diff mbox series

Add an "early rematerialisation" pass

Message ID 87efowyjzw.fsf@linaro.org
State Accepted
Commit 5cce817119cd31d18fbfc1c8245519d86b5e9480
Headers show
Series Add an "early rematerialisation" pass | expand

Commit Message

Richard Sandiford Nov. 17, 2017, 3:58 p.m. UTC
This patch looks for pseudo registers that are live across a call
and for which no call-preserved hard registers exist.  It then
recomputes the pseudos as necessary to ensure that they are no
longer live across a call.  The comment at the head of the file
describes the approach.

A new target hook selects which modes should be treated in this way.
By default none are, in which case the pass is skipped very early.

It might also be worth looking for cases like:

   C1: R1 := f (...)
   ...
   C2: R2 := f (...)
   C3: R1 := C2

and giving the same value number to C1 and C3, effectively treating
it like:

   C1: R1 := f (...)
   ...
   C2: R2 := f (...)
   C3: R1 := f (...)

Another (much more expensive) enhancement would be to apply value
numbering to all pseudo registers (not just rematerialisation
candidates), so that we can handle things like:

  C1: R1 := f (...R2...)
  ...
  C2: R1 := f (...R3...)

where R2 and R3 hold the same value.  But the current pass seems
to catch the vast majority of cases.

Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu
and powerpc64le-linux-gnu.  OK to install?

Richard


2017-11-17  Richard Sandiford  <richard.sandiford@linaro.org>

gcc/
	* Makefile.in (OBJS): Add early-remat.o.
	* target.def (select_early_remat_modes): New hook.
	* doc/tm.texi.in (TARGET_SELECT_EARLY_REMAT_MODES): New hook.
	* doc/tm.texi: Regenerate.
	* targhooks.h (default_select_early_remat_modes): Declare.
	* targhooks.c (default_select_early_remat_modes): New function.
	* timevar.def (TV_EARLY_REMAT): New timevar.
	* passes.def (pass_early_remat): New pass.
	* tree-pass.h (make_pass_early_remat): Declare.
	* early-remat.c: New file.
	* config/aarch64/aarch64.c (aarch64_select_early_remat_modes): New
	function.
	(TARGET_SELECT_EARLY_REMAT_MODES): Define.

gcc/testsuite/
	* gcc.target/aarch64/sve_spill_1.c: Also test that no predicates
	are spilled.
	* gcc.target/aarch64/sve_spill_2.c: New test.
	* gcc.target/aarch64/sve_spill_3.c: Likewise.
	* gcc.target/aarch64/sve_spill_4.c: Likewise.
	* gcc.target/aarch64/sve_spill_5.c: Likewise.
	* gcc.target/aarch64/sve_spill_6.c: Likewise.
	* gcc.target/aarch64/sve_spill_7.c: Likewise.

Comments

Richard Biener Dec. 14, 2017, 11:09 a.m. UTC | #1
On Fri, Nov 17, 2017 at 4:58 PM, Richard Sandiford
<richard.sandiford@linaro.org> wrote:
> This patch looks for pseudo registers that are live across a call

> and for which no call-preserved hard registers exist.  It then

> recomputes the pseudos as necessary to ensure that they are no

> longer live across a call.  The comment at the head of the file

> describes the approach.

>

> A new target hook selects which modes should be treated in this way.

> By default none are, in which case the pass is skipped very early.

>

> It might also be worth looking for cases like:

>

>    C1: R1 := f (...)

>    ...

>    C2: R2 := f (...)

>    C3: R1 := C2

>

> and giving the same value number to C1 and C3, effectively treating

> it like:

>

>    C1: R1 := f (...)

>    ...

>    C2: R2 := f (...)

>    C3: R1 := f (...)

>

> Another (much more expensive) enhancement would be to apply value

> numbering to all pseudo registers (not just rematerialisation

> candidates), so that we can handle things like:

>

>   C1: R1 := f (...R2...)

>   ...

>   C2: R1 := f (...R3...)

>

> where R2 and R3 hold the same value.  But the current pass seems

> to catch the vast majority of cases.

>

> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu

> and powerpc64le-linux-gnu.  OK to install?


Can you tell anything about the complexity of the algorithm?  How
does it relate to what LRA can do?  AFAIK LRA doesn't try to find
any global optimal solution and previous hardreg assignments
may work against it?

That said - I would have expected remat to be done before the
first scheduling pass?  Even before pass_sms (not sure
what pass_live_range_shrinkage does).  Or be integrated
with scheduling and it's register pressure cost model.

Also I would have expected the approach to apply to all modes,
just the cost of spilling is different.  But if you can, say, reduce
register pressure by one by rematerializing a bit-not then that
should be always profitable, no?  postreload-cse will come to
the rescue anyhow.

Disclaimer: didn't look at the pass implementation, RTL isn't my
primary expertise.

Richard.

> Richard

>

>

> 2017-11-17  Richard Sandiford  <richard.sandiford@linaro.org>

>

> gcc/

>         * Makefile.in (OBJS): Add early-remat.o.

>         * target.def (select_early_remat_modes): New hook.

>         * doc/tm.texi.in (TARGET_SELECT_EARLY_REMAT_MODES): New hook.

>         * doc/tm.texi: Regenerate.

>         * targhooks.h (default_select_early_remat_modes): Declare.

>         * targhooks.c (default_select_early_remat_modes): New function.

>         * timevar.def (TV_EARLY_REMAT): New timevar.

>         * passes.def (pass_early_remat): New pass.

>         * tree-pass.h (make_pass_early_remat): Declare.

>         * early-remat.c: New file.

>         * config/aarch64/aarch64.c (aarch64_select_early_remat_modes): New

>         function.

>         (TARGET_SELECT_EARLY_REMAT_MODES): Define.

>

> gcc/testsuite/

>         * gcc.target/aarch64/sve_spill_1.c: Also test that no predicates

>         are spilled.

>         * gcc.target/aarch64/sve_spill_2.c: New test.

>         * gcc.target/aarch64/sve_spill_3.c: Likewise.

>         * gcc.target/aarch64/sve_spill_4.c: Likewise.

>         * gcc.target/aarch64/sve_spill_5.c: Likewise.

>         * gcc.target/aarch64/sve_spill_6.c: Likewise.

>         * gcc.target/aarch64/sve_spill_7.c: Likewise.

>

> Index: gcc/Makefile.in

> ===================================================================

> --- gcc/Makefile.in     2017-11-17 15:56:28.653842781 +0000

> +++ gcc/Makefile.in     2017-11-17 15:56:53.016759921 +0000

> @@ -1273,6 +1273,7 @@ OBJS = \

>         dwarf2asm.o \

>         dwarf2cfi.o \

>         dwarf2out.o \

> +       early-remat.o \

>         emit-rtl.o \

>         et-forest.o \

>         except.o \

> Index: gcc/target.def

> ===================================================================

> --- gcc/target.def      2017-11-17 15:56:45.551761353 +0000

> +++ gcc/target.def      2017-11-17 15:56:53.022759920 +0000

> @@ -5531,6 +5531,19 @@ reload from using some alternatives, lik

>   default_preferred_output_reload_class)

>

>  DEFHOOK

> +(select_early_remat_modes,

> + "On some targets, certain modes cannot be held in registers around a\n\

> +standard ABI call and are relatively expensive to spill to the stack.\n\

> +The early rematerialization pass can help in such cases by aggressively\n\

> +recomputing values after calls, so that they don't need to be spilled.\n\

> +\n\

> +This hook returns the set of such modes by setting the associated bits\n\

> +in @var{modes}.  The default implementation selects no modes, which has\n\

> +the effect of disabling the early rematerialization pass.",

> + void, (sbitmap modes),

> + default_select_early_remat_modes)

> +

> +DEFHOOK

>  (class_likely_spilled_p,

>   "A target hook which returns @code{true} if pseudos that have been assigned\n\

>  to registers of class @var{rclass} would likely be spilled because\n\

> Index: gcc/doc/tm.texi.in

> ===================================================================

> --- gcc/doc/tm.texi.in  2017-11-17 15:56:45.550761354 +0000

> +++ gcc/doc/tm.texi.in  2017-11-17 15:56:53.020759920 +0000

> @@ -2307,6 +2307,8 @@ Do not define this macro if you do not d

>

>  @hook TARGET_SECONDARY_MEMORY_NEEDED_MODE

>

> +@hook TARGET_SELECT_EARLY_REMAT_MODES

> +

>  @hook TARGET_CLASS_LIKELY_SPILLED_P

>

>  @hook TARGET_CLASS_MAX_NREGS

> Index: gcc/doc/tm.texi

> ===================================================================

> --- gcc/doc/tm.texi     2017-11-17 15:56:45.549761354 +0000

> +++ gcc/doc/tm.texi     2017-11-17 15:56:53.019759920 +0000

> @@ -2774,6 +2774,17 @@ details.

>  With LRA, the default is to use @var{mode} unmodified.

>  @end deftypefn

>

> +@deftypefn {Target Hook} void TARGET_SELECT_EARLY_REMAT_MODES (sbitmap @var{modes})

> +On some targets, certain modes cannot be held in registers around a

> +standard ABI call and are relatively expensive to spill to the stack.

> +The early rematerialization pass can help in such cases by aggressively

> +recomputing values after calls, so that they don't need to be spilled.

> +

> +This hook returns the set of such modes by setting the associated bits

> +in @var{modes}.  The default implementation selects no modes, which has

> +the effect of disabling the early rematerialization pass.

> +@end deftypefn

> +

>  @deftypefn {Target Hook} bool TARGET_CLASS_LIKELY_SPILLED_P (reg_class_t @var{rclass})

>  A target hook which returns @code{true} if pseudos that have been assigned

>  to registers of class @var{rclass} would likely be spilled because

> Index: gcc/targhooks.h

> ===================================================================

> --- gcc/targhooks.h     2017-11-17 15:56:45.551761353 +0000

> +++ gcc/targhooks.h     2017-11-17 15:56:53.022759920 +0000

> @@ -284,5 +284,6 @@ extern unsigned int default_min_arithmet

>  extern enum flt_eval_method

>  default_excess_precision (enum excess_precision_type ATTRIBUTE_UNUSED);

>  extern bool default_stack_clash_protection_final_dynamic_probe (rtx);

> +extern void default_select_early_remat_modes (sbitmap);

>

>  #endif /* GCC_TARGHOOKS_H */

> Index: gcc/targhooks.c

> ===================================================================

> --- gcc/targhooks.c     2017-11-17 15:56:45.551761353 +0000

> +++ gcc/targhooks.c     2017-11-17 15:56:53.022759920 +0000

> @@ -82,6 +82,7 @@ Software Foundation; either version 3, o

>  #include "params.h"

>  #include "real.h"

>  #include "langhooks.h"

> +#include "sbitmap.h"

>

>  bool

>  default_legitimate_address_p (machine_mode mode ATTRIBUTE_UNUSED,

> @@ -2314,4 +2315,11 @@ default_stack_clash_protection_final_dyn

>    return 0;

>  }

>

> +/* The default implementation of TARGET_EARLY_REMAT_MODES.  */

> +

> +void

> +default_select_early_remat_modes (sbitmap)

> +{

> +}

> +

>  #include "gt-targhooks.h"

> Index: gcc/timevar.def

> ===================================================================

> --- gcc/timevar.def     2017-11-17 15:56:28.653842781 +0000

> +++ gcc/timevar.def     2017-11-17 15:56:53.023759919 +0000

> @@ -248,6 +248,7 @@ DEFTIMEVAR (TV_MODE_SWITCH           , "

>  DEFTIMEVAR (TV_SMS                  , "sms modulo scheduling")

>  DEFTIMEVAR (TV_LIVE_RANGE_SHRINKAGE  , "live range shrinkage")

>  DEFTIMEVAR (TV_SCHED                 , "scheduling")

> +DEFTIMEVAR (TV_EARLY_REMAT           , "early rematerialization")

>  DEFTIMEVAR (TV_IRA                  , "integrated RA")

>  DEFTIMEVAR (TV_LRA                  , "LRA non-specific")

>  DEFTIMEVAR (TV_LRA_ELIMINATE        , "LRA virtuals elimination")

> Index: gcc/passes.def

> ===================================================================

> --- gcc/passes.def      2017-11-17 15:56:28.653842781 +0000

> +++ gcc/passes.def      2017-11-17 15:56:53.021759920 +0000

> @@ -456,6 +456,7 @@ along with GCC; see the file COPYING3.

>        NEXT_PASS (pass_sms);

>        NEXT_PASS (pass_live_range_shrinkage);

>        NEXT_PASS (pass_sched);

> +      NEXT_PASS (pass_early_remat);

>        NEXT_PASS (pass_ira);

>        NEXT_PASS (pass_reload);

>        NEXT_PASS (pass_postreload);

> Index: gcc/tree-pass.h

> ===================================================================

> --- gcc/tree-pass.h     2017-11-17 15:56:28.653842781 +0000

> +++ gcc/tree-pass.h     2017-11-17 15:56:53.023759919 +0000

> @@ -573,6 +573,7 @@ extern rtl_opt_pass *make_pass_mode_swit

>  extern rtl_opt_pass *make_pass_sms (gcc::context *ctxt);

>  extern rtl_opt_pass *make_pass_sched (gcc::context *ctxt);

>  extern rtl_opt_pass *make_pass_live_range_shrinkage (gcc::context *ctxt);

> +extern rtl_opt_pass *make_pass_early_remat (gcc::context *ctxt);

>  extern rtl_opt_pass *make_pass_ira (gcc::context *ctxt);

>  extern rtl_opt_pass *make_pass_reload (gcc::context *ctxt);

>  extern rtl_opt_pass *make_pass_clean_state (gcc::context *ctxt);

> Index: gcc/early-remat.c

> ===================================================================

> --- /dev/null   2017-11-14 14:28:07.424493901 +0000

> +++ gcc/early-remat.c   2017-11-17 15:56:53.021759920 +0000

> @@ -0,0 +1,2610 @@

> +/* Early (pre-RA) rematerialization

> +   Copyright (C) 2017 Free Software Foundation, Inc.

> +

> +This file is part of GCC.

> +

> +GCC is free software; you can redistribute it and/or modify it under

> +the terms of the GNU General Public License as published by the Free

> +Software Foundation; either version 3, or (at your option) any later

> +version.

> +

> +GCC is distributed in the hope that it will be useful, but WITHOUT ANY

> +WARRANTY; without even the implied warranty of MERCHANTABILITY or

> +FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License

> +for more details.

> +

> +You should have received a copy of the GNU General Public License

> +along with GCC; see the file COPYING3.  If not see

> +<http://www.gnu.org/licenses/>.  */

> +

> +#include "config.h"

> +#include "system.h"

> +#include "coretypes.h"

> +#include "backend.h"

> +#include "rtl.h"

> +#include "df.h"

> +#include "tree-pass.h"

> +#include "memmodel.h"

> +#include "emit-rtl.h"

> +#include "insn-config.h"

> +#include "recog.h"

> +/* FIXME: The next two are only needed for gen_move_insn.  */

> +#include "tree.h"

> +#include "expr.h"

> +#include "target.h"

> +#include "inchash.h"

> +#include "rtlhash.h"

> +#include "print-rtl.h"

> +#include "rtl-iter.h"

> +

> +/* This pass runs before register allocation and implements an aggressive

> +   form of rematerialization.  It looks for pseudo registers R of mode M

> +   for which:

> +

> +     (a) there are no call-preserved registers of mode M; and

> +     (b) spilling R to the stack is expensive.

> +

> +   The assumption is that it's better to recompute R after each call instead

> +   of spilling it, even if this extends the live ranges of other registers.

> +

> +   The motivating example for which these conditions hold are AArch64 SVE

> +   vectors and predicates.  Spilling them to the stack makes the frame

> +   variable-sized, which we'd like to avoid if possible.  It's also very

> +   rare for SVE values to be "naturally" live across a call: usually this

> +   happens as a result of CSE or other code motion.

> +

> +   The pass is split into the following phases:

> +

> +   Collection phase

> +   ================

> +

> +   First we go through all pseudo registers looking for any that meet

> +   the conditions above.  For each such register R, we go through each

> +   instruction that defines R to see whether any of them are suitable

> +   rematerialization candidates.  If at least one is, we treat all the

> +   instructions that define R as candidates, but record which ones are

> +   not in fact suitable.  These unsuitable candidates exist only for the

> +   sake of calculating reaching definitions (see below).

> +

> +   A "candidate" is a single instruction that we want to rematerialize

> +   and a "candidate register" is a register that is set by at least one

> +   candidate.

> +

> +   Candidate sorting

> +   =================

> +

> +   Next we sort the candidates based on the cfg postorder, so that if

> +   candidate C1 uses candidate C2, C1 has a lower index than C2.

> +   This is useful when iterating through candidate bitmaps.

> +

> +   Reaching definition calculation

> +   ===============================

> +

> +   We then compute standard reaching-definition sets for each candidate.

> +   Each set specifies which candidates might provide the current definition

> +   of a live candidate register.

> +

> +   From here on, a candidate C is "live" at a point P if the candidate

> +   register defined by C is live at P and if C's definition reaches P.

> +   An instruction I "uses" a candidate C if I takes the register defined by

> +   C as input and if C is one of the reaching definitions of that register.

> +

> +   Candidate validation and value numbering

> +   ========================================

> +

> +   Next we simultaneously decide which candidates are valid and look

> +   for candidates that are equivalent to each other, assigning numbers

> +   to each unique candidate value.  A candidate C is invalid if:

> +

> +     (a) C uses an invalid candidate;

> +

> +     (b) there is a cycle of candidate uses involving C; or

> +

> +     (c) C takes a candidate register R as input and the reaching

> +         definitions of R do not have the same value number.

> +

> +   We assign a "representative" candidate C to each value number and from

> +   here on replace references to other candidates with that value number

> +   with references to C.  It is then only possible to rematerialize a

> +   register R at point P if (after this replacement) there is a single

> +   reaching definition of R at P.

> +

> +   Local phase

> +   ===========

> +

> +   During this phase we go through each block and look for cases in which:

> +

> +     (a) an instruction I comes between two call instructions CI1 and CI2;

> +

> +     (b) I uses a candidate register R;

> +

> +     (c) a candidate C provides the only reaching definition of R; and

> +

> +     (d) C does not come between CI1 and I.

> +

> +   We then emit a copy of C after CI1, as well as the transitive closure

> +   TC of the candidates used by C.  The copies of TC might use the original

> +   candidate registers or new temporary registers, depending on circumstances.

> +

> +   For example, if elsewhere we have:

> +

> +       C3: R3 <- f3 (...)

> +          ...

> +       C2: R2 <- f2 (...)

> +          ...

> +       C1: R1 <- f1 (R2, R3, ...)  // uses C2 and C3

> +

> +   then for a block containing:

> +

> +      CI1: call

> +          ...

> +       I: use R1  // uses C1

> +          ...

> +      CI2: call

> +

> +   we would emit:

> +

> +      CI1: call

> +      C3': R3' <- f3 (...)

> +      C2': R2' <- f2 (...)

> +      C1': R1 <- f1 (R2', R3', ...)

> +          ...

> +       I: use R1

> +          ...

> +      CI2: call

> +

> +   where R2' and R3' might be fresh registers.  If instead we had:

> +

> +      CI1: call

> +          ...

> +       I1: use R1  // uses C1

> +          ...

> +       I2: use R3  // uses C3

> +          ...

> +      CI2: call

> +

> +   we would keep the original R3:

> +

> +      CI1: call

> +      C3': R3 <- f3 (...)

> +      C2': R2' <- f2 (...)

> +      C1': R1 <- f1 (R2', R3, ...)

> +          ...

> +       I1: use R1  // uses C1

> +          ...

> +       I2: use R3  // uses C3

> +          ...

> +      CI2: call

> +

> +   We also record the last call in each block (if any) and compute:

> +

> +     rd_after_call:

> +       The set of candidates that either (a) are defined outside the block

> +       and are live after the last call or (b) are defined within the block

> +       and reach the end of the last call.  (We don't track whether the

> +       latter values are live or not.)

> +

> +     required_after_call:

> +       The set of candidates that need to be rematerialized after the

> +       last call in order to satisfy uses in the block itself.

> +

> +     required_in:

> +       The set of candidates that are live on entry to the block and are

> +       used without an intervening call.

> +

> +   In addition, we compute the initial values of the sets required by

> +   the global phase below.

> +

> +   Global phase

> +   ============

> +

> +   We next compute a maximal solution to the following availability

> +   problem:

> +

> +     available_in:

> +       The set of candidates that are live on entry to a block and can

> +       be used at that point without rematerialization.

> +

> +     available_out:

> +       The set of candidates that are live on exit from a block and can

> +       be used at that point without rematerialization.

> +

> +     available_locally:

> +       The subset of available_out that is due to code in the block itself.

> +       It contains candidates that are defined or used in the block and

> +       not invalidated by a later call.

> +

> +   We then go through each block B and look for an appropriate place

> +   to insert copies of required_in - available_in.  Conceptually we

> +   start by placing the copies at the head of B, but then move the

> +   copy of a candidate C to predecessors if:

> +

> +     (a) that seems cheaper;

> +

> +     (b) there is more than one reaching definition of C's register at

> +        the head of B; or

> +

> +     (c) copying C would clobber a hard register that is live on entry to B.

> +

> +   Moving a copy of C to a predecessor block PB involves:

> +

> +     (1) adding C to PB's required_after_call, if PB contains a call; or

> +

> +     (2) adding C PB's required_in otherwise.

> +

> +   C is then available on output from each PB and on input to B.

> +

> +   Once all this is done, we emit instructions for the final required_in

> +   and required_after_call sets.  */

> +

> +namespace {

> +

> +/* An invalid candidate index, used to indicate that there is more than

> +   one reaching definition.  */

> +const unsigned int MULTIPLE_CANDIDATES = -1U;

> +

> +/* Pass-specific information about one basic block.  */

> +struct remat_block_info {

> +  /* The last call instruction in the block.  */

> +  rtx_insn *last_call;

> +

> +  /* The set of candidates that are live on entry to the block.  NULL is

> +     equivalent to an empty set.  */

> +  bitmap rd_in;

> +

> +  /* The set of candidates that are live on exit from the block.  This might

> +     reuse rd_in.  NULL is equivalent to an empty set.  */

> +  bitmap rd_out;

> +

> +  /* The subset of RD_OUT that comes from local definitions.  NULL is

> +     equivalent to an empty set.  */

> +  bitmap rd_gen;

> +

> +  /* The set of candidates that the block invalidates (because it defines

> +     the register to something else, or because the register's value is

> +     no longer important).  NULL is equivalent to an empty set.  */

> +  bitmap rd_kill;

> +

> +  /* The set of candidates that either (a) are defined outside the block

> +     and are live after LAST_CALL or (b) are defined within the block

> +     and reach the instruction after LAST_CALL.  (We don't track whether

> +     the latter values are live or not.)

> +

> +     Only used if LAST_CALL is nonnull.  NULL is equivalent to an

> +     empty set.  */

> +  bitmap rd_after_call;

> +

> +  /* Candidates that are live and available without rematerialization

> +     on entry to the block.  NULL is equivalent to an empty set.  */

> +  bitmap available_in;

> +

> +  /* Candidates that become available without rematerialization within the

> +     block, and remain so on exit.  NULL is equivalent to an empty set.  */

> +  bitmap available_locally;

> +

> +  /* Candidates that are available without rematerialization on exit from

> +     the block.  This might reuse available_in or available_locally.  */

> +  bitmap available_out;

> +

> +  /* Candidates that need to be rematerialized either at the start of the

> +     block or before entering the block.  */

> +  bitmap required_in;

> +

> +  /* Candidates that need to be rematerialized after LAST_CALL.

> +     Only used if LAST_CALL is nonnull.  */

> +  bitmap required_after_call;

> +

> +  /* The number of candidates in the block.  */

> +  unsigned int num_candidates;

> +

> +  /* The earliest candidate in the block (i.e. the one with the

> +     highest index).  Only valid if NUM_CANDIDATES is nonzero.  */

> +  unsigned int first_candidate;

> +

> +  /* The best (lowest) execution frequency for rematerializing REQUIRED_IN.

> +     This is the execution frequency of the block if LOCAL_REMAT_CHEAPER_P,

> +     otherwise it is the sum of the execution frequencies of whichever

> +     predecessor blocks would do the rematerialization.  */

> +  int remat_frequency;

> +

> +  /* True if the block ends with an abnormal call.  */

> +  unsigned int abnormal_call_p : 1;

> +

> +  /* Used to record whether a graph traversal has visited this block.  */

> +  unsigned int visited_p : 1;

> +

> +  /* True if we have calculated REMAT_FREQUENCY.  */

> +  unsigned int remat_frequency_valid_p : 1;

> +

> +  /* True if it is cheaper to rematerialize candidates at the start of

> +     the block, rather than moving them to predecessor blocks.  */

> +  unsigned int local_remat_cheaper_p : 1;

> +};

> +

> +/* Information about a group of candidates with the same value number.  */

> +struct remat_equiv_class {

> +  /* The candidates that have the same value number.  */

> +  bitmap members;

> +

> +  /* The candidate that was first added to MEMBERS.  */

> +  unsigned int earliest;

> +

> +  /* The candidate that represents the others.  This is always the one

> +     with the highest index.  */

> +  unsigned int representative;

> +};

> +

> +/* Information about an instruction that we might want to rematerialize.  */

> +struct remat_candidate {

> +  /* The pseudo register that the instruction sets.  */

> +  unsigned int regno;

> +

> +  /* A temporary register used when rematerializing uses of this candidate,

> +     if REGNO doesn't have the right value or isn't worth using.  */

> +  unsigned int copy_regno;

> +

> +  /* True if we intend to rematerialize this instruction by emitting

> +     a move of a constant into REGNO, false if we intend to emit a

> +     copy of the original instruction.  */

> +  unsigned int constant_p : 1;

> +

> +  /* True if we still think it's possible to rematerialize INSN.  */

> +  unsigned int can_copy_p : 1;

> +

> +  /* Used to record whether a graph traversal has visited this candidate.  */

> +  unsigned int visited_p : 1;

> +

> +  /* True if we have verified that it's possible to rematerialize INSN.

> +     Once this is true, both it and CAN_COPY_P remain true.  */

> +  unsigned int validated_p : 1;

> +

> +  /* True if we have "stabilized" INSN, i.e. ensured that all non-candidate

> +     registers read by INSN will have the same value when rematerializing INSN.

> +     Only ever true if CAN_COPY_P.  */

> +  unsigned int stabilized_p : 1;

> +

> +  /* Hash value used for value numbering.  */

> +  hashval_t hash;

> +

> +  /* The instruction that sets REGNO.  */

> +  rtx_insn *insn;

> +

> +  /* If CONSTANT_P, the value that should be moved into REGNO when

> +     rematerializing, otherwise the pattern of the instruction that

> +     should be used.  */

> +  rtx remat_rtx;

> +

> +  /* The set of candidates that INSN takes as input.  NULL is equivalent

> +     to the empty set.  All candidates in this set have a higher index

> +     than the current candidate.  */

> +  bitmap uses;

> +

> +  /* The set of hard registers that would be clobbered by rematerializing

> +     the candidate, including (transitively) all those that would be

> +     clobbered by rematerializing USES.  */

> +  bitmap clobbers;

> +

> +  /* The equivalence class to which the candidate belongs, or null if none.  */

> +  remat_equiv_class *equiv_class;

> +};

> +

> +/* Hash functions used for value numbering.  */

> +struct remat_candidate_hasher : nofree_ptr_hash <remat_candidate>

> +{

> +  typedef value_type compare_type;

> +  static hashval_t hash (const remat_candidate *);

> +  static bool equal (const remat_candidate *, const remat_candidate *);

> +};

> +

> +/* Main class for this pass.  */

> +class early_remat {

> +public:

> +  early_remat (function *, sbitmap);

> +  ~early_remat ();

> +

> +  void run (void);

> +

> +private:

> +  bitmap alloc_bitmap (void);

> +  bitmap get_bitmap (bitmap *);

> +  void init_temp_bitmap (bitmap *);

> +  void copy_temp_bitmap (bitmap *, bitmap *);

> +

> +  void dump_insn_id (rtx_insn *);

> +  void dump_candidate_bitmap (bitmap);

> +  void dump_all_candidates (void);

> +  void dump_edge_list (basic_block, bool);

> +  void dump_block_info (basic_block);

> +  void dump_all_blocks (void);

> +

> +  bool interesting_regno_p (unsigned int);

> +  remat_candidate *add_candidate (rtx_insn *, unsigned int, bool);

> +  bool maybe_add_candidate (rtx_insn *, unsigned int);

> +  bool collect_candidates (void);

> +  void init_block_info (void);

> +  void sort_candidates (void);

> +  void finalize_candidate_indices (void);

> +  void record_equiv_candidates (unsigned int, unsigned int);

> +  static bool rd_confluence_n (edge);

> +  static bool rd_transfer (int);

> +  void compute_rd (void);

> +  unsigned int canon_candidate (unsigned int);

> +  void canon_bitmap (bitmap *);

> +  unsigned int resolve_reaching_def (bitmap);

> +  bool check_candidate_uses (unsigned int);

> +  void compute_clobbers (unsigned int);

> +  void assign_value_number (unsigned int);

> +  void decide_candidate_validity (void);

> +  bool stable_use_p (unsigned int);

> +  void emit_copy_before (unsigned int, rtx, rtx);

> +  void stabilize_pattern (unsigned int);

> +  void replace_dest_with_copy (unsigned int);

> +  void stabilize_candidate_uses (unsigned int, bitmap, bitmap, bitmap,

> +                                bitmap);

> +  void emit_remat_insns (bitmap, bitmap, bitmap, rtx_insn *);

> +  bool set_available_out (remat_block_info *);

> +  void process_block (basic_block);

> +  void local_phase (void);

> +  static bool avail_confluence_n (edge);

> +  static bool avail_transfer (int);

> +  void compute_availability (void);

> +  void unshare_available_sets (remat_block_info *);

> +  bool can_move_across_edge_p (edge);

> +  bool local_remat_cheaper_p (unsigned int);

> +  bool need_to_move_candidate_p (unsigned int, unsigned int);

> +  void compute_minimum_move_set (unsigned int, bitmap);

> +  void move_to_predecessors (unsigned int, bitmap, bitmap);

> +  void choose_rematerialization_points (void);

> +  void emit_remat_insns_for_block (basic_block);

> +  void global_phase (void);

> +

> +  /* The function that we're optimizing.  */

> +  function *m_fn;

> +

> +  /* The modes that we want to rematerialize.  */

> +  sbitmap m_selected_modes;

> +

> +  /* All rematerialization candidates, identified by their index into the

> +     vector.  */

> +  auto_vec<remat_candidate> m_candidates;

> +

> +  /* The set of candidate registers.  */

> +  bitmap_head m_candidate_regnos;

> +

> +  /* Temporary sets.  */

> +  bitmap_head m_tmp_bitmap;

> +  bitmap m_available;

> +  bitmap m_required;

> +

> +  /* Information about each basic block.  */

> +  auto_vec<remat_block_info> m_block_info;

> +

> +  /* A mapping from register numbers to the set of associated candidates.

> +     Only valid for registers in M_CANDIDATE_REGNOS.  */

> +  auto_vec<bitmap> m_regno_to_candidates;

> +

> +  /* An obstack used for allocating bitmaps, so that we can free them all

> +     in one go.  */

> +  bitmap_obstack m_obstack;

> +

> +  /* A hash table of candidates used for value numbering.  If a candidate

> +     in the table is in an equivalence class, the candidate is marked as

> +     the earliest member of the class.  */

> +  hash_table<remat_candidate_hasher> m_value_table;

> +

> +  /* Used temporarily by callback functions.  */

> +  static early_remat *er;

> +};

> +

> +}

> +

> +early_remat *early_remat::er;

> +

> +/* rtx_equal_p_cb callback that treats any two SCRATCHes as equal.

> +   This allows us to compare two copies of a pattern, even though their

> +   SCRATCHes are always distinct.  */

> +

> +static int

> +scratch_equal (const_rtx *x, const_rtx *y, rtx *nx, rtx *ny)

> +{

> +  if (GET_CODE (*x) == SCRATCH && GET_CODE (*y) == SCRATCH)

> +    {

> +      *nx = const0_rtx;

> +      *ny = const0_rtx;

> +      return 1;

> +    }

> +  return 0;

> +}

> +

> +/* Hash callback functions for remat_candidate.  */

> +

> +hashval_t

> +remat_candidate_hasher::hash (const remat_candidate *cand)

> +{

> +  return cand->hash;

> +}

> +

> +bool

> +remat_candidate_hasher::equal (const remat_candidate *cand1,

> +                              const remat_candidate *cand2)

> +{

> +  return (cand1->regno == cand2->regno

> +         && cand1->constant_p == cand2->constant_p

> +         && (cand1->constant_p

> +             ? rtx_equal_p (cand1->remat_rtx, cand2->remat_rtx)

> +             : rtx_equal_p_cb (cand1->remat_rtx, cand2->remat_rtx,

> +                               scratch_equal))

> +         && (!cand1->uses || bitmap_equal_p (cand1->uses, cand2->uses)));

> +}

> +

> +/* Return true if B is null or empty.  */

> +

> +inline bool

> +empty_p (bitmap b)

> +{

> +  return !b || bitmap_empty_p (b);

> +}

> +

> +/* Allocate a new bitmap.  It will be automatically freed at the end of

> +   the pass.  */

> +

> +inline bitmap

> +early_remat::alloc_bitmap (void)

> +{

> +  return bitmap_alloc (&m_obstack);

> +}

> +

> +/* Initialize *PTR to an empty bitmap if it is currently null.  */

> +

> +inline bitmap

> +early_remat::get_bitmap (bitmap *ptr)

> +{

> +  if (!*ptr)

> +    *ptr = alloc_bitmap ();

> +  return *ptr;

> +}

> +

> +/* *PTR is either null or empty.  If it is null, initialize it to an

> +   empty bitmap.  */

> +

> +inline void

> +early_remat::init_temp_bitmap (bitmap *ptr)

> +{

> +  if (!*ptr)

> +    *ptr = alloc_bitmap ();

> +  else

> +    gcc_checking_assert (bitmap_empty_p (*ptr));

> +}

> +

> +/* Move *SRC to *DEST and leave *SRC empty.  */

> +

> +inline void

> +early_remat::copy_temp_bitmap (bitmap *dest, bitmap *src)

> +{

> +  if (!empty_p (*src))

> +    {

> +      *dest = *src;

> +      *src = NULL;

> +    }

> +  else

> +    *dest = NULL;

> +}

> +

> +/* Print INSN's identifier to the dump file.  */

> +

> +void

> +early_remat::dump_insn_id (rtx_insn *insn)

> +{

> +  fprintf (dump_file, "%d[bb:%d]", INSN_UID (insn),

> +          BLOCK_FOR_INSN (insn)->index);

> +}

> +

> +/* Print candidate set CANDIDATES to the dump file, with a leading space.  */

> +

> +void

> +early_remat::dump_candidate_bitmap (bitmap candidates)

> +{

> +  if (empty_p (candidates))

> +    {

> +      fprintf (dump_file, " none");

> +      return;

> +    }

> +

> +  unsigned int cand_index;

> +  bitmap_iterator bi;

> +  EXECUTE_IF_SET_IN_BITMAP (candidates, 0, cand_index, bi)

> +    fprintf (dump_file, " %d", cand_index);

> +}

> +

> +/* Print information about all candidates to the dump file.  */

> +

> +void

> +early_remat::dump_all_candidates (void)

> +{

> +  fprintf (dump_file, "\n;; Candidates:\n;;\n");

> +  fprintf (dump_file, ";; %5s %5s %8s %s\n", "#", "reg", "mode", "insn");

> +  fprintf (dump_file, ";; %5s %5s %8s %s\n", "=", "===", "====", "====");

> +  unsigned int cand_index;

> +  remat_candidate *cand;

> +  FOR_EACH_VEC_ELT (m_candidates, cand_index, cand)

> +    {

> +      fprintf (dump_file, ";; %5d %5d %8s ", cand_index, cand->regno,

> +              GET_MODE_NAME (GET_MODE (regno_reg_rtx[cand->regno])));

> +      dump_insn_id (cand->insn);

> +      if (!cand->can_copy_p)

> +       fprintf (dump_file, "   -- can't copy");

> +      fprintf (dump_file, "\n");

> +    }

> +

> +  fprintf (dump_file, "\n;; Register-to-candidate mapping:\n;;\n");

> +  unsigned int regno;

> +  bitmap_iterator bi;

> +  EXECUTE_IF_SET_IN_BITMAP (&m_candidate_regnos, 0, regno, bi)

> +    {

> +      fprintf (dump_file, ";; %5d:", regno);

> +      dump_candidate_bitmap (m_regno_to_candidates[regno]);

> +      fprintf (dump_file, "\n");

> +    }

> +}

> +

> +/* Print the predecessors or successors of BB to the dump file, with a

> +   leading space.  DO_SUCC is true to print successors and false to print

> +   predecessors.  */

> +

> +void

> +early_remat::dump_edge_list (basic_block bb, bool do_succ)

> +{

> +  edge e;

> +  edge_iterator ei;

> +  FOR_EACH_EDGE (e, ei, do_succ ? bb->succs : bb->preds)

> +    dump_edge_info (dump_file, e, 0, do_succ);

> +}

> +

> +/* Print information about basic block BB to the dump file.  */

> +

> +void

> +early_remat::dump_block_info (basic_block bb)

> +{

> +  remat_block_info *info = &m_block_info[bb->index];

> +  fprintf (dump_file, ";;\n;; Block %d:", bb->index);

> +  int width = 25;

> +

> +  fprintf (dump_file, "\n;;%*s:", width, "predecessors");

> +  dump_edge_list (bb, false);

> +

> +  fprintf (dump_file, "\n;;%*s:", width, "successors");

> +  dump_edge_list (bb, true);

> +

> +  fprintf (dump_file, "\n;;%*s: %d", width, "frequency",

> +          bb->count.to_frequency (m_fn));

> +

> +  if (info->last_call)

> +    fprintf (dump_file, "\n;;%*s: %d", width, "last call",

> +            INSN_UID (info->last_call));

> +

> +  if (!empty_p (info->rd_in))

> +    {

> +      fprintf (dump_file, "\n;;%*s:", width, "RD in");

> +      dump_candidate_bitmap (info->rd_in);

> +    }

> +  if (!empty_p (info->rd_kill))

> +    {

> +      fprintf (dump_file, "\n;;%*s:", width, "RD kill");

> +      dump_candidate_bitmap (info->rd_kill);

> +    }

> +  if (!empty_p (info->rd_gen))

> +    {

> +      fprintf (dump_file, "\n;;%*s:", width, "RD gen");

> +      dump_candidate_bitmap (info->rd_gen);

> +    }

> +  if (!empty_p (info->rd_after_call))

> +    {

> +      fprintf (dump_file, "\n;;%*s:", width, "RD after call");

> +      dump_candidate_bitmap (info->rd_after_call);

> +    }

> +  if (!empty_p (info->rd_out))

> +    {

> +      fprintf (dump_file, "\n;;%*s:", width, "RD out");

> +      if (info->rd_in == info->rd_out)

> +       fprintf (dump_file, " RD in");

> +      else

> +       dump_candidate_bitmap (info->rd_out);

> +    }

> +  if (!empty_p (info->available_in))

> +    {

> +      fprintf (dump_file, "\n;;%*s:", width, "available in");

> +      dump_candidate_bitmap (info->available_in);

> +    }

> +  if (!empty_p (info->available_locally))

> +    {

> +      fprintf (dump_file, "\n;;%*s:", width, "available locally");

> +      dump_candidate_bitmap (info->available_locally);

> +    }

> +  if (!empty_p (info->available_out))

> +    {

> +      fprintf (dump_file, "\n;;%*s:", width, "available out");

> +      if (info->available_in == info->available_out)

> +       fprintf (dump_file, " available in");

> +      else if (info->available_locally == info->available_out)

> +       fprintf (dump_file, " available locally");

> +      else

> +       dump_candidate_bitmap (info->available_out);

> +    }

> +  if (!empty_p (info->required_in))

> +    {

> +      fprintf (dump_file, "\n;;%*s:", width, "required in");

> +      dump_candidate_bitmap (info->required_in);

> +    }

> +  if (!empty_p (info->required_after_call))

> +    {

> +      fprintf (dump_file, "\n;;%*s:", width, "required after call");

> +      dump_candidate_bitmap (info->required_after_call);

> +    }

> +  fprintf (dump_file, "\n");

> +}

> +

> +/* Print information about all basic blocks to the dump file.  */

> +

> +void

> +early_remat::dump_all_blocks (void)

> +{

> +  basic_block bb;

> +  FOR_EACH_BB_FN (bb, m_fn)

> +    dump_block_info (bb);

> +}

> +

> +/* Return true if REGNO is worth rematerializing.  */

> +

> +bool

> +early_remat::interesting_regno_p (unsigned int regno)

> +{

> +  /* Ignore unused registers.  */

> +  rtx reg = regno_reg_rtx[regno];

> +  if (!reg || DF_REG_DEF_COUNT (regno) == 0)

> +    return false;

> +

> +  /* Make sure the register has a mode that we want to rematerialize.  */

> +  if (!bitmap_bit_p (m_selected_modes, GET_MODE (reg)))

> +    return false;

> +

> +  /* Ignore values that might sometimes be used uninitialized.  We could

> +     instead add dummy candidates for the entry block definition, and so

> +     handle uses that are definitely not uninitialized, but the combination

> +     of the two should be rare in practice.  */

> +  if (bitmap_bit_p (DF_LR_OUT (ENTRY_BLOCK_PTR_FOR_FN (m_fn)), regno))

> +    return false;

> +

> +  return true;

> +}

> +

> +/* Record the set of register REGNO in instruction INSN as a

> +   rematerialization candidate.  CAN_COPY_P is true unless we already

> +   know that rematerialization is impossible (in which case the candidate

> +   only exists for the reaching definition calculation).

> +

> +   The candidate's index is not fixed at this stage.  */

> +

> +remat_candidate *

> +early_remat::add_candidate (rtx_insn *insn, unsigned int regno,

> +                           bool can_copy_p)

> +{

> +  remat_candidate cand;

> +  memset (&cand, 0, sizeof (cand));

> +  cand.regno = regno;

> +  cand.insn = insn;

> +  cand.remat_rtx = PATTERN (insn);

> +  cand.can_copy_p = can_copy_p;

> +  m_candidates.safe_push (cand);

> +

> +  bitmap_set_bit (&m_candidate_regnos, regno);

> +

> +  return &m_candidates.last ();

> +}

> +

> +/* Return true if we can rematerialize the set of register REGNO in

> +   instruction INSN, and add it as a candidate if so.  When returning

> +   false, print the reason to the dump file.  */

> +

> +bool

> +early_remat::maybe_add_candidate (rtx_insn *insn, unsigned int regno)

> +{

> +#define FAILURE_FORMAT ";; Can't rematerialize set of reg %d in %d[bb:%d]: "

> +#define FAILURE_ARGS regno, INSN_UID (insn), BLOCK_FOR_INSN (insn)->index

> +

> +  /* The definition must come from an ordinary instruction.  */

> +  basic_block bb = BLOCK_FOR_INSN (insn);

> +  if (!NONJUMP_INSN_P (insn)

> +      || (insn == BB_END (bb)

> +         && has_abnormal_or_eh_outgoing_edge_p (bb)))

> +    {

> +      if (dump_file)

> +       fprintf (dump_file, FAILURE_FORMAT "insn alters control flow\n",

> +                FAILURE_ARGS);

> +      return false;

> +    }

> +

> +  /* Prefer to rematerialize constants directly -- it's much easier.  */

> +  machine_mode mode = GET_MODE (regno_reg_rtx[regno]);

> +  if (rtx note = find_reg_equal_equiv_note (insn))

> +    {

> +      rtx val = XEXP (note, 0);

> +      if (CONSTANT_P (val)

> +         && targetm.legitimate_constant_p (mode, val))

> +       {

> +         remat_candidate *cand = add_candidate (insn, regno, true);

> +         cand->constant_p = true;

> +         cand->remat_rtx = val;

> +         return true;

> +       }

> +    }

> +

> +  /* See whether the target has reasons to prevent a copy.  */

> +  if (targetm.cannot_copy_insn_p && targetm.cannot_copy_insn_p (insn))

> +    {

> +      if (dump_file)

> +       fprintf (dump_file, FAILURE_FORMAT "target forbids copying\n",

> +                FAILURE_ARGS);

> +      return false;

> +    }

> +

> +  /* We can't copy trapping instructions.  */

> +  rtx pat = PATTERN (insn);

> +  if (may_trap_p (pat))

> +    {

> +      if (dump_file)

> +       fprintf (dump_file, FAILURE_FORMAT "insn might trap\n", FAILURE_ARGS);

> +      return false;

> +    }

> +

> +  /* We can't copy instructions that read memory, unless we know that

> +     the contents never change.  */

> +  subrtx_iterator::array_type array;

> +  FOR_EACH_SUBRTX (iter, array, pat, ALL)

> +    if (MEM_P (*iter) && !MEM_READONLY_P (*iter))

> +      {

> +       if (dump_file)

> +         fprintf (dump_file, FAILURE_FORMAT "insn references non-constant"

> +                  " memory\n", FAILURE_ARGS);

> +       return false;

> +      }

> +

> +  /* Check each defined register.  */

> +  df_ref ref;

> +  FOR_EACH_INSN_DEF (ref, insn)

> +    {

> +      unsigned int def_regno = DF_REF_REGNO (ref);

> +      if (def_regno == regno)

> +       {

> +         /* Make sure the definition is write-only.  (Partial definitions,

> +            such as setting the low part and clobbering the high part,

> +            are otherwise OK.)  */

> +         if (DF_REF_FLAGS_IS_SET (ref, DF_REF_READ_WRITE))

> +           {

> +             if (dump_file)

> +               fprintf (dump_file, FAILURE_FORMAT "destination is"

> +                        " read-modify-write\n", FAILURE_ARGS);

> +             return false;

> +           }

> +       }

> +      else

> +       {

> +         /* The instruction can set additional registers, provided that

> +            they're call-clobbered hard registers.  This is useful for

> +            instructions that alter the condition codes.  */

> +         if (!HARD_REGISTER_NUM_P (def_regno))

> +           {

> +             if (dump_file)

> +               fprintf (dump_file, FAILURE_FORMAT "insn also sets"

> +                        " pseudo reg %d\n", FAILURE_ARGS, def_regno);

> +             return false;

> +           }

> +         if (global_regs[def_regno])

> +           {

> +             if (dump_file)

> +               fprintf (dump_file, FAILURE_FORMAT "insn also sets"

> +                        " global reg %d\n", FAILURE_ARGS, def_regno);

> +             return false;

> +           }

> +         if (!TEST_HARD_REG_BIT (regs_invalidated_by_call, def_regno))

> +           {

> +             if (dump_file)

> +               fprintf (dump_file, FAILURE_FORMAT "insn also sets"

> +                        " call-preserved reg %d\n", FAILURE_ARGS, def_regno);

> +             return false;

> +           }

> +       }

> +    }

> +

> +  /* If the instruction uses fixed hard registers, check that those

> +     registers have the same value throughout the function.  If the

> +     instruction uses non-fixed hard registers, check that we can

> +     replace them with pseudos.  */

> +  FOR_EACH_INSN_USE (ref, insn)

> +    {

> +      unsigned int use_regno = DF_REF_REGNO (ref);

> +      if (HARD_REGISTER_NUM_P (use_regno) && fixed_regs[use_regno])

> +       {

> +         if (rtx_unstable_p (DF_REF_REAL_REG (ref)))

> +           {

> +             if (dump_file)

> +               fprintf (dump_file, FAILURE_FORMAT "insn uses fixed hard reg"

> +                        " %d\n", FAILURE_ARGS, use_regno);

> +             return false;

> +           }

> +       }

> +      else if (HARD_REGISTER_NUM_P (use_regno))

> +       {

> +         /* Allocate a dummy pseudo register and temporarily install it.

> +            Make the register number depend on the mode, which should

> +            provide enough sharing for match_dup while also weeding

> +            out cases in which operands with different modes are

> +            explicitly tied.  */

> +         rtx *loc = DF_REF_REAL_LOC (ref);

> +         unsigned int size = RTX_CODE_SIZE (REG);

> +         rtx new_reg = (rtx) alloca (size);

> +         memset (new_reg, 0, size);

> +         PUT_CODE (new_reg, REG);

> +         set_mode_and_regno (new_reg, GET_MODE (*loc),

> +                             LAST_VIRTUAL_REGISTER + 1 + GET_MODE (*loc));

> +         validate_change (insn, loc, new_reg, 1);

> +       }

> +    }

> +  bool ok_p = verify_changes (0);

> +  cancel_changes (0);

> +  if (!ok_p)

> +    {

> +      if (dump_file)

> +       fprintf (dump_file, FAILURE_FORMAT "insn does not allow hard"

> +                " register inputs to be replaced\n", FAILURE_ARGS);

> +      return false;

> +    }

> +

> +#undef FAILURE_ARGS

> +#undef FAILURE_FORMAT

> +

> +  add_candidate (insn, regno, true);

> +  return true;

> +}

> +

> +/* Calculate the set of rematerialization candidates.  Return true if

> +   we find at least one.  */

> +

> +bool

> +early_remat::collect_candidates (void)

> +{

> +  unsigned int nregs = DF_REG_SIZE (df);

> +  for (unsigned int regno = FIRST_PSEUDO_REGISTER; regno < nregs; ++regno)

> +    if (interesting_regno_p (regno))

> +      {

> +       /* Create candidates for all suitable definitions.  */

> +       bitmap_clear (&m_tmp_bitmap);

> +       unsigned int bad = 0;

> +       unsigned int id = 0;

> +       for (df_ref ref = DF_REG_DEF_CHAIN (regno); ref;

> +            ref = DF_REF_NEXT_REG (ref))

> +         {

> +           rtx_insn *insn = DF_REF_INSN (ref);

> +           if (maybe_add_candidate (insn, regno))

> +             bitmap_set_bit (&m_tmp_bitmap, id);

> +           else

> +             bad += 1;

> +           id += 1;

> +         }

> +

> +       /* If we found at least one suitable definition, add dummy

> +          candidates for the rest, so that we can see which definitions

> +          are live where.  */

> +       if (!bitmap_empty_p (&m_tmp_bitmap) && bad)

> +         {

> +           id = 0;

> +           for (df_ref ref = DF_REG_DEF_CHAIN (regno); ref;

> +                ref = DF_REF_NEXT_REG (ref))

> +             {

> +               if (!bitmap_bit_p (&m_tmp_bitmap, id))

> +                 add_candidate (DF_REF_INSN (ref), regno, false);

> +               id += 1;

> +             }

> +         }

> +      }

> +

> +

> +  return !m_candidates.is_empty ();

> +}

> +

> +/* Initialize the m_block_info array.  */

> +

> +void

> +early_remat::init_block_info (void)

> +{

> +  unsigned int n_blocks = last_basic_block_for_fn (m_fn);

> +  m_block_info.safe_grow_cleared (n_blocks);

> +}

> +

> +/* Maps basic block indices to their position in the post order.  */

> +static unsigned int *postorder_index;

> +

> +/* Order remat_candidates X_IN and Y_IN according to the cfg postorder.  */

> +

> +static int

> +compare_candidates (const void *x_in, const void *y_in)

> +{

> +  const remat_candidate *x = (const remat_candidate *) x_in;

> +  const remat_candidate *y = (const remat_candidate *) y_in;

> +  basic_block x_bb = BLOCK_FOR_INSN (x->insn);

> +  basic_block y_bb = BLOCK_FOR_INSN (y->insn);

> +  if (x_bb != y_bb)

> +    /* Make X and Y follow block postorder.  */

> +    return postorder_index[x_bb->index] - postorder_index[y_bb->index];

> +

> +  /* Make X and Y follow a backward traversal of the containing block.  */

> +  return DF_INSN_LUID (y->insn) - DF_INSN_LUID (x->insn);

> +}

> +

> +/* Sort the collected rematerialization candidates so that they follow

> +   cfg postorder.  */

> +

> +void

> +early_remat::sort_candidates (void)

> +{

> +  /* Make sure the DF LUIDs are up-to-date for all the blocks we

> +     care about.  */

> +  bitmap_clear (&m_tmp_bitmap);

> +  unsigned int cand_index;

> +  remat_candidate *cand;

> +  FOR_EACH_VEC_ELT (m_candidates, cand_index, cand)

> +    {

> +      basic_block bb = BLOCK_FOR_INSN (cand->insn);

> +      if (bitmap_set_bit (&m_tmp_bitmap, bb->index))

> +       df_recompute_luids (bb);

> +    }

> +

> +  /* Create a mapping from block numbers to their position in the

> +     postorder.  */

> +  unsigned int n_blocks = last_basic_block_for_fn (m_fn);

> +  int *postorder = df_get_postorder (DF_BACKWARD);

> +  unsigned int postorder_len = df_get_n_blocks (DF_BACKWARD);

> +  postorder_index = new unsigned int[n_blocks];

> +  for (unsigned int i = 0; i < postorder_len; ++i)

> +    postorder_index[postorder[i]] = i;

> +

> +  m_candidates.qsort (compare_candidates);

> +

> +  delete postorder_index;

> +}

> +

> +/* Commit to the current candidate indices and initialize cross-references.  */

> +

> +void

> +early_remat::finalize_candidate_indices (void)

> +{

> +  /* Create a bitmap for each candidate register.  */

> +  m_regno_to_candidates.safe_grow (max_reg_num ());

> +  unsigned int regno;

> +  bitmap_iterator bi;

> +  EXECUTE_IF_SET_IN_BITMAP (&m_candidate_regnos, 0, regno, bi)

> +    m_regno_to_candidates[regno] = alloc_bitmap ();

> +

> +  /* Go through each candidate and record its index.  */

> +  unsigned int cand_index;

> +  remat_candidate *cand;

> +  FOR_EACH_VEC_ELT (m_candidates, cand_index, cand)

> +    {

> +      basic_block bb = BLOCK_FOR_INSN (cand->insn);

> +      remat_block_info *info = &m_block_info[bb->index];

> +      info->num_candidates += 1;

> +      info->first_candidate = cand_index;

> +      bitmap_set_bit (m_regno_to_candidates[cand->regno], cand_index);

> +    }

> +}

> +

> +/* Record that candidates CAND1_INDEX and CAND2_INDEX are equivalent.

> +   CAND1_INDEX might already have an equivalence class, but CAND2_INDEX

> +   doesn't.  */

> +

> +void

> +early_remat::record_equiv_candidates (unsigned int cand1_index,

> +                                     unsigned int cand2_index)

> +{

> +  if (dump_file)

> +    fprintf (dump_file, ";; Candidate %d is equivalent to candidate %d\n",

> +            cand2_index, cand1_index);

> +

> +  remat_candidate *cand1 = &m_candidates[cand1_index];

> +  remat_candidate *cand2 = &m_candidates[cand2_index];

> +  gcc_checking_assert (!cand2->equiv_class);

> +

> +  remat_equiv_class *ec = cand1->equiv_class;

> +  if (!ec)

> +    {

> +      ec = XOBNEW (&m_obstack.obstack, remat_equiv_class);

> +      ec->members = alloc_bitmap ();

> +      bitmap_set_bit (ec->members, cand1_index);

> +      ec->earliest = cand1_index;

> +      ec->representative = cand1_index;

> +      cand1->equiv_class = ec;

> +    }

> +  cand1 = &m_candidates[ec->representative];

> +  cand2->equiv_class = ec;

> +  bitmap_set_bit (ec->members, cand2_index);

> +  if (cand2_index > ec->representative)

> +    ec->representative = cand2_index;

> +}

> +

> +/* Propagate information from the rd_out set of E->src to the rd_in set

> +   of E->dest, when computing global reaching definitions.  Return true

> +   if something changed.  */

> +

> +bool

> +early_remat::rd_confluence_n (edge e)

> +{

> +  remat_block_info *src = &er->m_block_info[e->src->index];

> +  remat_block_info *dest = &er->m_block_info[e->dest->index];

> +

> +  /* available_in temporarily contains the set of candidates whose

> +     registers are live on entry.  */

> +  if (empty_p (src->rd_out) || empty_p (dest->available_in))

> +    return false;

> +

> +  return bitmap_ior_and_into (er->get_bitmap (&dest->rd_in),

> +                             src->rd_out, dest->available_in);

> +}

> +

> +/* Propagate information from the rd_in set of block BB_INDEX to rd_out.

> +   Return true if something changed.  */

> +

> +bool

> +early_remat::rd_transfer (int bb_index)

> +{

> +  remat_block_info *info = &er->m_block_info[bb_index];

> +

> +  if (empty_p (info->rd_in))

> +    return false;

> +

> +  if (empty_p (info->rd_kill))

> +    {

> +      gcc_checking_assert (empty_p (info->rd_gen));

> +      if (!info->rd_out)

> +       info->rd_out = info->rd_in;

> +      else

> +       gcc_checking_assert (info->rd_out == info->rd_in);

> +      /* Assume that we only get called if something changed.  */

> +      return true;

> +    }

> +

> +  if (empty_p (info->rd_gen))

> +    return bitmap_and_compl (er->get_bitmap (&info->rd_out),

> +                            info->rd_in, info->rd_kill);

> +

> +  return bitmap_ior_and_compl (er->get_bitmap (&info->rd_out), info->rd_gen,

> +                              info->rd_in, info->rd_kill);

> +}

> +

> +/* Calculate the rd_* sets for each block.  */

> +

> +void

> +early_remat::compute_rd (void)

> +{

> +  /* First calculate the rd_kill and rd_gen sets, using the fact

> +     that m_candidates is sorted in order of decreasing LUID.  */

> +  unsigned int cand_index;

> +  remat_candidate *cand;

> +  FOR_EACH_VEC_ELT_REVERSE (m_candidates, cand_index, cand)

> +    {

> +      rtx_insn *insn = cand->insn;

> +      basic_block bb = BLOCK_FOR_INSN (insn);

> +      remat_block_info *info = &m_block_info[bb->index];

> +      bitmap kill = m_regno_to_candidates[cand->regno];

> +      bitmap_ior_into (get_bitmap (&info->rd_kill), kill);

> +      if (bitmap_bit_p (DF_LR_OUT (bb), cand->regno))

> +       {

> +         bitmap_and_compl_into (get_bitmap (&info->rd_gen), kill);

> +         bitmap_set_bit (info->rd_gen, cand_index);

> +       }

> +    }

> +

> +  /* Set up the initial values of the other sets.  */

> +  basic_block bb;

> +  FOR_EACH_BB_FN (bb, m_fn)

> +    {

> +      remat_block_info *info = &m_block_info[bb->index];

> +      unsigned int regno;

> +      bitmap_iterator bi;

> +      EXECUTE_IF_AND_IN_BITMAP (DF_LR_IN (bb), &m_candidate_regnos,

> +                               0, regno, bi)

> +       {

> +         /* Use available_in to record the set of candidates whose

> +            registers are live on entry (i.e. a maximum bound on rd_in).  */

> +         bitmap_ior_into (get_bitmap (&info->available_in),

> +                          m_regno_to_candidates[regno]);

> +

> +         /* Add registers that die in a block to the block's kill set,

> +            so that we don't needlessly propagate them through the rest

> +            of the function.  */

> +         if (!bitmap_bit_p (DF_LR_OUT (bb), regno))

> +           bitmap_ior_into (get_bitmap (&info->rd_kill),

> +                            m_regno_to_candidates[regno]);

> +       }

> +

> +      /* Initialize each block's rd_out to the minimal set (the set of

> +        local definitions).  */

> +      if (!empty_p (info->rd_gen))

> +       bitmap_copy (get_bitmap (&info->rd_out), info->rd_gen);

> +    }

> +

> +  /* Iterate until we reach a fixed point.  */

> +  er = this;

> +  bitmap_clear (&m_tmp_bitmap);

> +  bitmap_set_range (&m_tmp_bitmap, 0, last_basic_block_for_fn (m_fn));

> +  df_simple_dataflow (DF_FORWARD, NULL, NULL, rd_confluence_n, rd_transfer,

> +                     &m_tmp_bitmap, df_get_postorder (DF_FORWARD),

> +                     df_get_n_blocks (DF_FORWARD));

> +  er = 0;

> +

> +  /* Work out which definitions reach which candidates, again taking

> +     advantage of the candidate order.  */

> +  bitmap_head reaching;

> +  bitmap_initialize (&reaching, &m_obstack);

> +  basic_block old_bb = NULL;

> +  FOR_EACH_VEC_ELT_REVERSE (m_candidates, cand_index, cand)

> +    {

> +      bb = BLOCK_FOR_INSN (cand->insn);

> +      if (bb != old_bb)

> +       {

> +         /* Get the definitions that reach the start of the new block.  */

> +         remat_block_info *info = &m_block_info[bb->index];

> +         if (info->rd_in)

> +           bitmap_copy (&reaching, info->rd_in);

> +         else

> +           bitmap_clear (&reaching);

> +         old_bb = bb;

> +       }

> +      else

> +       {

> +         /* Process the definitions of the previous instruction.  */

> +         bitmap kill = m_regno_to_candidates[cand[1].regno];

> +         bitmap_and_compl_into (&reaching, kill);

> +         bitmap_set_bit (&reaching, cand_index + 1);

> +       }

> +

> +      if (cand->can_copy_p && !cand->constant_p)

> +       {

> +         df_ref ref;

> +         FOR_EACH_INSN_USE (ref, cand->insn)

> +           {

> +             unsigned int regno = DF_REF_REGNO (ref);

> +             if (bitmap_bit_p (&m_candidate_regnos, regno))

> +               {

> +                 bitmap defs = m_regno_to_candidates[regno];

> +                 bitmap_and (&m_tmp_bitmap, defs, &reaching);

> +                 bitmap_ior_into (get_bitmap (&cand->uses), &m_tmp_bitmap);

> +               }

> +           }

> +       }

> +    }

> +  bitmap_clear (&reaching);

> +}

> +

> +/* If CAND_INDEX is in an equivalence class, return the representative

> +   of the class, otherwise return CAND_INDEX.  */

> +

> +inline unsigned int

> +early_remat::canon_candidate (unsigned int cand_index)

> +{

> +  if (remat_equiv_class *ec = m_candidates[cand_index].equiv_class)

> +    return ec->representative;

> +  return cand_index;

> +}

> +

> +/* Make candidate set *PTR refer to candidates using the representative

> +   of each equivalence class.  */

> +

> +void

> +early_remat::canon_bitmap (bitmap *ptr)

> +{

> +  bitmap old_set = *ptr;

> +  if (empty_p (old_set))

> +    return;

> +

> +  bitmap new_set = NULL;

> +  unsigned int old_index;

> +  bitmap_iterator bi;

> +  EXECUTE_IF_SET_IN_BITMAP (old_set, 0, old_index, bi)

> +    {

> +      unsigned int new_index = canon_candidate (old_index);

> +      if (old_index != new_index)

> +       {

> +         if (!new_set)

> +           {

> +             new_set = alloc_bitmap ();

> +             bitmap_copy (new_set, old_set);

> +           }

> +         bitmap_clear_bit (new_set, old_index);

> +         bitmap_set_bit (new_set, new_index);

> +       }

> +    }

> +  if (new_set)

> +    {

> +      BITMAP_FREE (*ptr);

> +      *ptr = new_set;

> +    }

> +}

> +

> +/* If the candidates in REACHING all have the same value, return the

> +   earliest instance of that value (i.e. the first one to be added

> +   to m_value_table), otherwise return MULTIPLE_CANDIDATES.  */

> +

> +unsigned int

> +early_remat::resolve_reaching_def (bitmap reaching)

> +{

> +  unsigned int cand_index = bitmap_first_set_bit (reaching);

> +  if (remat_equiv_class *ec = m_candidates[cand_index].equiv_class)

> +    {

> +      if (!bitmap_intersect_compl_p (reaching, ec->members))

> +       return ec->earliest;

> +    }

> +  else if (bitmap_single_bit_set_p (reaching))

> +    return cand_index;

> +

> +  return MULTIPLE_CANDIDATES;

> +}

> +

> +/* Check whether all candidate registers used by candidate CAND_INDEX have

> +   unique definitions.  Return true if so, replacing the candidate's uses

> +   set with the appropriate form for value numbering.  */

> +

> +bool

> +early_remat::check_candidate_uses (unsigned int cand_index)

> +{

> +  remat_candidate *cand = &m_candidates[cand_index];

> +

> +  /* Process the uses for each register in turn.  */

> +  bitmap_head uses;

> +  bitmap_initialize (&uses, &m_obstack);

> +  bitmap_copy (&uses, cand->uses);

> +  bitmap uses_ec = alloc_bitmap ();

> +  while (!bitmap_empty_p (&uses))

> +    {

> +      /* Get the register for the lowest-indexed candidate remaining,

> +        and the reaching definitions of that register.  */

> +      unsigned int first = bitmap_first_set_bit (&uses);

> +      unsigned int regno = m_candidates[first].regno;

> +      bitmap_and (&m_tmp_bitmap, &uses, m_regno_to_candidates[regno]);

> +

> +      /* See whether all reaching definitions have the same value and if

> +        so get the index of the first candidate we saw with that value.  */

> +      unsigned int def = resolve_reaching_def (&m_tmp_bitmap);

> +      if (def == MULTIPLE_CANDIDATES)

> +       {

> +         if (dump_file)

> +           fprintf (dump_file, ";; Removing candidate %d because there is"

> +                    " more than one reaching definition of reg %d\n",

> +                    cand_index, regno);

> +         cand->can_copy_p = false;

> +         break;

> +       }

> +      bitmap_set_bit (uses_ec, def);

> +      bitmap_and_compl_into (&uses, &m_tmp_bitmap);

> +    }

> +  BITMAP_FREE (cand->uses);

> +  cand->uses = uses_ec;

> +  return cand->can_copy_p;

> +}

> +

> +/* Calculate the set of hard registers that would be clobbered by

> +   rematerializing candidate CAND_INDEX.  At this point the candidate's

> +   set of uses is final.  */

> +

> +void

> +early_remat::compute_clobbers (unsigned int cand_index)

> +{

> +  remat_candidate *cand = &m_candidates[cand_index];

> +  if (cand->uses)

> +    {

> +      unsigned int use_index;

> +      bitmap_iterator bi;

> +      EXECUTE_IF_SET_IN_BITMAP (cand->uses, 0, use_index, bi)

> +       if (bitmap clobbers = m_candidates[use_index].clobbers)

> +         bitmap_ior_into (get_bitmap (&cand->clobbers), clobbers);

> +    }

> +

> +  df_ref ref;

> +  FOR_EACH_INSN_DEF (ref, cand->insn)

> +    {

> +      unsigned int def_regno = DF_REF_REGNO (ref);

> +      if (def_regno != cand->regno)

> +       bitmap_set_bit (get_bitmap (&cand->clobbers), def_regno);

> +    }

> +}

> +

> +/* Mark candidate CAND_INDEX as validated and add it to the value table.  */

> +

> +void

> +early_remat::assign_value_number (unsigned int cand_index)

> +{

> +  remat_candidate *cand = &m_candidates[cand_index];

> +  gcc_checking_assert (cand->can_copy_p && !cand->validated_p);

> +

> +  compute_clobbers (cand_index);

> +  cand->validated_p = true;

> +

> +  inchash::hash h;

> +  h.add_int (cand->regno);

> +  inchash::add_rtx (cand->remat_rtx, h);

> +  cand->hash = h.end ();

> +

> +  remat_candidate **slot

> +    = m_value_table.find_slot_with_hash (cand, cand->hash, INSERT);

> +  if (!*slot)

> +    {

> +      *slot = cand;

> +      if (dump_file)

> +       fprintf (dump_file, ";; Candidate %d is not equivalent to"

> +                " others seen so far\n", cand_index);

> +    }

> +  else

> +    record_equiv_candidates (*slot - m_candidates.address (), cand_index);

> +}

> +

> +/* Make a final decision about which candidates are valid and assign

> +   value numbers to those that are.  */

> +

> +void

> +early_remat::decide_candidate_validity (void)

> +{

> +  auto_vec<unsigned int, 16> stack;

> +  unsigned int cand1_index;

> +  remat_candidate *cand1;

> +  FOR_EACH_VEC_ELT_REVERSE (m_candidates, cand1_index, cand1)

> +    {

> +      if (!cand1->can_copy_p || cand1->validated_p)

> +       continue;

> +

> +      if (empty_p (cand1->uses))

> +       {

> +         assign_value_number (cand1_index);

> +         continue;

> +       }

> +

> +      stack.safe_push (cand1_index);

> +      while (!stack.is_empty ())

> +       {

> +         unsigned int cand2_index = stack.last ();

> +         unsigned int watermark = stack.length ();

> +         remat_candidate *cand2 = &m_candidates[cand2_index];

> +         if (!cand2->can_copy_p || cand2->validated_p)

> +           {

> +             stack.pop ();

> +             continue;

> +           }

> +         cand2->visited_p = true;

> +         unsigned int cand3_index;

> +         bitmap_iterator bi;

> +         EXECUTE_IF_SET_IN_BITMAP (cand2->uses, 0, cand3_index, bi)

> +           {

> +             remat_candidate *cand3 = &m_candidates[cand3_index];

> +             if (!cand3->can_copy_p)

> +               {

> +                 if (dump_file)

> +                   fprintf (dump_file, ";; Removing candidate %d because"

> +                            " it uses removed candidate %d\n", cand2_index,

> +                            cand3_index);

> +                 cand2->can_copy_p = false;

> +                 break;

> +               }

> +             if (!cand3->validated_p)

> +               {

> +                 if (empty_p (cand3->uses))

> +                   assign_value_number (cand3_index);

> +                 else if (cand3->visited_p)

> +                   {

> +                     if (dump_file)

> +                       fprintf (dump_file, ";; Removing candidate %d"

> +                                " because its definition is cyclic\n",

> +                                cand2_index);

> +                     cand2->can_copy_p = false;

> +                     break;

> +                   }

> +                 else

> +                   stack.safe_push (cand3_index);

> +               }

> +           }

> +         if (!cand2->can_copy_p)

> +           {

> +             cand2->visited_p = false;

> +             stack.truncate (watermark - 1);

> +           }

> +         else if (watermark == stack.length ())

> +           {

> +             cand2->visited_p = false;

> +             if (check_candidate_uses (cand2_index))

> +               assign_value_number (cand2_index);

> +             stack.pop ();

> +           }

> +       }

> +    }

> +

> +  /* Ensure that the candidates always use the same candidate index

> +     to refer to an equivalence class.  */

> +  FOR_EACH_VEC_ELT_REVERSE (m_candidates, cand1_index, cand1)

> +    if (cand1->can_copy_p && !empty_p (cand1->uses))

> +      {

> +       canon_bitmap (&cand1->uses);

> +       gcc_checking_assert (bitmap_first_set_bit (cand1->uses) > cand1_index);

> +      }

> +}

> +

> +/* Assuming that every path reaching a point P contains a copy of a

> +   use U of REGNO, return true if another copy of U at P would have

> +   access to the same value of REGNO.  */

> +

> +bool

> +early_remat::stable_use_p (unsigned int regno)

> +{

> +  /* Conservatively assume not for hard registers.  */

> +  if (HARD_REGISTER_NUM_P (regno))

> +    return false;

> +

> +  /* See if REGNO has a single definition and is never used uninitialized.

> +     In this case the definition of REGNO dominates the common dominator

> +     of the uses U, which in turn dominates P.  */

> +  if (DF_REG_DEF_COUNT (regno) == 1

> +      && !bitmap_bit_p (DF_LR_OUT (ENTRY_BLOCK_PTR_FOR_FN (m_fn)), regno))

> +    return true;

> +

> +  return false;

> +}

> +

> +/* Emit a copy from DEST to SRC before candidate CAND_INDEX's instruction.  */

> +

> +void

> +early_remat::emit_copy_before (unsigned int cand_index, rtx dest, rtx src)

> +{

> +  remat_candidate *cand = &m_candidates[cand_index];

> +  if (dump_file)

> +    {

> +      fprintf (dump_file, ";; Stabilizing insn ");

> +      dump_insn_id (cand->insn);

> +      fprintf (dump_file, " by copying source reg %d:%s to temporary reg %d\n",

> +              REGNO (src), GET_MODE_NAME (GET_MODE (src)), REGNO (dest));

> +    }

> +  emit_insn_before (gen_move_insn (dest, src), cand->insn);

> +}

> +

> +/* Check whether any inputs to candidate CAND_INDEX's instruction could

> +   change at rematerialization points and replace them with new pseudo

> +   registers if so.  */

> +

> +void

> +early_remat::stabilize_pattern (unsigned int cand_index)

> +{

> +  remat_candidate *cand = &m_candidates[cand_index];

> +  if (cand->stabilized_p)

> +    return;

> +

> +  remat_equiv_class *ec = cand->equiv_class;

> +  gcc_checking_assert (!ec || cand_index == ec->representative);

> +

> +  /* Record the replacements we've made so far, so that we don't

> +     create two separate registers for match_dups.  Lookup is O(n),

> +     but the n is very small.  */

> +  typedef std::pair<rtx, rtx> reg_pair;

> +  auto_vec<reg_pair, 16> reg_map;

> +

> +  rtx_insn *insn = cand->insn;

> +  df_ref ref;

> +  FOR_EACH_INSN_USE (ref, insn)

> +    {

> +      unsigned int old_regno = DF_REF_REGNO (ref);

> +      rtx *loc = DF_REF_REAL_LOC (ref);

> +

> +      if (HARD_REGISTER_NUM_P (old_regno) && fixed_regs[old_regno])

> +       {

> +         /* We checked when adding the candidate that the value is stable.  */

> +         gcc_checking_assert (!rtx_unstable_p (*loc));

> +         continue;

> +       }

> +

> +      if (bitmap_bit_p (&m_candidate_regnos, old_regno))

> +       /* We already know which candidate provides the definition

> +          and will handle it during copying.  */

> +       continue;

> +

> +      if (stable_use_p (old_regno))

> +       /* We can continue to use the existing register.  */

> +       continue;

> +

> +      /* We need to replace the register.  See whether we've already

> +        created a suitable copy.  */

> +      rtx old_reg = *loc;

> +      rtx new_reg = NULL_RTX;

> +      machine_mode mode = GET_MODE (old_reg);

> +      reg_pair *p;

> +      unsigned int pi;

> +      FOR_EACH_VEC_ELT (reg_map, pi, p)

> +       if (REGNO (p->first) == old_regno

> +           && GET_MODE (p->first) == mode)

> +         {

> +           new_reg = p->second;

> +           break;

> +         }

> +

> +      if (!new_reg)

> +       {

> +         /* Create a new register and initialize it just before

> +            the instruction.  */

> +         new_reg = gen_reg_rtx (mode);

> +         reg_map.safe_push (reg_pair (old_reg, new_reg));

> +         if (ec)

> +           {

> +             unsigned int member_index;

> +             bitmap_iterator bi;

> +             EXECUTE_IF_SET_IN_BITMAP (ec->members, 0, member_index, bi)

> +               emit_copy_before (member_index, new_reg, old_reg);

> +           }

> +         else

> +           emit_copy_before (cand_index, new_reg, old_reg);

> +       }

> +      validate_change (insn, loc, new_reg, true);

> +    }

> +  if (num_changes_pending ())

> +    {

> +      if (!apply_change_group ())

> +       /* We checked when adding the candidates that the pattern allows

> +          hard registers to be replaced.  Nothing else should make the

> +          changes invalid.  */

> +       gcc_unreachable ();

> +

> +      if (ec)

> +       {

> +         /* Copy the new pattern to other members of the equivalence

> +            class.  */

> +         unsigned int member_index;

> +         bitmap_iterator bi;

> +         EXECUTE_IF_SET_IN_BITMAP (ec->members, 0, member_index, bi)

> +           if (cand_index != member_index)

> +             {

> +               rtx_insn *other_insn = m_candidates[member_index].insn;

> +               if (!validate_change (other_insn, &PATTERN (other_insn),

> +                                     copy_insn (PATTERN (insn)), 0))

> +                 /* If the original instruction was valid then the copy

> +                    should be too.  */

> +                 gcc_unreachable ();

> +             }

> +       }

> +    }

> +

> +  cand->stabilized_p = true;

> +}

> +

> +/* Change CAND's instruction so that it sets CAND->copy_regno instead

> +   of CAND->regno.  */

> +

> +void

> +early_remat::replace_dest_with_copy (unsigned int cand_index)

> +{

> +  remat_candidate *cand = &m_candidates[cand_index];

> +  df_ref def;

> +  FOR_EACH_INSN_DEF (def, cand->insn)

> +    if (DF_REF_REGNO (def) == cand->regno)

> +      validate_change (cand->insn, DF_REF_REAL_LOC (def),

> +                      regno_reg_rtx[cand->copy_regno], 1);

> +}

> +

> +/* Make sure that the candidates used by candidate CAND_INDEX are available.

> +   There are two ways of doing this for an input candidate I:

> +

> +   (1) Using the existing register number and ensuring that I is available.

> +

> +   (2) Using a new register number (recorded in copy_regno) and adding I

> +       to VIA_COPY.  This guarantees that making I available does not

> +       conflict with other uses of the original register.

> +

> +   REQUIRED is the set of candidates that are required but not available

> +   before the copy of CAND_INDEX.  AVAILABLE is the set of candidates

> +   that are already available before the copy of CAND_INDEX.  REACHING

> +   is the set of candidates that reach the copy of CAND_INDEX.  VIA_COPY

> +   is the set of candidates that will use new register numbers recorded

> +   in copy_regno instead of the original ones.  */

> +

> +void

> +early_remat::stabilize_candidate_uses (unsigned int cand_index,

> +                                      bitmap required, bitmap available,

> +                                      bitmap reaching, bitmap via_copy)

> +{

> +  remat_candidate *cand = &m_candidates[cand_index];

> +  df_ref use;

> +  FOR_EACH_INSN_USE (use, cand->insn)

> +    {

> +      unsigned int regno = DF_REF_REGNO (use);

> +      if (!bitmap_bit_p (&m_candidate_regnos, regno))

> +       continue;

> +

> +      /* Work out which candidate provides the definition.  */

> +      bitmap defs = m_regno_to_candidates[regno];

> +      bitmap_and (&m_tmp_bitmap, cand->uses, defs);

> +      gcc_checking_assert (bitmap_single_bit_set_p (&m_tmp_bitmap));

> +      unsigned int def_index = bitmap_first_set_bit (&m_tmp_bitmap);

> +

> +      /* First see if DEF_INDEX is the only reaching definition of REGNO

> +        at this point too and if it is or will become available.  We can

> +        continue to use REGNO if so.  */

> +      bitmap_and (&m_tmp_bitmap, reaching, defs);

> +      if (bitmap_single_bit_set_p (&m_tmp_bitmap)

> +         && bitmap_first_set_bit (&m_tmp_bitmap) == def_index

> +         && ((available && bitmap_bit_p (available, def_index))

> +             || bitmap_bit_p (required, def_index)))

> +       {

> +         if (dump_file)

> +           fprintf (dump_file, ";; Keeping reg %d for use of candidate %d"

> +                    " in candidate %d\n", regno, def_index, cand_index);

> +         continue;

> +       }

> +

> +      /* Otherwise fall back to using a copy.  There are other cases

> +        in which we *could* continue to use REGNO, but there's not

> +        really much point.  Using a separate register ought to make

> +        things easier for the register allocator.  */

> +      remat_candidate *def_cand = &m_candidates[def_index];

> +      rtx *loc = DF_REF_REAL_LOC (use);

> +      rtx new_reg;

> +      if (bitmap_set_bit (via_copy, def_index))

> +       {

> +         new_reg = gen_reg_rtx (GET_MODE (*loc));

> +         def_cand->copy_regno = REGNO (new_reg);

> +         if (dump_file)

> +           fprintf (dump_file, ";; Creating reg %d for use of candidate %d"

> +                    " in candidate %d\n", REGNO (new_reg), def_index,

> +                    cand_index);

> +       }

> +      else

> +       new_reg = regno_reg_rtx[def_cand->copy_regno];

> +      validate_change (cand->insn, loc, new_reg, 1);

> +    }

> +}

> +

> +/* Rematerialize the candidates in REQUIRED after instruction INSN,

> +   given that the candidates in AVAILABLE are already available

> +   and that REACHING is the set of candidates live after INSN.

> +   REQUIRED and AVAILABLE are disjoint on entry.

> +

> +   Clear REQUIRED on exit.  */

> +

> +void

> +early_remat::emit_remat_insns (bitmap required, bitmap available,

> +                              bitmap reaching, rtx_insn *insn)

> +{

> +  /* Quick exit if there's nothing to do.  */

> +  if (empty_p (required))

> +    return;

> +

> +  /* Only reaching definitions should be available or required.  */

> +  gcc_checking_assert (!bitmap_intersect_compl_p (required, reaching));

> +  if (available)

> +    gcc_checking_assert (!bitmap_intersect_compl_p (available, reaching));

> +

> +  bitmap_head via_copy;

> +  bitmap_initialize (&via_copy, &m_obstack);

> +  while (!bitmap_empty_p (required) || !bitmap_empty_p (&via_copy))

> +    {

> +      /* Pick the lowest-indexed candidate left.  */

> +      unsigned int required_index = (bitmap_empty_p (required)

> +                                    ? ~0U : bitmap_first_set_bit (required));

> +      unsigned int via_copy_index = (bitmap_empty_p (&via_copy)

> +                                    ? ~0U : bitmap_first_set_bit (&via_copy));

> +      unsigned int cand_index = MIN (required_index, via_copy_index);

> +      remat_candidate *cand = &m_candidates[cand_index];

> +

> +      bool via_copy_p = (cand_index == via_copy_index);

> +      if (via_copy_p)

> +       bitmap_clear_bit (&via_copy, cand_index);

> +      else

> +       {

> +         /* Remove all candidates for the same register from REQUIRED.  */

> +         bitmap_and (&m_tmp_bitmap, reaching,

> +                     m_regno_to_candidates[cand->regno]);

> +         bitmap_and_compl_into (required, &m_tmp_bitmap);

> +         gcc_checking_assert (!bitmap_bit_p (required, cand_index));

> +

> +         /* Only rematerialize if we have a single reaching definition

> +            of the register.  */

> +         if (!bitmap_single_bit_set_p (&m_tmp_bitmap))

> +           {

> +             if (dump_file)

> +               {

> +                 fprintf (dump_file, ";; Can't rematerialize reg %d after ",

> +                          cand->regno);

> +                 dump_insn_id (insn);

> +                 fprintf (dump_file, ": more than one reaching definition\n");

> +               }

> +             continue;

> +           }

> +

> +         /* Skip candidates that can't be rematerialized.  */

> +         if (!cand->can_copy_p)

> +           continue;

> +

> +         /* Check the function precondition.  */

> +         gcc_checking_assert (!available

> +                              || !bitmap_bit_p (available, cand_index));

> +       }

> +

> +      /* Invalid candidates should have been weeded out by now.  */

> +      gcc_assert (cand->can_copy_p);

> +

> +      rtx new_pattern;

> +      if (cand->constant_p)

> +       {

> +         /* Emit a simple move.  */

> +         unsigned int regno = via_copy_p ? cand->copy_regno : cand->regno;

> +         new_pattern = gen_move_insn (regno_reg_rtx[regno], cand->remat_rtx);

> +       }

> +      else

> +       {

> +         /* If this is the first time we've copied the instruction, make

> +            sure that any inputs will have the same value after INSN.  */

> +         stabilize_pattern (cand_index);

> +

> +         /* Temporarily adjust the original instruction so that it has

> +            the right form for the copy.  */

> +         if (via_copy_p)

> +           replace_dest_with_copy (cand_index);

> +         if (cand->uses)

> +           stabilize_candidate_uses (cand_index, required, available,

> +                                     reaching, &via_copy);

> +

> +         /* Get the new instruction pattern.  */

> +         new_pattern = copy_insn (cand->remat_rtx);

> +

> +         /* Undo the temporary changes.  */

> +         cancel_changes (0);

> +       }

> +

> +      /* Emit the new instruction.  */

> +      rtx_insn *new_insn = emit_insn_after (new_pattern, insn);

> +

> +      if (dump_file)

> +       {

> +         fprintf (dump_file, ";; Rematerializing candidate %d after ",

> +                  cand_index);

> +         dump_insn_id (insn);

> +         if (via_copy_p)

> +           fprintf (dump_file, " with new destination reg %d",

> +                    cand->copy_regno);

> +         fprintf (dump_file, ":\n\n");

> +         print_rtl_single (dump_file, new_insn);

> +         fprintf (dump_file, "\n");

> +       }

> +    }

> +}

> +

> +/* Recompute INFO's available_out set, given that it's distinct from

> +   available_in and available_locally.  */

> +

> +bool

> +early_remat::set_available_out (remat_block_info *info)

> +{

> +  if (empty_p (info->available_locally))

> +    return bitmap_and_compl (get_bitmap (&info->available_out),

> +                            info->available_in, info->rd_kill);

> +

> +  if (empty_p (info->rd_kill))

> +    return bitmap_ior (get_bitmap (&info->available_out),

> +                      info->available_locally, info->available_in);

> +

> +  return bitmap_ior_and_compl (get_bitmap (&info->available_out),

> +                              info->available_locally, info->available_in,

> +                              info->rd_kill);

> +}

> +

> +/* If BB has more than one call, decide which candidates should be

> +   rematerialized after the non-final calls and emit the associated

> +   instructions.  Record other information about the block in preparation

> +   for the global phase.  */

> +

> +void

> +early_remat::process_block (basic_block bb)

> +{

> +  remat_block_info *info = &m_block_info[bb->index];

> +  rtx_insn *last_call = NULL;

> +  rtx_insn *insn;

> +

> +  /* Ensure that we always use the same candidate index to refer to an

> +     equivalence class.  */

> +  if (info->rd_out == info->rd_in)

> +    {

> +      canon_bitmap (&info->rd_in);

> +      info->rd_out = info->rd_in;

> +    }

> +  else

> +    {

> +      canon_bitmap (&info->rd_in);

> +      canon_bitmap (&info->rd_out);

> +    }

> +  canon_bitmap (&info->rd_kill);

> +  canon_bitmap (&info->rd_gen);

> +

> +  /* The set of candidates that should be rematerialized on entry to the

> +     block or after the previous call (whichever is more recent).  */

> +  init_temp_bitmap (&m_required);

> +

> +  /* The set of candidates that reach the current instruction (i.e. are

> +     live just before the instruction).  */

> +  bitmap_head reaching;

> +  bitmap_initialize (&reaching, &m_obstack);

> +  if (info->rd_in)

> +    bitmap_copy (&reaching, info->rd_in);

> +

> +  /* The set of candidates that are live and available without

> +     rematerialization just before the current instruction.  This only

> +     accounts for earlier candidates in the block, or those that become

> +     available by being added to M_REQUIRED.  */

> +  init_temp_bitmap (&m_available);

> +

> +  /* Get the range of candidates in the block.  */

> +  unsigned int next_candidate = info->first_candidate;

> +  unsigned int num_candidates = info->num_candidates;

> +  remat_candidate *next_def = (num_candidates > 0

> +                              ? &m_candidates[next_candidate]

> +                              : NULL);

> +

> +  FOR_BB_INSNS (bb, insn)

> +    {

> +      if (!NONDEBUG_INSN_P (insn))

> +       continue;

> +

> +      /* First process uses, since this is a forward walk.  */

> +      df_ref ref;

> +      FOR_EACH_INSN_USE (ref, insn)

> +       {

> +         unsigned int regno = DF_REF_REGNO (ref);

> +         if (bitmap_bit_p (&m_candidate_regnos, regno))

> +           {

> +             bitmap defs = m_regno_to_candidates[regno];

> +             bitmap_and (&m_tmp_bitmap, defs, &reaching);

> +             gcc_checking_assert (!bitmap_empty_p (&m_tmp_bitmap));

> +             if (!bitmap_intersect_p (defs, m_available))

> +               {

> +                 /* There has been no definition of the register since

> +                    the last call or the start of the block (whichever

> +                    is most recent).  Mark the reaching definitions

> +                    as required at that point and thus available here.  */

> +                 bitmap_ior_into (m_required, &m_tmp_bitmap);

> +                 bitmap_ior_into (m_available, &m_tmp_bitmap);

> +               }

> +           }

> +       }

> +

> +      if (CALL_P (insn))

> +       {

> +         if (!last_call)

> +           {

> +             /* The first call in the block.  Record which candidates are

> +                required at the start of the block.  */

> +             copy_temp_bitmap (&info->required_in, &m_required);

> +             init_temp_bitmap (&m_required);

> +           }

> +         else

> +           /* The fully-local case: candidates that need to be

> +              rematerialized after a previous call in the block.  */

> +           emit_remat_insns (m_required, NULL, info->rd_after_call,

> +                             last_call);

> +         last_call = insn;

> +         bitmap_clear (m_available);

> +         gcc_checking_assert (empty_p (m_required));

> +       }

> +

> +      /* Now process definitions.  */

> +      if (next_def && insn == next_def->insn)

> +       {

> +         unsigned int gen = canon_candidate (next_candidate);

> +

> +         /* Other candidates with the same regno are not available

> +            any more.  */

> +         bitmap kill = m_regno_to_candidates[next_def->regno];

> +         bitmap_and_compl_into (m_available, kill);

> +         bitmap_and_compl_into (&reaching, kill);

> +

> +         /* Record that this candidate is available without

> +            rematerialization.  */

> +         bitmap_set_bit (m_available, gen);

> +         bitmap_set_bit (&reaching, gen);

> +

> +         /* Find the next candidate in the block.  */

> +         num_candidates -= 1;

> +         next_candidate -= 1;

> +         if (num_candidates > 0)

> +           next_def -= 1;

> +         else

> +           next_def = NULL;

> +       }

> +

> +      if (insn == last_call)

> +       bitmap_copy (get_bitmap (&info->rd_after_call), &reaching);

> +    }

> +  bitmap_clear (&reaching);

> +  gcc_checking_assert (num_candidates == 0);

> +

> +  /* Remove values from the available set if they aren't live (and so

> +     aren't interesting to successor blocks).  */

> +  if (info->rd_out)

> +    bitmap_and_into (m_available, info->rd_out);

> +

> +  /* Record the accumulated information.  */

> +  info->last_call = last_call;

> +  info->abnormal_call_p = (last_call

> +                          && last_call == BB_END (bb)

> +                          && has_abnormal_or_eh_outgoing_edge_p (bb));

> +  copy_temp_bitmap (&info->available_locally, &m_available);

> +  if (last_call)

> +    copy_temp_bitmap (&info->required_after_call, &m_required);

> +  else

> +    copy_temp_bitmap (&info->required_in, &m_required);

> +

> +  /* Assume at first that all live-in values are available without

> +     rematerialization (i.e. start with the most optimistic assumption).  */

> +  if (info->available_in)

> +    {

> +      if (info->rd_in)

> +       bitmap_copy (info->available_in, info->rd_in);

> +      else

> +       BITMAP_FREE (info->available_in);

> +    }

> +

> +  if (last_call || empty_p (info->available_in))

> +    /* The values available on exit from the block are exactly those that

> +       are available locally.  This set doesn't change.  */

> +    info->available_out = info->available_locally;

> +  else if (empty_p (info->available_locally) && empty_p (info->rd_kill))

> +    /* The values available on exit are the same as those available on entry.

> +       Updating one updates the other.  */

> +    info->available_out = info->available_in;

> +  else

> +    set_available_out (info);

> +}

> +

> +/* Process each block as for process_block, visiting dominators before

> +   the blocks they dominate.  */

> +

> +void

> +early_remat::local_phase (void)

> +{

> +  if (dump_file)

> +    fprintf (dump_file, "\n;; Local phase:\n");

> +

> +  int *postorder = df_get_postorder (DF_BACKWARD);

> +  unsigned int postorder_len = df_get_n_blocks (DF_BACKWARD);

> +  for (unsigned int i = postorder_len; i-- > 0; )

> +    if (postorder[i] >= NUM_FIXED_BLOCKS)

> +      process_block (BASIC_BLOCK_FOR_FN (m_fn, postorder[i]));

> +}

> +

> +/* Return true if available values survive across edge E.  */

> +

> +static inline bool

> +available_across_edge_p (edge e)

> +{

> +  return (e->flags & EDGE_EH) == 0;

> +}

> +

> +/* Propagate information from the available_out set of E->src to the

> +   available_in set of E->dest, when computing global availability.

> +   Return true if something changed.  */

> +

> +bool

> +early_remat::avail_confluence_n (edge e)

> +{

> +  remat_block_info *src = &er->m_block_info[e->src->index];

> +  remat_block_info *dest = &er->m_block_info[e->dest->index];

> +

> +  if (!available_across_edge_p (e))

> +    return false;

> +

> +  if (empty_p (dest->available_in))

> +    return false;

> +

> +  if (!src->available_out)

> +    {

> +      bitmap_clear (dest->available_in);

> +      return true;

> +    }

> +

> +  return bitmap_and_into (dest->available_in, src->available_out);

> +}

> +

> +/* Propagate information from the available_in set of block BB_INDEX

> +   to available_out.  Return true if something changed.  */

> +

> +bool

> +early_remat::avail_transfer (int bb_index)

> +{

> +  remat_block_info *info = &er->m_block_info[bb_index];

> +

> +  if (info->available_out == info->available_locally)

> +    return false;

> +

> +  if (info->available_out == info->available_in)

> +    /* Assume that we are only called if the input changed.  */

> +    return true;

> +

> +  return er->set_available_out (info);

> +}

> +

> +/* Compute global availability for the function, starting with the local

> +   information computed by local_phase.  */

> +

> +void

> +early_remat::compute_availability (void)

> +{

> +  /* We use df_simple_dataflow instead of the lcm routines for three reasons:

> +

> +     (1) it avoids recomputing the traversal order;

> +     (2) many of the sets are likely to be sparse, so we don't necessarily

> +        want to use sbitmaps; and

> +     (3) it means we can avoid creating an explicit kill set for the call.  */

> +  er = this;

> +  bitmap_clear (&m_tmp_bitmap);

> +  bitmap_set_range (&m_tmp_bitmap, 0, last_basic_block_for_fn (m_fn));

> +  df_simple_dataflow (DF_FORWARD, NULL, NULL,

> +                     avail_confluence_n, avail_transfer,

> +                     &m_tmp_bitmap, df_get_postorder (DF_FORWARD),

> +                     df_get_n_blocks (DF_FORWARD));

> +  er = 0;

> +

> +  /* Restrict the required_in sets to values that aren't available.  */

> +  basic_block bb;

> +  FOR_EACH_BB_FN (bb, m_fn)

> +    {

> +      remat_block_info *info = &m_block_info[bb->index];

> +      if (info->required_in && info->available_in)

> +       bitmap_and_compl_into (info->required_in, info->available_in);

> +    }

> +}

> +

> +/* Make sure that INFO's available_out and available_in sets are unique.  */

> +

> +inline void

> +early_remat::unshare_available_sets (remat_block_info *info)

> +{

> +  if (info->available_in && info->available_in == info->available_out)

> +    {

> +      info->available_in = alloc_bitmap ();

> +      bitmap_copy (info->available_in, info->available_out);

> +    }

> +}

> +

> +/* Return true if it is possible to move rematerializations from the

> +   destination of E to the source of E.  */

> +

> +inline bool

> +early_remat::can_move_across_edge_p (edge e)

> +{

> +  return (available_across_edge_p (e)

> +         && !m_block_info[e->src->index].abnormal_call_p);

> +}

> +

> +/* Return true if it is cheaper to rematerialize values at the head of

> +   block QUERY_BB_INDEX instead of rematerializing in its predecessors.  */

> +

> +bool

> +early_remat::local_remat_cheaper_p (unsigned int query_bb_index)

> +{

> +  if (m_block_info[query_bb_index].remat_frequency_valid_p)

> +    return m_block_info[query_bb_index].local_remat_cheaper_p;

> +

> +  /* Iteratively compute the cost of rematerializing values in the

> +     predecessor blocks, then compare that with the cost of

> +     rematerializing at the head of the block.

> +

> +     A cycle indicates that there is no call on that execution path,

> +     so it isn't necessary to rematerialize on that path.  */

> +  auto_vec<basic_block, 16> stack;

> +  stack.quick_push (BASIC_BLOCK_FOR_FN (m_fn, query_bb_index));

> +  while (!stack.is_empty ())

> +    {

> +      basic_block bb = stack.last ();

> +      remat_block_info *info = &m_block_info[bb->index];

> +      if (info->remat_frequency_valid_p)

> +       {

> +         stack.pop ();

> +         continue;

> +       }

> +

> +      info->visited_p = true;

> +      int frequency = 0;

> +      bool can_move_p = true;

> +      edge e;

> +      edge_iterator ei;

> +      FOR_EACH_EDGE (e, ei, bb->preds)

> +       if (!can_move_across_edge_p (e))

> +         {

> +           can_move_p = false;

> +           break;

> +         }

> +       else if (m_block_info[e->src->index].last_call)

> +         /* We'll rematerialize after the call.  */

> +         frequency += e->src->count.to_frequency (m_fn);

> +       else if (m_block_info[e->src->index].remat_frequency_valid_p)

> +         /* Add the cost of rematerializing at the head of E->src

> +            or in its predecessors (whichever is cheaper).  */

> +         frequency += m_block_info[e->src->index].remat_frequency;

> +       else if (!m_block_info[e->src->index].visited_p)

> +         /* Queue E->src and then revisit this block again.  */

> +         stack.safe_push (e->src);

> +

> +      /* Come back to this block later if we need to process some of

> +        its predecessors.  */

> +      if (stack.last () != bb)

> +       continue;

> +

> +      /* If rematerializing in and before the block have equal cost, prefer

> +        rematerializing in the block.  This should shorten the live range.  */

> +      int bb_frequency = bb->count.to_frequency (m_fn);

> +      if (!can_move_p || frequency >= bb_frequency)

> +       {

> +         info->local_remat_cheaper_p = true;

> +         info->remat_frequency = bb_frequency;

> +       }

> +      else

> +       info->remat_frequency = frequency;

> +      info->remat_frequency_valid_p = true;

> +      info->visited_p = false;

> +      if (dump_file)

> +       {

> +         if (!can_move_p)

> +           fprintf (dump_file, ";; Need to rematerialize at the head of"

> +                    " block %d; cannot move to predecessors.\n", bb->index);

> +         else

> +           {

> +             fprintf (dump_file, ";; Block %d has frequency %d,"

> +                      " rematerializing in predecessors has frequency %d",

> +                      bb->index, bb_frequency, frequency);

> +             if (info->local_remat_cheaper_p)

> +               fprintf (dump_file, "; prefer to rematerialize"

> +                        " in the block\n");

> +             else

> +               fprintf (dump_file, "; prefer to rematerialize"

> +                        " in predecessors\n");

> +           }

> +       }

> +      stack.pop ();

> +    }

> +  return m_block_info[query_bb_index].local_remat_cheaper_p;

> +}

> +

> +/* Return true if we cannot rematerialize candidate CAND_INDEX at the head of

> +   block BB_INDEX.  */

> +

> +bool

> +early_remat::need_to_move_candidate_p (unsigned int bb_index,

> +                                      unsigned int cand_index)

> +{

> +  remat_block_info *info = &m_block_info[bb_index];

> +  remat_candidate *cand = &m_candidates[cand_index];

> +  basic_block bb = BASIC_BLOCK_FOR_FN (m_fn, bb_index);

> +

> +  /* If there is more than one reaching definition of REGNO,

> +     we'll need to rematerialize in predecessors instead.  */

> +  bitmap_and (&m_tmp_bitmap, info->rd_in, m_regno_to_candidates[cand->regno]);

> +  if (!bitmap_single_bit_set_p (&m_tmp_bitmap))

> +    {

> +      if (dump_file)

> +       fprintf (dump_file, ";; Cannot rematerialize %d at the"

> +                " head of block %d because there is more than one"

> +                " reaching definition of reg %d\n", cand_index,

> +                bb_index, cand->regno);

> +      return true;

> +    }

> +

> +  /* Likewise if rematerializing CAND here would clobber a live register.  */

> +  if (cand->clobbers

> +      && bitmap_intersect_p (cand->clobbers, DF_LR_IN (bb)))

> +    {

> +      if (dump_file)

> +       fprintf (dump_file, ";; Cannot rematerialize %d at the"

> +                " head of block %d because it would clobber live"

> +                " registers\n", cand_index, bb_index);

> +      return true;

> +    }

> +

> +  return false;

> +}

> +

> +/* Set REQUIRED to the minimum set of candidates that must be rematerialized

> +   in predecessors of block BB_INDEX instead of at the start of the block.  */

> +

> +void

> +early_remat::compute_minimum_move_set (unsigned int bb_index,

> +                                      bitmap required)

> +{

> +  remat_block_info *info = &m_block_info[bb_index];

> +  bitmap_head remaining;

> +

> +  bitmap_clear (required);

> +  bitmap_initialize (&remaining, &m_obstack);

> +  bitmap_copy (&remaining, info->required_in);

> +  while (!bitmap_empty_p (&remaining))

> +    {

> +      unsigned int cand_index = bitmap_first_set_bit (&remaining);

> +      remat_candidate *cand = &m_candidates[cand_index];

> +      bitmap_clear_bit (&remaining, cand_index);

> +

> +      /* Leave invalid candidates where they are.  */

> +      if (!cand->can_copy_p)

> +       continue;

> +

> +      /* Decide whether to move this candidate.  */

> +      if (!bitmap_bit_p (required, cand_index))

> +       {

> +         if (!need_to_move_candidate_p (bb_index, cand_index))

> +           continue;

> +         bitmap_set_bit (required, cand_index);

> +       }

> +

> +      /* Also move values used by the candidate, so that we don't

> +        rematerialize them twice.  */

> +      if (cand->uses)

> +       {

> +         bitmap_ior_and_into (required, cand->uses, info->required_in);

> +         bitmap_ior_and_into (&remaining, cand->uses, info->required_in);

> +       }

> +    }

> +}

> +

> +/* Make the predecessors of BB_INDEX rematerialize the candidates in

> +   REQUIRED.  Add any blocks whose required_in set changes to

> +   PENDING_BLOCKS.  */

> +

> +void

> +early_remat::move_to_predecessors (unsigned int bb_index, bitmap required,

> +                                  bitmap pending_blocks)

> +{

> +  if (empty_p (required))

> +    return;

> +  remat_block_info *dest_info = &m_block_info[bb_index];

> +  basic_block bb = BASIC_BLOCK_FOR_FN (m_fn, bb_index);

> +  edge e;

> +  edge_iterator ei;

> +  FOR_EACH_EDGE (e, ei, bb->preds)

> +    {

> +      remat_block_info *src_info = &m_block_info[e->src->index];

> +

> +      /* Restrict the set we add to the reaching definitions.  */

> +      bitmap_and (&m_tmp_bitmap, required, src_info->rd_out);

> +      if (bitmap_empty_p (&m_tmp_bitmap))

> +       continue;

> +

> +      if (!can_move_across_edge_p (e))

> +       {

> +         /* We can't move the rematerialization and we can't do it at

> +            the start of the block either.  In this case we just give up

> +            and rely on spilling to make the values available across E.  */

> +         if (dump_file)

> +           {

> +             fprintf (dump_file, ";; Cannot rematerialize the following"

> +                      " candidates in block %d:", e->src->index);

> +             dump_candidate_bitmap (required);

> +             fprintf (dump_file, "\n");

> +           }

> +         continue;

> +       }

> +

> +      /* Remove candidates that are already available.  */

> +      if (src_info->available_out)

> +       {

> +         bitmap_and_compl_into (&m_tmp_bitmap, src_info->available_out);

> +         if (bitmap_empty_p (&m_tmp_bitmap))

> +           continue;

> +       }

> +

> +      /* Add the remaining candidates to the appropriate required set.  */

> +      if (dump_file)

> +       {

> +         fprintf (dump_file, ";; Moving this set from block %d"

> +                  " to block %d:", bb_index, e->src->index);

> +         dump_candidate_bitmap (&m_tmp_bitmap);

> +         fprintf (dump_file, "\n");

> +       }

> +      /* If the source block contains a call, we want to rematerialize

> +        after the call, otherwise we want to rematerialize at the start

> +        of the block.  */

> +      bitmap src_required = get_bitmap (src_info->last_call

> +                                       ? &src_info->required_after_call

> +                                       : &src_info->required_in);

> +      if (bitmap_ior_into (src_required, &m_tmp_bitmap))

> +       {

> +         if (!src_info->last_call)

> +           bitmap_set_bit (pending_blocks, e->src->index);

> +         unshare_available_sets (src_info);

> +         bitmap_ior_into (get_bitmap (&src_info->available_out),

> +                          &m_tmp_bitmap);

> +       }

> +    }

> +

> +  /* The candidates are now available on entry to the block.  */

> +  bitmap_and_compl_into (dest_info->required_in, required);

> +  unshare_available_sets (dest_info);

> +  bitmap_ior_into (get_bitmap (&dest_info->available_in), required);

> +}

> +

> +/* Go through the candidates that are currently marked as being

> +   rematerialized at the beginning of a block.  Decide in each case

> +   whether that's valid and profitable; if it isn't, move the

> +   rematerialization to predecessor blocks instead.  */

> +

> +void

> +early_remat::choose_rematerialization_points (void)

> +{

> +  bitmap_head required;

> +  bitmap_head pending_blocks;

> +

> +  int *postorder = df_get_postorder (DF_BACKWARD);

> +  unsigned int postorder_len = df_get_n_blocks (DF_BACKWARD);

> +  bitmap_initialize (&required, &m_obstack);

> +  bitmap_initialize (&pending_blocks, &m_obstack);

> +  do

> +    /* Process the blocks in postorder, to reduce the number of iterations

> +       of the outer loop.  */

> +    for (unsigned int i = 0; i < postorder_len; ++i)

> +      {

> +       unsigned int bb_index = postorder[i];

> +       remat_block_info *info = &m_block_info[bb_index];

> +       bitmap_clear_bit (&pending_blocks, bb_index);

> +

> +       if (empty_p (info->required_in))

> +         continue;

> +

> +       if (info->available_in)

> +         gcc_checking_assert (!bitmap_intersect_p (info->required_in,

> +                                                   info->available_in));

> +

> +       if (local_remat_cheaper_p (bb_index))

> +         {

> +           /* We'd prefer to rematerialize at the head of the block.

> +              Only move candidates if we need to.  */

> +           compute_minimum_move_set (bb_index, &required);

> +           move_to_predecessors (bb_index, &required, &pending_blocks);

> +         }

> +       else

> +         move_to_predecessors (bb_index, info->required_in,

> +                               &pending_blocks);

> +      }

> +  while (!bitmap_empty_p (&pending_blocks));

> +  bitmap_clear (&required);

> +}

> +

> +/* Emit all rematerialization instructions queued for BB.  */

> +

> +void

> +early_remat::emit_remat_insns_for_block (basic_block bb)

> +{

> +  remat_block_info *info = &m_block_info[bb->index];

> +

> +  if (info->last_call && !empty_p (info->required_after_call))

> +    emit_remat_insns (info->required_after_call, NULL,

> +                     info->rd_after_call, info->last_call);

> +

> +  if (!empty_p (info->required_in))

> +    {

> +      rtx_insn *insn = BB_HEAD (bb);

> +      while (insn != BB_END (bb)

> +            && !INSN_P (NEXT_INSN (insn)))

> +       insn = NEXT_INSN (insn);

> +      emit_remat_insns (info->required_in, info->available_in,

> +                       info->rd_in, insn);

> +    }

> +}

> +

> +/* Decide which candidates in each block's REQUIRED_IN set need to be

> +   rematerialized and decide where the rematerialization instructions

> +   should go.  Emit queued rematerialization instructions at the start

> +   of blocks and after the last calls in blocks.  */

> +

> +void

> +early_remat::global_phase (void)

> +{

> +  compute_availability ();

> +  if (dump_file)

> +    {

> +      fprintf (dump_file, "\n;; Blocks after computing global"

> +              " availability:\n");

> +      dump_all_blocks ();

> +    }

> +

> +  choose_rematerialization_points ();

> +  if (dump_file)

> +    {

> +      fprintf (dump_file, "\n;; Blocks after choosing rematerialization"

> +              " points:\n");

> +      dump_all_blocks ();

> +    }

> +

> +  basic_block bb;

> +  FOR_EACH_BB_FN (bb, m_fn)

> +    emit_remat_insns_for_block (bb);

> +}

> +

> +/* Main function for the pass.  */

> +

> +void

> +early_remat::run (void)

> +{

> +  df_analyze ();

> +

> +  if (!collect_candidates ())

> +    return;

> +

> +  init_block_info ();

> +  sort_candidates ();

> +  finalize_candidate_indices ();

> +  if (dump_file)

> +    dump_all_candidates ();

> +

> +  compute_rd ();

> +  decide_candidate_validity ();

> +  local_phase ();

> +  global_phase ();

> +}

> +

> +early_remat::early_remat (function *fn, sbitmap selected_modes)

> +  : m_fn (fn),

> +    m_selected_modes (selected_modes),

> +    m_available (0),

> +    m_required (0),

> +    m_value_table (63)

> +{

> +  bitmap_obstack_initialize (&m_obstack);

> +  bitmap_initialize (&m_candidate_regnos, &m_obstack);

> +  bitmap_initialize (&m_tmp_bitmap, &m_obstack);

> +}

> +

> +early_remat::~early_remat ()

> +{

> +  bitmap_obstack_release (&m_obstack);

> +}

> +

> +namespace {

> +

> +const pass_data pass_data_early_remat =

> +{

> +  RTL_PASS, /* type */

> +  "early_remat", /* name */

> +  OPTGROUP_NONE, /* optinfo_flags */

> +  TV_EARLY_REMAT, /* tv_id */

> +  0, /* properties_required */

> +  0, /* properties_provided */

> +  0, /* properties_destroyed */

> +  0, /* todo_flags_start */

> +  TODO_df_finish, /* todo_flags_finish */

> +};

> +

> +class pass_early_remat : public rtl_opt_pass

> +{

> +public:

> +  pass_early_remat (gcc::context *ctxt)

> +    : rtl_opt_pass (pass_data_early_remat, ctxt)

> +  {}

> +

> +  /* opt_pass methods: */

> +  virtual bool gate (function *)

> +  {

> +    return optimize > 1 && NUM_POLY_INT_COEFFS > 1;

> +  }

> +

> +  virtual unsigned int execute (function *f)

> +  {

> +    auto_sbitmap selected_modes (NUM_MACHINE_MODES);

> +    bitmap_clear (selected_modes);

> +    targetm.select_early_remat_modes (selected_modes);

> +    if (!bitmap_empty_p (selected_modes))

> +      early_remat (f, selected_modes).run ();

> +    return 0;

> +  }

> +}; // class pass_early_remat

> +

> +} // anon namespace

> +

> +rtl_opt_pass *

> +make_pass_early_remat (gcc::context *ctxt)

> +{

> +  return new pass_early_remat (ctxt);

> +}

> Index: gcc/config/aarch64/aarch64.c

> ===================================================================

> --- gcc/config/aarch64/aarch64.c        2017-11-17 15:56:45.548761354 +0000

> +++ gcc/config/aarch64/aarch64.c        2017-11-17 15:56:53.018759920 +0000

> @@ -17292,6 +17292,22 @@ aarch64_can_change_mode_class (machine_m

>    return true;

>  }

>

> +/* Implement TARGET_EARLY_REMAT_MODES.  */

> +

> +static void

> +aarch64_select_early_remat_modes (sbitmap modes)

> +{

> +  /* SVE values are not normally live across a call, so it should be

> +     worth doing early rematerialization even in VL-specific mode.  */

> +  for (int i = 0; i < NUM_MACHINE_MODES; ++i)

> +    {

> +      machine_mode mode = (machine_mode) i;

> +      unsigned int vec_flags = aarch64_classify_vector_mode (mode);

> +      if (vec_flags & VEC_ANY_SVE)

> +       bitmap_set_bit (modes, i);

> +    }

> +}

> +

>  /* Target-specific selftests.  */

>

>  #if CHECKING_P

> @@ -17766,6 +17782,9 @@ #define TARGET_COMPUTE_PRESSURE_CLASSES

>  #undef TARGET_CAN_CHANGE_MODE_CLASS

>  #define TARGET_CAN_CHANGE_MODE_CLASS aarch64_can_change_mode_class

>

> +#undef TARGET_SELECT_EARLY_REMAT_MODES

> +#define TARGET_SELECT_EARLY_REMAT_MODES aarch64_select_early_remat_modes

> +

>  #if CHECKING_P

>  #undef TARGET_RUN_TARGET_SELFTESTS

>  #define TARGET_RUN_TARGET_SELFTESTS selftest::aarch64_run_selftests

> Index: gcc/testsuite/gcc.target/aarch64/sve_spill_1.c

> ===================================================================

> --- gcc/testsuite/gcc.target/aarch64/sve_spill_1.c      2017-11-17 15:56:35.691963931 +0000

> +++ gcc/testsuite/gcc.target/aarch64/sve_spill_1.c      2017-11-17 15:56:53.022759920 +0000

> @@ -26,3 +26,5 @@ TEST_LOOP (uint64_t, 511);

>  /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #511\n} 2 } } */

>  /* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */

>  /* { dg-final { scan-assembler-not {\tstr\tz[0-9]} } } */

> +/* { dg-final { scan-assembler-not {\tldr\tp[0-9]} } } */

> +/* { dg-final { scan-assembler-not {\tstr\tp[0-9]} } } */

> Index: gcc/testsuite/gcc.target/aarch64/sve_spill_2.c

> ===================================================================

> --- /dev/null   2017-11-14 14:28:07.424493901 +0000

> +++ gcc/testsuite/gcc.target/aarch64/sve_spill_2.c      2017-11-17 15:56:53.022759920 +0000

> @@ -0,0 +1,39 @@

> +/* { dg-do compile } */

> +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable" } */

> +

> +#include <stdint.h>

> +

> +void consumer (void *);

> +

> +#define TEST_LOOP(TYPE)                                \

> +  void                                         \

> +  multi_loop_##TYPE (TYPE *x, TYPE val)                \

> +  {                                            \

> +    for (int i = 0; i < 7; ++i)                        \

> +      x[i] += val;                             \

> +    consumer (x);                              \

> +    for (int i = 0; i < 7; ++i)                        \

> +      x[i] += val;                             \

> +    consumer (x);                              \

> +    for (int i = 0; i < 7; ++i)                        \

> +      x[i] += val;                             \

> +    consumer (x);                              \

> +  }

> +

> +/* One iteration is enough.  */

> +TEST_LOOP (uint8_t);

> +TEST_LOOP (uint16_t);

> +/* Two iterations are enough.  Complete unrolling makes sense

> +   even at -O2.  */

> +TEST_LOOP (uint32_t);

> +/* Four iterations are needed; ought to stay a loop.  */

> +TEST_LOOP (uint64_t);

> +

> +/* { dg-final { scan-assembler-times {\twhilelo\tp[0-9]\.b} 3 } } */

> +/* { dg-final { scan-assembler-times {\twhilelo\tp[0-9]\.h} 3 } } */

> +/* { dg-final { scan-assembler {\twhilelo\tp[0-9]\.s} } } */

> +/* { dg-final { scan-assembler-times {\twhilelo\tp[0-9]\.d} 6 } } */

> +/* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */

> +/* { dg-final { scan-assembler-not {\tstr\tz[0-9]} } } */

> +/* { dg-final { scan-assembler-not {\tldr\tp[0-9]} } } */

> +/* { dg-final { scan-assembler-not {\tstr\tp[0-9]} } } */

> Index: gcc/testsuite/gcc.target/aarch64/sve_spill_3.c

> ===================================================================

> --- /dev/null   2017-11-14 14:28:07.424493901 +0000

> +++ gcc/testsuite/gcc.target/aarch64/sve_spill_3.c      2017-11-17 15:56:53.023759919 +0000

> @@ -0,0 +1,48 @@

> +/* { dg-do compile } */

> +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable" } */

> +

> +#include <stdint.h>

> +

> +void consumer (void *);

> +

> +#define TEST_LOOP(TYPE) \

> +  void                                                         \

> +  multi_loop_##TYPE (TYPE *x, TYPE val1, TYPE val2, int n)     \

> +  {                                                            \

> +    for (int i = 0; i < n; ++i)                                        \

> +      {                                                                \

> +       x[i * 2] += val1;                                       \

> +       x[i * 2 + 1] += val2;                                   \

> +      }                                                                \

> +    consumer (x);                                              \

> +    for (int i = 0; i < n; ++i)                                        \

> +      {                                                                \

> +       x[i * 2] += val1;                                       \

> +       x[i * 2 + 1] += val2;                                   \

> +      }                                                                \

> +    consumer (x);                                              \

> +    for (int i = 0; i < n; ++i)                                        \

> +      {                                                                \

> +       x[i * 2] += val1;                                       \

> +       x[i * 2 + 1] += val2;                                   \

> +      }                                                                \

> +    consumer (x);                                              \

> +  }

> +

> +/* One iteration is enough.  */

> +TEST_LOOP (uint8_t);

> +TEST_LOOP (uint16_t);

> +/* Two iterations are enough.  Complete unrolling makes sense

> +   even at -O2.  */

> +TEST_LOOP (uint32_t);

> +/* Four iterations are needed; ought to stay a loop.  */

> +TEST_LOOP (uint64_t);

> +

> +/* { dg-final { scan-assembler {\tld1b\tz[0-9]\.b} } } */

> +/* { dg-final { scan-assembler {\tld1h\tz[0-9]\.h} } } */

> +/* { dg-final { scan-assembler {\tld1w\tz[0-9]\.s} } } */

> +/* { dg-final { scan-assembler {\tld1d\tz[0-9]\.d} } } */

> +/* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */

> +/* { dg-final { scan-assembler-not {\tstr\tz[0-9]} } } */

> +/* { dg-final { scan-assembler-not {\tldr\tp[0-9]} } } */

> +/* { dg-final { scan-assembler-not {\tstr\tp[0-9]} } } */

> Index: gcc/testsuite/gcc.target/aarch64/sve_spill_4.c

> ===================================================================

> --- /dev/null   2017-11-14 14:28:07.424493901 +0000

> +++ gcc/testsuite/gcc.target/aarch64/sve_spill_4.c      2017-11-17 15:56:53.023759919 +0000

> @@ -0,0 +1,36 @@

> +/* { dg-do compile } */

> +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */

> +

> +#include <stdint.h>

> +

> +void consumer (void *);

> +

> +#define TEST_LOOP(TYPE, VAL)                   \

> +  void                                         \

> +  multi_loop_##TYPE (TYPE *x)                  \

> +  {                                            \

> +    for (int i = 0; i < 100; ++i)              \

> +      x[i] += VAL;                             \

> +    consumer (x);                              \

> +    for (int i = 0; i < 100; ++i)              \

> +      x[i] += VAL;                             \

> +    consumer (x);                              \

> +    for (int i = 0; i < 100; ++i)              \

> +      x[i] += VAL;                             \

> +    consumer (x);                              \

> +  }

> +

> +TEST_LOOP (uint16_t, 0x1234);

> +TEST_LOOP (uint32_t, 0x12345);

> +TEST_LOOP (uint64_t, 0x123456);

> +

> +/* { dg-final { scan-assembler-times {\tptrue\tp[0-9]+\.h,} 3 } } */

> +/* { dg-final { scan-assembler-times {\tptrue\tp[0-9]+\.s,} 3 } } */

> +/* { dg-final { scan-assembler-times {\tptrue\tp[0-9]+\.d,} 3 } } */

> +/* { dg-final { scan-assembler-times {\tld1rh\tz[0-9]+\.h,} 3 } } */

> +/* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s,} 3 } } */

> +/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d,} 3 } } */

> +/* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */

> +/* { dg-final { scan-assembler-not {\tstr\tz[0-9]} } } */

> +/* { dg-final { scan-assembler-not {\tldr\tp[0-9]} } } */

> +/* { dg-final { scan-assembler-not {\tstr\tp[0-9]} } } */

> Index: gcc/testsuite/gcc.target/aarch64/sve_spill_5.c

> ===================================================================

> --- /dev/null   2017-11-14 14:28:07.424493901 +0000

> +++ gcc/testsuite/gcc.target/aarch64/sve_spill_5.c      2017-11-17 15:56:53.023759919 +0000

> @@ -0,0 +1,34 @@

> +/* { dg-do compile } */

> +/* { dg-options "-O2 -ftree-vectorize -ffast-math -march=armv8-a+sve" } */

> +

> +#include <stdint.h>

> +

> +void consumer (void *);

> +

> +#define TEST_LOOP(TYPE, VAL)                   \

> +  void                                         \

> +  multi_loop_##TYPE (TYPE *x)                  \

> +  {                                            \

> +    for (int i = 0; i < 100; ++i)              \

> +      x[i] += VAL + i;                         \

> +    consumer (x);                              \

> +    for (int i = 0; i < 100; ++i)              \

> +      x[i] += VAL + i;                         \

> +    consumer (x);                              \

> +    for (int i = 0; i < 100; ++i)              \

> +      x[i] += VAL + i;                         \

> +    consumer (x);                              \

> +  }

> +

> +TEST_LOOP (uint8_t, 3);

> +TEST_LOOP (uint16_t, 4);

> +TEST_LOOP (uint32_t, 5);

> +TEST_LOOP (uint64_t, 6);

> +TEST_LOOP (float, 2.5f);

> +TEST_LOOP (double, 3.5);

> +

> +/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\..,} 18 } } */

> +/* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */

> +/* { dg-final { scan-assembler-not {\tstr\tz[0-9]} } } */

> +/* { dg-final { scan-assembler-not {\tldr\tp[0-9]} } } */

> +/* { dg-final { scan-assembler-not {\tstr\tp[0-9]} } } */

> Index: gcc/testsuite/gcc.target/aarch64/sve_spill_6.c

> ===================================================================

> --- /dev/null   2017-11-14 14:28:07.424493901 +0000

> +++ gcc/testsuite/gcc.target/aarch64/sve_spill_6.c      2017-11-17 15:56:53.023759919 +0000

> @@ -0,0 +1,44 @@

> +/* { dg-do compile } */

> +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable" } */

> +

> +#include <stdint.h>

> +

> +void consumer (void *);

> +

> +#define TEST_LOOP(TYPE, VAL)                                           \

> +  void                                                                 \

> +  multi_loop_##TYPE (TYPE *x1, TYPE *x2, TYPE *x3, TYPE *x4, int which) \

> +  {                                                                    \

> +    if (which)                                                         \

> +      {                                                                        \

> +       for (int i = 0; i < 7; ++i)                                     \

> +         x1[i] += VAL;                                                 \

> +       consumer (x1);                                                  \

> +       for (int i = 0; i < 7; ++i)                                     \

> +         x2[i] -= VAL;                                                 \

> +       consumer (x2);                                                  \

> +      }                                                                        \

> +    else                                                               \

> +      {                                                                        \

> +       for (int i = 0; i < 7; ++i)                                     \

> +         x3[i] &= VAL;                                                 \

> +       consumer (x3);                                                  \

> +      }                                                                        \

> +    for (int i = 0; i < 7; ++i)                                                \

> +      x4[i] |= VAL;                                                    \

> +    consumer (x4);                                                     \

> +  }

> +

> +TEST_LOOP (uint8_t, 0x12);

> +TEST_LOOP (uint16_t, 0x1234);

> +TEST_LOOP (uint32_t, 0x12345);

> +TEST_LOOP (uint64_t, 0x123456);

> +

> +/* { dg-final { scan-assembler {\tld1b\tz[0-9]+\.b,} } } */

> +/* { dg-final { scan-assembler {\tld1h\tz[0-9]+\.h,} } } */

> +/* { dg-final { scan-assembler {\tld1w\tz[0-9]+\.s,} } } */

> +/* { dg-final { scan-assembler {\tld1d\tz[0-9]+\.d,} } } */

> +/* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */

> +/* { dg-final { scan-assembler-not {\tstr\tz[0-9]} } } */

> +/* { dg-final { scan-assembler-not {\tldr\tp[0-9]} } } */

> +/* { dg-final { scan-assembler-not {\tstr\tp[0-9]} } } */

> Index: gcc/testsuite/gcc.target/aarch64/sve_spill_7.c

> ===================================================================

> --- /dev/null   2017-11-14 14:28:07.424493901 +0000

> +++ gcc/testsuite/gcc.target/aarch64/sve_spill_7.c      2017-11-17 15:56:53.023759919 +0000

> @@ -0,0 +1,46 @@

> +/* { dg-do compile } */

> +/* { dg-options "-O2 -ftree-vectorize -ffast-math -march=armv8-a+sve" } */

> +

> +#include <stdint.h>

> +

> +void consumer (void *);

> +

> +#define TEST_LOOP(TYPE, VAL)                   \

> +  void                                         \

> +  multi_loop_##TYPE (TYPE *x, int n)           \

> +  {                                            \

> +    for (int k = 0; k < 4; ++k)                        \

> +      {                                                \

> +       for (int j = 0; j < n; ++j)             \

> +         {                                     \

> +           for (int i = 0; i < 100; ++i)       \

> +             x[i] += VAL + i;                  \

> +           asm volatile ("");                  \

> +         }                                     \

> +       for (int j = 0; j < n; ++j)             \

> +         consumer (x);                         \

> +       for (int j = 0; j < n; ++j)             \

> +         {                                     \

> +           for (int i = 0; i < 100; ++i)       \

> +             x[i] += VAL + i;                  \

> +           asm volatile ("");                  \

> +         }                                     \

> +       consumer (x);                           \

> +       for (int i = 0; i < 100; ++i)           \

> +         x[i] += VAL + i;                      \

> +       consumer (x);                           \

> +      }                                                \

> +  }

> +

> +TEST_LOOP (uint8_t, 3);

> +TEST_LOOP (uint16_t, 4);

> +TEST_LOOP (uint32_t, 5);

> +TEST_LOOP (uint64_t, 6);

> +TEST_LOOP (float, 2.5f);

> +TEST_LOOP (double, 3.5);

> +

> +/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\..,} 18 } } */

> +/* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */

> +/* { dg-final { scan-assembler-not {\tstr\tz[0-9]} } } */

> +/* { dg-final { scan-assembler-not {\tldr\tp[0-9]} } } */

> +/* { dg-final { scan-assembler-not {\tstr\tp[0-9]} } } */
Jeff Law Dec. 14, 2017, 3:58 p.m. UTC | #2
On 12/14/2017 04:09 AM, Richard Biener wrote:
> On Fri, Nov 17, 2017 at 4:58 PM, Richard Sandiford

> <richard.sandiford@linaro.org> wrote:

>> This patch looks for pseudo registers that are live across a call

>> and for which no call-preserved hard registers exist.  It then

>> recomputes the pseudos as necessary to ensure that they are no

>> longer live across a call.  The comment at the head of the file

>> describes the approach.

>>

>> A new target hook selects which modes should be treated in this way.

>> By default none are, in which case the pass is skipped very early.

>>

>> It might also be worth looking for cases like:

>>

>>    C1: R1 := f (...)

>>    ...

>>    C2: R2 := f (...)

>>    C3: R1 := C2

>>

>> and giving the same value number to C1 and C3, effectively treating

>> it like:

>>

>>    C1: R1 := f (...)

>>    ...

>>    C2: R2 := f (...)

>>    C3: R1 := f (...)

>>

>> Another (much more expensive) enhancement would be to apply value

>> numbering to all pseudo registers (not just rematerialisation

>> candidates), so that we can handle things like:

>>

>>   C1: R1 := f (...R2...)

>>   ...

>>   C2: R1 := f (...R3...)

>>

>> where R2 and R3 hold the same value.  But the current pass seems

>> to catch the vast majority of cases.

>>

>> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu

>> and powerpc64le-linux-gnu.  OK to install?

> 

> Can you tell anything about the complexity of the algorithm?  How

> does it relate to what LRA can do?  AFAIK LRA doesn't try to find

> any global optimal solution and previous hardreg assignments

> may work against it?

> 

> That said - I would have expected remat to be done before the

> first scheduling pass?  Even before pass_sms (not sure

> what pass_live_range_shrinkage does).  Or be integrated

> with scheduling and it's register pressure cost model.

> 

> Also I would have expected the approach to apply to all modes,

> just the cost of spilling is different.  But if you can, say, reduce

> register pressure by one by rematerializing a bit-not then that

> should be always profitable, no?  postreload-cse will come to

> the rescue anyhow.

> 

> Disclaimer: didn't look at the pass implementation, RTL isn't my

> primary expertise.

I'd just started looking at this yesterday.

At a high level, yes, we could consider doing this on other modes.  It
certainly would have helped Sparc in the past (I have no idea if it
still would).  The basic problem was that all FP registers were call
clobbered, so our FP performance really sucked.

We eventually fixed by dramatically improving the basic caller-saves
costing model, avoid unnecessary saves/restores and save/restore in
wider modes (it tended to use word_mode chunks, even for things like 64
bit FP values).

Something like an aggressive remat pass would likely have helped as well.

Things have changed quite a lot since those days (1990-ish), thankfully.

There's other targets that have register classes which are call
clobbered and very painful to spill -- but they tend not to be heavily
used classes.  I'm thinking of things like loop counter registers, shift
amount registers and the like.

Anyway, hoping to dig further into the remat bits today.  And there's
the fully-predicated loop bits that are still unreviewed as well -- want
to take those Richi?


Jeff
Richard Sandiford Dec. 14, 2017, 7:26 p.m. UTC | #3
Jeff Law <law@redhat.com> writes:
> On 12/14/2017 04:09 AM, Richard Biener wrote:

>> On Fri, Nov 17, 2017 at 4:58 PM, Richard Sandiford

>> <richard.sandiford@linaro.org> wrote:

>>> This patch looks for pseudo registers that are live across a call

>>> and for which no call-preserved hard registers exist.  It then

>>> recomputes the pseudos as necessary to ensure that they are no

>>> longer live across a call.  The comment at the head of the file

>>> describes the approach.

>>>

>>> A new target hook selects which modes should be treated in this way.

>>> By default none are, in which case the pass is skipped very early.

>>>

>>> It might also be worth looking for cases like:

>>>

>>>    C1: R1 := f (...)

>>>    ...

>>>    C2: R2 := f (...)

>>>    C3: R1 := C2

>>>

>>> and giving the same value number to C1 and C3, effectively treating

>>> it like:

>>>

>>>    C1: R1 := f (...)

>>>    ...

>>>    C2: R2 := f (...)

>>>    C3: R1 := f (...)

>>>

>>> Another (much more expensive) enhancement would be to apply value

>>> numbering to all pseudo registers (not just rematerialisation

>>> candidates), so that we can handle things like:

>>>

>>>   C1: R1 := f (...R2...)

>>>   ...

>>>   C2: R1 := f (...R3...)

>>>

>>> where R2 and R3 hold the same value.  But the current pass seems

>>> to catch the vast majority of cases.

>>>

>>> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu

>>> and powerpc64le-linux-gnu.  OK to install?

>> 

>> Can you tell anything about the complexity of the algorithm?


Have to get back to you on that one. :-)

>> How does it relate to what LRA can do?  AFAIK LRA doesn't try to find

>> any global optimal solution and previous hardreg assignments may work

>> against it?


Yeah, both of those are problems.  But the more important problem is
that it can't increase the live ranges of input registers as easily.
Doing it before RA means that IRA gets to see the new ranges.

>> That said - I would have expected remat to be done before the

>> first scheduling pass?  Even before pass_sms (not sure

>> what pass_live_range_shrinkage does).  Or be integrated

>> with scheduling and it's register pressure cost model.


SMS shouldn't be a problem.  Early remat wouldn't introduce new
instructions into a loop unless the loop also had a call, which would
prevent SMS.  And although it's theoretically possible that it could
remove instructions from a loop, that would only happen if:

  (a) the instruction actually computes the same value every time, so
      could have been moved outside the loop; and

  (b) the result is only used after a following call (and in particular
      isn't used within the loop itself)

(a) is a missed optimisation and (b) seems unlikely.

Integrating remat into scheduling would make it much less powerful,
since scheduling does only limited code motion between blocks.

Doing it before scheduling would be good in principle, but there
would then need to be a fake dependency between the call and remat
instructions to stop the scheduler moving the remat instructions
back before the call.  Adding early remat was a way of avoiding such
fake dependencies in "every" pass, but it might be that scheduling
is one case in which the dependencies make sense.

Either way, being able to run the pass before scheduling seems
like a future enhancement, blocked on a future enhancement to
the scheduler.

>> Also I would have expected the approach to apply to all modes,

>> just the cost of spilling is different.  But if you can, say, reduce

>> register pressure by one by rematerializing a bit-not then that

>> should be always profitable, no?  postreload-cse will come to

>> the rescue anyhow.


But that would then mean taking the register pressure into account when
deciding whether to rematerialise.  On the one hand would make it hard
to do before scheduling (which decides the final pre-RA pressure).
It would also make it a significantly different algorithm, since it
wouldn't be a standard availability problem any more.

For that use case, pressure-dependent remat in the scheduler might
be a better approach, like you were suggesting.  The early remat
pass is specifically for the extreme case of no registers being
call-preserved, where it's more important that we don't miss
remat opportunities, and more important that we treat it as
a global problem.

Thanks,
Richard
Richard Biener Dec. 14, 2017, 7:32 p.m. UTC | #4
On December 14, 2017 8:26:49 PM GMT+01:00, Richard Sandiford <richard.sandiford@linaro.org> wrote:
>Jeff Law <law@redhat.com> writes:

>> On 12/14/2017 04:09 AM, Richard Biener wrote:

>>> On Fri, Nov 17, 2017 at 4:58 PM, Richard Sandiford

>>> <richard.sandiford@linaro.org> wrote:

>>>> This patch looks for pseudo registers that are live across a call

>>>> and for which no call-preserved hard registers exist.  It then

>>>> recomputes the pseudos as necessary to ensure that they are no

>>>> longer live across a call.  The comment at the head of the file

>>>> describes the approach.

>>>>

>>>> A new target hook selects which modes should be treated in this

>way.

>>>> By default none are, in which case the pass is skipped very early.

>>>>

>>>> It might also be worth looking for cases like:

>>>>

>>>>    C1: R1 := f (...)

>>>>    ...

>>>>    C2: R2 := f (...)

>>>>    C3: R1 := C2

>>>>

>>>> and giving the same value number to C1 and C3, effectively treating

>>>> it like:

>>>>

>>>>    C1: R1 := f (...)

>>>>    ...

>>>>    C2: R2 := f (...)

>>>>    C3: R1 := f (...)

>>>>

>>>> Another (much more expensive) enhancement would be to apply value

>>>> numbering to all pseudo registers (not just rematerialisation

>>>> candidates), so that we can handle things like:

>>>>

>>>>   C1: R1 := f (...R2...)

>>>>   ...

>>>>   C2: R1 := f (...R3...)

>>>>

>>>> where R2 and R3 hold the same value.  But the current pass seems

>>>> to catch the vast majority of cases.

>>>>

>>>> Tested on aarch64-linux-gnu (with and without SVE),

>x86_64-linux-gnu

>>>> and powerpc64le-linux-gnu.  OK to install?

>>> 

>>> Can you tell anything about the complexity of the algorithm?

>

>Have to get back to you on that one. :-)

>

>>> How does it relate to what LRA can do?  AFAIK LRA doesn't try to

>find

>>> any global optimal solution and previous hardreg assignments may

>work

>>> against it?

>

>Yeah, both of those are problems.  But the more important problem is

>that it can't increase the live ranges of input registers as easily.

>Doing it before RA means that IRA gets to see the new ranges.

>

>>> That said - I would have expected remat to be done before the

>>> first scheduling pass?  Even before pass_sms (not sure

>>> what pass_live_range_shrinkage does).  Or be integrated

>>> with scheduling and it's register pressure cost model.

>

>SMS shouldn't be a problem.  Early remat wouldn't introduce new

>instructions into a loop unless the loop also had a call, which would

>prevent SMS.  And although it's theoretically possible that it could

>remove instructions from a loop, that would only happen if:

>

>  (a) the instruction actually computes the same value every time, so

>      could have been moved outside the loop; and

>

>  (b) the result is only used after a following call (and in particular

>      isn't used within the loop itself)

>

>(a) is a missed optimisation and (b) seems unlikely.

>

>Integrating remat into scheduling would make it much less powerful,

>since scheduling does only limited code motion between blocks.

>

>Doing it before scheduling would be good in principle, but there

>would then need to be a fake dependency between the call and remat

>instructions to stop the scheduler moving the remat instructions

>back before the call.  Adding early remat was a way of avoiding such

>fake dependencies in "every" pass, but it might be that scheduling

>is one case in which the dependencies make sense.

>

>Either way, being able to run the pass before scheduling seems

>like a future enhancement, blocked on a future enhancement to

>the scheduler.

>

>>> Also I would have expected the approach to apply to all modes,

>>> just the cost of spilling is different.  But if you can, say, reduce

>>> register pressure by one by rematerializing a bit-not then that

>>> should be always profitable, no?  postreload-cse will come to

>>> the rescue anyhow.

>

>But that would then mean taking the register pressure into account when

>deciding whether to rematerialise.  On the one hand would make it hard

>to do before scheduling (which decides the final pre-RA pressure).

>It would also make it a significantly different algorithm, since it

>wouldn't be a standard availability problem any more.

>

>For that use case, pressure-dependent remat in the scheduler might

>be a better approach, like you were suggesting.  The early remat

>pass is specifically for the extreme case of no registers being

>call-preserved, where it's more important that we don't miss

>remat opportunities, and more important that we treat it as

>a global problem.


On x86_64 all xmm registers are caller saved for example. That means all FP regs and all vectors. (yeah, stupid ABI decision....)

Richard. 

>Thanks,

>Richard
Richard Sandiford Dec. 14, 2017, 11:22 p.m. UTC | #5
Richard Biener <richard.guenther@gmail.com> writes:
> On December 14, 2017 8:26:49 PM GMT+01:00, Richard Sandiford

> <richard.sandiford@linaro.org> wrote:

>>Jeff Law <law@redhat.com> writes:

>>> On 12/14/2017 04:09 AM, Richard Biener wrote:

>>>> On Fri, Nov 17, 2017 at 4:58 PM, Richard Sandiford

>>>> <richard.sandiford@linaro.org> wrote:

>>>>> This patch looks for pseudo registers that are live across a call

>>>>> and for which no call-preserved hard registers exist.  It then

>>>>> recomputes the pseudos as necessary to ensure that they are no

>>>>> longer live across a call.  The comment at the head of the file

>>>>> describes the approach.

>>>>>

>>>>> A new target hook selects which modes should be treated in this

>>way.

>>>>> By default none are, in which case the pass is skipped very early.

>>>>>

>>>>> It might also be worth looking for cases like:

>>>>>

>>>>>    C1: R1 := f (...)

>>>>>    ...

>>>>>    C2: R2 := f (...)

>>>>>    C3: R1 := C2

>>>>>

>>>>> and giving the same value number to C1 and C3, effectively treating

>>>>> it like:

>>>>>

>>>>>    C1: R1 := f (...)

>>>>>    ...

>>>>>    C2: R2 := f (...)

>>>>>    C3: R1 := f (...)

>>>>>

>>>>> Another (much more expensive) enhancement would be to apply value

>>>>> numbering to all pseudo registers (not just rematerialisation

>>>>> candidates), so that we can handle things like:

>>>>>

>>>>>   C1: R1 := f (...R2...)

>>>>>   ...

>>>>>   C2: R1 := f (...R3...)

>>>>>

>>>>> where R2 and R3 hold the same value.  But the current pass seems

>>>>> to catch the vast majority of cases.

>>>>>

>>>>> Tested on aarch64-linux-gnu (with and without SVE),

>>x86_64-linux-gnu

>>>>> and powerpc64le-linux-gnu.  OK to install?

>>>> 

>>>> Can you tell anything about the complexity of the algorithm?

>>

>>Have to get back to you on that one. :-)

>>

>>>> How does it relate to what LRA can do?  AFAIK LRA doesn't try to

>>find

>>>> any global optimal solution and previous hardreg assignments may

>>work

>>>> against it?

>>

>>Yeah, both of those are problems.  But the more important problem is

>>that it can't increase the live ranges of input registers as easily.

>>Doing it before RA means that IRA gets to see the new ranges.

>>

>>>> That said - I would have expected remat to be done before the

>>>> first scheduling pass?  Even before pass_sms (not sure

>>>> what pass_live_range_shrinkage does).  Or be integrated

>>>> with scheduling and it's register pressure cost model.

>>

>>SMS shouldn't be a problem.  Early remat wouldn't introduce new

>>instructions into a loop unless the loop also had a call, which would

>>prevent SMS.  And although it's theoretically possible that it could

>>remove instructions from a loop, that would only happen if:

>>

>>  (a) the instruction actually computes the same value every time, so

>>      could have been moved outside the loop; and

>>

>>  (b) the result is only used after a following call (and in particular

>>      isn't used within the loop itself)

>>

>>(a) is a missed optimisation and (b) seems unlikely.

>>

>>Integrating remat into scheduling would make it much less powerful,

>>since scheduling does only limited code motion between blocks.

>>

>>Doing it before scheduling would be good in principle, but there

>>would then need to be a fake dependency between the call and remat

>>instructions to stop the scheduler moving the remat instructions

>>back before the call.  Adding early remat was a way of avoiding such

>>fake dependencies in "every" pass, but it might be that scheduling

>>is one case in which the dependencies make sense.

>>

>>Either way, being able to run the pass before scheduling seems

>>like a future enhancement, blocked on a future enhancement to

>>the scheduler.

>>

>>>> Also I would have expected the approach to apply to all modes,

>>>> just the cost of spilling is different.  But if you can, say, reduce

>>>> register pressure by one by rematerializing a bit-not then that

>>>> should be always profitable, no?  postreload-cse will come to

>>>> the rescue anyhow.

>>

>>But that would then mean taking the register pressure into account when

>>deciding whether to rematerialise.  On the one hand would make it hard

>>to do before scheduling (which decides the final pre-RA pressure).

>>It would also make it a significantly different algorithm, since it

>>wouldn't be a standard availability problem any more.

>>

>>For that use case, pressure-dependent remat in the scheduler might

>>be a better approach, like you were suggesting.  The early remat

>>pass is specifically for the extreme case of no registers being

>>call-preserved, where it's more important that we don't miss

>>remat opportunities, and more important that we treat it as

>>a global problem.

>

> On x86_64 all xmm registers are caller saved for example. That means all

> FP regs and all vectors. (yeah, stupid ABI decision....)


OK.  The patch uses a target hook to select the modes -- basing it off
whether they're variable-length is just the default.  So if this turns
out to be a win for x86_64 or for SPARC (was originally going to reply
to that in Jeff's message, sorry), then the target could opt in if it
wants to.

Thanks,
Richard
Jeff Law Dec. 15, 2017, 7:10 p.m. UTC | #6
On 12/14/2017 12:32 PM, Richard Biener wrote:
> 

> On x86_64 all xmm registers are caller saved for example. That means all FP regs and all vectors. (yeah, stupid ABI decision....)

But that's precisely what I would expect if one was looking to maintain
backwards compatibility within the core runtime libraries.

If you make something callee-saved, then you have to have space for it
in the setjmp buffer.  Expanding that buffer is an ABI change and thus
*highly* discouraged.

jeff
Richard Biener Dec. 15, 2017, 7:14 p.m. UTC | #7
On December 15, 2017 8:10:33 PM GMT+01:00, Jeff Law <law@redhat.com> wrote:
>On 12/14/2017 12:32 PM, Richard Biener wrote:

>> 

>> On x86_64 all xmm registers are caller saved for example. That means

>all FP regs and all vectors. (yeah, stupid ABI decision....)

>But that's precisely what I would expect if one was looking to maintain

>backwards compatibility within the core runtime libraries.

>

>If you make something callee-saved, then you have to have space for it

>in the setjmp buffer.  Expanding that buffer is an ABI change and thus

>*highly* discouraged.


Yes. But the initial 64bit ABI already had 8 xmm regs. Probably all used for parameter passing as well, but... 

For all further extensions I agree. 

Richard. 

>jeff
Jeff Law Dec. 15, 2017, 7:53 p.m. UTC | #8
On 12/14/2017 12:26 PM, Richard Sandiford wrote:

>>> How does it relate to what LRA can do?  AFAIK LRA doesn't try to find

>>> any global optimal solution and previous hardreg assignments may work

>>> against it?

> 

> Yeah, both of those are problems.  But the more important problem is

> that it can't increase the live ranges of input registers as easily.

> Doing it before RA means that IRA gets to see the new ranges.

LRA does not work on a global basis.  It's somewhere between basic block
and extended basic block in its scope.

Remat (along with caller-saves) is really just a case of range splitting
in my mind.  So you really want the pass to either directly integrate
with IRA or run prior to IRA.

You can do splitting in response to failure to get a hard register and
try to hook back into IRA to color those new objects.  I had reasonably
good success with that approach when I was looking at the allocators
prior to LRA.

Basically I let IRA do its thing.  WHen it was done I walked through the
IL splitting ranges to make hard registers available at key points.
Then I'd call back into IRA (using existing mechanisms) to try
allocation again for the allocnos that had not been colored and any new
ones.  The key was there was some very simple and easy range splitting
you could do on already allocated allocnos that in turn would free up
hard registers.

That kind of model doesn't seem to fit here terribly well.  It's not a
lack of hard regs that's the problem, but simply not having any hard
regs available across calls.  So splitting the range of some allocno
that did get a hard register isn't going to help color any of the
allocanos that did not get a register.


If we go back further (circa 1998) we did a pre-allocation range
splitting pass.  We had it working marginally OK, but never really as
well as we wanted.

In that model we looked at pseudos that were likely going to be hard to
allocate and split them into multiple new pseudos.  We tracked the
relationship between the new and original pseudo so that reload could
shove them back together in cases where that made the most sense.  We
had copyin/copyout insns to move back and forth between the range copies
and the original pseudo as needed.

I don't remember the heuristics that drove when/where to split.
Meissner might since my recollection is that he did the major lifting
there.

But again, I don't think that model works here either.  It did nothing
WRT remat.

I know we pondered remat in the context of revamping caller-saves in the
early 90s to help Sparc FP.  But my recollection was that once we had
caller-saves handling the basics well, the performance gains were enough
that digging into remat was never really explored.

Anyway, that's a bit of history.  IMHO remat has to run prior to
allocation or integrated with allocation.   In general I'd expect
running before and independent of IRA to be easier to implement, but
slightly less performant than tightly integrated with IRA.

In addition to potentially avoiding spilling, we have an added benefit
for SVE that we avoid variable sized stack frames if we can eliminate
*all* instances of SVE regsiters live across calls.

I'm guessing that they're relatively rare to begin with based on
comments within the actual code.


> 

>>> That said - I would have expected remat to be done before the

>>> first scheduling pass?  Even before pass_sms (not sure

>>> what pass_live_range_shrinkage does).  Or be integrated

>>> with scheduling and it's register pressure cost model.

> 

> SMS shouldn't be a problem.  Early remat wouldn't introduce new

> instructions into a loop unless the loop also had a call, which would

> prevent SMS.  And although it's theoretically possible that it could

> remove instructions from a loop, that would only happen if:

> 

>   (a) the instruction actually computes the same value every time, so

>       could have been moved outside the loop; and

> 

>   (b) the result is only used after a following call (and in particular

>       isn't used within the loop itself)

> 

> (a) is a missed optimisation and (b) seems unlikely.

> 

> Integrating remat into scheduling would make it much less powerful,

> since scheduling does only limited code motion between blocks.

> 

> Doing it before scheduling would be good in principle, but there

> would then need to be a fake dependency between the call and remat

> instructions to stop the scheduler moving the remat instructions

> back before the call.  Adding early remat was a way of avoiding such

> fake dependencies in "every" pass, but it might be that scheduling

> is one case in which the dependencies make sense.

> 

> Either way, being able to run the pass before scheduling seems

> like a future enhancement, blocked on a future enhancement to

> the scheduler.

I'd expect it to be post-scheduling simply because otherwise you have to
ensure scheduling doesn't muck it back up.  And there's been talk t

> 

>>> Also I would have expected the approach to apply to all modes,

>>> just the cost of spilling is different.  But if you can, say, reduce

>>> register pressure by one by rematerializing a bit-not then that

>>> should be always profitable, no?  postreload-cse will come to

>>> the rescue anyhow.

> 

> But that would then mean taking the register pressure into account when

> deciding whether to rematerialise.  On the one hand would make it hard

> to do before scheduling (which decides the final pre-RA pressure).

> It would also make it a significantly different algorithm, since it

> wouldn't be a standard availability problem any more.

RIght.  In general you don't want to remat or range split speculatively
-- otherwise you end up doing a lot of work later to pull things back
together.  Been there, done that.  Estimation of pressure and costing
models are very important here as the scope of things under
consideration is great and the possibility of makings worse is very real.



> 

> For that use case, pressure-dependent remat in the scheduler might

> be a better approach, like you were suggesting.  The early remat

> pass is specifically for the extreme case of no registers being

> call-preserved, where it's more important that we don't miss

> remat opportunities, and more important that we treat it as

> a global problem.

This case feels different, largely becuase we're talking about a fairly
uncommon case and the cost if we hit that case is very high.  ie, we've
got a larger amount of wiggle room if we don't get the costing model
terribly accurate.

jeff
Jakub Jelinek Dec. 15, 2017, 7:53 p.m. UTC | #9
On Fri, Dec 15, 2017 at 08:14:38PM +0100, Richard Biener wrote:
> On December 15, 2017 8:10:33 PM GMT+01:00, Jeff Law <law@redhat.com> wrote:

> >On 12/14/2017 12:32 PM, Richard Biener wrote:

> >> 

> >> On x86_64 all xmm registers are caller saved for example. That means

> >all FP regs and all vectors. (yeah, stupid ABI decision....)

> >But that's precisely what I would expect if one was looking to maintain

> >backwards compatibility within the core runtime libraries.

> >

> >If you make something callee-saved, then you have to have space for it

> >in the setjmp buffer.  Expanding that buffer is an ABI change and thus

> >*highly* discouraged.

> 

> Yes. But the initial 64bit ABI already had 8 xmm regs. Probably all used for parameter passing as well, but... 

> 

> For all further extensions I agree. 


Another issue is that the ?mm registers keep growing in size, and having
e.g. low 128-bits of the registers call-saved and upper bits call-used is
just weird (I believe that is the MS ABI).

	Jakub
Jeff Law Dec. 15, 2017, 9:50 p.m. UTC | #10
On 11/17/2017 08:58 AM, Richard Sandiford wrote:
> This patch looks for pseudo registers that are live across a call

> and for which no call-preserved hard registers exist.  It then

> recomputes the pseudos as necessary to ensure that they are no

> longer live across a call.  The comment at the head of the file

> describes the approach.

> 

> A new target hook selects which modes should be treated in this way.

> By default none are, in which case the pass is skipped very early.

> 

> It might also be worth looking for cases like:

> 

>    C1: R1 := f (...)

>    ...

>    C2: R2 := f (...)

>    C3: R1 := C2

> 

> and giving the same value number to C1 and C3, effectively treating

> it like:

> 

>    C1: R1 := f (...)

>    ...

>    C2: R2 := f (...)

>    C3: R1 := f (...)

> 

> Another (much more expensive) enhancement would be to apply value

> numbering to all pseudo registers (not just rematerialisation

> candidates), so that we can handle things like:

> 

>   C1: R1 := f (...R2...)

>   ...

>   C2: R1 := f (...R3...)

> 

> where R2 and R3 hold the same value.  But the current pass seems

> to catch the vast majority of cases.

> 

> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu

> and powerpc64le-linux-gnu.  OK to install?

> 

> Richard

> 

> 

> 2017-11-17  Richard Sandiford  <richard.sandiford@linaro.org>

> 

> gcc/

> 	* Makefile.in (OBJS): Add early-remat.o.

> 	* target.def (select_early_remat_modes): New hook.

> 	* doc/tm.texi.in (TARGET_SELECT_EARLY_REMAT_MODES): New hook.

> 	* doc/tm.texi: Regenerate.

> 	* targhooks.h (default_select_early_remat_modes): Declare.

> 	* targhooks.c (default_select_early_remat_modes): New function.

> 	* timevar.def (TV_EARLY_REMAT): New timevar.

> 	* passes.def (pass_early_remat): New pass.

> 	* tree-pass.h (make_pass_early_remat): Declare.

> 	* early-remat.c: New file.

> 	* config/aarch64/aarch64.c (aarch64_select_early_remat_modes): New

> 	function.

> 	(TARGET_SELECT_EARLY_REMAT_MODES): Define.

> 

> gcc/testsuite/

> 	* gcc.target/aarch64/sve_spill_1.c: Also test that no predicates

> 	are spilled.

> 	* gcc.target/aarch64/sve_spill_2.c: New test.

> 	* gcc.target/aarch64/sve_spill_3.c: Likewise.

> 	* gcc.target/aarch64/sve_spill_4.c: Likewise.

> 	* gcc.target/aarch64/sve_spill_5.c: Likewise.

> 	* gcc.target/aarch64/sve_spill_6.c: Likewise.

> 	* gcc.target/aarch64/sve_spill_7.c: Likewise.

So it's not the immediate goal here, but it'd be nice if we could extend
this to cover all the other targets that have no registers of certain
classes that can be allocated across a call.

Mostly this is exactly what I'd expect -- not to diminish the work,
there's a ton of stuff here.  But the overall flow of work and
organization is basically what I'd expect.

I know Richi was a bit scared off by the RTL nature, but the vast
majority of the code here isn't related to RTL -- it's operating on
properties that got extracted from the RTL.  I could almost take this
change bits at the start of the pass and the end and run it on gimple
:-)  Which brings up an interesting question, is that a worthwhile thing
to ponder?  Is it gimple optimizers that are causing objects to be live
across calls or is it usually teh RTL bits?

As I was reading one of the thoughts I had was that this reminded me a
bit of LCM.  WHich makes sense -- it's code motion after all.  I was
pleasantly surprised to see the comment explaining why the LCM
infrastructure was not used.

Do you have to do anything about rounding modes?  Or are those not an
issue for aarch64?  It's something I'm pretty sure we'd need to handle
for x86 as we have to ensure the rounding mode is stable if we move/copy
FP computations from one context to another.


> Index: gcc/early-remat.c

> ===================================================================

> --- /dev/null	2017-11-14 14:28:07.424493901 +0000

> +++ gcc/early-remat.c	2017-11-17 15:56:53.021759920 +0000

> @@ -0,0 +1,2610 @@

> +   Candidate validation and value numbering

> +   ========================================

> +

> +   Next we simultaneously decide which candidates are valid and look

> +   for candidates that are equivalent to each other, assigning numbers

> +   to each unique candidate value.  A candidate C is invalid if:

> +

> +     (a) C uses an invalid candidate;

> +

> +     (b) there is a cycle of candidate uses involving C; or

> +

> +     (c) C takes a candidate register R as input and the reaching

> +         definitions of R do not have the same value number.

> +

> +   We assign a "representative" candidate C to each value number and from

> +   here on replace references to other candidates with that value number

> +   with references to C.  It is then only possible to rematerialize a

> +   register R at point P if (after this replacement) there is a single

> +   reaching definition of R at P.

So how often are you seeing candidates that are equal to each other?  I
know Vlad did some experiments in the register allocator and the number
of equivalences seen by VN was relatively small.

Maybe it's the case that there's something fundamentally different with
what you're doing allowing you to see more equivalences.

Of course if you're using VNs as part of the validation process, then I
guess you get equivalences for free, so you might as well go ahead and
use them.


Local/Global Phases seem reasonable at a high level.  About what you'd
expect.


> +/* Return true if REGNO is worth rematerializing.  */

> +

> +bool

> +early_remat::interesting_regno_p (unsigned int regno)

> +{

> +  /* Ignore unused registers.  */

> +  rtx reg = regno_reg_rtx[regno];

> +  if (!reg || DF_REG_DEF_COUNT (regno) == 0)

> +    return false;

> +

> +  /* Make sure the register has a mode that we want to rematerialize.  */

> +  if (!bitmap_bit_p (m_selected_modes, GET_MODE (reg)))

> +    return false;

> +

> +  /* Ignore values that might sometimes be used uninitialized.  We could

> +     instead add dummy candidates for the entry block definition, and so

> +     handle uses that are definitely not uninitialized, but the combination

> +     of the two should be rare in practice.  */

> +  if (bitmap_bit_p (DF_LR_OUT (ENTRY_BLOCK_PTR_FOR_FN (m_fn)), regno))

> +    return false;

> +

> +  return true;

> +}

So triggering on modes isn't really ideal, but I can understand why
you're doing it that way.  I think it's OK for now, but there may be a
day when we want to do something different.

That could one day be dealing with an architecture with a painful to
save register that is accessed in the same mode as other pseudos or
dealing with a remat pass to reduce pressure.

> +

> +/* Record the set of register REGNO in instruction INSN as a

> +   rematerialization candidate.  CAN_COPY_P is true unless we already

> +   know that rematerialization is impossible (in which case the candidate

> +   only exists for the reaching definition calculation).

> +

> +   The candidate's index is not fixed at this stage.  */

> +

> +remat_candidate *

> +early_remat::add_candidate (rtx_insn *insn, unsigned int regno,

> +			    bool can_copy_p)

> +{

> +  remat_candidate cand;

> +  memset (&cand, 0, sizeof (cand));

> +  cand.regno = regno;

> +  cand.insn = insn;

> +  cand.remat_rtx = PATTERN (insn);

> +  cand.can_copy_p = can_copy_p;

FWIW, if the fields you're assigning are at the start or the end of the
object, DSE will trim the memset to avoid unnecessary write traffic.
No need to try and rearrange the object to enable that optimization.
Just making you aware in case such things are of interest to you.


> +      else if (HARD_REGISTER_NUM_P (use_regno))

> +	{

> +	  /* Allocate a dummy pseudo register and temporarily install it.

> +	     Make the register number depend on the mode, which should

> +	     provide enough sharing for match_dup while also weeding

> +	     out cases in which operands with different modes are

> +	     explicitly tied.  */

> +	  rtx *loc = DF_REF_REAL_LOC (ref);

> +	  unsigned int size = RTX_CODE_SIZE (REG);

> +	  rtx new_reg = (rtx) alloca (size);

> +	  memset (new_reg, 0, size);

> +	  PUT_CODE (new_reg, REG);

> +	  set_mode_and_regno (new_reg, GET_MODE (*loc),

> +			      LAST_VIRTUAL_REGISTER + 1 + GET_MODE (*loc));

> +	  validate_change (insn, loc, new_reg, 1);

> +	}

> +    }

Ewww.  But I guess it works OK and you undo it immediately.  There was
another similar instance later.  They're not ideal, but they're well
localized and I don't think worth making a big deal out of.





> +

> +/* Assuming that every path reaching a point P contains a copy of a

> +   use U of REGNO, return true if another copy of U at P would have

> +   access to the same value of REGNO.  */

> +

> +bool

> +early_remat::stable_use_p (unsigned int regno)

> +{

> +  /* Conservatively assume not for hard registers.  */

> +  if (HARD_REGISTER_NUM_P (regno))

> +    return false;

> +

> +  /* See if REGNO has a single definition and is never used uninitialized.

> +     In this case the definition of REGNO dominates the common dominator

> +     of the uses U, which in turn dominates P.  */

> +  if (DF_REG_DEF_COUNT (regno) == 1

> +      && !bitmap_bit_p (DF_LR_OUT (ENTRY_BLOCK_PTR_FOR_FN (m_fn)), regno))

> +    return true;

> +

> +  return false;

> +}

What about loops here?  The def may dominate the use but the value might
be changing each iteration of the loop.  Does that affect correctness of
the algorithm you're using?

> +

> +/* Emit a copy from DEST to SRC before candidate CAND_INDEX's instruction.  */

> +

> +void

> +early_remat::emit_copy_before (unsigned int cand_index, rtx dest, rtx src)

> +{

> +  remat_candidate *cand = &m_candidates[cand_index];

> +  if (dump_file)

> +    {

> +      fprintf (dump_file, ";; Stabilizing insn ");

> +      dump_insn_id (cand->insn);

> +      fprintf (dump_file, " by copying source reg %d:%s to temporary reg %d\n",

> +	       REGNO (src), GET_MODE_NAME (GET_MODE (src)), REGNO (dest));

> +    }

> +  emit_insn_before (gen_move_insn (dest, src), cand->insn);

Might want to clarify that SRC and DEST are regs.  I was about to ask
about it, but then saw the mention in the dumping code.


Again, generally this is largely what I'd expect.  Just a few questions
above to clarify a couple minor concerns, but I think this is fine.

Jeff
Richard Sandiford Dec. 15, 2017, 11:28 p.m. UTC | #11
Jeff Law <law@redhat.com> writes:
> On 12/14/2017 12:26 PM, Richard Sandiford wrote:

>>>> How does it relate to what LRA can do?  AFAIK LRA doesn't try to find

>>>> any global optimal solution and previous hardreg assignments may work

>>>> against it?

>> 

>> Yeah, both of those are problems.  But the more important problem is

>> that it can't increase the live ranges of input registers as easily.

>> Doing it before RA means that IRA gets to see the new ranges.

> LRA does not work on a global basis.  It's somewhere between basic block

> and extended basic block in its scope.

>

> Remat (along with caller-saves) is really just a case of range splitting

> in my mind.  So you really want the pass to either directly integrate

> with IRA or run prior to IRA.

>

> You can do splitting in response to failure to get a hard register and

> try to hook back into IRA to color those new objects.  I had reasonably

> good success with that approach when I was looking at the allocators

> prior to LRA.

>

> Basically I let IRA do its thing.  WHen it was done I walked through the

> IL splitting ranges to make hard registers available at key points.

> Then I'd call back into IRA (using existing mechanisms) to try

> allocation again for the allocnos that had not been colored and any new

> ones.  The key was there was some very simple and easy range splitting

> you could do on already allocated allocnos that in turn would free up

> hard registers.

>

> That kind of model doesn't seem to fit here terribly well.  It's not a

> lack of hard regs that's the problem, but simply not having any hard

> regs available across calls.  So splitting the range of some allocno

> that did get a hard register isn't going to help color any of the

> allocanos that did not get a register.

>

>

> If we go back further (circa 1998) we did a pre-allocation range

> splitting pass.  We had it working marginally OK, but never really as

> well as we wanted.

>

> In that model we looked at pseudos that were likely going to be hard to

> allocate and split them into multiple new pseudos.  We tracked the

> relationship between the new and original pseudo so that reload could

> shove them back together in cases where that made the most sense.  We

> had copyin/copyout insns to move back and forth between the range copies

> and the original pseudo as needed.

>

> I don't remember the heuristics that drove when/where to split.

> Meissner might since my recollection is that he did the major lifting

> there.

>

> But again, I don't think that model works here either.  It did nothing

> WRT remat.

>

> I know we pondered remat in the context of revamping caller-saves in the

> early 90s to help Sparc FP.  But my recollection was that once we had

> caller-saves handling the basics well, the performance gains were enough

> that digging into remat was never really explored.

>

> Anyway, that's a bit of history.  IMHO remat has to run prior to

> allocation or integrated with allocation.   In general I'd expect

> running before and independent of IRA to be easier to implement, but

> slightly less performant than tightly integrated with IRA.

>

> In addition to potentially avoiding spilling, we have an added benefit

> for SVE that we avoid variable sized stack frames if we can eliminate

> *all* instances of SVE regsiters live across calls.

>

> I'm guessing that they're relatively rare to begin with based on

> comments within the actual code.


Yeah, spills due to excess register pressure in vectorised code are
relatively rare, and like you say, it's good to avoid variable-sized
frames if we can.  A lot of the remaining cases are due to duplicated
invariants being spilled, where it would be better to spill the duplicated
scalar instead.  That's future work.

The problem of SVE values becoming artificially live across calls is
actually relatively common though.  Adding early remat was important
in quite a few benchmarks.

Thanks,
Richard
Richard Sandiford Dec. 16, 2017, 12:31 a.m. UTC | #12
Jeff Law <law@redhat.com> writes:
> On 11/17/2017 08:58 AM, Richard Sandiford wrote:

>> This patch looks for pseudo registers that are live across a call

>> and for which no call-preserved hard registers exist.  It then

>> recomputes the pseudos as necessary to ensure that they are no

>> longer live across a call.  The comment at the head of the file

>> describes the approach.

>> 

>> A new target hook selects which modes should be treated in this way.

>> By default none are, in which case the pass is skipped very early.

>> 

>> It might also be worth looking for cases like:

>> 

>>    C1: R1 := f (...)

>>    ...

>>    C2: R2 := f (...)

>>    C3: R1 := C2

>> 

>> and giving the same value number to C1 and C3, effectively treating

>> it like:

>> 

>>    C1: R1 := f (...)

>>    ...

>>    C2: R2 := f (...)

>>    C3: R1 := f (...)

>> 

>> Another (much more expensive) enhancement would be to apply value

>> numbering to all pseudo registers (not just rematerialisation

>> candidates), so that we can handle things like:

>> 

>>   C1: R1 := f (...R2...)

>>   ...

>>   C2: R1 := f (...R3...)

>> 

>> where R2 and R3 hold the same value.  But the current pass seems

>> to catch the vast majority of cases.

>> 

>> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu

>> and powerpc64le-linux-gnu.  OK to install?

>> 

>> Richard

>> 

>> 

>> 2017-11-17  Richard Sandiford  <richard.sandiford@linaro.org>

>> 

>> gcc/

>> 	* Makefile.in (OBJS): Add early-remat.o.

>> 	* target.def (select_early_remat_modes): New hook.

>> 	* doc/tm.texi.in (TARGET_SELECT_EARLY_REMAT_MODES): New hook.

>> 	* doc/tm.texi: Regenerate.

>> 	* targhooks.h (default_select_early_remat_modes): Declare.

>> 	* targhooks.c (default_select_early_remat_modes): New function.

>> 	* timevar.def (TV_EARLY_REMAT): New timevar.

>> 	* passes.def (pass_early_remat): New pass.

>> 	* tree-pass.h (make_pass_early_remat): Declare.

>> 	* early-remat.c: New file.

>> 	* config/aarch64/aarch64.c (aarch64_select_early_remat_modes): New

>> 	function.

>> 	(TARGET_SELECT_EARLY_REMAT_MODES): Define.

>> 

>> gcc/testsuite/

>> 	* gcc.target/aarch64/sve_spill_1.c: Also test that no predicates

>> 	are spilled.

>> 	* gcc.target/aarch64/sve_spill_2.c: New test.

>> 	* gcc.target/aarch64/sve_spill_3.c: Likewise.

>> 	* gcc.target/aarch64/sve_spill_4.c: Likewise.

>> 	* gcc.target/aarch64/sve_spill_5.c: Likewise.

>> 	* gcc.target/aarch64/sve_spill_6.c: Likewise.

>> 	* gcc.target/aarch64/sve_spill_7.c: Likewise.

> So it's not the immediate goal here, but it'd be nice if we could extend

> this to cover all the other targets that have no registers of certain

> classes that can be allocated across a call.

>

> Mostly this is exactly what I'd expect -- not to diminish the work,

> there's a ton of stuff here.  But the overall flow of work and

> organization is basically what I'd expect.


That's good.  I wasn't setting out to do anything particularly
innovative here, so the fact that it's not surprising is a positive
sign. :-)

> I know Richi was a bit scared off by the RTL nature, but the vast

> majority of the code here isn't related to RTL -- it's operating on

> properties that got extracted from the RTL.  I could almost take this

> change bits at the start of the pass and the end and run it on gimple

> :-)  Which brings up an interesting question, is that a worthwhile thing

> to ponder?  Is it gimple optimizers that are causing objects to be live

> across calls or is it usually teh RTL bits?


It tends to be a mixture of both.  E.g. if we have two loops with
the same bounds and a call inbetween, the gimple optimisers will reuse
the WHILE_ULT in the first loop header for the second loop.  But there
are also cases introduced in rtl.  E.g. we don't (and IMO shouldn't)
expose to gimple that some SVE operations always require a predicate.
The expanders instead generate all-true predicates where necessary.
These are then an obvious candidate for PRE.

> As I was reading one of the thoughts I had was that this reminded me a

> bit of LCM.  WHich makes sense -- it's code motion after all.  I was

> pleasantly surprised to see the comment explaining why the LCM

> infrastructure was not used.


Yeah, the availablity problem is basically the same.  The main
difference is choosing the final placement based on block frequencies.

> Do you have to do anything about rounding modes?  Or are those not an

> issue for aarch64?  It's something I'm pretty sure we'd need to handle

> for x86 as we have to ensure the rounding mode is stable if we move/copy

> FP computations from one context to another.


Rounding modes don't affect loads, stores or moves, which are just
bit copies.  I suppose there is the problem that we could rematerialise
an FP operation in a region with a different rounding mode, but we just
don't have an IL that tracks that properly (in general, not just here).

>> Index: gcc/early-remat.c

>> ===================================================================

>> --- /dev/null	2017-11-14 14:28:07.424493901 +0000

>> +++ gcc/early-remat.c	2017-11-17 15:56:53.021759920 +0000

>> @@ -0,0 +1,2610 @@

>> +   Candidate validation and value numbering

>> +   ========================================

>> +

>> +   Next we simultaneously decide which candidates are valid and look

>> +   for candidates that are equivalent to each other, assigning numbers

>> +   to each unique candidate value.  A candidate C is invalid if:

>> +

>> +     (a) C uses an invalid candidate;

>> +

>> +     (b) there is a cycle of candidate uses involving C; or

>> +

>> +     (c) C takes a candidate register R as input and the reaching

>> +         definitions of R do not have the same value number.

>> +

>> +   We assign a "representative" candidate C to each value number and from

>> +   here on replace references to other candidates with that value number

>> +   with references to C.  It is then only possible to rematerialize a

>> +   register R at point P if (after this replacement) there is a single

>> +   reaching definition of R at P.

> So how often are you seeing candidates that are equal to each other?  I

> know Vlad did some experiments in the register allocator and the number

> of equivalences seen by VN was relatively small.

>

> Maybe it's the case that there's something fundamentally different with

> what you're doing allowing you to see more equivalences.

>

> Of course if you're using VNs as part of the validation process, then I

> guess you get equivalences for free, so you might as well go ahead and

> use them.


It happened in at least one important case (can't remember which, sorry).
That was enough to get me to implement it without first measuring how
common it was.  Like you say, it almost came for free as part of
validation.

> Local/Global Phases seem reasonable at a high level.  About what you'd

> expect.


Thanks.

>> +/* Return true if REGNO is worth rematerializing.  */

>> +

>> +bool

>> +early_remat::interesting_regno_p (unsigned int regno)

>> +{

>> +  /* Ignore unused registers.  */

>> +  rtx reg = regno_reg_rtx[regno];

>> +  if (!reg || DF_REG_DEF_COUNT (regno) == 0)

>> +    return false;

>> +

>> +  /* Make sure the register has a mode that we want to rematerialize.  */

>> +  if (!bitmap_bit_p (m_selected_modes, GET_MODE (reg)))

>> +    return false;

>> +

>> +  /* Ignore values that might sometimes be used uninitialized.  We could

>> +     instead add dummy candidates for the entry block definition, and so

>> +     handle uses that are definitely not uninitialized, but the combination

>> +     of the two should be rare in practice.  */

>> +  if (bitmap_bit_p (DF_LR_OUT (ENTRY_BLOCK_PTR_FOR_FN (m_fn)), regno))

>> +    return false;

>> +

>> +  return true;

>> +}

> So triggering on modes isn't really ideal, but I can understand why

> you're doing it that way.  I think it's OK for now, but there may be a

> day when we want to do something different.

>

> That could one day be dealing with an architecture with a painful to

> save register that is accessed in the same mode as other pseudos or

> dealing with a remat pass to reduce pressure.


I agree it might be useful to handle cases that aren't determined
solely by the mode.  It should be easy to plug that in when necessary.

I'm still not sure that this is the right place or algorithm for doing
remat to reduce pressure (but I'd be happy to be proven wrong :-)).

>> +/* Record the set of register REGNO in instruction INSN as a

>> +   rematerialization candidate.  CAN_COPY_P is true unless we already

>> +   know that rematerialization is impossible (in which case the candidate

>> +   only exists for the reaching definition calculation).

>> +

>> +   The candidate's index is not fixed at this stage.  */

>> +

>> +remat_candidate *

>> +early_remat::add_candidate (rtx_insn *insn, unsigned int regno,

>> +			    bool can_copy_p)

>> +{

>> +  remat_candidate cand;

>> +  memset (&cand, 0, sizeof (cand));

>> +  cand.regno = regno;

>> +  cand.insn = insn;

>> +  cand.remat_rtx = PATTERN (insn);

>> +  cand.can_copy_p = can_copy_p;

> FWIW, if the fields you're assigning are at the start or the end of the

> object, DSE will trim the memset to avoid unnecessary write traffic.

> No need to try and rearrange the object to enable that optimization.

> Just making you aware in case such things are of interest to you.


OK.  This particular code isn't performance-sensitive, but it's
definitely something that's worth bearing in mind.

>> +      else if (HARD_REGISTER_NUM_P (use_regno))

>> +	{

>> +	  /* Allocate a dummy pseudo register and temporarily install it.

>> +	     Make the register number depend on the mode, which should

>> +	     provide enough sharing for match_dup while also weeding

>> +	     out cases in which operands with different modes are

>> +	     explicitly tied.  */

>> +	  rtx *loc = DF_REF_REAL_LOC (ref);

>> +	  unsigned int size = RTX_CODE_SIZE (REG);

>> +	  rtx new_reg = (rtx) alloca (size);

>> +	  memset (new_reg, 0, size);

>> +	  PUT_CODE (new_reg, REG);

>> +	  set_mode_and_regno (new_reg, GET_MODE (*loc),

>> +			      LAST_VIRTUAL_REGISTER + 1 + GET_MODE (*loc));

>> +	  validate_change (insn, loc, new_reg, 1);

>> +	}

>> +    }

> Ewww.  But I guess it works OK and you undo it immediately.  There was

> another similar instance later.  They're not ideal, but they're well

> localized and I don't think worth making a big deal out of.


Yeah, it's not pretty, but I don't think we can do much better given
the current "try it and see" approach to insn recognition.

>> +

>> +/* Assuming that every path reaching a point P contains a copy of a

>> +   use U of REGNO, return true if another copy of U at P would have

>> +   access to the same value of REGNO.  */

>> +

>> +bool

>> +early_remat::stable_use_p (unsigned int regno)

>> +{

>> +  /* Conservatively assume not for hard registers.  */

>> +  if (HARD_REGISTER_NUM_P (regno))

>> +    return false;

>> +

>> +  /* See if REGNO has a single definition and is never used uninitialized.

>> +     In this case the definition of REGNO dominates the common dominator

>> +     of the uses U, which in turn dominates P.  */

>> +  if (DF_REG_DEF_COUNT (regno) == 1

>> +      && !bitmap_bit_p (DF_LR_OUT (ENTRY_BLOCK_PTR_FOR_FN (m_fn)), regno))

>> +    return true;

>> +

>> +  return false;

>> +}

> What about loops here?  The def may dominate the use but the value might

> be changing each iteration of the loop.  Does that affect correctness of

> the algorithm you're using?


No, that's OK.  Since the pre-existing copies of the candidate
instruction CI all use REGNO, and since every path to the rematerialisation
point P includes a copy of CI, there can't be a path from the definition
of REGNO to P that doesn't also include a copy of CI.  So the value of
REGNO at P will always be the correct one.

>> +

>> +/* Emit a copy from DEST to SRC before candidate CAND_INDEX's

>> instruction.  */

>> +

>> +void

>> +early_remat::emit_copy_before (unsigned int cand_index, rtx dest, rtx src)

>> +{

>> +  remat_candidate *cand = &m_candidates[cand_index];

>> +  if (dump_file)

>> +    {

>> +      fprintf (dump_file, ";; Stabilizing insn ");

>> +      dump_insn_id (cand->insn);

>> + fprintf (dump_file, " by copying source reg %d:%s to temporary reg

>> %d\n",

>> +	       REGNO (src), GET_MODE_NAME (GET_MODE (src)), REGNO (dest));

>> +    }

>> +  emit_insn_before (gen_move_insn (dest, src), cand->insn);

> Might want to clarify that SRC and DEST are regs.  I was about to ask

> about it, but then saw the mention in the dumping code.


OK.

> Again, generally this is largely what I'd expect.  Just a few questions

> above to clarify a couple minor concerns, but I think this is fine.


Thanks,
Richard
Richard Sandiford Jan. 9, 2018, 3:34 p.m. UTC | #13
Possible ping: wasn't sure whether this one needed more work or whether
it was OK to go in.  I've attached the patch with the improved comment
above early_remat::emit_copy_before.

Thanks,
Richard

Richard Sandiford <richard.sandiford@linaro.org> writes:
> Jeff Law <law@redhat.com> writes:

>> On 11/17/2017 08:58 AM, Richard Sandiford wrote:

>>> This patch looks for pseudo registers that are live across a call

>>> and for which no call-preserved hard registers exist.  It then

>>> recomputes the pseudos as necessary to ensure that they are no

>>> longer live across a call.  The comment at the head of the file

>>> describes the approach.

>>> 

>>> A new target hook selects which modes should be treated in this way.

>>> By default none are, in which case the pass is skipped very early.

>>> 

>>> It might also be worth looking for cases like:

>>> 

>>>    C1: R1 := f (...)

>>>    ...

>>>    C2: R2 := f (...)

>>>    C3: R1 := C2

>>> 

>>> and giving the same value number to C1 and C3, effectively treating

>>> it like:

>>> 

>>>    C1: R1 := f (...)

>>>    ...

>>>    C2: R2 := f (...)

>>>    C3: R1 := f (...)

>>> 

>>> Another (much more expensive) enhancement would be to apply value

>>> numbering to all pseudo registers (not just rematerialisation

>>> candidates), so that we can handle things like:

>>> 

>>>   C1: R1 := f (...R2...)

>>>   ...

>>>   C2: R1 := f (...R3...)

>>> 

>>> where R2 and R3 hold the same value.  But the current pass seems

>>> to catch the vast majority of cases.

>>> 

>>> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu

>>> and powerpc64le-linux-gnu.  OK to install?

>>> 

>>> Richard

>>> 

>>> 

>>> 2017-11-17  Richard Sandiford  <richard.sandiford@linaro.org>

>>> 

>>> gcc/

>>> 	* Makefile.in (OBJS): Add early-remat.o.

>>> 	* target.def (select_early_remat_modes): New hook.

>>> 	* doc/tm.texi.in (TARGET_SELECT_EARLY_REMAT_MODES): New hook.

>>> 	* doc/tm.texi: Regenerate.

>>> 	* targhooks.h (default_select_early_remat_modes): Declare.

>>> 	* targhooks.c (default_select_early_remat_modes): New function.

>>> 	* timevar.def (TV_EARLY_REMAT): New timevar.

>>> 	* passes.def (pass_early_remat): New pass.

>>> 	* tree-pass.h (make_pass_early_remat): Declare.

>>> 	* early-remat.c: New file.

>>> 	* config/aarch64/aarch64.c (aarch64_select_early_remat_modes): New

>>> 	function.

>>> 	(TARGET_SELECT_EARLY_REMAT_MODES): Define.

>>> 

>>> gcc/testsuite/

>>> 	* gcc.target/aarch64/sve_spill_1.c: Also test that no predicates

>>> 	are spilled.

>>> 	* gcc.target/aarch64/sve_spill_2.c: New test.

>>> 	* gcc.target/aarch64/sve_spill_3.c: Likewise.

>>> 	* gcc.target/aarch64/sve_spill_4.c: Likewise.

>>> 	* gcc.target/aarch64/sve_spill_5.c: Likewise.

>>> 	* gcc.target/aarch64/sve_spill_6.c: Likewise.

>>> 	* gcc.target/aarch64/sve_spill_7.c: Likewise.

>> So it's not the immediate goal here, but it'd be nice if we could extend

>> this to cover all the other targets that have no registers of certain

>> classes that can be allocated across a call.

>>

>> Mostly this is exactly what I'd expect -- not to diminish the work,

>> there's a ton of stuff here.  But the overall flow of work and

>> organization is basically what I'd expect.

>

> That's good.  I wasn't setting out to do anything particularly

> innovative here, so the fact that it's not surprising is a positive

> sign. :-)

>

>> I know Richi was a bit scared off by the RTL nature, but the vast

>> majority of the code here isn't related to RTL -- it's operating on

>> properties that got extracted from the RTL.  I could almost take this

>> change bits at the start of the pass and the end and run it on gimple

>> :-)  Which brings up an interesting question, is that a worthwhile thing

>> to ponder?  Is it gimple optimizers that are causing objects to be live

>> across calls or is it usually teh RTL bits?

>

> It tends to be a mixture of both.  E.g. if we have two loops with

> the same bounds and a call inbetween, the gimple optimisers will reuse

> the WHILE_ULT in the first loop header for the second loop.  But there

> are also cases introduced in rtl.  E.g. we don't (and IMO shouldn't)

> expose to gimple that some SVE operations always require a predicate.

> The expanders instead generate all-true predicates where necessary.

> These are then an obvious candidate for PRE.

>

>> As I was reading one of the thoughts I had was that this reminded me a

>> bit of LCM.  WHich makes sense -- it's code motion after all.  I was

>> pleasantly surprised to see the comment explaining why the LCM

>> infrastructure was not used.

>

> Yeah, the availablity problem is basically the same.  The main

> difference is choosing the final placement based on block frequencies.

>

>> Do you have to do anything about rounding modes?  Or are those not an

>> issue for aarch64?  It's something I'm pretty sure we'd need to handle

>> for x86 as we have to ensure the rounding mode is stable if we move/copy

>> FP computations from one context to another.

>

> Rounding modes don't affect loads, stores or moves, which are just

> bit copies.  I suppose there is the problem that we could rematerialise

> an FP operation in a region with a different rounding mode, but we just

> don't have an IL that tracks that properly (in general, not just here).

>

>>> Index: gcc/early-remat.c

>>> ===================================================================

>>> --- /dev/null	2017-11-14 14:28:07.424493901 +0000

>>> +++ gcc/early-remat.c	2017-11-17 15:56:53.021759920 +0000

>>> @@ -0,0 +1,2610 @@

>>> +   Candidate validation and value numbering

>>> +   ========================================

>>> +

>>> +   Next we simultaneously decide which candidates are valid and look

>>> +   for candidates that are equivalent to each other, assigning numbers

>>> +   to each unique candidate value.  A candidate C is invalid if:

>>> +

>>> +     (a) C uses an invalid candidate;

>>> +

>>> +     (b) there is a cycle of candidate uses involving C; or

>>> +

>>> +     (c) C takes a candidate register R as input and the reaching

>>> +         definitions of R do not have the same value number.

>>> +

>>> +   We assign a "representative" candidate C to each value number and from

>>> +   here on replace references to other candidates with that value number

>>> +   with references to C.  It is then only possible to rematerialize a

>>> +   register R at point P if (after this replacement) there is a single

>>> +   reaching definition of R at P.

>> So how often are you seeing candidates that are equal to each other?  I

>> know Vlad did some experiments in the register allocator and the number

>> of equivalences seen by VN was relatively small.

>>

>> Maybe it's the case that there's something fundamentally different with

>> what you're doing allowing you to see more equivalences.

>>

>> Of course if you're using VNs as part of the validation process, then I

>> guess you get equivalences for free, so you might as well go ahead and

>> use them.

>

> It happened in at least one important case (can't remember which, sorry).

> That was enough to get me to implement it without first measuring how

> common it was.  Like you say, it almost came for free as part of

> validation.

>

>> Local/Global Phases seem reasonable at a high level.  About what you'd

>> expect.

>

> Thanks.

>

>>> +/* Return true if REGNO is worth rematerializing.  */

>>> +

>>> +bool

>>> +early_remat::interesting_regno_p (unsigned int regno)

>>> +{

>>> +  /* Ignore unused registers.  */

>>> +  rtx reg = regno_reg_rtx[regno];

>>> +  if (!reg || DF_REG_DEF_COUNT (regno) == 0)

>>> +    return false;

>>> +

>>> +  /* Make sure the register has a mode that we want to rematerialize.  */

>>> +  if (!bitmap_bit_p (m_selected_modes, GET_MODE (reg)))

>>> +    return false;

>>> +

>>> +  /* Ignore values that might sometimes be used uninitialized.  We could

>>> +     instead add dummy candidates for the entry block definition, and so

>>> +     handle uses that are definitely not uninitialized, but the combination

>>> +     of the two should be rare in practice.  */

>>> +  if (bitmap_bit_p (DF_LR_OUT (ENTRY_BLOCK_PTR_FOR_FN (m_fn)), regno))

>>> +    return false;

>>> +

>>> +  return true;

>>> +}

>> So triggering on modes isn't really ideal, but I can understand why

>> you're doing it that way.  I think it's OK for now, but there may be a

>> day when we want to do something different.

>>

>> That could one day be dealing with an architecture with a painful to

>> save register that is accessed in the same mode as other pseudos or

>> dealing with a remat pass to reduce pressure.

>

> I agree it might be useful to handle cases that aren't determined

> solely by the mode.  It should be easy to plug that in when necessary.

>

> I'm still not sure that this is the right place or algorithm for doing

> remat to reduce pressure (but I'd be happy to be proven wrong :-)).

>

>>> +/* Record the set of register REGNO in instruction INSN as a

>>> +   rematerialization candidate.  CAN_COPY_P is true unless we already

>>> +   know that rematerialization is impossible (in which case the candidate

>>> +   only exists for the reaching definition calculation).

>>> +

>>> +   The candidate's index is not fixed at this stage.  */

>>> +

>>> +remat_candidate *

>>> +early_remat::add_candidate (rtx_insn *insn, unsigned int regno,

>>> +			    bool can_copy_p)

>>> +{

>>> +  remat_candidate cand;

>>> +  memset (&cand, 0, sizeof (cand));

>>> +  cand.regno = regno;

>>> +  cand.insn = insn;

>>> +  cand.remat_rtx = PATTERN (insn);

>>> +  cand.can_copy_p = can_copy_p;

>> FWIW, if the fields you're assigning are at the start or the end of the

>> object, DSE will trim the memset to avoid unnecessary write traffic.

>> No need to try and rearrange the object to enable that optimization.

>> Just making you aware in case such things are of interest to you.

>

> OK.  This particular code isn't performance-sensitive, but it's

> definitely something that's worth bearing in mind.

>

>>> +      else if (HARD_REGISTER_NUM_P (use_regno))

>>> +	{

>>> +	  /* Allocate a dummy pseudo register and temporarily install it.

>>> +	     Make the register number depend on the mode, which should

>>> +	     provide enough sharing for match_dup while also weeding

>>> +	     out cases in which operands with different modes are

>>> +	     explicitly tied.  */

>>> +	  rtx *loc = DF_REF_REAL_LOC (ref);

>>> +	  unsigned int size = RTX_CODE_SIZE (REG);

>>> +	  rtx new_reg = (rtx) alloca (size);

>>> +	  memset (new_reg, 0, size);

>>> +	  PUT_CODE (new_reg, REG);

>>> +	  set_mode_and_regno (new_reg, GET_MODE (*loc),

>>> +			      LAST_VIRTUAL_REGISTER + 1 + GET_MODE (*loc));

>>> +	  validate_change (insn, loc, new_reg, 1);

>>> +	}

>>> +    }

>> Ewww.  But I guess it works OK and you undo it immediately.  There was

>> another similar instance later.  They're not ideal, but they're well

>> localized and I don't think worth making a big deal out of.

>

> Yeah, it's not pretty, but I don't think we can do much better given

> the current "try it and see" approach to insn recognition.

>

>>> +

>>> +/* Assuming that every path reaching a point P contains a copy of a

>>> +   use U of REGNO, return true if another copy of U at P would have

>>> +   access to the same value of REGNO.  */

>>> +

>>> +bool

>>> +early_remat::stable_use_p (unsigned int regno)

>>> +{

>>> +  /* Conservatively assume not for hard registers.  */

>>> +  if (HARD_REGISTER_NUM_P (regno))

>>> +    return false;

>>> +

>>> +  /* See if REGNO has a single definition and is never used uninitialized.

>>> +     In this case the definition of REGNO dominates the common dominator

>>> +     of the uses U, which in turn dominates P.  */

>>> +  if (DF_REG_DEF_COUNT (regno) == 1

>>> +      && !bitmap_bit_p (DF_LR_OUT (ENTRY_BLOCK_PTR_FOR_FN (m_fn)), regno))

>>> +    return true;

>>> +

>>> +  return false;

>>> +}

>> What about loops here?  The def may dominate the use but the value might

>> be changing each iteration of the loop.  Does that affect correctness of

>> the algorithm you're using?

>

> No, that's OK.  Since the pre-existing copies of the candidate

> instruction CI all use REGNO, and since every path to the rematerialisation

> point P includes a copy of CI, there can't be a path from the definition

> of REGNO to P that doesn't also include a copy of CI.  So the value of

> REGNO at P will always be the correct one.

>

>>> +

>>> +/* Emit a copy from DEST to SRC before candidate CAND_INDEX's

>>> instruction.  */

>>> +

>>> +void

>>> +early_remat::emit_copy_before (unsigned int cand_index, rtx dest, rtx src)

>>> +{

>>> +  remat_candidate *cand = &m_candidates[cand_index];

>>> +  if (dump_file)

>>> +    {

>>> +      fprintf (dump_file, ";; Stabilizing insn ");

>>> +      dump_insn_id (cand->insn);

>>> + fprintf (dump_file, " by copying source reg %d:%s to temporary reg

>>> %d\n",

>>> +	       REGNO (src), GET_MODE_NAME (GET_MODE (src)), REGNO (dest));

>>> +    }

>>> +  emit_insn_before (gen_move_insn (dest, src), cand->insn);

>> Might want to clarify that SRC and DEST are regs.  I was about to ask

>> about it, but then saw the mention in the dumping code.

>

> OK.

>

>> Again, generally this is largely what I'd expect.  Just a few questions

>> above to clarify a couple minor concerns, but I think this is fine.

>

> Thanks,

> Richard


2018-01-09  Richard Sandiford  <richard.sandiford@linaro.org>

gcc/
	* Makefile.in (OBJS): Add early-remat.o.
	* target.def (select_early_remat_modes): New hook.
	* doc/tm.texi.in (TARGET_SELECT_EARLY_REMAT_MODES): New hook.
	* doc/tm.texi: Regenerate.
	* targhooks.h (default_select_early_remat_modes): Declare.
	* targhooks.c (default_select_early_remat_modes): New function.
	* timevar.def (TV_EARLY_REMAT): New timevar.
	* passes.def (pass_early_remat): New pass.
	* tree-pass.h (make_pass_early_remat): Declare.
	* early-remat.c: New file.
	* config/aarch64/aarch64.c (aarch64_select_early_remat_modes): New
	function.
	(TARGET_SELECT_EARLY_REMAT_MODES): Define.

gcc/testsuite/
	* gcc.target/aarch64/sve_spill_1.c: Also test that no predicates
	are spilled.
	* gcc.target/aarch64/sve_spill_2.c: New test.
	* gcc.target/aarch64/sve_spill_3.c: Likewise.
	* gcc.target/aarch64/sve_spill_4.c: Likewise.
	* gcc.target/aarch64/sve_spill_5.c: Likewise.
	* gcc.target/aarch64/sve_spill_6.c: Likewise.
	* gcc.target/aarch64/sve_spill_7.c: Likewise.

Index: gcc/Makefile.in
===================================================================
--- gcc/Makefile.in	2018-01-03 11:12:59.095630403 +0000
+++ gcc/Makefile.in	2018-01-09 15:28:36.199732981 +0000
@@ -1277,6 +1277,7 @@ OBJS = \
 	dwarf2asm.o \
 	dwarf2cfi.o \
 	dwarf2out.o \
+	early-remat.o \
 	emit-rtl.o \
 	et-forest.o \
 	except.o \
Index: gcc/target.def
===================================================================
--- gcc/target.def	2018-01-09 15:09:57.243794886 +0000
+++ gcc/target.def	2018-01-09 15:28:36.204732772 +0000
@@ -5571,6 +5571,19 @@ reload from using some alternatives, lik
  default_preferred_output_reload_class)
 
 DEFHOOK
+(select_early_remat_modes,
+ "On some targets, certain modes cannot be held in registers around a\n\
+standard ABI call and are relatively expensive to spill to the stack.\n\
+The early rematerialization pass can help in such cases by aggressively\n\
+recomputing values after calls, so that they don't need to be spilled.\n\
+\n\
+This hook returns the set of such modes by setting the associated bits\n\
+in @var{modes}.  The default implementation selects no modes, which has\n\
+the effect of disabling the early rematerialization pass.",
+ void, (sbitmap modes),
+ default_select_early_remat_modes)
+
+DEFHOOK
 (class_likely_spilled_p,
  "A target hook which returns @code{true} if pseudos that have been assigned\n\
 to registers of class @var{rclass} would likely be spilled because\n\
Index: gcc/doc/tm.texi.in
===================================================================
--- gcc/doc/tm.texi.in	2018-01-09 15:09:57.242794927 +0000
+++ gcc/doc/tm.texi.in	2018-01-09 15:28:36.203732814 +0000
@@ -2307,6 +2307,8 @@ Do not define this macro if you do not d
 
 @hook TARGET_SECONDARY_MEMORY_NEEDED_MODE
 
+@hook TARGET_SELECT_EARLY_REMAT_MODES
+
 @hook TARGET_CLASS_LIKELY_SPILLED_P
 
 @hook TARGET_CLASS_MAX_NREGS
Index: gcc/doc/tm.texi
===================================================================
--- gcc/doc/tm.texi	2018-01-09 15:09:57.241794968 +0000
+++ gcc/doc/tm.texi	2018-01-09 15:28:36.202732856 +0000
@@ -2774,6 +2774,17 @@ details.
 With LRA, the default is to use @var{mode} unmodified.
 @end deftypefn
 
+@deftypefn {Target Hook} void TARGET_SELECT_EARLY_REMAT_MODES (sbitmap @var{modes})
+On some targets, certain modes cannot be held in registers around a
+standard ABI call and are relatively expensive to spill to the stack.
+The early rematerialization pass can help in such cases by aggressively
+recomputing values after calls, so that they don't need to be spilled.
+
+This hook returns the set of such modes by setting the associated bits
+in @var{modes}.  The default implementation selects no modes, which has
+the effect of disabling the early rematerialization pass.
+@end deftypefn
+
 @deftypefn {Target Hook} bool TARGET_CLASS_LIKELY_SPILLED_P (reg_class_t @var{rclass})
 A target hook which returns @code{true} if pseudos that have been assigned
 to registers of class @var{rclass} would likely be spilled because
Index: gcc/targhooks.h
===================================================================
--- gcc/targhooks.h	2018-01-09 15:09:34.792756213 +0000
+++ gcc/targhooks.h	2018-01-09 15:28:36.205732731 +0000
@@ -285,6 +285,7 @@ extern unsigned int default_min_arithmet
 extern enum flt_eval_method
 default_excess_precision (enum excess_precision_type ATTRIBUTE_UNUSED);
 extern bool default_stack_clash_protection_final_dynamic_probe (rtx);
+extern void default_select_early_remat_modes (sbitmap);
 
 extern rtx
 default_inhibit_load_speculation (machine_mode, rtx, rtx, rtx, rtx, rtx, rtx);
Index: gcc/targhooks.c
===================================================================
--- gcc/targhooks.c	2018-01-09 15:09:34.792756213 +0000
+++ gcc/targhooks.c	2018-01-09 15:28:36.205732731 +0000
@@ -83,6 +83,7 @@ Software Foundation; either version 3, o
 #include "real.h"
 #include "langhooks.h"
 #include "dojump.h"
+#include "sbitmap.h"
 
 bool
 default_legitimate_address_p (machine_mode mode ATTRIBUTE_UNUSED,
@@ -2382,4 +2383,11 @@ default_inhibit_load_speculation (machin
   return result;
 }
 
+/* The default implementation of TARGET_EARLY_REMAT_MODES.  */
+
+void
+default_select_early_remat_modes (sbitmap)
+{
+}
+
 #include "gt-targhooks.h"
Index: gcc/timevar.def
===================================================================
--- gcc/timevar.def	2018-01-03 11:12:57.799681474 +0000
+++ gcc/timevar.def	2018-01-09 15:28:36.205732731 +0000
@@ -253,6 +253,7 @@ DEFTIMEVAR (TV_MODE_SWITCH           , "
 DEFTIMEVAR (TV_SMS		     , "sms modulo scheduling")
 DEFTIMEVAR (TV_LIVE_RANGE_SHRINKAGE  , "live range shrinkage")
 DEFTIMEVAR (TV_SCHED                 , "scheduling")
+DEFTIMEVAR (TV_EARLY_REMAT           , "early rematerialization")
 DEFTIMEVAR (TV_IRA		     , "integrated RA")
 DEFTIMEVAR (TV_LRA		     , "LRA non-specific")
 DEFTIMEVAR (TV_LRA_ELIMINATE	     , "LRA virtuals elimination")
Index: gcc/passes.def
===================================================================
--- gcc/passes.def	2018-01-03 11:12:57.677686279 +0000
+++ gcc/passes.def	2018-01-09 15:28:36.203732814 +0000
@@ -460,6 +460,7 @@ along with GCC; see the file COPYING3.
       NEXT_PASS (pass_sms);
       NEXT_PASS (pass_live_range_shrinkage);
       NEXT_PASS (pass_sched);
+      NEXT_PASS (pass_early_remat);
       NEXT_PASS (pass_ira);
       NEXT_PASS (pass_reload);
       NEXT_PASS (pass_postreload);
Index: gcc/tree-pass.h
===================================================================
--- gcc/tree-pass.h	2018-01-03 11:12:56.406736330 +0000
+++ gcc/tree-pass.h	2018-01-09 15:28:36.205732731 +0000
@@ -576,6 +576,7 @@ extern rtl_opt_pass *make_pass_mode_swit
 extern rtl_opt_pass *make_pass_sms (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_sched (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_live_range_shrinkage (gcc::context *ctxt);
+extern rtl_opt_pass *make_pass_early_remat (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_ira (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_reload (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_clean_state (gcc::context *ctxt);
Index: gcc/early-remat.c
===================================================================
--- /dev/null	2018-01-08 18:48:58.045015662 +0000
+++ gcc/early-remat.c	2018-01-09 15:28:36.203732814 +0000
@@ -0,0 +1,2611 @@
+/* Early (pre-RA) rematerialization
+   Copyright (C) 2017 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "rtl.h"
+#include "df.h"
+#include "tree-pass.h"
+#include "memmodel.h"
+#include "emit-rtl.h"
+#include "insn-config.h"
+#include "recog.h"
+/* FIXME: The next two are only needed for gen_move_insn.  */
+#include "tree.h"
+#include "expr.h"
+#include "target.h"
+#include "inchash.h"
+#include "rtlhash.h"
+#include "print-rtl.h"
+#include "rtl-iter.h"
+
+/* This pass runs before register allocation and implements an aggressive
+   form of rematerialization.  It looks for pseudo registers R of mode M
+   for which:
+
+     (a) there are no call-preserved registers of mode M; and
+     (b) spilling R to the stack is expensive.
+
+   The assumption is that it's better to recompute R after each call instead
+   of spilling it, even if this extends the live ranges of other registers.
+
+   The motivating example for which these conditions hold are AArch64 SVE
+   vectors and predicates.  Spilling them to the stack makes the frame
+   variable-sized, which we'd like to avoid if possible.  It's also very
+   rare for SVE values to be "naturally" live across a call: usually this
+   happens as a result of CSE or other code motion.
+
+   The pass is split into the following phases:
+
+   Collection phase
+   ================
+
+   First we go through all pseudo registers looking for any that meet
+   the conditions above.  For each such register R, we go through each
+   instruction that defines R to see whether any of them are suitable
+   rematerialization candidates.  If at least one is, we treat all the
+   instructions that define R as candidates, but record which ones are
+   not in fact suitable.  These unsuitable candidates exist only for the
+   sake of calculating reaching definitions (see below).
+
+   A "candidate" is a single instruction that we want to rematerialize
+   and a "candidate register" is a register that is set by at least one
+   candidate.
+
+   Candidate sorting
+   =================
+
+   Next we sort the candidates based on the cfg postorder, so that if
+   candidate C1 uses candidate C2, C1 has a lower index than C2.
+   This is useful when iterating through candidate bitmaps.
+
+   Reaching definition calculation
+   ===============================
+
+   We then compute standard reaching-definition sets for each candidate.
+   Each set specifies which candidates might provide the current definition
+   of a live candidate register.
+
+   From here on, a candidate C is "live" at a point P if the candidate
+   register defined by C is live at P and if C's definition reaches P.
+   An instruction I "uses" a candidate C if I takes the register defined by
+   C as input and if C is one of the reaching definitions of that register.
+
+   Candidate validation and value numbering
+   ========================================
+
+   Next we simultaneously decide which candidates are valid and look
+   for candidates that are equivalent to each other, assigning numbers
+   to each unique candidate value.  A candidate C is invalid if:
+
+     (a) C uses an invalid candidate;
+
+     (b) there is a cycle of candidate uses involving C; or
+
+     (c) C takes a candidate register R as input and the reaching
+         definitions of R do not have the same value number.
+
+   We assign a "representative" candidate C to each value number and from
+   here on replace references to other candidates with that value number
+   with references to C.  It is then only possible to rematerialize a
+   register R at point P if (after this replacement) there is a single
+   reaching definition of R at P.
+
+   Local phase
+   ===========
+
+   During this phase we go through each block and look for cases in which:
+
+     (a) an instruction I comes between two call instructions CI1 and CI2;
+
+     (b) I uses a candidate register R;
+
+     (c) a candidate C provides the only reaching definition of R; and
+
+     (d) C does not come between CI1 and I.
+
+   We then emit a copy of C after CI1, as well as the transitive closure
+   TC of the candidates used by C.  The copies of TC might use the original
+   candidate registers or new temporary registers, depending on circumstances.
+
+   For example, if elsewhere we have:
+
+       C3: R3 <- f3 (...)
+	   ...
+       C2: R2 <- f2 (...)
+	   ...
+       C1: R1 <- f1 (R2, R3, ...)  // uses C2 and C3
+
+   then for a block containing:
+
+      CI1: call
+	   ...
+	I: use R1  // uses C1
+	   ...
+      CI2: call
+
+   we would emit:
+
+      CI1: call
+      C3': R3' <- f3 (...)
+      C2': R2' <- f2 (...)
+      C1': R1 <- f1 (R2', R3', ...)
+	   ...
+	I: use R1
+	   ...
+      CI2: call
+
+   where R2' and R3' might be fresh registers.  If instead we had:
+
+      CI1: call
+	   ...
+       I1: use R1  // uses C1
+	   ...
+       I2: use R3  // uses C3
+	   ...
+      CI2: call
+
+   we would keep the original R3:
+
+      CI1: call
+      C3': R3 <- f3 (...)
+      C2': R2' <- f2 (...)
+      C1': R1 <- f1 (R2', R3, ...)
+	   ...
+       I1: use R1  // uses C1
+	   ...
+       I2: use R3  // uses C3
+	   ...
+      CI2: call
+
+   We also record the last call in each block (if any) and compute:
+
+     rd_after_call:
+       The set of candidates that either (a) are defined outside the block
+       and are live after the last call or (b) are defined within the block
+       and reach the end of the last call.  (We don't track whether the
+       latter values are live or not.)
+
+     required_after_call:
+       The set of candidates that need to be rematerialized after the
+       last call in order to satisfy uses in the block itself.
+
+     required_in:
+       The set of candidates that are live on entry to the block and are
+       used without an intervening call.
+
+   In addition, we compute the initial values of the sets required by
+   the global phase below.
+
+   Global phase
+   ============
+
+   We next compute a maximal solution to the following availability
+   problem:
+
+     available_in:
+       The set of candidates that are live on entry to a block and can
+       be used at that point without rematerialization.
+
+     available_out:
+       The set of candidates that are live on exit from a block and can
+       be used at that point without rematerialization.
+
+     available_locally:
+       The subset of available_out that is due to code in the block itself.
+       It contains candidates that are defined or used in the block and
+       not invalidated by a later call.
+
+   We then go through each block B and look for an appropriate place
+   to insert copies of required_in - available_in.  Conceptually we
+   start by placing the copies at the head of B, but then move the
+   copy of a candidate C to predecessors if:
+
+     (a) that seems cheaper;
+
+     (b) there is more than one reaching definition of C's register at
+	 the head of B; or
+
+     (c) copying C would clobber a hard register that is live on entry to B.
+
+   Moving a copy of C to a predecessor block PB involves:
+
+     (1) adding C to PB's required_after_call, if PB contains a call; or
+
+     (2) adding C PB's required_in otherwise.
+
+   C is then available on output from each PB and on input to B.
+
+   Once all this is done, we emit instructions for the final required_in
+   and required_after_call sets.  */
+
+namespace {
+
+/* An invalid candidate index, used to indicate that there is more than
+   one reaching definition.  */
+const unsigned int MULTIPLE_CANDIDATES = -1U;
+
+/* Pass-specific information about one basic block.  */
+struct remat_block_info {
+  /* The last call instruction in the block.  */
+  rtx_insn *last_call;
+
+  /* The set of candidates that are live on entry to the block.  NULL is
+     equivalent to an empty set.  */
+  bitmap rd_in;
+
+  /* The set of candidates that are live on exit from the block.  This might
+     reuse rd_in.  NULL is equivalent to an empty set.  */
+  bitmap rd_out;
+
+  /* The subset of RD_OUT that comes from local definitions.  NULL is
+     equivalent to an empty set.  */
+  bitmap rd_gen;
+
+  /* The set of candidates that the block invalidates (because it defines
+     the register to something else, or because the register's value is
+     no longer important).  NULL is equivalent to an empty set.  */
+  bitmap rd_kill;
+
+  /* The set of candidates that either (a) are defined outside the block
+     and are live after LAST_CALL or (b) are defined within the block
+     and reach the instruction after LAST_CALL.  (We don't track whether
+     the latter values are live or not.)
+
+     Only used if LAST_CALL is nonnull.  NULL is equivalent to an
+     empty set.  */
+  bitmap rd_after_call;
+
+  /* Candidates that are live and available without rematerialization
+     on entry to the block.  NULL is equivalent to an empty set.  */
+  bitmap available_in;
+
+  /* Candidates that become available without rematerialization within the
+     block, and remain so on exit.  NULL is equivalent to an empty set.  */
+  bitmap available_locally;
+
+  /* Candidates that are available without rematerialization on exit from
+     the block.  This might reuse available_in or available_locally.  */
+  bitmap available_out;
+
+  /* Candidates that need to be rematerialized either at the start of the
+     block or before entering the block.  */
+  bitmap required_in;
+
+  /* Candidates that need to be rematerialized after LAST_CALL.
+     Only used if LAST_CALL is nonnull.  */
+  bitmap required_after_call;
+
+  /* The number of candidates in the block.  */
+  unsigned int num_candidates;
+
+  /* The earliest candidate in the block (i.e. the one with the
+     highest index).  Only valid if NUM_CANDIDATES is nonzero.  */
+  unsigned int first_candidate;
+
+  /* The best (lowest) execution frequency for rematerializing REQUIRED_IN.
+     This is the execution frequency of the block if LOCAL_REMAT_CHEAPER_P,
+     otherwise it is the sum of the execution frequencies of whichever
+     predecessor blocks would do the rematerialization.  */
+  int remat_frequency;
+
+  /* True if the block ends with an abnormal call.  */
+  unsigned int abnormal_call_p : 1;
+
+  /* Used to record whether a graph traversal has visited this block.  */
+  unsigned int visited_p : 1;
+
+  /* True if we have calculated REMAT_FREQUENCY.  */
+  unsigned int remat_frequency_valid_p : 1;
+
+  /* True if it is cheaper to rematerialize candidates at the start of
+     the block, rather than moving them to predecessor blocks.  */
+  unsigned int local_remat_cheaper_p : 1;
+};
+
+/* Information about a group of candidates with the same value number.  */
+struct remat_equiv_class {
+  /* The candidates that have the same value number.  */
+  bitmap members;
+
+  /* The candidate that was first added to MEMBERS.  */
+  unsigned int earliest;
+
+  /* The candidate that represents the others.  This is always the one
+     with the highest index.  */
+  unsigned int representative;
+};
+
+/* Information about an instruction that we might want to rematerialize.  */
+struct remat_candidate {
+  /* The pseudo register that the instruction sets.  */
+  unsigned int regno;
+
+  /* A temporary register used when rematerializing uses of this candidate,
+     if REGNO doesn't have the right value or isn't worth using.  */
+  unsigned int copy_regno;
+
+  /* True if we intend to rematerialize this instruction by emitting
+     a move of a constant into REGNO, false if we intend to emit a
+     copy of the original instruction.  */
+  unsigned int constant_p : 1;
+
+  /* True if we still think it's possible to rematerialize INSN.  */
+  unsigned int can_copy_p : 1;
+
+  /* Used to record whether a graph traversal has visited this candidate.  */
+  unsigned int visited_p : 1;
+
+  /* True if we have verified that it's possible to rematerialize INSN.
+     Once this is true, both it and CAN_COPY_P remain true.  */
+  unsigned int validated_p : 1;
+
+  /* True if we have "stabilized" INSN, i.e. ensured that all non-candidate
+     registers read by INSN will have the same value when rematerializing INSN.
+     Only ever true if CAN_COPY_P.  */
+  unsigned int stabilized_p : 1;
+
+  /* Hash value used for value numbering.  */
+  hashval_t hash;
+
+  /* The instruction that sets REGNO.  */
+  rtx_insn *insn;
+
+  /* If CONSTANT_P, the value that should be moved into REGNO when
+     rematerializing, otherwise the pattern of the instruction that
+     should be used.  */
+  rtx remat_rtx;
+
+  /* The set of candidates that INSN takes as input.  NULL is equivalent
+     to the empty set.  All candidates in this set have a higher index
+     than the current candidate.  */
+  bitmap uses;
+
+  /* The set of hard registers that would be clobbered by rematerializing
+     the candidate, including (transitively) all those that would be
+     clobbered by rematerializing USES.  */
+  bitmap clobbers;
+
+  /* The equivalence class to which the candidate belongs, or null if none.  */
+  remat_equiv_class *equiv_class;
+};
+
+/* Hash functions used for value numbering.  */
+struct remat_candidate_hasher : nofree_ptr_hash <remat_candidate>
+{
+  typedef value_type compare_type;
+  static hashval_t hash (const remat_candidate *);
+  static bool equal (const remat_candidate *, const remat_candidate *);
+};
+
+/* Main class for this pass.  */
+class early_remat {
+public:
+  early_remat (function *, sbitmap);
+  ~early_remat ();
+
+  void run (void);
+
+private:
+  bitmap alloc_bitmap (void);
+  bitmap get_bitmap (bitmap *);
+  void init_temp_bitmap (bitmap *);
+  void copy_temp_bitmap (bitmap *, bitmap *);
+
+  void dump_insn_id (rtx_insn *);
+  void dump_candidate_bitmap (bitmap);
+  void dump_all_candidates (void);
+  void dump_edge_list (basic_block, bool);
+  void dump_block_info (basic_block);
+  void dump_all_blocks (void);
+
+  bool interesting_regno_p (unsigned int);
+  remat_candidate *add_candidate (rtx_insn *, unsigned int, bool);
+  bool maybe_add_candidate (rtx_insn *, unsigned int);
+  bool collect_candidates (void);
+  void init_block_info (void);
+  void sort_candidates (void);
+  void finalize_candidate_indices (void);
+  void record_equiv_candidates (unsigned int, unsigned int);
+  static bool rd_confluence_n (edge);
+  static bool rd_transfer (int);
+  void compute_rd (void);
+  unsigned int canon_candidate (unsigned int);
+  void canon_bitmap (bitmap *);
+  unsigned int resolve_reaching_def (bitmap);
+  bool check_candidate_uses (unsigned int);
+  void compute_clobbers (unsigned int);
+  void assign_value_number (unsigned int);
+  void decide_candidate_validity (void);
+  bool stable_use_p (unsigned int);
+  void emit_copy_before (unsigned int, rtx, rtx);
+  void stabilize_pattern (unsigned int);
+  void replace_dest_with_copy (unsigned int);
+  void stabilize_candidate_uses (unsigned int, bitmap, bitmap, bitmap,
+				 bitmap);
+  void emit_remat_insns (bitmap, bitmap, bitmap, rtx_insn *);
+  bool set_available_out (remat_block_info *);
+  void process_block (basic_block);
+  void local_phase (void);
+  static bool avail_confluence_n (edge);
+  static bool avail_transfer (int);
+  void compute_availability (void);
+  void unshare_available_sets (remat_block_info *);
+  bool can_move_across_edge_p (edge);
+  bool local_remat_cheaper_p (unsigned int);
+  bool need_to_move_candidate_p (unsigned int, unsigned int);
+  void compute_minimum_move_set (unsigned int, bitmap);
+  void move_to_predecessors (unsigned int, bitmap, bitmap);
+  void choose_rematerialization_points (void);
+  void emit_remat_insns_for_block (basic_block);
+  void global_phase (void);
+
+  /* The function that we're optimizing.  */
+  function *m_fn;
+
+  /* The modes that we want to rematerialize.  */
+  sbitmap m_selected_modes;
+
+  /* All rematerialization candidates, identified by their index into the
+     vector.  */
+  auto_vec<remat_candidate> m_candidates;
+
+  /* The set of candidate registers.  */
+  bitmap_head m_candidate_regnos;
+
+  /* Temporary sets.  */
+  bitmap_head m_tmp_bitmap;
+  bitmap m_available;
+  bitmap m_required;
+
+  /* Information about each basic block.  */
+  auto_vec<remat_block_info> m_block_info;
+
+  /* A mapping from register numbers to the set of associated candidates.
+     Only valid for registers in M_CANDIDATE_REGNOS.  */
+  auto_vec<bitmap> m_regno_to_candidates;
+
+  /* An obstack used for allocating bitmaps, so that we can free them all
+     in one go.  */
+  bitmap_obstack m_obstack;
+
+  /* A hash table of candidates used for value numbering.  If a candidate
+     in the table is in an equivalence class, the candidate is marked as
+     the earliest member of the class.  */
+  hash_table<remat_candidate_hasher> m_value_table;
+
+  /* Used temporarily by callback functions.  */
+  static early_remat *er;
+};
+
+}
+
+early_remat *early_remat::er;
+
+/* rtx_equal_p_cb callback that treats any two SCRATCHes as equal.
+   This allows us to compare two copies of a pattern, even though their
+   SCRATCHes are always distinct.  */
+
+static int
+scratch_equal (const_rtx *x, const_rtx *y, rtx *nx, rtx *ny)
+{
+  if (GET_CODE (*x) == SCRATCH && GET_CODE (*y) == SCRATCH)
+    {
+      *nx = const0_rtx;
+      *ny = const0_rtx;
+      return 1;
+    }
+  return 0;
+}
+
+/* Hash callback functions for remat_candidate.  */
+
+hashval_t
+remat_candidate_hasher::hash (const remat_candidate *cand)
+{
+  return cand->hash;
+}
+
+bool
+remat_candidate_hasher::equal (const remat_candidate *cand1,
+			       const remat_candidate *cand2)
+{
+  return (cand1->regno == cand2->regno
+	  && cand1->constant_p == cand2->constant_p
+	  && (cand1->constant_p
+	      ? rtx_equal_p (cand1->remat_rtx, cand2->remat_rtx)
+	      : rtx_equal_p_cb (cand1->remat_rtx, cand2->remat_rtx,
+				scratch_equal))
+	  && (!cand1->uses || bitmap_equal_p (cand1->uses, cand2->uses)));
+}
+
+/* Return true if B is null or empty.  */
+
+inline bool
+empty_p (bitmap b)
+{
+  return !b || bitmap_empty_p (b);
+}
+
+/* Allocate a new bitmap.  It will be automatically freed at the end of
+   the pass.  */
+
+inline bitmap
+early_remat::alloc_bitmap (void)
+{
+  return bitmap_alloc (&m_obstack);
+}
+
+/* Initialize *PTR to an empty bitmap if it is currently null.  */
+
+inline bitmap
+early_remat::get_bitmap (bitmap *ptr)
+{
+  if (!*ptr)
+    *ptr = alloc_bitmap ();
+  return *ptr;
+}
+
+/* *PTR is either null or empty.  If it is null, initialize it to an
+   empty bitmap.  */
+
+inline void
+early_remat::init_temp_bitmap (bitmap *ptr)
+{
+  if (!*ptr)
+    *ptr = alloc_bitmap ();
+  else
+    gcc_checking_assert (bitmap_empty_p (*ptr));
+}
+
+/* Move *SRC to *DEST and leave *SRC empty.  */
+
+inline void
+early_remat::copy_temp_bitmap (bitmap *dest, bitmap *src)
+{
+  if (!empty_p (*src))
+    {
+      *dest = *src;
+      *src = NULL;
+    }
+  else
+    *dest = NULL;
+}
+
+/* Print INSN's identifier to the dump file.  */
+
+void
+early_remat::dump_insn_id (rtx_insn *insn)
+{
+  fprintf (dump_file, "%d[bb:%d]", INSN_UID (insn),
+	   BLOCK_FOR_INSN (insn)->index);
+}
+
+/* Print candidate set CANDIDATES to the dump file, with a leading space.  */
+
+void
+early_remat::dump_candidate_bitmap (bitmap candidates)
+{
+  if (empty_p (candidates))
+    {
+      fprintf (dump_file, " none");
+      return;
+    }
+
+  unsigned int cand_index;
+  bitmap_iterator bi;
+  EXECUTE_IF_SET_IN_BITMAP (candidates, 0, cand_index, bi)
+    fprintf (dump_file, " %d", cand_index);
+}
+
+/* Print information about all candidates to the dump file.  */
+
+void
+early_remat::dump_all_candidates (void)
+{
+  fprintf (dump_file, "\n;; Candidates:\n;;\n");
+  fprintf (dump_file, ";; %5s %5s %8s %s\n", "#", "reg", "mode", "insn");
+  fprintf (dump_file, ";; %5s %5s %8s %s\n", "=", "===", "====", "====");
+  unsigned int cand_index;
+  remat_candidate *cand;
+  FOR_EACH_VEC_ELT (m_candidates, cand_index, cand)
+    {
+      fprintf (dump_file, ";; %5d %5d %8s ", cand_index, cand->regno,
+	       GET_MODE_NAME (GET_MODE (regno_reg_rtx[cand->regno])));
+      dump_insn_id (cand->insn);
+      if (!cand->can_copy_p)
+	fprintf (dump_file, "   -- can't copy");
+      fprintf (dump_file, "\n");
+    }
+
+  fprintf (dump_file, "\n;; Register-to-candidate mapping:\n;;\n");
+  unsigned int regno;
+  bitmap_iterator bi;
+  EXECUTE_IF_SET_IN_BITMAP (&m_candidate_regnos, 0, regno, bi)
+    {
+      fprintf (dump_file, ";; %5d:", regno);
+      dump_candidate_bitmap (m_regno_to_candidates[regno]);
+      fprintf (dump_file, "\n");
+    }
+}
+
+/* Print the predecessors or successors of BB to the dump file, with a
+   leading space.  DO_SUCC is true to print successors and false to print
+   predecessors.  */
+
+void
+early_remat::dump_edge_list (basic_block bb, bool do_succ)
+{
+  edge e;
+  edge_iterator ei;
+  FOR_EACH_EDGE (e, ei, do_succ ? bb->succs : bb->preds)
+    dump_edge_info (dump_file, e, 0, do_succ);
+}
+
+/* Print information about basic block BB to the dump file.  */
+
+void
+early_remat::dump_block_info (basic_block bb)
+{
+  remat_block_info *info = &m_block_info[bb->index];
+  fprintf (dump_file, ";;\n;; Block %d:", bb->index);
+  int width = 25;
+
+  fprintf (dump_file, "\n;;%*s:", width, "predecessors");
+  dump_edge_list (bb, false);
+
+  fprintf (dump_file, "\n;;%*s:", width, "successors");
+  dump_edge_list (bb, true);
+
+  fprintf (dump_file, "\n;;%*s: %d", width, "frequency",
+	   bb->count.to_frequency (m_fn));
+
+  if (info->last_call)
+    fprintf (dump_file, "\n;;%*s: %d", width, "last call",
+	     INSN_UID (info->last_call));
+
+  if (!empty_p (info->rd_in))
+    {
+      fprintf (dump_file, "\n;;%*s:", width, "RD in");
+      dump_candidate_bitmap (info->rd_in);
+    }
+  if (!empty_p (info->rd_kill))
+    {
+      fprintf (dump_file, "\n;;%*s:", width, "RD kill");
+      dump_candidate_bitmap (info->rd_kill);
+    }
+  if (!empty_p (info->rd_gen))
+    {
+      fprintf (dump_file, "\n;;%*s:", width, "RD gen");
+      dump_candidate_bitmap (info->rd_gen);
+    }
+  if (!empty_p (info->rd_after_call))
+    {
+      fprintf (dump_file, "\n;;%*s:", width, "RD after call");
+      dump_candidate_bitmap (info->rd_after_call);
+    }
+  if (!empty_p (info->rd_out))
+    {
+      fprintf (dump_file, "\n;;%*s:", width, "RD out");
+      if (info->rd_in == info->rd_out)
+	fprintf (dump_file, " RD in");
+      else
+	dump_candidate_bitmap (info->rd_out);
+    }
+  if (!empty_p (info->available_in))
+    {
+      fprintf (dump_file, "\n;;%*s:", width, "available in");
+      dump_candidate_bitmap (info->available_in);
+    }
+  if (!empty_p (info->available_locally))
+    {
+      fprintf (dump_file, "\n;;%*s:", width, "available locally");
+      dump_candidate_bitmap (info->available_locally);
+    }
+  if (!empty_p (info->available_out))
+    {
+      fprintf (dump_file, "\n;;%*s:", width, "available out");
+      if (info->available_in == info->available_out)
+	fprintf (dump_file, " available in");
+      else if (info->available_locally == info->available_out)
+	fprintf (dump_file, " available locally");
+      else
+	dump_candidate_bitmap (info->available_out);
+    }
+  if (!empty_p (info->required_in))
+    {
+      fprintf (dump_file, "\n;;%*s:", width, "required in");
+      dump_candidate_bitmap (info->required_in);
+    }
+  if (!empty_p (info->required_after_call))
+    {
+      fprintf (dump_file, "\n;;%*s:", width, "required after call");
+      dump_candidate_bitmap (info->required_after_call);
+    }
+  fprintf (dump_file, "\n");
+}
+
+/* Print information about all basic blocks to the dump file.  */
+
+void
+early_remat::dump_all_blocks (void)
+{
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, m_fn)
+    dump_block_info (bb);
+}
+
+/* Return true if REGNO is worth rematerializing.  */
+
+bool
+early_remat::interesting_regno_p (unsigned int regno)
+{
+  /* Ignore unused registers.  */
+  rtx reg = regno_reg_rtx[regno];
+  if (!reg || DF_REG_DEF_COUNT (regno) == 0)
+    return false;
+
+  /* Make sure the register has a mode that we want to rematerialize.  */
+  if (!bitmap_bit_p (m_selected_modes, GET_MODE (reg)))
+    return false;
+
+  /* Ignore values that might sometimes be used uninitialized.  We could
+     instead add dummy candidates for the entry block definition, and so
+     handle uses that are definitely not uninitialized, but the combination
+     of the two should be rare in practice.  */
+  if (bitmap_bit_p (DF_LR_OUT (ENTRY_BLOCK_PTR_FOR_FN (m_fn)), regno))
+    return false;
+
+  return true;
+}
+
+/* Record the set of register REGNO in instruction INSN as a
+   rematerialization candidate.  CAN_COPY_P is true unless we already
+   know that rematerialization is impossible (in which case the candidate
+   only exists for the reaching definition calculation).
+
+   The candidate's index is not fixed at this stage.  */
+
+remat_candidate *
+early_remat::add_candidate (rtx_insn *insn, unsigned int regno,
+			    bool can_copy_p)
+{
+  remat_candidate cand;
+  memset (&cand, 0, sizeof (cand));
+  cand.regno = regno;
+  cand.insn = insn;
+  cand.remat_rtx = PATTERN (insn);
+  cand.can_copy_p = can_copy_p;
+  m_candidates.safe_push (cand);
+
+  bitmap_set_bit (&m_candidate_regnos, regno);
+
+  return &m_candidates.last ();
+}
+
+/* Return true if we can rematerialize the set of register REGNO in
+   instruction INSN, and add it as a candidate if so.  When returning
+   false, print the reason to the dump file.  */
+
+bool
+early_remat::maybe_add_candidate (rtx_insn *insn, unsigned int regno)
+{
+#define FAILURE_FORMAT ";; Can't rematerialize set of reg %d in %d[bb:%d]: "
+#define FAILURE_ARGS regno, INSN_UID (insn), BLOCK_FOR_INSN (insn)->index
+
+  /* The definition must come from an ordinary instruction.  */
+  basic_block bb = BLOCK_FOR_INSN (insn);
+  if (!NONJUMP_INSN_P (insn)
+      || (insn == BB_END (bb)
+	  && has_abnormal_or_eh_outgoing_edge_p (bb)))
+    {
+      if (dump_file)
+	fprintf (dump_file, FAILURE_FORMAT "insn alters control flow\n",
+		 FAILURE_ARGS);
+      return false;
+    }
+
+  /* Prefer to rematerialize constants directly -- it's much easier.  */
+  machine_mode mode = GET_MODE (regno_reg_rtx[regno]);
+  if (rtx note = find_reg_equal_equiv_note (insn))
+    {
+      rtx val = XEXP (note, 0);
+      if (CONSTANT_P (val)
+	  && targetm.legitimate_constant_p (mode, val))
+	{
+	  remat_candidate *cand = add_candidate (insn, regno, true);
+	  cand->constant_p = true;
+	  cand->remat_rtx = val;
+	  return true;
+	}
+    }
+
+  /* See whether the target has reasons to prevent a copy.  */
+  if (targetm.cannot_copy_insn_p && targetm.cannot_copy_insn_p (insn))
+    {
+      if (dump_file)
+	fprintf (dump_file, FAILURE_FORMAT "target forbids copying\n",
+		 FAILURE_ARGS);
+      return false;
+    }
+
+  /* We can't copy trapping instructions.  */
+  rtx pat = PATTERN (insn);
+  if (may_trap_p (pat))
+    {
+      if (dump_file)
+	fprintf (dump_file, FAILURE_FORMAT "insn might trap\n", FAILURE_ARGS);
+      return false;
+    }
+
+  /* We can't copy instructions that read memory, unless we know that
+     the contents never change.  */
+  subrtx_iterator::array_type array;
+  FOR_EACH_SUBRTX (iter, array, pat, ALL)
+    if (MEM_P (*iter) && !MEM_READONLY_P (*iter))
+      {
+	if (dump_file)
+	  fprintf (dump_file, FAILURE_FORMAT "insn references non-constant"
+		   " memory\n", FAILURE_ARGS);
+	return false;
+      }
+
+  /* Check each defined register.  */
+  df_ref ref;
+  FOR_EACH_INSN_DEF (ref, insn)
+    {
+      unsigned int def_regno = DF_REF_REGNO (ref);
+      if (def_regno == regno)
+	{
+	  /* Make sure the definition is write-only.  (Partial definitions,
+	     such as setting the low part and clobbering the high part,
+	     are otherwise OK.)  */
+	  if (DF_REF_FLAGS_IS_SET (ref, DF_REF_READ_WRITE))
+	    {
+	      if (dump_file)
+		fprintf (dump_file, FAILURE_FORMAT "destination is"
+			 " read-modify-write\n", FAILURE_ARGS);
+	      return false;
+	    }
+	}
+      else
+	{
+	  /* The instruction can set additional registers, provided that
+	     they're call-clobbered hard registers.  This is useful for
+	     instructions that alter the condition codes.  */
+	  if (!HARD_REGISTER_NUM_P (def_regno))
+	    {
+	      if (dump_file)
+		fprintf (dump_file, FAILURE_FORMAT "insn also sets"
+			 " pseudo reg %d\n", FAILURE_ARGS, def_regno);
+	      return false;
+	    }
+	  if (global_regs[def_regno])
+	    {
+	      if (dump_file)
+		fprintf (dump_file, FAILURE_FORMAT "insn also sets"
+			 " global reg %d\n", FAILURE_ARGS, def_regno);
+	      return false;
+	    }
+	  if (!TEST_HARD_REG_BIT (regs_invalidated_by_call, def_regno))
+	    {
+	      if (dump_file)
+		fprintf (dump_file, FAILURE_FORMAT "insn also sets"
+			 " call-preserved reg %d\n", FAILURE_ARGS, def_regno);
+	      return false;
+	    }
+	}
+    }
+
+  /* If the instruction uses fixed hard registers, check that those
+     registers have the same value throughout the function.  If the
+     instruction uses non-fixed hard registers, check that we can
+     replace them with pseudos.  */
+  FOR_EACH_INSN_USE (ref, insn)
+    {
+      unsigned int use_regno = DF_REF_REGNO (ref);
+      if (HARD_REGISTER_NUM_P (use_regno) && fixed_regs[use_regno])
+	{
+	  if (rtx_unstable_p (DF_REF_REAL_REG (ref)))
+	    {
+	      if (dump_file)
+		fprintf (dump_file, FAILURE_FORMAT "insn uses fixed hard reg"
+			 " %d\n", FAILURE_ARGS, use_regno);
+	      return false;
+	    }
+	}
+      else if (HARD_REGISTER_NUM_P (use_regno))
+	{
+	  /* Allocate a dummy pseudo register and temporarily install it.
+	     Make the register number depend on the mode, which should
+	     provide enough sharing for match_dup while also weeding
+	     out cases in which operands with different modes are
+	     explicitly tied.  */
+	  rtx *loc = DF_REF_REAL_LOC (ref);
+	  unsigned int size = RTX_CODE_SIZE (REG);
+	  rtx new_reg = (rtx) alloca (size);
+	  memset (new_reg, 0, size);
+	  PUT_CODE (new_reg, REG);
+	  set_mode_and_regno (new_reg, GET_MODE (*loc),
+			      LAST_VIRTUAL_REGISTER + 1 + GET_MODE (*loc));
+	  validate_change (insn, loc, new_reg, 1);
+	}
+    }
+  bool ok_p = verify_changes (0);
+  cancel_changes (0);
+  if (!ok_p)
+    {
+      if (dump_file)
+	fprintf (dump_file, FAILURE_FORMAT "insn does not allow hard"
+		 " register inputs to be replaced\n", FAILURE_ARGS);
+      return false;
+    }
+
+#undef FAILURE_ARGS
+#undef FAILURE_FORMAT
+
+  add_candidate (insn, regno, true);
+  return true;
+}
+
+/* Calculate the set of rematerialization candidates.  Return true if
+   we find at least one.  */
+
+bool
+early_remat::collect_candidates (void)
+{
+  unsigned int nregs = DF_REG_SIZE (df);
+  for (unsigned int regno = FIRST_PSEUDO_REGISTER; regno < nregs; ++regno)
+    if (interesting_regno_p (regno))
+      {
+	/* Create candidates for all suitable definitions.  */
+	bitmap_clear (&m_tmp_bitmap);
+	unsigned int bad = 0;
+	unsigned int id = 0;
+	for (df_ref ref = DF_REG_DEF_CHAIN (regno); ref;
+	     ref = DF_REF_NEXT_REG (ref))
+	  {
+	    rtx_insn *insn = DF_REF_INSN (ref);
+	    if (maybe_add_candidate (insn, regno))
+	      bitmap_set_bit (&m_tmp_bitmap, id);
+	    else
+	      bad += 1;
+	    id += 1;
+	  }
+
+	/* If we found at least one suitable definition, add dummy
+	   candidates for the rest, so that we can see which definitions
+	   are live where.  */
+	if (!bitmap_empty_p (&m_tmp_bitmap) && bad)
+	  {
+	    id = 0;
+	    for (df_ref ref = DF_REG_DEF_CHAIN (regno); ref;
+		 ref = DF_REF_NEXT_REG (ref))
+	      {
+		if (!bitmap_bit_p (&m_tmp_bitmap, id))
+		  add_candidate (DF_REF_INSN (ref), regno, false);
+		id += 1;
+	      }
+	  }
+      }
+
+
+  return !m_candidates.is_empty ();
+}
+
+/* Initialize the m_block_info array.  */
+
+void
+early_remat::init_block_info (void)
+{
+  unsigned int n_blocks = last_basic_block_for_fn (m_fn);
+  m_block_info.safe_grow_cleared (n_blocks);
+}
+
+/* Maps basic block indices to their position in the post order.  */
+static unsigned int *postorder_index;
+
+/* Order remat_candidates X_IN and Y_IN according to the cfg postorder.  */
+
+static int
+compare_candidates (const void *x_in, const void *y_in)
+{
+  const remat_candidate *x = (const remat_candidate *) x_in;
+  const remat_candidate *y = (const remat_candidate *) y_in;
+  basic_block x_bb = BLOCK_FOR_INSN (x->insn);
+  basic_block y_bb = BLOCK_FOR_INSN (y->insn);
+  if (x_bb != y_bb)
+    /* Make X and Y follow block postorder.  */
+    return postorder_index[x_bb->index] - postorder_index[y_bb->index];
+
+  /* Make X and Y follow a backward traversal of the containing block.  */
+  return DF_INSN_LUID (y->insn) - DF_INSN_LUID (x->insn);
+}
+
+/* Sort the collected rematerialization candidates so that they follow
+   cfg postorder.  */
+
+void
+early_remat::sort_candidates (void)
+{
+  /* Make sure the DF LUIDs are up-to-date for all the blocks we
+     care about.  */
+  bitmap_clear (&m_tmp_bitmap);
+  unsigned int cand_index;
+  remat_candidate *cand;
+  FOR_EACH_VEC_ELT (m_candidates, cand_index, cand)
+    {
+      basic_block bb = BLOCK_FOR_INSN (cand->insn);
+      if (bitmap_set_bit (&m_tmp_bitmap, bb->index))
+	df_recompute_luids (bb);
+    }
+
+  /* Create a mapping from block numbers to their position in the
+     postorder.  */
+  unsigned int n_blocks = last_basic_block_for_fn (m_fn);
+  int *postorder = df_get_postorder (DF_BACKWARD);
+  unsigned int postorder_len = df_get_n_blocks (DF_BACKWARD);
+  postorder_index = new unsigned int[n_blocks];
+  for (unsigned int i = 0; i < postorder_len; ++i)
+    postorder_index[postorder[i]] = i;
+
+  m_candidates.qsort (compare_candidates);
+
+  delete postorder_index;
+}
+
+/* Commit to the current candidate indices and initialize cross-references.  */
+
+void
+early_remat::finalize_candidate_indices (void)
+{
+  /* Create a bitmap for each candidate register.  */
+  m_regno_to_candidates.safe_grow (max_reg_num ());
+  unsigned int regno;
+  bitmap_iterator bi;
+  EXECUTE_IF_SET_IN_BITMAP (&m_candidate_regnos, 0, regno, bi)
+    m_regno_to_candidates[regno] = alloc_bitmap ();
+
+  /* Go through each candidate and record its index.  */
+  unsigned int cand_index;
+  remat_candidate *cand;
+  FOR_EACH_VEC_ELT (m_candidates, cand_index, cand)
+    {
+      basic_block bb = BLOCK_FOR_INSN (cand->insn);
+      remat_block_info *info = &m_block_info[bb->index];
+      info->num_candidates += 1;
+      info->first_candidate = cand_index;
+      bitmap_set_bit (m_regno_to_candidates[cand->regno], cand_index);
+    }
+}
+
+/* Record that candidates CAND1_INDEX and CAND2_INDEX are equivalent.
+   CAND1_INDEX might already have an equivalence class, but CAND2_INDEX
+   doesn't.  */
+
+void
+early_remat::record_equiv_candidates (unsigned int cand1_index,
+				      unsigned int cand2_index)
+{
+  if (dump_file)
+    fprintf (dump_file, ";; Candidate %d is equivalent to candidate %d\n",
+	     cand2_index, cand1_index);
+
+  remat_candidate *cand1 = &m_candidates[cand1_index];
+  remat_candidate *cand2 = &m_candidates[cand2_index];
+  gcc_checking_assert (!cand2->equiv_class);
+
+  remat_equiv_class *ec = cand1->equiv_class;
+  if (!ec)
+    {
+      ec = XOBNEW (&m_obstack.obstack, remat_equiv_class);
+      ec->members = alloc_bitmap ();
+      bitmap_set_bit (ec->members, cand1_index);
+      ec->earliest = cand1_index;
+      ec->representative = cand1_index;
+      cand1->equiv_class = ec;
+    }
+  cand1 = &m_candidates[ec->representative];
+  cand2->equiv_class = ec;
+  bitmap_set_bit (ec->members, cand2_index);
+  if (cand2_index > ec->representative)
+    ec->representative = cand2_index;
+}
+
+/* Propagate information from the rd_out set of E->src to the rd_in set
+   of E->dest, when computing global reaching definitions.  Return true
+   if something changed.  */
+
+bool
+early_remat::rd_confluence_n (edge e)
+{
+  remat_block_info *src = &er->m_block_info[e->src->index];
+  remat_block_info *dest = &er->m_block_info[e->dest->index];
+
+  /* available_in temporarily contains the set of candidates whose
+     registers are live on entry.  */
+  if (empty_p (src->rd_out) || empty_p (dest->available_in))
+    return false;
+
+  return bitmap_ior_and_into (er->get_bitmap (&dest->rd_in),
+			      src->rd_out, dest->available_in);
+}
+
+/* Propagate information from the rd_in set of block BB_INDEX to rd_out.
+   Return true if something changed.  */
+
+bool
+early_remat::rd_transfer (int bb_index)
+{
+  remat_block_info *info = &er->m_block_info[bb_index];
+
+  if (empty_p (info->rd_in))
+    return false;
+
+  if (empty_p (info->rd_kill))
+    {
+      gcc_checking_assert (empty_p (info->rd_gen));
+      if (!info->rd_out)
+	info->rd_out = info->rd_in;
+      else
+	gcc_checking_assert (info->rd_out == info->rd_in);
+      /* Assume that we only get called if something changed.  */
+      return true;
+    }
+
+  if (empty_p (info->rd_gen))
+    return bitmap_and_compl (er->get_bitmap (&info->rd_out),
+			     info->rd_in, info->rd_kill);
+
+  return bitmap_ior_and_compl (er->get_bitmap (&info->rd_out), info->rd_gen,
+			       info->rd_in, info->rd_kill);
+}
+
+/* Calculate the rd_* sets for each block.  */
+
+void
+early_remat::compute_rd (void)
+{
+  /* First calculate the rd_kill and rd_gen sets, using the fact
+     that m_candidates is sorted in order of decreasing LUID.  */
+  unsigned int cand_index;
+  remat_candidate *cand;
+  FOR_EACH_VEC_ELT_REVERSE (m_candidates, cand_index, cand)
+    {
+      rtx_insn *insn = cand->insn;
+      basic_block bb = BLOCK_FOR_INSN (insn);
+      remat_block_info *info = &m_block_info[bb->index];
+      bitmap kill = m_regno_to_candidates[cand->regno];
+      bitmap_ior_into (get_bitmap (&info->rd_kill), kill);
+      if (bitmap_bit_p (DF_LR_OUT (bb), cand->regno))
+	{
+	  bitmap_and_compl_into (get_bitmap (&info->rd_gen), kill);
+	  bitmap_set_bit (info->rd_gen, cand_index);
+	}
+    }
+
+  /* Set up the initial values of the other sets.  */
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, m_fn)
+    {
+      remat_block_info *info = &m_block_info[bb->index];
+      unsigned int regno;
+      bitmap_iterator bi;
+      EXECUTE_IF_AND_IN_BITMAP (DF_LR_IN (bb), &m_candidate_regnos,
+				0, regno, bi)
+	{
+	  /* Use available_in to record the set of candidates whose
+	     registers are live on entry (i.e. a maximum bound on rd_in).  */
+	  bitmap_ior_into (get_bitmap (&info->available_in),
+			   m_regno_to_candidates[regno]);
+
+	  /* Add registers that die in a block to the block's kill set,
+	     so that we don't needlessly propagate them through the rest
+	     of the function.  */
+	  if (!bitmap_bit_p (DF_LR_OUT (bb), regno))
+	    bitmap_ior_into (get_bitmap (&info->rd_kill),
+			     m_regno_to_candidates[regno]);
+	}
+
+      /* Initialize each block's rd_out to the minimal set (the set of
+	 local definitions).  */
+      if (!empty_p (info->rd_gen))
+	bitmap_copy (get_bitmap (&info->rd_out), info->rd_gen);
+    }
+
+  /* Iterate until we reach a fixed point.  */
+  er = this;
+  bitmap_clear (&m_tmp_bitmap);
+  bitmap_set_range (&m_tmp_bitmap, 0, last_basic_block_for_fn (m_fn));
+  df_simple_dataflow (DF_FORWARD, NULL, NULL, rd_confluence_n, rd_transfer,
+		      &m_tmp_bitmap, df_get_postorder (DF_FORWARD),
+		      df_get_n_blocks (DF_FORWARD));
+  er = 0;
+
+  /* Work out which definitions reach which candidates, again taking
+     advantage of the candidate order.  */
+  bitmap_head reaching;
+  bitmap_initialize (&reaching, &m_obstack);
+  basic_block old_bb = NULL;
+  FOR_EACH_VEC_ELT_REVERSE (m_candidates, cand_index, cand)
+    {
+      bb = BLOCK_FOR_INSN (cand->insn);
+      if (bb != old_bb)
+	{
+	  /* Get the definitions that reach the start of the new block.  */
+	  remat_block_info *info = &m_block_info[bb->index];
+	  if (info->rd_in)
+	    bitmap_copy (&reaching, info->rd_in);
+	  else
+	    bitmap_clear (&reaching);
+	  old_bb = bb;
+	}
+      else
+	{
+	  /* Process the definitions of the previous instruction.  */
+	  bitmap kill = m_regno_to_candidates[cand[1].regno];
+	  bitmap_and_compl_into (&reaching, kill);
+	  bitmap_set_bit (&reaching, cand_index + 1);
+	}
+
+      if (cand->can_copy_p && !cand->constant_p)
+	{
+	  df_ref ref;
+	  FOR_EACH_INSN_USE (ref, cand->insn)
+	    {
+	      unsigned int regno = DF_REF_REGNO (ref);
+	      if (bitmap_bit_p (&m_candidate_regnos, regno))
+		{
+		  bitmap defs = m_regno_to_candidates[regno];
+		  bitmap_and (&m_tmp_bitmap, defs, &reaching);
+		  bitmap_ior_into (get_bitmap (&cand->uses), &m_tmp_bitmap);
+		}
+	    }
+	}
+    }
+  bitmap_clear (&reaching);
+}
+
+/* If CAND_INDEX is in an equivalence class, return the representative
+   of the class, otherwise return CAND_INDEX.  */
+
+inline unsigned int
+early_remat::canon_candidate (unsigned int cand_index)
+{
+  if (remat_equiv_class *ec = m_candidates[cand_index].equiv_class)
+    return ec->representative;
+  return cand_index;
+}
+
+/* Make candidate set *PTR refer to candidates using the representative
+   of each equivalence class.  */
+
+void
+early_remat::canon_bitmap (bitmap *ptr)
+{
+  bitmap old_set = *ptr;
+  if (empty_p (old_set))
+    return;
+
+  bitmap new_set = NULL;
+  unsigned int old_index;
+  bitmap_iterator bi;
+  EXECUTE_IF_SET_IN_BITMAP (old_set, 0, old_index, bi)
+    {
+      unsigned int new_index = canon_candidate (old_index);
+      if (old_index != new_index)
+	{
+	  if (!new_set)
+	    {
+	      new_set = alloc_bitmap ();
+	      bitmap_copy (new_set, old_set);
+	    }
+	  bitmap_clear_bit (new_set, old_index);
+	  bitmap_set_bit (new_set, new_index);
+	}
+    }
+  if (new_set)
+    {
+      BITMAP_FREE (*ptr);
+      *ptr = new_set;
+    }
+}
+
+/* If the candidates in REACHING all have the same value, return the
+   earliest instance of that value (i.e. the first one to be added
+   to m_value_table), otherwise return MULTIPLE_CANDIDATES.  */
+
+unsigned int
+early_remat::resolve_reaching_def (bitmap reaching)
+{
+  unsigned int cand_index = bitmap_first_set_bit (reaching);
+  if (remat_equiv_class *ec = m_candidates[cand_index].equiv_class)
+    {
+      if (!bitmap_intersect_compl_p (reaching, ec->members))
+	return ec->earliest;
+    }
+  else if (bitmap_single_bit_set_p (reaching))
+    return cand_index;
+
+  return MULTIPLE_CANDIDATES;
+}
+
+/* Check whether all candidate registers used by candidate CAND_INDEX have
+   unique definitions.  Return true if so, replacing the candidate's uses
+   set with the appropriate form for value numbering.  */
+
+bool
+early_remat::check_candidate_uses (unsigned int cand_index)
+{
+  remat_candidate *cand = &m_candidates[cand_index];
+
+  /* Process the uses for each register in turn.  */
+  bitmap_head uses;
+  bitmap_initialize (&uses, &m_obstack);
+  bitmap_copy (&uses, cand->uses);
+  bitmap uses_ec = alloc_bitmap ();
+  while (!bitmap_empty_p (&uses))
+    {
+      /* Get the register for the lowest-indexed candidate remaining,
+	 and the reaching definitions of that register.  */
+      unsigned int first = bitmap_first_set_bit (&uses);
+      unsigned int regno = m_candidates[first].regno;
+      bitmap_and (&m_tmp_bitmap, &uses, m_regno_to_candidates[regno]);
+
+      /* See whether all reaching definitions have the same value and if
+	 so get the index of the first candidate we saw with that value.  */
+      unsigned int def = resolve_reaching_def (&m_tmp_bitmap);
+      if (def == MULTIPLE_CANDIDATES)
+	{
+	  if (dump_file)
+	    fprintf (dump_file, ";; Removing candidate %d because there is"
+		     " more than one reaching definition of reg %d\n",
+		     cand_index, regno);
+	  cand->can_copy_p = false;
+	  break;
+	}
+      bitmap_set_bit (uses_ec, def);
+      bitmap_and_compl_into (&uses, &m_tmp_bitmap);
+    }
+  BITMAP_FREE (cand->uses);
+  cand->uses = uses_ec;
+  return cand->can_copy_p;
+}
+
+/* Calculate the set of hard registers that would be clobbered by
+   rematerializing candidate CAND_INDEX.  At this point the candidate's
+   set of uses is final.  */
+
+void
+early_remat::compute_clobbers (unsigned int cand_index)
+{
+  remat_candidate *cand = &m_candidates[cand_index];
+  if (cand->uses)
+    {
+      unsigned int use_index;
+      bitmap_iterator bi;
+      EXECUTE_IF_SET_IN_BITMAP (cand->uses, 0, use_index, bi)
+	if (bitmap clobbers = m_candidates[use_index].clobbers)
+	  bitmap_ior_into (get_bitmap (&cand->clobbers), clobbers);
+    }
+
+  df_ref ref;
+  FOR_EACH_INSN_DEF (ref, cand->insn)
+    {
+      unsigned int def_regno = DF_REF_REGNO (ref);
+      if (def_regno != cand->regno)
+	bitmap_set_bit (get_bitmap (&cand->clobbers), def_regno);
+    }
+}
+
+/* Mark candidate CAND_INDEX as validated and add it to the value table.  */
+
+void
+early_remat::assign_value_number (unsigned int cand_index)
+{
+  remat_candidate *cand = &m_candidates[cand_index];
+  gcc_checking_assert (cand->can_copy_p && !cand->validated_p);
+
+  compute_clobbers (cand_index);
+  cand->validated_p = true;
+
+  inchash::hash h;
+  h.add_int (cand->regno);
+  inchash::add_rtx (cand->remat_rtx, h);
+  cand->hash = h.end ();
+
+  remat_candidate **slot
+    = m_value_table.find_slot_with_hash (cand, cand->hash, INSERT);
+  if (!*slot)
+    {
+      *slot = cand;
+      if (dump_file)
+	fprintf (dump_file, ";; Candidate %d is not equivalent to"
+		 " others seen so far\n", cand_index);
+    }
+  else
+    record_equiv_candidates (*slot - m_candidates.address (), cand_index);
+}
+
+/* Make a final decision about which candidates are valid and assign
+   value numbers to those that are.  */
+
+void
+early_remat::decide_candidate_validity (void)
+{
+  auto_vec<unsigned int, 16> stack;
+  unsigned int cand1_index;
+  remat_candidate *cand1;
+  FOR_EACH_VEC_ELT_REVERSE (m_candidates, cand1_index, cand1)
+    {
+      if (!cand1->can_copy_p || cand1->validated_p)
+	continue;
+
+      if (empty_p (cand1->uses))
+	{
+	  assign_value_number (cand1_index);
+	  continue;
+	}
+
+      stack.safe_push (cand1_index);
+      while (!stack.is_empty ())
+	{
+	  unsigned int cand2_index = stack.last ();
+	  unsigned int watermark = stack.length ();
+	  remat_candidate *cand2 = &m_candidates[cand2_index];
+	  if (!cand2->can_copy_p || cand2->validated_p)
+	    {
+	      stack.pop ();
+	      continue;
+	    }
+	  cand2->visited_p = true;
+	  unsigned int cand3_index;
+	  bitmap_iterator bi;
+	  EXECUTE_IF_SET_IN_BITMAP (cand2->uses, 0, cand3_index, bi)
+	    {
+	      remat_candidate *cand3 = &m_candidates[cand3_index];
+	      if (!cand3->can_copy_p)
+		{
+		  if (dump_file)
+		    fprintf (dump_file, ";; Removing candidate %d because"
+			     " it uses removed candidate %d\n", cand2_index,
+			     cand3_index);
+		  cand2->can_copy_p = false;
+		  break;
+		}
+	      if (!cand3->validated_p)
+		{
+		  if (empty_p (cand3->uses))
+		    assign_value_number (cand3_index);
+		  else if (cand3->visited_p)
+		    {
+		      if (dump_file)
+			fprintf (dump_file, ";; Removing candidate %d"
+				 " because its definition is cyclic\n",
+				 cand2_index);
+		      cand2->can_copy_p = false;
+		      break;
+		    }
+		  else
+		    stack.safe_push (cand3_index);
+		}
+	    }
+	  if (!cand2->can_copy_p)
+	    {
+	      cand2->visited_p = false;
+	      stack.truncate (watermark - 1);
+	    }
+	  else if (watermark == stack.length ())
+	    {
+	      cand2->visited_p = false;
+	      if (check_candidate_uses (cand2_index))
+		assign_value_number (cand2_index);
+	      stack.pop ();
+	    }
+	}
+    }
+
+  /* Ensure that the candidates always use the same candidate index
+     to refer to an equivalence class.  */
+  FOR_EACH_VEC_ELT_REVERSE (m_candidates, cand1_index, cand1)
+    if (cand1->can_copy_p && !empty_p (cand1->uses))
+      {
+	canon_bitmap (&cand1->uses);
+	gcc_checking_assert (bitmap_first_set_bit (cand1->uses) > cand1_index);
+      }
+}
+
+/* Assuming that every path reaching a point P contains a copy of a
+   use U of REGNO, return true if another copy of U at P would have
+   access to the same value of REGNO.  */
+
+bool
+early_remat::stable_use_p (unsigned int regno)
+{
+  /* Conservatively assume not for hard registers.  */
+  if (HARD_REGISTER_NUM_P (regno))
+    return false;
+
+  /* See if REGNO has a single definition and is never used uninitialized.
+     In this case the definition of REGNO dominates the common dominator
+     of the uses U, which in turn dominates P.  */
+  if (DF_REG_DEF_COUNT (regno) == 1
+      && !bitmap_bit_p (DF_LR_OUT (ENTRY_BLOCK_PTR_FOR_FN (m_fn)), regno))
+    return true;
+
+  return false;
+}
+
+/* Emit a copy from register DEST to register SRC before candidate
+   CAND_INDEX's instruction.  */
+
+void
+early_remat::emit_copy_before (unsigned int cand_index, rtx dest, rtx src)
+{
+  remat_candidate *cand = &m_candidates[cand_index];
+  if (dump_file)
+    {
+      fprintf (dump_file, ";; Stabilizing insn ");
+      dump_insn_id (cand->insn);
+      fprintf (dump_file, " by copying source reg %d:%s to temporary reg %d\n",
+	       REGNO (src), GET_MODE_NAME (GET_MODE (src)), REGNO (dest));
+    }
+  emit_insn_before (gen_move_insn (dest, src), cand->insn);
+}
+
+/* Check whether any inputs to candidate CAND_INDEX's instruction could
+   change at rematerialization points and replace them with new pseudo
+   registers if so.  */
+
+void
+early_remat::stabilize_pattern (unsigned int cand_index)
+{
+  remat_candidate *cand = &m_candidates[cand_index];
+  if (cand->stabilized_p)
+    return;
+
+  remat_equiv_class *ec = cand->equiv_class;
+  gcc_checking_assert (!ec || cand_index == ec->representative);
+
+  /* Record the replacements we've made so far, so that we don't
+     create two separate registers for match_dups.  Lookup is O(n),
+     but the n is very small.  */
+  typedef std::pair<rtx, rtx> reg_pair;
+  auto_vec<reg_pair, 16> reg_map;
+
+  rtx_insn *insn = cand->insn;
+  df_ref ref;
+  FOR_EACH_INSN_USE (ref, insn)
+    {
+      unsigned int old_regno = DF_REF_REGNO (ref);
+      rtx *loc = DF_REF_REAL_LOC (ref);
+
+      if (HARD_REGISTER_NUM_P (old_regno) && fixed_regs[old_regno])
+	{
+	  /* We checked when adding the candidate that the value is stable.  */
+	  gcc_checking_assert (!rtx_unstable_p (*loc));
+	  continue;
+	}
+
+      if (bitmap_bit_p (&m_candidate_regnos, old_regno))
+	/* We already know which candidate provides the definition
+	   and will handle it during copying.  */
+	continue;
+
+      if (stable_use_p (old_regno))
+	/* We can continue to use the existing register.  */
+	continue;
+
+      /* We need to replace the register.  See whether we've already
+	 created a suitable copy.  */
+      rtx old_reg = *loc;
+      rtx new_reg = NULL_RTX;
+      machine_mode mode = GET_MODE (old_reg);
+      reg_pair *p;
+      unsigned int pi;
+      FOR_EACH_VEC_ELT (reg_map, pi, p)
+	if (REGNO (p->first) == old_regno
+	    && GET_MODE (p->first) == mode)
+	  {
+	    new_reg = p->second;
+	    break;
+	  }
+
+      if (!new_reg)
+	{
+	  /* Create a new register and initialize it just before
+	     the instruction.  */
+	  new_reg = gen_reg_rtx (mode);
+	  reg_map.safe_push (reg_pair (old_reg, new_reg));
+	  if (ec)
+	    {
+	      unsigned int member_index;
+	      bitmap_iterator bi;
+	      EXECUTE_IF_SET_IN_BITMAP (ec->members, 0, member_index, bi)
+		emit_copy_before (member_index, new_reg, old_reg);
+	    }
+	  else
+	    emit_copy_before (cand_index, new_reg, old_reg);
+	}
+      validate_change (insn, loc, new_reg, true);
+    }
+  if (num_changes_pending ())
+    {
+      if (!apply_change_group ())
+	/* We checked when adding the candidates that the pattern allows
+	   hard registers to be replaced.  Nothing else should make the
+	   changes invalid.  */
+	gcc_unreachable ();
+
+      if (ec)
+	{
+	  /* Copy the new pattern to other members of the equivalence
+	     class.  */
+	  unsigned int member_index;
+	  bitmap_iterator bi;
+	  EXECUTE_IF_SET_IN_BITMAP (ec->members, 0, member_index, bi)
+	    if (cand_index != member_index)
+	      {
+		rtx_insn *other_insn = m_candidates[member_index].insn;
+		if (!validate_change (other_insn, &PATTERN (other_insn),
+				      copy_insn (PATTERN (insn)), 0))
+		  /* If the original instruction was valid then the copy
+		     should be too.  */
+		  gcc_unreachable ();
+	      }
+	}
+    }
+
+  cand->stabilized_p = true;
+}
+
+/* Change CAND's instruction so that it sets CAND->copy_regno instead
+   of CAND->regno.  */
+
+void
+early_remat::replace_dest_with_copy (unsigned int cand_index)
+{
+  remat_candidate *cand = &m_candidates[cand_index];
+  df_ref def;
+  FOR_EACH_INSN_DEF (def, cand->insn)
+    if (DF_REF_REGNO (def) == cand->regno)
+      validate_change (cand->insn, DF_REF_REAL_LOC (def),
+		       regno_reg_rtx[cand->copy_regno], 1);
+}
+
+/* Make sure that the candidates used by candidate CAND_INDEX are available.
+   There are two ways of doing this for an input candidate I:
+
+   (1) Using the existing register number and ensuring that I is available.
+
+   (2) Using a new register number (recorded in copy_regno) and adding I
+       to VIA_COPY.  This guarantees that making I available does not
+       conflict with other uses of the original register.
+
+   REQUIRED is the set of candidates that are required but not available
+   before the copy of CAND_INDEX.  AVAILABLE is the set of candidates
+   that are already available before the copy of CAND_INDEX.  REACHING
+   is the set of candidates that reach the copy of CAND_INDEX.  VIA_COPY
+   is the set of candidates that will use new register numbers recorded
+   in copy_regno instead of the original ones.  */
+
+void
+early_remat::stabilize_candidate_uses (unsigned int cand_index,
+				       bitmap required, bitmap available,
+				       bitmap reaching, bitmap via_copy)
+{
+  remat_candidate *cand = &m_candidates[cand_index];
+  df_ref use;
+  FOR_EACH_INSN_USE (use, cand->insn)
+    {
+      unsigned int regno = DF_REF_REGNO (use);
+      if (!bitmap_bit_p (&m_candidate_regnos, regno))
+	continue;
+
+      /* Work out which candidate provides the definition.  */
+      bitmap defs = m_regno_to_candidates[regno];
+      bitmap_and (&m_tmp_bitmap, cand->uses, defs);
+      gcc_checking_assert (bitmap_single_bit_set_p (&m_tmp_bitmap));
+      unsigned int def_index = bitmap_first_set_bit (&m_tmp_bitmap);
+
+      /* First see if DEF_INDEX is the only reaching definition of REGNO
+	 at this point too and if it is or will become available.  We can
+	 continue to use REGNO if so.  */
+      bitmap_and (&m_tmp_bitmap, reaching, defs);
+      if (bitmap_single_bit_set_p (&m_tmp_bitmap)
+	  && bitmap_first_set_bit (&m_tmp_bitmap) == def_index
+	  && ((available && bitmap_bit_p (available, def_index))
+	      || bitmap_bit_p (required, def_index)))
+	{
+	  if (dump_file)
+	    fprintf (dump_file, ";; Keeping reg %d for use of candidate %d"
+		     " in candidate %d\n", regno, def_index, cand_index);
+	  continue;
+	}
+
+      /* Otherwise fall back to using a copy.  There are other cases
+	 in which we *could* continue to use REGNO, but there's not
+	 really much point.  Using a separate register ought to make
+	 things easier for the register allocator.  */
+      remat_candidate *def_cand = &m_candidates[def_index];
+      rtx *loc = DF_REF_REAL_LOC (use);
+      rtx new_reg;
+      if (bitmap_set_bit (via_copy, def_index))
+	{
+	  new_reg = gen_reg_rtx (GET_MODE (*loc));
+	  def_cand->copy_regno = REGNO (new_reg);
+	  if (dump_file)
+	    fprintf (dump_file, ";; Creating reg %d for use of candidate %d"
+		     " in candidate %d\n", REGNO (new_reg), def_index,
+		     cand_index);
+	}
+      else
+	new_reg = regno_reg_rtx[def_cand->copy_regno];
+      validate_change (cand->insn, loc, new_reg, 1);
+    }
+}
+
+/* Rematerialize the candidates in REQUIRED after instruction INSN,
+   given that the candidates in AVAILABLE are already available
+   and that REACHING is the set of candidates live after INSN.
+   REQUIRED and AVAILABLE are disjoint on entry.
+
+   Clear REQUIRED on exit.  */
+
+void
+early_remat::emit_remat_insns (bitmap required, bitmap available,
+			       bitmap reaching, rtx_insn *insn)
+{
+  /* Quick exit if there's nothing to do.  */
+  if (empty_p (required))
+    return;
+
+  /* Only reaching definitions should be available or required.  */
+  gcc_checking_assert (!bitmap_intersect_compl_p (required, reaching));
+  if (available)
+    gcc_checking_assert (!bitmap_intersect_compl_p (available, reaching));
+
+  bitmap_head via_copy;
+  bitmap_initialize (&via_copy, &m_obstack);
+  while (!bitmap_empty_p (required) || !bitmap_empty_p (&via_copy))
+    {
+      /* Pick the lowest-indexed candidate left.  */
+      unsigned int required_index = (bitmap_empty_p (required)
+				     ? ~0U : bitmap_first_set_bit (required));
+      unsigned int via_copy_index = (bitmap_empty_p (&via_copy)
+				     ? ~0U : bitmap_first_set_bit (&via_copy));
+      unsigned int cand_index = MIN (required_index, via_copy_index);
+      remat_candidate *cand = &m_candidates[cand_index];
+
+      bool via_copy_p = (cand_index == via_copy_index);
+      if (via_copy_p)
+	bitmap_clear_bit (&via_copy, cand_index);
+      else
+	{
+	  /* Remove all candidates for the same register from REQUIRED.  */
+	  bitmap_and (&m_tmp_bitmap, reaching,
+		      m_regno_to_candidates[cand->regno]);
+	  bitmap_and_compl_into (required, &m_tmp_bitmap);
+	  gcc_checking_assert (!bitmap_bit_p (required, cand_index));
+
+	  /* Only rematerialize if we have a single reaching definition
+	     of the register.  */
+	  if (!bitmap_single_bit_set_p (&m_tmp_bitmap))
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file, ";; Can't rematerialize reg %d after ",
+			   cand->regno);
+		  dump_insn_id (insn);
+		  fprintf (dump_file, ": more than one reaching definition\n");
+		}
+	      continue;
+	    }
+
+	  /* Skip candidates that can't be rematerialized.  */
+	  if (!cand->can_copy_p)
+	    continue;
+
+	  /* Check the function precondition.  */
+	  gcc_checking_assert (!available
+			       || !bitmap_bit_p (available, cand_index));
+	}
+
+      /* Invalid candidates should have been weeded out by now.  */
+      gcc_assert (cand->can_copy_p);
+
+      rtx new_pattern;
+      if (cand->constant_p)
+	{
+	  /* Emit a simple move.  */
+	  unsigned int regno = via_copy_p ? cand->copy_regno : cand->regno;
+	  new_pattern = gen_move_insn (regno_reg_rtx[regno], cand->remat_rtx);
+	}
+      else
+	{
+	  /* If this is the first time we've copied the instruction, make
+	     sure that any inputs will have the same value after INSN.  */
+	  stabilize_pattern (cand_index);
+
+	  /* Temporarily adjust the original instruction so that it has
+	     the right form for the copy.  */
+	  if (via_copy_p)
+	    replace_dest_with_copy (cand_index);
+	  if (cand->uses)
+	    stabilize_candidate_uses (cand_index, required, available,
+				      reaching, &via_copy);
+
+	  /* Get the new instruction pattern.  */
+	  new_pattern = copy_insn (cand->remat_rtx);
+
+	  /* Undo the temporary changes.  */
+	  cancel_changes (0);
+	}
+
+      /* Emit the new instruction.  */
+      rtx_insn *new_insn = emit_insn_after (new_pattern, insn);
+
+      if (dump_file)
+	{
+	  fprintf (dump_file, ";; Rematerializing candidate %d after ",
+		   cand_index);
+	  dump_insn_id (insn);
+	  if (via_copy_p)
+	    fprintf (dump_file, " with new destination reg %d",
+		     cand->copy_regno);
+	  fprintf (dump_file, ":\n\n");
+	  print_rtl_single (dump_file, new_insn);
+	  fprintf (dump_file, "\n");
+	}
+    }
+}
+
+/* Recompute INFO's available_out set, given that it's distinct from
+   available_in and available_locally.  */
+
+bool
+early_remat::set_available_out (remat_block_info *info)
+{
+  if (empty_p (info->available_locally))
+    return bitmap_and_compl (get_bitmap (&info->available_out),
+			     info->available_in, info->rd_kill);
+
+  if (empty_p (info->rd_kill))
+    return bitmap_ior (get_bitmap (&info->available_out),
+		       info->available_locally, info->available_in);
+
+  return bitmap_ior_and_compl (get_bitmap (&info->available_out),
+			       info->available_locally, info->available_in,
+			       info->rd_kill);
+}
+
+/* If BB has more than one call, decide which candidates should be
+   rematerialized after the non-final calls and emit the associated
+   instructions.  Record other information about the block in preparation
+   for the global phase.  */
+
+void
+early_remat::process_block (basic_block bb)
+{
+  remat_block_info *info = &m_block_info[bb->index];
+  rtx_insn *last_call = NULL;
+  rtx_insn *insn;
+
+  /* Ensure that we always use the same candidate index to refer to an
+     equivalence class.  */
+  if (info->rd_out == info->rd_in)
+    {
+      canon_bitmap (&info->rd_in);
+      info->rd_out = info->rd_in;
+    }
+  else
+    {
+      canon_bitmap (&info->rd_in);
+      canon_bitmap (&info->rd_out);
+    }
+  canon_bitmap (&info->rd_kill);
+  canon_bitmap (&info->rd_gen);
+
+  /* The set of candidates that should be rematerialized on entry to the
+     block or after the previous call (whichever is more recent).  */
+  init_temp_bitmap (&m_required);
+
+  /* The set of candidates that reach the current instruction (i.e. are
+     live just before the instruction).  */
+  bitmap_head reaching;
+  bitmap_initialize (&reaching, &m_obstack);
+  if (info->rd_in)
+    bitmap_copy (&reaching, info->rd_in);
+
+  /* The set of candidates that are live and available without
+     rematerialization just before the current instruction.  This only
+     accounts for earlier candidates in the block, or those that become
+     available by being added to M_REQUIRED.  */
+  init_temp_bitmap (&m_available);
+
+  /* Get the range of candidates in the block.  */
+  unsigned int next_candidate = info->first_candidate;
+  unsigned int num_candidates = info->num_candidates;
+  remat_candidate *next_def = (num_candidates > 0
+			       ? &m_candidates[next_candidate]
+			       : NULL);
+
+  FOR_BB_INSNS (bb, insn)
+    {
+      if (!NONDEBUG_INSN_P (insn))
+	continue;
+
+      /* First process uses, since this is a forward walk.  */
+      df_ref ref;
+      FOR_EACH_INSN_USE (ref, insn)
+	{
+	  unsigned int regno = DF_REF_REGNO (ref);
+	  if (bitmap_bit_p (&m_candidate_regnos, regno))
+	    {
+	      bitmap defs = m_regno_to_candidates[regno];
+	      bitmap_and (&m_tmp_bitmap, defs, &reaching);
+	      gcc_checking_assert (!bitmap_empty_p (&m_tmp_bitmap));
+	      if (!bitmap_intersect_p (defs, m_available))
+		{
+		  /* There has been no definition of the register since
+		     the last call or the start of the block (whichever
+		     is most recent).  Mark the reaching definitions
+		     as required at that point and thus available here.  */
+		  bitmap_ior_into (m_required, &m_tmp_bitmap);
+		  bitmap_ior_into (m_available, &m_tmp_bitmap);
+		}
+	    }
+	}
+
+      if (CALL_P (insn))
+	{
+	  if (!last_call)
+	    {
+	      /* The first call in the block.  Record which candidates are
+		 required at the start of the block.  */
+	      copy_temp_bitmap (&info->required_in, &m_required);
+	      init_temp_bitmap (&m_required);
+	    }
+	  else
+	    /* The fully-local case: candidates that need to be
+	       rematerialized after a previous call in the block.  */
+	    emit_remat_insns (m_required, NULL, info->rd_after_call,
+			      last_call);
+	  last_call = insn;
+	  bitmap_clear (m_available);
+	  gcc_checking_assert (empty_p (m_required));
+	}
+
+      /* Now process definitions.  */
+      if (next_def && insn == next_def->insn)
+	{
+	  unsigned int gen = canon_candidate (next_candidate);
+
+	  /* Other candidates with the same regno are not available
+	     any more.  */
+	  bitmap kill = m_regno_to_candidates[next_def->regno];
+	  bitmap_and_compl_into (m_available, kill);
+	  bitmap_and_compl_into (&reaching, kill);
+
+	  /* Record that this candidate is available without
+	     rematerialization.  */
+	  bitmap_set_bit (m_available, gen);
+	  bitmap_set_bit (&reaching, gen);
+
+	  /* Find the next candidate in the block.  */
+	  num_candidates -= 1;
+	  next_candidate -= 1;
+	  if (num_candidates > 0)
+	    next_def -= 1;
+	  else
+	    next_def = NULL;
+	}
+
+      if (insn == last_call)
+	bitmap_copy (get_bitmap (&info->rd_after_call), &reaching);
+    }
+  bitmap_clear (&reaching);
+  gcc_checking_assert (num_candidates == 0);
+
+  /* Remove values from the available set if they aren't live (and so
+     aren't interesting to successor blocks).  */
+  if (info->rd_out)
+    bitmap_and_into (m_available, info->rd_out);
+
+  /* Record the accumulated information.  */
+  info->last_call = last_call;
+  info->abnormal_call_p = (last_call
+			   && last_call == BB_END (bb)
+			   && has_abnormal_or_eh_outgoing_edge_p (bb));
+  copy_temp_bitmap (&info->available_locally, &m_available);
+  if (last_call)
+    copy_temp_bitmap (&info->required_after_call, &m_required);
+  else
+    copy_temp_bitmap (&info->required_in, &m_required);
+
+  /* Assume at first that all live-in values are available without
+     rematerialization (i.e. start with the most optimistic assumption).  */
+  if (info->available_in)
+    {
+      if (info->rd_in)
+	bitmap_copy (info->available_in, info->rd_in);
+      else
+	BITMAP_FREE (info->available_in);
+    }
+
+  if (last_call || empty_p (info->available_in))
+    /* The values available on exit from the block are exactly those that
+       are available locally.  This set doesn't change.  */
+    info->available_out = info->available_locally;
+  else if (empty_p (info->available_locally) && empty_p (info->rd_kill))
+    /* The values available on exit are the same as those available on entry.
+       Updating one updates the other.  */
+    info->available_out = info->available_in;
+  else
+    set_available_out (info);
+}
+
+/* Process each block as for process_block, visiting dominators before
+   the blocks they dominate.  */
+
+void
+early_remat::local_phase (void)
+{
+  if (dump_file)
+    fprintf (dump_file, "\n;; Local phase:\n");
+
+  int *postorder = df_get_postorder (DF_BACKWARD);
+  unsigned int postorder_len = df_get_n_blocks (DF_BACKWARD);
+  for (unsigned int i = postorder_len; i-- > 0; )
+    if (postorder[i] >= NUM_FIXED_BLOCKS)
+      process_block (BASIC_BLOCK_FOR_FN (m_fn, postorder[i]));
+}
+
+/* Return true if available values survive across edge E.  */
+
+static inline bool
+available_across_edge_p (edge e)
+{
+  return (e->flags & EDGE_EH) == 0;
+}
+
+/* Propagate information from the available_out set of E->src to the
+   available_in set of E->dest, when computing global availability.
+   Return true if something changed.  */
+
+bool
+early_remat::avail_confluence_n (edge e)
+{
+  remat_block_info *src = &er->m_block_info[e->src->index];
+  remat_block_info *dest = &er->m_block_info[e->dest->index];
+
+  if (!available_across_edge_p (e))
+    return false;
+
+  if (empty_p (dest->available_in))
+    return false;
+
+  if (!src->available_out)
+    {
+      bitmap_clear (dest->available_in);
+      return true;
+    }
+
+  return bitmap_and_into (dest->available_in, src->available_out);
+}
+
+/* Propagate information from the available_in set of block BB_INDEX
+   to available_out.  Return true if something changed.  */
+
+bool
+early_remat::avail_transfer (int bb_index)
+{
+  remat_block_info *info = &er->m_block_info[bb_index];
+
+  if (info->available_out == info->available_locally)
+    return false;
+
+  if (info->available_out == info->available_in)
+    /* Assume that we are only called if the input changed.  */
+    return true;
+
+  return er->set_available_out (info);
+}
+
+/* Compute global availability for the function, starting with the local
+   information computed by local_phase.  */
+
+void
+early_remat::compute_availability (void)
+{
+  /* We use df_simple_dataflow instead of the lcm routines for three reasons:
+
+     (1) it avoids recomputing the traversal order;
+     (2) many of the sets are likely to be sparse, so we don't necessarily
+	 want to use sbitmaps; and
+     (3) it means we can avoid creating an explicit kill set for the call.  */
+  er = this;
+  bitmap_clear (&m_tmp_bitmap);
+  bitmap_set_range (&m_tmp_bitmap, 0, last_basic_block_for_fn (m_fn));
+  df_simple_dataflow (DF_FORWARD, NULL, NULL,
+		      avail_confluence_n, avail_transfer,
+		      &m_tmp_bitmap, df_get_postorder (DF_FORWARD),
+		      df_get_n_blocks (DF_FORWARD));
+  er = 0;
+
+  /* Restrict the required_in sets to values that aren't available.  */
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, m_fn)
+    {
+      remat_block_info *info = &m_block_info[bb->index];
+      if (info->required_in && info->available_in)
+	bitmap_and_compl_into (info->required_in, info->available_in);
+    }
+}
+
+/* Make sure that INFO's available_out and available_in sets are unique.  */
+
+inline void
+early_remat::unshare_available_sets (remat_block_info *info)
+{
+  if (info->available_in && info->available_in == info->available_out)
+    {
+      info->available_in = alloc_bitmap ();
+      bitmap_copy (info->available_in, info->available_out);
+    }
+}
+
+/* Return true if it is possible to move rematerializations from the
+   destination of E to the source of E.  */
+
+inline bool
+early_remat::can_move_across_edge_p (edge e)
+{
+  return (available_across_edge_p (e)
+	  && !m_block_info[e->src->index].abnormal_call_p);
+}
+
+/* Return true if it is cheaper to rematerialize values at the head of
+   block QUERY_BB_INDEX instead of rematerializing in its predecessors.  */
+
+bool
+early_remat::local_remat_cheaper_p (unsigned int query_bb_index)
+{
+  if (m_block_info[query_bb_index].remat_frequency_valid_p)
+    return m_block_info[query_bb_index].local_remat_cheaper_p;
+
+  /* Iteratively compute the cost of rematerializing values in the
+     predecessor blocks, then compare that with the cost of
+     rematerializing at the head of the block.
+
+     A cycle indicates that there is no call on that execution path,
+     so it isn't necessary to rematerialize on that path.  */
+  auto_vec<basic_block, 16> stack;
+  stack.quick_push (BASIC_BLOCK_FOR_FN (m_fn, query_bb_index));
+  while (!stack.is_empty ())
+    {
+      basic_block bb = stack.last ();
+      remat_block_info *info = &m_block_info[bb->index];
+      if (info->remat_frequency_valid_p)
+	{
+	  stack.pop ();
+	  continue;
+	}
+
+      info->visited_p = true;
+      int frequency = 0;
+      bool can_move_p = true;
+      edge e;
+      edge_iterator ei;
+      FOR_EACH_EDGE (e, ei, bb->preds)
+	if (!can_move_across_edge_p (e))
+	  {
+	    can_move_p = false;
+	    break;
+	  }
+	else if (m_block_info[e->src->index].last_call)
+	  /* We'll rematerialize after the call.  */
+	  frequency += e->src->count.to_frequency (m_fn);
+	else if (m_block_info[e->src->index].remat_frequency_valid_p)
+	  /* Add the cost of rematerializing at the head of E->src
+	     or in its predecessors (whichever is cheaper).  */
+	  frequency += m_block_info[e->src->index].remat_frequency;
+	else if (!m_block_info[e->src->index].visited_p)
+	  /* Queue E->src and then revisit this block again.  */
+	  stack.safe_push (e->src);
+
+      /* Come back to this block later if we need to process some of
+	 its predecessors.  */
+      if (stack.last () != bb)
+	continue;
+
+      /* If rematerializing in and before the block have equal cost, prefer
+	 rematerializing in the block.  This should shorten the live range.  */
+      int bb_frequency = bb->count.to_frequency (m_fn);
+      if (!can_move_p || frequency >= bb_frequency)
+	{
+	  info->local_remat_cheaper_p = true;
+	  info->remat_frequency = bb_frequency;
+	}
+      else
+	info->remat_frequency = frequency;
+      info->remat_frequency_valid_p = true;
+      info->visited_p = false;
+      if (dump_file)
+	{
+	  if (!can_move_p)
+	    fprintf (dump_file, ";; Need to rematerialize at the head of"
+		     " block %d; cannot move to predecessors.\n", bb->index);
+	  else
+	    {
+	      fprintf (dump_file, ";; Block %d has frequency %d,"
+		       " rematerializing in predecessors has frequency %d",
+		       bb->index, bb_frequency, frequency);
+	      if (info->local_remat_cheaper_p)
+		fprintf (dump_file, "; prefer to rematerialize"
+			 " in the block\n");
+	      else
+		fprintf (dump_file, "; prefer to rematerialize"
+			 " in predecessors\n");
+	    }
+	}
+      stack.pop ();
+    }
+  return m_block_info[query_bb_index].local_remat_cheaper_p;
+}
+
+/* Return true if we cannot rematerialize candidate CAND_INDEX at the head of
+   block BB_INDEX.  */
+
+bool
+early_remat::need_to_move_candidate_p (unsigned int bb_index,
+				       unsigned int cand_index)
+{
+  remat_block_info *info = &m_block_info[bb_index];
+  remat_candidate *cand = &m_candidates[cand_index];
+  basic_block bb = BASIC_BLOCK_FOR_FN (m_fn, bb_index);
+
+  /* If there is more than one reaching definition of REGNO,
+     we'll need to rematerialize in predecessors instead.  */
+  bitmap_and (&m_tmp_bitmap, info->rd_in, m_regno_to_candidates[cand->regno]);
+  if (!bitmap_single_bit_set_p (&m_tmp_bitmap))
+    {
+      if (dump_file)
+	fprintf (dump_file, ";; Cannot rematerialize %d at the"
+		 " head of block %d because there is more than one"
+		 " reaching definition of reg %d\n", cand_index,
+		 bb_index, cand->regno);
+      return true;
+    }
+
+  /* Likewise if rematerializing CAND here would clobber a live register.  */
+  if (cand->clobbers
+      && bitmap_intersect_p (cand->clobbers, DF_LR_IN (bb)))
+    {
+      if (dump_file)
+	fprintf (dump_file, ";; Cannot rematerialize %d at the"
+		 " head of block %d because it would clobber live"
+		 " registers\n", cand_index, bb_index);
+      return true;
+    }
+
+  return false;
+}
+
+/* Set REQUIRED to the minimum set of candidates that must be rematerialized
+   in predecessors of block BB_INDEX instead of at the start of the block.  */
+
+void
+early_remat::compute_minimum_move_set (unsigned int bb_index,
+				       bitmap required)
+{
+  remat_block_info *info = &m_block_info[bb_index];
+  bitmap_head remaining;
+
+  bitmap_clear (required);
+  bitmap_initialize (&remaining, &m_obstack);
+  bitmap_copy (&remaining, info->required_in);
+  while (!bitmap_empty_p (&remaining))
+    {
+      unsigned int cand_index = bitmap_first_set_bit (&remaining);
+      remat_candidate *cand = &m_candidates[cand_index];
+      bitmap_clear_bit (&remaining, cand_index);
+
+      /* Leave invalid candidates where they are.  */
+      if (!cand->can_copy_p)
+	continue;
+
+      /* Decide whether to move this candidate.  */
+      if (!bitmap_bit_p (required, cand_index))
+	{
+	  if (!need_to_move_candidate_p (bb_index, cand_index))
+	    continue;
+	  bitmap_set_bit (required, cand_index);
+	}
+
+      /* Also move values used by the candidate, so that we don't
+	 rematerialize them twice.  */
+      if (cand->uses)
+	{
+	  bitmap_ior_and_into (required, cand->uses, info->required_in);
+	  bitmap_ior_and_into (&remaining, cand->uses, info->required_in);
+	}
+    }
+}
+
+/* Make the predecessors of BB_INDEX rematerialize the candidates in
+   REQUIRED.  Add any blocks whose required_in set changes to
+   PENDING_BLOCKS.  */
+
+void
+early_remat::move_to_predecessors (unsigned int bb_index, bitmap required,
+				   bitmap pending_blocks)
+{
+  if (empty_p (required))
+    return;
+  remat_block_info *dest_info = &m_block_info[bb_index];
+  basic_block bb = BASIC_BLOCK_FOR_FN (m_fn, bb_index);
+  edge e;
+  edge_iterator ei;
+  FOR_EACH_EDGE (e, ei, bb->preds)
+    {
+      remat_block_info *src_info = &m_block_info[e->src->index];
+
+      /* Restrict the set we add to the reaching definitions.  */
+      bitmap_and (&m_tmp_bitmap, required, src_info->rd_out);
+      if (bitmap_empty_p (&m_tmp_bitmap))
+	continue;
+
+      if (!can_move_across_edge_p (e))
+	{
+	  /* We can't move the rematerialization and we can't do it at
+	     the start of the block either.  In this case we just give up
+	     and rely on spilling to make the values available across E.  */
+	  if (dump_file)
+	    {
+	      fprintf (dump_file, ";; Cannot rematerialize the following"
+		       " candidates in block %d:", e->src->index);
+	      dump_candidate_bitmap (required);
+	      fprintf (dump_file, "\n");
+	    }
+	  continue;
+	}
+
+      /* Remove candidates that are already available.  */
+      if (src_info->available_out)
+	{
+	  bitmap_and_compl_into (&m_tmp_bitmap, src_info->available_out);
+	  if (bitmap_empty_p (&m_tmp_bitmap))
+	    continue;
+	}
+
+      /* Add the remaining candidates to the appropriate required set.  */
+      if (dump_file)
+	{
+	  fprintf (dump_file, ";; Moving this set from block %d"
+		   " to block %d:", bb_index, e->src->index);
+	  dump_candidate_bitmap (&m_tmp_bitmap);
+	  fprintf (dump_file, "\n");
+	}
+      /* If the source block contains a call, we want to rematerialize
+	 after the call, otherwise we want to rematerialize at the start
+	 of the block.  */
+      bitmap src_required = get_bitmap (src_info->last_call
+					? &src_info->required_after_call
+					: &src_info->required_in);
+      if (bitmap_ior_into (src_required, &m_tmp_bitmap))
+	{
+	  if (!src_info->last_call)
+	    bitmap_set_bit (pending_blocks, e->src->index);
+	  unshare_available_sets (src_info);
+	  bitmap_ior_into (get_bitmap (&src_info->available_out),
+			   &m_tmp_bitmap);
+	}
+    }
+
+  /* The candidates are now available on entry to the block.  */
+  bitmap_and_compl_into (dest_info->required_in, required);
+  unshare_available_sets (dest_info);
+  bitmap_ior_into (get_bitmap (&dest_info->available_in), required);
+}
+
+/* Go through the candidates that are currently marked as being
+   rematerialized at the beginning of a block.  Decide in each case
+   whether that's valid and profitable; if it isn't, move the
+   rematerialization to predecessor blocks instead.  */
+
+void
+early_remat::choose_rematerialization_points (void)
+{
+  bitmap_head required;
+  bitmap_head pending_blocks;
+
+  int *postorder = df_get_postorder (DF_BACKWARD);
+  unsigned int postorder_len = df_get_n_blocks (DF_BACKWARD);
+  bitmap_initialize (&required, &m_obstack);
+  bitmap_initialize (&pending_blocks, &m_obstack);
+  do
+    /* Process the blocks in postorder, to reduce the number of iterations
+       of the outer loop.  */
+    for (unsigned int i = 0; i < postorder_len; ++i)
+      {
+	unsigned int bb_index = postorder[i];
+	remat_block_info *info = &m_block_info[bb_index];
+	bitmap_clear_bit (&pending_blocks, bb_index);
+
+	if (empty_p (info->required_in))
+	  continue;
+
+	if (info->available_in)
+	  gcc_checking_assert (!bitmap_intersect_p (info->required_in,
+						    info->available_in));
+
+	if (local_remat_cheaper_p (bb_index))
+	  {
+	    /* We'd prefer to rematerialize at the head of the block.
+	       Only move candidates if we need to.  */
+	    compute_minimum_move_set (bb_index, &required);
+	    move_to_predecessors (bb_index, &required, &pending_blocks);
+	  }
+	else
+	  move_to_predecessors (bb_index, info->required_in,
+				&pending_blocks);
+      }
+  while (!bitmap_empty_p (&pending_blocks));
+  bitmap_clear (&required);
+}
+
+/* Emit all rematerialization instructions queued for BB.  */
+
+void
+early_remat::emit_remat_insns_for_block (basic_block bb)
+{
+  remat_block_info *info = &m_block_info[bb->index];
+
+  if (info->last_call && !empty_p (info->required_after_call))
+    emit_remat_insns (info->required_after_call, NULL,
+		      info->rd_after_call, info->last_call);
+
+  if (!empty_p (info->required_in))
+    {
+      rtx_insn *insn = BB_HEAD (bb);
+      while (insn != BB_END (bb)
+	     && !INSN_P (NEXT_INSN (insn)))
+	insn = NEXT_INSN (insn);
+      emit_remat_insns (info->required_in, info->available_in,
+			info->rd_in, insn);
+    }
+}
+
+/* Decide which candidates in each block's REQUIRED_IN set need to be
+   rematerialized and decide where the rematerialization instructions
+   should go.  Emit queued rematerialization instructions at the start
+   of blocks and after the last calls in blocks.  */
+
+void
+early_remat::global_phase (void)
+{
+  compute_availability ();
+  if (dump_file)
+    {
+      fprintf (dump_file, "\n;; Blocks after computing global"
+	       " availability:\n");
+      dump_all_blocks ();
+    }
+
+  choose_rematerialization_points ();
+  if (dump_file)
+    {
+      fprintf (dump_file, "\n;; Blocks after choosing rematerialization"
+	       " points:\n");
+      dump_all_blocks ();
+    }
+
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, m_fn)
+    emit_remat_insns_for_block (bb);
+}
+
+/* Main function for the pass.  */
+
+void
+early_remat::run (void)
+{
+  df_analyze ();
+
+  if (!collect_candidates ())
+    return;
+
+  init_block_info ();
+  sort_candidates ();
+  finalize_candidate_indices ();
+  if (dump_file)
+    dump_all_candidates ();
+
+  compute_rd ();
+  decide_candidate_validity ();
+  local_phase ();
+  global_phase ();
+}
+
+early_remat::early_remat (function *fn, sbitmap selected_modes)
+  : m_fn (fn),
+    m_selected_modes (selected_modes),
+    m_available (0),
+    m_required (0),
+    m_value_table (63)
+{
+  bitmap_obstack_initialize (&m_obstack);
+  bitmap_initialize (&m_candidate_regnos, &m_obstack);
+  bitmap_initialize (&m_tmp_bitmap, &m_obstack);
+}
+
+early_remat::~early_remat ()
+{
+  bitmap_obstack_release (&m_obstack);
+}
+
+namespace {
+
+const pass_data pass_data_early_remat =
+{
+  RTL_PASS, /* type */
+  "early_remat", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_EARLY_REMAT, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_df_finish, /* todo_flags_finish */
+};
+
+class pass_early_remat : public rtl_opt_pass
+{
+public:
+  pass_early_remat (gcc::context *ctxt)
+    : rtl_opt_pass (pass_data_early_remat, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+  {
+    return optimize > 1 && NUM_POLY_INT_COEFFS > 1;
+  }
+
+  virtual unsigned int execute (function *f)
+  {
+    auto_sbitmap selected_modes (NUM_MACHINE_MODES);
+    bitmap_clear (selected_modes);
+    targetm.select_early_remat_modes (selected_modes);
+    if (!bitmap_empty_p (selected_modes))
+      early_remat (f, selected_modes).run ();
+    return 0;
+  }
+}; // class pass_early_remat
+
+} // anon namespace
+
+rtl_opt_pass *
+make_pass_early_remat (gcc::context *ctxt)
+{
+  return new pass_early_remat (ctxt);
+}
Index: gcc/config/aarch64/aarch64.c
===================================================================
--- gcc/config/aarch64/aarch64.c	2018-01-09 15:09:57.712775534 +0000
+++ gcc/config/aarch64/aarch64.c	2018-01-09 15:28:36.201732898 +0000
@@ -17439,6 +17439,22 @@ aarch64_can_change_mode_class (machine_m
   return true;
 }
 
+/* Implement TARGET_EARLY_REMAT_MODES.  */
+
+static void
+aarch64_select_early_remat_modes (sbitmap modes)
+{
+  /* SVE values are not normally live across a call, so it should be
+     worth doing early rematerialization even in VL-specific mode.  */
+  for (int i = 0; i < NUM_MACHINE_MODES; ++i)
+    {
+      machine_mode mode = (machine_mode) i;
+      unsigned int vec_flags = aarch64_classify_vector_mode (mode);
+      if (vec_flags & VEC_ANY_SVE)
+	bitmap_set_bit (modes, i);
+    }
+}
+
 /* Target-specific selftests.  */
 
 #if CHECKING_P
@@ -17909,6 +17925,9 @@ #define TARGET_COMPUTE_PRESSURE_CLASSES
 #undef TARGET_CAN_CHANGE_MODE_CLASS
 #define TARGET_CAN_CHANGE_MODE_CLASS aarch64_can_change_mode_class
 
+#undef TARGET_SELECT_EARLY_REMAT_MODES
+#define TARGET_SELECT_EARLY_REMAT_MODES aarch64_select_early_remat_modes
+
 #if CHECKING_P
 #undef TARGET_RUN_TARGET_SELFTESTS
 #define TARGET_RUN_TARGET_SELFTESTS selftest::aarch64_run_selftests
Index: gcc/testsuite/gcc.target/aarch64/sve_spill_1.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/sve_spill_1.c	2018-01-09 15:09:55.875851493 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_spill_1.c	2018-01-09 15:28:36.205732731 +0000
@@ -26,3 +26,5 @@ TEST_LOOP (uint64_t, 511);
 /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #511\n} 2 } } */
 /* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */
 /* { dg-final { scan-assembler-not {\tstr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tldr\tp[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tp[0-9]} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_spill_2.c
===================================================================
--- /dev/null	2018-01-08 18:48:58.045015662 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_spill_2.c	2018-01-09 15:28:36.205732731 +0000
@@ -0,0 +1,39 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable" } */
+
+#include <stdint.h>
+
+void consumer (void *);
+
+#define TEST_LOOP(TYPE)				\
+  void						\
+  multi_loop_##TYPE (TYPE *x, TYPE val)		\
+  {						\
+    for (int i = 0; i < 7; ++i)			\
+      x[i] += val;				\
+    consumer (x);				\
+    for (int i = 0; i < 7; ++i)			\
+      x[i] += val;				\
+    consumer (x);				\
+    for (int i = 0; i < 7; ++i)			\
+      x[i] += val;				\
+    consumer (x);				\
+  }
+
+/* One iteration is enough.  */
+TEST_LOOP (uint8_t);
+TEST_LOOP (uint16_t);
+/* Two iterations are enough.  Complete unrolling makes sense
+   even at -O2.  */
+TEST_LOOP (uint32_t);
+/* Four iterations are needed; ought to stay a loop.  */
+TEST_LOOP (uint64_t);
+
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-9]\.b} 3 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-9]\.h} 3 } } */
+/* { dg-final { scan-assembler {\twhilelo\tp[0-9]\.s} } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-9]\.d} 6 } } */
+/* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tldr\tp[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tp[0-9]} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_spill_3.c
===================================================================
--- /dev/null	2018-01-08 18:48:58.045015662 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_spill_3.c	2018-01-09 15:28:36.205732731 +0000
@@ -0,0 +1,48 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable" } */
+
+#include <stdint.h>
+
+void consumer (void *);
+
+#define TEST_LOOP(TYPE) \
+  void								\
+  multi_loop_##TYPE (TYPE *x, TYPE val1, TYPE val2, int n)	\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      {								\
+	x[i * 2] += val1;					\
+	x[i * 2 + 1] += val2;					\
+      }								\
+    consumer (x);						\
+    for (int i = 0; i < n; ++i)					\
+      {								\
+	x[i * 2] += val1;					\
+	x[i * 2 + 1] += val2;					\
+      }								\
+    consumer (x);						\
+    for (int i = 0; i < n; ++i)					\
+      {								\
+	x[i * 2] += val1;					\
+	x[i * 2 + 1] += val2;					\
+      }								\
+    consumer (x);						\
+  }
+
+/* One iteration is enough.  */
+TEST_LOOP (uint8_t);
+TEST_LOOP (uint16_t);
+/* Two iterations are enough.  Complete unrolling makes sense
+   even at -O2.  */
+TEST_LOOP (uint32_t);
+/* Four iterations are needed; ought to stay a loop.  */
+TEST_LOOP (uint64_t);
+
+/* { dg-final { scan-assembler {\tld1b\tz[0-9]\.b} } } */
+/* { dg-final { scan-assembler {\tld1h\tz[0-9]\.h} } } */
+/* { dg-final { scan-assembler {\tld1w\tz[0-9]\.s} } } */
+/* { dg-final { scan-assembler {\tld1d\tz[0-9]\.d} } } */
+/* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tldr\tp[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tp[0-9]} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_spill_4.c
===================================================================
--- /dev/null	2018-01-08 18:48:58.045015662 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_spill_4.c	2018-01-09 15:28:36.205732731 +0000
@@ -0,0 +1,36 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#include <stdint.h>
+
+void consumer (void *);
+
+#define TEST_LOOP(TYPE, VAL)			\
+  void						\
+  multi_loop_##TYPE (TYPE *x)			\
+  {						\
+    for (int i = 0; i < 100; ++i)		\
+      x[i] += VAL;				\
+    consumer (x);				\
+    for (int i = 0; i < 100; ++i)		\
+      x[i] += VAL;				\
+    consumer (x);				\
+    for (int i = 0; i < 100; ++i)		\
+      x[i] += VAL;				\
+    consumer (x);				\
+  }
+
+TEST_LOOP (uint16_t, 0x1234);
+TEST_LOOP (uint32_t, 0x12345);
+TEST_LOOP (uint64_t, 0x123456);
+
+/* { dg-final { scan-assembler-times {\tptrue\tp[0-9]+\.h,} 3 } } */
+/* { dg-final { scan-assembler-times {\tptrue\tp[0-9]+\.s,} 3 } } */
+/* { dg-final { scan-assembler-times {\tptrue\tp[0-9]+\.d,} 3 } } */
+/* { dg-final { scan-assembler-times {\tld1rh\tz[0-9]+\.h,} 3 } } */
+/* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s,} 3 } } */
+/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d,} 3 } } */
+/* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tldr\tp[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tp[0-9]} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_spill_5.c
===================================================================
--- /dev/null	2018-01-08 18:48:58.045015662 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_spill_5.c	2018-01-09 15:28:36.205732731 +0000
@@ -0,0 +1,34 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math -march=armv8-a+sve" } */
+
+#include <stdint.h>
+
+void consumer (void *);
+
+#define TEST_LOOP(TYPE, VAL)			\
+  void						\
+  multi_loop_##TYPE (TYPE *x)			\
+  {						\
+    for (int i = 0; i < 100; ++i)		\
+      x[i] += VAL + i;				\
+    consumer (x);				\
+    for (int i = 0; i < 100; ++i)		\
+      x[i] += VAL + i;				\
+    consumer (x);				\
+    for (int i = 0; i < 100; ++i)		\
+      x[i] += VAL + i;				\
+    consumer (x);				\
+  }
+
+TEST_LOOP (uint8_t, 3);
+TEST_LOOP (uint16_t, 4);
+TEST_LOOP (uint32_t, 5);
+TEST_LOOP (uint64_t, 6);
+TEST_LOOP (float, 2.5f);
+TEST_LOOP (double, 3.5);
+
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\..,} 18 } } */
+/* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tldr\tp[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tp[0-9]} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_spill_6.c
===================================================================
--- /dev/null	2018-01-08 18:48:58.045015662 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_spill_6.c	2018-01-09 15:28:36.205732731 +0000
@@ -0,0 +1,44 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable" } */
+
+#include <stdint.h>
+
+void consumer (void *);
+
+#define TEST_LOOP(TYPE, VAL)						\
+  void									\
+  multi_loop_##TYPE (TYPE *x1, TYPE *x2, TYPE *x3, TYPE *x4, int which) \
+  {									\
+    if (which)								\
+      {									\
+	for (int i = 0; i < 7; ++i)					\
+	  x1[i] += VAL;							\
+	consumer (x1);							\
+	for (int i = 0; i < 7; ++i)					\
+	  x2[i] -= VAL;							\
+	consumer (x2);							\
+      }									\
+    else								\
+      {									\
+	for (int i = 0; i < 7; ++i)					\
+	  x3[i] &= VAL;							\
+	consumer (x3);							\
+      }									\
+    for (int i = 0; i < 7; ++i)						\
+      x4[i] |= VAL;							\
+    consumer (x4);							\
+  }
+
+TEST_LOOP (uint8_t, 0x12);
+TEST_LOOP (uint16_t, 0x1234);
+TEST_LOOP (uint32_t, 0x12345);
+TEST_LOOP (uint64_t, 0x123456);
+
+/* { dg-final { scan-assembler {\tld1b\tz[0-9]+\.b,} } } */
+/* { dg-final { scan-assembler {\tld1h\tz[0-9]+\.h,} } } */
+/* { dg-final { scan-assembler {\tld1w\tz[0-9]+\.s,} } } */
+/* { dg-final { scan-assembler {\tld1d\tz[0-9]+\.d,} } } */
+/* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tldr\tp[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tp[0-9]} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_spill_7.c
===================================================================
--- /dev/null	2018-01-08 18:48:58.045015662 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_spill_7.c	2018-01-09 15:28:36.205732731 +0000
@@ -0,0 +1,46 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math -march=armv8-a+sve" } */
+
+#include <stdint.h>
+
+void consumer (void *);
+
+#define TEST_LOOP(TYPE, VAL)			\
+  void						\
+  multi_loop_##TYPE (TYPE *x, int n)		\
+  {						\
+    for (int k = 0; k < 4; ++k)			\
+      {						\
+	for (int j = 0; j < n; ++j)		\
+	  {					\
+	    for (int i = 0; i < 100; ++i)	\
+	      x[i] += VAL + i;			\
+	    asm volatile ("");			\
+	  }					\
+	for (int j = 0; j < n; ++j)		\
+	  consumer (x);				\
+	for (int j = 0; j < n; ++j)		\
+	  {					\
+	    for (int i = 0; i < 100; ++i)	\
+	      x[i] += VAL + i;			\
+	    asm volatile ("");			\
+	  }					\
+	consumer (x);				\
+	for (int i = 0; i < 100; ++i)		\
+	  x[i] += VAL + i;			\
+	consumer (x);				\
+      }						\
+  }
+
+TEST_LOOP (uint8_t, 3);
+TEST_LOOP (uint16_t, 4);
+TEST_LOOP (uint32_t, 5);
+TEST_LOOP (uint64_t, 6);
+TEST_LOOP (float, 2.5f);
+TEST_LOOP (double, 3.5);
+
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\..,} 18 } } */
+/* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tldr\tp[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tp[0-9]} } } */
Jeff Law Jan. 9, 2018, 6:45 p.m. UTC | #14
On 01/09/2018 08:34 AM, Richard Sandiford wrote:
> Possible ping: wasn't sure whether this one needed more work or whether

> it was OK to go in.  I've attached the patch with the improved comment

> above early_remat::emit_copy_before.

> 

You answered all the questions I had.  I should have explicitly ack'd at
that point.  Sorry.

OK for the trunk,
Jeff
diff mbox series

Patch

Index: gcc/Makefile.in
===================================================================
--- gcc/Makefile.in	2017-11-17 15:56:28.653842781 +0000
+++ gcc/Makefile.in	2017-11-17 15:56:53.016759921 +0000
@@ -1273,6 +1273,7 @@  OBJS = \
 	dwarf2asm.o \
 	dwarf2cfi.o \
 	dwarf2out.o \
+	early-remat.o \
 	emit-rtl.o \
 	et-forest.o \
 	except.o \
Index: gcc/target.def
===================================================================
--- gcc/target.def	2017-11-17 15:56:45.551761353 +0000
+++ gcc/target.def	2017-11-17 15:56:53.022759920 +0000
@@ -5531,6 +5531,19 @@  reload from using some alternatives, lik
  default_preferred_output_reload_class)
 
 DEFHOOK
+(select_early_remat_modes,
+ "On some targets, certain modes cannot be held in registers around a\n\
+standard ABI call and are relatively expensive to spill to the stack.\n\
+The early rematerialization pass can help in such cases by aggressively\n\
+recomputing values after calls, so that they don't need to be spilled.\n\
+\n\
+This hook returns the set of such modes by setting the associated bits\n\
+in @var{modes}.  The default implementation selects no modes, which has\n\
+the effect of disabling the early rematerialization pass.",
+ void, (sbitmap modes),
+ default_select_early_remat_modes)
+
+DEFHOOK
 (class_likely_spilled_p,
  "A target hook which returns @code{true} if pseudos that have been assigned\n\
 to registers of class @var{rclass} would likely be spilled because\n\
Index: gcc/doc/tm.texi.in
===================================================================
--- gcc/doc/tm.texi.in	2017-11-17 15:56:45.550761354 +0000
+++ gcc/doc/tm.texi.in	2017-11-17 15:56:53.020759920 +0000
@@ -2307,6 +2307,8 @@  Do not define this macro if you do not d
 
 @hook TARGET_SECONDARY_MEMORY_NEEDED_MODE
 
+@hook TARGET_SELECT_EARLY_REMAT_MODES
+
 @hook TARGET_CLASS_LIKELY_SPILLED_P
 
 @hook TARGET_CLASS_MAX_NREGS
Index: gcc/doc/tm.texi
===================================================================
--- gcc/doc/tm.texi	2017-11-17 15:56:45.549761354 +0000
+++ gcc/doc/tm.texi	2017-11-17 15:56:53.019759920 +0000
@@ -2774,6 +2774,17 @@  details.
 With LRA, the default is to use @var{mode} unmodified.
 @end deftypefn
 
+@deftypefn {Target Hook} void TARGET_SELECT_EARLY_REMAT_MODES (sbitmap @var{modes})
+On some targets, certain modes cannot be held in registers around a
+standard ABI call and are relatively expensive to spill to the stack.
+The early rematerialization pass can help in such cases by aggressively
+recomputing values after calls, so that they don't need to be spilled.
+
+This hook returns the set of such modes by setting the associated bits
+in @var{modes}.  The default implementation selects no modes, which has
+the effect of disabling the early rematerialization pass.
+@end deftypefn
+
 @deftypefn {Target Hook} bool TARGET_CLASS_LIKELY_SPILLED_P (reg_class_t @var{rclass})
 A target hook which returns @code{true} if pseudos that have been assigned
 to registers of class @var{rclass} would likely be spilled because
Index: gcc/targhooks.h
===================================================================
--- gcc/targhooks.h	2017-11-17 15:56:45.551761353 +0000
+++ gcc/targhooks.h	2017-11-17 15:56:53.022759920 +0000
@@ -284,5 +284,6 @@  extern unsigned int default_min_arithmet
 extern enum flt_eval_method
 default_excess_precision (enum excess_precision_type ATTRIBUTE_UNUSED);
 extern bool default_stack_clash_protection_final_dynamic_probe (rtx);
+extern void default_select_early_remat_modes (sbitmap);
 
 #endif /* GCC_TARGHOOKS_H */
Index: gcc/targhooks.c
===================================================================
--- gcc/targhooks.c	2017-11-17 15:56:45.551761353 +0000
+++ gcc/targhooks.c	2017-11-17 15:56:53.022759920 +0000
@@ -82,6 +82,7 @@  Software Foundation; either version 3, o
 #include "params.h"
 #include "real.h"
 #include "langhooks.h"
+#include "sbitmap.h"
 
 bool
 default_legitimate_address_p (machine_mode mode ATTRIBUTE_UNUSED,
@@ -2314,4 +2315,11 @@  default_stack_clash_protection_final_dyn
   return 0;
 }
 
+/* The default implementation of TARGET_EARLY_REMAT_MODES.  */
+
+void
+default_select_early_remat_modes (sbitmap)
+{
+}
+
 #include "gt-targhooks.h"
Index: gcc/timevar.def
===================================================================
--- gcc/timevar.def	2017-11-17 15:56:28.653842781 +0000
+++ gcc/timevar.def	2017-11-17 15:56:53.023759919 +0000
@@ -248,6 +248,7 @@  DEFTIMEVAR (TV_MODE_SWITCH           , "
 DEFTIMEVAR (TV_SMS		     , "sms modulo scheduling")
 DEFTIMEVAR (TV_LIVE_RANGE_SHRINKAGE  , "live range shrinkage")
 DEFTIMEVAR (TV_SCHED                 , "scheduling")
+DEFTIMEVAR (TV_EARLY_REMAT           , "early rematerialization")
 DEFTIMEVAR (TV_IRA		     , "integrated RA")
 DEFTIMEVAR (TV_LRA		     , "LRA non-specific")
 DEFTIMEVAR (TV_LRA_ELIMINATE	     , "LRA virtuals elimination")
Index: gcc/passes.def
===================================================================
--- gcc/passes.def	2017-11-17 15:56:28.653842781 +0000
+++ gcc/passes.def	2017-11-17 15:56:53.021759920 +0000
@@ -456,6 +456,7 @@  along with GCC; see the file COPYING3.
       NEXT_PASS (pass_sms);
       NEXT_PASS (pass_live_range_shrinkage);
       NEXT_PASS (pass_sched);
+      NEXT_PASS (pass_early_remat);
       NEXT_PASS (pass_ira);
       NEXT_PASS (pass_reload);
       NEXT_PASS (pass_postreload);
Index: gcc/tree-pass.h
===================================================================
--- gcc/tree-pass.h	2017-11-17 15:56:28.653842781 +0000
+++ gcc/tree-pass.h	2017-11-17 15:56:53.023759919 +0000
@@ -573,6 +573,7 @@  extern rtl_opt_pass *make_pass_mode_swit
 extern rtl_opt_pass *make_pass_sms (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_sched (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_live_range_shrinkage (gcc::context *ctxt);
+extern rtl_opt_pass *make_pass_early_remat (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_ira (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_reload (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_clean_state (gcc::context *ctxt);
Index: gcc/early-remat.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/early-remat.c	2017-11-17 15:56:53.021759920 +0000
@@ -0,0 +1,2610 @@ 
+/* Early (pre-RA) rematerialization
+   Copyright (C) 2017 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "rtl.h"
+#include "df.h"
+#include "tree-pass.h"
+#include "memmodel.h"
+#include "emit-rtl.h"
+#include "insn-config.h"
+#include "recog.h"
+/* FIXME: The next two are only needed for gen_move_insn.  */
+#include "tree.h"
+#include "expr.h"
+#include "target.h"
+#include "inchash.h"
+#include "rtlhash.h"
+#include "print-rtl.h"
+#include "rtl-iter.h"
+
+/* This pass runs before register allocation and implements an aggressive
+   form of rematerialization.  It looks for pseudo registers R of mode M
+   for which:
+
+     (a) there are no call-preserved registers of mode M; and
+     (b) spilling R to the stack is expensive.
+
+   The assumption is that it's better to recompute R after each call instead
+   of spilling it, even if this extends the live ranges of other registers.
+
+   The motivating example for which these conditions hold are AArch64 SVE
+   vectors and predicates.  Spilling them to the stack makes the frame
+   variable-sized, which we'd like to avoid if possible.  It's also very
+   rare for SVE values to be "naturally" live across a call: usually this
+   happens as a result of CSE or other code motion.
+
+   The pass is split into the following phases:
+
+   Collection phase
+   ================
+
+   First we go through all pseudo registers looking for any that meet
+   the conditions above.  For each such register R, we go through each
+   instruction that defines R to see whether any of them are suitable
+   rematerialization candidates.  If at least one is, we treat all the
+   instructions that define R as candidates, but record which ones are
+   not in fact suitable.  These unsuitable candidates exist only for the
+   sake of calculating reaching definitions (see below).
+
+   A "candidate" is a single instruction that we want to rematerialize
+   and a "candidate register" is a register that is set by at least one
+   candidate.
+
+   Candidate sorting
+   =================
+
+   Next we sort the candidates based on the cfg postorder, so that if
+   candidate C1 uses candidate C2, C1 has a lower index than C2.
+   This is useful when iterating through candidate bitmaps.
+
+   Reaching definition calculation
+   ===============================
+
+   We then compute standard reaching-definition sets for each candidate.
+   Each set specifies which candidates might provide the current definition
+   of a live candidate register.
+
+   From here on, a candidate C is "live" at a point P if the candidate
+   register defined by C is live at P and if C's definition reaches P.
+   An instruction I "uses" a candidate C if I takes the register defined by
+   C as input and if C is one of the reaching definitions of that register.
+
+   Candidate validation and value numbering
+   ========================================
+
+   Next we simultaneously decide which candidates are valid and look
+   for candidates that are equivalent to each other, assigning numbers
+   to each unique candidate value.  A candidate C is invalid if:
+
+     (a) C uses an invalid candidate;
+
+     (b) there is a cycle of candidate uses involving C; or
+
+     (c) C takes a candidate register R as input and the reaching
+         definitions of R do not have the same value number.
+
+   We assign a "representative" candidate C to each value number and from
+   here on replace references to other candidates with that value number
+   with references to C.  It is then only possible to rematerialize a
+   register R at point P if (after this replacement) there is a single
+   reaching definition of R at P.
+
+   Local phase
+   ===========
+
+   During this phase we go through each block and look for cases in which:
+
+     (a) an instruction I comes between two call instructions CI1 and CI2;
+
+     (b) I uses a candidate register R;
+
+     (c) a candidate C provides the only reaching definition of R; and
+
+     (d) C does not come between CI1 and I.
+
+   We then emit a copy of C after CI1, as well as the transitive closure
+   TC of the candidates used by C.  The copies of TC might use the original
+   candidate registers or new temporary registers, depending on circumstances.
+
+   For example, if elsewhere we have:
+
+       C3: R3 <- f3 (...)
+	   ...
+       C2: R2 <- f2 (...)
+	   ...
+       C1: R1 <- f1 (R2, R3, ...)  // uses C2 and C3
+
+   then for a block containing:
+
+      CI1: call
+	   ...
+	I: use R1  // uses C1
+	   ...
+      CI2: call
+
+   we would emit:
+
+      CI1: call
+      C3': R3' <- f3 (...)
+      C2': R2' <- f2 (...)
+      C1': R1 <- f1 (R2', R3', ...)
+	   ...
+	I: use R1
+	   ...
+      CI2: call
+
+   where R2' and R3' might be fresh registers.  If instead we had:
+
+      CI1: call
+	   ...
+       I1: use R1  // uses C1
+	   ...
+       I2: use R3  // uses C3
+	   ...
+      CI2: call
+
+   we would keep the original R3:
+
+      CI1: call
+      C3': R3 <- f3 (...)
+      C2': R2' <- f2 (...)
+      C1': R1 <- f1 (R2', R3, ...)
+	   ...
+       I1: use R1  // uses C1
+	   ...
+       I2: use R3  // uses C3
+	   ...
+      CI2: call
+
+   We also record the last call in each block (if any) and compute:
+
+     rd_after_call:
+       The set of candidates that either (a) are defined outside the block
+       and are live after the last call or (b) are defined within the block
+       and reach the end of the last call.  (We don't track whether the
+       latter values are live or not.)
+
+     required_after_call:
+       The set of candidates that need to be rematerialized after the
+       last call in order to satisfy uses in the block itself.
+
+     required_in:
+       The set of candidates that are live on entry to the block and are
+       used without an intervening call.
+
+   In addition, we compute the initial values of the sets required by
+   the global phase below.
+
+   Global phase
+   ============
+
+   We next compute a maximal solution to the following availability
+   problem:
+
+     available_in:
+       The set of candidates that are live on entry to a block and can
+       be used at that point without rematerialization.
+
+     available_out:
+       The set of candidates that are live on exit from a block and can
+       be used at that point without rematerialization.
+
+     available_locally:
+       The subset of available_out that is due to code in the block itself.
+       It contains candidates that are defined or used in the block and
+       not invalidated by a later call.
+
+   We then go through each block B and look for an appropriate place
+   to insert copies of required_in - available_in.  Conceptually we
+   start by placing the copies at the head of B, but then move the
+   copy of a candidate C to predecessors if:
+
+     (a) that seems cheaper;
+
+     (b) there is more than one reaching definition of C's register at
+	 the head of B; or
+
+     (c) copying C would clobber a hard register that is live on entry to B.
+
+   Moving a copy of C to a predecessor block PB involves:
+
+     (1) adding C to PB's required_after_call, if PB contains a call; or
+
+     (2) adding C PB's required_in otherwise.
+
+   C is then available on output from each PB and on input to B.
+
+   Once all this is done, we emit instructions for the final required_in
+   and required_after_call sets.  */
+
+namespace {
+
+/* An invalid candidate index, used to indicate that there is more than
+   one reaching definition.  */
+const unsigned int MULTIPLE_CANDIDATES = -1U;
+
+/* Pass-specific information about one basic block.  */
+struct remat_block_info {
+  /* The last call instruction in the block.  */
+  rtx_insn *last_call;
+
+  /* The set of candidates that are live on entry to the block.  NULL is
+     equivalent to an empty set.  */
+  bitmap rd_in;
+
+  /* The set of candidates that are live on exit from the block.  This might
+     reuse rd_in.  NULL is equivalent to an empty set.  */
+  bitmap rd_out;
+
+  /* The subset of RD_OUT that comes from local definitions.  NULL is
+     equivalent to an empty set.  */
+  bitmap rd_gen;
+
+  /* The set of candidates that the block invalidates (because it defines
+     the register to something else, or because the register's value is
+     no longer important).  NULL is equivalent to an empty set.  */
+  bitmap rd_kill;
+
+  /* The set of candidates that either (a) are defined outside the block
+     and are live after LAST_CALL or (b) are defined within the block
+     and reach the instruction after LAST_CALL.  (We don't track whether
+     the latter values are live or not.)
+
+     Only used if LAST_CALL is nonnull.  NULL is equivalent to an
+     empty set.  */
+  bitmap rd_after_call;
+
+  /* Candidates that are live and available without rematerialization
+     on entry to the block.  NULL is equivalent to an empty set.  */
+  bitmap available_in;
+
+  /* Candidates that become available without rematerialization within the
+     block, and remain so on exit.  NULL is equivalent to an empty set.  */
+  bitmap available_locally;
+
+  /* Candidates that are available without rematerialization on exit from
+     the block.  This might reuse available_in or available_locally.  */
+  bitmap available_out;
+
+  /* Candidates that need to be rematerialized either at the start of the
+     block or before entering the block.  */
+  bitmap required_in;
+
+  /* Candidates that need to be rematerialized after LAST_CALL.
+     Only used if LAST_CALL is nonnull.  */
+  bitmap required_after_call;
+
+  /* The number of candidates in the block.  */
+  unsigned int num_candidates;
+
+  /* The earliest candidate in the block (i.e. the one with the
+     highest index).  Only valid if NUM_CANDIDATES is nonzero.  */
+  unsigned int first_candidate;
+
+  /* The best (lowest) execution frequency for rematerializing REQUIRED_IN.
+     This is the execution frequency of the block if LOCAL_REMAT_CHEAPER_P,
+     otherwise it is the sum of the execution frequencies of whichever
+     predecessor blocks would do the rematerialization.  */
+  int remat_frequency;
+
+  /* True if the block ends with an abnormal call.  */
+  unsigned int abnormal_call_p : 1;
+
+  /* Used to record whether a graph traversal has visited this block.  */
+  unsigned int visited_p : 1;
+
+  /* True if we have calculated REMAT_FREQUENCY.  */
+  unsigned int remat_frequency_valid_p : 1;
+
+  /* True if it is cheaper to rematerialize candidates at the start of
+     the block, rather than moving them to predecessor blocks.  */
+  unsigned int local_remat_cheaper_p : 1;
+};
+
+/* Information about a group of candidates with the same value number.  */
+struct remat_equiv_class {
+  /* The candidates that have the same value number.  */
+  bitmap members;
+
+  /* The candidate that was first added to MEMBERS.  */
+  unsigned int earliest;
+
+  /* The candidate that represents the others.  This is always the one
+     with the highest index.  */
+  unsigned int representative;
+};
+
+/* Information about an instruction that we might want to rematerialize.  */
+struct remat_candidate {
+  /* The pseudo register that the instruction sets.  */
+  unsigned int regno;
+
+  /* A temporary register used when rematerializing uses of this candidate,
+     if REGNO doesn't have the right value or isn't worth using.  */
+  unsigned int copy_regno;
+
+  /* True if we intend to rematerialize this instruction by emitting
+     a move of a constant into REGNO, false if we intend to emit a
+     copy of the original instruction.  */
+  unsigned int constant_p : 1;
+
+  /* True if we still think it's possible to rematerialize INSN.  */
+  unsigned int can_copy_p : 1;
+
+  /* Used to record whether a graph traversal has visited this candidate.  */
+  unsigned int visited_p : 1;
+
+  /* True if we have verified that it's possible to rematerialize INSN.
+     Once this is true, both it and CAN_COPY_P remain true.  */
+  unsigned int validated_p : 1;
+
+  /* True if we have "stabilized" INSN, i.e. ensured that all non-candidate
+     registers read by INSN will have the same value when rematerializing INSN.
+     Only ever true if CAN_COPY_P.  */
+  unsigned int stabilized_p : 1;
+
+  /* Hash value used for value numbering.  */
+  hashval_t hash;
+
+  /* The instruction that sets REGNO.  */
+  rtx_insn *insn;
+
+  /* If CONSTANT_P, the value that should be moved into REGNO when
+     rematerializing, otherwise the pattern of the instruction that
+     should be used.  */
+  rtx remat_rtx;
+
+  /* The set of candidates that INSN takes as input.  NULL is equivalent
+     to the empty set.  All candidates in this set have a higher index
+     than the current candidate.  */
+  bitmap uses;
+
+  /* The set of hard registers that would be clobbered by rematerializing
+     the candidate, including (transitively) all those that would be
+     clobbered by rematerializing USES.  */
+  bitmap clobbers;
+
+  /* The equivalence class to which the candidate belongs, or null if none.  */
+  remat_equiv_class *equiv_class;
+};
+
+/* Hash functions used for value numbering.  */
+struct remat_candidate_hasher : nofree_ptr_hash <remat_candidate>
+{
+  typedef value_type compare_type;
+  static hashval_t hash (const remat_candidate *);
+  static bool equal (const remat_candidate *, const remat_candidate *);
+};
+
+/* Main class for this pass.  */
+class early_remat {
+public:
+  early_remat (function *, sbitmap);
+  ~early_remat ();
+
+  void run (void);
+
+private:
+  bitmap alloc_bitmap (void);
+  bitmap get_bitmap (bitmap *);
+  void init_temp_bitmap (bitmap *);
+  void copy_temp_bitmap (bitmap *, bitmap *);
+
+  void dump_insn_id (rtx_insn *);
+  void dump_candidate_bitmap (bitmap);
+  void dump_all_candidates (void);
+  void dump_edge_list (basic_block, bool);
+  void dump_block_info (basic_block);
+  void dump_all_blocks (void);
+
+  bool interesting_regno_p (unsigned int);
+  remat_candidate *add_candidate (rtx_insn *, unsigned int, bool);
+  bool maybe_add_candidate (rtx_insn *, unsigned int);
+  bool collect_candidates (void);
+  void init_block_info (void);
+  void sort_candidates (void);
+  void finalize_candidate_indices (void);
+  void record_equiv_candidates (unsigned int, unsigned int);
+  static bool rd_confluence_n (edge);
+  static bool rd_transfer (int);
+  void compute_rd (void);
+  unsigned int canon_candidate (unsigned int);
+  void canon_bitmap (bitmap *);
+  unsigned int resolve_reaching_def (bitmap);
+  bool check_candidate_uses (unsigned int);
+  void compute_clobbers (unsigned int);
+  void assign_value_number (unsigned int);
+  void decide_candidate_validity (void);
+  bool stable_use_p (unsigned int);
+  void emit_copy_before (unsigned int, rtx, rtx);
+  void stabilize_pattern (unsigned int);
+  void replace_dest_with_copy (unsigned int);
+  void stabilize_candidate_uses (unsigned int, bitmap, bitmap, bitmap,
+				 bitmap);
+  void emit_remat_insns (bitmap, bitmap, bitmap, rtx_insn *);
+  bool set_available_out (remat_block_info *);
+  void process_block (basic_block);
+  void local_phase (void);
+  static bool avail_confluence_n (edge);
+  static bool avail_transfer (int);
+  void compute_availability (void);
+  void unshare_available_sets (remat_block_info *);
+  bool can_move_across_edge_p (edge);
+  bool local_remat_cheaper_p (unsigned int);
+  bool need_to_move_candidate_p (unsigned int, unsigned int);
+  void compute_minimum_move_set (unsigned int, bitmap);
+  void move_to_predecessors (unsigned int, bitmap, bitmap);
+  void choose_rematerialization_points (void);
+  void emit_remat_insns_for_block (basic_block);
+  void global_phase (void);
+
+  /* The function that we're optimizing.  */
+  function *m_fn;
+
+  /* The modes that we want to rematerialize.  */
+  sbitmap m_selected_modes;
+
+  /* All rematerialization candidates, identified by their index into the
+     vector.  */
+  auto_vec<remat_candidate> m_candidates;
+
+  /* The set of candidate registers.  */
+  bitmap_head m_candidate_regnos;
+
+  /* Temporary sets.  */
+  bitmap_head m_tmp_bitmap;
+  bitmap m_available;
+  bitmap m_required;
+
+  /* Information about each basic block.  */
+  auto_vec<remat_block_info> m_block_info;
+
+  /* A mapping from register numbers to the set of associated candidates.
+     Only valid for registers in M_CANDIDATE_REGNOS.  */
+  auto_vec<bitmap> m_regno_to_candidates;
+
+  /* An obstack used for allocating bitmaps, so that we can free them all
+     in one go.  */
+  bitmap_obstack m_obstack;
+
+  /* A hash table of candidates used for value numbering.  If a candidate
+     in the table is in an equivalence class, the candidate is marked as
+     the earliest member of the class.  */
+  hash_table<remat_candidate_hasher> m_value_table;
+
+  /* Used temporarily by callback functions.  */
+  static early_remat *er;
+};
+
+}
+
+early_remat *early_remat::er;
+
+/* rtx_equal_p_cb callback that treats any two SCRATCHes as equal.
+   This allows us to compare two copies of a pattern, even though their
+   SCRATCHes are always distinct.  */
+
+static int
+scratch_equal (const_rtx *x, const_rtx *y, rtx *nx, rtx *ny)
+{
+  if (GET_CODE (*x) == SCRATCH && GET_CODE (*y) == SCRATCH)
+    {
+      *nx = const0_rtx;
+      *ny = const0_rtx;
+      return 1;
+    }
+  return 0;
+}
+
+/* Hash callback functions for remat_candidate.  */
+
+hashval_t
+remat_candidate_hasher::hash (const remat_candidate *cand)
+{
+  return cand->hash;
+}
+
+bool
+remat_candidate_hasher::equal (const remat_candidate *cand1,
+			       const remat_candidate *cand2)
+{
+  return (cand1->regno == cand2->regno
+	  && cand1->constant_p == cand2->constant_p
+	  && (cand1->constant_p
+	      ? rtx_equal_p (cand1->remat_rtx, cand2->remat_rtx)
+	      : rtx_equal_p_cb (cand1->remat_rtx, cand2->remat_rtx,
+				scratch_equal))
+	  && (!cand1->uses || bitmap_equal_p (cand1->uses, cand2->uses)));
+}
+
+/* Return true if B is null or empty.  */
+
+inline bool
+empty_p (bitmap b)
+{
+  return !b || bitmap_empty_p (b);
+}
+
+/* Allocate a new bitmap.  It will be automatically freed at the end of
+   the pass.  */
+
+inline bitmap
+early_remat::alloc_bitmap (void)
+{
+  return bitmap_alloc (&m_obstack);
+}
+
+/* Initialize *PTR to an empty bitmap if it is currently null.  */
+
+inline bitmap
+early_remat::get_bitmap (bitmap *ptr)
+{
+  if (!*ptr)
+    *ptr = alloc_bitmap ();
+  return *ptr;
+}
+
+/* *PTR is either null or empty.  If it is null, initialize it to an
+   empty bitmap.  */
+
+inline void
+early_remat::init_temp_bitmap (bitmap *ptr)
+{
+  if (!*ptr)
+    *ptr = alloc_bitmap ();
+  else
+    gcc_checking_assert (bitmap_empty_p (*ptr));
+}
+
+/* Move *SRC to *DEST and leave *SRC empty.  */
+
+inline void
+early_remat::copy_temp_bitmap (bitmap *dest, bitmap *src)
+{
+  if (!empty_p (*src))
+    {
+      *dest = *src;
+      *src = NULL;
+    }
+  else
+    *dest = NULL;
+}
+
+/* Print INSN's identifier to the dump file.  */
+
+void
+early_remat::dump_insn_id (rtx_insn *insn)
+{
+  fprintf (dump_file, "%d[bb:%d]", INSN_UID (insn),
+	   BLOCK_FOR_INSN (insn)->index);
+}
+
+/* Print candidate set CANDIDATES to the dump file, with a leading space.  */
+
+void
+early_remat::dump_candidate_bitmap (bitmap candidates)
+{
+  if (empty_p (candidates))
+    {
+      fprintf (dump_file, " none");
+      return;
+    }
+
+  unsigned int cand_index;
+  bitmap_iterator bi;
+  EXECUTE_IF_SET_IN_BITMAP (candidates, 0, cand_index, bi)
+    fprintf (dump_file, " %d", cand_index);
+}
+
+/* Print information about all candidates to the dump file.  */
+
+void
+early_remat::dump_all_candidates (void)
+{
+  fprintf (dump_file, "\n;; Candidates:\n;;\n");
+  fprintf (dump_file, ";; %5s %5s %8s %s\n", "#", "reg", "mode", "insn");
+  fprintf (dump_file, ";; %5s %5s %8s %s\n", "=", "===", "====", "====");
+  unsigned int cand_index;
+  remat_candidate *cand;
+  FOR_EACH_VEC_ELT (m_candidates, cand_index, cand)
+    {
+      fprintf (dump_file, ";; %5d %5d %8s ", cand_index, cand->regno,
+	       GET_MODE_NAME (GET_MODE (regno_reg_rtx[cand->regno])));
+      dump_insn_id (cand->insn);
+      if (!cand->can_copy_p)
+	fprintf (dump_file, "   -- can't copy");
+      fprintf (dump_file, "\n");
+    }
+
+  fprintf (dump_file, "\n;; Register-to-candidate mapping:\n;;\n");
+  unsigned int regno;
+  bitmap_iterator bi;
+  EXECUTE_IF_SET_IN_BITMAP (&m_candidate_regnos, 0, regno, bi)
+    {
+      fprintf (dump_file, ";; %5d:", regno);
+      dump_candidate_bitmap (m_regno_to_candidates[regno]);
+      fprintf (dump_file, "\n");
+    }
+}
+
+/* Print the predecessors or successors of BB to the dump file, with a
+   leading space.  DO_SUCC is true to print successors and false to print
+   predecessors.  */
+
+void
+early_remat::dump_edge_list (basic_block bb, bool do_succ)
+{
+  edge e;
+  edge_iterator ei;
+  FOR_EACH_EDGE (e, ei, do_succ ? bb->succs : bb->preds)
+    dump_edge_info (dump_file, e, 0, do_succ);
+}
+
+/* Print information about basic block BB to the dump file.  */
+
+void
+early_remat::dump_block_info (basic_block bb)
+{
+  remat_block_info *info = &m_block_info[bb->index];
+  fprintf (dump_file, ";;\n;; Block %d:", bb->index);
+  int width = 25;
+
+  fprintf (dump_file, "\n;;%*s:", width, "predecessors");
+  dump_edge_list (bb, false);
+
+  fprintf (dump_file, "\n;;%*s:", width, "successors");
+  dump_edge_list (bb, true);
+
+  fprintf (dump_file, "\n;;%*s: %d", width, "frequency",
+	   bb->count.to_frequency (m_fn));
+
+  if (info->last_call)
+    fprintf (dump_file, "\n;;%*s: %d", width, "last call",
+	     INSN_UID (info->last_call));
+
+  if (!empty_p (info->rd_in))
+    {
+      fprintf (dump_file, "\n;;%*s:", width, "RD in");
+      dump_candidate_bitmap (info->rd_in);
+    }
+  if (!empty_p (info->rd_kill))
+    {
+      fprintf (dump_file, "\n;;%*s:", width, "RD kill");
+      dump_candidate_bitmap (info->rd_kill);
+    }
+  if (!empty_p (info->rd_gen))
+    {
+      fprintf (dump_file, "\n;;%*s:", width, "RD gen");
+      dump_candidate_bitmap (info->rd_gen);
+    }
+  if (!empty_p (info->rd_after_call))
+    {
+      fprintf (dump_file, "\n;;%*s:", width, "RD after call");
+      dump_candidate_bitmap (info->rd_after_call);
+    }
+  if (!empty_p (info->rd_out))
+    {
+      fprintf (dump_file, "\n;;%*s:", width, "RD out");
+      if (info->rd_in == info->rd_out)
+	fprintf (dump_file, " RD in");
+      else
+	dump_candidate_bitmap (info->rd_out);
+    }
+  if (!empty_p (info->available_in))
+    {
+      fprintf (dump_file, "\n;;%*s:", width, "available in");
+      dump_candidate_bitmap (info->available_in);
+    }
+  if (!empty_p (info->available_locally))
+    {
+      fprintf (dump_file, "\n;;%*s:", width, "available locally");
+      dump_candidate_bitmap (info->available_locally);
+    }
+  if (!empty_p (info->available_out))
+    {
+      fprintf (dump_file, "\n;;%*s:", width, "available out");
+      if (info->available_in == info->available_out)
+	fprintf (dump_file, " available in");
+      else if (info->available_locally == info->available_out)
+	fprintf (dump_file, " available locally");
+      else
+	dump_candidate_bitmap (info->available_out);
+    }
+  if (!empty_p (info->required_in))
+    {
+      fprintf (dump_file, "\n;;%*s:", width, "required in");
+      dump_candidate_bitmap (info->required_in);
+    }
+  if (!empty_p (info->required_after_call))
+    {
+      fprintf (dump_file, "\n;;%*s:", width, "required after call");
+      dump_candidate_bitmap (info->required_after_call);
+    }
+  fprintf (dump_file, "\n");
+}
+
+/* Print information about all basic blocks to the dump file.  */
+
+void
+early_remat::dump_all_blocks (void)
+{
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, m_fn)
+    dump_block_info (bb);
+}
+
+/* Return true if REGNO is worth rematerializing.  */
+
+bool
+early_remat::interesting_regno_p (unsigned int regno)
+{
+  /* Ignore unused registers.  */
+  rtx reg = regno_reg_rtx[regno];
+  if (!reg || DF_REG_DEF_COUNT (regno) == 0)
+    return false;
+
+  /* Make sure the register has a mode that we want to rematerialize.  */
+  if (!bitmap_bit_p (m_selected_modes, GET_MODE (reg)))
+    return false;
+
+  /* Ignore values that might sometimes be used uninitialized.  We could
+     instead add dummy candidates for the entry block definition, and so
+     handle uses that are definitely not uninitialized, but the combination
+     of the two should be rare in practice.  */
+  if (bitmap_bit_p (DF_LR_OUT (ENTRY_BLOCK_PTR_FOR_FN (m_fn)), regno))
+    return false;
+
+  return true;
+}
+
+/* Record the set of register REGNO in instruction INSN as a
+   rematerialization candidate.  CAN_COPY_P is true unless we already
+   know that rematerialization is impossible (in which case the candidate
+   only exists for the reaching definition calculation).
+
+   The candidate's index is not fixed at this stage.  */
+
+remat_candidate *
+early_remat::add_candidate (rtx_insn *insn, unsigned int regno,
+			    bool can_copy_p)
+{
+  remat_candidate cand;
+  memset (&cand, 0, sizeof (cand));
+  cand.regno = regno;
+  cand.insn = insn;
+  cand.remat_rtx = PATTERN (insn);
+  cand.can_copy_p = can_copy_p;
+  m_candidates.safe_push (cand);
+
+  bitmap_set_bit (&m_candidate_regnos, regno);
+
+  return &m_candidates.last ();
+}
+
+/* Return true if we can rematerialize the set of register REGNO in
+   instruction INSN, and add it as a candidate if so.  When returning
+   false, print the reason to the dump file.  */
+
+bool
+early_remat::maybe_add_candidate (rtx_insn *insn, unsigned int regno)
+{
+#define FAILURE_FORMAT ";; Can't rematerialize set of reg %d in %d[bb:%d]: "
+#define FAILURE_ARGS regno, INSN_UID (insn), BLOCK_FOR_INSN (insn)->index
+
+  /* The definition must come from an ordinary instruction.  */
+  basic_block bb = BLOCK_FOR_INSN (insn);
+  if (!NONJUMP_INSN_P (insn)
+      || (insn == BB_END (bb)
+	  && has_abnormal_or_eh_outgoing_edge_p (bb)))
+    {
+      if (dump_file)
+	fprintf (dump_file, FAILURE_FORMAT "insn alters control flow\n",
+		 FAILURE_ARGS);
+      return false;
+    }
+
+  /* Prefer to rematerialize constants directly -- it's much easier.  */
+  machine_mode mode = GET_MODE (regno_reg_rtx[regno]);
+  if (rtx note = find_reg_equal_equiv_note (insn))
+    {
+      rtx val = XEXP (note, 0);
+      if (CONSTANT_P (val)
+	  && targetm.legitimate_constant_p (mode, val))
+	{
+	  remat_candidate *cand = add_candidate (insn, regno, true);
+	  cand->constant_p = true;
+	  cand->remat_rtx = val;
+	  return true;
+	}
+    }
+
+  /* See whether the target has reasons to prevent a copy.  */
+  if (targetm.cannot_copy_insn_p && targetm.cannot_copy_insn_p (insn))
+    {
+      if (dump_file)
+	fprintf (dump_file, FAILURE_FORMAT "target forbids copying\n",
+		 FAILURE_ARGS);
+      return false;
+    }
+
+  /* We can't copy trapping instructions.  */
+  rtx pat = PATTERN (insn);
+  if (may_trap_p (pat))
+    {
+      if (dump_file)
+	fprintf (dump_file, FAILURE_FORMAT "insn might trap\n", FAILURE_ARGS);
+      return false;
+    }
+
+  /* We can't copy instructions that read memory, unless we know that
+     the contents never change.  */
+  subrtx_iterator::array_type array;
+  FOR_EACH_SUBRTX (iter, array, pat, ALL)
+    if (MEM_P (*iter) && !MEM_READONLY_P (*iter))
+      {
+	if (dump_file)
+	  fprintf (dump_file, FAILURE_FORMAT "insn references non-constant"
+		   " memory\n", FAILURE_ARGS);
+	return false;
+      }
+
+  /* Check each defined register.  */
+  df_ref ref;
+  FOR_EACH_INSN_DEF (ref, insn)
+    {
+      unsigned int def_regno = DF_REF_REGNO (ref);
+      if (def_regno == regno)
+	{
+	  /* Make sure the definition is write-only.  (Partial definitions,
+	     such as setting the low part and clobbering the high part,
+	     are otherwise OK.)  */
+	  if (DF_REF_FLAGS_IS_SET (ref, DF_REF_READ_WRITE))
+	    {
+	      if (dump_file)
+		fprintf (dump_file, FAILURE_FORMAT "destination is"
+			 " read-modify-write\n", FAILURE_ARGS);
+	      return false;
+	    }
+	}
+      else
+	{
+	  /* The instruction can set additional registers, provided that
+	     they're call-clobbered hard registers.  This is useful for
+	     instructions that alter the condition codes.  */
+	  if (!HARD_REGISTER_NUM_P (def_regno))
+	    {
+	      if (dump_file)
+		fprintf (dump_file, FAILURE_FORMAT "insn also sets"
+			 " pseudo reg %d\n", FAILURE_ARGS, def_regno);
+	      return false;
+	    }
+	  if (global_regs[def_regno])
+	    {
+	      if (dump_file)
+		fprintf (dump_file, FAILURE_FORMAT "insn also sets"
+			 " global reg %d\n", FAILURE_ARGS, def_regno);
+	      return false;
+	    }
+	  if (!TEST_HARD_REG_BIT (regs_invalidated_by_call, def_regno))
+	    {
+	      if (dump_file)
+		fprintf (dump_file, FAILURE_FORMAT "insn also sets"
+			 " call-preserved reg %d\n", FAILURE_ARGS, def_regno);
+	      return false;
+	    }
+	}
+    }
+
+  /* If the instruction uses fixed hard registers, check that those
+     registers have the same value throughout the function.  If the
+     instruction uses non-fixed hard registers, check that we can
+     replace them with pseudos.  */
+  FOR_EACH_INSN_USE (ref, insn)
+    {
+      unsigned int use_regno = DF_REF_REGNO (ref);
+      if (HARD_REGISTER_NUM_P (use_regno) && fixed_regs[use_regno])
+	{
+	  if (rtx_unstable_p (DF_REF_REAL_REG (ref)))
+	    {
+	      if (dump_file)
+		fprintf (dump_file, FAILURE_FORMAT "insn uses fixed hard reg"
+			 " %d\n", FAILURE_ARGS, use_regno);
+	      return false;
+	    }
+	}
+      else if (HARD_REGISTER_NUM_P (use_regno))
+	{
+	  /* Allocate a dummy pseudo register and temporarily install it.
+	     Make the register number depend on the mode, which should
+	     provide enough sharing for match_dup while also weeding
+	     out cases in which operands with different modes are
+	     explicitly tied.  */
+	  rtx *loc = DF_REF_REAL_LOC (ref);
+	  unsigned int size = RTX_CODE_SIZE (REG);
+	  rtx new_reg = (rtx) alloca (size);
+	  memset (new_reg, 0, size);
+	  PUT_CODE (new_reg, REG);
+	  set_mode_and_regno (new_reg, GET_MODE (*loc),
+			      LAST_VIRTUAL_REGISTER + 1 + GET_MODE (*loc));
+	  validate_change (insn, loc, new_reg, 1);
+	}
+    }
+  bool ok_p = verify_changes (0);
+  cancel_changes (0);
+  if (!ok_p)
+    {
+      if (dump_file)
+	fprintf (dump_file, FAILURE_FORMAT "insn does not allow hard"
+		 " register inputs to be replaced\n", FAILURE_ARGS);
+      return false;
+    }
+
+#undef FAILURE_ARGS
+#undef FAILURE_FORMAT
+
+  add_candidate (insn, regno, true);
+  return true;
+}
+
+/* Calculate the set of rematerialization candidates.  Return true if
+   we find at least one.  */
+
+bool
+early_remat::collect_candidates (void)
+{
+  unsigned int nregs = DF_REG_SIZE (df);
+  for (unsigned int regno = FIRST_PSEUDO_REGISTER; regno < nregs; ++regno)
+    if (interesting_regno_p (regno))
+      {
+	/* Create candidates for all suitable definitions.  */
+	bitmap_clear (&m_tmp_bitmap);
+	unsigned int bad = 0;
+	unsigned int id = 0;
+	for (df_ref ref = DF_REG_DEF_CHAIN (regno); ref;
+	     ref = DF_REF_NEXT_REG (ref))
+	  {
+	    rtx_insn *insn = DF_REF_INSN (ref);
+	    if (maybe_add_candidate (insn, regno))
+	      bitmap_set_bit (&m_tmp_bitmap, id);
+	    else
+	      bad += 1;
+	    id += 1;
+	  }
+
+	/* If we found at least one suitable definition, add dummy
+	   candidates for the rest, so that we can see which definitions
+	   are live where.  */
+	if (!bitmap_empty_p (&m_tmp_bitmap) && bad)
+	  {
+	    id = 0;
+	    for (df_ref ref = DF_REG_DEF_CHAIN (regno); ref;
+		 ref = DF_REF_NEXT_REG (ref))
+	      {
+		if (!bitmap_bit_p (&m_tmp_bitmap, id))
+		  add_candidate (DF_REF_INSN (ref), regno, false);
+		id += 1;
+	      }
+	  }
+      }
+
+
+  return !m_candidates.is_empty ();
+}
+
+/* Initialize the m_block_info array.  */
+
+void
+early_remat::init_block_info (void)
+{
+  unsigned int n_blocks = last_basic_block_for_fn (m_fn);
+  m_block_info.safe_grow_cleared (n_blocks);
+}
+
+/* Maps basic block indices to their position in the post order.  */
+static unsigned int *postorder_index;
+
+/* Order remat_candidates X_IN and Y_IN according to the cfg postorder.  */
+
+static int
+compare_candidates (const void *x_in, const void *y_in)
+{
+  const remat_candidate *x = (const remat_candidate *) x_in;
+  const remat_candidate *y = (const remat_candidate *) y_in;
+  basic_block x_bb = BLOCK_FOR_INSN (x->insn);
+  basic_block y_bb = BLOCK_FOR_INSN (y->insn);
+  if (x_bb != y_bb)
+    /* Make X and Y follow block postorder.  */
+    return postorder_index[x_bb->index] - postorder_index[y_bb->index];
+
+  /* Make X and Y follow a backward traversal of the containing block.  */
+  return DF_INSN_LUID (y->insn) - DF_INSN_LUID (x->insn);
+}
+
+/* Sort the collected rematerialization candidates so that they follow
+   cfg postorder.  */
+
+void
+early_remat::sort_candidates (void)
+{
+  /* Make sure the DF LUIDs are up-to-date for all the blocks we
+     care about.  */
+  bitmap_clear (&m_tmp_bitmap);
+  unsigned int cand_index;
+  remat_candidate *cand;
+  FOR_EACH_VEC_ELT (m_candidates, cand_index, cand)
+    {
+      basic_block bb = BLOCK_FOR_INSN (cand->insn);
+      if (bitmap_set_bit (&m_tmp_bitmap, bb->index))
+	df_recompute_luids (bb);
+    }
+
+  /* Create a mapping from block numbers to their position in the
+     postorder.  */
+  unsigned int n_blocks = last_basic_block_for_fn (m_fn);
+  int *postorder = df_get_postorder (DF_BACKWARD);
+  unsigned int postorder_len = df_get_n_blocks (DF_BACKWARD);
+  postorder_index = new unsigned int[n_blocks];
+  for (unsigned int i = 0; i < postorder_len; ++i)
+    postorder_index[postorder[i]] = i;
+
+  m_candidates.qsort (compare_candidates);
+
+  delete postorder_index;
+}
+
+/* Commit to the current candidate indices and initialize cross-references.  */
+
+void
+early_remat::finalize_candidate_indices (void)
+{
+  /* Create a bitmap for each candidate register.  */
+  m_regno_to_candidates.safe_grow (max_reg_num ());
+  unsigned int regno;
+  bitmap_iterator bi;
+  EXECUTE_IF_SET_IN_BITMAP (&m_candidate_regnos, 0, regno, bi)
+    m_regno_to_candidates[regno] = alloc_bitmap ();
+
+  /* Go through each candidate and record its index.  */
+  unsigned int cand_index;
+  remat_candidate *cand;
+  FOR_EACH_VEC_ELT (m_candidates, cand_index, cand)
+    {
+      basic_block bb = BLOCK_FOR_INSN (cand->insn);
+      remat_block_info *info = &m_block_info[bb->index];
+      info->num_candidates += 1;
+      info->first_candidate = cand_index;
+      bitmap_set_bit (m_regno_to_candidates[cand->regno], cand_index);
+    }
+}
+
+/* Record that candidates CAND1_INDEX and CAND2_INDEX are equivalent.
+   CAND1_INDEX might already have an equivalence class, but CAND2_INDEX
+   doesn't.  */
+
+void
+early_remat::record_equiv_candidates (unsigned int cand1_index,
+				      unsigned int cand2_index)
+{
+  if (dump_file)
+    fprintf (dump_file, ";; Candidate %d is equivalent to candidate %d\n",
+	     cand2_index, cand1_index);
+
+  remat_candidate *cand1 = &m_candidates[cand1_index];
+  remat_candidate *cand2 = &m_candidates[cand2_index];
+  gcc_checking_assert (!cand2->equiv_class);
+
+  remat_equiv_class *ec = cand1->equiv_class;
+  if (!ec)
+    {
+      ec = XOBNEW (&m_obstack.obstack, remat_equiv_class);
+      ec->members = alloc_bitmap ();
+      bitmap_set_bit (ec->members, cand1_index);
+      ec->earliest = cand1_index;
+      ec->representative = cand1_index;
+      cand1->equiv_class = ec;
+    }
+  cand1 = &m_candidates[ec->representative];
+  cand2->equiv_class = ec;
+  bitmap_set_bit (ec->members, cand2_index);
+  if (cand2_index > ec->representative)
+    ec->representative = cand2_index;
+}
+
+/* Propagate information from the rd_out set of E->src to the rd_in set
+   of E->dest, when computing global reaching definitions.  Return true
+   if something changed.  */
+
+bool
+early_remat::rd_confluence_n (edge e)
+{
+  remat_block_info *src = &er->m_block_info[e->src->index];
+  remat_block_info *dest = &er->m_block_info[e->dest->index];
+
+  /* available_in temporarily contains the set of candidates whose
+     registers are live on entry.  */
+  if (empty_p (src->rd_out) || empty_p (dest->available_in))
+    return false;
+
+  return bitmap_ior_and_into (er->get_bitmap (&dest->rd_in),
+			      src->rd_out, dest->available_in);
+}
+
+/* Propagate information from the rd_in set of block BB_INDEX to rd_out.
+   Return true if something changed.  */
+
+bool
+early_remat::rd_transfer (int bb_index)
+{
+  remat_block_info *info = &er->m_block_info[bb_index];
+
+  if (empty_p (info->rd_in))
+    return false;
+
+  if (empty_p (info->rd_kill))
+    {
+      gcc_checking_assert (empty_p (info->rd_gen));
+      if (!info->rd_out)
+	info->rd_out = info->rd_in;
+      else
+	gcc_checking_assert (info->rd_out == info->rd_in);
+      /* Assume that we only get called if something changed.  */
+      return true;
+    }
+
+  if (empty_p (info->rd_gen))
+    return bitmap_and_compl (er->get_bitmap (&info->rd_out),
+			     info->rd_in, info->rd_kill);
+
+  return bitmap_ior_and_compl (er->get_bitmap (&info->rd_out), info->rd_gen,
+			       info->rd_in, info->rd_kill);
+}
+
+/* Calculate the rd_* sets for each block.  */
+
+void
+early_remat::compute_rd (void)
+{
+  /* First calculate the rd_kill and rd_gen sets, using the fact
+     that m_candidates is sorted in order of decreasing LUID.  */
+  unsigned int cand_index;
+  remat_candidate *cand;
+  FOR_EACH_VEC_ELT_REVERSE (m_candidates, cand_index, cand)
+    {
+      rtx_insn *insn = cand->insn;
+      basic_block bb = BLOCK_FOR_INSN (insn);
+      remat_block_info *info = &m_block_info[bb->index];
+      bitmap kill = m_regno_to_candidates[cand->regno];
+      bitmap_ior_into (get_bitmap (&info->rd_kill), kill);
+      if (bitmap_bit_p (DF_LR_OUT (bb), cand->regno))
+	{
+	  bitmap_and_compl_into (get_bitmap (&info->rd_gen), kill);
+	  bitmap_set_bit (info->rd_gen, cand_index);
+	}
+    }
+
+  /* Set up the initial values of the other sets.  */
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, m_fn)
+    {
+      remat_block_info *info = &m_block_info[bb->index];
+      unsigned int regno;
+      bitmap_iterator bi;
+      EXECUTE_IF_AND_IN_BITMAP (DF_LR_IN (bb), &m_candidate_regnos,
+				0, regno, bi)
+	{
+	  /* Use available_in to record the set of candidates whose
+	     registers are live on entry (i.e. a maximum bound on rd_in).  */
+	  bitmap_ior_into (get_bitmap (&info->available_in),
+			   m_regno_to_candidates[regno]);
+
+	  /* Add registers that die in a block to the block's kill set,
+	     so that we don't needlessly propagate them through the rest
+	     of the function.  */
+	  if (!bitmap_bit_p (DF_LR_OUT (bb), regno))
+	    bitmap_ior_into (get_bitmap (&info->rd_kill),
+			     m_regno_to_candidates[regno]);
+	}
+
+      /* Initialize each block's rd_out to the minimal set (the set of
+	 local definitions).  */
+      if (!empty_p (info->rd_gen))
+	bitmap_copy (get_bitmap (&info->rd_out), info->rd_gen);
+    }
+
+  /* Iterate until we reach a fixed point.  */
+  er = this;
+  bitmap_clear (&m_tmp_bitmap);
+  bitmap_set_range (&m_tmp_bitmap, 0, last_basic_block_for_fn (m_fn));
+  df_simple_dataflow (DF_FORWARD, NULL, NULL, rd_confluence_n, rd_transfer,
+		      &m_tmp_bitmap, df_get_postorder (DF_FORWARD),
+		      df_get_n_blocks (DF_FORWARD));
+  er = 0;
+
+  /* Work out which definitions reach which candidates, again taking
+     advantage of the candidate order.  */
+  bitmap_head reaching;
+  bitmap_initialize (&reaching, &m_obstack);
+  basic_block old_bb = NULL;
+  FOR_EACH_VEC_ELT_REVERSE (m_candidates, cand_index, cand)
+    {
+      bb = BLOCK_FOR_INSN (cand->insn);
+      if (bb != old_bb)
+	{
+	  /* Get the definitions that reach the start of the new block.  */
+	  remat_block_info *info = &m_block_info[bb->index];
+	  if (info->rd_in)
+	    bitmap_copy (&reaching, info->rd_in);
+	  else
+	    bitmap_clear (&reaching);
+	  old_bb = bb;
+	}
+      else
+	{
+	  /* Process the definitions of the previous instruction.  */
+	  bitmap kill = m_regno_to_candidates[cand[1].regno];
+	  bitmap_and_compl_into (&reaching, kill);
+	  bitmap_set_bit (&reaching, cand_index + 1);
+	}
+
+      if (cand->can_copy_p && !cand->constant_p)
+	{
+	  df_ref ref;
+	  FOR_EACH_INSN_USE (ref, cand->insn)
+	    {
+	      unsigned int regno = DF_REF_REGNO (ref);
+	      if (bitmap_bit_p (&m_candidate_regnos, regno))
+		{
+		  bitmap defs = m_regno_to_candidates[regno];
+		  bitmap_and (&m_tmp_bitmap, defs, &reaching);
+		  bitmap_ior_into (get_bitmap (&cand->uses), &m_tmp_bitmap);
+		}
+	    }
+	}
+    }
+  bitmap_clear (&reaching);
+}
+
+/* If CAND_INDEX is in an equivalence class, return the representative
+   of the class, otherwise return CAND_INDEX.  */
+
+inline unsigned int
+early_remat::canon_candidate (unsigned int cand_index)
+{
+  if (remat_equiv_class *ec = m_candidates[cand_index].equiv_class)
+    return ec->representative;
+  return cand_index;
+}
+
+/* Make candidate set *PTR refer to candidates using the representative
+   of each equivalence class.  */
+
+void
+early_remat::canon_bitmap (bitmap *ptr)
+{
+  bitmap old_set = *ptr;
+  if (empty_p (old_set))
+    return;
+
+  bitmap new_set = NULL;
+  unsigned int old_index;
+  bitmap_iterator bi;
+  EXECUTE_IF_SET_IN_BITMAP (old_set, 0, old_index, bi)
+    {
+      unsigned int new_index = canon_candidate (old_index);
+      if (old_index != new_index)
+	{
+	  if (!new_set)
+	    {
+	      new_set = alloc_bitmap ();
+	      bitmap_copy (new_set, old_set);
+	    }
+	  bitmap_clear_bit (new_set, old_index);
+	  bitmap_set_bit (new_set, new_index);
+	}
+    }
+  if (new_set)
+    {
+      BITMAP_FREE (*ptr);
+      *ptr = new_set;
+    }
+}
+
+/* If the candidates in REACHING all have the same value, return the
+   earliest instance of that value (i.e. the first one to be added
+   to m_value_table), otherwise return MULTIPLE_CANDIDATES.  */
+
+unsigned int
+early_remat::resolve_reaching_def (bitmap reaching)
+{
+  unsigned int cand_index = bitmap_first_set_bit (reaching);
+  if (remat_equiv_class *ec = m_candidates[cand_index].equiv_class)
+    {
+      if (!bitmap_intersect_compl_p (reaching, ec->members))
+	return ec->earliest;
+    }
+  else if (bitmap_single_bit_set_p (reaching))
+    return cand_index;
+
+  return MULTIPLE_CANDIDATES;
+}
+
+/* Check whether all candidate registers used by candidate CAND_INDEX have
+   unique definitions.  Return true if so, replacing the candidate's uses
+   set with the appropriate form for value numbering.  */
+
+bool
+early_remat::check_candidate_uses (unsigned int cand_index)
+{
+  remat_candidate *cand = &m_candidates[cand_index];
+
+  /* Process the uses for each register in turn.  */
+  bitmap_head uses;
+  bitmap_initialize (&uses, &m_obstack);
+  bitmap_copy (&uses, cand->uses);
+  bitmap uses_ec = alloc_bitmap ();
+  while (!bitmap_empty_p (&uses))
+    {
+      /* Get the register for the lowest-indexed candidate remaining,
+	 and the reaching definitions of that register.  */
+      unsigned int first = bitmap_first_set_bit (&uses);
+      unsigned int regno = m_candidates[first].regno;
+      bitmap_and (&m_tmp_bitmap, &uses, m_regno_to_candidates[regno]);
+
+      /* See whether all reaching definitions have the same value and if
+	 so get the index of the first candidate we saw with that value.  */
+      unsigned int def = resolve_reaching_def (&m_tmp_bitmap);
+      if (def == MULTIPLE_CANDIDATES)
+	{
+	  if (dump_file)
+	    fprintf (dump_file, ";; Removing candidate %d because there is"
+		     " more than one reaching definition of reg %d\n",
+		     cand_index, regno);
+	  cand->can_copy_p = false;
+	  break;
+	}
+      bitmap_set_bit (uses_ec, def);
+      bitmap_and_compl_into (&uses, &m_tmp_bitmap);
+    }
+  BITMAP_FREE (cand->uses);
+  cand->uses = uses_ec;
+  return cand->can_copy_p;
+}
+
+/* Calculate the set of hard registers that would be clobbered by
+   rematerializing candidate CAND_INDEX.  At this point the candidate's
+   set of uses is final.  */
+
+void
+early_remat::compute_clobbers (unsigned int cand_index)
+{
+  remat_candidate *cand = &m_candidates[cand_index];
+  if (cand->uses)
+    {
+      unsigned int use_index;
+      bitmap_iterator bi;
+      EXECUTE_IF_SET_IN_BITMAP (cand->uses, 0, use_index, bi)
+	if (bitmap clobbers = m_candidates[use_index].clobbers)
+	  bitmap_ior_into (get_bitmap (&cand->clobbers), clobbers);
+    }
+
+  df_ref ref;
+  FOR_EACH_INSN_DEF (ref, cand->insn)
+    {
+      unsigned int def_regno = DF_REF_REGNO (ref);
+      if (def_regno != cand->regno)
+	bitmap_set_bit (get_bitmap (&cand->clobbers), def_regno);
+    }
+}
+
+/* Mark candidate CAND_INDEX as validated and add it to the value table.  */
+
+void
+early_remat::assign_value_number (unsigned int cand_index)
+{
+  remat_candidate *cand = &m_candidates[cand_index];
+  gcc_checking_assert (cand->can_copy_p && !cand->validated_p);
+
+  compute_clobbers (cand_index);
+  cand->validated_p = true;
+
+  inchash::hash h;
+  h.add_int (cand->regno);
+  inchash::add_rtx (cand->remat_rtx, h);
+  cand->hash = h.end ();
+
+  remat_candidate **slot
+    = m_value_table.find_slot_with_hash (cand, cand->hash, INSERT);
+  if (!*slot)
+    {
+      *slot = cand;
+      if (dump_file)
+	fprintf (dump_file, ";; Candidate %d is not equivalent to"
+		 " others seen so far\n", cand_index);
+    }
+  else
+    record_equiv_candidates (*slot - m_candidates.address (), cand_index);
+}
+
+/* Make a final decision about which candidates are valid and assign
+   value numbers to those that are.  */
+
+void
+early_remat::decide_candidate_validity (void)
+{
+  auto_vec<unsigned int, 16> stack;
+  unsigned int cand1_index;
+  remat_candidate *cand1;
+  FOR_EACH_VEC_ELT_REVERSE (m_candidates, cand1_index, cand1)
+    {
+      if (!cand1->can_copy_p || cand1->validated_p)
+	continue;
+
+      if (empty_p (cand1->uses))
+	{
+	  assign_value_number (cand1_index);
+	  continue;
+	}
+
+      stack.safe_push (cand1_index);
+      while (!stack.is_empty ())
+	{
+	  unsigned int cand2_index = stack.last ();
+	  unsigned int watermark = stack.length ();
+	  remat_candidate *cand2 = &m_candidates[cand2_index];
+	  if (!cand2->can_copy_p || cand2->validated_p)
+	    {
+	      stack.pop ();
+	      continue;
+	    }
+	  cand2->visited_p = true;
+	  unsigned int cand3_index;
+	  bitmap_iterator bi;
+	  EXECUTE_IF_SET_IN_BITMAP (cand2->uses, 0, cand3_index, bi)
+	    {
+	      remat_candidate *cand3 = &m_candidates[cand3_index];
+	      if (!cand3->can_copy_p)
+		{
+		  if (dump_file)
+		    fprintf (dump_file, ";; Removing candidate %d because"
+			     " it uses removed candidate %d\n", cand2_index,
+			     cand3_index);
+		  cand2->can_copy_p = false;
+		  break;
+		}
+	      if (!cand3->validated_p)
+		{
+		  if (empty_p (cand3->uses))
+		    assign_value_number (cand3_index);
+		  else if (cand3->visited_p)
+		    {
+		      if (dump_file)
+			fprintf (dump_file, ";; Removing candidate %d"
+				 " because its definition is cyclic\n",
+				 cand2_index);
+		      cand2->can_copy_p = false;
+		      break;
+		    }
+		  else
+		    stack.safe_push (cand3_index);
+		}
+	    }
+	  if (!cand2->can_copy_p)
+	    {
+	      cand2->visited_p = false;
+	      stack.truncate (watermark - 1);
+	    }
+	  else if (watermark == stack.length ())
+	    {
+	      cand2->visited_p = false;
+	      if (check_candidate_uses (cand2_index))
+		assign_value_number (cand2_index);
+	      stack.pop ();
+	    }
+	}
+    }
+
+  /* Ensure that the candidates always use the same candidate index
+     to refer to an equivalence class.  */
+  FOR_EACH_VEC_ELT_REVERSE (m_candidates, cand1_index, cand1)
+    if (cand1->can_copy_p && !empty_p (cand1->uses))
+      {
+	canon_bitmap (&cand1->uses);
+	gcc_checking_assert (bitmap_first_set_bit (cand1->uses) > cand1_index);
+      }
+}
+
+/* Assuming that every path reaching a point P contains a copy of a
+   use U of REGNO, return true if another copy of U at P would have
+   access to the same value of REGNO.  */
+
+bool
+early_remat::stable_use_p (unsigned int regno)
+{
+  /* Conservatively assume not for hard registers.  */
+  if (HARD_REGISTER_NUM_P (regno))
+    return false;
+
+  /* See if REGNO has a single definition and is never used uninitialized.
+     In this case the definition of REGNO dominates the common dominator
+     of the uses U, which in turn dominates P.  */
+  if (DF_REG_DEF_COUNT (regno) == 1
+      && !bitmap_bit_p (DF_LR_OUT (ENTRY_BLOCK_PTR_FOR_FN (m_fn)), regno))
+    return true;
+
+  return false;
+}
+
+/* Emit a copy from DEST to SRC before candidate CAND_INDEX's instruction.  */
+
+void
+early_remat::emit_copy_before (unsigned int cand_index, rtx dest, rtx src)
+{
+  remat_candidate *cand = &m_candidates[cand_index];
+  if (dump_file)
+    {
+      fprintf (dump_file, ";; Stabilizing insn ");
+      dump_insn_id (cand->insn);
+      fprintf (dump_file, " by copying source reg %d:%s to temporary reg %d\n",
+	       REGNO (src), GET_MODE_NAME (GET_MODE (src)), REGNO (dest));
+    }
+  emit_insn_before (gen_move_insn (dest, src), cand->insn);
+}
+
+/* Check whether any inputs to candidate CAND_INDEX's instruction could
+   change at rematerialization points and replace them with new pseudo
+   registers if so.  */
+
+void
+early_remat::stabilize_pattern (unsigned int cand_index)
+{
+  remat_candidate *cand = &m_candidates[cand_index];
+  if (cand->stabilized_p)
+    return;
+
+  remat_equiv_class *ec = cand->equiv_class;
+  gcc_checking_assert (!ec || cand_index == ec->representative);
+
+  /* Record the replacements we've made so far, so that we don't
+     create two separate registers for match_dups.  Lookup is O(n),
+     but the n is very small.  */
+  typedef std::pair<rtx, rtx> reg_pair;
+  auto_vec<reg_pair, 16> reg_map;
+
+  rtx_insn *insn = cand->insn;
+  df_ref ref;
+  FOR_EACH_INSN_USE (ref, insn)
+    {
+      unsigned int old_regno = DF_REF_REGNO (ref);
+      rtx *loc = DF_REF_REAL_LOC (ref);
+
+      if (HARD_REGISTER_NUM_P (old_regno) && fixed_regs[old_regno])
+	{
+	  /* We checked when adding the candidate that the value is stable.  */
+	  gcc_checking_assert (!rtx_unstable_p (*loc));
+	  continue;
+	}
+
+      if (bitmap_bit_p (&m_candidate_regnos, old_regno))
+	/* We already know which candidate provides the definition
+	   and will handle it during copying.  */
+	continue;
+
+      if (stable_use_p (old_regno))
+	/* We can continue to use the existing register.  */
+	continue;
+
+      /* We need to replace the register.  See whether we've already
+	 created a suitable copy.  */
+      rtx old_reg = *loc;
+      rtx new_reg = NULL_RTX;
+      machine_mode mode = GET_MODE (old_reg);
+      reg_pair *p;
+      unsigned int pi;
+      FOR_EACH_VEC_ELT (reg_map, pi, p)
+	if (REGNO (p->first) == old_regno
+	    && GET_MODE (p->first) == mode)
+	  {
+	    new_reg = p->second;
+	    break;
+	  }
+
+      if (!new_reg)
+	{
+	  /* Create a new register and initialize it just before
+	     the instruction.  */
+	  new_reg = gen_reg_rtx (mode);
+	  reg_map.safe_push (reg_pair (old_reg, new_reg));
+	  if (ec)
+	    {
+	      unsigned int member_index;
+	      bitmap_iterator bi;
+	      EXECUTE_IF_SET_IN_BITMAP (ec->members, 0, member_index, bi)
+		emit_copy_before (member_index, new_reg, old_reg);
+	    }
+	  else
+	    emit_copy_before (cand_index, new_reg, old_reg);
+	}
+      validate_change (insn, loc, new_reg, true);
+    }
+  if (num_changes_pending ())
+    {
+      if (!apply_change_group ())
+	/* We checked when adding the candidates that the pattern allows
+	   hard registers to be replaced.  Nothing else should make the
+	   changes invalid.  */
+	gcc_unreachable ();
+
+      if (ec)
+	{
+	  /* Copy the new pattern to other members of the equivalence
+	     class.  */
+	  unsigned int member_index;
+	  bitmap_iterator bi;
+	  EXECUTE_IF_SET_IN_BITMAP (ec->members, 0, member_index, bi)
+	    if (cand_index != member_index)
+	      {
+		rtx_insn *other_insn = m_candidates[member_index].insn;
+		if (!validate_change (other_insn, &PATTERN (other_insn),
+				      copy_insn (PATTERN (insn)), 0))
+		  /* If the original instruction was valid then the copy
+		     should be too.  */
+		  gcc_unreachable ();
+	      }
+	}
+    }
+
+  cand->stabilized_p = true;
+}
+
+/* Change CAND's instruction so that it sets CAND->copy_regno instead
+   of CAND->regno.  */
+
+void
+early_remat::replace_dest_with_copy (unsigned int cand_index)
+{
+  remat_candidate *cand = &m_candidates[cand_index];
+  df_ref def;
+  FOR_EACH_INSN_DEF (def, cand->insn)
+    if (DF_REF_REGNO (def) == cand->regno)
+      validate_change (cand->insn, DF_REF_REAL_LOC (def),
+		       regno_reg_rtx[cand->copy_regno], 1);
+}
+
+/* Make sure that the candidates used by candidate CAND_INDEX are available.
+   There are two ways of doing this for an input candidate I:
+
+   (1) Using the existing register number and ensuring that I is available.
+
+   (2) Using a new register number (recorded in copy_regno) and adding I
+       to VIA_COPY.  This guarantees that making I available does not
+       conflict with other uses of the original register.
+
+   REQUIRED is the set of candidates that are required but not available
+   before the copy of CAND_INDEX.  AVAILABLE is the set of candidates
+   that are already available before the copy of CAND_INDEX.  REACHING
+   is the set of candidates that reach the copy of CAND_INDEX.  VIA_COPY
+   is the set of candidates that will use new register numbers recorded
+   in copy_regno instead of the original ones.  */
+
+void
+early_remat::stabilize_candidate_uses (unsigned int cand_index,
+				       bitmap required, bitmap available,
+				       bitmap reaching, bitmap via_copy)
+{
+  remat_candidate *cand = &m_candidates[cand_index];
+  df_ref use;
+  FOR_EACH_INSN_USE (use, cand->insn)
+    {
+      unsigned int regno = DF_REF_REGNO (use);
+      if (!bitmap_bit_p (&m_candidate_regnos, regno))
+	continue;
+
+      /* Work out which candidate provides the definition.  */
+      bitmap defs = m_regno_to_candidates[regno];
+      bitmap_and (&m_tmp_bitmap, cand->uses, defs);
+      gcc_checking_assert (bitmap_single_bit_set_p (&m_tmp_bitmap));
+      unsigned int def_index = bitmap_first_set_bit (&m_tmp_bitmap);
+
+      /* First see if DEF_INDEX is the only reaching definition of REGNO
+	 at this point too and if it is or will become available.  We can
+	 continue to use REGNO if so.  */
+      bitmap_and (&m_tmp_bitmap, reaching, defs);
+      if (bitmap_single_bit_set_p (&m_tmp_bitmap)
+	  && bitmap_first_set_bit (&m_tmp_bitmap) == def_index
+	  && ((available && bitmap_bit_p (available, def_index))
+	      || bitmap_bit_p (required, def_index)))
+	{
+	  if (dump_file)
+	    fprintf (dump_file, ";; Keeping reg %d for use of candidate %d"
+		     " in candidate %d\n", regno, def_index, cand_index);
+	  continue;
+	}
+
+      /* Otherwise fall back to using a copy.  There are other cases
+	 in which we *could* continue to use REGNO, but there's not
+	 really much point.  Using a separate register ought to make
+	 things easier for the register allocator.  */
+      remat_candidate *def_cand = &m_candidates[def_index];
+      rtx *loc = DF_REF_REAL_LOC (use);
+      rtx new_reg;
+      if (bitmap_set_bit (via_copy, def_index))
+	{
+	  new_reg = gen_reg_rtx (GET_MODE (*loc));
+	  def_cand->copy_regno = REGNO (new_reg);
+	  if (dump_file)
+	    fprintf (dump_file, ";; Creating reg %d for use of candidate %d"
+		     " in candidate %d\n", REGNO (new_reg), def_index,
+		     cand_index);
+	}
+      else
+	new_reg = regno_reg_rtx[def_cand->copy_regno];
+      validate_change (cand->insn, loc, new_reg, 1);
+    }
+}
+
+/* Rematerialize the candidates in REQUIRED after instruction INSN,
+   given that the candidates in AVAILABLE are already available
+   and that REACHING is the set of candidates live after INSN.
+   REQUIRED and AVAILABLE are disjoint on entry.
+
+   Clear REQUIRED on exit.  */
+
+void
+early_remat::emit_remat_insns (bitmap required, bitmap available,
+			       bitmap reaching, rtx_insn *insn)
+{
+  /* Quick exit if there's nothing to do.  */
+  if (empty_p (required))
+    return;
+
+  /* Only reaching definitions should be available or required.  */
+  gcc_checking_assert (!bitmap_intersect_compl_p (required, reaching));
+  if (available)
+    gcc_checking_assert (!bitmap_intersect_compl_p (available, reaching));
+
+  bitmap_head via_copy;
+  bitmap_initialize (&via_copy, &m_obstack);
+  while (!bitmap_empty_p (required) || !bitmap_empty_p (&via_copy))
+    {
+      /* Pick the lowest-indexed candidate left.  */
+      unsigned int required_index = (bitmap_empty_p (required)
+				     ? ~0U : bitmap_first_set_bit (required));
+      unsigned int via_copy_index = (bitmap_empty_p (&via_copy)
+				     ? ~0U : bitmap_first_set_bit (&via_copy));
+      unsigned int cand_index = MIN (required_index, via_copy_index);
+      remat_candidate *cand = &m_candidates[cand_index];
+
+      bool via_copy_p = (cand_index == via_copy_index);
+      if (via_copy_p)
+	bitmap_clear_bit (&via_copy, cand_index);
+      else
+	{
+	  /* Remove all candidates for the same register from REQUIRED.  */
+	  bitmap_and (&m_tmp_bitmap, reaching,
+		      m_regno_to_candidates[cand->regno]);
+	  bitmap_and_compl_into (required, &m_tmp_bitmap);
+	  gcc_checking_assert (!bitmap_bit_p (required, cand_index));
+
+	  /* Only rematerialize if we have a single reaching definition
+	     of the register.  */
+	  if (!bitmap_single_bit_set_p (&m_tmp_bitmap))
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file, ";; Can't rematerialize reg %d after ",
+			   cand->regno);
+		  dump_insn_id (insn);
+		  fprintf (dump_file, ": more than one reaching definition\n");
+		}
+	      continue;
+	    }
+
+	  /* Skip candidates that can't be rematerialized.  */
+	  if (!cand->can_copy_p)
+	    continue;
+
+	  /* Check the function precondition.  */
+	  gcc_checking_assert (!available
+			       || !bitmap_bit_p (available, cand_index));
+	}
+
+      /* Invalid candidates should have been weeded out by now.  */
+      gcc_assert (cand->can_copy_p);
+
+      rtx new_pattern;
+      if (cand->constant_p)
+	{
+	  /* Emit a simple move.  */
+	  unsigned int regno = via_copy_p ? cand->copy_regno : cand->regno;
+	  new_pattern = gen_move_insn (regno_reg_rtx[regno], cand->remat_rtx);
+	}
+      else
+	{
+	  /* If this is the first time we've copied the instruction, make
+	     sure that any inputs will have the same value after INSN.  */
+	  stabilize_pattern (cand_index);
+
+	  /* Temporarily adjust the original instruction so that it has
+	     the right form for the copy.  */
+	  if (via_copy_p)
+	    replace_dest_with_copy (cand_index);
+	  if (cand->uses)
+	    stabilize_candidate_uses (cand_index, required, available,
+				      reaching, &via_copy);
+
+	  /* Get the new instruction pattern.  */
+	  new_pattern = copy_insn (cand->remat_rtx);
+
+	  /* Undo the temporary changes.  */
+	  cancel_changes (0);
+	}
+
+      /* Emit the new instruction.  */
+      rtx_insn *new_insn = emit_insn_after (new_pattern, insn);
+
+      if (dump_file)
+	{
+	  fprintf (dump_file, ";; Rematerializing candidate %d after ",
+		   cand_index);
+	  dump_insn_id (insn);
+	  if (via_copy_p)
+	    fprintf (dump_file, " with new destination reg %d",
+		     cand->copy_regno);
+	  fprintf (dump_file, ":\n\n");
+	  print_rtl_single (dump_file, new_insn);
+	  fprintf (dump_file, "\n");
+	}
+    }
+}
+
+/* Recompute INFO's available_out set, given that it's distinct from
+   available_in and available_locally.  */
+
+bool
+early_remat::set_available_out (remat_block_info *info)
+{
+  if (empty_p (info->available_locally))
+    return bitmap_and_compl (get_bitmap (&info->available_out),
+			     info->available_in, info->rd_kill);
+
+  if (empty_p (info->rd_kill))
+    return bitmap_ior (get_bitmap (&info->available_out),
+		       info->available_locally, info->available_in);
+
+  return bitmap_ior_and_compl (get_bitmap (&info->available_out),
+			       info->available_locally, info->available_in,
+			       info->rd_kill);
+}
+
+/* If BB has more than one call, decide which candidates should be
+   rematerialized after the non-final calls and emit the associated
+   instructions.  Record other information about the block in preparation
+   for the global phase.  */
+
+void
+early_remat::process_block (basic_block bb)
+{
+  remat_block_info *info = &m_block_info[bb->index];
+  rtx_insn *last_call = NULL;
+  rtx_insn *insn;
+
+  /* Ensure that we always use the same candidate index to refer to an
+     equivalence class.  */
+  if (info->rd_out == info->rd_in)
+    {
+      canon_bitmap (&info->rd_in);
+      info->rd_out = info->rd_in;
+    }
+  else
+    {
+      canon_bitmap (&info->rd_in);
+      canon_bitmap (&info->rd_out);
+    }
+  canon_bitmap (&info->rd_kill);
+  canon_bitmap (&info->rd_gen);
+
+  /* The set of candidates that should be rematerialized on entry to the
+     block or after the previous call (whichever is more recent).  */
+  init_temp_bitmap (&m_required);
+
+  /* The set of candidates that reach the current instruction (i.e. are
+     live just before the instruction).  */
+  bitmap_head reaching;
+  bitmap_initialize (&reaching, &m_obstack);
+  if (info->rd_in)
+    bitmap_copy (&reaching, info->rd_in);
+
+  /* The set of candidates that are live and available without
+     rematerialization just before the current instruction.  This only
+     accounts for earlier candidates in the block, or those that become
+     available by being added to M_REQUIRED.  */
+  init_temp_bitmap (&m_available);
+
+  /* Get the range of candidates in the block.  */
+  unsigned int next_candidate = info->first_candidate;
+  unsigned int num_candidates = info->num_candidates;
+  remat_candidate *next_def = (num_candidates > 0
+			       ? &m_candidates[next_candidate]
+			       : NULL);
+
+  FOR_BB_INSNS (bb, insn)
+    {
+      if (!NONDEBUG_INSN_P (insn))
+	continue;
+
+      /* First process uses, since this is a forward walk.  */
+      df_ref ref;
+      FOR_EACH_INSN_USE (ref, insn)
+	{
+	  unsigned int regno = DF_REF_REGNO (ref);
+	  if (bitmap_bit_p (&m_candidate_regnos, regno))
+	    {
+	      bitmap defs = m_regno_to_candidates[regno];
+	      bitmap_and (&m_tmp_bitmap, defs, &reaching);
+	      gcc_checking_assert (!bitmap_empty_p (&m_tmp_bitmap));
+	      if (!bitmap_intersect_p (defs, m_available))
+		{
+		  /* There has been no definition of the register since
+		     the last call or the start of the block (whichever
+		     is most recent).  Mark the reaching definitions
+		     as required at that point and thus available here.  */
+		  bitmap_ior_into (m_required, &m_tmp_bitmap);
+		  bitmap_ior_into (m_available, &m_tmp_bitmap);
+		}
+	    }
+	}
+
+      if (CALL_P (insn))
+	{
+	  if (!last_call)
+	    {
+	      /* The first call in the block.  Record which candidates are
+		 required at the start of the block.  */
+	      copy_temp_bitmap (&info->required_in, &m_required);
+	      init_temp_bitmap (&m_required);
+	    }
+	  else
+	    /* The fully-local case: candidates that need to be
+	       rematerialized after a previous call in the block.  */
+	    emit_remat_insns (m_required, NULL, info->rd_after_call,
+			      last_call);
+	  last_call = insn;
+	  bitmap_clear (m_available);
+	  gcc_checking_assert (empty_p (m_required));
+	}
+
+      /* Now process definitions.  */
+      if (next_def && insn == next_def->insn)
+	{
+	  unsigned int gen = canon_candidate (next_candidate);
+
+	  /* Other candidates with the same regno are not available
+	     any more.  */
+	  bitmap kill = m_regno_to_candidates[next_def->regno];
+	  bitmap_and_compl_into (m_available, kill);
+	  bitmap_and_compl_into (&reaching, kill);
+
+	  /* Record that this candidate is available without
+	     rematerialization.  */
+	  bitmap_set_bit (m_available, gen);
+	  bitmap_set_bit (&reaching, gen);
+
+	  /* Find the next candidate in the block.  */
+	  num_candidates -= 1;
+	  next_candidate -= 1;
+	  if (num_candidates > 0)
+	    next_def -= 1;
+	  else
+	    next_def = NULL;
+	}
+
+      if (insn == last_call)
+	bitmap_copy (get_bitmap (&info->rd_after_call), &reaching);
+    }
+  bitmap_clear (&reaching);
+  gcc_checking_assert (num_candidates == 0);
+
+  /* Remove values from the available set if they aren't live (and so
+     aren't interesting to successor blocks).  */
+  if (info->rd_out)
+    bitmap_and_into (m_available, info->rd_out);
+
+  /* Record the accumulated information.  */
+  info->last_call = last_call;
+  info->abnormal_call_p = (last_call
+			   && last_call == BB_END (bb)
+			   && has_abnormal_or_eh_outgoing_edge_p (bb));
+  copy_temp_bitmap (&info->available_locally, &m_available);
+  if (last_call)
+    copy_temp_bitmap (&info->required_after_call, &m_required);
+  else
+    copy_temp_bitmap (&info->required_in, &m_required);
+
+  /* Assume at first that all live-in values are available without
+     rematerialization (i.e. start with the most optimistic assumption).  */
+  if (info->available_in)
+    {
+      if (info->rd_in)
+	bitmap_copy (info->available_in, info->rd_in);
+      else
+	BITMAP_FREE (info->available_in);
+    }
+
+  if (last_call || empty_p (info->available_in))
+    /* The values available on exit from the block are exactly those that
+       are available locally.  This set doesn't change.  */
+    info->available_out = info->available_locally;
+  else if (empty_p (info->available_locally) && empty_p (info->rd_kill))
+    /* The values available on exit are the same as those available on entry.
+       Updating one updates the other.  */
+    info->available_out = info->available_in;
+  else
+    set_available_out (info);
+}
+
+/* Process each block as for process_block, visiting dominators before
+   the blocks they dominate.  */
+
+void
+early_remat::local_phase (void)
+{
+  if (dump_file)
+    fprintf (dump_file, "\n;; Local phase:\n");
+
+  int *postorder = df_get_postorder (DF_BACKWARD);
+  unsigned int postorder_len = df_get_n_blocks (DF_BACKWARD);
+  for (unsigned int i = postorder_len; i-- > 0; )
+    if (postorder[i] >= NUM_FIXED_BLOCKS)
+      process_block (BASIC_BLOCK_FOR_FN (m_fn, postorder[i]));
+}
+
+/* Return true if available values survive across edge E.  */
+
+static inline bool
+available_across_edge_p (edge e)
+{
+  return (e->flags & EDGE_EH) == 0;
+}
+
+/* Propagate information from the available_out set of E->src to the
+   available_in set of E->dest, when computing global availability.
+   Return true if something changed.  */
+
+bool
+early_remat::avail_confluence_n (edge e)
+{
+  remat_block_info *src = &er->m_block_info[e->src->index];
+  remat_block_info *dest = &er->m_block_info[e->dest->index];
+
+  if (!available_across_edge_p (e))
+    return false;
+
+  if (empty_p (dest->available_in))
+    return false;
+
+  if (!src->available_out)
+    {
+      bitmap_clear (dest->available_in);
+      return true;
+    }
+
+  return bitmap_and_into (dest->available_in, src->available_out);
+}
+
+/* Propagate information from the available_in set of block BB_INDEX
+   to available_out.  Return true if something changed.  */
+
+bool
+early_remat::avail_transfer (int bb_index)
+{
+  remat_block_info *info = &er->m_block_info[bb_index];
+
+  if (info->available_out == info->available_locally)
+    return false;
+
+  if (info->available_out == info->available_in)
+    /* Assume that we are only called if the input changed.  */
+    return true;
+
+  return er->set_available_out (info);
+}
+
+/* Compute global availability for the function, starting with the local
+   information computed by local_phase.  */
+
+void
+early_remat::compute_availability (void)
+{
+  /* We use df_simple_dataflow instead of the lcm routines for three reasons:
+
+     (1) it avoids recomputing the traversal order;
+     (2) many of the sets are likely to be sparse, so we don't necessarily
+	 want to use sbitmaps; and
+     (3) it means we can avoid creating an explicit kill set for the call.  */
+  er = this;
+  bitmap_clear (&m_tmp_bitmap);
+  bitmap_set_range (&m_tmp_bitmap, 0, last_basic_block_for_fn (m_fn));
+  df_simple_dataflow (DF_FORWARD, NULL, NULL,
+		      avail_confluence_n, avail_transfer,
+		      &m_tmp_bitmap, df_get_postorder (DF_FORWARD),
+		      df_get_n_blocks (DF_FORWARD));
+  er = 0;
+
+  /* Restrict the required_in sets to values that aren't available.  */
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, m_fn)
+    {
+      remat_block_info *info = &m_block_info[bb->index];
+      if (info->required_in && info->available_in)
+	bitmap_and_compl_into (info->required_in, info->available_in);
+    }
+}
+
+/* Make sure that INFO's available_out and available_in sets are unique.  */
+
+inline void
+early_remat::unshare_available_sets (remat_block_info *info)
+{
+  if (info->available_in && info->available_in == info->available_out)
+    {
+      info->available_in = alloc_bitmap ();
+      bitmap_copy (info->available_in, info->available_out);
+    }
+}
+
+/* Return true if it is possible to move rematerializations from the
+   destination of E to the source of E.  */
+
+inline bool
+early_remat::can_move_across_edge_p (edge e)
+{
+  return (available_across_edge_p (e)
+	  && !m_block_info[e->src->index].abnormal_call_p);
+}
+
+/* Return true if it is cheaper to rematerialize values at the head of
+   block QUERY_BB_INDEX instead of rematerializing in its predecessors.  */
+
+bool
+early_remat::local_remat_cheaper_p (unsigned int query_bb_index)
+{
+  if (m_block_info[query_bb_index].remat_frequency_valid_p)
+    return m_block_info[query_bb_index].local_remat_cheaper_p;
+
+  /* Iteratively compute the cost of rematerializing values in the
+     predecessor blocks, then compare that with the cost of
+     rematerializing at the head of the block.
+
+     A cycle indicates that there is no call on that execution path,
+     so it isn't necessary to rematerialize on that path.  */
+  auto_vec<basic_block, 16> stack;
+  stack.quick_push (BASIC_BLOCK_FOR_FN (m_fn, query_bb_index));
+  while (!stack.is_empty ())
+    {
+      basic_block bb = stack.last ();
+      remat_block_info *info = &m_block_info[bb->index];
+      if (info->remat_frequency_valid_p)
+	{
+	  stack.pop ();
+	  continue;
+	}
+
+      info->visited_p = true;
+      int frequency = 0;
+      bool can_move_p = true;
+      edge e;
+      edge_iterator ei;
+      FOR_EACH_EDGE (e, ei, bb->preds)
+	if (!can_move_across_edge_p (e))
+	  {
+	    can_move_p = false;
+	    break;
+	  }
+	else if (m_block_info[e->src->index].last_call)
+	  /* We'll rematerialize after the call.  */
+	  frequency += e->src->count.to_frequency (m_fn);
+	else if (m_block_info[e->src->index].remat_frequency_valid_p)
+	  /* Add the cost of rematerializing at the head of E->src
+	     or in its predecessors (whichever is cheaper).  */
+	  frequency += m_block_info[e->src->index].remat_frequency;
+	else if (!m_block_info[e->src->index].visited_p)
+	  /* Queue E->src and then revisit this block again.  */
+	  stack.safe_push (e->src);
+
+      /* Come back to this block later if we need to process some of
+	 its predecessors.  */
+      if (stack.last () != bb)
+	continue;
+
+      /* If rematerializing in and before the block have equal cost, prefer
+	 rematerializing in the block.  This should shorten the live range.  */
+      int bb_frequency = bb->count.to_frequency (m_fn);
+      if (!can_move_p || frequency >= bb_frequency)
+	{
+	  info->local_remat_cheaper_p = true;
+	  info->remat_frequency = bb_frequency;
+	}
+      else
+	info->remat_frequency = frequency;
+      info->remat_frequency_valid_p = true;
+      info->visited_p = false;
+      if (dump_file)
+	{
+	  if (!can_move_p)
+	    fprintf (dump_file, ";; Need to rematerialize at the head of"
+		     " block %d; cannot move to predecessors.\n", bb->index);
+	  else
+	    {
+	      fprintf (dump_file, ";; Block %d has frequency %d,"
+		       " rematerializing in predecessors has frequency %d",
+		       bb->index, bb_frequency, frequency);
+	      if (info->local_remat_cheaper_p)
+		fprintf (dump_file, "; prefer to rematerialize"
+			 " in the block\n");
+	      else
+		fprintf (dump_file, "; prefer to rematerialize"
+			 " in predecessors\n");
+	    }
+	}
+      stack.pop ();
+    }
+  return m_block_info[query_bb_index].local_remat_cheaper_p;
+}
+
+/* Return true if we cannot rematerialize candidate CAND_INDEX at the head of
+   block BB_INDEX.  */
+
+bool
+early_remat::need_to_move_candidate_p (unsigned int bb_index,
+				       unsigned int cand_index)
+{
+  remat_block_info *info = &m_block_info[bb_index];
+  remat_candidate *cand = &m_candidates[cand_index];
+  basic_block bb = BASIC_BLOCK_FOR_FN (m_fn, bb_index);
+
+  /* If there is more than one reaching definition of REGNO,
+     we'll need to rematerialize in predecessors instead.  */
+  bitmap_and (&m_tmp_bitmap, info->rd_in, m_regno_to_candidates[cand->regno]);
+  if (!bitmap_single_bit_set_p (&m_tmp_bitmap))
+    {
+      if (dump_file)
+	fprintf (dump_file, ";; Cannot rematerialize %d at the"
+		 " head of block %d because there is more than one"
+		 " reaching definition of reg %d\n", cand_index,
+		 bb_index, cand->regno);
+      return true;
+    }
+
+  /* Likewise if rematerializing CAND here would clobber a live register.  */
+  if (cand->clobbers
+      && bitmap_intersect_p (cand->clobbers, DF_LR_IN (bb)))
+    {
+      if (dump_file)
+	fprintf (dump_file, ";; Cannot rematerialize %d at the"
+		 " head of block %d because it would clobber live"
+		 " registers\n", cand_index, bb_index);
+      return true;
+    }
+
+  return false;
+}
+
+/* Set REQUIRED to the minimum set of candidates that must be rematerialized
+   in predecessors of block BB_INDEX instead of at the start of the block.  */
+
+void
+early_remat::compute_minimum_move_set (unsigned int bb_index,
+				       bitmap required)
+{
+  remat_block_info *info = &m_block_info[bb_index];
+  bitmap_head remaining;
+
+  bitmap_clear (required);
+  bitmap_initialize (&remaining, &m_obstack);
+  bitmap_copy (&remaining, info->required_in);
+  while (!bitmap_empty_p (&remaining))
+    {
+      unsigned int cand_index = bitmap_first_set_bit (&remaining);
+      remat_candidate *cand = &m_candidates[cand_index];
+      bitmap_clear_bit (&remaining, cand_index);
+
+      /* Leave invalid candidates where they are.  */
+      if (!cand->can_copy_p)
+	continue;
+
+      /* Decide whether to move this candidate.  */
+      if (!bitmap_bit_p (required, cand_index))
+	{
+	  if (!need_to_move_candidate_p (bb_index, cand_index))
+	    continue;
+	  bitmap_set_bit (required, cand_index);
+	}
+
+      /* Also move values used by the candidate, so that we don't
+	 rematerialize them twice.  */
+      if (cand->uses)
+	{
+	  bitmap_ior_and_into (required, cand->uses, info->required_in);
+	  bitmap_ior_and_into (&remaining, cand->uses, info->required_in);
+	}
+    }
+}
+
+/* Make the predecessors of BB_INDEX rematerialize the candidates in
+   REQUIRED.  Add any blocks whose required_in set changes to
+   PENDING_BLOCKS.  */
+
+void
+early_remat::move_to_predecessors (unsigned int bb_index, bitmap required,
+				   bitmap pending_blocks)
+{
+  if (empty_p (required))
+    return;
+  remat_block_info *dest_info = &m_block_info[bb_index];
+  basic_block bb = BASIC_BLOCK_FOR_FN (m_fn, bb_index);
+  edge e;
+  edge_iterator ei;
+  FOR_EACH_EDGE (e, ei, bb->preds)
+    {
+      remat_block_info *src_info = &m_block_info[e->src->index];
+
+      /* Restrict the set we add to the reaching definitions.  */
+      bitmap_and (&m_tmp_bitmap, required, src_info->rd_out);
+      if (bitmap_empty_p (&m_tmp_bitmap))
+	continue;
+
+      if (!can_move_across_edge_p (e))
+	{
+	  /* We can't move the rematerialization and we can't do it at
+	     the start of the block either.  In this case we just give up
+	     and rely on spilling to make the values available across E.  */
+	  if (dump_file)
+	    {
+	      fprintf (dump_file, ";; Cannot rematerialize the following"
+		       " candidates in block %d:", e->src->index);
+	      dump_candidate_bitmap (required);
+	      fprintf (dump_file, "\n");
+	    }
+	  continue;
+	}
+
+      /* Remove candidates that are already available.  */
+      if (src_info->available_out)
+	{
+	  bitmap_and_compl_into (&m_tmp_bitmap, src_info->available_out);
+	  if (bitmap_empty_p (&m_tmp_bitmap))
+	    continue;
+	}
+
+      /* Add the remaining candidates to the appropriate required set.  */
+      if (dump_file)
+	{
+	  fprintf (dump_file, ";; Moving this set from block %d"
+		   " to block %d:", bb_index, e->src->index);
+	  dump_candidate_bitmap (&m_tmp_bitmap);
+	  fprintf (dump_file, "\n");
+	}
+      /* If the source block contains a call, we want to rematerialize
+	 after the call, otherwise we want to rematerialize at the start
+	 of the block.  */
+      bitmap src_required = get_bitmap (src_info->last_call
+					? &src_info->required_after_call
+					: &src_info->required_in);
+      if (bitmap_ior_into (src_required, &m_tmp_bitmap))
+	{
+	  if (!src_info->last_call)
+	    bitmap_set_bit (pending_blocks, e->src->index);
+	  unshare_available_sets (src_info);
+	  bitmap_ior_into (get_bitmap (&src_info->available_out),
+			   &m_tmp_bitmap);
+	}
+    }
+
+  /* The candidates are now available on entry to the block.  */
+  bitmap_and_compl_into (dest_info->required_in, required);
+  unshare_available_sets (dest_info);
+  bitmap_ior_into (get_bitmap (&dest_info->available_in), required);
+}
+
+/* Go through the candidates that are currently marked as being
+   rematerialized at the beginning of a block.  Decide in each case
+   whether that's valid and profitable; if it isn't, move the
+   rematerialization to predecessor blocks instead.  */
+
+void
+early_remat::choose_rematerialization_points (void)
+{
+  bitmap_head required;
+  bitmap_head pending_blocks;
+
+  int *postorder = df_get_postorder (DF_BACKWARD);
+  unsigned int postorder_len = df_get_n_blocks (DF_BACKWARD);
+  bitmap_initialize (&required, &m_obstack);
+  bitmap_initialize (&pending_blocks, &m_obstack);
+  do
+    /* Process the blocks in postorder, to reduce the number of iterations
+       of the outer loop.  */
+    for (unsigned int i = 0; i < postorder_len; ++i)
+      {
+	unsigned int bb_index = postorder[i];
+	remat_block_info *info = &m_block_info[bb_index];
+	bitmap_clear_bit (&pending_blocks, bb_index);
+
+	if (empty_p (info->required_in))
+	  continue;
+
+	if (info->available_in)
+	  gcc_checking_assert (!bitmap_intersect_p (info->required_in,
+						    info->available_in));
+
+	if (local_remat_cheaper_p (bb_index))
+	  {
+	    /* We'd prefer to rematerialize at the head of the block.
+	       Only move candidates if we need to.  */
+	    compute_minimum_move_set (bb_index, &required);
+	    move_to_predecessors (bb_index, &required, &pending_blocks);
+	  }
+	else
+	  move_to_predecessors (bb_index, info->required_in,
+				&pending_blocks);
+      }
+  while (!bitmap_empty_p (&pending_blocks));
+  bitmap_clear (&required);
+}
+
+/* Emit all rematerialization instructions queued for BB.  */
+
+void
+early_remat::emit_remat_insns_for_block (basic_block bb)
+{
+  remat_block_info *info = &m_block_info[bb->index];
+
+  if (info->last_call && !empty_p (info->required_after_call))
+    emit_remat_insns (info->required_after_call, NULL,
+		      info->rd_after_call, info->last_call);
+
+  if (!empty_p (info->required_in))
+    {
+      rtx_insn *insn = BB_HEAD (bb);
+      while (insn != BB_END (bb)
+	     && !INSN_P (NEXT_INSN (insn)))
+	insn = NEXT_INSN (insn);
+      emit_remat_insns (info->required_in, info->available_in,
+			info->rd_in, insn);
+    }
+}
+
+/* Decide which candidates in each block's REQUIRED_IN set need to be
+   rematerialized and decide where the rematerialization instructions
+   should go.  Emit queued rematerialization instructions at the start
+   of blocks and after the last calls in blocks.  */
+
+void
+early_remat::global_phase (void)
+{
+  compute_availability ();
+  if (dump_file)
+    {
+      fprintf (dump_file, "\n;; Blocks after computing global"
+	       " availability:\n");
+      dump_all_blocks ();
+    }
+
+  choose_rematerialization_points ();
+  if (dump_file)
+    {
+      fprintf (dump_file, "\n;; Blocks after choosing rematerialization"
+	       " points:\n");
+      dump_all_blocks ();
+    }
+
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, m_fn)
+    emit_remat_insns_for_block (bb);
+}
+
+/* Main function for the pass.  */
+
+void
+early_remat::run (void)
+{
+  df_analyze ();
+
+  if (!collect_candidates ())
+    return;
+
+  init_block_info ();
+  sort_candidates ();
+  finalize_candidate_indices ();
+  if (dump_file)
+    dump_all_candidates ();
+
+  compute_rd ();
+  decide_candidate_validity ();
+  local_phase ();
+  global_phase ();
+}
+
+early_remat::early_remat (function *fn, sbitmap selected_modes)
+  : m_fn (fn),
+    m_selected_modes (selected_modes),
+    m_available (0),
+    m_required (0),
+    m_value_table (63)
+{
+  bitmap_obstack_initialize (&m_obstack);
+  bitmap_initialize (&m_candidate_regnos, &m_obstack);
+  bitmap_initialize (&m_tmp_bitmap, &m_obstack);
+}
+
+early_remat::~early_remat ()
+{
+  bitmap_obstack_release (&m_obstack);
+}
+
+namespace {
+
+const pass_data pass_data_early_remat =
+{
+  RTL_PASS, /* type */
+  "early_remat", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_EARLY_REMAT, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_df_finish, /* todo_flags_finish */
+};
+
+class pass_early_remat : public rtl_opt_pass
+{
+public:
+  pass_early_remat (gcc::context *ctxt)
+    : rtl_opt_pass (pass_data_early_remat, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+  {
+    return optimize > 1 && NUM_POLY_INT_COEFFS > 1;
+  }
+
+  virtual unsigned int execute (function *f)
+  {
+    auto_sbitmap selected_modes (NUM_MACHINE_MODES);
+    bitmap_clear (selected_modes);
+    targetm.select_early_remat_modes (selected_modes);
+    if (!bitmap_empty_p (selected_modes))
+      early_remat (f, selected_modes).run ();
+    return 0;
+  }
+}; // class pass_early_remat
+
+} // anon namespace
+
+rtl_opt_pass *
+make_pass_early_remat (gcc::context *ctxt)
+{
+  return new pass_early_remat (ctxt);
+}
Index: gcc/config/aarch64/aarch64.c
===================================================================
--- gcc/config/aarch64/aarch64.c	2017-11-17 15:56:45.548761354 +0000
+++ gcc/config/aarch64/aarch64.c	2017-11-17 15:56:53.018759920 +0000
@@ -17292,6 +17292,22 @@  aarch64_can_change_mode_class (machine_m
   return true;
 }
 
+/* Implement TARGET_EARLY_REMAT_MODES.  */
+
+static void
+aarch64_select_early_remat_modes (sbitmap modes)
+{
+  /* SVE values are not normally live across a call, so it should be
+     worth doing early rematerialization even in VL-specific mode.  */
+  for (int i = 0; i < NUM_MACHINE_MODES; ++i)
+    {
+      machine_mode mode = (machine_mode) i;
+      unsigned int vec_flags = aarch64_classify_vector_mode (mode);
+      if (vec_flags & VEC_ANY_SVE)
+	bitmap_set_bit (modes, i);
+    }
+}
+
 /* Target-specific selftests.  */
 
 #if CHECKING_P
@@ -17766,6 +17782,9 @@  #define TARGET_COMPUTE_PRESSURE_CLASSES
 #undef TARGET_CAN_CHANGE_MODE_CLASS
 #define TARGET_CAN_CHANGE_MODE_CLASS aarch64_can_change_mode_class
 
+#undef TARGET_SELECT_EARLY_REMAT_MODES
+#define TARGET_SELECT_EARLY_REMAT_MODES aarch64_select_early_remat_modes
+
 #if CHECKING_P
 #undef TARGET_RUN_TARGET_SELFTESTS
 #define TARGET_RUN_TARGET_SELFTESTS selftest::aarch64_run_selftests
Index: gcc/testsuite/gcc.target/aarch64/sve_spill_1.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/sve_spill_1.c	2017-11-17 15:56:35.691963931 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_spill_1.c	2017-11-17 15:56:53.022759920 +0000
@@ -26,3 +26,5 @@  TEST_LOOP (uint64_t, 511);
 /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #511\n} 2 } } */
 /* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */
 /* { dg-final { scan-assembler-not {\tstr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tldr\tp[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tp[0-9]} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_spill_2.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_spill_2.c	2017-11-17 15:56:53.022759920 +0000
@@ -0,0 +1,39 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable" } */
+
+#include <stdint.h>
+
+void consumer (void *);
+
+#define TEST_LOOP(TYPE)				\
+  void						\
+  multi_loop_##TYPE (TYPE *x, TYPE val)		\
+  {						\
+    for (int i = 0; i < 7; ++i)			\
+      x[i] += val;				\
+    consumer (x);				\
+    for (int i = 0; i < 7; ++i)			\
+      x[i] += val;				\
+    consumer (x);				\
+    for (int i = 0; i < 7; ++i)			\
+      x[i] += val;				\
+    consumer (x);				\
+  }
+
+/* One iteration is enough.  */
+TEST_LOOP (uint8_t);
+TEST_LOOP (uint16_t);
+/* Two iterations are enough.  Complete unrolling makes sense
+   even at -O2.  */
+TEST_LOOP (uint32_t);
+/* Four iterations are needed; ought to stay a loop.  */
+TEST_LOOP (uint64_t);
+
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-9]\.b} 3 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-9]\.h} 3 } } */
+/* { dg-final { scan-assembler {\twhilelo\tp[0-9]\.s} } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-9]\.d} 6 } } */
+/* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tldr\tp[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tp[0-9]} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_spill_3.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_spill_3.c	2017-11-17 15:56:53.023759919 +0000
@@ -0,0 +1,48 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable" } */
+
+#include <stdint.h>
+
+void consumer (void *);
+
+#define TEST_LOOP(TYPE) \
+  void								\
+  multi_loop_##TYPE (TYPE *x, TYPE val1, TYPE val2, int n)	\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      {								\
+	x[i * 2] += val1;					\
+	x[i * 2 + 1] += val2;					\
+      }								\
+    consumer (x);						\
+    for (int i = 0; i < n; ++i)					\
+      {								\
+	x[i * 2] += val1;					\
+	x[i * 2 + 1] += val2;					\
+      }								\
+    consumer (x);						\
+    for (int i = 0; i < n; ++i)					\
+      {								\
+	x[i * 2] += val1;					\
+	x[i * 2 + 1] += val2;					\
+      }								\
+    consumer (x);						\
+  }
+
+/* One iteration is enough.  */
+TEST_LOOP (uint8_t);
+TEST_LOOP (uint16_t);
+/* Two iterations are enough.  Complete unrolling makes sense
+   even at -O2.  */
+TEST_LOOP (uint32_t);
+/* Four iterations are needed; ought to stay a loop.  */
+TEST_LOOP (uint64_t);
+
+/* { dg-final { scan-assembler {\tld1b\tz[0-9]\.b} } } */
+/* { dg-final { scan-assembler {\tld1h\tz[0-9]\.h} } } */
+/* { dg-final { scan-assembler {\tld1w\tz[0-9]\.s} } } */
+/* { dg-final { scan-assembler {\tld1d\tz[0-9]\.d} } } */
+/* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tldr\tp[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tp[0-9]} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_spill_4.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_spill_4.c	2017-11-17 15:56:53.023759919 +0000
@@ -0,0 +1,36 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#include <stdint.h>
+
+void consumer (void *);
+
+#define TEST_LOOP(TYPE, VAL)			\
+  void						\
+  multi_loop_##TYPE (TYPE *x)			\
+  {						\
+    for (int i = 0; i < 100; ++i)		\
+      x[i] += VAL;				\
+    consumer (x);				\
+    for (int i = 0; i < 100; ++i)		\
+      x[i] += VAL;				\
+    consumer (x);				\
+    for (int i = 0; i < 100; ++i)		\
+      x[i] += VAL;				\
+    consumer (x);				\
+  }
+
+TEST_LOOP (uint16_t, 0x1234);
+TEST_LOOP (uint32_t, 0x12345);
+TEST_LOOP (uint64_t, 0x123456);
+
+/* { dg-final { scan-assembler-times {\tptrue\tp[0-9]+\.h,} 3 } } */
+/* { dg-final { scan-assembler-times {\tptrue\tp[0-9]+\.s,} 3 } } */
+/* { dg-final { scan-assembler-times {\tptrue\tp[0-9]+\.d,} 3 } } */
+/* { dg-final { scan-assembler-times {\tld1rh\tz[0-9]+\.h,} 3 } } */
+/* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s,} 3 } } */
+/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d,} 3 } } */
+/* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tldr\tp[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tp[0-9]} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_spill_5.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_spill_5.c	2017-11-17 15:56:53.023759919 +0000
@@ -0,0 +1,34 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math -march=armv8-a+sve" } */
+
+#include <stdint.h>
+
+void consumer (void *);
+
+#define TEST_LOOP(TYPE, VAL)			\
+  void						\
+  multi_loop_##TYPE (TYPE *x)			\
+  {						\
+    for (int i = 0; i < 100; ++i)		\
+      x[i] += VAL + i;				\
+    consumer (x);				\
+    for (int i = 0; i < 100; ++i)		\
+      x[i] += VAL + i;				\
+    consumer (x);				\
+    for (int i = 0; i < 100; ++i)		\
+      x[i] += VAL + i;				\
+    consumer (x);				\
+  }
+
+TEST_LOOP (uint8_t, 3);
+TEST_LOOP (uint16_t, 4);
+TEST_LOOP (uint32_t, 5);
+TEST_LOOP (uint64_t, 6);
+TEST_LOOP (float, 2.5f);
+TEST_LOOP (double, 3.5);
+
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\..,} 18 } } */
+/* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tldr\tp[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tp[0-9]} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_spill_6.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_spill_6.c	2017-11-17 15:56:53.023759919 +0000
@@ -0,0 +1,44 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable" } */
+
+#include <stdint.h>
+
+void consumer (void *);
+
+#define TEST_LOOP(TYPE, VAL)						\
+  void									\
+  multi_loop_##TYPE (TYPE *x1, TYPE *x2, TYPE *x3, TYPE *x4, int which) \
+  {									\
+    if (which)								\
+      {									\
+	for (int i = 0; i < 7; ++i)					\
+	  x1[i] += VAL;							\
+	consumer (x1);							\
+	for (int i = 0; i < 7; ++i)					\
+	  x2[i] -= VAL;							\
+	consumer (x2);							\
+      }									\
+    else								\
+      {									\
+	for (int i = 0; i < 7; ++i)					\
+	  x3[i] &= VAL;							\
+	consumer (x3);							\
+      }									\
+    for (int i = 0; i < 7; ++i)						\
+      x4[i] |= VAL;							\
+    consumer (x4);							\
+  }
+
+TEST_LOOP (uint8_t, 0x12);
+TEST_LOOP (uint16_t, 0x1234);
+TEST_LOOP (uint32_t, 0x12345);
+TEST_LOOP (uint64_t, 0x123456);
+
+/* { dg-final { scan-assembler {\tld1b\tz[0-9]+\.b,} } } */
+/* { dg-final { scan-assembler {\tld1h\tz[0-9]+\.h,} } } */
+/* { dg-final { scan-assembler {\tld1w\tz[0-9]+\.s,} } } */
+/* { dg-final { scan-assembler {\tld1d\tz[0-9]+\.d,} } } */
+/* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tldr\tp[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tp[0-9]} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_spill_7.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_spill_7.c	2017-11-17 15:56:53.023759919 +0000
@@ -0,0 +1,46 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math -march=armv8-a+sve" } */
+
+#include <stdint.h>
+
+void consumer (void *);
+
+#define TEST_LOOP(TYPE, VAL)			\
+  void						\
+  multi_loop_##TYPE (TYPE *x, int n)		\
+  {						\
+    for (int k = 0; k < 4; ++k)			\
+      {						\
+	for (int j = 0; j < n; ++j)		\
+	  {					\
+	    for (int i = 0; i < 100; ++i)	\
+	      x[i] += VAL + i;			\
+	    asm volatile ("");			\
+	  }					\
+	for (int j = 0; j < n; ++j)		\
+	  consumer (x);				\
+	for (int j = 0; j < n; ++j)		\
+	  {					\
+	    for (int i = 0; i < 100; ++i)	\
+	      x[i] += VAL + i;			\
+	    asm volatile ("");			\
+	  }					\
+	consumer (x);				\
+	for (int i = 0; i < 100; ++i)		\
+	  x[i] += VAL + i;			\
+	consumer (x);				\
+      }						\
+  }
+
+TEST_LOOP (uint8_t, 3);
+TEST_LOOP (uint16_t, 4);
+TEST_LOOP (uint32_t, 5);
+TEST_LOOP (uint64_t, 6);
+TEST_LOOP (float, 2.5f);
+TEST_LOOP (double, 3.5);
+
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\..,} 18 } } */
+/* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tldr\tp[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tp[0-9]} } } */