diff mbox

[AArch64] Separate shrink wrapping hooks implementation

Message ID 5824836B.5030302@foss.arm.com
State Superseded
Headers show

Commit Message

Kyrill Tkachov Nov. 10, 2016, 2:25 p.m. UTC
Hi all,

This patch implements the new separate shrink-wrapping hooks for aarch64.
In separate shrink wrapping (as I understand it) we consider each register save/restore as
a 'component' that can be performed independently of the other save/restores in the prologue/epilogue
and can be moved outside the prologue/epilogue and instead performed only in the basic blocks where it's
actually needed. This allows us to avoid saving and restoring registers on execution paths where a register
might not be needed.

In the most general form a 'component' can be any operation that the prologue/epilogue performs, for example
stack adjustment. But in this patch we only consider callee-saved register save/restores as components.
The code is in many ways similar to the powerpc implementation of the hooks.

The hooks implemented are:
* TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS: Returns a bitmap containing a bit for each register that should
be considered a 'component' i.e. its save/restore should be separated from the prologue and epilogue and placed
at the basic block where it's needed.

* TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB: Determine for a given basic block which 'component' registers it needs.
This is determined through dataflow. If a component register is in the IN,GEN or KILL sets for the basic block
it's considered as needed and marked as such in the bitmap.

* TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS and TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS: Given a bitmap
of component registers emits the save or restore code for them.

* TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS: Given a bitmap of component registers record in the backend that
the register is shrink-wrapped using this approach and that the normal prologue and epilogue expansion code
should not emit code for them. This is done similarly to powerpc by defining a bool array in machine_function
where we record whether each register is separately shrink-wrapped.  The prologue and epilogue expansion code
(through aarch64_save_callee_saves and aarch64_restore_callee_saves) is updated to not emit save/restores for
these registers if they appear in that array.

Our prologue and epilogue code has a lot of intricate logic to perform stack adjustments using the writeback
forms of the load/store instructions. Separately shrink-wrapping those registers marked for writeback
(cfun->machine->frame.wb_candidate1 and cfun->machine->frame.wb_candidate2) broke that codegen and I had to
emit an explicit stack adjustment instruction that created ugly prologue/epilogue sequences. So this patch
is conservative and doesn't allow shrink-wrapping of the registers marked for writeback. Maybe in the future
we can relax it (for example allow wrapping of one of the two writeback registers if the writeback amount
can be encoded in a single-register writeback store/load) but given the development stage of GCC I thought
I'd play it safe.

I ran SPEC2006 on a Cortex-A72. Overall scores were neutral but there were some interesting swings.
458.sjeng     +1.45%
471.omnetpp   +2.19%
445.gobmk     -2.01%

On SPECFP:
453.povray    +7.00%

I'll be re-running the benchmarks with Segher's recent patch [1] to see if they fix the regression
and if it does I think this can go in.

[1] https://gcc.gnu.org/ml/gcc-patches/2016-11/msg00889.html

Bootstrapped and tested on aarch64-none-linux-gnu.

Thanks,
Kyrill

2016-11-10  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * config/aarch64/aarch64.h (machine_function): Add
     reg_is_wrapped_separately field.
     * config/aarch64/aarch64.c (emit_set_insn): Change return type to
     rtx_insn *.
     (aarch64_save_callee_saves): Don't save registers that are wrapped
     separately.
     (aarch64_restore_callee_saves): Don't restore registers that are
     wrapped separately.
     (offset_9bit_signed_unscaled_p, offset_12bit_unsigned_scaled_p,
     aarch64_offset_7bit_signed_scaled_p): Move earlier in the file.
     (aarch64_get_separate_components): New function.
     (aarch64_components_for_bb): Likewise.
     (aarch64_disqualify_components): Likewise.
     (aarch64_emit_prologue_components): Likewise.
     (aarch64_emit_epilogue_components): Likewise.
     (aarch64_set_handled_components): Likewise.
     (TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS,
     TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB,
     TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS,
     TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS,
     TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS,
     TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS): Define.

Comments

Segher Boessenkool Nov. 10, 2016, 4:26 p.m. UTC | #1
Hi!

Great to see this.  Just a few comments...

On Thu, Nov 10, 2016 at 02:25:47PM +0000, Kyrill Tkachov wrote:
> +/* Implement TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS.  */

> +

> +static sbitmap

> +aarch64_get_separate_components (void)

> +{

> +  /* Calls to alloca further extend the stack frame and it can be messy to

> +     figure out the location of the stack slots for each register.

> +     For now be conservative.  */

> +  if (cfun->calls_alloca)

> +    return NULL;


The generic code already disallows functions with alloca (in
try_shrink_wrapping_separate).

> +static void

> +aarch64_emit_prologue_components (sbitmap components)

> +{

> +  rtx ptr_reg = gen_rtx_REG (Pmode, frame_pointer_needed

> +			     ? HARD_FRAME_POINTER_REGNUM

> +			     : STACK_POINTER_REGNUM);

> +

> +  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)

> +    if (bitmap_bit_p (components, regno))

> +      {

> +	rtx reg = gen_rtx_REG (Pmode, regno);

> +	HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno];

> +	if (!frame_pointer_needed)

> +	offset += cfun->machine->frame.frame_size

> +		  - cfun->machine->frame.hard_fp_offset;

> +	rtx addr = plus_constant (Pmode, ptr_reg, offset);

> +	rtx mem = gen_frame_mem (Pmode, addr);

> +

> +	RTX_FRAME_RELATED_P (emit_move_insn (mem, reg)) = 1;

> +      }

> +}


I think you should emit the CFI notes here directly, just like for the
epilogue components.


Segher
Kyrill Tkachov Nov. 10, 2016, 4:45 p.m. UTC | #2
On 10/11/16 16:26, Segher Boessenkool wrote:
> Hi!


Hi,

>

> Great to see this.  Just a few comments...

>

> On Thu, Nov 10, 2016 at 02:25:47PM +0000, Kyrill Tkachov wrote:

>> +/* Implement TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS.  */

>> +

>> +static sbitmap

>> +aarch64_get_separate_components (void)

>> +{

>> +  /* Calls to alloca further extend the stack frame and it can be messy to

>> +     figure out the location of the stack slots for each register.

>> +     For now be conservative.  */

>> +  if (cfun->calls_alloca)

>> +    return NULL;

> The generic code already disallows functions with alloca (in

> try_shrink_wrapping_separate).


Ok, I'll remove this.

>> +static void

>> +aarch64_emit_prologue_components (sbitmap components)

>> +{

>> +  rtx ptr_reg = gen_rtx_REG (Pmode, frame_pointer_needed

>> +			     ? HARD_FRAME_POINTER_REGNUM

>> +			     : STACK_POINTER_REGNUM);

>> +

>> +  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)

>> +    if (bitmap_bit_p (components, regno))

>> +      {

>> +	rtx reg = gen_rtx_REG (Pmode, regno);

>> +	HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno];

>> +	if (!frame_pointer_needed)

>> +	offset += cfun->machine->frame.frame_size

>> +		  - cfun->machine->frame.hard_fp_offset;

>> +	rtx addr = plus_constant (Pmode, ptr_reg, offset);

>> +	rtx mem = gen_frame_mem (Pmode, addr);

>> +

>> +	RTX_FRAME_RELATED_P (emit_move_insn (mem, reg)) = 1;

>> +      }

>> +}

> I think you should emit the CFI notes here directly, just like for the

> epilogue components.


The prologue code in expand_prologue doesn't attach any explicit notes,
so I didn't want to deviate from that. Looking at the powerpc implementation,
would that be a REG_CFA_OFFSET with the (SET (mem) (reg)) expression for saving
the reg?

Thanks,
Kyrill

>

> Segher
Andrew Pinski Nov. 10, 2016, 10:42 p.m. UTC | #3
On Thu, Nov 10, 2016 at 6:25 AM, Kyrill Tkachov
<kyrylo.tkachov@foss.arm.com> wrote:
> Hi all,

>

> This patch implements the new separate shrink-wrapping hooks for aarch64.

> In separate shrink wrapping (as I understand it) we consider each register

> save/restore as

> a 'component' that can be performed independently of the other save/restores

> in the prologue/epilogue

> and can be moved outside the prologue/epilogue and instead performed only in

> the basic blocks where it's

> actually needed. This allows us to avoid saving and restoring registers on

> execution paths where a register

> might not be needed.

>

> In the most general form a 'component' can be any operation that the

> prologue/epilogue performs, for example

> stack adjustment. But in this patch we only consider callee-saved register

> save/restores as components.

> The code is in many ways similar to the powerpc implementation of the hooks.

>

> The hooks implemented are:

> * TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS: Returns a bitmap containing a

> bit for each register that should

> be considered a 'component' i.e. its save/restore should be separated from

> the prologue and epilogue and placed

> at the basic block where it's needed.

>

> * TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB: Determine for a given basic block

> which 'component' registers it needs.

> This is determined through dataflow. If a component register is in the

> IN,GEN or KILL sets for the basic block

> it's considered as needed and marked as such in the bitmap.

>

> * TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS and

> TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS: Given a bitmap

> of component registers emits the save or restore code for them.

>

> * TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS: Given a bitmap of component

> registers record in the backend that

> the register is shrink-wrapped using this approach and that the normal

> prologue and epilogue expansion code

> should not emit code for them. This is done similarly to powerpc by defining

> a bool array in machine_function

> where we record whether each register is separately shrink-wrapped.  The

> prologue and epilogue expansion code

> (through aarch64_save_callee_saves and aarch64_restore_callee_saves) is

> updated to not emit save/restores for

> these registers if they appear in that array.

>

> Our prologue and epilogue code has a lot of intricate logic to perform stack

> adjustments using the writeback

> forms of the load/store instructions. Separately shrink-wrapping those

> registers marked for writeback

> (cfun->machine->frame.wb_candidate1 and cfun->machine->frame.wb_candidate2)

> broke that codegen and I had to

> emit an explicit stack adjustment instruction that created ugly

> prologue/epilogue sequences. So this patch

> is conservative and doesn't allow shrink-wrapping of the registers marked

> for writeback. Maybe in the future

> we can relax it (for example allow wrapping of one of the two writeback

> registers if the writeback amount

> can be encoded in a single-register writeback store/load) but given the

> development stage of GCC I thought

> I'd play it safe.

>

> I ran SPEC2006 on a Cortex-A72. Overall scores were neutral but there were

> some interesting swings.

> 458.sjeng     +1.45%

> 471.omnetpp   +2.19%

> 445.gobmk     -2.01%

>

> On SPECFP:

> 453.povray    +7.00%



Wow, this looks really good.  Thank you for implementing this.  If I
get some time I am going to try it out on other processors than A72
but I doubt I have time any time soon.

Thanks,
Andrew

>

> I'll be re-running the benchmarks with Segher's recent patch [1] to see if

> they fix the regression

> and if it does I think this can go in.

>

> [1] https://gcc.gnu.org/ml/gcc-patches/2016-11/msg00889.html

>

> Bootstrapped and tested on aarch64-none-linux-gnu.

>

> Thanks,

> Kyrill

>

> 2016-11-10  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

>

>     * config/aarch64/aarch64.h (machine_function): Add

>     reg_is_wrapped_separately field.

>     * config/aarch64/aarch64.c (emit_set_insn): Change return type to

>     rtx_insn *.

>     (aarch64_save_callee_saves): Don't save registers that are wrapped

>     separately.

>     (aarch64_restore_callee_saves): Don't restore registers that are

>     wrapped separately.

>     (offset_9bit_signed_unscaled_p, offset_12bit_unsigned_scaled_p,

>     aarch64_offset_7bit_signed_scaled_p): Move earlier in the file.

>     (aarch64_get_separate_components): New function.

>     (aarch64_components_for_bb): Likewise.

>     (aarch64_disqualify_components): Likewise.

>     (aarch64_emit_prologue_components): Likewise.

>     (aarch64_emit_epilogue_components): Likewise.

>     (aarch64_set_handled_components): Likewise.

>     (TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS,

>     TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB,

>     TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS,

>     TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS,

>     TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS,

>     TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS): Define.
Segher Boessenkool Nov. 10, 2016, 11:39 p.m. UTC | #4
On Thu, Nov 10, 2016 at 02:42:24PM -0800, Andrew Pinski wrote:
> On Thu, Nov 10, 2016 at 6:25 AM, Kyrill Tkachov

> > I ran SPEC2006 on a Cortex-A72. Overall scores were neutral but there were

> > some interesting swings.

> > 458.sjeng     +1.45%

> > 471.omnetpp   +2.19%

> > 445.gobmk     -2.01%

> >

> > On SPECFP:

> > 453.povray    +7.00%

> 

> 

> Wow, this looks really good.  Thank you for implementing this.  If I

> get some time I am going to try it out on other processors than A72

> but I doubt I have time any time soon.


I'd love to hear what causes the slowdown for gobmk as well, btw.


Segher
Kyrill Tkachov Nov. 11, 2016, 10:17 a.m. UTC | #5
On 10/11/16 23:39, Segher Boessenkool wrote:
> On Thu, Nov 10, 2016 at 02:42:24PM -0800, Andrew Pinski wrote:

>> On Thu, Nov 10, 2016 at 6:25 AM, Kyrill Tkachov

>>> I ran SPEC2006 on a Cortex-A72. Overall scores were neutral but there were

>>> some interesting swings.

>>> 458.sjeng     +1.45%

>>> 471.omnetpp   +2.19%

>>> 445.gobmk     -2.01%

>>>

>>> On SPECFP:

>>> 453.povray    +7.00%

>>

>> Wow, this looks really good.  Thank you for implementing this.  If I

>> get some time I am going to try it out on other processors than A72

>> but I doubt I have time any time soon.

> I'd love to hear what causes the slowdown for gobmk as well, btw.


I haven't yet gotten a direct answer for that (through performance analysis tools)
but I have noticed that load/store pairs are not generated as aggressively as I hoped.
They are being merged by the sched fusion pass and peepholes (which runs after this)
but it still misses cases. I've hacked the SWS hooks to generate pairs explicitly and that
increases the number of pairs and helps code size to boot. It complicates the logic of
the hooks a bit but not too much.

I'll make those changes and re-benchmark, hopefully that
will help performance.

Thanks,
Kyrill

>

> Segher
diff mbox

Patch

commit 14c7a66d9f3a44ef40499e61ca9643c7dfbc6c82
Author: Kyrylo Tkachov <kyrylo.tkachov@arm.com>
Date:   Tue Oct 11 09:25:54 2016 +0100

    [AArch64] Separate shrink wrapping hooks implementation

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 325e725..5508333 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -1138,7 +1138,7 @@  aarch64_is_extend_from_extract (machine_mode mode, rtx mult_imm,
 
 /* Emit an insn that's a simple single-set.  Both the operands must be
    known to be valid.  */
-inline static rtx
+inline static rtx_insn *
 emit_set_insn (rtx x, rtx y)
 {
   return emit_insn (gen_rtx_SET (x, y));
@@ -3135,6 +3135,9 @@  aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset,
 	      || regno == cfun->machine->frame.wb_candidate2))
 	continue;
 
+      if (cfun->machine->reg_is_wrapped_separately[regno])
+       continue;
+
       reg = gen_rtx_REG (mode, regno);
       offset = start_offset + cfun->machine->frame.reg_offset[regno];
       mem = gen_mem_ref (mode, plus_constant (Pmode, stack_pointer_rtx,
@@ -3143,6 +3146,7 @@  aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset,
       regno2 = aarch64_next_callee_save (regno + 1, limit);
 
       if (regno2 <= limit
+	  && !cfun->machine->reg_is_wrapped_separately[regno2]
 	  && ((cfun->machine->frame.reg_offset[regno] + UNITS_PER_WORD)
 	      == cfun->machine->frame.reg_offset[regno2]))
 
@@ -3191,6 +3195,9 @@  aarch64_restore_callee_saves (machine_mode mode,
        regno <= limit;
        regno = aarch64_next_callee_save (regno + 1, limit))
     {
+      if (cfun->machine->reg_is_wrapped_separately[regno])
+       continue;
+
       rtx reg, mem;
 
       if (skip_wb
@@ -3205,6 +3212,7 @@  aarch64_restore_callee_saves (machine_mode mode,
       regno2 = aarch64_next_callee_save (regno + 1, limit);
 
       if (regno2 <= limit
+	  && !cfun->machine->reg_is_wrapped_separately[regno2]
 	  && ((cfun->machine->frame.reg_offset[regno] + UNITS_PER_WORD)
 	      == cfun->machine->frame.reg_offset[regno2]))
 	{
@@ -3224,6 +3232,169 @@  aarch64_restore_callee_saves (machine_mode mode,
     }
 }
 
+static inline bool
+offset_9bit_signed_unscaled_p (machine_mode mode ATTRIBUTE_UNUSED,
+			       HOST_WIDE_INT offset)
+{
+  return offset >= -256 && offset < 256;
+}
+
+static inline bool
+offset_12bit_unsigned_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
+{
+  return (offset >= 0
+	  && offset < 4096 * GET_MODE_SIZE (mode)
+	  && offset % GET_MODE_SIZE (mode) == 0);
+}
+
+bool
+aarch64_offset_7bit_signed_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
+{
+  return (offset >= -64 * GET_MODE_SIZE (mode)
+	  && offset < 64 * GET_MODE_SIZE (mode)
+	  && offset % GET_MODE_SIZE (mode) == 0);
+}
+
+/* Implement TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS.  */
+
+static sbitmap
+aarch64_get_separate_components (void)
+{
+  /* Calls to alloca further extend the stack frame and it can be messy to
+     figure out the location of the stack slots for each register.
+     For now be conservative.  */
+  if (cfun->calls_alloca)
+    return NULL;
+
+  aarch64_layout_frame ();
+
+  sbitmap components = sbitmap_alloc (V31_REGNUM + 1);
+  bitmap_clear (components);
+
+  /* The registers we need saved to the frame.  */
+  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
+    if (aarch64_register_saved_on_entry (regno))
+      {
+	HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno];
+	if (!frame_pointer_needed)
+	  offset += cfun->machine->frame.frame_size
+		    - cfun->machine->frame.hard_fp_offset;
+	/* Check that we can access the stack slot of the register with one
+	   direct load with no adjustments needed.  */
+	if (offset_12bit_unsigned_scaled_p (DImode, offset))
+	  bitmap_set_bit (components, regno);
+      }
+
+  /* Don't mess with the hard frame pointer.  */
+  if (frame_pointer_needed)
+    bitmap_clear_bit (components, HARD_FRAME_POINTER_REGNUM);
+
+  unsigned reg1 = cfun->machine->frame.wb_candidate1;
+  unsigned reg2 = cfun->machine->frame.wb_candidate2;
+  /* If aarch64_layout_frame has chosen registers to store/restore with
+     writeback don't interfere with them to avoid having to output explicit
+     stack adjustment instructions.  */
+  if (reg2 != INVALID_REGNUM)
+    bitmap_clear_bit (components, reg2);
+  if (reg1 != INVALID_REGNUM)
+    bitmap_clear_bit (components, reg1);
+
+  bitmap_clear_bit (components, LR_REGNUM);
+  bitmap_clear_bit (components, SP_REGNUM);
+
+  return components;
+}
+
+/* Implement TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB.  */
+
+static sbitmap
+aarch64_components_for_bb (basic_block bb)
+{
+  bitmap in = DF_LIVE_IN (bb);
+  bitmap gen = &DF_LIVE_BB_INFO (bb)->gen;
+  bitmap kill = &DF_LIVE_BB_INFO (bb)->kill;
+
+  sbitmap components = sbitmap_alloc (V31_REGNUM + 1);
+  bitmap_clear (components);
+
+  /* GPRs are used in a bb if they are in the IN, GEN, or KILL sets.  */
+  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
+    if ((!call_used_regs[regno])
+       && (bitmap_bit_p (in, regno)
+	   || bitmap_bit_p (gen, regno)
+	   || bitmap_bit_p (kill, regno)))
+	  bitmap_set_bit (components, regno);
+
+  return components;
+}
+
+/* Implement TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS.
+   Nothing to do for aarch64.  */
+
+static void
+aarch64_disqualify_components (sbitmap, edge, sbitmap, bool)
+{
+}
+
+/* Implement TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS.  */
+
+static void
+aarch64_emit_prologue_components (sbitmap components)
+{
+  rtx ptr_reg = gen_rtx_REG (Pmode, frame_pointer_needed
+			     ? HARD_FRAME_POINTER_REGNUM
+			     : STACK_POINTER_REGNUM);
+
+  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
+    if (bitmap_bit_p (components, regno))
+      {
+	rtx reg = gen_rtx_REG (Pmode, regno);
+	HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno];
+	if (!frame_pointer_needed)
+	offset += cfun->machine->frame.frame_size
+		  - cfun->machine->frame.hard_fp_offset;
+	rtx addr = plus_constant (Pmode, ptr_reg, offset);
+	rtx mem = gen_frame_mem (Pmode, addr);
+
+	RTX_FRAME_RELATED_P (emit_move_insn (mem, reg)) = 1;
+      }
+}
+
+/* Implement TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS.  */
+
+static void
+aarch64_emit_epilogue_components (sbitmap components)
+{
+
+  rtx ptr_reg = gen_rtx_REG (Pmode, frame_pointer_needed
+			     ? HARD_FRAME_POINTER_REGNUM
+			     : STACK_POINTER_REGNUM);
+  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
+    if (bitmap_bit_p (components, regno))
+      {
+	rtx reg = gen_rtx_REG (Pmode, regno);
+	HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno];
+	if (!frame_pointer_needed)
+	  offset += cfun->machine->frame.frame_size
+		     - cfun->machine->frame.hard_fp_offset;
+	rtx addr = plus_constant (Pmode, ptr_reg, offset);
+	rtx mem = gen_frame_mem (Pmode, addr);
+
+	RTX_FRAME_RELATED_P (emit_move_insn (reg, mem)) = 1;
+	add_reg_note (get_last_insn (), REG_CFA_RESTORE, reg);
+      }
+}
+
+/* Implement TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS.  */
+
+static void
+aarch64_set_handled_components (sbitmap components)
+{
+  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
+    if (bitmap_bit_p (components, regno))
+      cfun->machine->reg_is_wrapped_separately[regno] = true;
+}
+
 /* AArch64 stack frames generated by this compiler look like:
 
 	+-------------------------------+
@@ -3944,29 +4115,6 @@  aarch64_classify_index (struct aarch64_address_info *info, rtx x,
   return false;
 }
 
-bool
-aarch64_offset_7bit_signed_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
-{
-  return (offset >= -64 * GET_MODE_SIZE (mode)
-	  && offset < 64 * GET_MODE_SIZE (mode)
-	  && offset % GET_MODE_SIZE (mode) == 0);
-}
-
-static inline bool
-offset_9bit_signed_unscaled_p (machine_mode mode ATTRIBUTE_UNUSED,
-			       HOST_WIDE_INT offset)
-{
-  return offset >= -256 && offset < 256;
-}
-
-static inline bool
-offset_12bit_unsigned_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
-{
-  return (offset >= 0
-	  && offset < 4096 * GET_MODE_SIZE (mode)
-	  && offset % GET_MODE_SIZE (mode) == 0);
-}
-
 /* Return true if MODE is one of the modes for which we
    support LDP/STP operations.  */
 
@@ -14452,6 +14600,30 @@  aarch64_optab_supported_p (int op, machine_mode mode1, machine_mode,
 #define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD_GUARD \
   aarch64_first_cycle_multipass_dfa_lookahead_guard
 
+#undef TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS
+#define TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS \
+  aarch64_get_separate_components
+
+#undef TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB
+#define TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB \
+  aarch64_components_for_bb
+
+#undef TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS
+#define TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS \
+  aarch64_disqualify_components
+
+#undef TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS
+#define TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS \
+  aarch64_emit_prologue_components
+
+#undef TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS
+#define TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS \
+  aarch64_emit_epilogue_components
+
+#undef TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS
+#define TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS \
+  aarch64_set_handled_components
+
 #undef TARGET_TRAMPOLINE_INIT
 #define TARGET_TRAMPOLINE_INIT aarch64_trampoline_init
 
diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 584ff5c..fb89e5a 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -591,6 +591,8 @@  struct GTY (()) aarch64_frame
 typedef struct GTY (()) machine_function
 {
   struct aarch64_frame frame;
+  /* One entry for each GPR and FP register.  */
+  bool reg_is_wrapped_separately[V31_REGNUM + 1];
 } machine_function;
 #endif