From patchwork Thu Nov 10 14:25:47 2016
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kyrill Tkachov <kyrylo.tkachov@foss.arm.com>
X-Patchwork-Id: 81678
Delivered-To: patch@linaro.org
Received: by 10.140.97.165 with SMTP id m34csp752535qge;
 Thu, 10 Nov 2016 06:26:16 -0800 (PST)
X-Received: by 10.36.132.204 with SMTP id h195mr18732937itd.41.1478787976319; 
 Thu, 10 Nov 2016 06:26:16 -0800 (PST)
Return-Path: <gcc-patches-return-440959-patch=linaro.org@gcc.gnu.org>
Received: from sourceware.org (server1.sourceware.org. [209.132.180.131])
 by mx.google.com with ESMTPS id
 j11si4322952pag.229.2016.11.10.06.26.16 for <patch@linaro.org>
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Thu, 10 Nov 2016 06:26:16 -0800 (PST)
Received-SPF: pass (google.com: domain of
 gcc-patches-return-440959-patch=linaro.org@gcc.gnu.org
 designates 209.132.180.131 as permitted sender)
 client-ip=209.132.180.131; 
Authentication-Results: mx.google.com; dkim=pass header.i=@gcc.gnu.org;
 spf=pass (google.com: domain of
 gcc-patches-return-440959-patch=linaro.org@gcc.gnu.org
 designates 209.132.180.131 as permitted sender)
 smtp.mailfrom=gcc-patches-return-440959-patch=linaro.org@gcc.gnu.org
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
 :list-unsubscribe:list-archive:list-post:list-help:sender
 :message-id:date:from:mime-version:to:cc:subject:content-type;
 q=dns; s=default; b=YOqAB231uJYzw7ovRGezqNO/v3tl/hP747Wo7FM7Zo2
 1tBqDnburZVuXgXk+XW0YdqugTZEkS92c347zDwU5vgZCoQfT+WDb/5ye5CW2khL
 J4FbScpoA/TI6MUHHg2HQU97BTVE9mRXmaNzEpo6+oqiEi5n+m2UB4nwTXAkMRdE
 =
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id
 :list-unsubscribe:list-archive:list-post:list-help:sender
 :message-id:date:from:mime-version:to:cc:subject:content-type;
 s=default; bh=FvNN3bnQFBmIWi/fRxGpXvQBIwM=; b=f0yk1IcvYBrsDd6AC
 5KofWUWQETEhwAGgdDb72guETdKa4sWZXvVKF8wZiJWiOls1j21zHPH7gFYoqhzE
 FG9yendiygRYQwyWPvw1bYj3iYkXABBB0ht80I6uV/1eROPQw7/j14x6ILDIvkQ2
 ohmGSRCR5FdW08V6K46G41vUvM=
Received: (qmail 71813 invoked by alias); 10 Nov 2016 14:26:03 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <mailto:gcc-patches-unsubscribe-patch=linaro.org@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org
Received: (qmail 71790 invoked by uid 89); 10 Nov 2016 14:26:02 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-3.8 required=5.0 tests=BAYES_00,
 KAM_LAZY_DOMAIN_SECURITY,
 RP_MATCHES_RCVD autolearn=ham version=3.3.2 spammy=interfere,
 gen, 5918, messy
X-HELO: foss.arm.com
Received: from foss.arm.com (HELO foss.arm.com) (217.140.101.70) by
 sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP;
 Thu, 10 Nov 2016 14:25:52 +0000
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])	by
 usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 4990716;
 Thu, 10 Nov 2016 06:25:50 -0800 (PST)
Received: from [10.2.207.77] (e100706-lin.cambridge.arm.com
 [10.2.207.77])	by usa-sjc-imap-foss1.foss.arm.com (Postfix)
 with ESMTPSA id 5821D3F318; Thu, 10 Nov 2016 06:25:49 -0800 (PST)
Message-ID: <5824836B.5030302@foss.arm.com>
Date: Thu, 10 Nov 2016 14:25:47 +0000
From: Kyrill Tkachov <kyrylo.tkachov@foss.arm.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.2.0
MIME-Version: 1.0
To: GCC Patches <gcc-patches@gcc.gnu.org>
CC: Marcus Shawcroft <marcus.shawcroft@arm.com>,
 Richard Earnshaw <Richard.Earnshaw@arm.com>,
 James Greenhalgh <james.greenhalgh@arm.com>,
 Segher Boessenkool <segher@kernel.crashing.org>
Subject: [PATCH][AArch64] Separate shrink wrapping hooks implementation

Hi all,

This patch implements the new separate shrink-wrapping hooks for aarch64.
In separate shrink wrapping (as I understand it) we consider each register save/restore as
a 'component' that can be performed independently of the other save/restores in the prologue/epilogue
and can be moved outside the prologue/epilogue and instead performed only in the basic blocks where it's
actually needed. This allows us to avoid saving and restoring registers on execution paths where a register
might not be needed.

In the most general form a 'component' can be any operation that the prologue/epilogue performs, for example
stack adjustment. But in this patch we only consider callee-saved register save/restores as components.
The code is in many ways similar to the powerpc implementation of the hooks.

The hooks implemented are:
* TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS: Returns a bitmap containing a bit for each register that should
be considered a 'component' i.e. its save/restore should be separated from the prologue and epilogue and placed
at the basic block where it's needed.

* TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB: Determine for a given basic block which 'component' registers it needs.
This is determined through dataflow. If a component register is in the IN,GEN or KILL sets for the basic block
it's considered as needed and marked as such in the bitmap.

* TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS and TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS: Given a bitmap
of component registers emits the save or restore code for them.

* TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS: Given a bitmap of component registers record in the backend that
the register is shrink-wrapped using this approach and that the normal prologue and epilogue expansion code
should not emit code for them. This is done similarly to powerpc by defining a bool array in machine_function
where we record whether each register is separately shrink-wrapped.  The prologue and epilogue expansion code
(through aarch64_save_callee_saves and aarch64_restore_callee_saves) is updated to not emit save/restores for
these registers if they appear in that array.

Our prologue and epilogue code has a lot of intricate logic to perform stack adjustments using the writeback
forms of the load/store instructions. Separately shrink-wrapping those registers marked for writeback
(cfun->machine->frame.wb_candidate1 and cfun->machine->frame.wb_candidate2) broke that codegen and I had to
emit an explicit stack adjustment instruction that created ugly prologue/epilogue sequences. So this patch
is conservative and doesn't allow shrink-wrapping of the registers marked for writeback. Maybe in the future
we can relax it (for example allow wrapping of one of the two writeback registers if the writeback amount
can be encoded in a single-register writeback store/load) but given the development stage of GCC I thought
I'd play it safe.

I ran SPEC2006 on a Cortex-A72. Overall scores were neutral but there were some interesting swings.
458.sjeng     +1.45%
471.omnetpp   +2.19%
445.gobmk     -2.01%

On SPECFP:
453.povray    +7.00%

I'll be re-running the benchmarks with Segher's recent patch [1] to see if they fix the regression
and if it does I think this can go in.

[1] https://gcc.gnu.org/ml/gcc-patches/2016-11/msg00889.html

Bootstrapped and tested on aarch64-none-linux-gnu.

Thanks,
Kyrill

2016-11-10  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * config/aarch64/aarch64.h (machine_function): Add
     reg_is_wrapped_separately field.
     * config/aarch64/aarch64.c (emit_set_insn): Change return type to
     rtx_insn *.
     (aarch64_save_callee_saves): Don't save registers that are wrapped
     separately.
     (aarch64_restore_callee_saves): Don't restore registers that are
     wrapped separately.
     (offset_9bit_signed_unscaled_p, offset_12bit_unsigned_scaled_p,
     aarch64_offset_7bit_signed_scaled_p): Move earlier in the file.
     (aarch64_get_separate_components): New function.
     (aarch64_components_for_bb): Likewise.
     (aarch64_disqualify_components): Likewise.
     (aarch64_emit_prologue_components): Likewise.
     (aarch64_emit_epilogue_components): Likewise.
     (aarch64_set_handled_components): Likewise.
     (TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS,
     TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB,
     TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS,
     TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS,
     TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS,
     TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS): Define.

commit 14c7a66d9f3a44ef40499e61ca9643c7dfbc6c82
Author: Kyrylo Tkachov <kyrylo.tkachov@arm.com>
Date:   Tue Oct 11 09:25:54 2016 +0100

    [AArch64] Separate shrink wrapping hooks implementation

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 325e725..5508333 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -1138,7 +1138,7 @@ aarch64_is_extend_from_extract (machine_mode mode, rtx mult_imm,
 
 /* Emit an insn that's a simple single-set.  Both the operands must be
    known to be valid.  */
-inline static rtx
+inline static rtx_insn *
 emit_set_insn (rtx x, rtx y)
 {
   return emit_insn (gen_rtx_SET (x, y));
@@ -3135,6 +3135,9 @@ aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset,
 	      || regno == cfun->machine->frame.wb_candidate2))
 	continue;
 
+      if (cfun->machine->reg_is_wrapped_separately[regno])
+       continue;
+
       reg = gen_rtx_REG (mode, regno);
       offset = start_offset + cfun->machine->frame.reg_offset[regno];
       mem = gen_mem_ref (mode, plus_constant (Pmode, stack_pointer_rtx,
@@ -3143,6 +3146,7 @@ aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset,
       regno2 = aarch64_next_callee_save (regno + 1, limit);
 
       if (regno2 <= limit
+	  && !cfun->machine->reg_is_wrapped_separately[regno2]
 	  && ((cfun->machine->frame.reg_offset[regno] + UNITS_PER_WORD)
 	      == cfun->machine->frame.reg_offset[regno2]))
 
@@ -3191,6 +3195,9 @@ aarch64_restore_callee_saves (machine_mode mode,
        regno <= limit;
        regno = aarch64_next_callee_save (regno + 1, limit))
     {
+      if (cfun->machine->reg_is_wrapped_separately[regno])
+       continue;
+
       rtx reg, mem;
 
       if (skip_wb
@@ -3205,6 +3212,7 @@ aarch64_restore_callee_saves (machine_mode mode,
       regno2 = aarch64_next_callee_save (regno + 1, limit);
 
       if (regno2 <= limit
+	  && !cfun->machine->reg_is_wrapped_separately[regno2]
 	  && ((cfun->machine->frame.reg_offset[regno] + UNITS_PER_WORD)
 	      == cfun->machine->frame.reg_offset[regno2]))
 	{
@@ -3224,6 +3232,169 @@ aarch64_restore_callee_saves (machine_mode mode,
     }
 }
 
+static inline bool
+offset_9bit_signed_unscaled_p (machine_mode mode ATTRIBUTE_UNUSED,
+			       HOST_WIDE_INT offset)
+{
+  return offset >= -256 && offset < 256;
+}
+
+static inline bool
+offset_12bit_unsigned_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
+{
+  return (offset >= 0
+	  && offset < 4096 * GET_MODE_SIZE (mode)
+	  && offset % GET_MODE_SIZE (mode) == 0);
+}
+
+bool
+aarch64_offset_7bit_signed_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
+{
+  return (offset >= -64 * GET_MODE_SIZE (mode)
+	  && offset < 64 * GET_MODE_SIZE (mode)
+	  && offset % GET_MODE_SIZE (mode) == 0);
+}
+
+/* Implement TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS.  */
+
+static sbitmap
+aarch64_get_separate_components (void)
+{
+  /* Calls to alloca further extend the stack frame and it can be messy to
+     figure out the location of the stack slots for each register.
+     For now be conservative.  */
+  if (cfun->calls_alloca)
+    return NULL;
+
+  aarch64_layout_frame ();
+
+  sbitmap components = sbitmap_alloc (V31_REGNUM + 1);
+  bitmap_clear (components);
+
+  /* The registers we need saved to the frame.  */
+  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
+    if (aarch64_register_saved_on_entry (regno))
+      {
+	HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno];
+	if (!frame_pointer_needed)
+	  offset += cfun->machine->frame.frame_size
+		    - cfun->machine->frame.hard_fp_offset;
+	/* Check that we can access the stack slot of the register with one
+	   direct load with no adjustments needed.  */
+	if (offset_12bit_unsigned_scaled_p (DImode, offset))
+	  bitmap_set_bit (components, regno);
+      }
+
+  /* Don't mess with the hard frame pointer.  */
+  if (frame_pointer_needed)
+    bitmap_clear_bit (components, HARD_FRAME_POINTER_REGNUM);
+
+  unsigned reg1 = cfun->machine->frame.wb_candidate1;
+  unsigned reg2 = cfun->machine->frame.wb_candidate2;
+  /* If aarch64_layout_frame has chosen registers to store/restore with
+     writeback don't interfere with them to avoid having to output explicit
+     stack adjustment instructions.  */
+  if (reg2 != INVALID_REGNUM)
+    bitmap_clear_bit (components, reg2);
+  if (reg1 != INVALID_REGNUM)
+    bitmap_clear_bit (components, reg1);
+
+  bitmap_clear_bit (components, LR_REGNUM);
+  bitmap_clear_bit (components, SP_REGNUM);
+
+  return components;
+}
+
+/* Implement TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB.  */
+
+static sbitmap
+aarch64_components_for_bb (basic_block bb)
+{
+  bitmap in = DF_LIVE_IN (bb);
+  bitmap gen = &DF_LIVE_BB_INFO (bb)->gen;
+  bitmap kill = &DF_LIVE_BB_INFO (bb)->kill;
+
+  sbitmap components = sbitmap_alloc (V31_REGNUM + 1);
+  bitmap_clear (components);
+
+  /* GPRs are used in a bb if they are in the IN, GEN, or KILL sets.  */
+  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
+    if ((!call_used_regs[regno])
+       && (bitmap_bit_p (in, regno)
+	   || bitmap_bit_p (gen, regno)
+	   || bitmap_bit_p (kill, regno)))
+	  bitmap_set_bit (components, regno);
+
+  return components;
+}
+
+/* Implement TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS.
+   Nothing to do for aarch64.  */
+
+static void
+aarch64_disqualify_components (sbitmap, edge, sbitmap, bool)
+{
+}
+
+/* Implement TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS.  */
+
+static void
+aarch64_emit_prologue_components (sbitmap components)
+{
+  rtx ptr_reg = gen_rtx_REG (Pmode, frame_pointer_needed
+			     ? HARD_FRAME_POINTER_REGNUM
+			     : STACK_POINTER_REGNUM);
+
+  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
+    if (bitmap_bit_p (components, regno))
+      {
+	rtx reg = gen_rtx_REG (Pmode, regno);
+	HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno];
+	if (!frame_pointer_needed)
+	offset += cfun->machine->frame.frame_size
+		  - cfun->machine->frame.hard_fp_offset;
+	rtx addr = plus_constant (Pmode, ptr_reg, offset);
+	rtx mem = gen_frame_mem (Pmode, addr);
+
+	RTX_FRAME_RELATED_P (emit_move_insn (mem, reg)) = 1;
+      }
+}
+
+/* Implement TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS.  */
+
+static void
+aarch64_emit_epilogue_components (sbitmap components)
+{
+
+  rtx ptr_reg = gen_rtx_REG (Pmode, frame_pointer_needed
+			     ? HARD_FRAME_POINTER_REGNUM
+			     : STACK_POINTER_REGNUM);
+  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
+    if (bitmap_bit_p (components, regno))
+      {
+	rtx reg = gen_rtx_REG (Pmode, regno);
+	HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno];
+	if (!frame_pointer_needed)
+	  offset += cfun->machine->frame.frame_size
+		     - cfun->machine->frame.hard_fp_offset;
+	rtx addr = plus_constant (Pmode, ptr_reg, offset);
+	rtx mem = gen_frame_mem (Pmode, addr);
+
+	RTX_FRAME_RELATED_P (emit_move_insn (reg, mem)) = 1;
+	add_reg_note (get_last_insn (), REG_CFA_RESTORE, reg);
+      }
+}
+
+/* Implement TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS.  */
+
+static void
+aarch64_set_handled_components (sbitmap components)
+{
+  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
+    if (bitmap_bit_p (components, regno))
+      cfun->machine->reg_is_wrapped_separately[regno] = true;
+}
+
 /* AArch64 stack frames generated by this compiler look like:
 
 	+-------------------------------+
@@ -3944,29 +4115,6 @@ aarch64_classify_index (struct aarch64_address_info *info, rtx x,
   return false;
 }
 
-bool
-aarch64_offset_7bit_signed_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
-{
-  return (offset >= -64 * GET_MODE_SIZE (mode)
-	  && offset < 64 * GET_MODE_SIZE (mode)
-	  && offset % GET_MODE_SIZE (mode) == 0);
-}
-
-static inline bool
-offset_9bit_signed_unscaled_p (machine_mode mode ATTRIBUTE_UNUSED,
-			       HOST_WIDE_INT offset)
-{
-  return offset >= -256 && offset < 256;
-}
-
-static inline bool
-offset_12bit_unsigned_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
-{
-  return (offset >= 0
-	  && offset < 4096 * GET_MODE_SIZE (mode)
-	  && offset % GET_MODE_SIZE (mode) == 0);
-}
-
 /* Return true if MODE is one of the modes for which we
    support LDP/STP operations.  */
 
@@ -14452,6 +14600,30 @@ aarch64_optab_supported_p (int op, machine_mode mode1, machine_mode,
 #define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD_GUARD \
   aarch64_first_cycle_multipass_dfa_lookahead_guard
 
+#undef TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS
+#define TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS \
+  aarch64_get_separate_components
+
+#undef TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB
+#define TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB \
+  aarch64_components_for_bb
+
+#undef TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS
+#define TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS \
+  aarch64_disqualify_components
+
+#undef TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS
+#define TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS \
+  aarch64_emit_prologue_components
+
+#undef TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS
+#define TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS \
+  aarch64_emit_epilogue_components
+
+#undef TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS
+#define TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS \
+  aarch64_set_handled_components
+
 #undef TARGET_TRAMPOLINE_INIT
 #define TARGET_TRAMPOLINE_INIT aarch64_trampoline_init
 
diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 584ff5c..fb89e5a 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -591,6 +591,8 @@ struct GTY (()) aarch64_frame
 typedef struct GTY (()) machine_function
 {
   struct aarch64_frame frame;
+  /* One entry for each GPR and FP register.  */
+  bool reg_is_wrapped_separately[V31_REGNUM + 1];
 } machine_function;
 #endif