From patchwork Thu Nov 10 14:25:47 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kyrill Tkachov X-Patchwork-Id: 81678 Delivered-To: patch@linaro.org Received: by 10.140.97.165 with SMTP id m34csp752535qge; Thu, 10 Nov 2016 06:26:16 -0800 (PST) X-Received: by 10.36.132.204 with SMTP id h195mr18732937itd.41.1478787976319; Thu, 10 Nov 2016 06:26:16 -0800 (PST) Return-Path: Received: from sourceware.org (server1.sourceware.org. [209.132.180.131]) by mx.google.com with ESMTPS id j11si4322952pag.229.2016.11.10.06.26.16 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 10 Nov 2016 06:26:16 -0800 (PST) Received-SPF: pass (google.com: domain of gcc-patches-return-440959-patch=linaro.org@gcc.gnu.org designates 209.132.180.131 as permitted sender) client-ip=209.132.180.131; Authentication-Results: mx.google.com; dkim=pass header.i=@gcc.gnu.org; spf=pass (google.com: domain of gcc-patches-return-440959-patch=linaro.org@gcc.gnu.org designates 209.132.180.131 as permitted sender) smtp.mailfrom=gcc-patches-return-440959-patch=linaro.org@gcc.gnu.org DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :message-id:date:from:mime-version:to:cc:subject:content-type; q=dns; s=default; b=YOqAB231uJYzw7ovRGezqNO/v3tl/hP747Wo7FM7Zo2 1tBqDnburZVuXgXk+XW0YdqugTZEkS92c347zDwU5vgZCoQfT+WDb/5ye5CW2khL J4FbScpoA/TI6MUHHg2HQU97BTVE9mRXmaNzEpo6+oqiEi5n+m2UB4nwTXAkMRdE = DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :message-id:date:from:mime-version:to:cc:subject:content-type; s=default; bh=FvNN3bnQFBmIWi/fRxGpXvQBIwM=; b=f0yk1IcvYBrsDd6AC 5KofWUWQETEhwAGgdDb72guETdKa4sWZXvVKF8wZiJWiOls1j21zHPH7gFYoqhzE FG9yendiygRYQwyWPvw1bYj3iYkXABBB0ht80I6uV/1eROPQw7/j14x6ILDIvkQ2 ohmGSRCR5FdW08V6K46G41vUvM= Received: (qmail 71813 invoked by alias); 10 Nov 2016 14:26:03 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 71790 invoked by uid 89); 10 Nov 2016 14:26:02 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-3.8 required=5.0 tests=BAYES_00, KAM_LAZY_DOMAIN_SECURITY, RP_MATCHES_RCVD autolearn=ham version=3.3.2 spammy=interfere, gen, 5918, messy X-HELO: foss.arm.com Received: from foss.arm.com (HELO foss.arm.com) (217.140.101.70) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Thu, 10 Nov 2016 14:25:52 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 4990716; Thu, 10 Nov 2016 06:25:50 -0800 (PST) Received: from [10.2.207.77] (e100706-lin.cambridge.arm.com [10.2.207.77]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 5821D3F318; Thu, 10 Nov 2016 06:25:49 -0800 (PST) Message-ID: <5824836B.5030302@foss.arm.com> Date: Thu, 10 Nov 2016 14:25:47 +0000 From: Kyrill Tkachov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 MIME-Version: 1.0 To: GCC Patches CC: Marcus Shawcroft , Richard Earnshaw , James Greenhalgh , Segher Boessenkool Subject: [PATCH][AArch64] Separate shrink wrapping hooks implementation Hi all, This patch implements the new separate shrink-wrapping hooks for aarch64. In separate shrink wrapping (as I understand it) we consider each register save/restore as a 'component' that can be performed independently of the other save/restores in the prologue/epilogue and can be moved outside the prologue/epilogue and instead performed only in the basic blocks where it's actually needed. This allows us to avoid saving and restoring registers on execution paths where a register might not be needed. In the most general form a 'component' can be any operation that the prologue/epilogue performs, for example stack adjustment. But in this patch we only consider callee-saved register save/restores as components. The code is in many ways similar to the powerpc implementation of the hooks. The hooks implemented are: * TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS: Returns a bitmap containing a bit for each register that should be considered a 'component' i.e. its save/restore should be separated from the prologue and epilogue and placed at the basic block where it's needed. * TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB: Determine for a given basic block which 'component' registers it needs. This is determined through dataflow. If a component register is in the IN,GEN or KILL sets for the basic block it's considered as needed and marked as such in the bitmap. * TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS and TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS: Given a bitmap of component registers emits the save or restore code for them. * TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS: Given a bitmap of component registers record in the backend that the register is shrink-wrapped using this approach and that the normal prologue and epilogue expansion code should not emit code for them. This is done similarly to powerpc by defining a bool array in machine_function where we record whether each register is separately shrink-wrapped. The prologue and epilogue expansion code (through aarch64_save_callee_saves and aarch64_restore_callee_saves) is updated to not emit save/restores for these registers if they appear in that array. Our prologue and epilogue code has a lot of intricate logic to perform stack adjustments using the writeback forms of the load/store instructions. Separately shrink-wrapping those registers marked for writeback (cfun->machine->frame.wb_candidate1 and cfun->machine->frame.wb_candidate2) broke that codegen and I had to emit an explicit stack adjustment instruction that created ugly prologue/epilogue sequences. So this patch is conservative and doesn't allow shrink-wrapping of the registers marked for writeback. Maybe in the future we can relax it (for example allow wrapping of one of the two writeback registers if the writeback amount can be encoded in a single-register writeback store/load) but given the development stage of GCC I thought I'd play it safe. I ran SPEC2006 on a Cortex-A72. Overall scores were neutral but there were some interesting swings. 458.sjeng +1.45% 471.omnetpp +2.19% 445.gobmk -2.01% On SPECFP: 453.povray +7.00% I'll be re-running the benchmarks with Segher's recent patch [1] to see if they fix the regression and if it does I think this can go in. [1] https://gcc.gnu.org/ml/gcc-patches/2016-11/msg00889.html Bootstrapped and tested on aarch64-none-linux-gnu. Thanks, Kyrill 2016-11-10 Kyrylo Tkachov * config/aarch64/aarch64.h (machine_function): Add reg_is_wrapped_separately field. * config/aarch64/aarch64.c (emit_set_insn): Change return type to rtx_insn *. (aarch64_save_callee_saves): Don't save registers that are wrapped separately. (aarch64_restore_callee_saves): Don't restore registers that are wrapped separately. (offset_9bit_signed_unscaled_p, offset_12bit_unsigned_scaled_p, aarch64_offset_7bit_signed_scaled_p): Move earlier in the file. (aarch64_get_separate_components): New function. (aarch64_components_for_bb): Likewise. (aarch64_disqualify_components): Likewise. (aarch64_emit_prologue_components): Likewise. (aarch64_emit_epilogue_components): Likewise. (aarch64_set_handled_components): Likewise. (TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS, TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB, TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS, TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS, TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS, TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS): Define. commit 14c7a66d9f3a44ef40499e61ca9643c7dfbc6c82 Author: Kyrylo Tkachov Date: Tue Oct 11 09:25:54 2016 +0100 [AArch64] Separate shrink wrapping hooks implementation diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 325e725..5508333 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -1138,7 +1138,7 @@ aarch64_is_extend_from_extract (machine_mode mode, rtx mult_imm, /* Emit an insn that's a simple single-set. Both the operands must be known to be valid. */ -inline static rtx +inline static rtx_insn * emit_set_insn (rtx x, rtx y) { return emit_insn (gen_rtx_SET (x, y)); @@ -3135,6 +3135,9 @@ aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset, || regno == cfun->machine->frame.wb_candidate2)) continue; + if (cfun->machine->reg_is_wrapped_separately[regno]) + continue; + reg = gen_rtx_REG (mode, regno); offset = start_offset + cfun->machine->frame.reg_offset[regno]; mem = gen_mem_ref (mode, plus_constant (Pmode, stack_pointer_rtx, @@ -3143,6 +3146,7 @@ aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset, regno2 = aarch64_next_callee_save (regno + 1, limit); if (regno2 <= limit + && !cfun->machine->reg_is_wrapped_separately[regno2] && ((cfun->machine->frame.reg_offset[regno] + UNITS_PER_WORD) == cfun->machine->frame.reg_offset[regno2])) @@ -3191,6 +3195,9 @@ aarch64_restore_callee_saves (machine_mode mode, regno <= limit; regno = aarch64_next_callee_save (regno + 1, limit)) { + if (cfun->machine->reg_is_wrapped_separately[regno]) + continue; + rtx reg, mem; if (skip_wb @@ -3205,6 +3212,7 @@ aarch64_restore_callee_saves (machine_mode mode, regno2 = aarch64_next_callee_save (regno + 1, limit); if (regno2 <= limit + && !cfun->machine->reg_is_wrapped_separately[regno2] && ((cfun->machine->frame.reg_offset[regno] + UNITS_PER_WORD) == cfun->machine->frame.reg_offset[regno2])) { @@ -3224,6 +3232,169 @@ aarch64_restore_callee_saves (machine_mode mode, } } +static inline bool +offset_9bit_signed_unscaled_p (machine_mode mode ATTRIBUTE_UNUSED, + HOST_WIDE_INT offset) +{ + return offset >= -256 && offset < 256; +} + +static inline bool +offset_12bit_unsigned_scaled_p (machine_mode mode, HOST_WIDE_INT offset) +{ + return (offset >= 0 + && offset < 4096 * GET_MODE_SIZE (mode) + && offset % GET_MODE_SIZE (mode) == 0); +} + +bool +aarch64_offset_7bit_signed_scaled_p (machine_mode mode, HOST_WIDE_INT offset) +{ + return (offset >= -64 * GET_MODE_SIZE (mode) + && offset < 64 * GET_MODE_SIZE (mode) + && offset % GET_MODE_SIZE (mode) == 0); +} + +/* Implement TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS. */ + +static sbitmap +aarch64_get_separate_components (void) +{ + /* Calls to alloca further extend the stack frame and it can be messy to + figure out the location of the stack slots for each register. + For now be conservative. */ + if (cfun->calls_alloca) + return NULL; + + aarch64_layout_frame (); + + sbitmap components = sbitmap_alloc (V31_REGNUM + 1); + bitmap_clear (components); + + /* The registers we need saved to the frame. */ + for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++) + if (aarch64_register_saved_on_entry (regno)) + { + HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno]; + if (!frame_pointer_needed) + offset += cfun->machine->frame.frame_size + - cfun->machine->frame.hard_fp_offset; + /* Check that we can access the stack slot of the register with one + direct load with no adjustments needed. */ + if (offset_12bit_unsigned_scaled_p (DImode, offset)) + bitmap_set_bit (components, regno); + } + + /* Don't mess with the hard frame pointer. */ + if (frame_pointer_needed) + bitmap_clear_bit (components, HARD_FRAME_POINTER_REGNUM); + + unsigned reg1 = cfun->machine->frame.wb_candidate1; + unsigned reg2 = cfun->machine->frame.wb_candidate2; + /* If aarch64_layout_frame has chosen registers to store/restore with + writeback don't interfere with them to avoid having to output explicit + stack adjustment instructions. */ + if (reg2 != INVALID_REGNUM) + bitmap_clear_bit (components, reg2); + if (reg1 != INVALID_REGNUM) + bitmap_clear_bit (components, reg1); + + bitmap_clear_bit (components, LR_REGNUM); + bitmap_clear_bit (components, SP_REGNUM); + + return components; +} + +/* Implement TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB. */ + +static sbitmap +aarch64_components_for_bb (basic_block bb) +{ + bitmap in = DF_LIVE_IN (bb); + bitmap gen = &DF_LIVE_BB_INFO (bb)->gen; + bitmap kill = &DF_LIVE_BB_INFO (bb)->kill; + + sbitmap components = sbitmap_alloc (V31_REGNUM + 1); + bitmap_clear (components); + + /* GPRs are used in a bb if they are in the IN, GEN, or KILL sets. */ + for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++) + if ((!call_used_regs[regno]) + && (bitmap_bit_p (in, regno) + || bitmap_bit_p (gen, regno) + || bitmap_bit_p (kill, regno))) + bitmap_set_bit (components, regno); + + return components; +} + +/* Implement TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS. + Nothing to do for aarch64. */ + +static void +aarch64_disqualify_components (sbitmap, edge, sbitmap, bool) +{ +} + +/* Implement TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS. */ + +static void +aarch64_emit_prologue_components (sbitmap components) +{ + rtx ptr_reg = gen_rtx_REG (Pmode, frame_pointer_needed + ? HARD_FRAME_POINTER_REGNUM + : STACK_POINTER_REGNUM); + + for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++) + if (bitmap_bit_p (components, regno)) + { + rtx reg = gen_rtx_REG (Pmode, regno); + HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno]; + if (!frame_pointer_needed) + offset += cfun->machine->frame.frame_size + - cfun->machine->frame.hard_fp_offset; + rtx addr = plus_constant (Pmode, ptr_reg, offset); + rtx mem = gen_frame_mem (Pmode, addr); + + RTX_FRAME_RELATED_P (emit_move_insn (mem, reg)) = 1; + } +} + +/* Implement TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS. */ + +static void +aarch64_emit_epilogue_components (sbitmap components) +{ + + rtx ptr_reg = gen_rtx_REG (Pmode, frame_pointer_needed + ? HARD_FRAME_POINTER_REGNUM + : STACK_POINTER_REGNUM); + for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++) + if (bitmap_bit_p (components, regno)) + { + rtx reg = gen_rtx_REG (Pmode, regno); + HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno]; + if (!frame_pointer_needed) + offset += cfun->machine->frame.frame_size + - cfun->machine->frame.hard_fp_offset; + rtx addr = plus_constant (Pmode, ptr_reg, offset); + rtx mem = gen_frame_mem (Pmode, addr); + + RTX_FRAME_RELATED_P (emit_move_insn (reg, mem)) = 1; + add_reg_note (get_last_insn (), REG_CFA_RESTORE, reg); + } +} + +/* Implement TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS. */ + +static void +aarch64_set_handled_components (sbitmap components) +{ + for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++) + if (bitmap_bit_p (components, regno)) + cfun->machine->reg_is_wrapped_separately[regno] = true; +} + /* AArch64 stack frames generated by this compiler look like: +-------------------------------+ @@ -3944,29 +4115,6 @@ aarch64_classify_index (struct aarch64_address_info *info, rtx x, return false; } -bool -aarch64_offset_7bit_signed_scaled_p (machine_mode mode, HOST_WIDE_INT offset) -{ - return (offset >= -64 * GET_MODE_SIZE (mode) - && offset < 64 * GET_MODE_SIZE (mode) - && offset % GET_MODE_SIZE (mode) == 0); -} - -static inline bool -offset_9bit_signed_unscaled_p (machine_mode mode ATTRIBUTE_UNUSED, - HOST_WIDE_INT offset) -{ - return offset >= -256 && offset < 256; -} - -static inline bool -offset_12bit_unsigned_scaled_p (machine_mode mode, HOST_WIDE_INT offset) -{ - return (offset >= 0 - && offset < 4096 * GET_MODE_SIZE (mode) - && offset % GET_MODE_SIZE (mode) == 0); -} - /* Return true if MODE is one of the modes for which we support LDP/STP operations. */ @@ -14452,6 +14600,30 @@ aarch64_optab_supported_p (int op, machine_mode mode1, machine_mode, #define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD_GUARD \ aarch64_first_cycle_multipass_dfa_lookahead_guard +#undef TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS +#define TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS \ + aarch64_get_separate_components + +#undef TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB +#define TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB \ + aarch64_components_for_bb + +#undef TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS +#define TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS \ + aarch64_disqualify_components + +#undef TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS +#define TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS \ + aarch64_emit_prologue_components + +#undef TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS +#define TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS \ + aarch64_emit_epilogue_components + +#undef TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS +#define TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS \ + aarch64_set_handled_components + #undef TARGET_TRAMPOLINE_INIT #define TARGET_TRAMPOLINE_INIT aarch64_trampoline_init diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h index 584ff5c..fb89e5a 100644 --- a/gcc/config/aarch64/aarch64.h +++ b/gcc/config/aarch64/aarch64.h @@ -591,6 +591,8 @@ struct GTY (()) aarch64_frame typedef struct GTY (()) machine_function { struct aarch64_frame frame; + /* One entry for each GPR and FP register. */ + bool reg_is_wrapped_separately[V31_REGNUM + 1]; } machine_function; #endif