From patchwork Fri Nov 11 15:31:32 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kyrill Tkachov X-Patchwork-Id: 81850 Delivered-To: patch@linaro.org Received: by 10.140.97.165 with SMTP id m34csp1325570qge; Fri, 11 Nov 2016 07:32:03 -0800 (PST) X-Received: by 10.98.65.72 with SMTP id o69mr8039667pfa.128.1478878323212; Fri, 11 Nov 2016 07:32:03 -0800 (PST) Return-Path: Received: from sourceware.org (server1.sourceware.org. [209.132.180.131]) by mx.google.com with ESMTPS id q9si10690950pgd.190.2016.11.11.07.32.02 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 11 Nov 2016 07:32:03 -0800 (PST) Received-SPF: pass (google.com: domain of gcc-patches-return-441116-patch=linaro.org@gcc.gnu.org designates 209.132.180.131 as permitted sender) client-ip=209.132.180.131; Authentication-Results: mx.google.com; dkim=pass header.i=@gcc.gnu.org; spf=pass (google.com: domain of gcc-patches-return-441116-patch=linaro.org@gcc.gnu.org designates 209.132.180.131 as permitted sender) smtp.mailfrom=gcc-patches-return-441116-patch=linaro.org@gcc.gnu.org DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :message-id:date:from:mime-version:to:cc:subject:references :in-reply-to:content-type; q=dns; s=default; b=E3rqFLtU6aZq+Ia1m ENkq7a9tNtMRuiAZvejRP2yfgSTqrLc1jFVfifq13onPhn3sylI+rdXBhkcGvkL0 E+H3XGM2pJ+Jtxi6Jx2Td803msdp1nM2974jNaeQ4uF3a5hyuHuc3c8q7ho2b6x4 gR4waGpiNhZBD+w7KYpVwQvQc8= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :message-id:date:from:mime-version:to:cc:subject:references :in-reply-to:content-type; s=default; bh=WCto8UC9uClG4k+/KHuP3zl AqM4=; b=qpyAqPSLGMWzatNCTz/vicexW9sfXuvuvqexoJ4d7tZymuTLGi6zqje r3v1+zU5AypZskX/qewClA7wPH7q9K7i1TswrR8mL3DizYDJVLNoqyEX6kAQtnI1 C09ORuz8ejqArQNiUWqcqRSxX1VhN48IzUDYg8Al/k2cdYEwQ/eM= Received: (qmail 60372 invoked by alias); 11 Nov 2016 15:31:47 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 60358 invoked by uid 89); 11 Nov 2016 15:31:46 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-3.8 required=5.0 tests=BAYES_00, KAM_LAZY_DOMAIN_SECURITY, RP_MATCHES_RCVD autolearn=ham version=3.3.2 spammy=START, wrapping, love X-HELO: foss.arm.com Received: from foss.arm.com (HELO foss.arm.com) (217.140.101.70) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 11 Nov 2016 15:31:36 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 9D5061576; Fri, 11 Nov 2016 07:31:34 -0800 (PST) Received: from [10.2.207.77] (e100706-lin.cambridge.arm.com [10.2.207.77]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 895653F556; Fri, 11 Nov 2016 07:31:33 -0800 (PST) Message-ID: <5825E454.6070302@foss.arm.com> Date: Fri, 11 Nov 2016 15:31:32 +0000 From: Kyrill Tkachov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 MIME-Version: 1.0 To: Segher Boessenkool , Andrew Pinski CC: GCC Patches , Marcus Shawcroft , Richard Earnshaw , James Greenhalgh Subject: Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation References: <5824836B.5030302@foss.arm.com> <20161110233943.GC17570@gate.crashing.org> <58259AD6.4040203@foss.arm.com> In-Reply-To: <58259AD6.4040203@foss.arm.com> On 11/11/16 10:17, Kyrill Tkachov wrote: > > On 10/11/16 23:39, Segher Boessenkool wrote: >> On Thu, Nov 10, 2016 at 02:42:24PM -0800, Andrew Pinski wrote: >>> On Thu, Nov 10, 2016 at 6:25 AM, Kyrill Tkachov >>>> I ran SPEC2006 on a Cortex-A72. Overall scores were neutral but there were >>>> some interesting swings. >>>> 458.sjeng +1.45% >>>> 471.omnetpp +2.19% >>>> 445.gobmk -2.01% >>>> >>>> On SPECFP: >>>> 453.povray +7.00% >>> >>> Wow, this looks really good. Thank you for implementing this. If I >>> get some time I am going to try it out on other processors than A72 >>> but I doubt I have time any time soon. >> I'd love to hear what causes the slowdown for gobmk as well, btw. > > I haven't yet gotten a direct answer for that (through performance analysis tools) > but I have noticed that load/store pairs are not generated as aggressively as I hoped. > They are being merged by the sched fusion pass and peepholes (which runs after this) > but it still misses cases. I've hacked the SWS hooks to generate pairs explicitly and that > increases the number of pairs and helps code size to boot. It complicates the logic of > the hooks a bit but not too much. > > I'll make those changes and re-benchmark, hopefully that > will help performance. > And here's a version that explicitly emits pairs. I've looked at assembly codegen on SPEC2006 and it generates quite a few more LDP/STP pairs than the original version. I kicked off benchmarks over the weekend to see the effect. Andrew, if you want to try it out (more benchmarking and testing always welcome) this is the one to try. Thanks, Kyrill > Thanks, > Kyrill > >> >> Segher > commit bedb71d6f6f772eed33ba35e93cc4104326675da Author: Kyrylo Tkachov Date: Tue Oct 11 09:25:54 2016 +0100 [AArch64] Separate shrink wrapping hooks implementation diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 325e725..15b5bdf 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -1138,7 +1138,7 @@ aarch64_is_extend_from_extract (machine_mode mode, rtx mult_imm, /* Emit an insn that's a simple single-set. Both the operands must be known to be valid. */ -inline static rtx +inline static rtx_insn * emit_set_insn (rtx x, rtx y) { return emit_insn (gen_rtx_SET (x, y)); @@ -3135,6 +3135,9 @@ aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset, || regno == cfun->machine->frame.wb_candidate2)) continue; + if (cfun->machine->reg_is_wrapped_separately[regno]) + continue; + reg = gen_rtx_REG (mode, regno); offset = start_offset + cfun->machine->frame.reg_offset[regno]; mem = gen_mem_ref (mode, plus_constant (Pmode, stack_pointer_rtx, @@ -3143,6 +3146,7 @@ aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset, regno2 = aarch64_next_callee_save (regno + 1, limit); if (regno2 <= limit + && !cfun->machine->reg_is_wrapped_separately[regno2] && ((cfun->machine->frame.reg_offset[regno] + UNITS_PER_WORD) == cfun->machine->frame.reg_offset[regno2])) @@ -3191,6 +3195,9 @@ aarch64_restore_callee_saves (machine_mode mode, regno <= limit; regno = aarch64_next_callee_save (regno + 1, limit)) { + if (cfun->machine->reg_is_wrapped_separately[regno]) + continue; + rtx reg, mem; if (skip_wb @@ -3205,6 +3212,7 @@ aarch64_restore_callee_saves (machine_mode mode, regno2 = aarch64_next_callee_save (regno + 1, limit); if (regno2 <= limit + && !cfun->machine->reg_is_wrapped_separately[regno2] && ((cfun->machine->frame.reg_offset[regno] + UNITS_PER_WORD) == cfun->machine->frame.reg_offset[regno2])) { @@ -3224,6 +3232,273 @@ aarch64_restore_callee_saves (machine_mode mode, } } +static inline bool +offset_9bit_signed_unscaled_p (machine_mode mode ATTRIBUTE_UNUSED, + HOST_WIDE_INT offset) +{ + return offset >= -256 && offset < 256; +} + +static inline bool +offset_12bit_unsigned_scaled_p (machine_mode mode, HOST_WIDE_INT offset) +{ + return (offset >= 0 + && offset < 4096 * GET_MODE_SIZE (mode) + && offset % GET_MODE_SIZE (mode) == 0); +} + +bool +aarch64_offset_7bit_signed_scaled_p (machine_mode mode, HOST_WIDE_INT offset) +{ + return (offset >= -64 * GET_MODE_SIZE (mode) + && offset < 64 * GET_MODE_SIZE (mode) + && offset % GET_MODE_SIZE (mode) == 0); +} + +/* Implement TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS. */ + +static sbitmap +aarch64_get_separate_components (void) +{ + aarch64_layout_frame (); + + sbitmap components = sbitmap_alloc (V31_REGNUM + 1); + bitmap_clear (components); + + /* The registers we need saved to the frame. */ + for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++) + if (aarch64_register_saved_on_entry (regno)) + { + HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno]; + if (!frame_pointer_needed) + offset += cfun->machine->frame.frame_size + - cfun->machine->frame.hard_fp_offset; + /* Check that we can access the stack slot of the register with one + direct load with no adjustments needed. */ + if (offset_12bit_unsigned_scaled_p (DImode, offset)) + bitmap_set_bit (components, regno); + } + + /* Don't mess with the hard frame pointer. */ + if (frame_pointer_needed) + bitmap_clear_bit (components, HARD_FRAME_POINTER_REGNUM); + + unsigned reg1 = cfun->machine->frame.wb_candidate1; + unsigned reg2 = cfun->machine->frame.wb_candidate2; + /* If aarch64_layout_frame has chosen registers to store/restore with + writeback don't interfere with them to avoid having to output explicit + stack adjustment instructions. */ + if (reg2 != INVALID_REGNUM) + bitmap_clear_bit (components, reg2); + if (reg1 != INVALID_REGNUM) + bitmap_clear_bit (components, reg1); + + bitmap_clear_bit (components, LR_REGNUM); + bitmap_clear_bit (components, SP_REGNUM); + + return components; +} + +/* Implement TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB. */ + +static sbitmap +aarch64_components_for_bb (basic_block bb) +{ + bitmap in = DF_LIVE_IN (bb); + bitmap gen = &DF_LIVE_BB_INFO (bb)->gen; + bitmap kill = &DF_LIVE_BB_INFO (bb)->kill; + + sbitmap components = sbitmap_alloc (V31_REGNUM + 1); + bitmap_clear (components); + + /* GPRs are used in a bb if they are in the IN, GEN, or KILL sets. */ + for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++) + if ((!call_used_regs[regno]) + && (bitmap_bit_p (in, regno) + || bitmap_bit_p (gen, regno) + || bitmap_bit_p (kill, regno))) + bitmap_set_bit (components, regno); + + return components; +} + +/* Implement TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS. + Nothing to do for aarch64. */ + +static void +aarch64_disqualify_components (sbitmap, edge, sbitmap, bool) +{ +} + +/* Return the next set bit in BMP from START onwards. Return the total number + of bits in BMP if no set bit is found at or after START. */ + +static unsigned int +aarch64_get_next_set_bit (sbitmap bmp, unsigned int start) +{ + unsigned int nbits = SBITMAP_SIZE (bmp); + if (start == nbits) + return start; + + gcc_assert (start < nbits); + for (unsigned int i = start; i < nbits; i++) + if (bitmap_bit_p (bmp, i)) + return i; + + return nbits; +} + +/* Implement TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS. */ + +static void +aarch64_emit_prologue_components (sbitmap components) +{ + rtx ptr_reg = gen_rtx_REG (Pmode, frame_pointer_needed + ? HARD_FRAME_POINTER_REGNUM + : STACK_POINTER_REGNUM); + + unsigned total_bits = SBITMAP_SIZE (components); + unsigned regno = aarch64_get_next_set_bit (components, R0_REGNUM); + rtx_insn *insn = NULL; + + while (regno != total_bits) + { + machine_mode mode = GP_REGNUM_P (regno) ? DImode : DFmode; + rtx reg = gen_rtx_REG (mode, regno); + HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno]; + if (!frame_pointer_needed) + offset += cfun->machine->frame.frame_size + - cfun->machine->frame.hard_fp_offset; + rtx addr = plus_constant (Pmode, ptr_reg, offset); + rtx mem = gen_frame_mem (mode, addr); + + rtx set = gen_rtx_SET (mem, reg); + unsigned regno2 = aarch64_get_next_set_bit (components, regno + 1); + /* No more registers to save after REGNO or the memory slot of the + register is not suitable for a store pair. + Emit a single save and exit. */ + if (!satisfies_constraint_Ump (mem) + || regno2 == total_bits) + { + insn = emit_insn (set); + RTX_FRAME_RELATED_P (insn) = 1; + add_reg_note (insn, REG_CFA_OFFSET, copy_rtx (set)); + break; + } + + HOST_WIDE_INT offset2 = cfun->machine->frame.reg_offset[regno2]; + /* The next register is not of the same class or its offset is not + mergeable with the current one into a pair. */ + if (GP_REGNUM_P (regno) != GP_REGNUM_P (regno2) + || (offset2 - cfun->machine->frame.reg_offset[regno]) + != GET_MODE_SIZE (DImode)) + { + insn = emit_insn (set); + RTX_FRAME_RELATED_P (insn) = 1; + add_reg_note (insn, REG_CFA_OFFSET, copy_rtx (set)); + + regno = regno2; + continue; + } + + /* REGNO2 can be stored in a pair with REGNO. */ + rtx reg2 = gen_rtx_REG (mode, regno2); + if (!frame_pointer_needed) + offset2 += cfun->machine->frame.frame_size + - cfun->machine->frame.hard_fp_offset; + rtx addr2 = plus_constant (Pmode, ptr_reg, offset2); + rtx mem2 = gen_frame_mem (mode, addr2); + rtx set2 = gen_rtx_SET (mem2, reg2); + + insn = emit_insn (aarch64_gen_store_pair (mode, mem, reg, mem2, reg2)); + RTX_FRAME_RELATED_P (insn) = 1; + add_reg_note (insn, REG_CFA_OFFSET, set); + add_reg_note (insn, REG_CFA_OFFSET, set2); + + regno = aarch64_get_next_set_bit (components, regno2 + 1); + } + +} + +/* Implement TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS. */ + +static void +aarch64_emit_epilogue_components (sbitmap components) +{ + + rtx ptr_reg = gen_rtx_REG (Pmode, frame_pointer_needed + ? HARD_FRAME_POINTER_REGNUM + : STACK_POINTER_REGNUM); + unsigned total_bits = SBITMAP_SIZE (components); + unsigned regno = aarch64_get_next_set_bit (components, R0_REGNUM); + rtx_insn *insn = NULL; + + while (regno != total_bits) + { + machine_mode mode = GP_REGNUM_P (regno) ? DImode : DFmode; + rtx reg = gen_rtx_REG (mode, regno); + HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno]; + if (!frame_pointer_needed) + offset += cfun->machine->frame.frame_size + - cfun->machine->frame.hard_fp_offset; + rtx addr = plus_constant (Pmode, ptr_reg, offset); + rtx mem = gen_frame_mem (mode, addr); + + unsigned regno2 = aarch64_get_next_set_bit (components, regno + 1); + /* No more registers to restore or the memory location is not suitable + for a load pair. Emit a single restore and exit. */ + if (!satisfies_constraint_Ump (mem) + || regno2 == total_bits) + { + insn = emit_move_insn (reg, mem); + RTX_FRAME_RELATED_P (insn) = 1; + add_reg_note (insn, REG_CFA_RESTORE, reg); + break; + } + + HOST_WIDE_INT offset2 = cfun->machine->frame.reg_offset[regno2]; + /* The next register is not of the same class or its offset is not + mergeable with the current one into a pair. Emit a single restore + and continue from REGNO2. */ + if (GP_REGNUM_P (regno) != GP_REGNUM_P (regno2) + || (offset2 - cfun->machine->frame.reg_offset[regno]) + != GET_MODE_SIZE (DImode)) + { + insn = emit_move_insn (reg, mem); + RTX_FRAME_RELATED_P (insn) = 1; + add_reg_note (insn, REG_CFA_RESTORE, reg); + + regno = regno2; + continue; + } + + /* REGNO2 can be loaded in a pair with REGNO. */ + rtx reg2 = gen_rtx_REG (mode, regno2); + if (!frame_pointer_needed) + offset2 += cfun->machine->frame.frame_size + - cfun->machine->frame.hard_fp_offset; + rtx addr2 = plus_constant (Pmode, ptr_reg, offset2); + rtx mem2 = gen_frame_mem (mode, addr2); + + insn = emit_insn (aarch64_gen_load_pair (mode, reg, mem, reg2, mem2)); + RTX_FRAME_RELATED_P (insn) = 1; + add_reg_note (insn, REG_CFA_RESTORE, reg); + add_reg_note (insn, REG_CFA_RESTORE, reg2); + + regno = aarch64_get_next_set_bit (components, regno2 + 1); + } +} + +/* Implement TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS. */ + +static void +aarch64_set_handled_components (sbitmap components) +{ + for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++) + if (bitmap_bit_p (components, regno)) + cfun->machine->reg_is_wrapped_separately[regno] = true; +} + /* AArch64 stack frames generated by this compiler look like: +-------------------------------+ @@ -3944,29 +4219,6 @@ aarch64_classify_index (struct aarch64_address_info *info, rtx x, return false; } -bool -aarch64_offset_7bit_signed_scaled_p (machine_mode mode, HOST_WIDE_INT offset) -{ - return (offset >= -64 * GET_MODE_SIZE (mode) - && offset < 64 * GET_MODE_SIZE (mode) - && offset % GET_MODE_SIZE (mode) == 0); -} - -static inline bool -offset_9bit_signed_unscaled_p (machine_mode mode ATTRIBUTE_UNUSED, - HOST_WIDE_INT offset) -{ - return offset >= -256 && offset < 256; -} - -static inline bool -offset_12bit_unsigned_scaled_p (machine_mode mode, HOST_WIDE_INT offset) -{ - return (offset >= 0 - && offset < 4096 * GET_MODE_SIZE (mode) - && offset % GET_MODE_SIZE (mode) == 0); -} - /* Return true if MODE is one of the modes for which we support LDP/STP operations. */ @@ -14452,6 +14704,30 @@ aarch64_optab_supported_p (int op, machine_mode mode1, machine_mode, #define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD_GUARD \ aarch64_first_cycle_multipass_dfa_lookahead_guard +#undef TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS +#define TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS \ + aarch64_get_separate_components + +#undef TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB +#define TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB \ + aarch64_components_for_bb + +#undef TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS +#define TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS \ + aarch64_disqualify_components + +#undef TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS +#define TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS \ + aarch64_emit_prologue_components + +#undef TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS +#define TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS \ + aarch64_emit_epilogue_components + +#undef TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS +#define TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS \ + aarch64_set_handled_components + #undef TARGET_TRAMPOLINE_INIT #define TARGET_TRAMPOLINE_INIT aarch64_trampoline_init diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h index 584ff5c..fb89e5a 100644 --- a/gcc/config/aarch64/aarch64.h +++ b/gcc/config/aarch64/aarch64.h @@ -591,6 +591,8 @@ struct GTY (()) aarch64_frame typedef struct GTY (()) machine_function { struct aarch64_frame frame; + /* One entry for each GPR and FP register. */ + bool reg_is_wrapped_separately[V31_REGNUM + 1]; } machine_function; #endif