Message ID | 1463476002-513-1-git-send-email-james.greenhalgh@arm.com |
---|---|
State | New |
Headers | show |
On 17 May 2016 at 10:06, James Greenhalgh <james.greenhalgh@arm.com> wrote: > > Hi, > > This is just a simplification, it probably makes life easier for register > allocation in some corner cases and seems the right thing to do. We don't > use the internal version elsewhere, so we're safe to delete it and change > the types. > > OK? > > Bootstrapped on AArch64 with no issues. Help me understand why this is ok for BE ? Cheers /Marcus
On Tue, May 17, 2016 at 11:32:36AM +0100, Marcus Shawcroft wrote: > On 17 May 2016 at 10:06, James Greenhalgh <james.greenhalgh@arm.com> wrote: > > > > Hi, > > > > This is just a simplification, it probably makes life easier for register > > allocation in some corner cases and seems the right thing to do. We don't > > use the internal version elsewhere, so we're safe to delete it and change > > the types. > > > > OK? > > > > Bootstrapped on AArch64 with no issues. > > Help me understand why this is ok for BE ? The reduc_plus_scal_<mode> pattern wants to take a vector and return a scalar value representing the sum of the lanes of that vector. We want to go from V2DFmode to DFmode. The architectural instruction FADDP writes to a scalar value in the low bits of the register, leaving zeroes in the upper bits. i.e. faddp d0, v1.2d 128 64 0 | 0x0 | v1.d[0] + v1.d[1] | In the current implementation, we use the aarch64_reduc_plus_internal<mode> pattern, which treats the result of FADDP as a vector of two elements. We then need an extra step to extract the correct scalar value from that vector. From GCC's point of view the lane containing the result is either lane 0 (little-endian) or lane 1 (big-endian), which is why the current code is endian dependent. The extract operation will always be a NOP move from architectural bits 0-63 to architectural bits 0-63 - but we never elide the move as future passes can't be certain that the upper bits are zero (they come out of an UNSPEC so could be anything). However, this is all unneccesary. FADDP does exactly what we want, regardless of endianness, we just need to model the instruction as writing the scalar value in the first place. Which is what this patch wires up. We probably just missed this optimization in the migration from the reduc_splus optabs (which required a vector return value) to the reduc_plus_scal optabs (which require a scalar return value). Does that help? Thanks, James
On 17 May 2016 at 12:02, James Greenhalgh <james.greenhalgh@arm.com> wrote: > On Tue, May 17, 2016 at 11:32:36AM +0100, Marcus Shawcroft wrote: >> On 17 May 2016 at 10:06, James Greenhalgh <james.greenhalgh@arm.com> wrote: >> > >> > Hi, >> > >> > This is just a simplification, it probably makes life easier for register >> > allocation in some corner cases and seems the right thing to do. We don't >> > use the internal version elsewhere, so we're safe to delete it and change >> > the types. >> > >> > OK? >> > >> > Bootstrapped on AArch64 with no issues. >> >> Help me understand why this is ok for BE ? > > The reduc_plus_scal_<mode> pattern wants to take a vector and return a scalar > value representing the sum of the lanes of that vector. We want to go > from V2DFmode to DFmode. > > The architectural instruction FADDP writes to a scalar value in the low > bits of the register, leaving zeroes in the upper bits. > > i.e. > > faddp d0, v1.2d > > 128 64 0 > | 0x0 | v1.d[0] + v1.d[1] | > > In the current implementation, we use the > aarch64_reduc_plus_internal<mode> pattern, which treats the result of > FADDP as a vector of two elements. We then need an extra step to extract > the correct scalar value from that vector. From GCC's point of view the lane > containing the result is either lane 0 (little-endian) or lane 1 > (big-endian), which is why the current code is endian dependent. The extract > operation will always be a NOP move from architectural bits 0-63 to > architectural bits 0-63 - but we never elide the move as future passes can't > be certain that the upper bits are zero (they come out of an UNSPEC so > could be anything). > > However, this is all unneccesary. FADDP does exactly what we want, > regardless of endianness, we just need to model the instruction as writing > the scalar value in the first place. Which is what this patch wires up. > > We probably just missed this optimization in the migration from the > reduc_splus optabs (which required a vector return value) to the > reduc_plus_scal optabs (which require a scalar return value). > > Does that help? Yep. Thanks. OK to commit. /Marcus > Thanks, > James >
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md index bd73bce..30023f0 100644 --- a/gcc/config/aarch64/aarch64-simd.md +++ b/gcc/config/aarch64/aarch64-simd.md @@ -1989,19 +1989,6 @@ } ) -(define_expand "reduc_plus_scal_<mode>" - [(match_operand:<VEL> 0 "register_operand" "=w") - (match_operand:V2F 1 "register_operand" "w")] - "TARGET_SIMD" - { - rtx elt = GEN_INT (ENDIAN_LANE_N (<MODE>mode, 0)); - rtx scratch = gen_reg_rtx (<MODE>mode); - emit_insn (gen_aarch64_reduc_plus_internal<mode> (scratch, operands[1])); - emit_insn (gen_aarch64_get_lane<mode> (operands[0], scratch, elt)); - DONE; - } -) - (define_insn "aarch64_reduc_plus_internal<mode>" [(set (match_operand:VDQV 0 "register_operand" "=w") (unspec:VDQV [(match_operand:VDQV 1 "register_operand" "w")] @@ -2020,9 +2007,9 @@ [(set_attr "type" "neon_reduc_add")] ) -(define_insn "aarch64_reduc_plus_internal<mode>" - [(set (match_operand:V2F 0 "register_operand" "=w") - (unspec:V2F [(match_operand:V2F 1 "register_operand" "w")] +(define_insn "reduc_plus_scal_<mode>" + [(set (match_operand:<VEL> 0 "register_operand" "=w") + (unspec:<VEL> [(match_operand:V2F 1 "register_operand" "w")] UNSPEC_FADDV))] "TARGET_SIMD" "faddp\\t%<Vetype>0, %1.<Vtype>"