diff mbox

[AArch64] Simplify reduc_plus_scal_v2[sd]f sequence

Message ID 1463476002-513-1-git-send-email-james.greenhalgh@arm.com
State New
Headers show

Commit Message

James Greenhalgh May 17, 2016, 9:06 a.m. UTC
Hi,

This is just a simplification, it probably makes life easier for register
allocation in some corner cases and seems the right thing to do. We don't
use the internal version elsewhere, so we're safe to delete it and change
the types.

OK?

Bootstrapped on AArch64 with no issues.

Thanks,
James

---
2016-05-17  James Greenhalgh  <james.greenhalgh@arm.com>

	* config/aarch64/aarch64-simd.md
	(aarch64_reduc_plus_internal<mode>): Rename to...
	(reduc_plus_scal): ...This, and remove previous implementation.

Comments

Marcus Shawcroft May 17, 2016, 10:32 a.m. UTC | #1
On 17 May 2016 at 10:06, James Greenhalgh <james.greenhalgh@arm.com> wrote:
>

> Hi,

>

> This is just a simplification, it probably makes life easier for register

> allocation in some corner cases and seems the right thing to do. We don't

> use the internal version elsewhere, so we're safe to delete it and change

> the types.

>

> OK?

>

> Bootstrapped on AArch64 with no issues.


Help me understand why this is ok for BE ?

Cheers
/Marcus
James Greenhalgh May 17, 2016, 11:02 a.m. UTC | #2
On Tue, May 17, 2016 at 11:32:36AM +0100, Marcus Shawcroft wrote:
> On 17 May 2016 at 10:06, James Greenhalgh <james.greenhalgh@arm.com> wrote:

> >

> > Hi,

> >

> > This is just a simplification, it probably makes life easier for register

> > allocation in some corner cases and seems the right thing to do. We don't

> > use the internal version elsewhere, so we're safe to delete it and change

> > the types.

> >

> > OK?

> >

> > Bootstrapped on AArch64 with no issues.

> 

> Help me understand why this is ok for BE ?


The reduc_plus_scal_<mode> pattern wants to take a vector and return a scalar
value representing the sum of the lanes of that vector. We want to go
from V2DFmode to DFmode.

The architectural instruction FADDP writes to a scalar value in the low
bits of the register, leaving zeroes in the upper bits.

i.e.

	faddp  d0, v1.2d

128                 64                    0
 |    0x0            | v1.d[0] + v1.d[1]  |

In the current implementation, we use the
aarch64_reduc_plus_internal<mode> pattern, which treats the result of
FADDP as a vector of two elements. We then need an extra step to extract
the correct scalar value from that vector. From GCC's point of view the lane
containing the result is either lane 0 (little-endian) or lane 1
(big-endian), which is why the current code is endian dependent. The extract
operation will always be a NOP move from architectural bits 0-63 to
architectural bits 0-63 - but we never elide the move as future passes can't
be certain that the upper bits are zero (they come out of an UNSPEC so
could be anything).

However, this is all unneccesary. FADDP does exactly what we want,
regardless of endianness, we just need to model the instruction as writing
the scalar value in the first place. Which is what this patch wires up.

We probably just missed this optimization in the migration from the
reduc_splus optabs (which required a vector return value) to the
reduc_plus_scal optabs (which require a scalar return value).

Does that help?

Thanks,
James
Marcus Shawcroft May 17, 2016, 1:36 p.m. UTC | #3
On 17 May 2016 at 12:02, James Greenhalgh <james.greenhalgh@arm.com> wrote:
> On Tue, May 17, 2016 at 11:32:36AM +0100, Marcus Shawcroft wrote:

>> On 17 May 2016 at 10:06, James Greenhalgh <james.greenhalgh@arm.com> wrote:

>> >

>> > Hi,

>> >

>> > This is just a simplification, it probably makes life easier for register

>> > allocation in some corner cases and seems the right thing to do. We don't

>> > use the internal version elsewhere, so we're safe to delete it and change

>> > the types.

>> >

>> > OK?

>> >

>> > Bootstrapped on AArch64 with no issues.

>>

>> Help me understand why this is ok for BE ?

>

> The reduc_plus_scal_<mode> pattern wants to take a vector and return a scalar

> value representing the sum of the lanes of that vector. We want to go

> from V2DFmode to DFmode.

>

> The architectural instruction FADDP writes to a scalar value in the low

> bits of the register, leaving zeroes in the upper bits.

>

> i.e.

>

>         faddp  d0, v1.2d

>

> 128                 64                    0

>  |    0x0            | v1.d[0] + v1.d[1]  |

>

> In the current implementation, we use the

> aarch64_reduc_plus_internal<mode> pattern, which treats the result of

> FADDP as a vector of two elements. We then need an extra step to extract

> the correct scalar value from that vector. From GCC's point of view the lane

> containing the result is either lane 0 (little-endian) or lane 1

> (big-endian), which is why the current code is endian dependent. The extract

> operation will always be a NOP move from architectural bits 0-63 to

> architectural bits 0-63 - but we never elide the move as future passes can't

> be certain that the upper bits are zero (they come out of an UNSPEC so

> could be anything).

>

> However, this is all unneccesary. FADDP does exactly what we want,

> regardless of endianness, we just need to model the instruction as writing

> the scalar value in the first place. Which is what this patch wires up.

>

> We probably just missed this optimization in the migration from the

> reduc_splus optabs (which required a vector return value) to the

> reduc_plus_scal optabs (which require a scalar return value).

>

> Does that help?



Yep. Thanks. OK to commit. /Marcus

> Thanks,

> James

>
diff mbox

Patch

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index bd73bce..30023f0 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1989,19 +1989,6 @@ 
   }
 )
 
-(define_expand "reduc_plus_scal_<mode>"
-  [(match_operand:<VEL> 0 "register_operand" "=w")
-   (match_operand:V2F 1 "register_operand" "w")]
-  "TARGET_SIMD"
-  {
-    rtx elt = GEN_INT (ENDIAN_LANE_N (<MODE>mode, 0));
-    rtx scratch = gen_reg_rtx (<MODE>mode);
-    emit_insn (gen_aarch64_reduc_plus_internal<mode> (scratch, operands[1]));
-    emit_insn (gen_aarch64_get_lane<mode> (operands[0], scratch, elt));
-    DONE;
-  }
-)
-
 (define_insn "aarch64_reduc_plus_internal<mode>"
  [(set (match_operand:VDQV 0 "register_operand" "=w")
        (unspec:VDQV [(match_operand:VDQV 1 "register_operand" "w")]
@@ -2020,9 +2007,9 @@ 
   [(set_attr "type" "neon_reduc_add")]
 )
 
-(define_insn "aarch64_reduc_plus_internal<mode>"
- [(set (match_operand:V2F 0 "register_operand" "=w")
-       (unspec:V2F [(match_operand:V2F 1 "register_operand" "w")]
+(define_insn "reduc_plus_scal_<mode>"
+ [(set (match_operand:<VEL> 0 "register_operand" "=w")
+       (unspec:<VEL> [(match_operand:V2F 1 "register_operand" "w")]
 		   UNSPEC_FADDV))]
  "TARGET_SIMD"
  "faddp\\t%<Vetype>0, %1.<Vtype>"