Message ID | 87h8r8og4o.fsf@linaro.org |
---|---|
State | Accepted |
Commit | 9bfb28ed3c6fb702c2cab6798959679e1bbd7d09 |
Headers | show |
Series | [SLP/AArch64] Fix unpack handling for big-endian SVE | expand |
On Fri, Jan 26, 2018 at 3:21 PM, Richard Sandiford <richard.sandiford@linaro.org> wrote: > I hadn't realised that on big-endian targets, VEC_UNPACK*HI_EXPR unpacks > the low-numbered lanes and VEC_UNPACK*LO_EXPR unpacks the high-numbered > lanes. This meant that both the SVE patterns and the handling of > fully-masked loops were wrong. > > The patch deals with that by making sure that all vec_unpack* optabs > are define_expands, using BYTES_BIG_ENDIAN to choose the appropriate > define_insn. This in turn meant that we can get rid of the duplication > between the signed and unsigned patterns for predicates. (We provide > implementations of both the signed and unsigned optabs because the sign > doesn't matter for predicates: every element contains only one > significant bit.) > > Also, the float unpacks need to unpack one half of the input vector, > but the unpacked upper bits are "don't care". There are two obvious > ways of handling that: use an unpack (filling with zeros) or use a ZIP > (filling with a duplicate of the low bits). The code previously used > unpacks, but the sequence involved a subreg that is semantically an > element reverse on big-endian targets. Using the ZIP patterns avoids > that, and at the moment there's no reason to prefer one over the other > for performance reasons, so the patch switches to ZIP unconditionally. > As the comment says, it would be easy to optimise this later if UUNPK > turns out to be better for some implementations. > > Tested on aarch64_be-elf, aarch64-linux-gnu and x86_64-linux-gnu. > I think the tree-vect-loop-manip.c part is obvious, but OK for the > AArch64 bits? > > Richard > > > 2018-01-26 Richard Sandiford <richard.sandiford@linaro.org> > > gcc/ > * tree-vect-loop-manip.c (vect_maybe_permute_loop_masks): > Reverse the choice between VEC_UNPACK_LO_EXPR and VEC_UNPACK_HI_EXPR > for big-endian. > * config/aarch64/iterators.md (hi_lanes_optab): New int attribute. > * config/aarch64/aarch64-sve.md > (*aarch64_sve_<perm_insn><perm_hilo><mode>): Rename to... > (aarch64_sve_<perm_insn><perm_hilo><mode>): ...this. > (*extend<mode><Vwide>2): Rename to... > (aarch64_sve_extend<mode><Vwide>2): ...this. > (vec_unpack<su>_<perm_hilo>_<mode>): Turn into a define_expand, > renaming the old pattern to... > (aarch64_sve_punpk<perm_hilo>_<mode>): ...this. Only define > unsigned packs. > (vec_unpack<su>_<perm_hilo>_<SVE_BHSI:mode>): Turn into a > define_expand, renaming the old pattern to... > (aarch64_sve_<su>unpk<perm_hilo>_<SVE_BHSI:mode>): ...this. > (*vec_unpacku_<perm_hilo>_<mode>_no_convert): Delete. > (vec_unpacks_<perm_hilo>_<mode>): Take BYTES_BIG_ENDIAN into > account when deciding which SVE instruction the optab should use. > (vec_unpack<su_optab>_float_<perm_hilo>_vnx4si): Likewise. > > gcc/testsuite/ > * gcc.target/aarch64/sve/unpack_fcvt_signed_1.c: Expect zips rather > than unpacks. > * gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c: Likewise. > * gcc.target/aarch64/sve/unpack_float_1.c: Likewise. > > Index: gcc/tree-vect-loop-manip.c > =================================================================== > --- gcc/tree-vect-loop-manip.c 2018-01-13 18:02:00.948360274 +0000 > +++ gcc/tree-vect-loop-manip.c 2018-01-26 14:00:15.717916957 +0000 > @@ -334,7 +334,8 @@ vect_maybe_permute_loop_masks (gimple_se > { > tree src = src_rgm->masks[i / 2]; > tree dest = dest_rgm->masks[i]; > - tree_code code = (i & 1 ? VEC_UNPACK_HI_EXPR > + tree_code code = ((i & 1) == (BYTES_BIG_ENDIAN ? 0 : 1) > + ? VEC_UNPACK_HI_EXPR > : VEC_UNPACK_LO_EXPR); This looks broken anyways -- the element oder on GIMPLE is the same for little and big-endian. I see two other BYTES_BIG_ENDIAN uses in tree-vect-*, not sure when they crept in but they are all broken. Richard. > gassign *stmt; > if (dest_masktype == unpack_masktype) > Index: gcc/config/aarch64/iterators.md > =================================================================== > --- gcc/config/aarch64/iterators.md 2018-01-13 18:01:26.106685901 +0000 > +++ gcc/config/aarch64/iterators.md 2018-01-26 14:00:15.716916999 +0000 > @@ -1674,6 +1674,15 @@ (define_int_attr perm_hilo [(UNSPEC_ZIP1 > (UNSPEC_UNPACKSHI "hi") (UNSPEC_UNPACKUHI "hi") > (UNSPEC_UNPACKSLO "lo") (UNSPEC_UNPACKULO "lo")]) > > +;; Return true if the associated optab refers to the high-numbered lanes, > +;; false if it refers to the low-numbered lanes. The convention is for > +;; "hi" to refer to the low-numbered lanes (the first ones in memory) > +;; for big-endian. > +(define_int_attr hi_lanes_optab [(UNSPEC_UNPACKSHI "!BYTES_BIG_ENDIAN") > + (UNSPEC_UNPACKUHI "!BYTES_BIG_ENDIAN") > + (UNSPEC_UNPACKSLO "BYTES_BIG_ENDIAN") > + (UNSPEC_UNPACKULO "BYTES_BIG_ENDIAN")]) > + > (define_int_attr frecp_suffix [(UNSPEC_FRECPE "e") (UNSPEC_FRECPX "x")]) > > (define_int_attr crc_variant [(UNSPEC_CRC32B "crc32b") (UNSPEC_CRC32H "crc32h") > Index: gcc/config/aarch64/aarch64-sve.md > =================================================================== > --- gcc/config/aarch64/aarch64-sve.md 2018-01-26 13:55:12.277219211 +0000 > +++ gcc/config/aarch64/aarch64-sve.md 2018-01-26 14:00:15.716916999 +0000 > @@ -817,7 +817,7 @@ (define_insn "*aarch64_sve_<perm_insn><p > "<perm_insn><perm_hilo>\t%0.<Vetype>, %1.<Vetype>, %2.<Vetype>" > ) > > -(define_insn "*aarch64_sve_<perm_insn><perm_hilo><mode>" > +(define_insn "aarch64_sve_<perm_insn><perm_hilo><mode>" > [(set (match_operand:SVE_ALL 0 "register_operand" "=w") > (unspec:SVE_ALL [(match_operand:SVE_ALL 1 "register_operand" "w") > (match_operand:SVE_ALL 2 "register_operand" "w")] > @@ -2156,7 +2156,7 @@ (define_insn "*<optab><mode>vnx4sf2" > ) > > ;; Conversion of DI or SI to DF, predicated with a PTRUE. > -(define_insn "*<optab><mode>vnx2df2" > +(define_insn "aarch64_sve_<optab><mode>vnx2df2" > [(set (match_operand:VNx2DF 0 "register_operand" "=w") > (unspec:VNx2DF > [(match_operand:VNx2BI 1 "register_operand" "Upl") > @@ -2183,7 +2183,7 @@ (define_insn "*trunc<Vwide><mode>2" > > ;; Conversion of SFs to the same number of DFs, or HFs to the same number > ;; of SFs. > -(define_insn "*extend<mode><Vwide>2" > +(define_insn "aarch64_sve_extend<mode><Vwide>2" > [(set (match_operand:<VWIDE> 0 "register_operand" "=w") > (unspec:<VWIDE> > [(match_operand:<VWIDE_PRED> 1 "register_operand" "Upl") > @@ -2195,17 +2195,50 @@ (define_insn "*extend<mode><Vwide>2" > "fcvt\t%0.<Vewtype>, %1/m, %2.<Vetype>" > ) > > +;; Unpack the low or high half of a predicate, where "high" refers to > +;; the low-numbered lanes for big-endian and the high-numbered lanes > +;; for little-endian. > +(define_expand "vec_unpack<su>_<perm_hilo>_<mode>" > + [(match_operand:<VWIDE> 0 "register_operand") > + (unspec:<VWIDE> [(match_operand:PRED_BHS 1 "register_operand")] > + UNPACK)] > + "TARGET_SVE" > + { > + emit_insn ((<hi_lanes_optab> > + ? gen_aarch64_sve_punpkhi_<PRED_BHS:mode> > + : gen_aarch64_sve_punpklo_<PRED_BHS:mode>) > + (operands[0], operands[1])); > + DONE; > + } > +) > + > ;; PUNPKHI and PUNPKLO. > -(define_insn "vec_unpack<su>_<perm_hilo>_<mode>" > +(define_insn "aarch64_sve_punpk<perm_hilo>_<mode>" > [(set (match_operand:<VWIDE> 0 "register_operand" "=Upa") > (unspec:<VWIDE> [(match_operand:PRED_BHS 1 "register_operand" "Upa")] > - UNPACK))] > + UNPACK_UNSIGNED))] > "TARGET_SVE" > "punpk<perm_hilo>\t%0.h, %1.b" > ) > > +;; Unpack the low or high half of a vector, where "high" refers to > +;; the low-numbered lanes for big-endian and the high-numbered lanes > +;; for little-endian. > +(define_expand "vec_unpack<su>_<perm_hilo>_<SVE_BHSI:mode>" > + [(match_operand:<VWIDE> 0 "register_operand") > + (unspec:<VWIDE> [(match_operand:SVE_BHSI 1 "register_operand")] UNPACK)] > + "TARGET_SVE" > + { > + emit_insn ((<hi_lanes_optab> > + ? gen_aarch64_sve_<su>unpkhi_<SVE_BHSI:mode> > + : gen_aarch64_sve_<su>unpklo_<SVE_BHSI:mode>) > + (operands[0], operands[1])); > + DONE; > + } > +) > + > ;; SUNPKHI, UUNPKHI, SUNPKLO and UUNPKLO. > -(define_insn "vec_unpack<su>_<perm_hilo>_<SVE_BHSI:mode>" > +(define_insn "aarch64_sve_<su>unpk<perm_hilo>_<SVE_BHSI:mode>" > [(set (match_operand:<VWIDE> 0 "register_operand" "=w") > (unspec:<VWIDE> [(match_operand:SVE_BHSI 1 "register_operand" "w")] > UNPACK))] > @@ -2213,32 +2246,28 @@ (define_insn "vec_unpack<su>_<perm_hilo> > "<su>unpk<perm_hilo>\t%0.<Vewtype>, %1.<Vetype>" > ) > > -;; Used by the vec_unpacks_<perm_hilo>_<mode> expander to unpack the bit > -;; representation of a VNx4SF or VNx8HF without conversion. The choice > -;; between signed and unsigned isn't significant. > -(define_insn "*vec_unpacku_<perm_hilo>_<mode>_no_convert" > - [(set (match_operand:SVE_HSF 0 "register_operand" "=w") > - (unspec:SVE_HSF [(match_operand:SVE_HSF 1 "register_operand" "w")] > - UNPACK_UNSIGNED))] > - "TARGET_SVE" > - "uunpk<perm_hilo>\t%0.<Vewtype>, %1.<Vetype>" > -) > - > ;; Unpack one half of a VNx4SF to VNx2DF, or one half of a VNx8HF to VNx4SF. > ;; First unpack the source without conversion, then float-convert the > ;; unpacked source. > (define_expand "vec_unpacks_<perm_hilo>_<mode>" > - [(set (match_dup 2) > - (unspec:SVE_HSF [(match_operand:SVE_HSF 1 "register_operand")] > - UNPACK_UNSIGNED)) > - (set (match_operand:<VWIDE> 0 "register_operand") > - (unspec:<VWIDE> [(match_dup 3) > - (unspec:<VWIDE> [(match_dup 2)] UNSPEC_FLOAT_CONVERT)] > - UNSPEC_MERGE_PTRUE))] > - "TARGET_SVE" > - { > - operands[2] = gen_reg_rtx (<MODE>mode); > - operands[3] = force_reg (<VWIDE_PRED>mode, CONSTM1_RTX (<VWIDE_PRED>mode)); > + [(match_operand:<VWIDE> 0 "register_operand") > + (unspec:SVE_HSF [(match_operand:SVE_HSF 1 "register_operand")] > + UNPACK_UNSIGNED)] > + "TARGET_SVE" > + { > + /* Use ZIP to do the unpack, since we don't care about the upper halves > + and since it has the nice property of not needing any subregs. > + If using UUNPK* turns out to be preferable, we could model it as > + a ZIP whose first operand is zero. */ > + rtx temp = gen_reg_rtx (<MODE>mode); > + emit_insn ((<hi_lanes_optab> > + ? gen_aarch64_sve_zip2<mode> > + : gen_aarch64_sve_zip1<mode>) > + (temp, operands[1], operands[1])); > + rtx ptrue = force_reg (<VWIDE_PRED>mode, CONSTM1_RTX (<VWIDE_PRED>mode)); > + emit_insn (gen_aarch64_sve_extend<mode><Vwide>2 (operands[0], > + ptrue, temp)); > + DONE; > } > ) > > @@ -2246,18 +2275,25 @@ (define_expand "vec_unpacks_<perm_hilo>_ > ;; to VNx2DI, reinterpret the VNx2DI as a VNx4SI, then convert the > ;; unpacked VNx4SI to VNx2DF. > (define_expand "vec_unpack<su_optab>_float_<perm_hilo>_vnx4si" > - [(set (match_dup 2) > - (unspec:VNx2DI [(match_operand:VNx4SI 1 "register_operand")] > - UNPACK_UNSIGNED)) > - (set (match_operand:VNx2DF 0 "register_operand") > - (unspec:VNx2DF [(match_dup 3) > - (FLOATUORS:VNx2DF (match_dup 4))] > - UNSPEC_MERGE_PTRUE))] > - "TARGET_SVE" > - { > - operands[2] = gen_reg_rtx (VNx2DImode); > - operands[3] = force_reg (VNx2BImode, CONSTM1_RTX (VNx2BImode)); > - operands[4] = gen_rtx_SUBREG (VNx4SImode, operands[2], 0); > + [(match_operand:VNx2DF 0 "register_operand") > + (FLOATUORS:VNx2DF > + (unspec:VNx2DI [(match_operand:VNx4SI 1 "register_operand")] > + UNPACK_UNSIGNED))] > + "TARGET_SVE" > + { > + /* Use ZIP to do the unpack, since we don't care about the upper halves > + and since it has the nice property of not needing any subregs. > + If using UUNPK* turns out to be preferable, we could model it as > + a ZIP whose first operand is zero. */ > + rtx temp = gen_reg_rtx (VNx4SImode); > + emit_insn ((<hi_lanes_optab> > + ? gen_aarch64_sve_zip2vnx4si > + : gen_aarch64_sve_zip1vnx4si) > + (temp, operands[1], operands[1])); > + rtx ptrue = force_reg (VNx2BImode, CONSTM1_RTX (VNx2BImode)); > + emit_insn (gen_aarch64_sve_<FLOATUORS:optab>vnx4sivnx2df2 (operands[0], > + ptrue, temp)); > + DONE; > } > ) > > Index: gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_signed_1.c > =================================================================== > --- gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_signed_1.c 2018-01-13 17:54:52.934269783 +0000 > +++ gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_signed_1.c 2018-01-26 14:00:15.716916999 +0000 > @@ -10,6 +10,6 @@ unpack_double_int_plus8 (double *d, int3 > d[i] = s[i] + 8; > } > > -/* { dg-final { scan-assembler-times {\tuunpklo\tz[0-9]+\.d, z[0-9]+\.s\n} 1 } } */ > -/* { dg-final { scan-assembler-times {\tuunpkhi\tz[0-9]+\.d, z[0-9]+\.s\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.s, z[0-9]+\.s, z[0-9]+\.s\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.s, z[0-9]+\.s, z[0-9]+\.s\n} 1 } } */ > /* { dg-final { scan-assembler-times {\tscvtf\tz[0-9]+\.d, p[0-7]/m, z[0-9]+\.s\n} 2 } } */ > Index: gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c > =================================================================== > --- gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c 2018-01-13 17:54:52.934269783 +0000 > +++ gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c 2018-01-26 14:00:15.717916957 +0000 > @@ -10,6 +10,6 @@ unpack_double_int_plus9 (double *d, uint > d[i] = (double) (s[i] + 9); > } > > -/* { dg-final { scan-assembler-times {\tuunpklo\tz[0-9]+\.d, z[0-9]+\.s\n} 1 } } */ > -/* { dg-final { scan-assembler-times {\tuunpkhi\tz[0-9]+\.d, z[0-9]+\.s\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.s, z[0-9]+\.s, z[0-9]+\.s\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.s, z[0-9]+\.s, z[0-9]+\.s\n} 1 } } */ > /* { dg-final { scan-assembler-times {\tucvtf\tz[0-9]+\.d, p[0-7]/m, z[0-9]+\.s\n} 2 } } */ > Index: gcc/testsuite/gcc.target/aarch64/sve/unpack_float_1.c > =================================================================== > --- gcc/testsuite/gcc.target/aarch64/sve/unpack_float_1.c 2018-01-13 17:54:52.935269745 +0000 > +++ gcc/testsuite/gcc.target/aarch64/sve/unpack_float_1.c 2018-01-26 14:00:15.717916957 +0000 > @@ -8,6 +8,6 @@ unpack_float_plus_7point9 (double *d, fl > d[i] = s[i] + 7.9; > } > > -/* { dg-final { scan-assembler-times {\tuunpklo\tz[0-9]+\.d, z[0-9]+\.s\n} 1 } } */ > -/* { dg-final { scan-assembler-times {\tuunpkhi\tz[0-9]+\.d, z[0-9]+\.s\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.s, z[0-9]+\.s, z[0-9]+\.s\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.s, z[0-9]+\.s, z[0-9]+\.s\n} 1 } } */ > /* { dg-final { scan-assembler-times {\tfcvt\tz[0-9]+\.d, p[0-7]/m, z[0-9]+\.s\n} 2 } } */
Richard Biener <richard.guenther@gmail.com> writes: > On Fri, Jan 26, 2018 at 3:21 PM, Richard Sandiford > <richard.sandiford@linaro.org> wrote: >> I hadn't realised that on big-endian targets, VEC_UNPACK*HI_EXPR unpacks >> the low-numbered lanes and VEC_UNPACK*LO_EXPR unpacks the high-numbered >> lanes. This meant that both the SVE patterns and the handling of >> fully-masked loops were wrong. >> >> The patch deals with that by making sure that all vec_unpack* optabs >> are define_expands, using BYTES_BIG_ENDIAN to choose the appropriate >> define_insn. This in turn meant that we can get rid of the duplication >> between the signed and unsigned patterns for predicates. (We provide >> implementations of both the signed and unsigned optabs because the sign >> doesn't matter for predicates: every element contains only one >> significant bit.) >> >> Also, the float unpacks need to unpack one half of the input vector, >> but the unpacked upper bits are "don't care". There are two obvious >> ways of handling that: use an unpack (filling with zeros) or use a ZIP >> (filling with a duplicate of the low bits). The code previously used >> unpacks, but the sequence involved a subreg that is semantically an >> element reverse on big-endian targets. Using the ZIP patterns avoids >> that, and at the moment there's no reason to prefer one over the other >> for performance reasons, so the patch switches to ZIP unconditionally. >> As the comment says, it would be easy to optimise this later if UUNPK >> turns out to be better for some implementations. >> >> Tested on aarch64_be-elf, aarch64-linux-gnu and x86_64-linux-gnu. >> I think the tree-vect-loop-manip.c part is obvious, but OK for the >> AArch64 bits? >> >> Richard >> >> >> 2018-01-26 Richard Sandiford <richard.sandiford@linaro.org> >> >> gcc/ >> * tree-vect-loop-manip.c (vect_maybe_permute_loop_masks): >> Reverse the choice between VEC_UNPACK_LO_EXPR and VEC_UNPACK_HI_EXPR >> for big-endian. >> * config/aarch64/iterators.md (hi_lanes_optab): New int attribute. >> * config/aarch64/aarch64-sve.md >> (*aarch64_sve_<perm_insn><perm_hilo><mode>): Rename to... >> (aarch64_sve_<perm_insn><perm_hilo><mode>): ...this. >> (*extend<mode><Vwide>2): Rename to... >> (aarch64_sve_extend<mode><Vwide>2): ...this. >> (vec_unpack<su>_<perm_hilo>_<mode>): Turn into a define_expand, >> renaming the old pattern to... >> (aarch64_sve_punpk<perm_hilo>_<mode>): ...this. Only define >> unsigned packs. >> (vec_unpack<su>_<perm_hilo>_<SVE_BHSI:mode>): Turn into a >> define_expand, renaming the old pattern to... >> (aarch64_sve_<su>unpk<perm_hilo>_<SVE_BHSI:mode>): ...this. >> (*vec_unpacku_<perm_hilo>_<mode>_no_convert): Delete. >> (vec_unpacks_<perm_hilo>_<mode>): Take BYTES_BIG_ENDIAN into >> account when deciding which SVE instruction the optab should use. >> (vec_unpack<su_optab>_float_<perm_hilo>_vnx4si): Likewise. >> >> gcc/testsuite/ >> * gcc.target/aarch64/sve/unpack_fcvt_signed_1.c: Expect zips rather >> than unpacks. >> * gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c: Likewise. >> * gcc.target/aarch64/sve/unpack_float_1.c: Likewise. >> >> Index: gcc/tree-vect-loop-manip.c >> =================================================================== >> --- gcc/tree-vect-loop-manip.c 2018-01-13 18:02:00.948360274 +0000 >> +++ gcc/tree-vect-loop-manip.c 2018-01-26 14:00:15.717916957 +0000 >> @@ -334,7 +334,8 @@ vect_maybe_permute_loop_masks (gimple_se >> { >> tree src = src_rgm->masks[i / 2]; >> tree dest = dest_rgm->masks[i]; >> - tree_code code = (i & 1 ? VEC_UNPACK_HI_EXPR >> + tree_code code = ((i & 1) == (BYTES_BIG_ENDIAN ? 0 : 1) >> + ? VEC_UNPACK_HI_EXPR >> : VEC_UNPACK_LO_EXPR); > > This looks broken anyways -- the element oder on GIMPLE is the same > for little and big-endian. I see two other BYTES_BIG_ENDIAN uses in > tree-vect-*, not sure when they crept in but they are all broken. Are you sure? Everything apart from this new code seems consistently to treat UNPACK_HI as "high part of the register" rather than "high memory address/lane number". I think changing it now would break powerpc64 and mips. Thanks, Richard
On Fri, Jan 26, 2018 at 4:04 PM, Richard Sandiford <richard.sandiford@linaro.org> wrote: > Richard Biener <richard.guenther@gmail.com> writes: >> On Fri, Jan 26, 2018 at 3:21 PM, Richard Sandiford >> <richard.sandiford@linaro.org> wrote: >>> I hadn't realised that on big-endian targets, VEC_UNPACK*HI_EXPR unpacks >>> the low-numbered lanes and VEC_UNPACK*LO_EXPR unpacks the high-numbered >>> lanes. This meant that both the SVE patterns and the handling of >>> fully-masked loops were wrong. >>> >>> The patch deals with that by making sure that all vec_unpack* optabs >>> are define_expands, using BYTES_BIG_ENDIAN to choose the appropriate >>> define_insn. This in turn meant that we can get rid of the duplication >>> between the signed and unsigned patterns for predicates. (We provide >>> implementations of both the signed and unsigned optabs because the sign >>> doesn't matter for predicates: every element contains only one >>> significant bit.) >>> >>> Also, the float unpacks need to unpack one half of the input vector, >>> but the unpacked upper bits are "don't care". There are two obvious >>> ways of handling that: use an unpack (filling with zeros) or use a ZIP >>> (filling with a duplicate of the low bits). The code previously used >>> unpacks, but the sequence involved a subreg that is semantically an >>> element reverse on big-endian targets. Using the ZIP patterns avoids >>> that, and at the moment there's no reason to prefer one over the other >>> for performance reasons, so the patch switches to ZIP unconditionally. >>> As the comment says, it would be easy to optimise this later if UUNPK >>> turns out to be better for some implementations. >>> >>> Tested on aarch64_be-elf, aarch64-linux-gnu and x86_64-linux-gnu. >>> I think the tree-vect-loop-manip.c part is obvious, but OK for the >>> AArch64 bits? >>> >>> Richard >>> >>> >>> 2018-01-26 Richard Sandiford <richard.sandiford@linaro.org> >>> >>> gcc/ >>> * tree-vect-loop-manip.c (vect_maybe_permute_loop_masks): >>> Reverse the choice between VEC_UNPACK_LO_EXPR and VEC_UNPACK_HI_EXPR >>> for big-endian. >>> * config/aarch64/iterators.md (hi_lanes_optab): New int attribute. >>> * config/aarch64/aarch64-sve.md >>> (*aarch64_sve_<perm_insn><perm_hilo><mode>): Rename to... >>> (aarch64_sve_<perm_insn><perm_hilo><mode>): ...this. >>> (*extend<mode><Vwide>2): Rename to... >>> (aarch64_sve_extend<mode><Vwide>2): ...this. >>> (vec_unpack<su>_<perm_hilo>_<mode>): Turn into a define_expand, >>> renaming the old pattern to... >>> (aarch64_sve_punpk<perm_hilo>_<mode>): ...this. Only define >>> unsigned packs. >>> (vec_unpack<su>_<perm_hilo>_<SVE_BHSI:mode>): Turn into a >>> define_expand, renaming the old pattern to... >>> (aarch64_sve_<su>unpk<perm_hilo>_<SVE_BHSI:mode>): ...this. >>> (*vec_unpacku_<perm_hilo>_<mode>_no_convert): Delete. >>> (vec_unpacks_<perm_hilo>_<mode>): Take BYTES_BIG_ENDIAN into >>> account when deciding which SVE instruction the optab should use. >>> (vec_unpack<su_optab>_float_<perm_hilo>_vnx4si): Likewise. >>> >>> gcc/testsuite/ >>> * gcc.target/aarch64/sve/unpack_fcvt_signed_1.c: Expect zips rather >>> than unpacks. >>> * gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c: Likewise. >>> * gcc.target/aarch64/sve/unpack_float_1.c: Likewise. >>> >>> Index: gcc/tree-vect-loop-manip.c >>> =================================================================== >>> --- gcc/tree-vect-loop-manip.c 2018-01-13 18:02:00.948360274 +0000 >>> +++ gcc/tree-vect-loop-manip.c 2018-01-26 14:00:15.717916957 +0000 >>> @@ -334,7 +334,8 @@ vect_maybe_permute_loop_masks (gimple_se >>> { >>> tree src = src_rgm->masks[i / 2]; >>> tree dest = dest_rgm->masks[i]; >>> - tree_code code = (i & 1 ? VEC_UNPACK_HI_EXPR >>> + tree_code code = ((i & 1) == (BYTES_BIG_ENDIAN ? 0 : 1) >>> + ? VEC_UNPACK_HI_EXPR >>> : VEC_UNPACK_LO_EXPR); >> >> This looks broken anyways -- the element oder on GIMPLE is the same >> for little and big-endian. I see two other BYTES_BIG_ENDIAN uses in >> tree-vect-*, not sure when they crept in but they are all broken. > > Are you sure? Everything apart from this new code seems consistently > to treat UNPACK_HI as "high part of the register" rather than > "high memory address/lane number". I think changing it now would > break powerpc64 and mips. Looks like so. But we agreed on changing this, looks like this is a left-over. (I'm always looking at constant folding for how it's supposed to work). Looks like VEC_WIDEN_LSHIFT_HI/O_EXPR doesn't have constant folding. Richard. > Thanks, > Richard
On Fri, Jan 26, 2018 at 02:21:11PM +0000, Richard Sandiford wrote: > I hadn't realised that on big-endian targets, VEC_UNPACK*HI_EXPR unpacks > the low-numbered lanes and VEC_UNPACK*LO_EXPR unpacks the high-numbered > lanes. This meant that both the SVE patterns and the handling of > fully-masked loops were wrong. > > The patch deals with that by making sure that all vec_unpack* optabs > are define_expands, using BYTES_BIG_ENDIAN to choose the appropriate > define_insn. This in turn meant that we can get rid of the duplication > between the signed and unsigned patterns for predicates. (We provide > implementations of both the signed and unsigned optabs because the sign > doesn't matter for predicates: every element contains only one > significant bit.) > > Also, the float unpacks need to unpack one half of the input vector, > but the unpacked upper bits are "don't care". There are two obvious > ways of handling that: use an unpack (filling with zeros) or use a ZIP > (filling with a duplicate of the low bits). The code previously used > unpacks, but the sequence involved a subreg that is semantically an > element reverse on big-endian targets. Using the ZIP patterns avoids > that, and at the moment there's no reason to prefer one over the other > for performance reasons, so the patch switches to ZIP unconditionally. > As the comment says, it would be easy to optimise this later if UUNPK > turns out to be better for some implementations. > > Tested on aarch64_be-elf, aarch64-linux-gnu and x86_64-linux-gnu. > I think the tree-vect-loop-manip.c part is obvious, but OK for the > AArch64 bits? The AArch64 bits are OK by me, if this direction is appropriate for the tree-vect-loop-manip.c parts. Thanks, James > > Richard > > > 2018-01-26 Richard Sandiford <richard.sandiford@linaro.org> > > gcc/ > * tree-vect-loop-manip.c (vect_maybe_permute_loop_masks): > Reverse the choice between VEC_UNPACK_LO_EXPR and VEC_UNPACK_HI_EXPR > for big-endian. > * config/aarch64/iterators.md (hi_lanes_optab): New int attribute. > * config/aarch64/aarch64-sve.md > (*aarch64_sve_<perm_insn><perm_hilo><mode>): Rename to... > (aarch64_sve_<perm_insn><perm_hilo><mode>): ...this. > (*extend<mode><Vwide>2): Rename to... > (aarch64_sve_extend<mode><Vwide>2): ...this. > (vec_unpack<su>_<perm_hilo>_<mode>): Turn into a define_expand, > renaming the old pattern to... > (aarch64_sve_punpk<perm_hilo>_<mode>): ...this. Only define > unsigned packs. > (vec_unpack<su>_<perm_hilo>_<SVE_BHSI:mode>): Turn into a > define_expand, renaming the old pattern to... > (aarch64_sve_<su>unpk<perm_hilo>_<SVE_BHSI:mode>): ...this. > (*vec_unpacku_<perm_hilo>_<mode>_no_convert): Delete. > (vec_unpacks_<perm_hilo>_<mode>): Take BYTES_BIG_ENDIAN into > account when deciding which SVE instruction the optab should use. > (vec_unpack<su_optab>_float_<perm_hilo>_vnx4si): Likewise. > > gcc/testsuite/ > * gcc.target/aarch64/sve/unpack_fcvt_signed_1.c: Expect zips rather > than unpacks. > * gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c: Likewise. > * gcc.target/aarch64/sve/unpack_float_1.c: Likewise.
Index: gcc/tree-vect-loop-manip.c =================================================================== --- gcc/tree-vect-loop-manip.c 2018-01-13 18:02:00.948360274 +0000 +++ gcc/tree-vect-loop-manip.c 2018-01-26 14:00:15.717916957 +0000 @@ -334,7 +334,8 @@ vect_maybe_permute_loop_masks (gimple_se { tree src = src_rgm->masks[i / 2]; tree dest = dest_rgm->masks[i]; - tree_code code = (i & 1 ? VEC_UNPACK_HI_EXPR + tree_code code = ((i & 1) == (BYTES_BIG_ENDIAN ? 0 : 1) + ? VEC_UNPACK_HI_EXPR : VEC_UNPACK_LO_EXPR); gassign *stmt; if (dest_masktype == unpack_masktype) Index: gcc/config/aarch64/iterators.md =================================================================== --- gcc/config/aarch64/iterators.md 2018-01-13 18:01:26.106685901 +0000 +++ gcc/config/aarch64/iterators.md 2018-01-26 14:00:15.716916999 +0000 @@ -1674,6 +1674,15 @@ (define_int_attr perm_hilo [(UNSPEC_ZIP1 (UNSPEC_UNPACKSHI "hi") (UNSPEC_UNPACKUHI "hi") (UNSPEC_UNPACKSLO "lo") (UNSPEC_UNPACKULO "lo")]) +;; Return true if the associated optab refers to the high-numbered lanes, +;; false if it refers to the low-numbered lanes. The convention is for +;; "hi" to refer to the low-numbered lanes (the first ones in memory) +;; for big-endian. +(define_int_attr hi_lanes_optab [(UNSPEC_UNPACKSHI "!BYTES_BIG_ENDIAN") + (UNSPEC_UNPACKUHI "!BYTES_BIG_ENDIAN") + (UNSPEC_UNPACKSLO "BYTES_BIG_ENDIAN") + (UNSPEC_UNPACKULO "BYTES_BIG_ENDIAN")]) + (define_int_attr frecp_suffix [(UNSPEC_FRECPE "e") (UNSPEC_FRECPX "x")]) (define_int_attr crc_variant [(UNSPEC_CRC32B "crc32b") (UNSPEC_CRC32H "crc32h") Index: gcc/config/aarch64/aarch64-sve.md =================================================================== --- gcc/config/aarch64/aarch64-sve.md 2018-01-26 13:55:12.277219211 +0000 +++ gcc/config/aarch64/aarch64-sve.md 2018-01-26 14:00:15.716916999 +0000 @@ -817,7 +817,7 @@ (define_insn "*aarch64_sve_<perm_insn><p "<perm_insn><perm_hilo>\t%0.<Vetype>, %1.<Vetype>, %2.<Vetype>" ) -(define_insn "*aarch64_sve_<perm_insn><perm_hilo><mode>" +(define_insn "aarch64_sve_<perm_insn><perm_hilo><mode>" [(set (match_operand:SVE_ALL 0 "register_operand" "=w") (unspec:SVE_ALL [(match_operand:SVE_ALL 1 "register_operand" "w") (match_operand:SVE_ALL 2 "register_operand" "w")] @@ -2156,7 +2156,7 @@ (define_insn "*<optab><mode>vnx4sf2" ) ;; Conversion of DI or SI to DF, predicated with a PTRUE. -(define_insn "*<optab><mode>vnx2df2" +(define_insn "aarch64_sve_<optab><mode>vnx2df2" [(set (match_operand:VNx2DF 0 "register_operand" "=w") (unspec:VNx2DF [(match_operand:VNx2BI 1 "register_operand" "Upl") @@ -2183,7 +2183,7 @@ (define_insn "*trunc<Vwide><mode>2" ;; Conversion of SFs to the same number of DFs, or HFs to the same number ;; of SFs. -(define_insn "*extend<mode><Vwide>2" +(define_insn "aarch64_sve_extend<mode><Vwide>2" [(set (match_operand:<VWIDE> 0 "register_operand" "=w") (unspec:<VWIDE> [(match_operand:<VWIDE_PRED> 1 "register_operand" "Upl") @@ -2195,17 +2195,50 @@ (define_insn "*extend<mode><Vwide>2" "fcvt\t%0.<Vewtype>, %1/m, %2.<Vetype>" ) +;; Unpack the low or high half of a predicate, where "high" refers to +;; the low-numbered lanes for big-endian and the high-numbered lanes +;; for little-endian. +(define_expand "vec_unpack<su>_<perm_hilo>_<mode>" + [(match_operand:<VWIDE> 0 "register_operand") + (unspec:<VWIDE> [(match_operand:PRED_BHS 1 "register_operand")] + UNPACK)] + "TARGET_SVE" + { + emit_insn ((<hi_lanes_optab> + ? gen_aarch64_sve_punpkhi_<PRED_BHS:mode> + : gen_aarch64_sve_punpklo_<PRED_BHS:mode>) + (operands[0], operands[1])); + DONE; + } +) + ;; PUNPKHI and PUNPKLO. -(define_insn "vec_unpack<su>_<perm_hilo>_<mode>" +(define_insn "aarch64_sve_punpk<perm_hilo>_<mode>" [(set (match_operand:<VWIDE> 0 "register_operand" "=Upa") (unspec:<VWIDE> [(match_operand:PRED_BHS 1 "register_operand" "Upa")] - UNPACK))] + UNPACK_UNSIGNED))] "TARGET_SVE" "punpk<perm_hilo>\t%0.h, %1.b" ) +;; Unpack the low or high half of a vector, where "high" refers to +;; the low-numbered lanes for big-endian and the high-numbered lanes +;; for little-endian. +(define_expand "vec_unpack<su>_<perm_hilo>_<SVE_BHSI:mode>" + [(match_operand:<VWIDE> 0 "register_operand") + (unspec:<VWIDE> [(match_operand:SVE_BHSI 1 "register_operand")] UNPACK)] + "TARGET_SVE" + { + emit_insn ((<hi_lanes_optab> + ? gen_aarch64_sve_<su>unpkhi_<SVE_BHSI:mode> + : gen_aarch64_sve_<su>unpklo_<SVE_BHSI:mode>) + (operands[0], operands[1])); + DONE; + } +) + ;; SUNPKHI, UUNPKHI, SUNPKLO and UUNPKLO. -(define_insn "vec_unpack<su>_<perm_hilo>_<SVE_BHSI:mode>" +(define_insn "aarch64_sve_<su>unpk<perm_hilo>_<SVE_BHSI:mode>" [(set (match_operand:<VWIDE> 0 "register_operand" "=w") (unspec:<VWIDE> [(match_operand:SVE_BHSI 1 "register_operand" "w")] UNPACK))] @@ -2213,32 +2246,28 @@ (define_insn "vec_unpack<su>_<perm_hilo> "<su>unpk<perm_hilo>\t%0.<Vewtype>, %1.<Vetype>" ) -;; Used by the vec_unpacks_<perm_hilo>_<mode> expander to unpack the bit -;; representation of a VNx4SF or VNx8HF without conversion. The choice -;; between signed and unsigned isn't significant. -(define_insn "*vec_unpacku_<perm_hilo>_<mode>_no_convert" - [(set (match_operand:SVE_HSF 0 "register_operand" "=w") - (unspec:SVE_HSF [(match_operand:SVE_HSF 1 "register_operand" "w")] - UNPACK_UNSIGNED))] - "TARGET_SVE" - "uunpk<perm_hilo>\t%0.<Vewtype>, %1.<Vetype>" -) - ;; Unpack one half of a VNx4SF to VNx2DF, or one half of a VNx8HF to VNx4SF. ;; First unpack the source without conversion, then float-convert the ;; unpacked source. (define_expand "vec_unpacks_<perm_hilo>_<mode>" - [(set (match_dup 2) - (unspec:SVE_HSF [(match_operand:SVE_HSF 1 "register_operand")] - UNPACK_UNSIGNED)) - (set (match_operand:<VWIDE> 0 "register_operand") - (unspec:<VWIDE> [(match_dup 3) - (unspec:<VWIDE> [(match_dup 2)] UNSPEC_FLOAT_CONVERT)] - UNSPEC_MERGE_PTRUE))] - "TARGET_SVE" - { - operands[2] = gen_reg_rtx (<MODE>mode); - operands[3] = force_reg (<VWIDE_PRED>mode, CONSTM1_RTX (<VWIDE_PRED>mode)); + [(match_operand:<VWIDE> 0 "register_operand") + (unspec:SVE_HSF [(match_operand:SVE_HSF 1 "register_operand")] + UNPACK_UNSIGNED)] + "TARGET_SVE" + { + /* Use ZIP to do the unpack, since we don't care about the upper halves + and since it has the nice property of not needing any subregs. + If using UUNPK* turns out to be preferable, we could model it as + a ZIP whose first operand is zero. */ + rtx temp = gen_reg_rtx (<MODE>mode); + emit_insn ((<hi_lanes_optab> + ? gen_aarch64_sve_zip2<mode> + : gen_aarch64_sve_zip1<mode>) + (temp, operands[1], operands[1])); + rtx ptrue = force_reg (<VWIDE_PRED>mode, CONSTM1_RTX (<VWIDE_PRED>mode)); + emit_insn (gen_aarch64_sve_extend<mode><Vwide>2 (operands[0], + ptrue, temp)); + DONE; } ) @@ -2246,18 +2275,25 @@ (define_expand "vec_unpacks_<perm_hilo>_ ;; to VNx2DI, reinterpret the VNx2DI as a VNx4SI, then convert the ;; unpacked VNx4SI to VNx2DF. (define_expand "vec_unpack<su_optab>_float_<perm_hilo>_vnx4si" - [(set (match_dup 2) - (unspec:VNx2DI [(match_operand:VNx4SI 1 "register_operand")] - UNPACK_UNSIGNED)) - (set (match_operand:VNx2DF 0 "register_operand") - (unspec:VNx2DF [(match_dup 3) - (FLOATUORS:VNx2DF (match_dup 4))] - UNSPEC_MERGE_PTRUE))] - "TARGET_SVE" - { - operands[2] = gen_reg_rtx (VNx2DImode); - operands[3] = force_reg (VNx2BImode, CONSTM1_RTX (VNx2BImode)); - operands[4] = gen_rtx_SUBREG (VNx4SImode, operands[2], 0); + [(match_operand:VNx2DF 0 "register_operand") + (FLOATUORS:VNx2DF + (unspec:VNx2DI [(match_operand:VNx4SI 1 "register_operand")] + UNPACK_UNSIGNED))] + "TARGET_SVE" + { + /* Use ZIP to do the unpack, since we don't care about the upper halves + and since it has the nice property of not needing any subregs. + If using UUNPK* turns out to be preferable, we could model it as + a ZIP whose first operand is zero. */ + rtx temp = gen_reg_rtx (VNx4SImode); + emit_insn ((<hi_lanes_optab> + ? gen_aarch64_sve_zip2vnx4si + : gen_aarch64_sve_zip1vnx4si) + (temp, operands[1], operands[1])); + rtx ptrue = force_reg (VNx2BImode, CONSTM1_RTX (VNx2BImode)); + emit_insn (gen_aarch64_sve_<FLOATUORS:optab>vnx4sivnx2df2 (operands[0], + ptrue, temp)); + DONE; } ) Index: gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_signed_1.c =================================================================== --- gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_signed_1.c 2018-01-13 17:54:52.934269783 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_signed_1.c 2018-01-26 14:00:15.716916999 +0000 @@ -10,6 +10,6 @@ unpack_double_int_plus8 (double *d, int3 d[i] = s[i] + 8; } -/* { dg-final { scan-assembler-times {\tuunpklo\tz[0-9]+\.d, z[0-9]+\.s\n} 1 } } */ -/* { dg-final { scan-assembler-times {\tuunpkhi\tz[0-9]+\.d, z[0-9]+\.s\n} 1 } } */ +/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.s, z[0-9]+\.s, z[0-9]+\.s\n} 1 } } */ +/* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.s, z[0-9]+\.s, z[0-9]+\.s\n} 1 } } */ /* { dg-final { scan-assembler-times {\tscvtf\tz[0-9]+\.d, p[0-7]/m, z[0-9]+\.s\n} 2 } } */ Index: gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c =================================================================== --- gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c 2018-01-13 17:54:52.934269783 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c 2018-01-26 14:00:15.717916957 +0000 @@ -10,6 +10,6 @@ unpack_double_int_plus9 (double *d, uint d[i] = (double) (s[i] + 9); } -/* { dg-final { scan-assembler-times {\tuunpklo\tz[0-9]+\.d, z[0-9]+\.s\n} 1 } } */ -/* { dg-final { scan-assembler-times {\tuunpkhi\tz[0-9]+\.d, z[0-9]+\.s\n} 1 } } */ +/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.s, z[0-9]+\.s, z[0-9]+\.s\n} 1 } } */ +/* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.s, z[0-9]+\.s, z[0-9]+\.s\n} 1 } } */ /* { dg-final { scan-assembler-times {\tucvtf\tz[0-9]+\.d, p[0-7]/m, z[0-9]+\.s\n} 2 } } */ Index: gcc/testsuite/gcc.target/aarch64/sve/unpack_float_1.c =================================================================== --- gcc/testsuite/gcc.target/aarch64/sve/unpack_float_1.c 2018-01-13 17:54:52.935269745 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve/unpack_float_1.c 2018-01-26 14:00:15.717916957 +0000 @@ -8,6 +8,6 @@ unpack_float_plus_7point9 (double *d, fl d[i] = s[i] + 7.9; } -/* { dg-final { scan-assembler-times {\tuunpklo\tz[0-9]+\.d, z[0-9]+\.s\n} 1 } } */ -/* { dg-final { scan-assembler-times {\tuunpkhi\tz[0-9]+\.d, z[0-9]+\.s\n} 1 } } */ +/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.s, z[0-9]+\.s, z[0-9]+\.s\n} 1 } } */ +/* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.s, z[0-9]+\.s, z[0-9]+\.s\n} 1 } } */ /* { dg-final { scan-assembler-times {\tfcvt\tz[0-9]+\.d, p[0-7]/m, z[0-9]+\.s\n} 2 } } */