Message ID | 87shdnwpnr.fsf@linaro.org |
---|---|
State | New |
Headers | show |
Series | Add optabs for common types of permutation | expand |
On 11/09/2017 06:24 AM, Richard Sandiford wrote: > ...so that we can use them for variable-length vectors. For now > constant-length vectors continue to use VEC_PERM_EXPR and the > vec_perm_const optab even for cases that the new optabs could > handle. > > The vector optabs are inconsistent about whether there should be > an underscore before the mode part of the name, but the other lo/hi > optabs have one. > > Doing this means that we're able to optimise some SLP tests using > non-SLP (for now) on targets with variable-length vectors, so the > patch needs to add a few XFAILs. Most of these go away with later > patches. > > Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu > and powerpc64le-linus-gnu. OK to install? > > Richard > > > 2017-11-09 Richard Sandiford <richard.sandiford@linaro.org> > Alan Hayward <alan.hayward@arm.com> > David Sherwood <david.sherwood@arm.com> > > gcc/ > * doc/md.texi (vec_reverse, vec_interleave_lo, vec_interleave_hi) > (vec_extract_even, vec_extract_odd): Document new optabs. > * internal-fn.def (VEC_INTERLEAVE_LO, VEC_INTERLEAVE_HI) > (VEC_EXTRACT_EVEN, VEC_EXTRACT_ODD, VEC_REVERSE): New internal > functions. > * optabs.def (vec_interleave_lo_optab, vec_interleave_hi_optab) > (vec_extract_even_optab, vec_extract_odd_optab, vec_reverse_optab): > New optabs. > * tree-vect-data-refs.c: Include internal-fn.h. > (vect_grouped_store_supported): Try using IFN_VEC_INTERLEAVE_{LO,HI}. > (vect_permute_store_chain): Use them here too. > (vect_grouped_load_supported): Try using IFN_VEC_EXTRACT_{EVEN,ODD}. > (vect_permute_load_chain): Use them here too. > * tree-vect-stmts.c (can_reverse_vector_p): New function. > (get_negative_load_store_type): Use it. > (reverse_vector): New function. > (vectorizable_store, vectorizable_load): Use it. > * config/aarch64/iterators.md (perm_optab): New iterator. > * config/aarch64/aarch64-sve.md (<perm_optab>_<mode>): New expander. > (vec_reverse_<mode>): Likewise. > > gcc/testsuite/ > * gcc.dg/vect/no-vfa-vect-depend-2.c: Remove XFAIL. > * gcc.dg/vect/no-vfa-vect-depend-3.c: Likewise. > * gcc.dg/vect/pr33953.c: XFAIL for vect_variable_length. > * gcc.dg/vect/pr68445.c: Likewise. > * gcc.dg/vect/slp-12a.c: Likewise. > * gcc.dg/vect/slp-13-big-array.c: Likewise. > * gcc.dg/vect/slp-13.c: Likewise. > * gcc.dg/vect/slp-14.c: Likewise. > * gcc.dg/vect/slp-15.c: Likewise. > * gcc.dg/vect/slp-42.c: Likewise. > * gcc.dg/vect/slp-multitypes-2.c: Likewise. > * gcc.dg/vect/slp-multitypes-4.c: Likewise. > * gcc.dg/vect/slp-multitypes-5.c: Likewise. > * gcc.dg/vect/slp-reduc-4.c: Likewise. > * gcc.dg/vect/slp-reduc-7.c: Likewise. > * gcc.target/aarch64/sve_vec_perm_2.c: New test. > * gcc.target/aarch64/sve_vec_perm_2_run.c: Likewise. > * gcc.target/aarch64/sve_vec_perm_3.c: New test. > * gcc.target/aarch64/sve_vec_perm_3_run.c: Likewise. > * gcc.target/aarch64/sve_vec_perm_4.c: New test. > * gcc.target/aarch64/sve_vec_perm_4_run.c: Likewise. OK. jeff
On Mon, Nov 20, 2017 at 12:56 AM, Jeff Law <law@redhat.com> wrote: > On 11/09/2017 06:24 AM, Richard Sandiford wrote: >> ...so that we can use them for variable-length vectors. For now >> constant-length vectors continue to use VEC_PERM_EXPR and the >> vec_perm_const optab even for cases that the new optabs could >> handle. >> >> The vector optabs are inconsistent about whether there should be >> an underscore before the mode part of the name, but the other lo/hi >> optabs have one. >> >> Doing this means that we're able to optimise some SLP tests using >> non-SLP (for now) on targets with variable-length vectors, so the >> patch needs to add a few XFAILs. Most of these go away with later >> patches. >> >> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu >> and powerpc64le-linus-gnu. OK to install? >> >> Richard >> >> >> 2017-11-09 Richard Sandiford <richard.sandiford@linaro.org> >> Alan Hayward <alan.hayward@arm.com> >> David Sherwood <david.sherwood@arm.com> >> >> gcc/ >> * doc/md.texi (vec_reverse, vec_interleave_lo, vec_interleave_hi) >> (vec_extract_even, vec_extract_odd): Document new optabs. >> * internal-fn.def (VEC_INTERLEAVE_LO, VEC_INTERLEAVE_HI) >> (VEC_EXTRACT_EVEN, VEC_EXTRACT_ODD, VEC_REVERSE): New internal >> functions. >> * optabs.def (vec_interleave_lo_optab, vec_interleave_hi_optab) >> (vec_extract_even_optab, vec_extract_odd_optab, vec_reverse_optab): >> New optabs. >> * tree-vect-data-refs.c: Include internal-fn.h. >> (vect_grouped_store_supported): Try using IFN_VEC_INTERLEAVE_{LO,HI}. >> (vect_permute_store_chain): Use them here too. >> (vect_grouped_load_supported): Try using IFN_VEC_EXTRACT_{EVEN,ODD}. >> (vect_permute_load_chain): Use them here too. >> * tree-vect-stmts.c (can_reverse_vector_p): New function. >> (get_negative_load_store_type): Use it. >> (reverse_vector): New function. >> (vectorizable_store, vectorizable_load): Use it. >> * config/aarch64/iterators.md (perm_optab): New iterator. >> * config/aarch64/aarch64-sve.md (<perm_optab>_<mode>): New expander. >> (vec_reverse_<mode>): Likewise. >> >> gcc/testsuite/ >> * gcc.dg/vect/no-vfa-vect-depend-2.c: Remove XFAIL. >> * gcc.dg/vect/no-vfa-vect-depend-3.c: Likewise. >> * gcc.dg/vect/pr33953.c: XFAIL for vect_variable_length. >> * gcc.dg/vect/pr68445.c: Likewise. >> * gcc.dg/vect/slp-12a.c: Likewise. >> * gcc.dg/vect/slp-13-big-array.c: Likewise. >> * gcc.dg/vect/slp-13.c: Likewise. >> * gcc.dg/vect/slp-14.c: Likewise. >> * gcc.dg/vect/slp-15.c: Likewise. >> * gcc.dg/vect/slp-42.c: Likewise. >> * gcc.dg/vect/slp-multitypes-2.c: Likewise. >> * gcc.dg/vect/slp-multitypes-4.c: Likewise. >> * gcc.dg/vect/slp-multitypes-5.c: Likewise. >> * gcc.dg/vect/slp-reduc-4.c: Likewise. >> * gcc.dg/vect/slp-reduc-7.c: Likewise. >> * gcc.target/aarch64/sve_vec_perm_2.c: New test. >> * gcc.target/aarch64/sve_vec_perm_2_run.c: Likewise. >> * gcc.target/aarch64/sve_vec_perm_3.c: New test. >> * gcc.target/aarch64/sve_vec_perm_3_run.c: Likewise. >> * gcc.target/aarch64/sve_vec_perm_4.c: New test. >> * gcc.target/aarch64/sve_vec_perm_4_run.c: Likewise. > OK. It's really a step backwards - we had those optabs and a tree code in the past and canonicalizing things to VEC_PERM_EXPR made things simpler. Why doesn't VEC_PERM <v1, v2, that-constant-series-expr-thing> not work? :/ Richard. > jeff
Richard Biener <richard.guenther@gmail.com> writes: > On Mon, Nov 20, 2017 at 12:56 AM, Jeff Law <law@redhat.com> wrote: >> On 11/09/2017 06:24 AM, Richard Sandiford wrote: >>> ...so that we can use them for variable-length vectors. For now >>> constant-length vectors continue to use VEC_PERM_EXPR and the >>> vec_perm_const optab even for cases that the new optabs could >>> handle. >>> >>> The vector optabs are inconsistent about whether there should be >>> an underscore before the mode part of the name, but the other lo/hi >>> optabs have one. >>> >>> Doing this means that we're able to optimise some SLP tests using >>> non-SLP (for now) on targets with variable-length vectors, so the >>> patch needs to add a few XFAILs. Most of these go away with later >>> patches. >>> >>> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu >>> and powerpc64le-linus-gnu. OK to install? >>> >>> Richard >>> >>> >>> 2017-11-09 Richard Sandiford <richard.sandiford@linaro.org> >>> Alan Hayward <alan.hayward@arm.com> >>> David Sherwood <david.sherwood@arm.com> >>> >>> gcc/ >>> * doc/md.texi (vec_reverse, vec_interleave_lo, vec_interleave_hi) >>> (vec_extract_even, vec_extract_odd): Document new optabs. >>> * internal-fn.def (VEC_INTERLEAVE_LO, VEC_INTERLEAVE_HI) >>> (VEC_EXTRACT_EVEN, VEC_EXTRACT_ODD, VEC_REVERSE): New internal >>> functions. >>> * optabs.def (vec_interleave_lo_optab, vec_interleave_hi_optab) >>> (vec_extract_even_optab, vec_extract_odd_optab, vec_reverse_optab): >>> New optabs. >>> * tree-vect-data-refs.c: Include internal-fn.h. >>> (vect_grouped_store_supported): Try using IFN_VEC_INTERLEAVE_{LO,HI}. >>> (vect_permute_store_chain): Use them here too. >>> (vect_grouped_load_supported): Try using IFN_VEC_EXTRACT_{EVEN,ODD}. >>> (vect_permute_load_chain): Use them here too. >>> * tree-vect-stmts.c (can_reverse_vector_p): New function. >>> (get_negative_load_store_type): Use it. >>> (reverse_vector): New function. >>> (vectorizable_store, vectorizable_load): Use it. >>> * config/aarch64/iterators.md (perm_optab): New iterator. >>> * config/aarch64/aarch64-sve.md (<perm_optab>_<mode>): New expander. >>> (vec_reverse_<mode>): Likewise. >>> >>> gcc/testsuite/ >>> * gcc.dg/vect/no-vfa-vect-depend-2.c: Remove XFAIL. >>> * gcc.dg/vect/no-vfa-vect-depend-3.c: Likewise. >>> * gcc.dg/vect/pr33953.c: XFAIL for vect_variable_length. >>> * gcc.dg/vect/pr68445.c: Likewise. >>> * gcc.dg/vect/slp-12a.c: Likewise. >>> * gcc.dg/vect/slp-13-big-array.c: Likewise. >>> * gcc.dg/vect/slp-13.c: Likewise. >>> * gcc.dg/vect/slp-14.c: Likewise. >>> * gcc.dg/vect/slp-15.c: Likewise. >>> * gcc.dg/vect/slp-42.c: Likewise. >>> * gcc.dg/vect/slp-multitypes-2.c: Likewise. >>> * gcc.dg/vect/slp-multitypes-4.c: Likewise. >>> * gcc.dg/vect/slp-multitypes-5.c: Likewise. >>> * gcc.dg/vect/slp-reduc-4.c: Likewise. >>> * gcc.dg/vect/slp-reduc-7.c: Likewise. >>> * gcc.target/aarch64/sve_vec_perm_2.c: New test. >>> * gcc.target/aarch64/sve_vec_perm_2_run.c: Likewise. >>> * gcc.target/aarch64/sve_vec_perm_3.c: New test. >>> * gcc.target/aarch64/sve_vec_perm_3_run.c: Likewise. >>> * gcc.target/aarch64/sve_vec_perm_4.c: New test. >>> * gcc.target/aarch64/sve_vec_perm_4_run.c: Likewise. >> OK. > > It's really a step backwards - we had those optabs and a tree code in > the past and > canonicalizing things to VEC_PERM_EXPR made things simpler. > > Why doesn't VEC_PERM <v1, v2, that-constant-series-expr-thing> not work? The problems with that are: - It doesn't work for vectors with 256-bit elements because the indices wrap round. - Supporting a fake VEC_PERM_EXPR <v256qi, v256qi, v256hi> for a few special cases would be hard, especially since v256hi isn't a normal vector mode. I imagine everything dealing with VEC_PERM_EXPR would then have to worry about that special case. - VEC_SERIES_CST only copes naturally with EXTRACT_EVEN, EXTRACT_ODD and REVERSE. INTERLEAVE_LO is { 0, N/2, 1, N/2+1, ... }. I guess it's possible to represent that using a combination of shifts, masks, and additions, but then: 1) when generating them, we'd need to make sure that we cost the operation as a single permute, rather than costing all the shifts, masks and additions 2) we'd need to make sure that all gimple optimisations that run afterwards don't perturb the sequence, otherwise we'll end up with something that's very expensive. 3) that sequence wouldn't be handled by existing VEC_PERM_EXPR optimisations, and it wouldn't be trivial to add it, since we'd need to re-recognise the sequence first. 4) expand would need to re-recognise the sequence and use the optab anyway. Using an internal function seems much simpler :-) I think VEC_PERM_EXPR is useful because it represents the same operation as __builtin_shuffle, and we want to optimise that as best we can. But these internal functions are only used by the vectoriser, which should always see what the final form of the permute should be. Thanks, Richard
On Mon, Nov 20, 2017 at 1:35 PM, Richard Sandiford <richard.sandiford@linaro.org> wrote: > Richard Biener <richard.guenther@gmail.com> writes: >> On Mon, Nov 20, 2017 at 12:56 AM, Jeff Law <law@redhat.com> wrote: >>> On 11/09/2017 06:24 AM, Richard Sandiford wrote: >>>> ...so that we can use them for variable-length vectors. For now >>>> constant-length vectors continue to use VEC_PERM_EXPR and the >>>> vec_perm_const optab even for cases that the new optabs could >>>> handle. >>>> >>>> The vector optabs are inconsistent about whether there should be >>>> an underscore before the mode part of the name, but the other lo/hi >>>> optabs have one. >>>> >>>> Doing this means that we're able to optimise some SLP tests using >>>> non-SLP (for now) on targets with variable-length vectors, so the >>>> patch needs to add a few XFAILs. Most of these go away with later >>>> patches. >>>> >>>> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu >>>> and powerpc64le-linus-gnu. OK to install? >>>> >>>> Richard >>>> >>>> >>>> 2017-11-09 Richard Sandiford <richard.sandiford@linaro.org> >>>> Alan Hayward <alan.hayward@arm.com> >>>> David Sherwood <david.sherwood@arm.com> >>>> >>>> gcc/ >>>> * doc/md.texi (vec_reverse, vec_interleave_lo, vec_interleave_hi) >>>> (vec_extract_even, vec_extract_odd): Document new optabs. >>>> * internal-fn.def (VEC_INTERLEAVE_LO, VEC_INTERLEAVE_HI) >>>> (VEC_EXTRACT_EVEN, VEC_EXTRACT_ODD, VEC_REVERSE): New internal >>>> functions. >>>> * optabs.def (vec_interleave_lo_optab, vec_interleave_hi_optab) >>>> (vec_extract_even_optab, vec_extract_odd_optab, vec_reverse_optab): >>>> New optabs. >>>> * tree-vect-data-refs.c: Include internal-fn.h. >>>> (vect_grouped_store_supported): Try using IFN_VEC_INTERLEAVE_{LO,HI}. >>>> (vect_permute_store_chain): Use them here too. >>>> (vect_grouped_load_supported): Try using IFN_VEC_EXTRACT_{EVEN,ODD}. >>>> (vect_permute_load_chain): Use them here too. >>>> * tree-vect-stmts.c (can_reverse_vector_p): New function. >>>> (get_negative_load_store_type): Use it. >>>> (reverse_vector): New function. >>>> (vectorizable_store, vectorizable_load): Use it. >>>> * config/aarch64/iterators.md (perm_optab): New iterator. >>>> * config/aarch64/aarch64-sve.md (<perm_optab>_<mode>): New expander. >>>> (vec_reverse_<mode>): Likewise. >>>> >>>> gcc/testsuite/ >>>> * gcc.dg/vect/no-vfa-vect-depend-2.c: Remove XFAIL. >>>> * gcc.dg/vect/no-vfa-vect-depend-3.c: Likewise. >>>> * gcc.dg/vect/pr33953.c: XFAIL for vect_variable_length. >>>> * gcc.dg/vect/pr68445.c: Likewise. >>>> * gcc.dg/vect/slp-12a.c: Likewise. >>>> * gcc.dg/vect/slp-13-big-array.c: Likewise. >>>> * gcc.dg/vect/slp-13.c: Likewise. >>>> * gcc.dg/vect/slp-14.c: Likewise. >>>> * gcc.dg/vect/slp-15.c: Likewise. >>>> * gcc.dg/vect/slp-42.c: Likewise. >>>> * gcc.dg/vect/slp-multitypes-2.c: Likewise. >>>> * gcc.dg/vect/slp-multitypes-4.c: Likewise. >>>> * gcc.dg/vect/slp-multitypes-5.c: Likewise. >>>> * gcc.dg/vect/slp-reduc-4.c: Likewise. >>>> * gcc.dg/vect/slp-reduc-7.c: Likewise. >>>> * gcc.target/aarch64/sve_vec_perm_2.c: New test. >>>> * gcc.target/aarch64/sve_vec_perm_2_run.c: Likewise. >>>> * gcc.target/aarch64/sve_vec_perm_3.c: New test. >>>> * gcc.target/aarch64/sve_vec_perm_3_run.c: Likewise. >>>> * gcc.target/aarch64/sve_vec_perm_4.c: New test. >>>> * gcc.target/aarch64/sve_vec_perm_4_run.c: Likewise. >>> OK. >> >> It's really a step backwards - we had those optabs and a tree code in >> the past and >> canonicalizing things to VEC_PERM_EXPR made things simpler. >> >> Why doesn't VEC_PERM <v1, v2, that-constant-series-expr-thing> not work? > > The problems with that are: > > - It doesn't work for vectors with 256-bit elements because the indices > wrap round. That's a general issue that would need to be addressed for larger vectors (GCN?). I presume the requirement that the permutation vector have the same size needs to be relaxed. > - Supporting a fake VEC_PERM_EXPR <v256qi, v256qi, v256hi> for a few > special cases would be hard, especially since v256hi isn't a normal > vector mode. I imagine everything dealing with VEC_PERM_EXPR would > then have to worry about that special case. I think it's not really a special case - any code here should just expect the same number of vector elements and not a particular size. You already dealt with using a char[] vector for permutations I think. > - VEC_SERIES_CST only copes naturally with EXTRACT_EVEN, EXTRACT_ODD > and REVERSE. INTERLEAVE_LO is { 0, N/2, 1, N/2+1, ... }. > I guess it's possible to represent that using a combination of > shifts, masks, and additions, but then: > > 1) when generating them, we'd need to make sure that we cost the > operation as a single permute, rather than costing all the shifts, > masks and additions > > 2) we'd need to make sure that all gimple optimisations that run > afterwards don't perturb the sequence, otherwise we'll end up > with something that's very expensive. > > 3) that sequence wouldn't be handled by existing VEC_PERM_EXPR > optimisations, and it wouldn't be trivial to add it, since we'd > need to re-recognise the sequence first. > > 4) expand would need to re-recognise the sequence and use the > optab anyway. Well, the answer is of course that you just need a more powerful VEC_SERIES_CST that can handle INTERLEAVE_HI/LO. It seems to me SVE can generate such masks relatively cheaply -- do a 0, 1, 2, 3... sequence and then do a INTERLEAVE_HI/LO on it. So it makes sense that we can directly specify it. Suggested fix: add a interleaved bit to VEC_SERIES_CST. At least I'd like to see it used for the cases it can already handle. VEC_PERM_EXPR is supposed to be the only permutation operation, if it cannot handle some cases it needs to be fixed / its constraints relaxed (like the v256qi case). > Using an internal function seems much simpler :-) > > I think VEC_PERM_EXPR is useful because it represents the same > operation as __builtin_shuffle, and we want to optimise that as best > we can. But these internal functions are only used by the vectoriser, > which should always see what the final form of the permute should be. You hope so. We have several cases where later unrolling and CSE/forwprop optimize permutations away. Richard. > Thanks, > Richard
Richard Biener <richard.guenther@gmail.com> writes: > On Mon, Nov 20, 2017 at 1:35 PM, Richard Sandiford > <richard.sandiford@linaro.org> wrote: >> Richard Biener <richard.guenther@gmail.com> writes: >>> On Mon, Nov 20, 2017 at 12:56 AM, Jeff Law <law@redhat.com> wrote: >>>> On 11/09/2017 06:24 AM, Richard Sandiford wrote: >>>>> ...so that we can use them for variable-length vectors. For now >>>>> constant-length vectors continue to use VEC_PERM_EXPR and the >>>>> vec_perm_const optab even for cases that the new optabs could >>>>> handle. >>>>> >>>>> The vector optabs are inconsistent about whether there should be >>>>> an underscore before the mode part of the name, but the other lo/hi >>>>> optabs have one. >>>>> >>>>> Doing this means that we're able to optimise some SLP tests using >>>>> non-SLP (for now) on targets with variable-length vectors, so the >>>>> patch needs to add a few XFAILs. Most of these go away with later >>>>> patches. >>>>> >>>>> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu >>>>> and powerpc64le-linus-gnu. OK to install? >>>>> >>>>> Richard >>>>> >>>>> >>>>> 2017-11-09 Richard Sandiford <richard.sandiford@linaro.org> >>>>> Alan Hayward <alan.hayward@arm.com> >>>>> David Sherwood <david.sherwood@arm.com> >>>>> >>>>> gcc/ >>>>> * doc/md.texi (vec_reverse, vec_interleave_lo, vec_interleave_hi) >>>>> (vec_extract_even, vec_extract_odd): Document new optabs. >>>>> * internal-fn.def (VEC_INTERLEAVE_LO, VEC_INTERLEAVE_HI) >>>>> (VEC_EXTRACT_EVEN, VEC_EXTRACT_ODD, VEC_REVERSE): New internal >>>>> functions. >>>>> * optabs.def (vec_interleave_lo_optab, vec_interleave_hi_optab) >>>>> (vec_extract_even_optab, vec_extract_odd_optab, vec_reverse_optab): >>>>> New optabs. >>>>> * tree-vect-data-refs.c: Include internal-fn.h. >>>>> (vect_grouped_store_supported): Try using IFN_VEC_INTERLEAVE_{LO,HI}. >>>>> (vect_permute_store_chain): Use them here too. >>>>> (vect_grouped_load_supported): Try using IFN_VEC_EXTRACT_{EVEN,ODD}. >>>>> (vect_permute_load_chain): Use them here too. >>>>> * tree-vect-stmts.c (can_reverse_vector_p): New function. >>>>> (get_negative_load_store_type): Use it. >>>>> (reverse_vector): New function. >>>>> (vectorizable_store, vectorizable_load): Use it. >>>>> * config/aarch64/iterators.md (perm_optab): New iterator. >>>>> * config/aarch64/aarch64-sve.md (<perm_optab>_<mode>): New expander. >>>>> (vec_reverse_<mode>): Likewise. >>>>> >>>>> gcc/testsuite/ >>>>> * gcc.dg/vect/no-vfa-vect-depend-2.c: Remove XFAIL. >>>>> * gcc.dg/vect/no-vfa-vect-depend-3.c: Likewise. >>>>> * gcc.dg/vect/pr33953.c: XFAIL for vect_variable_length. >>>>> * gcc.dg/vect/pr68445.c: Likewise. >>>>> * gcc.dg/vect/slp-12a.c: Likewise. >>>>> * gcc.dg/vect/slp-13-big-array.c: Likewise. >>>>> * gcc.dg/vect/slp-13.c: Likewise. >>>>> * gcc.dg/vect/slp-14.c: Likewise. >>>>> * gcc.dg/vect/slp-15.c: Likewise. >>>>> * gcc.dg/vect/slp-42.c: Likewise. >>>>> * gcc.dg/vect/slp-multitypes-2.c: Likewise. >>>>> * gcc.dg/vect/slp-multitypes-4.c: Likewise. >>>>> * gcc.dg/vect/slp-multitypes-5.c: Likewise. >>>>> * gcc.dg/vect/slp-reduc-4.c: Likewise. >>>>> * gcc.dg/vect/slp-reduc-7.c: Likewise. >>>>> * gcc.target/aarch64/sve_vec_perm_2.c: New test. >>>>> * gcc.target/aarch64/sve_vec_perm_2_run.c: Likewise. >>>>> * gcc.target/aarch64/sve_vec_perm_3.c: New test. >>>>> * gcc.target/aarch64/sve_vec_perm_3_run.c: Likewise. >>>>> * gcc.target/aarch64/sve_vec_perm_4.c: New test. >>>>> * gcc.target/aarch64/sve_vec_perm_4_run.c: Likewise. >>>> OK. >>> >>> It's really a step backwards - we had those optabs and a tree code in >>> the past and >>> canonicalizing things to VEC_PERM_EXPR made things simpler. >>> >>> Why doesn't VEC_PERM <v1, v2, that-constant-series-expr-thing> not work? >> >> The problems with that are: >> >> - It doesn't work for vectors with 256-bit elements because the indices >> wrap round. > > That's a general issue that would need to be addressed for larger > vectors (GCN?). > I presume the requirement that the permutation vector have the same size > needs to be relaxed. > >> - Supporting a fake VEC_PERM_EXPR <v256qi, v256qi, v256hi> for a few >> special cases would be hard, especially since v256hi isn't a normal >> vector mode. I imagine everything dealing with VEC_PERM_EXPR would >> then have to worry about that special case. > > I think it's not really a special case - any code here should just > expect the same > number of vector elements and not a particular size. You already dealt with > using a char[] vector for permutations I think. It sounds like you're talking about the case in which the permutation vector is a VECTOR_CST. We still use VEC_PERM_EXPRs for constant-length vectors, so that doesn't change. (And yes, that probably means that it does break for *fixed-length* 2048-bit vectors.) But this patch is about the variable-length case, in which the permutation vector is never a VECTOR_CST, and couldn't get converted to a vec_perm_indices array. As far as existing code is concerned, it's no different from a VEC_PERM_EXPR with a variable permutation vector. So by taking this approach, we'd effectively be committing to supporting VEC_PERM_EXPRs with variable permutation vectors that that are wider than the vectors being permuted. Those permutation vectors will usually not have a vector_mode_supported_p mode and will have to be synthesised somehow. Trying to support the general case like this could be incredibly expensive. Only certain special cases like interleave hi/lo could be handled cheaply. >> - VEC_SERIES_CST only copes naturally with EXTRACT_EVEN, EXTRACT_ODD >> and REVERSE. INTERLEAVE_LO is { 0, N/2, 1, N/2+1, ... }. >> I guess it's possible to represent that using a combination of >> shifts, masks, and additions, but then: >> >> 1) when generating them, we'd need to make sure that we cost the >> operation as a single permute, rather than costing all the shifts, >> masks and additions >> >> 2) we'd need to make sure that all gimple optimisations that run >> afterwards don't perturb the sequence, otherwise we'll end up >> with something that's very expensive. >> >> 3) that sequence wouldn't be handled by existing VEC_PERM_EXPR >> optimisations, and it wouldn't be trivial to add it, since we'd >> need to re-recognise the sequence first. >> >> 4) expand would need to re-recognise the sequence and use the >> optab anyway. > > Well, the answer is of course that you just need a more powerful VEC_SERIES_CST > that can handle INTERLEAVE_HI/LO. It seems to me SVE can generate > such masks relatively cheaply -- do a 0, 1, 2, 3... sequence and then do > a INTERLEAVE_HI/LO on it. So it makes sense that we can directly specify it. It can do lots of other things too :-) But in all cases as separate statements. It seems better to expose them as separate statements in gimple so that they get optimised by the more powerful gimple optimisers, rather than waiting until rtl. I think if we go down the route of building more and more operations into the constant, we'll end up inventing a gimple version of rtx CONST. I also don't see why it's OK to expose the concept of interleave hi/lo as an operation on constants but not as an operation on general vectors. > Suggested fix: add a interleaved bit to VEC_SERIES_CST. That only handles this one case though. We'd have to keep making it more and more complicated as more cases come up. E.g. the extra bit couldn't represent { 0, 1, 2, ..., n/2-1, n, n+1, ... }. > At least I'd like to see it used for the cases it can already handle. > > VEC_PERM_EXPR is supposed to be the only permutation operation, if it cannot > handle some cases it needs to be fixed / its constraints relaxed (like > the v256qi case). I don't think we gain anything by shoehorning everything into one code for the variable-length case though. None of the existing vec_perm_const code can (or should) be used, and none of the existing VECTOR_CST-based VEC_PERM_EXPR handling will do anything. That accounts for the majority of the VEC_PERM_EXPR support. Also, like with VEC_DUPLICATE_CST vs. VECTOR_CST, there won't be any vector types that use a mixture of current VEC_PERM_EXPRs and new VEC_PERM_EXPRs. So if we used VEC_PERM_EXPRs with VEC_SERIES_CSTs instead of internal permute functions, we wouldn't get any new optimisations for free: we'd have to write new code to match the new constants. So it becomes a question of whether it's easier to do that on VEC_SERIES_CSTs or internal functions. I think match.pd makes it much easier to optimise internal functions, and you get the added benefit that the result is automatically checked against what the target supports. And the direct mapping to optabs means that no re-recognition is necessary. It also seems inconsistent to allow things like TARGET_MEM_REF vs. MEM_REF but still require a single gimple permute operation, regardless of circumstances. I thought we were trying to move in the other direction, i.e. trying to get the power of the gimple optimisers for things that were previously handled by rtl. >> Using an internal function seems much simpler :-) >> >> I think VEC_PERM_EXPR is useful because it represents the same >> operation as __builtin_shuffle, and we want to optimise that as best >> we can. But these internal functions are only used by the vectoriser, >> which should always see what the final form of the permute should be. > > You hope so. We have several cases where later unrolling and CSE/forwprop > optimize permutations away. Unrolling doesn't usually expose anything useful for variable-length vectors though, since the iv step is also variable. I guess it could still happen, but TBH I'd rather take the hit of that than the risk that optimisers could create expensive non-native permutes. Thanks, Richard
On Tue, Nov 21, 2017 at 11:47 PM, Richard Sandiford <richard.sandiford@linaro.org> wrote: > Richard Biener <richard.guenther@gmail.com> writes: >> On Mon, Nov 20, 2017 at 1:35 PM, Richard Sandiford >> <richard.sandiford@linaro.org> wrote: >>> Richard Biener <richard.guenther@gmail.com> writes: >>>> On Mon, Nov 20, 2017 at 12:56 AM, Jeff Law <law@redhat.com> wrote: >>>>> On 11/09/2017 06:24 AM, Richard Sandiford wrote: >>>>>> ...so that we can use them for variable-length vectors. For now >>>>>> constant-length vectors continue to use VEC_PERM_EXPR and the >>>>>> vec_perm_const optab even for cases that the new optabs could >>>>>> handle. >>>>>> >>>>>> The vector optabs are inconsistent about whether there should be >>>>>> an underscore before the mode part of the name, but the other lo/hi >>>>>> optabs have one. >>>>>> >>>>>> Doing this means that we're able to optimise some SLP tests using >>>>>> non-SLP (for now) on targets with variable-length vectors, so the >>>>>> patch needs to add a few XFAILs. Most of these go away with later >>>>>> patches. >>>>>> >>>>>> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu >>>>>> and powerpc64le-linus-gnu. OK to install? >>>>>> >>>>>> Richard >>>>>> >>>>>> >>>>>> 2017-11-09 Richard Sandiford <richard.sandiford@linaro.org> >>>>>> Alan Hayward <alan.hayward@arm.com> >>>>>> David Sherwood <david.sherwood@arm.com> >>>>>> >>>>>> gcc/ >>>>>> * doc/md.texi (vec_reverse, vec_interleave_lo, vec_interleave_hi) >>>>>> (vec_extract_even, vec_extract_odd): Document new optabs. >>>>>> * internal-fn.def (VEC_INTERLEAVE_LO, VEC_INTERLEAVE_HI) >>>>>> (VEC_EXTRACT_EVEN, VEC_EXTRACT_ODD, VEC_REVERSE): New internal >>>>>> functions. >>>>>> * optabs.def (vec_interleave_lo_optab, vec_interleave_hi_optab) >>>>>> (vec_extract_even_optab, vec_extract_odd_optab, vec_reverse_optab): >>>>>> New optabs. >>>>>> * tree-vect-data-refs.c: Include internal-fn.h. >>>>>> (vect_grouped_store_supported): Try using IFN_VEC_INTERLEAVE_{LO,HI}. >>>>>> (vect_permute_store_chain): Use them here too. >>>>>> (vect_grouped_load_supported): Try using IFN_VEC_EXTRACT_{EVEN,ODD}. >>>>>> (vect_permute_load_chain): Use them here too. >>>>>> * tree-vect-stmts.c (can_reverse_vector_p): New function. >>>>>> (get_negative_load_store_type): Use it. >>>>>> (reverse_vector): New function. >>>>>> (vectorizable_store, vectorizable_load): Use it. >>>>>> * config/aarch64/iterators.md (perm_optab): New iterator. >>>>>> * config/aarch64/aarch64-sve.md (<perm_optab>_<mode>): New expander. >>>>>> (vec_reverse_<mode>): Likewise. >>>>>> >>>>>> gcc/testsuite/ >>>>>> * gcc.dg/vect/no-vfa-vect-depend-2.c: Remove XFAIL. >>>>>> * gcc.dg/vect/no-vfa-vect-depend-3.c: Likewise. >>>>>> * gcc.dg/vect/pr33953.c: XFAIL for vect_variable_length. >>>>>> * gcc.dg/vect/pr68445.c: Likewise. >>>>>> * gcc.dg/vect/slp-12a.c: Likewise. >>>>>> * gcc.dg/vect/slp-13-big-array.c: Likewise. >>>>>> * gcc.dg/vect/slp-13.c: Likewise. >>>>>> * gcc.dg/vect/slp-14.c: Likewise. >>>>>> * gcc.dg/vect/slp-15.c: Likewise. >>>>>> * gcc.dg/vect/slp-42.c: Likewise. >>>>>> * gcc.dg/vect/slp-multitypes-2.c: Likewise. >>>>>> * gcc.dg/vect/slp-multitypes-4.c: Likewise. >>>>>> * gcc.dg/vect/slp-multitypes-5.c: Likewise. >>>>>> * gcc.dg/vect/slp-reduc-4.c: Likewise. >>>>>> * gcc.dg/vect/slp-reduc-7.c: Likewise. >>>>>> * gcc.target/aarch64/sve_vec_perm_2.c: New test. >>>>>> * gcc.target/aarch64/sve_vec_perm_2_run.c: Likewise. >>>>>> * gcc.target/aarch64/sve_vec_perm_3.c: New test. >>>>>> * gcc.target/aarch64/sve_vec_perm_3_run.c: Likewise. >>>>>> * gcc.target/aarch64/sve_vec_perm_4.c: New test. >>>>>> * gcc.target/aarch64/sve_vec_perm_4_run.c: Likewise. >>>>> OK. >>>> >>>> It's really a step backwards - we had those optabs and a tree code in >>>> the past and >>>> canonicalizing things to VEC_PERM_EXPR made things simpler. >>>> >>>> Why doesn't VEC_PERM <v1, v2, that-constant-series-expr-thing> not work? >>> >>> The problems with that are: >>> >>> - It doesn't work for vectors with 256-bit elements because the indices >>> wrap round. >> >> That's a general issue that would need to be addressed for larger >> vectors (GCN?). >> I presume the requirement that the permutation vector have the same size >> needs to be relaxed. >> >>> - Supporting a fake VEC_PERM_EXPR <v256qi, v256qi, v256hi> for a few >>> special cases would be hard, especially since v256hi isn't a normal >>> vector mode. I imagine everything dealing with VEC_PERM_EXPR would >>> then have to worry about that special case. >> >> I think it's not really a special case - any code here should just >> expect the same >> number of vector elements and not a particular size. You already dealt with >> using a char[] vector for permutations I think. > > It sounds like you're talking about the case in which the permutation > vector is a VECTOR_CST. We still use VEC_PERM_EXPRs for constant-length > vectors, so that doesn't change. (And yes, that probably means that it > does break for *fixed-length* 2048-bit vectors.) > > But this patch is about the variable-length case, in which the > permutation vector is never a VECTOR_CST, and couldn't get converted > to a vec_perm_indices array. As far as existing code is concerned, > it's no different from a VEC_PERM_EXPR with a variable permutation > vector. But the permutation vector is constant as well - this is what you added those VEC_SERIES_CST stuff and whatnot for. I don't want variable-size vector special-casing everywhere. I want it to be somehow naturally integrating with existing stuff. > So by taking this approach, we'd effectively be committing to supporting > VEC_PERM_EXPRs with variable permutation vectors that that are wider than > the vectors being permuted. Those permutation vectors will usually not > have a vector_mode_supported_p mode and will have to be synthesised > somehow. Trying to support the general case like this could be incredibly > expensive. Only certain special cases like interleave hi/lo could be > handled cheaply. As far as I understand SVE only supports interleave / extract even/odd anyway. >>> - VEC_SERIES_CST only copes naturally with EXTRACT_EVEN, EXTRACT_ODD >>> and REVERSE. INTERLEAVE_LO is { 0, N/2, 1, N/2+1, ... }. >>> I guess it's possible to represent that using a combination of >>> shifts, masks, and additions, but then: >>> >>> 1) when generating them, we'd need to make sure that we cost the >>> operation as a single permute, rather than costing all the shifts, >>> masks and additions >>> >>> 2) we'd need to make sure that all gimple optimisations that run >>> afterwards don't perturb the sequence, otherwise we'll end up >>> with something that's very expensive. >>> >>> 3) that sequence wouldn't be handled by existing VEC_PERM_EXPR >>> optimisations, and it wouldn't be trivial to add it, since we'd >>> need to re-recognise the sequence first. >>> >>> 4) expand would need to re-recognise the sequence and use the >>> optab anyway. >> >> Well, the answer is of course that you just need a more powerful VEC_SERIES_CST >> that can handle INTERLEAVE_HI/LO. It seems to me SVE can generate >> such masks relatively cheaply -- do a 0, 1, 2, 3... sequence and then do >> a INTERLEAVE_HI/LO on it. So it makes sense that we can directly specify it. > > It can do lots of other things too :-) But in all cases as separate > statements. It seems better to expose them as separate statements in > gimple so that they get optimised by the more powerful gimple optimisers, > rather than waiting until rtl. > > I think if we go down the route of building more and more operations into > the constant, we'll end up inventing a gimple version of rtx CONST. > > I also don't see why it's OK to expose the concept of interleave hi/lo > as an operation on constants but not as an operation on general vectors. I'm not suggesting to expose it as an operation. I'm suggesting that if the target can vec_perm_const_ok () with an "interleave/extract" permutation then we should be able to represent that with VEC_PERM_EXPR and thus also represent the permutation vector. I wasn't too happy with VEC_SERIES_CST either you know. As said, having something as disruptive as poly_int everywhere but then still need all those special casing for variable length vectors in the vectorizer looks just wrong. How are you going to handle __builtin_shuffle () with SVE & intrinsics? How are you going to handle generic vector "lowering"? All the predication stuff is hidden from the middle-end as well. It would have been nice to finally have a nice way to express these things in GIMPLE. Bah. >> Suggested fix: add a interleaved bit to VEC_SERIES_CST. > > That only handles this one case though. We'd have to keep making it > more and more complicated as more cases come up. E.g. the extra bit > couldn't represent { 0, 1, 2, ..., n/2-1, n, n+1, ... }. But is there an instruction for this in SVE? I understand there's a single instruction doing interleave low/high and extract even/odd? But is there more? Possibly a generic permute but for it you'd have to explicitely construct a permutation vector using some primitives like that "series" instruction? So for that case it's reasonable to have GIMPLE like perm_vector_1 = VEC_SERIES_EXRP <...> ... v_2 = VEC_PERM_EXPR <.., .., perm_vector_1>; that is, it's not required to pretend the VEC_PERM_EXRP is a single instruction or the permutation vector is "constant"? >> At least I'd like to see it used for the cases it can already handle. >> >> VEC_PERM_EXPR is supposed to be the only permutation operation, if it cannot >> handle some cases it needs to be fixed / its constraints relaxed (like >> the v256qi case). > > I don't think we gain anything by shoehorning everything into one code > for the variable-length case though. None of the existing vec_perm_const > code can (or should) be used, and none of the existing VECTOR_CST-based > VEC_PERM_EXPR handling will do anything. That accounts for the majority > of the VEC_PERM_EXPR support. Also, like with VEC_DUPLICATE_CST vs. > VECTOR_CST, there won't be any vector types that use a mixture of > current VEC_PERM_EXPRs and new VEC_PERM_EXPRs. I dislike having those as you know. > So if we used VEC_PERM_EXPRs with VEC_SERIES_CSTs instead of internal > permute functions, we wouldn't get any new optimisations for free: we'd > have to write new code to match the new constants. Yeah, too bad we have those new constants ;) > So it becomes a > question of whether it's easier to do that on VEC_SERIES_CSTs or > internal functions. I think match.pd makes it much easier to optimise > internal functions, and you get the added benefit that the result is > automatically checked against what the target supports. And the > direct mapping to optabs means that no re-recognition is necessary. > > It also seems inconsistent to allow things like TARGET_MEM_REF vs. > MEM_REF but still require a single gimple permute operation, regardless > of circumstances. I thought we were trying to move in the other direction, > i.e. trying to get the power of the gimple optimisers for things that > were previously handled by rtl. Indeed I don't like TARGET_MEM_REF too much either. >>> Using an internal function seems much simpler :-) >>> >>> I think VEC_PERM_EXPR is useful because it represents the same >>> operation as __builtin_shuffle, and we want to optimise that as best >>> we can. But these internal functions are only used by the vectoriser, >>> which should always see what the final form of the permute should be. >> >> You hope so. We have several cases where later unrolling and CSE/forwprop >> optimize permutations away. > > Unrolling doesn't usually expose anything useful for variable-length > vectors though, since the iv step is also variable. I guess it could > still happen, but TBH I'd rather take the hit of that than the risk > that optimisers could create expensive non-native permutes. optimizers / folders have to (and do) check vec_perm_const_ok if they change a constant permute vector. Note there's always an advantage of exposing target capabilities directly, like on x86 those vec_perm_const_ok VEC_PERM_EXPRs could be expanded to the (series of!) native "permute" instructions of x86 by adding (target specific!) IFNs. But then those would be black boxes to all followup optimizers which means we could as well have none of those. But fact is the vectorizer isn't perfect and we rely on useless permutes being a) CSEd, b) combined, c) eliminated against extracts, etc. You'd have to replicate all VEC_PERM/BIT_FIELD_REF/etc. patterns we have in match.pd for all of the target IFNs. Yes, if we were right before RTL expansion we can have those "target IFNs" immediately but fact is we do a _lot_ of optimizations after vectorization. Oh, and there I mentioned "target IFNs" (and related, "target match.pd"). You are adding IFNs that exist for each target (because that's how optabs work(?)) but in reality you are generating ones that match SVE. Not very nice either. That said, all this feels a bit like a hack throughout of GCC rather than designing variable-length vectors into GIMPLE and then providing some implementation meat in the arm backend(s). I know you're time constrained but I think we're carrying a quite big maintainance burden that will be very difficult to "fix" afterwards (because of lack of motivation once this is in). And we've not even seen SVE silicon... (happened with SSE5 for example but that at least was x86-only). Richard. > Thanks, > Richard
Richard Biener <richard.guenther@gmail.com> writes: > On Tue, Nov 21, 2017 at 11:47 PM, Richard Sandiford > <richard.sandiford@linaro.org> wrote: >> Richard Biener <richard.guenther@gmail.com> writes: >>> On Mon, Nov 20, 2017 at 1:35 PM, Richard Sandiford >>> <richard.sandiford@linaro.org> wrote: >>>> Richard Biener <richard.guenther@gmail.com> writes: >>>>> On Mon, Nov 20, 2017 at 12:56 AM, Jeff Law <law@redhat.com> wrote: >>>>>> On 11/09/2017 06:24 AM, Richard Sandiford wrote: >>>>>>> ...so that we can use them for variable-length vectors. For now >>>>>>> constant-length vectors continue to use VEC_PERM_EXPR and the >>>>>>> vec_perm_const optab even for cases that the new optabs could >>>>>>> handle. >>>>>>> >>>>>>> The vector optabs are inconsistent about whether there should be >>>>>>> an underscore before the mode part of the name, but the other lo/hi >>>>>>> optabs have one. >>>>>>> >>>>>>> Doing this means that we're able to optimise some SLP tests using >>>>>>> non-SLP (for now) on targets with variable-length vectors, so the >>>>>>> patch needs to add a few XFAILs. Most of these go away with later >>>>>>> patches. >>>>>>> >>>>>>> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu >>>>>>> and powerpc64le-linus-gnu. OK to install? >>>>>>> >>>>>>> Richard >>>>>>> >>>>>>> >>>>>>> 2017-11-09 Richard Sandiford <richard.sandiford@linaro.org> >>>>>>> Alan Hayward <alan.hayward@arm.com> >>>>>>> David Sherwood <david.sherwood@arm.com> >>>>>>> >>>>>>> gcc/ >>>>>>> * doc/md.texi (vec_reverse, vec_interleave_lo, vec_interleave_hi) >>>>>>> (vec_extract_even, vec_extract_odd): Document new optabs. >>>>>>> * internal-fn.def (VEC_INTERLEAVE_LO, VEC_INTERLEAVE_HI) >>>>>>> (VEC_EXTRACT_EVEN, VEC_EXTRACT_ODD, VEC_REVERSE): New internal >>>>>>> functions. >>>>>>> * optabs.def (vec_interleave_lo_optab, vec_interleave_hi_optab) >>>>>>> (vec_extract_even_optab, vec_extract_odd_optab, vec_reverse_optab): >>>>>>> New optabs. >>>>>>> * tree-vect-data-refs.c: Include internal-fn.h. >>>>>>> (vect_grouped_store_supported): Try using IFN_VEC_INTERLEAVE_{LO,HI}. >>>>>>> (vect_permute_store_chain): Use them here too. >>>>>>> (vect_grouped_load_supported): Try using IFN_VEC_EXTRACT_{EVEN,ODD}. >>>>>>> (vect_permute_load_chain): Use them here too. >>>>>>> * tree-vect-stmts.c (can_reverse_vector_p): New function. >>>>>>> (get_negative_load_store_type): Use it. >>>>>>> (reverse_vector): New function. >>>>>>> (vectorizable_store, vectorizable_load): Use it. >>>>>>> * config/aarch64/iterators.md (perm_optab): New iterator. >>>>>>> * config/aarch64/aarch64-sve.md (<perm_optab>_<mode>): New expander. >>>>>>> (vec_reverse_<mode>): Likewise. >>>>>>> >>>>>>> gcc/testsuite/ >>>>>>> * gcc.dg/vect/no-vfa-vect-depend-2.c: Remove XFAIL. >>>>>>> * gcc.dg/vect/no-vfa-vect-depend-3.c: Likewise. >>>>>>> * gcc.dg/vect/pr33953.c: XFAIL for vect_variable_length. >>>>>>> * gcc.dg/vect/pr68445.c: Likewise. >>>>>>> * gcc.dg/vect/slp-12a.c: Likewise. >>>>>>> * gcc.dg/vect/slp-13-big-array.c: Likewise. >>>>>>> * gcc.dg/vect/slp-13.c: Likewise. >>>>>>> * gcc.dg/vect/slp-14.c: Likewise. >>>>>>> * gcc.dg/vect/slp-15.c: Likewise. >>>>>>> * gcc.dg/vect/slp-42.c: Likewise. >>>>>>> * gcc.dg/vect/slp-multitypes-2.c: Likewise. >>>>>>> * gcc.dg/vect/slp-multitypes-4.c: Likewise. >>>>>>> * gcc.dg/vect/slp-multitypes-5.c: Likewise. >>>>>>> * gcc.dg/vect/slp-reduc-4.c: Likewise. >>>>>>> * gcc.dg/vect/slp-reduc-7.c: Likewise. >>>>>>> * gcc.target/aarch64/sve_vec_perm_2.c: New test. >>>>>>> * gcc.target/aarch64/sve_vec_perm_2_run.c: Likewise. >>>>>>> * gcc.target/aarch64/sve_vec_perm_3.c: New test. >>>>>>> * gcc.target/aarch64/sve_vec_perm_3_run.c: Likewise. >>>>>>> * gcc.target/aarch64/sve_vec_perm_4.c: New test. >>>>>>> * gcc.target/aarch64/sve_vec_perm_4_run.c: Likewise. >>>>>> OK. >>>>> >>>>> It's really a step backwards - we had those optabs and a tree code in >>>>> the past and >>>>> canonicalizing things to VEC_PERM_EXPR made things simpler. >>>>> >>>>> Why doesn't VEC_PERM <v1, v2, that-constant-series-expr-thing> not work? >>>> >>>> The problems with that are: >>>> >>>> - It doesn't work for vectors with 256-bit elements because the indices >>>> wrap round. >>> >>> That's a general issue that would need to be addressed for larger >>> vectors (GCN?). >>> I presume the requirement that the permutation vector have the same size >>> needs to be relaxed. >>> >>>> - Supporting a fake VEC_PERM_EXPR <v256qi, v256qi, v256hi> for a few >>>> special cases would be hard, especially since v256hi isn't a normal >>>> vector mode. I imagine everything dealing with VEC_PERM_EXPR would >>>> then have to worry about that special case. >>> >>> I think it's not really a special case - any code here should just >>> expect the same >>> number of vector elements and not a particular size. You already dealt with >>> using a char[] vector for permutations I think. >> >> It sounds like you're talking about the case in which the permutation >> vector is a VECTOR_CST. We still use VEC_PERM_EXPRs for constant-length >> vectors, so that doesn't change. (And yes, that probably means that it >> does break for *fixed-length* 2048-bit vectors.) >> >> But this patch is about the variable-length case, in which the >> permutation vector is never a VECTOR_CST, and couldn't get converted >> to a vec_perm_indices array. As far as existing code is concerned, >> it's no different from a VEC_PERM_EXPR with a variable permutation >> vector. > > But the permutation vector is constant as well - this is what you added those > VEC_SERIES_CST stuff and whatnot for. > > I don't want variable-size vector special-casing everywhere. I want it to be > somehow naturally integrating with existing stuff. It's going to be a special case whatever happens though. If it's a VEC_PERM_EXPR then it'll be a new form of VEC_PERM_EXPR. The advantage of the internal functions and optabs is that they map to a concept that already exists. The code that generates the permutation already knows that it's generating an interleave lo/hi, and like you say, it used to do that directly via special tree codes. I agree that having a VEC_PERM_EXPR makes more sense for the constant-length case, but the concept is still there. And although using VEC_PERM_EXPR in gimple makes sense, I think not having the optabs is a step backwards, because it means that every target with interleave lo/hi has to duplicate the detection logic. >> So by taking this approach, we'd effectively be committing to supporting >> VEC_PERM_EXPRs with variable permutation vectors that that are wider than >> the vectors being permuted. Those permutation vectors will usually not >> have a vector_mode_supported_p mode and will have to be synthesised >> somehow. Trying to support the general case like this could be incredibly >> expensive. Only certain special cases like interleave hi/lo could be >> handled cheaply. > > As far as I understand SVE only supports interleave / extract even/odd anyway. > >>>> - VEC_SERIES_CST only copes naturally with EXTRACT_EVEN, EXTRACT_ODD >>>> and REVERSE. INTERLEAVE_LO is { 0, N/2, 1, N/2+1, ... }. >>>> I guess it's possible to represent that using a combination of >>>> shifts, masks, and additions, but then: >>>> >>>> 1) when generating them, we'd need to make sure that we cost the >>>> operation as a single permute, rather than costing all the shifts, >>>> masks and additions >>>> >>>> 2) we'd need to make sure that all gimple optimisations that run >>>> afterwards don't perturb the sequence, otherwise we'll end up >>>> with something that's very expensive. >>>> >>>> 3) that sequence wouldn't be handled by existing VEC_PERM_EXPR >>>> optimisations, and it wouldn't be trivial to add it, since we'd >>>> need to re-recognise the sequence first. >>>> >>>> 4) expand would need to re-recognise the sequence and use the >>>> optab anyway. >>> >>> Well, the answer is of course that you just need a more powerful >>> VEC_SERIES_CST >>> that can handle INTERLEAVE_HI/LO. It seems to me SVE can generate >>> such masks relatively cheaply -- do a 0, 1, 2, 3... sequence and then do >>> a INTERLEAVE_HI/LO on it. So it makes sense that we can directly specify it. >> >> It can do lots of other things too :-) But in all cases as separate >> statements. It seems better to expose them as separate statements in >> gimple so that they get optimised by the more powerful gimple optimisers, >> rather than waiting until rtl. >> >> I think if we go down the route of building more and more operations into >> the constant, we'll end up inventing a gimple version of rtx CONST. >> >> I also don't see why it's OK to expose the concept of interleave hi/lo >> as an operation on constants but not as an operation on general vectors. > > I'm not suggesting to expose it as an operation. I'm suggesting that if the > target can vec_perm_const_ok () with an "interleave/extract" permutation > then we should be able to represent that with VEC_PERM_EXPR and thus > also represent the permutation vector. But vec_perm_const_ok () takes a fixed-length mask, so it can't be used here. It would need to be a new hook (and thus a new special case for variable-length vectors). > I wasn't too happy with VEC_SERIES_CST either you know. > > As said, having something as disruptive as poly_int everywhere but then > still need all those special casing for variable length vectors in the > vectorizer > looks just wrong. > > How are you going to handle __builtin_shuffle () with SVE & intrinsics? The variable case should work with the current constraints, i.e. with the permutation vector having the same element width as the vectors being permuted, once there's a way of writing __builtin_shuffle with variable-length vectors. That means that 256-element shuffles can't refer to the second vector, but that's correct. (The problem here is that the interleaves *do* need to refer to the second vector, but using wider vectors for the permutation vector wouldn't be a native operation in general for SVE or for any existing target.) > How are you going to handle generic vector "lowering"? We shouldn't generate variable-length vectors that don't exist on the target (and there's no syntax for doing that in C). > All the predication stuff is hidden from the middle-end as well. It would > have been nice to finally have a nice way to express these things in GIMPLE. Not sure what you mean here. The only time predication is hidden is when SVE requires an all-true predicate for a full-vector operation, which seems like a target-specific detail. All "real" predication is exposed in GIMPLE. It builds on the existing support for vector boolean types. > Bah. > >>> Suggested fix: add a interleaved bit to VEC_SERIES_CST. >> >> That only handles this one case though. We'd have to keep making it >> more and more complicated as more cases come up. E.g. the extra bit >> couldn't represent { 0, 1, 2, ..., n/2-1, n, n+1, ... }. > > But is there an instruction for this in SVE? I understand there's a single > instruction doing interleave low/high and extract even/odd? This specific example, no. But my point is that... > But is there more? Possibly a generic permute but for it you'd have > to explicitely construct a permutation vector using some primitives > like that "series" instruction? So for that case it's reasonable to > have GIMPLE like > > perm_vector_1 = VEC_SERIES_EXRP <...> > ... > v_2 = VEC_PERM_EXPR <.., .., perm_vector_1>; > > that is, it's not required to pretend the VEC_PERM_EXRP is a single > instruction or the permutation vector is "constant"? ...by taking this approach, we're saying that we need to ensure that there is always a way of representing every directly-supported variable- length permutation mask as a constant, so that it doesn't get split from VEC_PERM_EXPR. I don't see why that's better than having internal functions. You said that you don't like the extra constants, and each time we make the constants more complicated, we have to support the more complicated constants everywhere that handles the constants (rather than everywhere that handles the VEC_PERM_EXPRs). >>> At least I'd like to see it used for the cases it can already handle. >>> >>> VEC_PERM_EXPR is supposed to be the only permutation operation, if it cannot >>> handle some cases it needs to be fixed / its constraints relaxed (like >>> the v256qi case). >> >> I don't think we gain anything by shoehorning everything into one code >> for the variable-length case though. None of the existing vec_perm_const >> code can (or should) be used, and none of the existing VECTOR_CST-based >> VEC_PERM_EXPR handling will do anything. That accounts for the majority >> of the VEC_PERM_EXPR support. Also, like with VEC_DUPLICATE_CST vs. >> VECTOR_CST, there won't be any vector types that use a mixture of >> current VEC_PERM_EXPRs and new VEC_PERM_EXPRs. > > I dislike having those as you know. > >> So if we used VEC_PERM_EXPRs with VEC_SERIES_CSTs instead of internal >> permute functions, we wouldn't get any new optimisations for free: we'd >> have to write new code to match the new constants. > > Yeah, too bad we have those new constants ;) > >> So it becomes a >> question of whether it's easier to do that on VEC_SERIES_CSTs or >> internal functions. I think match.pd makes it much easier to optimise >> internal functions, and you get the added benefit that the result is >> automatically checked against what the target supports. And the >> direct mapping to optabs means that no re-recognition is necessary. >> >> It also seems inconsistent to allow things like TARGET_MEM_REF vs. >> MEM_REF but still require a single gimple permute operation, regardless >> of circumstances. I thought we were trying to move in the other direction, >> i.e. trying to get the power of the gimple optimisers for things that >> were previously handled by rtl. > > Indeed I don't like TARGET_MEM_REF too much either. Hmm, ok. Maybe that's the difference here. It seems like a really nice feature to me :-) >>>> Using an internal function seems much simpler :-) >>>> >>>> I think VEC_PERM_EXPR is useful because it represents the same >>>> operation as __builtin_shuffle, and we want to optimise that as best >>>> we can. But these internal functions are only used by the vectoriser, >>>> which should always see what the final form of the permute should be. >>> >>> You hope so. We have several cases where later unrolling and CSE/forwprop >>> optimize permutations away. >> >> Unrolling doesn't usually expose anything useful for variable-length >> vectors though, since the iv step is also variable. I guess it could >> still happen, but TBH I'd rather take the hit of that than the risk >> that optimisers could create expensive non-native permutes. > > optimizers / folders have to (and do) check vec_perm_const_ok if they > change a constant permute vector. > > Note there's always an advantage of exposing target capabilities directly, > like on x86 those vec_perm_const_ok VEC_PERM_EXPRs could be > expanded to the (series of!) native "permute" instructions of x86 by adding > (target specific!) IFNs. But then those would be black boxes to all followup > optimizers which means we could as well have none of those. But fact is > the vectorizer isn't perfect and we rely on useless permutes being > a) CSEd, b) combined, c) eliminated against extracts, etc. You'd have to > replicate all VEC_PERM/BIT_FIELD_REF/etc. patterns we have in match.pd > for all of the target IFNs. Yes, if we were right before RTL expansion we can > have those "target IFNs" immediately but fact is we do a _lot_ of optimizations > after vectorization. > > Oh, and there I mentioned "target IFNs" (and related, "target match.pd"). > You are adding IFNs that exist for each target (because that's how optabs > work(?)) but in reality you are generating ones that match SVE. Not > very nice either. But the concepts are general, even if they're implemented by only one target at the moment. One architecture always has to come first. E.g. when IFN_MASK_LOAD went in, it was only supported for x86_64. Adding it as a generic function was still the right thing to do and meant that all SVE had to do was define the optab. I think target IFNs would only make sense if we have some sort of pre-expand target-specific lowering pass. (Which might be a good thing.) Here we're adding internal functions for things that the vectoriser has to be aware of. > That said, all this feels a bit like a hack throughout of GCC rather > than designing variable-length vectors into GIMPLE and then providing > some implementation meat in the arm backend(s). I know you're time > constrained but I think we're carrying a quite big maintainance burden > that will be very difficult to "fix" afterwards (because of lack of > motivation once this is in). And we've not even seen SVE silicon... > (happened with SSE5 for example but that at least was x86-only). I don't think it's a hack, and it didn't end up this way because of time constraints. IMO designing variable-length vectors into gimple means that (a) it needs to be possible to create variable- length vector constants in gimple (hence the new constants) and (b) gimple optimisers need to be aware of the fact that vectors have a variable length, element offsets can variable, etc. (hence the poly_int stuff, which is also needed for rtl). Thanks, Richard
Hi, On Thu, 23 Nov 2017, Richard Sandiford wrote: > > I don't want variable-size vector special-casing everywhere. I want > > it to be somehow naturally integrating with existing stuff. > > It's going to be a special case whatever happens though. It wouldn't have to be this way. It's like saying that loops with a constant upper bound should be represented in a different way than loops with an invariant upper bound. That would seem like a bad idea. > If it's a VEC_PERM_EXPR then it'll be a new form of VEC_PERM_EXPR. No, it'd be a VEC_PERM_EXPR where the magic mask is generated by a new EXPR type, instead of being a mere constant. > The advantage of the internal functions and optabs is that they map to a > concept that already exists. The code that generates the permutation > already knows that it's generating an interleave lo/hi, and like you > say, it used to do that directly via special tree codes. I agree that > having a VEC_PERM_EXPR makes more sense for the constant-length case, > but the concept is still there. > > And although using VEC_PERM_EXPR in gimple makes sense, I think not > having the optabs is a step backwards, because it means that every > target with interleave lo/hi has to duplicate the detection logic. The middle end can provide helper routines to make detection easy. The RTL expander could also match VEC_PERM_EXPR to specific optabs, if we really really want to add optab over optab for each specific kind of permutation in the future. In a way the difference boils down to have PERM(x,y, TYPE) (with TYPE being, say, HI_LO, EXTR_EVEN/ODD, REVERSE, and what not) vs. PERM_HI_LO(x,y) PERM_EVEN(x,y) PERM_ODD(x,y) PERM_REVERSE(x,y) ... The former way seems saner for an intermediate representation. In this specific case TYPE would be detected by magicness of the constant, and if extended to SVE by magicness of the definition of the variably-sized invariant. > > I'm not suggesting to expose it as an operation. I'm suggesting that > > if the target can vec_perm_const_ok () with an "interleave/extract" > > permutation then we should be able to represent that with > > VEC_PERM_EXPR and thus also represent the permutation vector. > > But vec_perm_const_ok () takes a fixed-length mask, so it can't be > used here. It would need to be a new hook (and thus a new special > case for variable-length vectors). Why do you reject extending vec_perm_const_ok to _do_ take an invarant mask? > > But is there more? Possibly a generic permute but for it you'd have > > to explicitely construct a permutation vector using some primitives > > like that "series" instruction? So for that case it's reasonable to > > have GIMPLE like > > > > perm_vector_1 = VEC_SERIES_EXRP <...> > > ... > > v_2 = VEC_PERM_EXPR <.., .., perm_vector_1>; > > > > that is, it's not required to pretend the VEC_PERM_EXRP is a single > > instruction or the permutation vector is "constant"? > > ...by taking this approach, we're saying that we need to ensure that > there is always a way of representing every directly-supported variable- > length permutation mask as a constant, so that it doesn't get split from > VEC_PERM_EXPR. I'm having trouble understanding this. Why would splitting away the defintion of perm_vector_1 from VEC_PERM_EXPR be a problem? It's still the same VEC_SERIES_EXRP, and hence still recognizable as a special permutation (if it is one). The optimizer won't touch VEC_SERIES_EXRP, or if they do (e.g. combine two of them), and they feed a VEC_PERM_EXPR they will make sure the combined result still is supported by the target. In a way, on targets which support only specific forms of permutation for the vector type in question, this invariant mask won't be explicitely generated in code, it's an abstract tag in the IR to specific the type of the transformation. Hence moving the def for that tag around is no problem. > I don't see why that's better than having internal > functions. The real difference isn't internal functions vs. expression nodes, but rather multiple node types vs. a single node type. Ciao, Michael.
On Thu, Nov 23, 2017 at 02:43:32PM +0100, Michael Matz wrote: > > If it's a VEC_PERM_EXPR then it'll be a new form of VEC_PERM_EXPR. > > No, it'd be a VEC_PERM_EXPR where the magic mask is generated by a new > EXPR type, instead of being a mere constant. Or an internal function that would produce the permutation mask vector given kind and number of vector elements and element type for the mask. Jakub
Michael Matz <matz@suse.de> writes: > Hi, > > On Thu, 23 Nov 2017, Richard Sandiford wrote: > >> > I don't want variable-size vector special-casing everywhere. I want >> > it to be somehow naturally integrating with existing stuff. >> >> It's going to be a special case whatever happens though. > > It wouldn't have to be this way. It's like saying that loops with a > constant upper bound should be represented in a different way than loops > with an invariant upper bound. That would seem like a bad idea. The difference is that with a loop, each iteration follows a set pattern. But: (1) for constant-length VEC_PERM_EXPRs, each element of the permutation vector is independent of the others: you can't predict what the selector for element i is given the selectors for the other elements. (2) for variable-length permutes, the elements *do* have to follow a set pattern that can be extended indefinitely. Or do you mean that we should use the new representation of interleave masks even for constant-length vectors, rather than using a VECTOR_CST? I suppose that would be more consistent, but we'd then have to check when generating a VEC_PERM_EXPR of a VECTOR_CST whether it should be represented in this new way instead. I think we then lose the benefit using a single tree code. The decision for VEC_DUPLICATE_CST and VEC_SERIES_CST was to restrict them only to variable-length vectors. >> If it's a VEC_PERM_EXPR then it'll be a new form of VEC_PERM_EXPR. > > No, it'd be a VEC_PERM_EXPR where the magic mask is generated by a new > EXPR type, instead of being a mere constant. (See [1] below) >> The advantage of the internal functions and optabs is that they map to a >> concept that already exists. The code that generates the permutation >> already knows that it's generating an interleave lo/hi, and like you >> say, it used to do that directly via special tree codes. I agree that >> having a VEC_PERM_EXPR makes more sense for the constant-length case, >> but the concept is still there. >> >> And although using VEC_PERM_EXPR in gimple makes sense, I think not >> having the optabs is a step backwards, because it means that every >> target with interleave lo/hi has to duplicate the detection logic. > > The middle end can provide helper routines to make detection easy. The > RTL expander could also match VEC_PERM_EXPR to specific optabs, if we > really really want to add optab over optab for each specific kind of > permutation in the future. Adding a new optab doesn't seem like a big deal to me, if it's for something that target-independent code already needs to worry about. (The reason we need these specific optabs is because the target- independent code already generates these particular permutes.) The overhead attached to adding an optab isn't really any higher than adding a new detector function, especially on targets that don't implement the optab. > In a way the difference boils down to have > PERM(x,y, TYPE) > (with TYPE being, say, HI_LO, EXTR_EVEN/ODD, REVERSE, and what not) > vs. > PERM_HI_LO(x,y) > PERM_EVEN(x,y) > PERM_ODD(x,y) > PERM_REVERSE(x,y) > ... > > The former way seems saner for an intermediate representation. In this > specific case TYPE would be detected by magicness of the constant, and if > extended to SVE by magicness of the definition of the variably-sized > invariant. [1] That sounds similar to the way that COND_EXPR and VEC_COND_EXPR can embed the comparison in the first operand. I think Richard has complained about that in the past (and it does cause some ugliness in the way mask types are calculated during vectorisation). >> > I'm not suggesting to expose it as an operation. I'm suggesting that >> > if the target can vec_perm_const_ok () with an "interleave/extract" >> > permutation then we should be able to represent that with >> > VEC_PERM_EXPR and thus also represent the permutation vector. >> >> But vec_perm_const_ok () takes a fixed-length mask, so it can't be >> used here. It would need to be a new hook (and thus a new special >> case for variable-length vectors). > > Why do you reject extending vec_perm_const_ok to _do_ take an invarant > mask? What kind of interface were you thinking of though? Note that the current interface is independent of the tree or rtl levels, since it's called by both gimple optimisers and vec_perm_const expanders. I assume we'd want to keep that. >> > But is there more? Possibly a generic permute but for it you'd have >> > to explicitely construct a permutation vector using some primitives >> > like that "series" instruction? So for that case it's reasonable to >> > have GIMPLE like >> > >> > perm_vector_1 = VEC_SERIES_EXRP <...> >> > ... >> > v_2 = VEC_PERM_EXPR <.., .., perm_vector_1>; >> > >> > that is, it's not required to pretend the VEC_PERM_EXRP is a single >> > instruction or the permutation vector is "constant"? >> >> ...by taking this approach, we're saying that we need to ensure that >> there is always a way of representing every directly-supported variable- >> length permutation mask as a constant, so that it doesn't get split from >> VEC_PERM_EXPR. > > I'm having trouble understanding this. Why would splitting away the > defintion of perm_vector_1 from VEC_PERM_EXPR be a problem? It's still > the same VEC_SERIES_EXRP, and hence still recognizable as a special > permutation (if it is one). The optimizer won't touch VEC_SERIES_EXRP, or > if they do (e.g. combine two of them), and they feed a VEC_PERM_EXPR they > will make sure the combined result still is supported by the target. The problem with splitting it out is that it just becomes any old gassign, and you don't normally have to check the uses of an SSA_NAME before optimising the definition. > In a way, on targets which support only specific forms of permutation for > the vector type in question, this invariant mask won't be explicitely > generated in code, it's an abstract tag in the IR to specific the type of > the transformation. Hence moving the def for that tag around is no > problem. But why should the def exist as a separate gimple statement in that case? If it's one operation then it seems better to keep it as one operation, both for code-gen and for optimisation. >> I don't see why that's better than having internal >> functions. > > The real difference isn't internal functions vs. expression nodes, but > rather multiple node types vs. a single node type. But maybe more specifically: multiple node types that each have a single form vs. one node type that has multiple forms. And at that level it seems like it's a difference between: TREE_CODE (gimple_assign_rhs_code (x) == VEC_PERM_EXPR) vs. gimple_vec_perm_p (x) Thanks, Richard
Jakub Jelinek <jakub@redhat.com> writes: > On Thu, Nov 23, 2017 at 02:43:32PM +0100, Michael Matz wrote: >> > If it's a VEC_PERM_EXPR then it'll be a new form of VEC_PERM_EXPR. >> >> No, it'd be a VEC_PERM_EXPR where the magic mask is generated by a new >> EXPR type, instead of being a mere constant. > > Or an internal function that would produce the permutation mask vector > given kind and number of vector elements and element type for the mask. I think this comes back to what to the return type of the functions should be. A vector of QImodes wouldn't be wide enough when vectors are 256 elements or wider, so we'd need a vctor of HImodes with the same number of elements as a native vector of QImodes. This means that the function will need to return a non-native vector type. That doesn't matter if the function remains glued to the VEC_PERM_EXPR, but once we expose it as a separate operation, it can get optimised separately from the VEC_PERM_EXPR. Also, even with this representation, the operation isn't truly variable-length. It just increases the maximum number of elements from 256 to 65536. That ought to be enough realistically (in the same way that 640k ought to be enough realistically), but the patches as posted have avoided needing to encode a maximum like that. Having the internal function do the permute rather than produce the mask means that the operation really is variable-length: we don't require a vector index to fit within a specific integer type. Thanks, Richard
Index: gcc/doc/md.texi =================================================================== --- gcc/doc/md.texi 2017-11-09 13:21:01.989917982 +0000 +++ gcc/doc/md.texi 2017-11-09 13:21:02.323463345 +0000 @@ -5017,6 +5017,46 @@ There is no need for a target to supply and @samp{vec_perm_const@var{m}} if the former can trivially implement the operation with, say, the vector constant loaded into a register. +@cindex @code{vec_reverse_@var{m}} instruction pattern +@item @samp{vec_reverse_@var{m}} +Reverse the order of the elements in vector input operand 1 and store +the result in vector output operand 0. Both operands have mode @var{m}. + +This pattern is provided mainly for targets with variable-length vectors. +Targets with fixed-length vectors can instead handle any reverse-specific +optimizations in @samp{vec_perm_const@var{m}}. + +@cindex @code{vec_interleave_lo_@var{m}} instruction pattern +@item @samp{vec_interleave_lo_@var{m}} +Take the lowest-indexed halves of vector input operands 1 and 2 and +interleave the elements, so that element @var{x} of operand 1 is followed by +element @var{x} of operand 2. Store the result in vector output operand 0. +All three operands have mode @var{m}. + +This pattern is provided mainly for targets with variable-length +vectors. Targets with fixed-length vectors can instead handle any +interleave-specific optimizations in @samp{vec_perm_const@var{m}}. + +@cindex @code{vec_interleave_hi_@var{m}} instruction pattern +@item @samp{vec_interleave_hi_@var{m}} +Like @samp{vec_interleave_lo_@var{m}}, but operate on the highest-indexed +halves instead of the lowest-indexed halves. + +@cindex @code{vec_extract_even_@var{m}} instruction pattern +@item @samp{vec_extract_even_@var{m}} +Concatenate vector input operands 1 and 2, extract the elements with +even-numbered indices, and store the result in vector output operand 0. +All three operands have mode @var{m}. + +This pattern is provided mainly for targets with variable-length vectors. +Targets with fixed-length vectors can instead handle any +extract-specific optimizations in @samp{vec_perm_const@var{m}}. + +@cindex @code{vec_extract_odd_@var{m}} instruction pattern +@item @samp{vec_extract_odd_@var{m}} +Like @samp{vec_extract_even_@var{m}}, but extract the elements with +odd-numbered indices. + @cindex @code{push@var{m}1} instruction pattern @item @samp{push@var{m}1} Output a push instruction. Operand 0 is value to push. Used only when Index: gcc/internal-fn.def =================================================================== --- gcc/internal-fn.def 2017-11-09 13:21:01.989917982 +0000 +++ gcc/internal-fn.def 2017-11-09 13:21:02.323463345 +0000 @@ -102,6 +102,17 @@ DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_ DEF_INTERNAL_OPTAB_FN (MASK_STORE_LANES, 0, vec_mask_store_lanes, mask_store_lanes) +DEF_INTERNAL_OPTAB_FN (VEC_INTERLEAVE_LO, ECF_CONST | ECF_NOTHROW, + vec_interleave_lo, binary) +DEF_INTERNAL_OPTAB_FN (VEC_INTERLEAVE_HI, ECF_CONST | ECF_NOTHROW, + vec_interleave_hi, binary) +DEF_INTERNAL_OPTAB_FN (VEC_EXTRACT_EVEN, ECF_CONST | ECF_NOTHROW, + vec_extract_even, binary) +DEF_INTERNAL_OPTAB_FN (VEC_EXTRACT_ODD, ECF_CONST | ECF_NOTHROW, + vec_extract_odd, binary) +DEF_INTERNAL_OPTAB_FN (VEC_REVERSE, ECF_CONST | ECF_NOTHROW, + vec_reverse, unary) + DEF_INTERNAL_OPTAB_FN (RSQRT, ECF_CONST, rsqrt, unary) /* Unary math functions. */ Index: gcc/optabs.def =================================================================== --- gcc/optabs.def 2017-11-09 13:21:01.989917982 +0000 +++ gcc/optabs.def 2017-11-09 13:21:02.323463345 +0000 @@ -309,6 +309,11 @@ OPTAB_D (vec_perm_optab, "vec_perm$a") OPTAB_D (vec_realign_load_optab, "vec_realign_load_$a") OPTAB_D (vec_set_optab, "vec_set$a") OPTAB_D (vec_shr_optab, "vec_shr_$a") +OPTAB_D (vec_interleave_lo_optab, "vec_interleave_lo_$a") +OPTAB_D (vec_interleave_hi_optab, "vec_interleave_hi_$a") +OPTAB_D (vec_extract_even_optab, "vec_extract_even_$a") +OPTAB_D (vec_extract_odd_optab, "vec_extract_odd_$a") +OPTAB_D (vec_reverse_optab, "vec_reverse_$a") OPTAB_D (vec_unpacks_float_hi_optab, "vec_unpacks_float_hi_$a") OPTAB_D (vec_unpacks_float_lo_optab, "vec_unpacks_float_lo_$a") OPTAB_D (vec_unpacks_hi_optab, "vec_unpacks_hi_$a") Index: gcc/tree-vect-data-refs.c =================================================================== --- gcc/tree-vect-data-refs.c 2017-11-09 13:21:01.989917982 +0000 +++ gcc/tree-vect-data-refs.c 2017-11-09 13:21:02.326167766 +0000 @@ -52,6 +52,7 @@ Software Foundation; either version 3, o #include "params.h" #include "tree-cfg.h" #include "tree-hash-traits.h" +#include "internal-fn.h" /* Return true if load- or store-lanes optab OPTAB is implemented for COUNT vectors of type VECTYPE. NAME is the name of OPTAB. */ @@ -4636,7 +4637,16 @@ vect_grouped_store_supported (tree vecty return false; } - /* Check that the permutation is supported. */ + /* Powers of 2 use a tree of interleaving operations. See whether the + target supports them directly. */ + if (count != 3 + && direct_internal_fn_supported_p (IFN_VEC_INTERLEAVE_LO, vectype, + OPTIMIZE_FOR_SPEED) + && direct_internal_fn_supported_p (IFN_VEC_INTERLEAVE_HI, vectype, + OPTIMIZE_FOR_SPEED)) + return true; + + /* Otherwise check for support in the form of general permutations. */ unsigned int nelt; if (VECTOR_MODE_P (mode) && GET_MODE_NUNITS (mode).is_constant (&nelt)) { @@ -4881,50 +4891,78 @@ vect_permute_store_chain (vec<tree> dr_c /* If length is not equal to 3 then only power of 2 is supported. */ gcc_assert (pow2p_hwi (length)); - /* vect_grouped_store_supported ensures that this is constant. */ - unsigned int nelt = TYPE_VECTOR_SUBPARTS (vectype).to_constant (); - auto_vec_perm_indices sel (nelt); - sel.quick_grow (nelt); - for (i = 0, n = nelt / 2; i < n; i++) + if (direct_internal_fn_supported_p (IFN_VEC_INTERLEAVE_LO, vectype, + OPTIMIZE_FOR_SPEED) + && direct_internal_fn_supported_p (IFN_VEC_INTERLEAVE_HI, vectype, + OPTIMIZE_FOR_SPEED)) + { + /* We could support the case where only one of the optabs is + implemented, but that seems unlikely. */ + perm_mask_low = NULL_TREE; + perm_mask_high = NULL_TREE; + } + else { - sel[i * 2] = i; - sel[i * 2 + 1] = i + nelt; + /* vect_grouped_store_supported ensures that this is constant. */ + unsigned int nelt = TYPE_VECTOR_SUBPARTS (vectype).to_constant (); + auto_vec_perm_indices sel (nelt); + sel.quick_grow (nelt); + for (i = 0, n = nelt / 2; i < n; i++) + { + sel[i * 2] = i; + sel[i * 2 + 1] = i + nelt; + } + perm_mask_low = vect_gen_perm_mask_checked (vectype, sel); + + for (i = 0; i < nelt; i++) + sel[i] += nelt / 2; + perm_mask_high = vect_gen_perm_mask_checked (vectype, sel); } - perm_mask_high = vect_gen_perm_mask_checked (vectype, sel); - for (i = 0; i < nelt; i++) - sel[i] += nelt / 2; - perm_mask_low = vect_gen_perm_mask_checked (vectype, sel); + for (i = 0, n = log_length; i < n; i++) + { + for (j = 0; j < length / 2; j++) + { + vect1 = dr_chain[j]; + vect2 = dr_chain[j + length / 2]; - for (i = 0, n = log_length; i < n; i++) - { - for (j = 0; j < length/2; j++) - { - vect1 = dr_chain[j]; - vect2 = dr_chain[j+length/2]; + /* Create interleaving stmt: + high = VEC_PERM_EXPR <vect1, vect2, + {0, nelt, 1, nelt + 1, ...}> */ + low = make_temp_ssa_name (vectype, NULL, "vect_inter_low"); + if (perm_mask_low) + perm_stmt = gimple_build_assign (low, VEC_PERM_EXPR, vect1, + vect2, perm_mask_low); + else + { + perm_stmt = gimple_build_call_internal + (IFN_VEC_INTERLEAVE_LO, 2, vect1, vect2); + gimple_set_lhs (perm_stmt, low); + } + vect_finish_stmt_generation (stmt, perm_stmt, gsi); + (*result_chain)[2 * j] = low; - /* Create interleaving stmt: - high = VEC_PERM_EXPR <vect1, vect2, {0, nelt, 1, nelt+1, - ...}> */ - high = make_temp_ssa_name (vectype, NULL, "vect_inter_high"); + /* Create interleaving stmt: + high = VEC_PERM_EXPR <vect1, vect2, + {nelt / 2, nelt * 3 / 2, + nelt / 2 + 1, nelt * 3 / 2 + 1, + ...}> */ + high = make_temp_ssa_name (vectype, NULL, "vect_inter_high"); + if (perm_mask_high) perm_stmt = gimple_build_assign (high, VEC_PERM_EXPR, vect1, vect2, perm_mask_high); - vect_finish_stmt_generation (stmt, perm_stmt, gsi); - (*result_chain)[2*j] = high; - - /* Create interleaving stmt: - low = VEC_PERM_EXPR <vect1, vect2, - {nelt/2, nelt*3/2, nelt/2+1, nelt*3/2+1, - ...}> */ - low = make_temp_ssa_name (vectype, NULL, "vect_inter_low"); - perm_stmt = gimple_build_assign (low, VEC_PERM_EXPR, vect1, - vect2, perm_mask_low); - vect_finish_stmt_generation (stmt, perm_stmt, gsi); - (*result_chain)[2*j+1] = low; - } - memcpy (dr_chain.address (), result_chain->address (), - length * sizeof (tree)); - } + else + { + perm_stmt = gimple_build_call_internal + (IFN_VEC_INTERLEAVE_HI, 2, vect1, vect2); + gimple_set_lhs (perm_stmt, high); + } + vect_finish_stmt_generation (stmt, perm_stmt, gsi); + (*result_chain)[2 * j + 1] = high; + } + memcpy (dr_chain.address (), result_chain->address (), + length * sizeof (tree)); + } } } @@ -5235,7 +5273,16 @@ vect_grouped_load_supported (tree vectyp return false; } - /* Check that the permutation is supported. */ + /* Powers of 2 use a tree of extract operations. See whether the + target supports them directly. */ + if (count != 3 + && direct_internal_fn_supported_p (IFN_VEC_EXTRACT_EVEN, vectype, + OPTIMIZE_FOR_SPEED) + && direct_internal_fn_supported_p (IFN_VEC_EXTRACT_ODD, vectype, + OPTIMIZE_FOR_SPEED)) + return true; + + /* Otherwise check for support in the form of general permutations. */ unsigned int nelt; if (VECTOR_MODE_P (mode) && GET_MODE_NUNITS (mode).is_constant (&nelt)) { @@ -5464,17 +5511,30 @@ vect_permute_load_chain (vec<tree> dr_ch /* If length is not equal to 3 then only power of 2 is supported. */ gcc_assert (pow2p_hwi (length)); - /* vect_grouped_load_supported ensures that this is constant. */ - unsigned nelt = TYPE_VECTOR_SUBPARTS (vectype).to_constant (); - auto_vec_perm_indices sel (nelt); - sel.quick_grow (nelt); - for (i = 0; i < nelt; ++i) - sel[i] = i * 2; - perm_mask_even = vect_gen_perm_mask_checked (vectype, sel); - - for (i = 0; i < nelt; ++i) - sel[i] = i * 2 + 1; - perm_mask_odd = vect_gen_perm_mask_checked (vectype, sel); + if (direct_internal_fn_supported_p (IFN_VEC_EXTRACT_EVEN, vectype, + OPTIMIZE_FOR_SPEED) + && direct_internal_fn_supported_p (IFN_VEC_EXTRACT_ODD, vectype, + OPTIMIZE_FOR_SPEED)) + { + /* We could support the case where only one of the optabs is + implemented, but that seems unlikely. */ + perm_mask_even = NULL_TREE; + perm_mask_odd = NULL_TREE; + } + else + { + /* vect_grouped_load_supported ensures that this is constant. */ + unsigned nelt = TYPE_VECTOR_SUBPARTS (vectype).to_constant (); + auto_vec_perm_indices sel (nelt); + sel.quick_grow (nelt); + for (i = 0; i < nelt; ++i) + sel[i] = i * 2; + perm_mask_even = vect_gen_perm_mask_checked (vectype, sel); + + for (i = 0; i < nelt; ++i) + sel[i] = i * 2 + 1; + perm_mask_odd = vect_gen_perm_mask_checked (vectype, sel); + } for (i = 0; i < log_length; i++) { @@ -5485,19 +5545,33 @@ vect_permute_load_chain (vec<tree> dr_ch /* data_ref = permute_even (first_data_ref, second_data_ref); */ data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_even"); - perm_stmt = gimple_build_assign (data_ref, VEC_PERM_EXPR, - first_vect, second_vect, - perm_mask_even); + if (perm_mask_even) + perm_stmt = gimple_build_assign (data_ref, VEC_PERM_EXPR, + first_vect, second_vect, + perm_mask_even); + else + { + perm_stmt = gimple_build_call_internal + (IFN_VEC_EXTRACT_EVEN, 2, first_vect, second_vect); + gimple_set_lhs (perm_stmt, data_ref); + } vect_finish_stmt_generation (stmt, perm_stmt, gsi); - (*result_chain)[j/2] = data_ref; + (*result_chain)[j / 2] = data_ref; /* data_ref = permute_odd (first_data_ref, second_data_ref); */ data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_odd"); - perm_stmt = gimple_build_assign (data_ref, VEC_PERM_EXPR, - first_vect, second_vect, - perm_mask_odd); + if (perm_mask_odd) + perm_stmt = gimple_build_assign (data_ref, VEC_PERM_EXPR, + first_vect, second_vect, + perm_mask_odd); + else + { + perm_stmt = gimple_build_call_internal + (IFN_VEC_EXTRACT_ODD, 2, first_vect, second_vect); + gimple_set_lhs (perm_stmt, data_ref); + } vect_finish_stmt_generation (stmt, perm_stmt, gsi); - (*result_chain)[j/2+length/2] = data_ref; + (*result_chain)[j / 2 + length / 2] = data_ref; } memcpy (dr_chain.address (), result_chain->address (), length * sizeof (tree)); Index: gcc/tree-vect-stmts.c =================================================================== --- gcc/tree-vect-stmts.c 2017-11-09 13:21:01.989917982 +0000 +++ gcc/tree-vect-stmts.c 2017-11-09 13:21:02.327069240 +0000 @@ -1796,6 +1796,46 @@ perm_mask_for_reverse (tree vectype) return vect_gen_perm_mask_checked (vectype, sel); } +/* Return true if the target can reverse the elements in a vector of + type VECTOR_TYPE. */ + +static bool +can_reverse_vector_p (tree vector_type) +{ + return (direct_internal_fn_supported_p (IFN_VEC_REVERSE, vector_type, + OPTIMIZE_FOR_SPEED) + || perm_mask_for_reverse (vector_type)); +} + +/* Generate a statement to reverse the elements in vector INPUT and + return the SSA name that holds the result. GSI is a statement iterator + pointing to STMT, which is the scalar statement we're vectorizing. + VEC_DEST is the destination variable with which new SSA names + should be associated. */ + +static tree +reverse_vector (tree vec_dest, tree input, gimple *stmt, + gimple_stmt_iterator *gsi) +{ + tree new_temp = make_ssa_name (vec_dest); + tree vector_type = TREE_TYPE (input); + gimple *perm_stmt; + if (direct_internal_fn_supported_p (IFN_VEC_REVERSE, vector_type, + OPTIMIZE_FOR_SPEED)) + { + perm_stmt = gimple_build_call_internal (IFN_VEC_REVERSE, 1, input); + gimple_set_lhs (perm_stmt, new_temp); + } + else + { + tree perm_mask = perm_mask_for_reverse (vector_type); + perm_stmt = gimple_build_assign (new_temp, VEC_PERM_EXPR, + input, input, perm_mask); + } + vect_finish_stmt_generation (stmt, perm_stmt, gsi); + return new_temp; +} + /* A subroutine of get_load_store_type, with a subset of the same arguments. Handle the case where STMT is part of a grouped load or store. @@ -1999,7 +2039,7 @@ get_negative_load_store_type (gimple *st return VMAT_CONTIGUOUS_DOWN; } - if (!perm_mask_for_reverse (vectype)) + if (!can_reverse_vector_p (vectype)) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, @@ -6760,20 +6800,10 @@ vectorizable_store (gimple *stmt, gimple if (memory_access_type == VMAT_CONTIGUOUS_REVERSE) { - tree perm_mask = perm_mask_for_reverse (vectype); tree perm_dest = vect_create_destination_var (gimple_assign_rhs1 (stmt), vectype); - tree new_temp = make_ssa_name (perm_dest); - - /* Generate the permute statement. */ - gimple *perm_stmt - = gimple_build_assign (new_temp, VEC_PERM_EXPR, vec_oprnd, - vec_oprnd, perm_mask); - vect_finish_stmt_generation (stmt, perm_stmt, gsi); - - perm_stmt = SSA_NAME_DEF_STMT (new_temp); - vec_oprnd = new_temp; + vec_oprnd = reverse_vector (perm_dest, vec_oprnd, stmt, gsi); } /* Arguments are ready. Create the new vector stmt. */ @@ -7998,9 +8028,7 @@ vectorizable_load (gimple *stmt, gimple_ if (memory_access_type == VMAT_CONTIGUOUS_REVERSE) { - tree perm_mask = perm_mask_for_reverse (vectype); - new_temp = permute_vec_elements (new_temp, new_temp, - perm_mask, stmt, gsi); + new_temp = reverse_vector (vec_dest, new_temp, stmt, gsi); new_stmt = SSA_NAME_DEF_STMT (new_temp); } Index: gcc/config/aarch64/iterators.md =================================================================== --- gcc/config/aarch64/iterators.md 2017-11-09 13:21:01.989917982 +0000 +++ gcc/config/aarch64/iterators.md 2017-11-09 13:21:02.322561871 +0000 @@ -1556,6 +1556,11 @@ (define_int_attr pauth_hint_num_a [(UNSP (UNSPEC_PACI1716 "8") (UNSPEC_AUTI1716 "12")]) +(define_int_attr perm_optab [(UNSPEC_ZIP1 "vec_interleave_lo") + (UNSPEC_ZIP2 "vec_interleave_hi") + (UNSPEC_UZP1 "vec_extract_even") + (UNSPEC_UZP2 "vec_extract_odd")]) + (define_int_attr perm_insn [(UNSPEC_ZIP1 "zip") (UNSPEC_ZIP2 "zip") (UNSPEC_TRN1 "trn") (UNSPEC_TRN2 "trn") (UNSPEC_UZP1 "uzp") (UNSPEC_UZP2 "uzp")]) Index: gcc/config/aarch64/aarch64-sve.md =================================================================== --- gcc/config/aarch64/aarch64-sve.md 2017-11-09 13:21:01.989917982 +0000 +++ gcc/config/aarch64/aarch64-sve.md 2017-11-09 13:21:02.320758923 +0000 @@ -630,6 +630,19 @@ (define_expand "vec_perm<mode>" } ) +(define_expand "<perm_optab>_<mode>" + [(set (match_operand:SVE_ALL 0 "register_operand") + (unspec:SVE_ALL [(match_operand:SVE_ALL 1 "register_operand") + (match_operand:SVE_ALL 2 "register_operand")] + OPTAB_PERMUTE))] + "TARGET_SVE && !GET_MODE_NUNITS (<MODE>mode).is_constant ()") + +(define_expand "vec_reverse_<mode>" + [(set (match_operand:SVE_ALL 0 "register_operand") + (unspec:SVE_ALL [(match_operand:SVE_ALL 1 "register_operand")] + UNSPEC_REV))] + "TARGET_SVE && !GET_MODE_NUNITS (<MODE>mode).is_constant ()") + (define_insn "*aarch64_sve_tbl<mode>" [(set (match_operand:SVE_ALL 0 "register_operand" "=w") (unspec:SVE_ALL Index: gcc/testsuite/gcc.dg/vect/no-vfa-vect-depend-2.c =================================================================== --- gcc/testsuite/gcc.dg/vect/no-vfa-vect-depend-2.c 2017-11-09 13:21:01.989917982 +0000 +++ gcc/testsuite/gcc.dg/vect/no-vfa-vect-depend-2.c 2017-11-09 13:21:02.323463345 +0000 @@ -51,7 +51,4 @@ int main (void) } /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" {xfail { vect_no_align && { ! vect_hw_misalign } } } } } */ -/* Requires reverse for variable-length SVE, which is implemented for - by a later patch. Until then we report it twice, once for SVE and - once for 128-bit Advanced SIMD. */ -/* { dg-final { scan-tree-dump-times "dependence distance negative" 1 "vect" { xfail { aarch64_sve && vect_variable_length } } } } */ +/* { dg-final { scan-tree-dump-times "dependence distance negative" 1 "vect" } } */ Index: gcc/testsuite/gcc.dg/vect/no-vfa-vect-depend-3.c =================================================================== --- gcc/testsuite/gcc.dg/vect/no-vfa-vect-depend-3.c 2017-11-09 13:21:01.989917982 +0000 +++ gcc/testsuite/gcc.dg/vect/no-vfa-vect-depend-3.c 2017-11-09 13:21:02.323463345 +0000 @@ -183,7 +183,4 @@ int main () } /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 4 "vect" {xfail { vect_no_align && { ! vect_hw_misalign } } } } } */ -/* f4 requires reverse for SVE, which is implemented by a later patch. - Until then we report it twice, once for SVE and once for 128-bit - Advanced SIMD. */ -/* { dg-final { scan-tree-dump-times "dependence distance negative" 4 "vect" { xfail { aarch64_sve && vect_variable_length } } } } */ +/* { dg-final { scan-tree-dump-times "dependence distance negative" 4 "vect" } } */ Index: gcc/testsuite/gcc.dg/vect/pr33953.c =================================================================== --- gcc/testsuite/gcc.dg/vect/pr33953.c 2017-11-09 13:21:01.989917982 +0000 +++ gcc/testsuite/gcc.dg/vect/pr33953.c 2017-11-09 13:21:02.323463345 +0000 @@ -29,6 +29,6 @@ void blockmove_NtoN_blend_noremap32 (con } /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail { vect_no_align && { ! vect_hw_misalign } } } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { vect_no_align && { ! vect_hw_misalign } } } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { { vect_no_align && { ! vect_hw_misalign } } || vect_variable_length } } } } */ Index: gcc/testsuite/gcc.dg/vect/pr68445.c =================================================================== --- gcc/testsuite/gcc.dg/vect/pr68445.c 2017-11-09 13:21:01.989917982 +0000 +++ gcc/testsuite/gcc.dg/vect/pr68445.c 2017-11-09 13:21:02.323463345 +0000 @@ -16,4 +16,4 @@ void IMB_double_fast_x (int *destf, int } } -/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { xfail vect_variable_length } } } */ Index: gcc/testsuite/gcc.dg/vect/slp-12a.c =================================================================== --- gcc/testsuite/gcc.dg/vect/slp-12a.c 2017-11-09 13:21:01.989917982 +0000 +++ gcc/testsuite/gcc.dg/vect/slp-12a.c 2017-11-09 13:21:02.323463345 +0000 @@ -75,5 +75,5 @@ int main (void) /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_strided8 && vect_int_mult } } } } */ /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target { ! { vect_strided8 && vect_int_mult } } } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target { vect_strided8 && vect_int_mult } } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target { vect_strided8 && vect_int_mult } xfail vect_variable_length } } } */ /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { target { ! { vect_strided8 && vect_int_mult } } } } } */ Index: gcc/testsuite/gcc.dg/vect/slp-13-big-array.c =================================================================== --- gcc/testsuite/gcc.dg/vect/slp-13-big-array.c 2017-11-09 13:21:01.989917982 +0000 +++ gcc/testsuite/gcc.dg/vect/slp-13-big-array.c 2017-11-09 13:21:02.324364818 +0000 @@ -134,4 +134,4 @@ int main (void) /* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { target { { vect_interleave && vect_extract_even_odd } && { ! vect_pack_trunc } } } } } */ /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { target { ! vect_pack_trunc } } } } */ /* { dg-final { scan-tree-dump-times "vectorized 3 loops" 1 "vect" { target { { vect_interleave && vect_extract_even_odd } && vect_pack_trunc } } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect" { target vect_pack_trunc } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect" { target vect_pack_trunc xfail vect_variable_length } } } */ Index: gcc/testsuite/gcc.dg/vect/slp-13.c =================================================================== --- gcc/testsuite/gcc.dg/vect/slp-13.c 2017-11-09 13:21:01.989917982 +0000 +++ gcc/testsuite/gcc.dg/vect/slp-13.c 2017-11-09 13:21:02.324364818 +0000 @@ -128,4 +128,4 @@ int main (void) /* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { target { { vect_interleave && vect_extract_even_odd } && { ! vect_pack_trunc } } } } } */ /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { target { ! vect_pack_trunc } } } } */ /* { dg-final { scan-tree-dump-times "vectorized 3 loops" 1 "vect" { target { { vect_interleave && vect_extract_even_odd } && vect_pack_trunc } } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect" { target vect_pack_trunc } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect" { target vect_pack_trunc xfail vect_variable_length } } } */ Index: gcc/testsuite/gcc.dg/vect/slp-14.c =================================================================== --- gcc/testsuite/gcc.dg/vect/slp-14.c 2017-11-09 13:21:01.989917982 +0000 +++ gcc/testsuite/gcc.dg/vect/slp-14.c 2017-11-09 13:21:02.324364818 +0000 @@ -111,5 +111,5 @@ int main (void) } /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target vect_int_mult } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { target vect_int_mult } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { target vect_int_mult xfail vect_variable_length } } } */ Index: gcc/testsuite/gcc.dg/vect/slp-15.c =================================================================== --- gcc/testsuite/gcc.dg/vect/slp-15.c 2017-11-09 13:21:01.989917982 +0000 +++ gcc/testsuite/gcc.dg/vect/slp-15.c 2017-11-09 13:21:02.324364818 +0000 @@ -112,6 +112,6 @@ int main (void) /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" {target vect_int_mult } } } */ /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" {target { ! { vect_int_mult } } } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" {target vect_int_mult } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { target vect_int_mult xfail vect_variable_length } } } */ /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" {target { ! { vect_int_mult } } } } } */ Index: gcc/testsuite/gcc.dg/vect/slp-42.c =================================================================== --- gcc/testsuite/gcc.dg/vect/slp-42.c 2017-11-09 13:21:01.989917982 +0000 +++ gcc/testsuite/gcc.dg/vect/slp-42.c 2017-11-09 13:21:02.324364818 +0000 @@ -15,5 +15,5 @@ void foo (int n) } } -/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { xfail vect_variable_length } } } */ /* { dg-final { scan-tree-dump "vectorized 1 loops" "vect" } } */ Index: gcc/testsuite/gcc.dg/vect/slp-multitypes-2.c =================================================================== --- gcc/testsuite/gcc.dg/vect/slp-multitypes-2.c 2017-11-09 13:21:01.989917982 +0000 +++ gcc/testsuite/gcc.dg/vect/slp-multitypes-2.c 2017-11-09 13:21:02.324364818 +0000 @@ -77,5 +77,5 @@ int main (void) } /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { xfail vect_variable_length } } } */ Index: gcc/testsuite/gcc.dg/vect/slp-multitypes-4.c =================================================================== --- gcc/testsuite/gcc.dg/vect/slp-multitypes-4.c 2017-11-09 13:21:01.989917982 +0000 +++ gcc/testsuite/gcc.dg/vect/slp-multitypes-4.c 2017-11-09 13:21:02.324364818 +0000 @@ -52,5 +52,5 @@ int main (void) } /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target vect_unpack } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target vect_unpack } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target vect_unpack xfail vect_variable_length } } } */ Index: gcc/testsuite/gcc.dg/vect/slp-multitypes-5.c =================================================================== --- gcc/testsuite/gcc.dg/vect/slp-multitypes-5.c 2017-11-09 13:21:01.989917982 +0000 +++ gcc/testsuite/gcc.dg/vect/slp-multitypes-5.c 2017-11-09 13:21:02.324364818 +0000 @@ -52,5 +52,5 @@ int main (void) } /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target vect_pack_trunc } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target vect_pack_trunc } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target vect_pack_trunc xfail vect_variable_length } } } */ Index: gcc/testsuite/gcc.dg/vect/slp-reduc-4.c =================================================================== --- gcc/testsuite/gcc.dg/vect/slp-reduc-4.c 2017-11-09 13:21:01.989917982 +0000 +++ gcc/testsuite/gcc.dg/vect/slp-reduc-4.c 2017-11-09 13:21:02.325266292 +0000 @@ -57,5 +57,5 @@ int main (void) } /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail vect_no_int_min_max } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail vect_no_int_min_max } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { vect_no_int_min_max || vect_variable_length } } } } */ Index: gcc/testsuite/gcc.dg/vect/slp-reduc-7.c =================================================================== --- gcc/testsuite/gcc.dg/vect/slp-reduc-7.c 2017-11-09 13:21:01.989917982 +0000 +++ gcc/testsuite/gcc.dg/vect/slp-reduc-7.c 2017-11-09 13:21:02.325266292 +0000 @@ -55,5 +55,5 @@ int main (void) } /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail vect_no_int_add } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail vect_no_int_add } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { vect_no_int_add || vect_variable_length } } } } */ Index: gcc/testsuite/gcc.target/aarch64/sve_vec_perm_2.c =================================================================== --- /dev/null 2017-11-09 12:47:20.377612760 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve_vec_perm_2.c 2017-11-09 13:21:02.325266292 +0000 @@ -0,0 +1,31 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */ + +#include <stdint.h> + +#define VEC_PERM(TYPE) \ +TYPE __attribute__ ((noinline, noclone)) \ +vec_reverse_##TYPE (TYPE *restrict a, TYPE *restrict b, int n) \ +{ \ + for (int i = 0; i < n; ++i) \ + a[i] = b[n - i - 1]; \ +} + +#define TEST_ALL(T) \ + T (int8_t) \ + T (uint8_t) \ + T (int16_t) \ + T (uint16_t) \ + T (int32_t) \ + T (uint32_t) \ + T (int64_t) \ + T (uint64_t) \ + T (float) \ + T (double) + +TEST_ALL (VEC_PERM) + +/* { dg-final { scan-assembler-times {\trev\tz[0-9]+\.b, z[0-9]+\.b\n} 2 } } */ +/* { dg-final { scan-assembler-times {\trev\tz[0-9]+\.h, z[0-9]+\.h\n} 2 } } */ +/* { dg-final { scan-assembler-times {\trev\tz[0-9]+\.s, z[0-9]+\.s\n} 3 } } */ +/* { dg-final { scan-assembler-times {\trev\tz[0-9]+\.d, z[0-9]+\.d\n} 3 } } */ Index: gcc/testsuite/gcc.target/aarch64/sve_vec_perm_2_run.c =================================================================== --- /dev/null 2017-11-09 12:47:20.377612760 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve_vec_perm_2_run.c 2017-11-09 13:21:02.325266292 +0000 @@ -0,0 +1,29 @@ +/* { dg-do run { target aarch64_sve_hw } } */ +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */ + +#include "sve_vec_perm_2.c" + +#define N 153 + +#define HARNESS(TYPE) \ + { \ + TYPE a[N], b[N]; \ + for (unsigned int i = 0; i < N; ++i) \ + { \ + b[i] = i * 2 + i % 5; \ + asm volatile ("" ::: "memory"); \ + } \ + vec_reverse_##TYPE (a, b, N); \ + for (unsigned int i = 0; i < N; ++i) \ + { \ + TYPE expected = (N - i - 1) * 2 + (N - i - 1) % 5; \ + if (a[i] != expected) \ + __builtin_abort (); \ + } \ + } + +int __attribute__ ((optimize (1))) +main (void) +{ + TEST_ALL (HARNESS) +} Index: gcc/testsuite/gcc.target/aarch64/sve_vec_perm_3.c =================================================================== --- /dev/null 2017-11-09 12:47:20.377612760 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve_vec_perm_3.c 2017-11-09 13:21:02.325266292 +0000 @@ -0,0 +1,46 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable" } */ + +#include <stdint.h> + +#define VEC_PERM(TYPE) \ +TYPE __attribute__ ((noinline, noclone)) \ +vec_zip_##TYPE (TYPE *restrict a, TYPE *restrict b, \ + TYPE *restrict c, long n) \ +{ \ + for (long i = 0; i < n; ++i) \ + { \ + a[i * 8] = c[i * 4]; \ + a[i * 8 + 1] = b[i * 4]; \ + a[i * 8 + 2] = c[i * 4 + 1]; \ + a[i * 8 + 3] = b[i * 4 + 1]; \ + a[i * 8 + 4] = c[i * 4 + 2]; \ + a[i * 8 + 5] = b[i * 4 + 2]; \ + a[i * 8 + 6] = c[i * 4 + 3]; \ + a[i * 8 + 7] = b[i * 4 + 3]; \ + } \ +} + +#define TEST_ALL(T) \ + T (int8_t) \ + T (uint8_t) \ + T (int16_t) \ + T (uint16_t) \ + T (int32_t) \ + T (uint32_t) \ + T (int64_t) \ + T (uint64_t) \ + T (float) \ + T (double) + +TEST_ALL (VEC_PERM) + +/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.b, z[0-9]+\.b, z[0-9]+\.b\n} 24 } } */ +/* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.b, z[0-9]+\.b, z[0-9]+\.b\n} 24 } } */ +/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.h, z[0-9]+\.h, z[0-9]+\.h\n} 24 } } */ +/* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.h, z[0-9]+\.h, z[0-9]+\.h\n} 24 } } */ +/* Currently we can't use SLP for groups bigger than 128 bits. */ +/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.s, z[0-9]+\.s, z[0-9]+\.s\n} 36 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.s, z[0-9]+\.s, z[0-9]+\.s\n} 36 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 36 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 36 { xfail *-*-* } } } */ Index: gcc/testsuite/gcc.target/aarch64/sve_vec_perm_3_run.c =================================================================== --- /dev/null 2017-11-09 12:47:20.377612760 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve_vec_perm_3_run.c 2017-11-09 13:21:02.325266292 +0000 @@ -0,0 +1,31 @@ +/* { dg-do run { target aarch64_sve_hw } } */ +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */ + +#include "sve_vec_perm_3.c" + +#define N (43 * 8) + +#define HARNESS(TYPE) \ + { \ + TYPE a[N], b[N], c[N]; \ + for (unsigned int i = 0; i < N; ++i) \ + { \ + b[i] = i * 2 + i % 5; \ + c[i] = i * 3; \ + asm volatile ("" ::: "memory"); \ + } \ + vec_zip_##TYPE (a, b, c, N / 8); \ + for (unsigned int i = 0; i < N / 2; ++i) \ + { \ + TYPE expected1 = i * 3; \ + TYPE expected2 = i * 2 + i % 5; \ + if (a[i * 2] != expected1 || a[i * 2 + 1] != expected2) \ + __builtin_abort (); \ + } \ + } + +int __attribute__ ((optimize (1))) +main (void) +{ + TEST_ALL (HARNESS) +} Index: gcc/testsuite/gcc.target/aarch64/sve_vec_perm_4.c =================================================================== --- /dev/null 2017-11-09 12:47:20.377612760 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve_vec_perm_4.c 2017-11-09 13:21:02.325266292 +0000 @@ -0,0 +1,52 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable" } */ + +#include <stdint.h> + +#define VEC_PERM(TYPE) \ +TYPE __attribute__ ((noinline, noclone)) \ +vec_uzp_##TYPE (TYPE *restrict a, TYPE *restrict b, \ + TYPE *restrict c, long n) \ +{ \ + for (long i = 0; i < n; ++i) \ + { \ + a[i * 4] = c[i * 8]; \ + b[i * 4] = c[i * 8 + 1]; \ + a[i * 4 + 1] = c[i * 8 + 2]; \ + b[i * 4 + 1] = c[i * 8 + 3]; \ + a[i * 4 + 2] = c[i * 8 + 4]; \ + b[i * 4 + 2] = c[i * 8 + 5]; \ + a[i * 4 + 3] = c[i * 8 + 6]; \ + b[i * 4 + 3] = c[i * 8 + 7]; \ + } \ +} + +#define TEST_ALL(T) \ + T (int8_t) \ + T (uint8_t) \ + T (int16_t) \ + T (uint16_t) \ + T (int32_t) \ + T (uint32_t) \ + T (int64_t) \ + T (uint64_t) \ + T (float) \ + T (double) + +TEST_ALL (VEC_PERM) + +/* We could use a single uzp1 and uzp2 per function by implementing + SLP load permutation for variable width. XFAIL until then. */ +/* { dg-final { scan-assembler-times {\tuzp1\tz[0-9]+\.b, z[0-9]+\.b, z[0-9]+\.b\n} 2 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tuzp2\tz[0-9]+\.b, z[0-9]+\.b, z[0-9]+\.b\n} 2 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tuzp1\tz[0-9]+\.h, z[0-9]+\.h, z[0-9]+\.h\n} 2 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tuzp2\tz[0-9]+\.h, z[0-9]+\.h, z[0-9]+\.h\n} 2 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tuzp1\tz[0-9]+\.s, z[0-9]+\.s, z[0-9]+\.s\n} 3 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tuzp2\tz[0-9]+\.s, z[0-9]+\.s, z[0-9]+\.s\n} 3 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tuzp1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 3 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tuzp2\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 3 { xfail *-*-* } } } */ +/* Delete these if the tests above start passing instead. */ +/* { dg-final { scan-assembler-times {\tuzp1\tz[0-9]+\.b, z[0-9]+\.b, z[0-9]+\.b\n} 24 } } */ +/* { dg-final { scan-assembler-times {\tuzp2\tz[0-9]+\.b, z[0-9]+\.b, z[0-9]+\.b\n} 24 } } */ +/* { dg-final { scan-assembler-times {\tuzp1\tz[0-9]+\.h, z[0-9]+\.h, z[0-9]+\.h\n} 24 } } */ +/* { dg-final { scan-assembler-times {\tuzp2\tz[0-9]+\.h, z[0-9]+\.h, z[0-9]+\.h\n} 24 } } */ Index: gcc/testsuite/gcc.target/aarch64/sve_vec_perm_4_run.c =================================================================== --- /dev/null 2017-11-09 12:47:20.377612760 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve_vec_perm_4_run.c 2017-11-09 13:21:02.325266292 +0000 @@ -0,0 +1,29 @@ +/* { dg-do run { target aarch64_sve_hw } } */ +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */ + +#include "sve_vec_perm_4.c" + +#define N (43 * 8) + +#define HARNESS(TYPE) \ + { \ + TYPE a[N], b[N], c[N]; \ + for (unsigned int i = 0; i < N; ++i) \ + { \ + c[i] = i * 2 + i % 5; \ + asm volatile ("" ::: "memory"); \ + } \ + vec_uzp_##TYPE (a, b, c, N / 8); \ + for (unsigned int i = 0; i < N; ++i) \ + { \ + TYPE expected = i * 2 + i % 5; \ + if ((i & 1 ? b[i / 2] : a[i / 2]) != expected) \ + __builtin_abort (); \ + } \ + } + +int __attribute__ ((optimize (1))) +main (void) +{ + TEST_ALL (HARNESS) +}