Add missing cases to vect_get_smallest_scalar_type (PR 85286)

Message ID 877epg6zlq.fsf@linaro.org
State New
Headers show
Series
  • Add missing cases to vect_get_smallest_scalar_type (PR 85286)
Related show

Commit Message

Richard Sandiford April 9, 2018, 5:47 p.m.
In this PR we used WIDEN_SUM_EXPR to vectorise:

  short i, y;
  int sum;
  [...]
  for (i = x; i > 0; i--)
    sum += y;

with 4 ints and 8 shorts per vector.  The problem was that we set
the VF based only on the ints, then calculated the number of vector
copies based on the shorts, giving 4/8.  Previously that led to
ncopies==0, but after r249897 we pick it up as an ICE.

In this particular case we could vectorise the reduction by setting
ncopies based on the output type rather than the input type, but it
doesn't seem worth adding a special "optimisation" for such a
pathological case.  I think it's really an instance of the more general
problem that we can't vectorise using combinations of (say) 64-bit and
128-bit vectors on targets that support both.

Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?

Richard


2018-04-09  Richard Sandiford  <richard.sandiford@linaro.org>

gcc/
	PR tree-optimization/85286
	* tree-vect-data-refs.c (vect_get_smallest_scalar_type):

gcc/testsuite/
	* gcc.dg/vect/pr85286.c: New test.

Comments

Jakub Jelinek April 9, 2018, 6:01 p.m. | #1
On Mon, Apr 09, 2018 at 06:47:45PM +0100, Richard Sandiford wrote:
> In this PR we used WIDEN_SUM_EXPR to vectorise:

> 

>   short i, y;

>   int sum;

>   [...]

>   for (i = x; i > 0; i--)

>     sum += y;

> 

> with 4 ints and 8 shorts per vector.  The problem was that we set

> the VF based only on the ints, then calculated the number of vector

> copies based on the shorts, giving 4/8.  Previously that led to

> ncopies==0, but after r249897 we pick it up as an ICE.

> 

> In this particular case we could vectorise the reduction by setting

> ncopies based on the output type rather than the input type, but it

> doesn't seem worth adding a special "optimisation" for such a

> pathological case.  I think it's really an instance of the more general

> problem that we can't vectorise using combinations of (say) 64-bit and

> 128-bit vectors on targets that support both.


We badly need that, there are plenty of PRs where we generate really large
vectorized loop because of it e.g. on x86 where we can easily use 128-bit,
256-bit and 512-bit vectors; but I'm afraid it is not a stage4 material.

	Jakub
Richard Biener April 10, 2018, 6:50 a.m. | #2
On Mon, Apr 9, 2018 at 7:47 PM, Richard Sandiford
<richard.sandiford@linaro.org> wrote:
> In this PR we used WIDEN_SUM_EXPR to vectorise:

>

>   short i, y;

>   int sum;

>   [...]

>   for (i = x; i > 0; i--)

>     sum += y;

>

> with 4 ints and 8 shorts per vector.  The problem was that we set

> the VF based only on the ints, then calculated the number of vector

> copies based on the shorts, giving 4/8.  Previously that led to

> ncopies==0, but after r249897 we pick it up as an ICE.

>

> In this particular case we could vectorise the reduction by setting

> ncopies based on the output type rather than the input type, but it

> doesn't seem worth adding a special "optimisation" for such a

> pathological case.  I think it's really an instance of the more general

> problem that we can't vectorise using combinations of (say) 64-bit and

> 128-bit vectors on targets that support both.

>

> Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?


OK.

Richard.

> Richard

>

>

> 2018-04-09  Richard Sandiford  <richard.sandiford@linaro.org>

>

> gcc/

>         PR tree-optimization/85286

>         * tree-vect-data-refs.c (vect_get_smallest_scalar_type):

>

> gcc/testsuite/

>         * gcc.dg/vect/pr85286.c: New test.

>

> Index: gcc/tree-vect-data-refs.c

> ===================================================================

> --- gcc/tree-vect-data-refs.c   2018-03-24 10:52:25.616823316 +0000

> +++ gcc/tree-vect-data-refs.c   2018-04-09 18:44:09.676561821 +0100

> @@ -132,6 +132,8 @@ vect_get_smallest_scalar_type (gimple *s

>

>    if (is_gimple_assign (stmt)

>        && (gimple_assign_cast_p (stmt)

> +          || gimple_assign_rhs_code (stmt) == DOT_PROD_EXPR

> +          || gimple_assign_rhs_code (stmt) == WIDEN_SUM_EXPR

>            || gimple_assign_rhs_code (stmt) == WIDEN_MULT_EXPR

>            || gimple_assign_rhs_code (stmt) == WIDEN_LSHIFT_EXPR

>            || gimple_assign_rhs_code (stmt) == FLOAT_EXPR))

> Index: gcc/testsuite/gcc.dg/vect/pr85286.c

> ===================================================================

> --- /dev/null   2018-04-08 19:55:28.217132277 +0100

> +++ gcc/testsuite/gcc.dg/vect/pr85286.c 2018-04-09 18:44:09.675561881 +0100

> @@ -0,0 +1,19 @@

> +/* PR tree-optimization/45241 */

> +/* { dg-do compile } */

> +/* { dg-additional-options "--param scev-max-expr-complexity=0" } */

> +

> +int

> +foo (short x)

> +{

> +  short i, y;

> +  int sum;

> +

> +  for (i = 0; i < x; i++)

> +    y = x * i;

> +

> +  for (i = x; i > 0; i--)

> +    sum += y;

> +

> +  return sum;

> +}

> +
Richard Sandiford April 10, 2018, 10:40 a.m. | #3
Jakub Jelinek <jakub@redhat.com> writes:
> On Mon, Apr 09, 2018 at 06:47:45PM +0100, Richard Sandiford wrote:

>> In this PR we used WIDEN_SUM_EXPR to vectorise:

>> 

>>   short i, y;

>>   int sum;

>>   [...]

>>   for (i = x; i > 0; i--)

>>     sum += y;

>> 

>> with 4 ints and 8 shorts per vector.  The problem was that we set

>> the VF based only on the ints, then calculated the number of vector

>> copies based on the shorts, giving 4/8.  Previously that led to

>> ncopies==0, but after r249897 we pick it up as an ICE.

>> 

>> In this particular case we could vectorise the reduction by setting

>> ncopies based on the output type rather than the input type, but it

>> doesn't seem worth adding a special "optimisation" for such a

>> pathological case.  I think it's really an instance of the more general

>> problem that we can't vectorise using combinations of (say) 64-bit and

>> 128-bit vectors on targets that support both.

>

> We badly need that, there are plenty of PRs where we generate really large

> vectorized loop because of it e.g. on x86 where we can easily use 128-bit,

> 256-bit and 512-bit vectors; but I'm afraid it is not a stage4 material.


Yeah.  We also need it on AArch64 for a proper implementation of simd
clones for Advanced SIMD.

I think it's related to one of the most important missed optimisations
for SVE: when using mixed data sizes, it's usually better to store the
smaller data unpacked in wider lanes, and there's direct support for
loading and storing it that way.  In both the SVE and non-SVE cases,
we want the VF sometimes to be based on wider sizes rather than the
narrowest one.

FWIW, I have some patches queued for GCC 9 that should make it
easier to implement this (but no promises).  They're also supposed
to make it possible to compare the costs of different implementations
side-by-side, rather than always picking the first one that has
a lower cost than the scalar code.

Thanks,
Richard
Richard Biener April 10, 2018, 12:58 p.m. | #4
On Tue, Apr 10, 2018 at 12:40 PM, Richard Sandiford
<richard.sandiford@linaro.org> wrote:
> Jakub Jelinek <jakub@redhat.com> writes:

>> On Mon, Apr 09, 2018 at 06:47:45PM +0100, Richard Sandiford wrote:

>>> In this PR we used WIDEN_SUM_EXPR to vectorise:

>>>

>>>   short i, y;

>>>   int sum;

>>>   [...]

>>>   for (i = x; i > 0; i--)

>>>     sum += y;

>>>

>>> with 4 ints and 8 shorts per vector.  The problem was that we set

>>> the VF based only on the ints, then calculated the number of vector

>>> copies based on the shorts, giving 4/8.  Previously that led to

>>> ncopies==0, but after r249897 we pick it up as an ICE.

>>>

>>> In this particular case we could vectorise the reduction by setting

>>> ncopies based on the output type rather than the input type, but it

>>> doesn't seem worth adding a special "optimisation" for such a

>>> pathological case.  I think it's really an instance of the more general

>>> problem that we can't vectorise using combinations of (say) 64-bit and

>>> 128-bit vectors on targets that support both.

>>

>> We badly need that, there are plenty of PRs where we generate really large

>> vectorized loop because of it e.g. on x86 where we can easily use 128-bit,

>> 256-bit and 512-bit vectors; but I'm afraid it is not a stage4 material.

>

> Yeah.  We also need it on AArch64 for a proper implementation of simd

> clones for Advanced SIMD.

>

> I think it's related to one of the most important missed optimisations

> for SVE: when using mixed data sizes, it's usually better to store the

> smaller data unpacked in wider lanes, and there's direct support for

> loading and storing it that way.  In both the SVE and non-SVE cases,

> we want the VF sometimes to be based on wider sizes rather than the

> narrowest one.


It's unfortunately not very easy to remove the limitation in full and in
general it widens the space we need to search for the best vectorization
even further...

> FWIW, I have some patches queued for GCC 9 that should make it

> easier to implement this (but no promises).  They're also supposed

> to make it possible to compare the costs of different implementations

> side-by-side, rather than always picking the first one that has

> a lower cost than the scalar code.


I have also a similar patch in the works.

Richard.

> Thanks,

> Richard

Patch

Index: gcc/tree-vect-data-refs.c
===================================================================
--- gcc/tree-vect-data-refs.c	2018-03-24 10:52:25.616823316 +0000
+++ gcc/tree-vect-data-refs.c	2018-04-09 18:44:09.676561821 +0100
@@ -132,6 +132,8 @@  vect_get_smallest_scalar_type (gimple *s
 
   if (is_gimple_assign (stmt)
       && (gimple_assign_cast_p (stmt)
+          || gimple_assign_rhs_code (stmt) == DOT_PROD_EXPR
+          || gimple_assign_rhs_code (stmt) == WIDEN_SUM_EXPR
           || gimple_assign_rhs_code (stmt) == WIDEN_MULT_EXPR
           || gimple_assign_rhs_code (stmt) == WIDEN_LSHIFT_EXPR
           || gimple_assign_rhs_code (stmt) == FLOAT_EXPR))
Index: gcc/testsuite/gcc.dg/vect/pr85286.c
===================================================================
--- /dev/null	2018-04-08 19:55:28.217132277 +0100
+++ gcc/testsuite/gcc.dg/vect/pr85286.c	2018-04-09 18:44:09.675561881 +0100
@@ -0,0 +1,19 @@ 
+/* PR tree-optimization/45241 */
+/* { dg-do compile } */
+/* { dg-additional-options "--param scev-max-expr-complexity=0" } */
+
+int
+foo (short x)
+{
+  short i, y;
+  int sum;
+
+  for (i = 0; i < x; i++)
+    y = x * i;
+
+  for (i = x; i > 0; i--)
+    sum += y;
+
+  return sum;
+}
+