diff mbox series

Prefer open-coding vector integer division

Message ID 87h8mxea5s.fsf@linaro.org
State New
Headers show
Series Prefer open-coding vector integer division | expand

Commit Message

Richard Sandiford May 24, 2018, 8:21 a.m. UTC
vect_recog_divmod_pattern currently bails out if the target has
native support for integer division, but I think in practice
it's always going to be better to open-code it anyway, just as
we usually open-code scalar divisions by constants.

I think the only currently affected target is MIPS MSA, where for:

  void
  foo (int *x)
  {
    for (int i = 0; i < 100; ++i)
      x[i] /= 2;
  }

we previously preferred to use division for powers of 2:

        .set    noreorder
        bnz.w   $w1,1f
        div_s.w $w0,$w0,$w1
        break   7
        .set    reorder
1:

(or just the div_s.w for -mno-check-zero-division), but after the patch
we open-code them using shifts:

        clt_s.w $w1,$w0,$w2
        subv.w  $w0,$w0,$w1
        srai.w  $w0,$w0,1

I assume that's better.  Matthew, is that right?

MSA doesn't define a high-part pattern, so it still uses a division
instruction for the non-power-of-2 case.

Richard B pointed out that this would disable SLP of division by
different amounts, but I think in practice that's a price worth paying,
since the current cost model can't really tell whether using a general
vector division is better than using open-coded scalar divisions.
The fix would be either to support SLP of mixed open-coded divisions
or to improve the cost model and try SLP again without the patterns.
The patch adds an XFAILed test for this.

Tested on aarch64-linux-gnu (with and without SVE), aarch64_be-elf
and x86_64-linux-gnu.  OK to install?

Richard


2018-05-24  Richard Sandiford  <richard.sandiford@linaro.org>

gcc/
	* tree-vect-patterns.c (vect_recog_divmod_pattern): Remove check
	for division support.

gcc/testsuite/
	* gcc.dg/vect/bb-slp-div-1.c: New XFAILed test.

Comments

Jakub Jelinek May 24, 2018, 8:31 a.m. UTC | #1
On Thu, May 24, 2018 at 09:21:35AM +0100, Richard Sandiford wrote:
> vect_recog_divmod_pattern currently bails out if the target has

> native support for integer division, but I think in practice

> it's always going to be better to open-code it anyway, just as

> we usually open-code scalar divisions by constants.

> 

> I think the only currently affected target is MIPS MSA, where for:


Isn't powerpcspe affected too?  It has a divv2si3 pattern.

> --- gcc/tree-vect-patterns.c	2018-05-16 12:48:59.115202362 +0100

> +++ gcc/tree-vect-patterns.c	2018-05-24 09:18:10.445466941 +0100

> @@ -2639,7 +2639,6 @@ vect_recog_divmod_pattern (vec<gimple *>

>    enum tree_code rhs_code;

>    stmt_vec_info stmt_vinfo = vinfo_for_stmt (last_stmt);

>    vec_info *vinfo = stmt_vinfo->vinfo;

> -  optab optab;

>    tree q;

>    int dummy_int, prec;

>    stmt_vec_info def_stmt_vinfo;

> @@ -2674,17 +2673,6 @@ vect_recog_divmod_pattern (vec<gimple *>

>    if (vectype == NULL_TREE)

>      return NULL;

>  

> -  /* If the target can handle vectorized division or modulo natively,

> -     don't attempt to optimize this.  */

> -  optab = optab_for_tree_code (rhs_code, vectype, optab_default);

> -  if (optab != unknown_optab)

> -    {

> -      machine_mode vec_mode = TYPE_MODE (vectype);

> -      int icode = (int) optab_handler (optab, vec_mode);

> -      if (icode != CODE_FOR_nothing)

> -	return NULL;

> -    }

> -


Shouldn't we instead keep it, but only do it if
optimize_bb_for_size_p (gimple_bb (last_stmt)) ?
I mean, a hw division is very likely shorter than what we replace it with...

	Jakub
Richard Biener May 24, 2018, 10:05 a.m. UTC | #2
On Thu, May 24, 2018 at 10:31 AM Jakub Jelinek <jakub@redhat.com> wrote:

> On Thu, May 24, 2018 at 09:21:35AM +0100, Richard Sandiford wrote:

> > vect_recog_divmod_pattern currently bails out if the target has

> > native support for integer division, but I think in practice

> > it's always going to be better to open-code it anyway, just as

> > we usually open-code scalar divisions by constants.

> >

> > I think the only currently affected target is MIPS MSA, where for:


> Isn't powerpcspe affected too?  It has a divv2si3 pattern.


> > --- gcc/tree-vect-patterns.c  2018-05-16 12:48:59.115202362 +0100

> > +++ gcc/tree-vect-patterns.c  2018-05-24 09:18:10.445466941 +0100

> > @@ -2639,7 +2639,6 @@ vect_recog_divmod_pattern (vec<gimple *>

> >    enum tree_code rhs_code;

> >    stmt_vec_info stmt_vinfo = vinfo_for_stmt (last_stmt);

> >    vec_info *vinfo = stmt_vinfo->vinfo;

> > -  optab optab;

> >    tree q;

> >    int dummy_int, prec;

> >    stmt_vec_info def_stmt_vinfo;

> > @@ -2674,17 +2673,6 @@ vect_recog_divmod_pattern (vec<gimple *>

> >    if (vectype == NULL_TREE)

> >      return NULL;

> >

> > -  /* If the target can handle vectorized division or modulo natively,

> > -     don't attempt to optimize this.  */

> > -  optab = optab_for_tree_code (rhs_code, vectype, optab_default);

> > -  if (optab != unknown_optab)

> > -    {

> > -      machine_mode vec_mode = TYPE_MODE (vectype);

> > -      int icode = (int) optab_handler (optab, vec_mode);

> > -      if (icode != CODE_FOR_nothing)

> > -     return NULL;

> > -    }

> > -


> Shouldn't we instead keep it, but only do it if

> optimize_bb_for_size_p (gimple_bb (last_stmt)) ?

> I mean, a hw division is very likely shorter than what we replace it

with...

Good idea.  While loops are checked for optimize_loop_nest_for_speed if
not forced vect BB vectorization has no such predicate.

Richard.

>          Jakub
Richard Sandiford May 24, 2018, 12:52 p.m. UTC | #3
Richard Biener <richard.guenther@gmail.com> writes:
> On Thu, May 24, 2018 at 10:31 AM Jakub Jelinek <jakub@redhat.com> wrote:

>> On Thu, May 24, 2018 at 09:21:35AM +0100, Richard Sandiford wrote:

>> > vect_recog_divmod_pattern currently bails out if the target has

>> > native support for integer division, but I think in practice

>> > it's always going to be better to open-code it anyway, just as

>> > we usually open-code scalar divisions by constants.

>> >

>> > I think the only currently affected target is MIPS MSA, where for:

>

>> Isn't powerpcspe affected too?  It has a divv2si3 pattern.


Ah, yeah.  I forgot to update the description after you mentioned
that on IRC, sorry.

>> > --- gcc/tree-vect-patterns.c  2018-05-16 12:48:59.115202362 +0100

>> > +++ gcc/tree-vect-patterns.c  2018-05-24 09:18:10.445466941 +0100

>> > @@ -2639,7 +2639,6 @@ vect_recog_divmod_pattern (vec<gimple *>

>> >    enum tree_code rhs_code;

>> >    stmt_vec_info stmt_vinfo = vinfo_for_stmt (last_stmt);

>> >    vec_info *vinfo = stmt_vinfo->vinfo;

>> > -  optab optab;

>> >    tree q;

>> >    int dummy_int, prec;

>> >    stmt_vec_info def_stmt_vinfo;

>> > @@ -2674,17 +2673,6 @@ vect_recog_divmod_pattern (vec<gimple *>

>> >    if (vectype == NULL_TREE)

>> >      return NULL;

>> >

>> > -  /* If the target can handle vectorized division or modulo natively,

>> > -     don't attempt to optimize this.  */

>> > -  optab = optab_for_tree_code (rhs_code, vectype, optab_default);

>> > -  if (optab != unknown_optab)

>> > -    {

>> > -      machine_mode vec_mode = TYPE_MODE (vectype);

>> > -      int icode = (int) optab_handler (optab, vec_mode);

>> > -      if (icode != CODE_FOR_nothing)

>> > -     return NULL;

>> > -    }

>> > -

>

>> Shouldn't we instead keep it, but only do it if

>> optimize_bb_for_size_p (gimple_bb (last_stmt)) ?

>> I mean, a hw division is very likely shorter than what we replace it

> with...

>

> Good idea.  While loops are checked for optimize_loop_nest_for_speed if

> not forced vect BB vectorization has no such predicate.


Like this, if it passes testing?

Thanks,
Richard


2018-05-24  Richard Sandiford  <richard.sandiford@linaro.org>

gcc/
	* tree-vect-patterns.c: Include predict.h.
	(vect_recog_divmod_pattern): Restrict check for division support
	to when optimizing for size.

gcc/testsuite/
	* gcc.dg/vect/bb-slp-div-1.c: New XFAILed test.

Index: gcc/tree-vect-patterns.c
===================================================================
*** gcc/tree-vect-patterns.c	2018-05-24 13:50:21.656138942 +0100
--- gcc/tree-vect-patterns.c	2018-05-24 13:50:21.819134113 +0100
*************** vect_recog_divmod_pattern (vec<gimple *>
*** 2674,2688 ****
    if (vectype == NULL_TREE)
      return NULL;
  
!   /* If the target can handle vectorized division or modulo natively,
!      don't attempt to optimize this.  */
!   optab = optab_for_tree_code (rhs_code, vectype, optab_default);
!   if (optab != unknown_optab)
      {
!       machine_mode vec_mode = TYPE_MODE (vectype);
!       int icode = (int) optab_handler (optab, vec_mode);
!       if (icode != CODE_FOR_nothing)
! 	return NULL;
      }
  
    prec = TYPE_PRECISION (itype);
--- 2674,2692 ----
    if (vectype == NULL_TREE)
      return NULL;
  
!   if (optimize_bb_for_size_p (gimple_bb (last_stmt)))
      {
!       /* If the target can handle vectorized division or modulo natively,
! 	 don't attempt to optimize this, since native division is likely
! 	 to give smaller code.  */
!       optab = optab_for_tree_code (rhs_code, vectype, optab_default);
!       if (optab != unknown_optab)
! 	{
! 	  machine_mode vec_mode = TYPE_MODE (vectype);
! 	  int icode = (int) optab_handler (optab, vec_mode);
! 	  if (icode != CODE_FOR_nothing)
! 	    return NULL;
! 	}
      }
  
    prec = TYPE_PRECISION (itype);
Index: gcc/testsuite/gcc.dg/vect/bb-slp-div-1.c
===================================================================
*** /dev/null	2018-04-20 16:19:46.369131350 +0100
--- gcc/testsuite/gcc.dg/vect/bb-slp-div-1.c	2018-05-24 13:50:21.818134143 +0100
***************
*** 0 ****
--- 1,19 ----
+ /* { dg-do compile } */
+ /* { dg-additional-options "-msve-vector-bits=256" { target aarch64_sve } } */
+ 
+ int x[8];
+ 
+ void
+ f (void)
+ {
+   x[0] /= 2;
+   x[1] /= 3;
+   x[2] /= 4;
+   x[3] /= 5;
+   x[4] /= 6;
+   x[5] /= 7;
+   x[6] /= 8;
+   x[7] /= 9;
+ }
+ 
+ /* { dg-final { scan-tree-dump "basic block vectorized" "slp2" { xfail *-*-* } } } */
Richard Biener May 25, 2018, 7:26 a.m. UTC | #4
On Thu, May 24, 2018 at 2:52 PM Richard Sandiford <
richard.sandiford@linaro.org> wrote:

> Richard Biener <richard.guenther@gmail.com> writes:

> > On Thu, May 24, 2018 at 10:31 AM Jakub Jelinek <jakub@redhat.com> wrote:

> >> On Thu, May 24, 2018 at 09:21:35AM +0100, Richard Sandiford wrote:

> >> > vect_recog_divmod_pattern currently bails out if the target has

> >> > native support for integer division, but I think in practice

> >> > it's always going to be better to open-code it anyway, just as

> >> > we usually open-code scalar divisions by constants.

> >> >

> >> > I think the only currently affected target is MIPS MSA, where for:

> >

> >> Isn't powerpcspe affected too?  It has a divv2si3 pattern.


> Ah, yeah.  I forgot to update the description after you mentioned

> that on IRC, sorry.


> >> > --- gcc/tree-vect-patterns.c  2018-05-16 12:48:59.115202362 +0100

> >> > +++ gcc/tree-vect-patterns.c  2018-05-24 09:18:10.445466941 +0100

> >> > @@ -2639,7 +2639,6 @@ vect_recog_divmod_pattern (vec<gimple *>

> >> >    enum tree_code rhs_code;

> >> >    stmt_vec_info stmt_vinfo = vinfo_for_stmt (last_stmt);

> >> >    vec_info *vinfo = stmt_vinfo->vinfo;

> >> > -  optab optab;

> >> >    tree q;

> >> >    int dummy_int, prec;

> >> >    stmt_vec_info def_stmt_vinfo;

> >> > @@ -2674,17 +2673,6 @@ vect_recog_divmod_pattern (vec<gimple *>

> >> >    if (vectype == NULL_TREE)

> >> >      return NULL;

> >> >

> >> > -  /* If the target can handle vectorized division or modulo

natively,
> >> > -     don't attempt to optimize this.  */

> >> > -  optab = optab_for_tree_code (rhs_code, vectype, optab_default);

> >> > -  if (optab != unknown_optab)

> >> > -    {

> >> > -      machine_mode vec_mode = TYPE_MODE (vectype);

> >> > -      int icode = (int) optab_handler (optab, vec_mode);

> >> > -      if (icode != CODE_FOR_nothing)

> >> > -     return NULL;

> >> > -    }

> >> > -

> >

> >> Shouldn't we instead keep it, but only do it if

> >> optimize_bb_for_size_p (gimple_bb (last_stmt)) ?

> >> I mean, a hw division is very likely shorter than what we replace it

> > with...

> >

> > Good idea.  While loops are checked for optimize_loop_nest_for_speed if

> > not forced vect BB vectorization has no such predicate.


> Like this, if it passes testing?


Looks good.

Thanks,
Richard.

> Thanks,

> Richard



> 2018-05-24  Richard Sandiford  <richard.sandiford@linaro.org>


> gcc/

>          * tree-vect-patterns.c: Include predict.h.

>          (vect_recog_divmod_pattern): Restrict check for division support

>          to when optimizing for size.


> gcc/testsuite/

>          * gcc.dg/vect/bb-slp-div-1.c: New XFAILed test.


> Index: gcc/tree-vect-patterns.c

> ===================================================================

> *** gcc/tree-vect-patterns.c    2018-05-24 13:50:21.656138942 +0100

> --- gcc/tree-vect-patterns.c    2018-05-24 13:50:21.819134113 +0100

> *************** vect_recog_divmod_pattern (vec<gimple *>

> *** 2674,2688 ****

>      if (vectype == NULL_TREE)

>        return NULL;


> !   /* If the target can handle vectorized division or modulo natively,

> !      don't attempt to optimize this.  */

> !   optab = optab_for_tree_code (rhs_code, vectype, optab_default);

> !   if (optab != unknown_optab)

>        {

> !       machine_mode vec_mode = TYPE_MODE (vectype);

> !       int icode = (int) optab_handler (optab, vec_mode);

> !       if (icode != CODE_FOR_nothing)

> !       return NULL;

>        }


>      prec = TYPE_PRECISION (itype);

> --- 2674,2692 ----

>      if (vectype == NULL_TREE)

>        return NULL;


> !   if (optimize_bb_for_size_p (gimple_bb (last_stmt)))

>        {

> !       /* If the target can handle vectorized division or modulo

natively,
> !        don't attempt to optimize this, since native division is likely

> !        to give smaller code.  */

> !       optab = optab_for_tree_code (rhs_code, vectype, optab_default);

> !       if (optab != unknown_optab)

> !       {

> !         machine_mode vec_mode = TYPE_MODE (vectype);

> !         int icode = (int) optab_handler (optab, vec_mode);

> !         if (icode != CODE_FOR_nothing)

> !           return NULL;

> !       }

>        }


>      prec = TYPE_PRECISION (itype);

> Index: gcc/testsuite/gcc.dg/vect/bb-slp-div-1.c

> ===================================================================

> *** /dev/null   2018-04-20 16:19:46.369131350 +0100

> --- gcc/testsuite/gcc.dg/vect/bb-slp-div-1.c    2018-05-24

13:50:21.818134143 +0100
> ***************

> *** 0 ****

> --- 1,19 ----

> + /* { dg-do compile } */

> + /* { dg-additional-options "-msve-vector-bits=256" { target aarch64_sve

} } */
> +

> + int x[8];

> +

> + void

> + f (void)

> + {

> +   x[0] /= 2;

> +   x[1] /= 3;

> +   x[2] /= 4;

> +   x[3] /= 5;

> +   x[4] /= 6;

> +   x[5] /= 7;

> +   x[6] /= 8;

> +   x[7] /= 9;

> + }

> +

> + /* { dg-final { scan-tree-dump "basic block vectorized" "slp2" { xfail

*-*-* } } } */
Matthew Fortune June 8, 2018, 1:36 p.m. UTC | #5
Richard Sandiford <richard.sandiford@linaro.org> writes:
> vect_recog_divmod_pattern currently bails out if the target has

> native support for integer division, but I think in practice

> it's always going to be better to open-code it anyway, just as

> we usually open-code scalar divisions by constants.

> 

> I think the only currently affected target is MIPS MSA, where for:

> 

>   void

>   foo (int *x)

>   {

>     for (int i = 0; i < 100; ++i)

>       x[i] /= 2;

>   }

> 

> we previously preferred to use division for powers of 2:

> 

>         .set    noreorder

>         bnz.w   $w1,1f

>         div_s.w $w0,$w0,$w1

>         break   7

>         .set    reorder

> 1:

> 

> (or just the div_s.w for -mno-check-zero-division), but after the patch

> we open-code them using shifts:

> 

>         clt_s.w $w1,$w0,$w2

>         subv.w  $w0,$w0,$w1

>         srai.w  $w0,$w0,1

> 

> I assume that's better.  Matthew, is that right?


Sorry for extreme tardiness. Yes, the alternate sequence has a max latency
of 6. Although I don't have the range of latencies to hand for the FDIV, as
far as I remember 6 cycles is better than the fastest FDIV case at least for
i6400/i6500.

Matthew
diff mbox series

Patch

Index: gcc/tree-vect-patterns.c
===================================================================
--- gcc/tree-vect-patterns.c	2018-05-16 12:48:59.115202362 +0100
+++ gcc/tree-vect-patterns.c	2018-05-24 09:18:10.445466941 +0100
@@ -2639,7 +2639,6 @@  vect_recog_divmod_pattern (vec<gimple *>
   enum tree_code rhs_code;
   stmt_vec_info stmt_vinfo = vinfo_for_stmt (last_stmt);
   vec_info *vinfo = stmt_vinfo->vinfo;
-  optab optab;
   tree q;
   int dummy_int, prec;
   stmt_vec_info def_stmt_vinfo;
@@ -2674,17 +2673,6 @@  vect_recog_divmod_pattern (vec<gimple *>
   if (vectype == NULL_TREE)
     return NULL;
 
-  /* If the target can handle vectorized division or modulo natively,
-     don't attempt to optimize this.  */
-  optab = optab_for_tree_code (rhs_code, vectype, optab_default);
-  if (optab != unknown_optab)
-    {
-      machine_mode vec_mode = TYPE_MODE (vectype);
-      int icode = (int) optab_handler (optab, vec_mode);
-      if (icode != CODE_FOR_nothing)
-	return NULL;
-    }
-
   prec = TYPE_PRECISION (itype);
   if (integer_pow2p (oprnd1))
     {
Index: gcc/testsuite/gcc.dg/vect/bb-slp-div-1.c
===================================================================
--- /dev/null	2018-04-20 16:19:46.369131350 +0100
+++ gcc/testsuite/gcc.dg/vect/bb-slp-div-1.c	2018-05-24 09:18:10.444466986 +0100
@@ -0,0 +1,19 @@ 
+/* { dg-do compile } */
+/* { dg-additional-options "-msve-vector-bits=256" { target aarch64_sve } } */
+
+int x[8];
+
+void
+f (void)
+{
+  x[0] /= 2;
+  x[1] /= 3;
+  x[2] /= 4;
+  x[3] /= 5;
+  x[4] /= 6;
+  x[5] /= 7;
+  x[6] /= 8;
+  x[7] /= 9;
+}
+
+/* { dg-final { scan-tree-dump "basic block vectorized" "slp2" { xfail *-*-* } } } */