diff mbox

[rtlanal] Do a better job of costing parallel sets containing flag-setting operations.

Message ID 66275bc9-7d97-b990-4c86-2de1f4a6a2fa@arm.com
State New
Headers show

Commit Message

Richard Earnshaw (lists) June 19, 2017, 1:46 p.m. UTC
Many parallel set insns are of the form of a single set that also sets
the condition code flags.  In this case the cost of such an insn is
normally the cost of the part that doesn't set the flags, since updating
the condition flags is simply a side effect.

At present all such insns are treated as having unknown cost (ie 0) and
combine assumes that such insns are infinitely more expensive than any
other insn sequence with a non-zero cost.

This patch addresses this problem by allowing insn_rtx_cost to ignore
the condition setting part of a PARALLEL iff there is exactly one
comparison set and one non-comparison set.  If the only set operation is
a comparison we still use that as the basis of the insn cost.

	* rtlanal.c (insn_rtx_cost): If a parallel contains exactly one
	comparison set and one other set, use the cost of the
	non-comparison set.

Bootstrapped on aarch64-none-linuxgnu

OK?

R.

Comments

Segher Boessenkool June 19, 2017, 2:08 p.m. UTC | #1
Hi!

On Mon, Jun 19, 2017 at 02:46:59PM +0100, Richard Earnshaw (lists) wrote:
> Many parallel set insns are of the form of a single set that also sets

> the condition code flags.  In this case the cost of such an insn is

> normally the cost of the part that doesn't set the flags, since updating

> the condition flags is simply a side effect.

> 

> At present all such insns are treated as having unknown cost (ie 0) and

> combine assumes that such insns are infinitely more expensive than any

> other insn sequence with a non-zero cost.


That's not what combine does: it optimistically assumes any combination
with unknown costs is an improvement.

> This patch addresses this problem by allowing insn_rtx_cost to ignore

> the condition setting part of a PARALLEL iff there is exactly one

> comparison set and one non-comparison set.  If the only set operation is

> a comparison we still use that as the basis of the insn cost.


I'll test this on a zillion archs, see what the effect is.

Have you considered costing general parallels as well?


Segher
Richard Earnshaw (lists) June 19, 2017, 2:28 p.m. UTC | #2
On 19/06/17 15:08, Segher Boessenkool wrote:
> Hi!

> 

> On Mon, Jun 19, 2017 at 02:46:59PM +0100, Richard Earnshaw (lists) wrote:

>> Many parallel set insns are of the form of a single set that also sets

>> the condition code flags.  In this case the cost of such an insn is

>> normally the cost of the part that doesn't set the flags, since updating

>> the condition flags is simply a side effect.

>>

>> At present all such insns are treated as having unknown cost (ie 0) and

>> combine assumes that such insns are infinitely more expensive than any

>> other insn sequence with a non-zero cost.

> 

> That's not what combine does: it optimistically assumes any combination

> with unknown costs is an improvement.


So try this testcase on ARM.

unsigned long x, y, z;
int b;
void test()
{
   b = __builtin_sub_overflow (y,z, &x);
}


Without the patch, combine rips apart a compare and subtract insn
because it sees it as having cost zero and substitutes it with separate
compare and subtract insns.

ie before:


        ldr     r3, .L5
        ldr     r2, .L5+4
        ldr     r3, [r3]
        ldr     r2, [r2]
        cmp     r3, r2    <=====
        movcs   r0, #0
        movcc   r0, #1
        ldr     ip, .L5+8
        ldr     r1, .L5+12
        sub     r3, r3, r2  <=====
        str     r3, [ip]
        str     r0, [r1]
        bx      lr

after:

        ldr     r3, .L5
        ldr     r2, .L5+4
        ldr     r3, [r3]
        ldr     r2, [r2]
        subs    r3, r3, r2  <====
        movcc   r1, #1
        movcs   r1, #0
        ldr     r0, .L5+8
        ldr     r2, .L5+12
        str     r3, [r0]
        str     r1, [r2]
        bx      lr

The combine log before the patch shows:

allowing combination of insns 10 and 51
original costs 0 + 8 = 0
replacement costs 4 + 12 = 16

So it is clearly deciding that the original costs are greater than the
replacement costs.

> 

>> This patch addresses this problem by allowing insn_rtx_cost to ignore

>> the condition setting part of a PARALLEL iff there is exactly one

>> comparison set and one non-comparison set.  If the only set operation is

>> a comparison we still use that as the basis of the insn cost.

> 

> I'll test this on a zillion archs, see what the effect is.

> 

> Have you considered costing general parallels as well?

> 

> 


I thought about it but concluded that there's no generically correct
answer.  It might be the max of all the individual sets or it might be
the sum, or it might be somewhere in between.  For example on ARM the
load/store multiple operations are expressed as parallels, but their
cost will depend on how many loads/stores happen in parallel within the
hardware.

I think we'd need a new back-end hook to handle the other cases sensibly.

R.

> Segher

>
Richard Earnshaw (lists) June 19, 2017, 2:45 p.m. UTC | #3
On 19/06/17 15:08, Segher Boessenkool wrote:
> Hi!

> 

> On Mon, Jun 19, 2017 at 02:46:59PM +0100, Richard Earnshaw (lists) wrote:

>> Many parallel set insns are of the form of a single set that also sets

>> the condition code flags.  In this case the cost of such an insn is

>> normally the cost of the part that doesn't set the flags, since updating

>> the condition flags is simply a side effect.

>>

>> At present all such insns are treated as having unknown cost (ie 0) and

>> combine assumes that such insns are infinitely more expensive than any

>> other insn sequence with a non-zero cost.

> 

> That's not what combine does: it optimistically assumes any combination

> with unknown costs is an improvement.


Actually the logic is

  int reject = old_cost > 0 && new_cost > old_cost;


So reject will never be true if old cost is zero.

R.
> 

>> This patch addresses this problem by allowing insn_rtx_cost to ignore

>> the condition setting part of a PARALLEL iff there is exactly one

>> comparison set and one non-comparison set.  If the only set operation is

>> a comparison we still use that as the basis of the insn cost.

> 

> I'll test this on a zillion archs, see what the effect is.

> 

> Have you considered costing general parallels as well?

> 

> 

> Segher

>
Segher Boessenkool June 19, 2017, 3:06 p.m. UTC | #4
On Mon, Jun 19, 2017 at 03:28:20PM +0100, Richard Earnshaw (lists) wrote:
> > That's not what combine does: it optimistically assumes any combination

> > with unknown costs is an improvement.

> 

> So try this testcase on ARM.

> 

> unsigned long x, y, z;

> int b;

> void test()

> {

>    b = __builtin_sub_overflow (y,z, &x);

> }

> 

> 

> Without the patch, combine rips apart a compare and subtract insn

> because it sees it as having cost zero and substitutes it with separate

> compare and subtract insns.


> The combine log before the patch shows:

> 

> allowing combination of insns 10 and 51

> original costs 0 + 8 = 0

> replacement costs 4 + 12 = 16


Yes, this is a good example of a case where your patch helps.  Thanks.

> So it is clearly deciding that the original costs are greater than the

> replacement costs.


No: it allows any combination with unknown cost (either old or new cost).
See combine_validate_cost.

> >> This patch addresses this problem by allowing insn_rtx_cost to ignore

> >> the condition setting part of a PARALLEL iff there is exactly one

> >> comparison set and one non-comparison set.  If the only set operation is

> >> a comparison we still use that as the basis of the insn cost.

> > 

> > I'll test this on a zillion archs, see what the effect is.

> > 

> > Have you considered costing general parallels as well?

> 

> I thought about it but concluded that there's no generically correct

> answer.  It might be the max of all the individual sets or it might be

> the sum, or it might be somewhere in between.  For example on ARM the

> load/store multiple operations are expressed as parallels, but their

> cost will depend on how many loads/stores happen in parallel within the

> hardware.

> 

> I think we'd need a new back-end hook to handle the other cases sensibly.


And in general make insn_rtx_cost do something more sane than just looking
at a set_src_cost, yeah.

The problem is changing any of this without regressing some targets.
Of course we are in stage 1 now ;-)


Segher
Segher Boessenkool June 19, 2017, 3:09 p.m. UTC | #5
On Mon, Jun 19, 2017 at 03:45:23PM +0100, Richard Earnshaw (lists) wrote:
> >> At present all such insns are treated as having unknown cost (ie 0) and

> >> combine assumes that such insns are infinitely more expensive than any

> >> other insn sequence with a non-zero cost.

> > 

> > That's not what combine does: it optimistically assumes any combination

> > with unknown costs is an improvement.

> 

> Actually the logic is

> 

>   int reject = old_cost > 0 && new_cost > old_cost;

> 

> So reject will never be true if old cost is zero.


Yes, exactly; and neither if new_cost is zero.  If any cost is unknown
combine just hopes for the best.


Segher
Richard Earnshaw (lists) June 19, 2017, 4:01 p.m. UTC | #6
On 19/06/17 16:09, Segher Boessenkool wrote:
> On Mon, Jun 19, 2017 at 03:45:23PM +0100, Richard Earnshaw (lists) wrote:

>>>> At present all such insns are treated as having unknown cost (ie 0) and

>>>> combine assumes that such insns are infinitely more expensive than any

>>>> other insn sequence with a non-zero cost.

>>>

>>> That's not what combine does: it optimistically assumes any combination

>>> with unknown costs is an improvement.

>>

>> Actually the logic is

>>

>>   int reject = old_cost > 0 && new_cost > old_cost;

>>

>> So reject will never be true if old cost is zero.

> 

> Yes, exactly; and neither if new_cost is zero.  If any cost is unknown

> combine just hopes for the best.

> 

> 

> Segher

> 


Yeah, and I'm not suggesting we change the logic there (sorry if the
description was misleading).  Instead I'm proposing that we handle more
cases for parallels to not return zero.

R.
Segher Boessenkool June 19, 2017, 5:40 p.m. UTC | #7
On Mon, Jun 19, 2017 at 05:01:10PM +0100, Richard Earnshaw (lists) wrote:
> Yeah, and I'm not suggesting we change the logic there (sorry if the

> description was misleading).  Instead I'm proposing that we handle more

> cases for parallels to not return zero.


Right.  My test run is half way through, will have results later --
your change looks good to me, but it is always surprising whether
better costs help or not, or even *hurt* good code generation (things
are just too tightly tuned to the current behaviour, so some things
may need retuning).


Segher
Segher Boessenkool June 20, 2017, 12:54 p.m. UTC | #8
On Mon, Jun 19, 2017 at 12:40:53PM -0500, Segher Boessenkool wrote:
> On Mon, Jun 19, 2017 at 05:01:10PM +0100, Richard Earnshaw (lists) wrote:

> > Yeah, and I'm not suggesting we change the logic there (sorry if the

> > description was misleading).  Instead I'm proposing that we handle more

> > cases for parallels to not return zero.

> 

> Right.  My test run is half way through, will have results later --

> your change looks good to me, but it is always surprising whether

> better costs help or not, or even *hurt* good code generation (things

> are just too tightly tuned to the current behaviour, so some things

> may need retuning).


Everything built successfully (31 targets); --enable-checking=yes,rtl,tree
so it took a while, sorry.

The targets with any differences (table shows code size):

                 old     patched
         arm  11545709  11545797
     powerpc   8442762   8442746
      x86_64  10627428  10627363

Arm has very many differences, the others do not.  For powerpc (which
is 32-bit, 64-bit showed no differences) most of the difference is
scheduling deciding to do things a bit differently, and most of it
in places where we have not-so-good costs anyway.  For arm the effects
often cascade to bb-reorder making different decisions.

Anyway, all differences are small, it is not likely to hurt anything.
I support the patch, if that helps -- but I cannot approve it.


Segher
Richard Earnshaw (lists) June 30, 2017, 9:03 a.m. UTC | #9
On 19/06/17 14:46, Richard Earnshaw (lists) wrote:
> Many parallel set insns are of the form of a single set that also sets

> the condition code flags.  In this case the cost of such an insn is

> normally the cost of the part that doesn't set the flags, since updating

> the condition flags is simply a side effect.

> 

> At present all such insns are treated as having unknown cost (ie 0) and

> combine assumes that such insns are infinitely more expensive than any

> other insn sequence with a non-zero cost.

> 

> This patch addresses this problem by allowing insn_rtx_cost to ignore

> the condition setting part of a PARALLEL iff there is exactly one

> comparison set and one non-comparison set.  If the only set operation is

> a comparison we still use that as the basis of the insn cost.

> 

> 	* rtlanal.c (insn_rtx_cost): If a parallel contains exactly one

> 	comparison set and one other set, use the cost of the

> 	non-comparison set.

> 

> Bootstrapped on aarch64-none-linuxgnu

> 

> OK?

> 


Ping?

R.

> R.

> 

> 

> insn-costs.patch

> 

> 

> diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c

> index d9f57c3..5cae793 100644

> --- a/gcc/rtlanal.c

> +++ b/gcc/rtlanal.c

> @@ -5260,23 +5260,41 @@ insn_rtx_cost (rtx pat, bool speed)

>    int i, cost;

>    rtx set;

>  

> -  /* Extract the single set rtx from the instruction pattern.

> -     We can't use single_set since we only have the pattern.  */

> +  /* Extract the single set rtx from the instruction pattern.  We

> +     can't use single_set since we only have the pattern.  We also

> +     consider PARALLELs of a normal set and and a single comparison.

> +     In that case we use the cost of the non-comparison SET operation,

> +     which is most-likely to be the real cost of this operation.  */

>    if (GET_CODE (pat) == SET)

>      set = pat;

>    else if (GET_CODE (pat) == PARALLEL)

>      {

>        set = NULL_RTX;

> +      rtx comparison = NULL_RTX;

> +

>        for (i = 0; i < XVECLEN (pat, 0); i++)

>  	{

>  	  rtx x = XVECEXP (pat, 0, i);

>  	  if (GET_CODE (x) == SET)

>  	    {

> -	      if (set)

> -		return 0;

> -	      set = x;

> +	      if (GET_CODE (SET_SRC (x)) == COMPARE)

> +		{

> +		  if (comparison)

> +		    return 0;

> +		  comparison = x;

> +		}

> +	      else

> +		{

> +		  if (set)

> +		    return 0;

> +		  set = x;

> +		}

>  	    }

>  	}

> +

> +      if (!set && comparison)

> +	set = comparison;

> +

>        if (!set)

>  	return 0;

>      }

>
Jeff Law June 30, 2017, 3:20 p.m. UTC | #10
On 06/30/2017 03:03 AM, Richard Earnshaw (lists) wrote:
> On 19/06/17 14:46, Richard Earnshaw (lists) wrote:

>> Many parallel set insns are of the form of a single set that also sets

>> the condition code flags.  In this case the cost of such an insn is

>> normally the cost of the part that doesn't set the flags, since updating

>> the condition flags is simply a side effect.

>>

>> At present all such insns are treated as having unknown cost (ie 0) and

>> combine assumes that such insns are infinitely more expensive than any

>> other insn sequence with a non-zero cost.

>>

>> This patch addresses this problem by allowing insn_rtx_cost to ignore

>> the condition setting part of a PARALLEL iff there is exactly one

>> comparison set and one non-comparison set.  If the only set operation is

>> a comparison we still use that as the basis of the insn cost.

>>

>> 	* rtlanal.c (insn_rtx_cost): If a parallel contains exactly one

>> 	comparison set and one other set, use the cost of the

>> 	non-comparison set.

>>

>> Bootstrapped on aarch64-none-linuxgnu

>>

>> OK?

>>

> 

> Ping?

Needs a ChangeLog, with that, OK.  Ideally we'd have something which
verifies we're getting the cost right, but that's probably nontrivial to
do in a generic manner.

jeff
diff mbox

Patch

diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
index d9f57c3..5cae793 100644
--- a/gcc/rtlanal.c
+++ b/gcc/rtlanal.c
@@ -5260,23 +5260,41 @@  insn_rtx_cost (rtx pat, bool speed)
   int i, cost;
   rtx set;
 
-  /* Extract the single set rtx from the instruction pattern.
-     We can't use single_set since we only have the pattern.  */
+  /* Extract the single set rtx from the instruction pattern.  We
+     can't use single_set since we only have the pattern.  We also
+     consider PARALLELs of a normal set and and a single comparison.
+     In that case we use the cost of the non-comparison SET operation,
+     which is most-likely to be the real cost of this operation.  */
   if (GET_CODE (pat) == SET)
     set = pat;
   else if (GET_CODE (pat) == PARALLEL)
     {
       set = NULL_RTX;
+      rtx comparison = NULL_RTX;
+
       for (i = 0; i < XVECLEN (pat, 0); i++)
 	{
 	  rtx x = XVECEXP (pat, 0, i);
 	  if (GET_CODE (x) == SET)
 	    {
-	      if (set)
-		return 0;
-	      set = x;
+	      if (GET_CODE (SET_SRC (x)) == COMPARE)
+		{
+		  if (comparison)
+		    return 0;
+		  comparison = x;
+		}
+	      else
+		{
+		  if (set)
+		    return 0;
+		  set = x;
+		}
 	    }
 	}
+
+      if (!set && comparison)
+	set = comparison;
+
       if (!set)
 	return 0;
     }