diff mbox

[RFC] PR tree-optimization/67628: Make tree ifcombine more symmetric and interactions with dom

Message ID 56015958.3090009@arm.com
State New
Headers show

Commit Message

Kyrylo Tkachov Sept. 22, 2015, 1:36 p.m. UTC
Hi all,

I'm looking into improving usage of aarch64 conditional compare instructions and PR 67628
is one area of improvement I identified. The problem there is different tree-level behaviour
for the expressions:
(a > b && b <= c) && c > d
vs.
a > b && (b <= c && c > d)

The second variant generates worse code for aarch64 (and x86_64) because tree ifcombine doesn't
do as good a job at merging basic blocks.

I've tracked this down to the function ifcombine_andif in tree-ssa-ifcombine.c.
With this patch I get the good codegen from that PR, both forms are combined into a single basic block
and the conditional compare expansion code in ccmp.c does a good job at generating the ccmp instructions.
I also see about 2% more ccmp instructions being generated for SPEC2006 with the code generally looking
better as well (if I remove this whole single-conditional restriction completely I get 19% more conditional
compares in SPEC2006, but that's something I want to investigate separately).

The idea is that for the two expressions given above the control flow and contents of the basic blocks
is the same, but when ifcombine_ifandif is called the inner_cond_bb and outer_cond_bb arguments are
reversed i.e. for the first expression the inner_cond_bb contains a single comparison c > d
and outer_cond_bb contains multiple comparisons a > b && b <= c.
In the second case the outer cond contains the single comparison a > b and the inner block this
time contains the multiple conditions a > b && b <= c, thus failing the
"Only do this optimization if the inner bb contains only the conditional" check.
The patch makes this condition more symmetrical by rejecting this optimisation only if
neither the inner nor the outer condition blocks are simple.

Unfortunately, I see a testsuite regression with this patch:
FAIL: gcc.dg/pr66299-2.c scan-tree-dump-not optimized "<<"

The reduced part of that test is:
void
test1 (int x, unsigned u)
{
   if ((1U << x) != 64
       || (2 << x) != u
       || (x << x) != 384
       || (3 << x) == 9
       || (x << 14) != 98304U
       || (1 << x) == 14
       || (3 << 2) != 12)
     __builtin_abort ();
}

The patched ifcombine pass works more or less as expected and produces fewer basic blocks.
Before this patch a relevant part of the ifcombine dump for test1 is:
;;   basic block 2, loop depth 0, count 0, freq 10000, maybe hot
   if (x_1(D) != 6)
     goto <bb 6>;
   else
     goto <bb 3>;

;;   basic block 3, loop depth 0, count 0, freq 9996, maybe hot
   _2 = 2 << x_1(D);
   _3 = (unsigned intD.10) _2;
   if (_3 != u_4(D))
     goto <bb 6>;
   else
     goto <bb 4>;


After this patch it is:
;;   basic block 2, loop depth 0, count 0, freq 10000, maybe hot
   _2 = 2 << x_1(D);
   _3 = (unsigned intD.10) _2;
   _9 = _3 != u_4(D);
   _10 = x_1(D) != 6;
   _11 = _9 | _10;
   if (_11 != 0)
     goto <bb 5>;
   else
     goto <bb 3>;

The second form ends up generating worse codegen however, and the badness starts with the dom1 pass.
In the unpatched case it manages to deduce that x must be 6 by the time it reaches basic block 3 and
uses that information to eliminate the shift in "_2 = 2 << x_1(D)" from basic block 3
In the patched case it is unable to make that call, I think because the x != 6 condition is IORed
with another test.

I'm not familiar with the internals of the dom pass, so I'm not sure where to go looking for a fix for this.
Is the ifcombine change a step in the right direction? If so, what would need to be done to fix the issue with
the dom pass?

I suppose what we want is to not combine basic blocks if the sequence and conditions of the basic blocks are
such that dom can potentially exploit them, but how do we express that?

Thanks,
Kyrill

P.S. I can provide more complete tree dumps for the examples if needed.

2015-09-22  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     PR tree-optimization/67628
     * tree-ssa-ifcombine.c (ifcombine_ifandif): Allow optimization
     when either the inner or outer block contain a single conditional.

Comments

Jeff Law Sept. 22, 2015, 7:31 p.m. UTC | #1
On 09/22/2015 07:36 AM, Kyrill Tkachov wrote:
> Hi all,

> Unfortunately, I see a testsuite regression with this patch:
> FAIL: gcc.dg/pr66299-2.c scan-tree-dump-not optimized "<<"
>
> The reduced part of that test is:
> void
> test1 (int x, unsigned u)
> {
>    if ((1U << x) != 64
>        || (2 << x) != u
>        || (x << x) != 384
>        || (3 << x) == 9
>        || (x << 14) != 98304U
>        || (1 << x) == 14
>        || (3 << 2) != 12)
>      __builtin_abort ();
> }
>
> The patched ifcombine pass works more or less as expected and produces
> fewer basic blocks.
> Before this patch a relevant part of the ifcombine dump for test1 is:
> ;;   basic block 2, loop depth 0, count 0, freq 10000, maybe hot
>    if (x_1(D) != 6)
>      goto <bb 6>;
>    else
>      goto <bb 3>;
>
> ;;   basic block 3, loop depth 0, count 0, freq 9996, maybe hot
>    _2 = 2 << x_1(D);
>    _3 = (unsigned intD.10) _2;
>    if (_3 != u_4(D))
>      goto <bb 6>;
>    else
>      goto <bb 4>;
>
>
> After this patch it is:
> ;;   basic block 2, loop depth 0, count 0, freq 10000, maybe hot
>    _2 = 2 << x_1(D);
>    _3 = (unsigned intD.10) _2;
>    _9 = _3 != u_4(D);
>    _10 = x_1(D) != 6;
>    _11 = _9 | _10;
>    if (_11 != 0)
>      goto <bb 5>;
>    else
>      goto <bb 3>;
>
> The second form ends up generating worse codegen however, and the
> badness starts with the dom1 pass.
> In the unpatched case it manages to deduce that x must be 6 by the time
> it reaches basic block 3 and
> uses that information to eliminate the shift in "_2 = 2 << x_1(D)" from
> basic block 3
> In the patched case it is unable to make that call, I think because the
> x != 6 condition is IORed
> with another test.
>
> I'm not familiar with the internals of the dom pass, so I'm not sure
> where to go looking for a fix for this.
> Is the ifcombine change a step in the right direction? If so, what would
> need to be done to fix the issue with
> the dom pass?
I don't see how you can reasonably fix this in DOM.  if _9 or _10 is 
true, then _11 is true.  But we can't reasonably record any kind of 
equivalence for _9 or _10 individually.

If the statement
_11 = _9 | _10;

Were changed to

_11 = _9 & _10;

Then we could record something useful about _9 and _10.


> I suppose what we want is to not combine basic blocks if the sequence
> and conditions of the basic blocks are
> such that dom can potentially exploit them, but how do we express that?
I don't think there's going to be a way to directly express that.  You 
could essentially claim that TRUTH_OR is more expensive than TRUTH_AND 
because of the impact on DOM, but that in and of itself may not resolve 
the situation either.

I think the question we need to answer is whether or not your changes 
are generally better, even if there's specific instances where they make 
things worse.  If the benefits outweigh the negatives then we can xfail 
that test.

jeff
Kyrylo Tkachov Sept. 23, 2015, 8:59 a.m. UTC | #2
On 22/09/15 20:31, Jeff Law wrote:
> On 09/22/2015 07:36 AM, Kyrill Tkachov wrote:
>> Hi all,
>> Unfortunately, I see a testsuite regression with this patch:
>> FAIL: gcc.dg/pr66299-2.c scan-tree-dump-not optimized "<<"
>>
>> The reduced part of that test is:
>> void
>> test1 (int x, unsigned u)
>> {
>>     if ((1U << x) != 64
>>         || (2 << x) != u
>>         || (x << x) != 384
>>         || (3 << x) == 9
>>         || (x << 14) != 98304U
>>         || (1 << x) == 14
>>         || (3 << 2) != 12)
>>       __builtin_abort ();
>> }
>>
>> The patched ifcombine pass works more or less as expected and produces
>> fewer basic blocks.
>> Before this patch a relevant part of the ifcombine dump for test1 is:
>> ;;   basic block 2, loop depth 0, count 0, freq 10000, maybe hot
>>     if (x_1(D) != 6)
>>       goto <bb 6>;
>>     else
>>       goto <bb 3>;
>>
>> ;;   basic block 3, loop depth 0, count 0, freq 9996, maybe hot
>>     _2 = 2 << x_1(D);
>>     _3 = (unsigned intD.10) _2;
>>     if (_3 != u_4(D))
>>       goto <bb 6>;
>>     else
>>       goto <bb 4>;
>>
>>
>> After this patch it is:
>> ;;   basic block 2, loop depth 0, count 0, freq 10000, maybe hot
>>     _2 = 2 << x_1(D);
>>     _3 = (unsigned intD.10) _2;
>>     _9 = _3 != u_4(D);
>>     _10 = x_1(D) != 6;
>>     _11 = _9 | _10;
>>     if (_11 != 0)
>>       goto <bb 5>;
>>     else
>>       goto <bb 3>;
>>
>> The second form ends up generating worse codegen however, and the
>> badness starts with the dom1 pass.
>> In the unpatched case it manages to deduce that x must be 6 by the time
>> it reaches basic block 3 and
>> uses that information to eliminate the shift in "_2 = 2 << x_1(D)" from
>> basic block 3
>> In the patched case it is unable to make that call, I think because the
>> x != 6 condition is IORed
>> with another test.
>>
>> I'm not familiar with the internals of the dom pass, so I'm not sure
>> where to go looking for a fix for this.
>> Is the ifcombine change a step in the right direction? If so, what would
>> need to be done to fix the issue with
>> the dom pass?
> I don't see how you can reasonably fix this in DOM.  if _9 or _10 is
> true, then _11 is true.  But we can't reasonably record any kind of
> equivalence for _9 or _10 individually.
>
> If the statement
> _11 = _9 | _10;
>
> Were changed to
>
> _11 = _9 & _10;
>
> Then we could record something useful about _9 and _10.
>
>
>> I suppose what we want is to not combine basic blocks if the sequence
>> and conditions of the basic blocks are
>> such that dom can potentially exploit them, but how do we express that?
> I don't think there's going to be a way to directly express that.  You
> could essentially claim that TRUTH_OR is more expensive than TRUTH_AND
> because of the impact on DOM, but that in and of itself may not resolve
> the situation either.
>
> I think the question we need to answer is whether or not your changes
> are generally better, even if there's specific instances where they make
> things worse.  If the benefits outweigh the negatives then we can xfail
> that test.

Ok, I'll investigate and benchmark some more.
Andrew, this transformation to ifcombine (together with the restriction that the inner condition block
has to be a single comparison) was added by you with r204194.
Is there a particular reason for that restriction and why it is applied to the inner block and not either?

Thanks,
Kyrill



>
> jeff
>
Andrew Pinski Sept. 23, 2015, 9:09 a.m. UTC | #3
> On Sep 23, 2015, at 1:59 AM, Kyrill Tkachov <kyrylo.tkachov@arm.com> wrote:
> 
> 
>> On 22/09/15 20:31, Jeff Law wrote:
>>> On 09/22/2015 07:36 AM, Kyrill Tkachov wrote:
>>> Hi all,
>>> Unfortunately, I see a testsuite regression with this patch:
>>> FAIL: gcc.dg/pr66299-2.c scan-tree-dump-not optimized "<<"
>>> 
>>> The reduced part of that test is:
>>> void
>>> test1 (int x, unsigned u)
>>> {
>>>    if ((1U << x) != 64
>>>        || (2 << x) != u
>>>        || (x << x) != 384
>>>        || (3 << x) == 9
>>>        || (x << 14) != 98304U
>>>        || (1 << x) == 14
>>>        || (3 << 2) != 12)
>>>      __builtin_abort ();
>>> }
>>> 
>>> The patched ifcombine pass works more or less as expected and produces
>>> fewer basic blocks.
>>> Before this patch a relevant part of the ifcombine dump for test1 is:
>>> ;;   basic block 2, loop depth 0, count 0, freq 10000, maybe hot
>>>    if (x_1(D) != 6)
>>>      goto <bb 6>;
>>>    else
>>>      goto <bb 3>;
>>> 
>>> ;;   basic block 3, loop depth 0, count 0, freq 9996, maybe hot
>>>    _2 = 2 << x_1(D);
>>>    _3 = (unsigned intD.10) _2;
>>>    if (_3 != u_4(D))
>>>      goto <bb 6>;
>>>    else
>>>      goto <bb 4>;
>>> 
>>> 
>>> After this patch it is:
>>> ;;   basic block 2, loop depth 0, count 0, freq 10000, maybe hot
>>>    _2 = 2 << x_1(D);
>>>    _3 = (unsigned intD.10) _2;
>>>    _9 = _3 != u_4(D);
>>>    _10 = x_1(D) != 6;
>>>    _11 = _9 | _10;
>>>    if (_11 != 0)
>>>      goto <bb 5>;
>>>    else
>>>      goto <bb 3>;
>>> 
>>> The second form ends up generating worse codegen however, and the
>>> badness starts with the dom1 pass.
>>> In the unpatched case it manages to deduce that x must be 6 by the time
>>> it reaches basic block 3 and
>>> uses that information to eliminate the shift in "_2 = 2 << x_1(D)" from
>>> basic block 3
>>> In the patched case it is unable to make that call, I think because the
>>> x != 6 condition is IORed
>>> with another test.
>>> 
>>> I'm not familiar with the internals of the dom pass, so I'm not sure
>>> where to go looking for a fix for this.
>>> Is the ifcombine change a step in the right direction? If so, what would
>>> need to be done to fix the issue with
>>> the dom pass?
>> I don't see how you can reasonably fix this in DOM.  if _9 or _10 is
>> true, then _11 is true.  But we can't reasonably record any kind of
>> equivalence for _9 or _10 individually.
>> 
>> If the statement
>> _11 = _9 | _10;
>> 
>> Were changed to
>> 
>> _11 = _9 & _10;
>> 
>> Then we could record something useful about _9 and _10.
>> 
>> 
>>> I suppose what we want is to not combine basic blocks if the sequence
>>> and conditions of the basic blocks are
>>> such that dom can potentially exploit them, but how do we express that?
>> I don't think there's going to be a way to directly express that.  You
>> could essentially claim that TRUTH_OR is more expensive than TRUTH_AND
>> because of the impact on DOM, but that in and of itself may not resolve
>> the situation either.
>> 
>> I think the question we need to answer is whether or not your changes
>> are generally better, even if there's specific instances where they make
>> things worse.  If the benefits outweigh the negatives then we can xfail
>> that test.
> 
> Ok, I'll investigate and benchmark some more.
> Andrew, this transformation to ifcombine (together with the restriction that the inner condition block
> has to be a single comparison) was added by you with r204194.
> Is there a particular reason for that restriction and why it is applied to the inner block and not either?

My reasoning at the time was there might be an "expensive" instruction or one that might trap (I did not check to see if the other part of the code was detecting that).
The outer block did not need any checks as we have something like
...
If (a)
  If (b)

Or
....
If (a)
  Goto f
else if (b)
 ....
Else
{
F:
....
}

And there was no need to check what was before the if (a) part just what is in between the two ifs. 

What I mean by expensive for an example is division or some function call. 

Thanks,
Andrew


> 
> Thanks,
> Kyrill
> 
> 
> 
>> 
>> jeff
>
Kyrylo Tkachov Sept. 23, 2015, 9:24 a.m. UTC | #4
On 23/09/15 10:09, Pinski, Andrew wrote:
>> On Sep 23, 2015, at 1:59 AM, Kyrill Tkachov <kyrylo.tkachov@arm.com> wrote:
>>
>>
>>> On 22/09/15 20:31, Jeff Law wrote:
>>>> On 09/22/2015 07:36 AM, Kyrill Tkachov wrote:
>>>> Hi all,
>>>> Unfortunately, I see a testsuite regression with this patch:
>>>> FAIL: gcc.dg/pr66299-2.c scan-tree-dump-not optimized "<<"
>>>>
>>>> The reduced part of that test is:
>>>> void
>>>> test1 (int x, unsigned u)
>>>> {
>>>>     if ((1U << x) != 64
>>>>         || (2 << x) != u
>>>>         || (x << x) != 384
>>>>         || (3 << x) == 9
>>>>         || (x << 14) != 98304U
>>>>         || (1 << x) == 14
>>>>         || (3 << 2) != 12)
>>>>       __builtin_abort ();
>>>> }
>>>>
>>>> The patched ifcombine pass works more or less as expected and produces
>>>> fewer basic blocks.
>>>> Before this patch a relevant part of the ifcombine dump for test1 is:
>>>> ;;   basic block 2, loop depth 0, count 0, freq 10000, maybe hot
>>>>     if (x_1(D) != 6)
>>>>       goto <bb 6>;
>>>>     else
>>>>       goto <bb 3>;
>>>>
>>>> ;;   basic block 3, loop depth 0, count 0, freq 9996, maybe hot
>>>>     _2 = 2 << x_1(D);
>>>>     _3 = (unsigned intD.10) _2;
>>>>     if (_3 != u_4(D))
>>>>       goto <bb 6>;
>>>>     else
>>>>       goto <bb 4>;
>>>>
>>>>
>>>> After this patch it is:
>>>> ;;   basic block 2, loop depth 0, count 0, freq 10000, maybe hot
>>>>     _2 = 2 << x_1(D);
>>>>     _3 = (unsigned intD.10) _2;
>>>>     _9 = _3 != u_4(D);
>>>>     _10 = x_1(D) != 6;
>>>>     _11 = _9 | _10;
>>>>     if (_11 != 0)
>>>>       goto <bb 5>;
>>>>     else
>>>>       goto <bb 3>;
>>>>
>>>> The second form ends up generating worse codegen however, and the
>>>> badness starts with the dom1 pass.
>>>> In the unpatched case it manages to deduce that x must be 6 by the time
>>>> it reaches basic block 3 and
>>>> uses that information to eliminate the shift in "_2 = 2 << x_1(D)" from
>>>> basic block 3
>>>> In the patched case it is unable to make that call, I think because the
>>>> x != 6 condition is IORed
>>>> with another test.
>>>>
>>>> I'm not familiar with the internals of the dom pass, so I'm not sure
>>>> where to go looking for a fix for this.
>>>> Is the ifcombine change a step in the right direction? If so, what would
>>>> need to be done to fix the issue with
>>>> the dom pass?
>>> I don't see how you can reasonably fix this in DOM.  if _9 or _10 is
>>> true, then _11 is true.  But we can't reasonably record any kind of
>>> equivalence for _9 or _10 individually.
>>>
>>> If the statement
>>> _11 = _9 | _10;
>>>
>>> Were changed to
>>>
>>> _11 = _9 & _10;
>>>
>>> Then we could record something useful about _9 and _10.
>>>
>>>
>>>> I suppose what we want is to not combine basic blocks if the sequence
>>>> and conditions of the basic blocks are
>>>> such that dom can potentially exploit them, but how do we express that?
>>> I don't think there's going to be a way to directly express that.  You
>>> could essentially claim that TRUTH_OR is more expensive than TRUTH_AND
>>> because of the impact on DOM, but that in and of itself may not resolve
>>> the situation either.
>>>
>>> I think the question we need to answer is whether or not your changes
>>> are generally better, even if there's specific instances where they make
>>> things worse.  If the benefits outweigh the negatives then we can xfail
>>> that test.
>> Ok, I'll investigate and benchmark some more.
>> Andrew, this transformation to ifcombine (together with the restriction that the inner condition block
>> has to be a single comparison) was added by you with r204194.
>> Is there a particular reason for that restriction and why it is applied to the inner block and not either?
> My reasoning at the time was there might be an "expensive" instruction or one that might trap (I did not check to see if the other part of the code was detecting that).
> The outer block did not need any checks as we have something like
> ...
> If (a)
>    If (b)
>
> Or
> ....
> If (a)
>    Goto f
> else if (b)
>   ....
> Else
> {
> F:
> ....
> }
>
> And there was no need to check what was before the if (a) part just what is in between the two ifs.

Ah, because the code in outer_cond_bb would have to be executed anyway whether
we perform the conversion or not, right?

Thanks,
Kyrill

>
> What I mean by expensive for an example is division or some function call.
>
> Thanks,
> Andrew
>
>
>> Thanks,
>> Kyrill
>>
>>
>>
>>> jeff
Richard Biener Sept. 23, 2015, 10:10 a.m. UTC | #5
On Wed, 23 Sep 2015, Kyrill Tkachov wrote:

> 
> On 23/09/15 10:09, Pinski, Andrew wrote:
> > > On Sep 23, 2015, at 1:59 AM, Kyrill Tkachov <kyrylo.tkachov@arm.com>
> > > wrote:
> > > 
> > > 
> > > > On 22/09/15 20:31, Jeff Law wrote:
> > > > > On 09/22/2015 07:36 AM, Kyrill Tkachov wrote:
> > > > > Hi all,
> > > > > Unfortunately, I see a testsuite regression with this patch:
> > > > > FAIL: gcc.dg/pr66299-2.c scan-tree-dump-not optimized "<<"
> > > > > 
> > > > > The reduced part of that test is:
> > > > > void
> > > > > test1 (int x, unsigned u)
> > > > > {
> > > > >     if ((1U << x) != 64
> > > > >         || (2 << x) != u
> > > > >         || (x << x) != 384
> > > > >         || (3 << x) == 9
> > > > >         || (x << 14) != 98304U
> > > > >         || (1 << x) == 14
> > > > >         || (3 << 2) != 12)
> > > > >       __builtin_abort ();
> > > > > }
> > > > > 
> > > > > The patched ifcombine pass works more or less as expected and produces
> > > > > fewer basic blocks.
> > > > > Before this patch a relevant part of the ifcombine dump for test1 is:
> > > > > ;;   basic block 2, loop depth 0, count 0, freq 10000, maybe hot
> > > > >     if (x_1(D) != 6)
> > > > >       goto <bb 6>;
> > > > >     else
> > > > >       goto <bb 3>;
> > > > > 
> > > > > ;;   basic block 3, loop depth 0, count 0, freq 9996, maybe hot
> > > > >     _2 = 2 << x_1(D);
> > > > >     _3 = (unsigned intD.10) _2;
> > > > >     if (_3 != u_4(D))
> > > > >       goto <bb 6>;
> > > > >     else
> > > > >       goto <bb 4>;
> > > > > 
> > > > > 
> > > > > After this patch it is:
> > > > > ;;   basic block 2, loop depth 0, count 0, freq 10000, maybe hot
> > > > >     _2 = 2 << x_1(D);
> > > > >     _3 = (unsigned intD.10) _2;
> > > > >     _9 = _3 != u_4(D);
> > > > >     _10 = x_1(D) != 6;
> > > > >     _11 = _9 | _10;
> > > > >     if (_11 != 0)
> > > > >       goto <bb 5>;
> > > > >     else
> > > > >       goto <bb 3>;
> > > > > 
> > > > > The second form ends up generating worse codegen however, and the
> > > > > badness starts with the dom1 pass.
> > > > > In the unpatched case it manages to deduce that x must be 6 by the
> > > > > time
> > > > > it reaches basic block 3 and
> > > > > uses that information to eliminate the shift in "_2 = 2 << x_1(D)"
> > > > > from
> > > > > basic block 3
> > > > > In the patched case it is unable to make that call, I think because
> > > > > the
> > > > > x != 6 condition is IORed
> > > > > with another test.
> > > > > 
> > > > > I'm not familiar with the internals of the dom pass, so I'm not sure
> > > > > where to go looking for a fix for this.
> > > > > Is the ifcombine change a step in the right direction? If so, what
> > > > > would
> > > > > need to be done to fix the issue with
> > > > > the dom pass?
> > > > I don't see how you can reasonably fix this in DOM.  if _9 or _10 is
> > > > true, then _11 is true.  But we can't reasonably record any kind of
> > > > equivalence for _9 or _10 individually.
> > > > 
> > > > If the statement
> > > > _11 = _9 | _10;
> > > > 
> > > > Were changed to
> > > > 
> > > > _11 = _9 & _10;
> > > > 
> > > > Then we could record something useful about _9 and _10.
> > > > 
> > > > 
> > > > > I suppose what we want is to not combine basic blocks if the sequence
> > > > > and conditions of the basic blocks are
> > > > > such that dom can potentially exploit them, but how do we express
> > > > > that?
> > > > I don't think there's going to be a way to directly express that.  You
> > > > could essentially claim that TRUTH_OR is more expensive than TRUTH_AND
> > > > because of the impact on DOM, but that in and of itself may not resolve
> > > > the situation either.
> > > > 
> > > > I think the question we need to answer is whether or not your changes
> > > > are generally better, even if there's specific instances where they make
> > > > things worse.  If the benefits outweigh the negatives then we can xfail
> > > > that test.
> > > Ok, I'll investigate and benchmark some more.
> > > Andrew, this transformation to ifcombine (together with the restriction
> > > that the inner condition block
> > > has to be a single comparison) was added by you with r204194.
> > > Is there a particular reason for that restriction and why it is applied to
> > > the inner block and not either?
> > My reasoning at the time was there might be an "expensive" instruction or
> > one that might trap (I did not check to see if the other part of the code
> > was detecting that).
> > The outer block did not need any checks as we have something like
> > ...
> > If (a)
> >    If (b)
> > 
> > Or
> > ....
> > If (a)
> >    Goto f
> > else if (b)
> >   ....
> > Else
> > {
> > F:
> > ....
> > }
> > 
> > And there was no need to check what was before the if (a) part just what is
> > in between the two ifs.
> 
> Ah, because the code in outer_cond_bb would have to be executed anyway whether
> we perform the conversion or not, right?

All ifcombine transforms make the outer condition unconditionally 
true/false thus the check should have been on whether the outer
cond BB is "empty".  Which would solve your problem, right?

Note that other transforms (bit test recognition) don't care (sth
we might want to fix?).

In general this needs a better cost function, maybe simply use
estimate_num_insns with speed estimates and compare against a
new --param.

Thanks,
Richard.

> Thanks,
> Kyrill
> 
> > 
> > What I mean by expensive for an example is division or some function call.
> > 
> > Thanks,
> > Andrew
> > 
> > 
> > > Thanks,
> > > Kyrill
> > > 
> > > 
> > > 
> > > > jeff
> 
>
Kyrylo Tkachov Sept. 23, 2015, 11:24 a.m. UTC | #6
On 23/09/15 11:10, Richard Biener wrote:
> On Wed, 23 Sep 2015, Kyrill Tkachov wrote:
>
>> On 23/09/15 10:09, Pinski, Andrew wrote:
>>>> On Sep 23, 2015, at 1:59 AM, Kyrill Tkachov <kyrylo.tkachov@arm.com>
>>>> wrote:
>>>>
>>>>
>>>>> On 22/09/15 20:31, Jeff Law wrote:
>>>>>> On 09/22/2015 07:36 AM, Kyrill Tkachov wrote:
>>>>>> Hi all,
>>>>>> Unfortunately, I see a testsuite regression with this patch:
>>>>>> FAIL: gcc.dg/pr66299-2.c scan-tree-dump-not optimized "<<"
>>>>>>
>>>>>> The reduced part of that test is:
>>>>>> void
>>>>>> test1 (int x, unsigned u)
>>>>>> {
>>>>>>      if ((1U << x) != 64
>>>>>>          || (2 << x) != u
>>>>>>          || (x << x) != 384
>>>>>>          || (3 << x) == 9
>>>>>>          || (x << 14) != 98304U
>>>>>>          || (1 << x) == 14
>>>>>>          || (3 << 2) != 12)
>>>>>>        __builtin_abort ();
>>>>>> }
>>>>>>
>>>>>> The patched ifcombine pass works more or less as expected and produces
>>>>>> fewer basic blocks.
>>>>>> Before this patch a relevant part of the ifcombine dump for test1 is:
>>>>>> ;;   basic block 2, loop depth 0, count 0, freq 10000, maybe hot
>>>>>>      if (x_1(D) != 6)
>>>>>>        goto <bb 6>;
>>>>>>      else
>>>>>>        goto <bb 3>;
>>>>>>
>>>>>> ;;   basic block 3, loop depth 0, count 0, freq 9996, maybe hot
>>>>>>      _2 = 2 << x_1(D);
>>>>>>      _3 = (unsigned intD.10) _2;
>>>>>>      if (_3 != u_4(D))
>>>>>>        goto <bb 6>;
>>>>>>      else
>>>>>>        goto <bb 4>;
>>>>>>
>>>>>>
>>>>>> After this patch it is:
>>>>>> ;;   basic block 2, loop depth 0, count 0, freq 10000, maybe hot
>>>>>>      _2 = 2 << x_1(D);
>>>>>>      _3 = (unsigned intD.10) _2;
>>>>>>      _9 = _3 != u_4(D);
>>>>>>      _10 = x_1(D) != 6;
>>>>>>      _11 = _9 | _10;
>>>>>>      if (_11 != 0)
>>>>>>        goto <bb 5>;
>>>>>>      else
>>>>>>        goto <bb 3>;
>>>>>>
>>>>>> The second form ends up generating worse codegen however, and the
>>>>>> badness starts with the dom1 pass.
>>>>>> In the unpatched case it manages to deduce that x must be 6 by the
>>>>>> time
>>>>>> it reaches basic block 3 and
>>>>>> uses that information to eliminate the shift in "_2 = 2 << x_1(D)"
>>>>>> from
>>>>>> basic block 3
>>>>>> In the patched case it is unable to make that call, I think because
>>>>>> the
>>>>>> x != 6 condition is IORed
>>>>>> with another test.
>>>>>>
>>>>>> I'm not familiar with the internals of the dom pass, so I'm not sure
>>>>>> where to go looking for a fix for this.
>>>>>> Is the ifcombine change a step in the right direction? If so, what
>>>>>> would
>>>>>> need to be done to fix the issue with
>>>>>> the dom pass?
>>>>> I don't see how you can reasonably fix this in DOM.  if _9 or _10 is
>>>>> true, then _11 is true.  But we can't reasonably record any kind of
>>>>> equivalence for _9 or _10 individually.
>>>>>
>>>>> If the statement
>>>>> _11 = _9 | _10;
>>>>>
>>>>> Were changed to
>>>>>
>>>>> _11 = _9 & _10;
>>>>>
>>>>> Then we could record something useful about _9 and _10.
>>>>>
>>>>>
>>>>>> I suppose what we want is to not combine basic blocks if the sequence
>>>>>> and conditions of the basic blocks are
>>>>>> such that dom can potentially exploit them, but how do we express
>>>>>> that?
>>>>> I don't think there's going to be a way to directly express that.  You
>>>>> could essentially claim that TRUTH_OR is more expensive than TRUTH_AND
>>>>> because of the impact on DOM, but that in and of itself may not resolve
>>>>> the situation either.
>>>>>
>>>>> I think the question we need to answer is whether or not your changes
>>>>> are generally better, even if there's specific instances where they make
>>>>> things worse.  If the benefits outweigh the negatives then we can xfail
>>>>> that test.
>>>> Ok, I'll investigate and benchmark some more.
>>>> Andrew, this transformation to ifcombine (together with the restriction
>>>> that the inner condition block
>>>> has to be a single comparison) was added by you with r204194.
>>>> Is there a particular reason for that restriction and why it is applied to
>>>> the inner block and not either?
>>> My reasoning at the time was there might be an "expensive" instruction or
>>> one that might trap (I did not check to see if the other part of the code
>>> was detecting that).
>>> The outer block did not need any checks as we have something like
>>> ...
>>> If (a)
>>>     If (b)
>>>
>>> Or
>>> ....
>>> If (a)
>>>     Goto f
>>> else if (b)
>>>    ....
>>> Else
>>> {
>>> F:
>>> ....
>>> }
>>>
>>> And there was no need to check what was before the if (a) part just what is
>>> in between the two ifs.
>> Ah, because the code in outer_cond_bb would have to be executed anyway whether
>> we perform the conversion or not, right?
> All ifcombine transforms make the outer condition unconditionally
> true/false thus the check should have been on whether the outer
> cond BB is "empty".  Which would solve your problem, right?

I'm not sure I follow. Why does cond bb has to be empty?

>
> Note that other transforms (bit test recognition) don't care (sth
> we might want to fix?).
>
> In general this needs a better cost function, maybe simply use
> estimate_num_insns with speed estimates and compare against a
> new --param.

Thanks, that looks like a starting point.
If we were add some kind of costing check here, would we even need
the checks mentioned above? I don't think it will affect correctness
(the inner cond bb is checked for no side-effects before entering this function).

Thanks,
Kyrill

>
> Thanks,
> Richard.
>
>> Thanks,
>> Kyrill
>>
>>> What I mean by expensive for an example is division or some function call.
>>>
>>> Thanks,
>>> Andrew
>>>
>>>
>>>> Thanks,
>>>> Kyrill
>>>>
>>>>
>>>>
>>>>> jeff
>>
Richard Biener Sept. 23, 2015, 11:25 a.m. UTC | #7
On Wed, 23 Sep 2015, Kyrill Tkachov wrote:

> 
> On 23/09/15 11:10, Richard Biener wrote:
> > On Wed, 23 Sep 2015, Kyrill Tkachov wrote:
> > 
> > > On 23/09/15 10:09, Pinski, Andrew wrote:
> > > > > On Sep 23, 2015, at 1:59 AM, Kyrill Tkachov <kyrylo.tkachov@arm.com>
> > > > > wrote:
> > > > > 
> > > > > 
> > > > > > On 22/09/15 20:31, Jeff Law wrote:
> > > > > > > On 09/22/2015 07:36 AM, Kyrill Tkachov wrote:
> > > > > > > Hi all,
> > > > > > > Unfortunately, I see a testsuite regression with this patch:
> > > > > > > FAIL: gcc.dg/pr66299-2.c scan-tree-dump-not optimized "<<"
> > > > > > > 
> > > > > > > The reduced part of that test is:
> > > > > > > void
> > > > > > > test1 (int x, unsigned u)
> > > > > > > {
> > > > > > >      if ((1U << x) != 64
> > > > > > >          || (2 << x) != u
> > > > > > >          || (x << x) != 384
> > > > > > >          || (3 << x) == 9
> > > > > > >          || (x << 14) != 98304U
> > > > > > >          || (1 << x) == 14
> > > > > > >          || (3 << 2) != 12)
> > > > > > >        __builtin_abort ();
> > > > > > > }
> > > > > > > 
> > > > > > > The patched ifcombine pass works more or less as expected and
> > > > > > > produces
> > > > > > > fewer basic blocks.
> > > > > > > Before this patch a relevant part of the ifcombine dump for test1
> > > > > > > is:
> > > > > > > ;;   basic block 2, loop depth 0, count 0, freq 10000, maybe hot
> > > > > > >      if (x_1(D) != 6)
> > > > > > >        goto <bb 6>;
> > > > > > >      else
> > > > > > >        goto <bb 3>;
> > > > > > > 
> > > > > > > ;;   basic block 3, loop depth 0, count 0, freq 9996, maybe hot
> > > > > > >      _2 = 2 << x_1(D);
> > > > > > >      _3 = (unsigned intD.10) _2;
> > > > > > >      if (_3 != u_4(D))
> > > > > > >        goto <bb 6>;
> > > > > > >      else
> > > > > > >        goto <bb 4>;
> > > > > > > 
> > > > > > > 
> > > > > > > After this patch it is:
> > > > > > > ;;   basic block 2, loop depth 0, count 0, freq 10000, maybe hot
> > > > > > >      _2 = 2 << x_1(D);
> > > > > > >      _3 = (unsigned intD.10) _2;
> > > > > > >      _9 = _3 != u_4(D);
> > > > > > >      _10 = x_1(D) != 6;
> > > > > > >      _11 = _9 | _10;
> > > > > > >      if (_11 != 0)
> > > > > > >        goto <bb 5>;
> > > > > > >      else
> > > > > > >        goto <bb 3>;
> > > > > > > 
> > > > > > > The second form ends up generating worse codegen however, and the
> > > > > > > badness starts with the dom1 pass.
> > > > > > > In the unpatched case it manages to deduce that x must be 6 by the
> > > > > > > time
> > > > > > > it reaches basic block 3 and
> > > > > > > uses that information to eliminate the shift in "_2 = 2 << x_1(D)"
> > > > > > > from
> > > > > > > basic block 3
> > > > > > > In the patched case it is unable to make that call, I think
> > > > > > > because
> > > > > > > the
> > > > > > > x != 6 condition is IORed
> > > > > > > with another test.
> > > > > > > 
> > > > > > > I'm not familiar with the internals of the dom pass, so I'm not
> > > > > > > sure
> > > > > > > where to go looking for a fix for this.
> > > > > > > Is the ifcombine change a step in the right direction? If so, what
> > > > > > > would
> > > > > > > need to be done to fix the issue with
> > > > > > > the dom pass?
> > > > > > I don't see how you can reasonably fix this in DOM.  if _9 or _10 is
> > > > > > true, then _11 is true.  But we can't reasonably record any kind of
> > > > > > equivalence for _9 or _10 individually.
> > > > > > 
> > > > > > If the statement
> > > > > > _11 = _9 | _10;
> > > > > > 
> > > > > > Were changed to
> > > > > > 
> > > > > > _11 = _9 & _10;
> > > > > > 
> > > > > > Then we could record something useful about _9 and _10.
> > > > > > 
> > > > > > 
> > > > > > > I suppose what we want is to not combine basic blocks if the
> > > > > > > sequence
> > > > > > > and conditions of the basic blocks are
> > > > > > > such that dom can potentially exploit them, but how do we express
> > > > > > > that?
> > > > > > I don't think there's going to be a way to directly express that.
> > > > > > You
> > > > > > could essentially claim that TRUTH_OR is more expensive than
> > > > > > TRUTH_AND
> > > > > > because of the impact on DOM, but that in and of itself may not
> > > > > > resolve
> > > > > > the situation either.
> > > > > > 
> > > > > > I think the question we need to answer is whether or not your
> > > > > > changes
> > > > > > are generally better, even if there's specific instances where they
> > > > > > make
> > > > > > things worse.  If the benefits outweigh the negatives then we can
> > > > > > xfail
> > > > > > that test.
> > > > > Ok, I'll investigate and benchmark some more.
> > > > > Andrew, this transformation to ifcombine (together with the
> > > > > restriction
> > > > > that the inner condition block
> > > > > has to be a single comparison) was added by you with r204194.
> > > > > Is there a particular reason for that restriction and why it is
> > > > > applied to
> > > > > the inner block and not either?
> > > > My reasoning at the time was there might be an "expensive" instruction
> > > > or
> > > > one that might trap (I did not check to see if the other part of the
> > > > code
> > > > was detecting that).
> > > > The outer block did not need any checks as we have something like
> > > > ...
> > > > If (a)
> > > >     If (b)
> > > > 
> > > > Or
> > > > ....
> > > > If (a)
> > > >     Goto f
> > > > else if (b)
> > > >    ....
> > > > Else
> > > > {
> > > > F:
> > > > ....
> > > > }
> > > > 
> > > > And there was no need to check what was before the if (a) part just what
> > > > is
> > > > in between the two ifs.
> > > Ah, because the code in outer_cond_bb would have to be executed anyway
> > > whether
> > > we perform the conversion or not, right?
> > All ifcombine transforms make the outer condition unconditionally
> > true/false thus the check should have been on whether the outer
> > cond BB is "empty".  Which would solve your problem, right?
> 
> I'm not sure I follow. Why does cond bb has to be empty?

Sorry, I mixed it up.  Of course the cond is at the end of cond bb...

> > 
> > Note that other transforms (bit test recognition) don't care (sth
> > we might want to fix?).
> > 
> > In general this needs a better cost function, maybe simply use
> > estimate_num_insns with speed estimates and compare against a
> > new --param.
> 
> Thanks, that looks like a starting point.
> If we were add some kind of costing check here, would we even need
> the checks mentioned above? I don't think it will affect correctness
> (the inner cond bb is checked for no side-effects before entering this
> function).

No, we'd replace the check with the cost function.

Richard.

> Thanks,
> Kyrill
> 
> > 
> > Thanks,
> > Richard.
> > 
> > > Thanks,
> > > Kyrill
> > > 
> > > > What I mean by expensive for an example is division or some function
> > > > call.
> > > > 
> > > > Thanks,
> > > > Andrew
> > > > 
> > > > 
> > > > > Thanks,
> > > > > Kyrill
> > > > > 
> > > > > 
> > > > > 
> > > > > > jeff
> > > 
> 
>
Kyrylo Tkachov Sept. 25, 2015, 9:45 a.m. UTC | #8
Hi all,

On 23/09/15 11:10, Richard Biener wrote:
> On Wed, 23 Sep 2015, Kyrill Tkachov wrote:
>
>> On 23/09/15 10:09, Pinski, Andrew wrote:
>>>> On Sep 23, 2015, at 1:59 AM, Kyrill Tkachov <kyrylo.tkachov@arm.com>
>>>> wrote:
>>>>
>>>>
>>>>> On 22/09/15 20:31, Jeff Law wrote:
>>>>>> On 09/22/2015 07:36 AM, Kyrill Tkachov wrote:
>>>>>> Hi all,
>>>>>> Unfortunately, I see a testsuite regression with this patch:
>>>>>> FAIL: gcc.dg/pr66299-2.c scan-tree-dump-not optimized "<<"
>>>>>>
>>>>>> The reduced part of that test is:
>>>>>> void
>>>>>> test1 (int x, unsigned u)
>>>>>> {
>>>>>>      if ((1U << x) != 64
>>>>>>          || (2 << x) != u
>>>>>>          || (x << x) != 384
>>>>>>          || (3 << x) == 9
>>>>>>          || (x << 14) != 98304U
>>>>>>          || (1 << x) == 14
>>>>>>          || (3 << 2) != 12)
>>>>>>        __builtin_abort ();
>>>>>> }
>>>>>>
>>>>>> The patched ifcombine pass works more or less as expected and produces
>>>>>> fewer basic blocks.
>>>>>> Before this patch a relevant part of the ifcombine dump for test1 is:
>>>>>> ;;   basic block 2, loop depth 0, count 0, freq 10000, maybe hot
>>>>>>      if (x_1(D) != 6)
>>>>>>        goto <bb 6>;
>>>>>>      else
>>>>>>        goto <bb 3>;
>>>>>>
>>>>>> ;;   basic block 3, loop depth 0, count 0, freq 9996, maybe hot
>>>>>>      _2 = 2 << x_1(D);
>>>>>>      _3 = (unsigned intD.10) _2;
>>>>>>      if (_3 != u_4(D))
>>>>>>        goto <bb 6>;
>>>>>>      else
>>>>>>        goto <bb 4>;
>>>>>>
>>>>>>
>>>>>> After this patch it is:
>>>>>> ;;   basic block 2, loop depth 0, count 0, freq 10000, maybe hot
>>>>>>      _2 = 2 << x_1(D);
>>>>>>      _3 = (unsigned intD.10) _2;
>>>>>>      _9 = _3 != u_4(D);
>>>>>>      _10 = x_1(D) != 6;
>>>>>>      _11 = _9 | _10;
>>>>>>      if (_11 != 0)
>>>>>>        goto <bb 5>;
>>>>>>      else
>>>>>>        goto <bb 3>;
>>>>>>
>>>>>> The second form ends up generating worse codegen however, and the
>>>>>> badness starts with the dom1 pass.
>>>>>> In the unpatched case it manages to deduce that x must be 6 by the
>>>>>> time
>>>>>> it reaches basic block 3 and
>>>>>> uses that information to eliminate the shift in "_2 = 2 << x_1(D)"
>>>>>> from
>>>>>> basic block 3
>>>>>> In the patched case it is unable to make that call, I think because
>>>>>> the
>>>>>> x != 6 condition is IORed
>>>>>> with another test.
>>>>>>
>>>>>> I'm not familiar with the internals of the dom pass, so I'm not sure
>>>>>> where to go looking for a fix for this.
>>>>>> Is the ifcombine change a step in the right direction? If so, what
>>>>>> would
>>>>>> need to be done to fix the issue with
>>>>>> the dom pass?
>>>>> I don't see how you can reasonably fix this in DOM.  if _9 or _10 is
>>>>> true, then _11 is true.  But we can't reasonably record any kind of
>>>>> equivalence for _9 or _10 individually.
>>>>>
>>>>> If the statement
>>>>> _11 = _9 | _10;
>>>>>
>>>>> Were changed to
>>>>>
>>>>> _11 = _9 & _10;
>>>>>
>>>>> Then we could record something useful about _9 and _10.
>>>>>
>>>>>
>>>>>> I suppose what we want is to not combine basic blocks if the sequence
>>>>>> and conditions of the basic blocks are
>>>>>> such that dom can potentially exploit them, but how do we express
>>>>>> that?
>>>>> I don't think there's going to be a way to directly express that.  You
>>>>> could essentially claim that TRUTH_OR is more expensive than TRUTH_AND
>>>>> because of the impact on DOM, but that in and of itself may not resolve
>>>>> the situation either.
>>>>>
>>>>> I think the question we need to answer is whether or not your changes
>>>>> are generally better, even if there's specific instances where they make
>>>>> things worse.  If the benefits outweigh the negatives then we can xfail
>>>>> that test.
>>>> Ok, I'll investigate and benchmark some more.
>>>> Andrew, this transformation to ifcombine (together with the restriction
>>>> that the inner condition block
>>>> has to be a single comparison) was added by you with r204194.
>>>> Is there a particular reason for that restriction and why it is applied to
>>>> the inner block and not either?
>>> My reasoning at the time was there might be an "expensive" instruction or
>>> one that might trap (I did not check to see if the other part of the code
>>> was detecting that).
>>> The outer block did not need any checks as we have something like
>>> ...
>>> If (a)
>>>     If (b)
>>>
>>> Or
>>> ....
>>> If (a)
>>>     Goto f
>>> else if (b)
>>>    ....
>>> Else
>>> {
>>> F:
>>> ....
>>> }
>>>
>>> And there was no need to check what was before the if (a) part just what is
>>> in between the two ifs.
>> Ah, because the code in outer_cond_bb would have to be executed anyway whether
>> we perform the conversion or not, right?
> All ifcombine transforms make the outer condition unconditionally
> true/false thus the check should have been on whether the outer
> cond BB is "empty".  Which would solve your problem, right?
>
> Note that other transforms (bit test recognition) don't care (sth
> we might want to fix?).
>
> In general this needs a better cost function, maybe simply use
> estimate_num_insns with speed estimates and compare against a
> new --param.

So I've played around with that code and I think I'd like to make it a bit
more intricate. Just comparing against estimate_num_insns is too coarse-grained and
is just a comparison by a magic number number and I've been struggling to make this
aggressive enough without pulling in too much code into the unconditional path.

As far as aarch64 is concerned I want to make this transformation more aggressive when
it can facilitate conditional comparison generation during expand. This means that I want
to allow the cases where the inner block contains comparisons, combined using logical operations
like TRUTH_AND_EXPR, TRUTH_IOR_EXPR, TRUTH_NOT_EXPR, their bitwise variants etc.
The expand code in ccmp.c knows how to handle these chains of comparisons and emit the appropriate
conditional compare patterns. If, however, the inner block contains other types of operations like
arithmetic ops, pointer dereferencing etc I'd want to be conservative to avoid pulling in operations
that don't facilitate the ccmp.c expansions.

So what I'm proposing is:
- If a target supports conditional comparisons through TARGET_GEN_CCMP_FIRST
(currently only aarch64) then we allow the aforementioned comparisons+logical-combinations blocks.
If the block also contains other types of operations we apply the estimate_num_insns cost comparison
with the default value for the comparison picked to be such so that it changes codegen the least from
the current situation i.e. one instruction. This value will be a new param that targets can increase
if they want to.

- If the target does not support conditional comparisons we follow only the second scheme (the conservative
estimate_num_insns comparison). This should cause the minimal codegen difference for the targets that
don't support conditional compares (which is all targets but aarch64) while allowing them to scale
the aggressiveness of this transformation if their benchmarking shows it to be beneficial.


  I believe such a scheme would avoid pulling in too much code that doesn't
facilitate conditional compares generation into the unconditional path and would minimise
impact on existing targets that don't do conditional compares.

Would that be an acceptable plan for the ifcombine_ifandif transformation in
tree-ssa-ifcombine.c (pending benchmarking, of course)?

Thanks,
Kyrill

>
> Thanks,
> Richard.
>
>> Thanks,
>> Kyrill
>>
>>> What I mean by expensive for an example is division or some function call.
>>>
>>> Thanks,
>>> Andrew
>>>
>>>
>>>> Thanks,
>>>> Kyrill
>>>>
>>>>
>>>>
>>>>> jeff
>>
Richard Biener Sept. 25, 2015, 9:49 a.m. UTC | #9
On Fri, 25 Sep 2015, Kyrill Tkachov wrote:

> Hi all,
> 
> On 23/09/15 11:10, Richard Biener wrote:
> > On Wed, 23 Sep 2015, Kyrill Tkachov wrote:
> > 
> > > On 23/09/15 10:09, Pinski, Andrew wrote:
> > > > > On Sep 23, 2015, at 1:59 AM, Kyrill Tkachov <kyrylo.tkachov@arm.com>
> > > > > wrote:
> > > > > 
> > > > > 
> > > > > > On 22/09/15 20:31, Jeff Law wrote:
> > > > > > > On 09/22/2015 07:36 AM, Kyrill Tkachov wrote:
> > > > > > > Hi all,
> > > > > > > Unfortunately, I see a testsuite regression with this patch:
> > > > > > > FAIL: gcc.dg/pr66299-2.c scan-tree-dump-not optimized "<<"
> > > > > > > 
> > > > > > > The reduced part of that test is:
> > > > > > > void
> > > > > > > test1 (int x, unsigned u)
> > > > > > > {
> > > > > > >      if ((1U << x) != 64
> > > > > > >          || (2 << x) != u
> > > > > > >          || (x << x) != 384
> > > > > > >          || (3 << x) == 9
> > > > > > >          || (x << 14) != 98304U
> > > > > > >          || (1 << x) == 14
> > > > > > >          || (3 << 2) != 12)
> > > > > > >        __builtin_abort ();
> > > > > > > }
> > > > > > > 
> > > > > > > The patched ifcombine pass works more or less as expected and
> > > > > > > produces
> > > > > > > fewer basic blocks.
> > > > > > > Before this patch a relevant part of the ifcombine dump for test1
> > > > > > > is:
> > > > > > > ;;   basic block 2, loop depth 0, count 0, freq 10000, maybe hot
> > > > > > >      if (x_1(D) != 6)
> > > > > > >        goto <bb 6>;
> > > > > > >      else
> > > > > > >        goto <bb 3>;
> > > > > > > 
> > > > > > > ;;   basic block 3, loop depth 0, count 0, freq 9996, maybe hot
> > > > > > >      _2 = 2 << x_1(D);
> > > > > > >      _3 = (unsigned intD.10) _2;
> > > > > > >      if (_3 != u_4(D))
> > > > > > >        goto <bb 6>;
> > > > > > >      else
> > > > > > >        goto <bb 4>;
> > > > > > > 
> > > > > > > 
> > > > > > > After this patch it is:
> > > > > > > ;;   basic block 2, loop depth 0, count 0, freq 10000, maybe hot
> > > > > > >      _2 = 2 << x_1(D);
> > > > > > >      _3 = (unsigned intD.10) _2;
> > > > > > >      _9 = _3 != u_4(D);
> > > > > > >      _10 = x_1(D) != 6;
> > > > > > >      _11 = _9 | _10;
> > > > > > >      if (_11 != 0)
> > > > > > >        goto <bb 5>;
> > > > > > >      else
> > > > > > >        goto <bb 3>;
> > > > > > > 
> > > > > > > The second form ends up generating worse codegen however, and the
> > > > > > > badness starts with the dom1 pass.
> > > > > > > In the unpatched case it manages to deduce that x must be 6 by the
> > > > > > > time
> > > > > > > it reaches basic block 3 and
> > > > > > > uses that information to eliminate the shift in "_2 = 2 << x_1(D)"
> > > > > > > from
> > > > > > > basic block 3
> > > > > > > In the patched case it is unable to make that call, I think
> > > > > > > because
> > > > > > > the
> > > > > > > x != 6 condition is IORed
> > > > > > > with another test.
> > > > > > > 
> > > > > > > I'm not familiar with the internals of the dom pass, so I'm not
> > > > > > > sure
> > > > > > > where to go looking for a fix for this.
> > > > > > > Is the ifcombine change a step in the right direction? If so, what
> > > > > > > would
> > > > > > > need to be done to fix the issue with
> > > > > > > the dom pass?
> > > > > > I don't see how you can reasonably fix this in DOM.  if _9 or _10 is
> > > > > > true, then _11 is true.  But we can't reasonably record any kind of
> > > > > > equivalence for _9 or _10 individually.
> > > > > > 
> > > > > > If the statement
> > > > > > _11 = _9 | _10;
> > > > > > 
> > > > > > Were changed to
> > > > > > 
> > > > > > _11 = _9 & _10;
> > > > > > 
> > > > > > Then we could record something useful about _9 and _10.
> > > > > > 
> > > > > > 
> > > > > > > I suppose what we want is to not combine basic blocks if the
> > > > > > > sequence
> > > > > > > and conditions of the basic blocks are
> > > > > > > such that dom can potentially exploit them, but how do we express
> > > > > > > that?
> > > > > > I don't think there's going to be a way to directly express that.
> > > > > > You
> > > > > > could essentially claim that TRUTH_OR is more expensive than
> > > > > > TRUTH_AND
> > > > > > because of the impact on DOM, but that in and of itself may not
> > > > > > resolve
> > > > > > the situation either.
> > > > > > 
> > > > > > I think the question we need to answer is whether or not your
> > > > > > changes
> > > > > > are generally better, even if there's specific instances where they
> > > > > > make
> > > > > > things worse.  If the benefits outweigh the negatives then we can
> > > > > > xfail
> > > > > > that test.
> > > > > Ok, I'll investigate and benchmark some more.
> > > > > Andrew, this transformation to ifcombine (together with the
> > > > > restriction
> > > > > that the inner condition block
> > > > > has to be a single comparison) was added by you with r204194.
> > > > > Is there a particular reason for that restriction and why it is
> > > > > applied to
> > > > > the inner block and not either?
> > > > My reasoning at the time was there might be an "expensive" instruction
> > > > or
> > > > one that might trap (I did not check to see if the other part of the
> > > > code
> > > > was detecting that).
> > > > The outer block did not need any checks as we have something like
> > > > ...
> > > > If (a)
> > > >     If (b)
> > > > 
> > > > Or
> > > > ....
> > > > If (a)
> > > >     Goto f
> > > > else if (b)
> > > >    ....
> > > > Else
> > > > {
> > > > F:
> > > > ....
> > > > }
> > > > 
> > > > And there was no need to check what was before the if (a) part just what
> > > > is
> > > > in between the two ifs.
> > > Ah, because the code in outer_cond_bb would have to be executed anyway
> > > whether
> > > we perform the conversion or not, right?
> > All ifcombine transforms make the outer condition unconditionally
> > true/false thus the check should have been on whether the outer
> > cond BB is "empty".  Which would solve your problem, right?
> > 
> > Note that other transforms (bit test recognition) don't care (sth
> > we might want to fix?).
> > 
> > In general this needs a better cost function, maybe simply use
> > estimate_num_insns with speed estimates and compare against a
> > new --param.
> 
> So I've played around with that code and I think I'd like to make it a bit
> more intricate. Just comparing against estimate_num_insns is too
> coarse-grained and
> is just a comparison by a magic number number and I've been struggling to make
> this
> aggressive enough without pulling in too much code into the unconditional
> path.
> 
> As far as aarch64 is concerned I want to make this transformation more
> aggressive when
> it can facilitate conditional comparison generation during expand. This means
> that I want
> to allow the cases where the inner block contains comparisons, combined using
> logical operations
> like TRUTH_AND_EXPR, TRUTH_IOR_EXPR, TRUTH_NOT_EXPR, their bitwise variants
> etc.
> The expand code in ccmp.c knows how to handle these chains of comparisons and
> emit the appropriate
> conditional compare patterns. If, however, the inner block contains other
> types of operations like
> arithmetic ops, pointer dereferencing etc I'd want to be conservative to avoid
> pulling in operations
> that don't facilitate the ccmp.c expansions.

So the inner block only contains stmts feeding a condition?  As we're
combining that condition with the outer one and the result simplified(?)
it makes sense to allow that as "empty" block generally.

No?

Thanks,
Richard.

> So what I'm proposing is:
> - If a target supports conditional comparisons through TARGET_GEN_CCMP_FIRST
> (currently only aarch64) then we allow the aforementioned
> comparisons+logical-combinations blocks.
> If the block also contains other types of operations we apply the
> estimate_num_insns cost comparison
> with the default value for the comparison picked to be such so that it changes
> codegen the least from
> the current situation i.e. one instruction. This value will be a new param
> that targets can increase
> if they want to.
> 
> - If the target does not support conditional comparisons we follow only the
> second scheme (the conservative
> estimate_num_insns comparison). This should cause the minimal codegen
> difference for the targets that
> don't support conditional compares (which is all targets but aarch64) while
> allowing them to scale
> the aggressiveness of this transformation if their benchmarking shows it to be
> beneficial.
> 
> 
>  I believe such a scheme would avoid pulling in too much code that doesn't
> facilitate conditional compares generation into the unconditional path and
> would minimise
> impact on existing targets that don't do conditional compares.
> 
> Would that be an acceptable plan for the ifcombine_ifandif transformation in
> tree-ssa-ifcombine.c (pending benchmarking, of course)?
> 
> Thanks,
> Kyrill
> 
> > 
> > Thanks,
> > Richard.
> > 
> > > Thanks,
> > > Kyrill
> > > 
> > > > What I mean by expensive for an example is division or some function
> > > > call.
> > > > 
> > > > Thanks,
> > > > Andrew
> > > > 
> > > > 
> > > > > Thanks,
> > > > > Kyrill
> > > > > 
> > > > > 
> > > > > 
> > > > > > jeff
> > > 
> 
>
Kyrylo Tkachov Sept. 25, 2015, 9:57 a.m. UTC | #10
On 25/09/15 10:49, Richard Biener wrote:
> On Fri, 25 Sep 2015, Kyrill Tkachov wrote:
>
>> Hi all,
>>
>> On 23/09/15 11:10, Richard Biener wrote:
>>> On Wed, 23 Sep 2015, Kyrill Tkachov wrote:
>>>
>>>> On 23/09/15 10:09, Pinski, Andrew wrote:
>>>>>> On Sep 23, 2015, at 1:59 AM, Kyrill Tkachov <kyrylo.tkachov@arm.com>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>> On 22/09/15 20:31, Jeff Law wrote:
>>>>>>>> On 09/22/2015 07:36 AM, Kyrill Tkachov wrote:
>>>>>>>> Hi all,
>>>>>>>> Unfortunately, I see a testsuite regression with this patch:
>>>>>>>> FAIL: gcc.dg/pr66299-2.c scan-tree-dump-not optimized "<<"
>>>>>>>>
>>>>>>>> The reduced part of that test is:
>>>>>>>> void
>>>>>>>> test1 (int x, unsigned u)
>>>>>>>> {
>>>>>>>>       if ((1U << x) != 64
>>>>>>>>           || (2 << x) != u
>>>>>>>>           || (x << x) != 384
>>>>>>>>           || (3 << x) == 9
>>>>>>>>           || (x << 14) != 98304U
>>>>>>>>           || (1 << x) == 14
>>>>>>>>           || (3 << 2) != 12)
>>>>>>>>         __builtin_abort ();
>>>>>>>> }
>>>>>>>>
>>>>>>>> The patched ifcombine pass works more or less as expected and
>>>>>>>> produces
>>>>>>>> fewer basic blocks.
>>>>>>>> Before this patch a relevant part of the ifcombine dump for test1
>>>>>>>> is:
>>>>>>>> ;;   basic block 2, loop depth 0, count 0, freq 10000, maybe hot
>>>>>>>>       if (x_1(D) != 6)
>>>>>>>>         goto <bb 6>;
>>>>>>>>       else
>>>>>>>>         goto <bb 3>;
>>>>>>>>
>>>>>>>> ;;   basic block 3, loop depth 0, count 0, freq 9996, maybe hot
>>>>>>>>       _2 = 2 << x_1(D);
>>>>>>>>       _3 = (unsigned intD.10) _2;
>>>>>>>>       if (_3 != u_4(D))
>>>>>>>>         goto <bb 6>;
>>>>>>>>       else
>>>>>>>>         goto <bb 4>;
>>>>>>>>
>>>>>>>>
>>>>>>>> After this patch it is:
>>>>>>>> ;;   basic block 2, loop depth 0, count 0, freq 10000, maybe hot
>>>>>>>>       _2 = 2 << x_1(D);
>>>>>>>>       _3 = (unsigned intD.10) _2;
>>>>>>>>       _9 = _3 != u_4(D);
>>>>>>>>       _10 = x_1(D) != 6;
>>>>>>>>       _11 = _9 | _10;
>>>>>>>>       if (_11 != 0)
>>>>>>>>         goto <bb 5>;
>>>>>>>>       else
>>>>>>>>         goto <bb 3>;
>>>>>>>>
>>>>>>>> The second form ends up generating worse codegen however, and the
>>>>>>>> badness starts with the dom1 pass.
>>>>>>>> In the unpatched case it manages to deduce that x must be 6 by the
>>>>>>>> time
>>>>>>>> it reaches basic block 3 and
>>>>>>>> uses that information to eliminate the shift in "_2 = 2 << x_1(D)"
>>>>>>>> from
>>>>>>>> basic block 3
>>>>>>>> In the patched case it is unable to make that call, I think
>>>>>>>> because
>>>>>>>> the
>>>>>>>> x != 6 condition is IORed
>>>>>>>> with another test.
>>>>>>>>
>>>>>>>> I'm not familiar with the internals of the dom pass, so I'm not
>>>>>>>> sure
>>>>>>>> where to go looking for a fix for this.
>>>>>>>> Is the ifcombine change a step in the right direction? If so, what
>>>>>>>> would
>>>>>>>> need to be done to fix the issue with
>>>>>>>> the dom pass?
>>>>>>> I don't see how you can reasonably fix this in DOM.  if _9 or _10 is
>>>>>>> true, then _11 is true.  But we can't reasonably record any kind of
>>>>>>> equivalence for _9 or _10 individually.
>>>>>>>
>>>>>>> If the statement
>>>>>>> _11 = _9 | _10;
>>>>>>>
>>>>>>> Were changed to
>>>>>>>
>>>>>>> _11 = _9 & _10;
>>>>>>>
>>>>>>> Then we could record something useful about _9 and _10.
>>>>>>>
>>>>>>>
>>>>>>>> I suppose what we want is to not combine basic blocks if the
>>>>>>>> sequence
>>>>>>>> and conditions of the basic blocks are
>>>>>>>> such that dom can potentially exploit them, but how do we express
>>>>>>>> that?
>>>>>>> I don't think there's going to be a way to directly express that.
>>>>>>> You
>>>>>>> could essentially claim that TRUTH_OR is more expensive than
>>>>>>> TRUTH_AND
>>>>>>> because of the impact on DOM, but that in and of itself may not
>>>>>>> resolve
>>>>>>> the situation either.
>>>>>>>
>>>>>>> I think the question we need to answer is whether or not your
>>>>>>> changes
>>>>>>> are generally better, even if there's specific instances where they
>>>>>>> make
>>>>>>> things worse.  If the benefits outweigh the negatives then we can
>>>>>>> xfail
>>>>>>> that test.
>>>>>> Ok, I'll investigate and benchmark some more.
>>>>>> Andrew, this transformation to ifcombine (together with the
>>>>>> restriction
>>>>>> that the inner condition block
>>>>>> has to be a single comparison) was added by you with r204194.
>>>>>> Is there a particular reason for that restriction and why it is
>>>>>> applied to
>>>>>> the inner block and not either?
>>>>> My reasoning at the time was there might be an "expensive" instruction
>>>>> or
>>>>> one that might trap (I did not check to see if the other part of the
>>>>> code
>>>>> was detecting that).
>>>>> The outer block did not need any checks as we have something like
>>>>> ...
>>>>> If (a)
>>>>>      If (b)
>>>>>
>>>>> Or
>>>>> ....
>>>>> If (a)
>>>>>      Goto f
>>>>> else if (b)
>>>>>     ....
>>>>> Else
>>>>> {
>>>>> F:
>>>>> ....
>>>>> }
>>>>>
>>>>> And there was no need to check what was before the if (a) part just what
>>>>> is
>>>>> in between the two ifs.
>>>> Ah, because the code in outer_cond_bb would have to be executed anyway
>>>> whether
>>>> we perform the conversion or not, right?
>>> All ifcombine transforms make the outer condition unconditionally
>>> true/false thus the check should have been on whether the outer
>>> cond BB is "empty".  Which would solve your problem, right?
>>>
>>> Note that other transforms (bit test recognition) don't care (sth
>>> we might want to fix?).
>>>
>>> In general this needs a better cost function, maybe simply use
>>> estimate_num_insns with speed estimates and compare against a
>>> new --param.
>> So I've played around with that code and I think I'd like to make it a bit
>> more intricate. Just comparing against estimate_num_insns is too
>> coarse-grained and
>> is just a comparison by a magic number number and I've been struggling to make
>> this
>> aggressive enough without pulling in too much code into the unconditional
>> path.
>>
>> As far as aarch64 is concerned I want to make this transformation more
>> aggressive when
>> it can facilitate conditional comparison generation during expand. This means
>> that I want
>> to allow the cases where the inner block contains comparisons, combined using
>> logical operations
>> like TRUTH_AND_EXPR, TRUTH_IOR_EXPR, TRUTH_NOT_EXPR, their bitwise variants
>> etc.
>> The expand code in ccmp.c knows how to handle these chains of comparisons and
>> emit the appropriate
>> conditional compare patterns. If, however, the inner block contains other
>> types of operations like
>> arithmetic ops, pointer dereferencing etc I'd want to be conservative to avoid
>> pulling in operations
>> that don't facilitate the ccmp.c expansions.
> So the inner block only contains stmts feeding a condition?  As we're
> combining that condition with the outer one and the result simplified(?)
> it makes sense to allow that as "empty" block generally.

My concern is that those stmts feeding the condition will
now be executed unconditionally (rather than only after jumping
to the inner_cond_bb) so we might end up doing redundant work
if the the outer condition would actually have jumped to the else block
rather than to the inner one.

Kyrill

>
> No?
>
> Thanks,
> Richard.
>
>> So what I'm proposing is:
>> - If a target supports conditional comparisons through TARGET_GEN_CCMP_FIRST
>> (currently only aarch64) then we allow the aforementioned
>> comparisons+logical-combinations blocks.
>> If the block also contains other types of operations we apply the
>> estimate_num_insns cost comparison
>> with the default value for the comparison picked to be such so that it changes
>> codegen the least from
>> the current situation i.e. one instruction. This value will be a new param
>> that targets can increase
>> if they want to.
>>
>> - If the target does not support conditional comparisons we follow only the
>> second scheme (the conservative
>> estimate_num_insns comparison). This should cause the minimal codegen
>> difference for the targets that
>> don't support conditional compares (which is all targets but aarch64) while
>> allowing them to scale
>> the aggressiveness of this transformation if their benchmarking shows it to be
>> beneficial.
>>
>>
>>   I believe such a scheme would avoid pulling in too much code that doesn't
>> facilitate conditional compares generation into the unconditional path and
>> would minimise
>> impact on existing targets that don't do conditional compares.
>>
>> Would that be an acceptable plan for the ifcombine_ifandif transformation in
>> tree-ssa-ifcombine.c (pending benchmarking, of course)?
>>
>> Thanks,
>> Kyrill
>>
>>> Thanks,
>>> Richard.
>>>
>>>> Thanks,
>>>> Kyrill
>>>>
>>>>> What I mean by expensive for an example is division or some function
>>>>> call.
>>>>>
>>>>> Thanks,
>>>>> Andrew
>>>>>
>>>>>
>>>>>> Thanks,
>>>>>> Kyrill
>>>>>>
>>>>>>
>>>>>>
>>>>>>> jeff
>>
Richard Biener Sept. 25, 2015, 10:15 a.m. UTC | #11
On Fri, 25 Sep 2015, Kyrill Tkachov wrote:

> 
> On 25/09/15 10:49, Richard Biener wrote:
> > On Fri, 25 Sep 2015, Kyrill Tkachov wrote:
> > 
> > > Hi all,
> > > 
> > > On 23/09/15 11:10, Richard Biener wrote:
> > > > On Wed, 23 Sep 2015, Kyrill Tkachov wrote:
> > > > 
> > > > > On 23/09/15 10:09, Pinski, Andrew wrote:
> > > > > > > On Sep 23, 2015, at 1:59 AM, Kyrill Tkachov
> > > > > > > <kyrylo.tkachov@arm.com>
> > > > > > > wrote:
> > > > > > > 
> > > > > > > 
> > > > > > > > On 22/09/15 20:31, Jeff Law wrote:
> > > > > > > > > On 09/22/2015 07:36 AM, Kyrill Tkachov wrote:
> > > > > > > > > Hi all,
> > > > > > > > > Unfortunately, I see a testsuite regression with this patch:
> > > > > > > > > FAIL: gcc.dg/pr66299-2.c scan-tree-dump-not optimized "<<"
> > > > > > > > > 
> > > > > > > > > The reduced part of that test is:
> > > > > > > > > void
> > > > > > > > > test1 (int x, unsigned u)
> > > > > > > > > {
> > > > > > > > >       if ((1U << x) != 64
> > > > > > > > >           || (2 << x) != u
> > > > > > > > >           || (x << x) != 384
> > > > > > > > >           || (3 << x) == 9
> > > > > > > > >           || (x << 14) != 98304U
> > > > > > > > >           || (1 << x) == 14
> > > > > > > > >           || (3 << 2) != 12)
> > > > > > > > >         __builtin_abort ();
> > > > > > > > > }
> > > > > > > > > 
> > > > > > > > > The patched ifcombine pass works more or less as expected and
> > > > > > > > > produces
> > > > > > > > > fewer basic blocks.
> > > > > > > > > Before this patch a relevant part of the ifcombine dump for
> > > > > > > > > test1
> > > > > > > > > is:
> > > > > > > > > ;;   basic block 2, loop depth 0, count 0, freq 10000, maybe
> > > > > > > > > hot
> > > > > > > > >       if (x_1(D) != 6)
> > > > > > > > >         goto <bb 6>;
> > > > > > > > >       else
> > > > > > > > >         goto <bb 3>;
> > > > > > > > > 
> > > > > > > > > ;;   basic block 3, loop depth 0, count 0, freq 9996, maybe
> > > > > > > > > hot
> > > > > > > > >       _2 = 2 << x_1(D);
> > > > > > > > >       _3 = (unsigned intD.10) _2;
> > > > > > > > >       if (_3 != u_4(D))
> > > > > > > > >         goto <bb 6>;
> > > > > > > > >       else
> > > > > > > > >         goto <bb 4>;
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > After this patch it is:
> > > > > > > > > ;;   basic block 2, loop depth 0, count 0, freq 10000, maybe
> > > > > > > > > hot
> > > > > > > > >       _2 = 2 << x_1(D);
> > > > > > > > >       _3 = (unsigned intD.10) _2;
> > > > > > > > >       _9 = _3 != u_4(D);
> > > > > > > > >       _10 = x_1(D) != 6;
> > > > > > > > >       _11 = _9 | _10;
> > > > > > > > >       if (_11 != 0)
> > > > > > > > >         goto <bb 5>;
> > > > > > > > >       else
> > > > > > > > >         goto <bb 3>;
> > > > > > > > > 
> > > > > > > > > The second form ends up generating worse codegen however, and
> > > > > > > > > the
> > > > > > > > > badness starts with the dom1 pass.
> > > > > > > > > In the unpatched case it manages to deduce that x must be 6 by
> > > > > > > > > the
> > > > > > > > > time
> > > > > > > > > it reaches basic block 3 and
> > > > > > > > > uses that information to eliminate the shift in "_2 = 2 <<
> > > > > > > > > x_1(D)"
> > > > > > > > > from
> > > > > > > > > basic block 3
> > > > > > > > > In the patched case it is unable to make that call, I think
> > > > > > > > > because
> > > > > > > > > the
> > > > > > > > > x != 6 condition is IORed
> > > > > > > > > with another test.
> > > > > > > > > 
> > > > > > > > > I'm not familiar with the internals of the dom pass, so I'm
> > > > > > > > > not
> > > > > > > > > sure
> > > > > > > > > where to go looking for a fix for this.
> > > > > > > > > Is the ifcombine change a step in the right direction? If so,
> > > > > > > > > what
> > > > > > > > > would
> > > > > > > > > need to be done to fix the issue with
> > > > > > > > > the dom pass?
> > > > > > > > I don't see how you can reasonably fix this in DOM.  if _9 or
> > > > > > > > _10 is
> > > > > > > > true, then _11 is true.  But we can't reasonably record any kind
> > > > > > > > of
> > > > > > > > equivalence for _9 or _10 individually.
> > > > > > > > 
> > > > > > > > If the statement
> > > > > > > > _11 = _9 | _10;
> > > > > > > > 
> > > > > > > > Were changed to
> > > > > > > > 
> > > > > > > > _11 = _9 & _10;
> > > > > > > > 
> > > > > > > > Then we could record something useful about _9 and _10.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > I suppose what we want is to not combine basic blocks if the
> > > > > > > > > sequence
> > > > > > > > > and conditions of the basic blocks are
> > > > > > > > > such that dom can potentially exploit them, but how do we
> > > > > > > > > express
> > > > > > > > > that?
> > > > > > > > I don't think there's going to be a way to directly express
> > > > > > > > that.
> > > > > > > > You
> > > > > > > > could essentially claim that TRUTH_OR is more expensive than
> > > > > > > > TRUTH_AND
> > > > > > > > because of the impact on DOM, but that in and of itself may not
> > > > > > > > resolve
> > > > > > > > the situation either.
> > > > > > > > 
> > > > > > > > I think the question we need to answer is whether or not your
> > > > > > > > changes
> > > > > > > > are generally better, even if there's specific instances where
> > > > > > > > they
> > > > > > > > make
> > > > > > > > things worse.  If the benefits outweigh the negatives then we
> > > > > > > > can
> > > > > > > > xfail
> > > > > > > > that test.
> > > > > > > Ok, I'll investigate and benchmark some more.
> > > > > > > Andrew, this transformation to ifcombine (together with the
> > > > > > > restriction
> > > > > > > that the inner condition block
> > > > > > > has to be a single comparison) was added by you with r204194.
> > > > > > > Is there a particular reason for that restriction and why it is
> > > > > > > applied to
> > > > > > > the inner block and not either?
> > > > > > My reasoning at the time was there might be an "expensive"
> > > > > > instruction
> > > > > > or
> > > > > > one that might trap (I did not check to see if the other part of the
> > > > > > code
> > > > > > was detecting that).
> > > > > > The outer block did not need any checks as we have something like
> > > > > > ...
> > > > > > If (a)
> > > > > >      If (b)
> > > > > > 
> > > > > > Or
> > > > > > ....
> > > > > > If (a)
> > > > > >      Goto f
> > > > > > else if (b)
> > > > > >     ....
> > > > > > Else
> > > > > > {
> > > > > > F:
> > > > > > ....
> > > > > > }
> > > > > > 
> > > > > > And there was no need to check what was before the if (a) part just
> > > > > > what
> > > > > > is
> > > > > > in between the two ifs.
> > > > > Ah, because the code in outer_cond_bb would have to be executed anyway
> > > > > whether
> > > > > we perform the conversion or not, right?
> > > > All ifcombine transforms make the outer condition unconditionally
> > > > true/false thus the check should have been on whether the outer
> > > > cond BB is "empty".  Which would solve your problem, right?
> > > > 
> > > > Note that other transforms (bit test recognition) don't care (sth
> > > > we might want to fix?).
> > > > 
> > > > In general this needs a better cost function, maybe simply use
> > > > estimate_num_insns with speed estimates and compare against a
> > > > new --param.
> > > So I've played around with that code and I think I'd like to make it a bit
> > > more intricate. Just comparing against estimate_num_insns is too
> > > coarse-grained and
> > > is just a comparison by a magic number number and I've been struggling to
> > > make
> > > this
> > > aggressive enough without pulling in too much code into the unconditional
> > > path.
> > > 
> > > As far as aarch64 is concerned I want to make this transformation more
> > > aggressive when
> > > it can facilitate conditional comparison generation during expand. This
> > > means
> > > that I want
> > > to allow the cases where the inner block contains comparisons, combined
> > > using
> > > logical operations
> > > like TRUTH_AND_EXPR, TRUTH_IOR_EXPR, TRUTH_NOT_EXPR, their bitwise
> > > variants
> > > etc.
> > > The expand code in ccmp.c knows how to handle these chains of comparisons
> > > and
> > > emit the appropriate
> > > conditional compare patterns. If, however, the inner block contains other
> > > types of operations like
> > > arithmetic ops, pointer dereferencing etc I'd want to be conservative to
> > > avoid
> > > pulling in operations
> > > that don't facilitate the ccmp.c expansions.
> > So the inner block only contains stmts feeding a condition?  As we're
> > combining that condition with the outer one and the result simplified(?)
> > it makes sense to allow that as "empty" block generally.
> 
> My concern is that those stmts feeding the condition will
> now be executed unconditionally (rather than only after jumping
> to the inner_cond_bb) so we might end up doing redundant work
> if the the outer condition would actually have jumped to the else block
> rather than to the inner one.

That's true.  But how is arm not affected by this?  That is,
what's so special about ccmps here?

> Kyrill
> 
> > 
> > No?
> > 
> > Thanks,
> > Richard.
> > 
> > > So what I'm proposing is:
> > > - If a target supports conditional comparisons through
> > > TARGET_GEN_CCMP_FIRST
> > > (currently only aarch64) then we allow the aforementioned
> > > comparisons+logical-combinations blocks.
> > > If the block also contains other types of operations we apply the
> > > estimate_num_insns cost comparison
> > > with the default value for the comparison picked to be such so that it
> > > changes
> > > codegen the least from
> > > the current situation i.e. one instruction. This value will be a new param
> > > that targets can increase
> > > if they want to.
> > > 
> > > - If the target does not support conditional comparisons we follow only
> > > the
> > > second scheme (the conservative
> > > estimate_num_insns comparison). This should cause the minimal codegen
> > > difference for the targets that
> > > don't support conditional compares (which is all targets but aarch64)
> > > while
> > > allowing them to scale
> > > the aggressiveness of this transformation if their benchmarking shows it
> > > to be
> > > beneficial.
> > > 
> > > 
> > >   I believe such a scheme would avoid pulling in too much code that
> > > doesn't
> > > facilitate conditional compares generation into the unconditional path and
> > > would minimise
> > > impact on existing targets that don't do conditional compares.
> > > 
> > > Would that be an acceptable plan for the ifcombine_ifandif transformation
> > > in
> > > tree-ssa-ifcombine.c (pending benchmarking, of course)?
> > > 
> > > Thanks,
> > > Kyrill
> > > 
> > > > Thanks,
> > > > Richard.
> > > > 
> > > > > Thanks,
> > > > > Kyrill
> > > > > 
> > > > > > What I mean by expensive for an example is division or some function
> > > > > > call.
> > > > > > 
> > > > > > Thanks,
> > > > > > Andrew
> > > > > > 
> > > > > > 
> > > > > > > Thanks,
> > > > > > > Kyrill
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > > jeff
> > > 
> 
>
Kyrylo Tkachov Sept. 25, 2015, 11:20 a.m. UTC | #12
On 25/09/15 11:15, Richard Biener wrote:
> On Fri, 25 Sep 2015, Kyrill Tkachov wrote:
>
>> On 25/09/15 10:49, Richard Biener wrote:
>>> On Fri, 25 Sep 2015, Kyrill Tkachov wrote:
>>>
>>>> Hi all,
>>>>
>>>> On 23/09/15 11:10, Richard Biener wrote:
>>>>> On Wed, 23 Sep 2015, Kyrill Tkachov wrote:
>>>>>
>>>>>> On 23/09/15 10:09, Pinski, Andrew wrote:
>>>>>>>> On Sep 23, 2015, at 1:59 AM, Kyrill Tkachov
>>>>>>>> <kyrylo.tkachov@arm.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> On 22/09/15 20:31, Jeff Law wrote:
>>>>>>>>>> On 09/22/2015 07:36 AM, Kyrill Tkachov wrote:
>>>>>>>>>> Hi all,
>>>>>>>>>> Unfortunately, I see a testsuite regression with this patch:
>>>>>>>>>> FAIL: gcc.dg/pr66299-2.c scan-tree-dump-not optimized "<<"
>>>>>>>>>>
>>>>>>>>>> The reduced part of that test is:
>>>>>>>>>> void
>>>>>>>>>> test1 (int x, unsigned u)
>>>>>>>>>> {
>>>>>>>>>>        if ((1U << x) != 64
>>>>>>>>>>            || (2 << x) != u
>>>>>>>>>>            || (x << x) != 384
>>>>>>>>>>            || (3 << x) == 9
>>>>>>>>>>            || (x << 14) != 98304U
>>>>>>>>>>            || (1 << x) == 14
>>>>>>>>>>            || (3 << 2) != 12)
>>>>>>>>>>          __builtin_abort ();
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> The patched ifcombine pass works more or less as expected and
>>>>>>>>>> produces
>>>>>>>>>> fewer basic blocks.
>>>>>>>>>> Before this patch a relevant part of the ifcombine dump for
>>>>>>>>>> test1
>>>>>>>>>> is:
>>>>>>>>>> ;;   basic block 2, loop depth 0, count 0, freq 10000, maybe
>>>>>>>>>> hot
>>>>>>>>>>        if (x_1(D) != 6)
>>>>>>>>>>          goto <bb 6>;
>>>>>>>>>>        else
>>>>>>>>>>          goto <bb 3>;
>>>>>>>>>>
>>>>>>>>>> ;;   basic block 3, loop depth 0, count 0, freq 9996, maybe
>>>>>>>>>> hot
>>>>>>>>>>        _2 = 2 << x_1(D);
>>>>>>>>>>        _3 = (unsigned intD.10) _2;
>>>>>>>>>>        if (_3 != u_4(D))
>>>>>>>>>>          goto <bb 6>;
>>>>>>>>>>        else
>>>>>>>>>>          goto <bb 4>;
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> After this patch it is:
>>>>>>>>>> ;;   basic block 2, loop depth 0, count 0, freq 10000, maybe
>>>>>>>>>> hot
>>>>>>>>>>        _2 = 2 << x_1(D);
>>>>>>>>>>        _3 = (unsigned intD.10) _2;
>>>>>>>>>>        _9 = _3 != u_4(D);
>>>>>>>>>>        _10 = x_1(D) != 6;
>>>>>>>>>>        _11 = _9 | _10;
>>>>>>>>>>        if (_11 != 0)
>>>>>>>>>>          goto <bb 5>;
>>>>>>>>>>        else
>>>>>>>>>>          goto <bb 3>;
>>>>>>>>>>
>>>>>>>>>> The second form ends up generating worse codegen however, and
>>>>>>>>>> the
>>>>>>>>>> badness starts with the dom1 pass.
>>>>>>>>>> In the unpatched case it manages to deduce that x must be 6 by
>>>>>>>>>> the
>>>>>>>>>> time
>>>>>>>>>> it reaches basic block 3 and
>>>>>>>>>> uses that information to eliminate the shift in "_2 = 2 <<
>>>>>>>>>> x_1(D)"
>>>>>>>>>> from
>>>>>>>>>> basic block 3
>>>>>>>>>> In the patched case it is unable to make that call, I think
>>>>>>>>>> because
>>>>>>>>>> the
>>>>>>>>>> x != 6 condition is IORed
>>>>>>>>>> with another test.
>>>>>>>>>>
>>>>>>>>>> I'm not familiar with the internals of the dom pass, so I'm
>>>>>>>>>> not
>>>>>>>>>> sure
>>>>>>>>>> where to go looking for a fix for this.
>>>>>>>>>> Is the ifcombine change a step in the right direction? If so,
>>>>>>>>>> what
>>>>>>>>>> would
>>>>>>>>>> need to be done to fix the issue with
>>>>>>>>>> the dom pass?
>>>>>>>>> I don't see how you can reasonably fix this in DOM.  if _9 or
>>>>>>>>> _10 is
>>>>>>>>> true, then _11 is true.  But we can't reasonably record any kind
>>>>>>>>> of
>>>>>>>>> equivalence for _9 or _10 individually.
>>>>>>>>>
>>>>>>>>> If the statement
>>>>>>>>> _11 = _9 | _10;
>>>>>>>>>
>>>>>>>>> Were changed to
>>>>>>>>>
>>>>>>>>> _11 = _9 & _10;
>>>>>>>>>
>>>>>>>>> Then we could record something useful about _9 and _10.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I suppose what we want is to not combine basic blocks if the
>>>>>>>>>> sequence
>>>>>>>>>> and conditions of the basic blocks are
>>>>>>>>>> such that dom can potentially exploit them, but how do we
>>>>>>>>>> express
>>>>>>>>>> that?
>>>>>>>>> I don't think there's going to be a way to directly express
>>>>>>>>> that.
>>>>>>>>> You
>>>>>>>>> could essentially claim that TRUTH_OR is more expensive than
>>>>>>>>> TRUTH_AND
>>>>>>>>> because of the impact on DOM, but that in and of itself may not
>>>>>>>>> resolve
>>>>>>>>> the situation either.
>>>>>>>>>
>>>>>>>>> I think the question we need to answer is whether or not your
>>>>>>>>> changes
>>>>>>>>> are generally better, even if there's specific instances where
>>>>>>>>> they
>>>>>>>>> make
>>>>>>>>> things worse.  If the benefits outweigh the negatives then we
>>>>>>>>> can
>>>>>>>>> xfail
>>>>>>>>> that test.
>>>>>>>> Ok, I'll investigate and benchmark some more.
>>>>>>>> Andrew, this transformation to ifcombine (together with the
>>>>>>>> restriction
>>>>>>>> that the inner condition block
>>>>>>>> has to be a single comparison) was added by you with r204194.
>>>>>>>> Is there a particular reason for that restriction and why it is
>>>>>>>> applied to
>>>>>>>> the inner block and not either?
>>>>>>> My reasoning at the time was there might be an "expensive"
>>>>>>> instruction
>>>>>>> or
>>>>>>> one that might trap (I did not check to see if the other part of the
>>>>>>> code
>>>>>>> was detecting that).
>>>>>>> The outer block did not need any checks as we have something like
>>>>>>> ...
>>>>>>> If (a)
>>>>>>>       If (b)
>>>>>>>
>>>>>>> Or
>>>>>>> ....
>>>>>>> If (a)
>>>>>>>       Goto f
>>>>>>> else if (b)
>>>>>>>      ....
>>>>>>> Else
>>>>>>> {
>>>>>>> F:
>>>>>>> ....
>>>>>>> }
>>>>>>>
>>>>>>> And there was no need to check what was before the if (a) part just
>>>>>>> what
>>>>>>> is
>>>>>>> in between the two ifs.
>>>>>> Ah, because the code in outer_cond_bb would have to be executed anyway
>>>>>> whether
>>>>>> we perform the conversion or not, right?
>>>>> All ifcombine transforms make the outer condition unconditionally
>>>>> true/false thus the check should have been on whether the outer
>>>>> cond BB is "empty".  Which would solve your problem, right?
>>>>>
>>>>> Note that other transforms (bit test recognition) don't care (sth
>>>>> we might want to fix?).
>>>>>
>>>>> In general this needs a better cost function, maybe simply use
>>>>> estimate_num_insns with speed estimates and compare against a
>>>>> new --param.
>>>> So I've played around with that code and I think I'd like to make it a bit
>>>> more intricate. Just comparing against estimate_num_insns is too
>>>> coarse-grained and
>>>> is just a comparison by a magic number number and I've been struggling to
>>>> make
>>>> this
>>>> aggressive enough without pulling in too much code into the unconditional
>>>> path.
>>>>
>>>> As far as aarch64 is concerned I want to make this transformation more
>>>> aggressive when
>>>> it can facilitate conditional comparison generation during expand. This
>>>> means
>>>> that I want
>>>> to allow the cases where the inner block contains comparisons, combined
>>>> using
>>>> logical operations
>>>> like TRUTH_AND_EXPR, TRUTH_IOR_EXPR, TRUTH_NOT_EXPR, their bitwise
>>>> variants
>>>> etc.
>>>> The expand code in ccmp.c knows how to handle these chains of comparisons
>>>> and
>>>> emit the appropriate
>>>> conditional compare patterns. If, however, the inner block contains other
>>>> types of operations like
>>>> arithmetic ops, pointer dereferencing etc I'd want to be conservative to
>>>> avoid
>>>> pulling in operations
>>>> that don't facilitate the ccmp.c expansions.
>>> So the inner block only contains stmts feeding a condition?  As we're
>>> combining that condition with the outer one and the result simplified(?)
>>> it makes sense to allow that as "empty" block generally.
>> My concern is that those stmts feeding the condition will
>> now be executed unconditionally (rather than only after jumping
>> to the inner_cond_bb) so we might end up doing redundant work
>> if the the outer condition would actually have jumped to the else block
>> rather than to the inner one.
> That's true.  But how is arm not affected by this?  That is,
> what's so special about ccmps here?

arm doesn't implement the TARGET_GEN_CCMP_FIRST hook and thus does not
make use of the ccmp.c code that transforms ANDs and IORs of comparisons into
ccmp chains.
  In principle it could (since it can just use normal conditional execution)
but we'd need to implement that hook for the arm backend, which is a separate task.

If/when we implement the ccmp hooks for arm it should automatically benefit from a more
aggressive ifcombine with this scheme but I'm not entirely confident that being aggressive
here for targets that cannot do ccmps is a good idea.

Thanks,
Kyrill

>> Kyrill
>>
>>> No?
>>>
>>> Thanks,
>>> Richard.
>>>
>>>> So what I'm proposing is:
>>>> - If a target supports conditional comparisons through
>>>> TARGET_GEN_CCMP_FIRST
>>>> (currently only aarch64) then we allow the aforementioned
>>>> comparisons+logical-combinations blocks.
>>>> If the block also contains other types of operations we apply the
>>>> estimate_num_insns cost comparison
>>>> with the default value for the comparison picked to be such so that it
>>>> changes
>>>> codegen the least from
>>>> the current situation i.e. one instruction. This value will be a new param
>>>> that targets can increase
>>>> if they want to.
>>>>
>>>> - If the target does not support conditional comparisons we follow only
>>>> the
>>>> second scheme (the conservative
>>>> estimate_num_insns comparison). This should cause the minimal codegen
>>>> difference for the targets that
>>>> don't support conditional compares (which is all targets but aarch64)
>>>> while
>>>> allowing them to scale
>>>> the aggressiveness of this transformation if their benchmarking shows it
>>>> to be
>>>> beneficial.
>>>>
>>>>
>>>>    I believe such a scheme would avoid pulling in too much code that
>>>> doesn't
>>>> facilitate conditional compares generation into the unconditional path and
>>>> would minimise
>>>> impact on existing targets that don't do conditional compares.
>>>>
>>>> Would that be an acceptable plan for the ifcombine_ifandif transformation
>>>> in
>>>> tree-ssa-ifcombine.c (pending benchmarking, of course)?
>>>>
>>>> Thanks,
>>>> Kyrill
>>>>
>>>>> Thanks,
>>>>> Richard.
>>>>>
>>>>>> Thanks,
>>>>>> Kyrill
>>>>>>
>>>>>>> What I mean by expensive for an example is division or some function
>>>>>>> call.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Andrew
>>>>>>>
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Kyrill
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> jeff
>>
Richard Biener Sept. 25, 2015, 11:35 a.m. UTC | #13
On Fri, 25 Sep 2015, Kyrill Tkachov wrote:

> 
> On 25/09/15 11:15, Richard Biener wrote:
> > On Fri, 25 Sep 2015, Kyrill Tkachov wrote:
> > 
> > > On 25/09/15 10:49, Richard Biener wrote:
> > > > On Fri, 25 Sep 2015, Kyrill Tkachov wrote:
> > > > 
> > > > > Hi all,
> > > > > 
> > > > > On 23/09/15 11:10, Richard Biener wrote:
> > > > > > On Wed, 23 Sep 2015, Kyrill Tkachov wrote:
> > > > > > 
> > > > > > > On 23/09/15 10:09, Pinski, Andrew wrote:
> > > > > > > > > On Sep 23, 2015, at 1:59 AM, Kyrill Tkachov
> > > > > > > > > <kyrylo.tkachov@arm.com>
> > > > > > > > > wrote:
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > On 22/09/15 20:31, Jeff Law wrote:
> > > > > > > > > > > On 09/22/2015 07:36 AM, Kyrill Tkachov wrote:
> > > > > > > > > > > Hi all,
> > > > > > > > > > > Unfortunately, I see a testsuite regression with this
> > > > > > > > > > > patch:
> > > > > > > > > > > FAIL: gcc.dg/pr66299-2.c scan-tree-dump-not optimized "<<"
> > > > > > > > > > > 
> > > > > > > > > > > The reduced part of that test is:
> > > > > > > > > > > void
> > > > > > > > > > > test1 (int x, unsigned u)
> > > > > > > > > > > {
> > > > > > > > > > >        if ((1U << x) != 64
> > > > > > > > > > >            || (2 << x) != u
> > > > > > > > > > >            || (x << x) != 384
> > > > > > > > > > >            || (3 << x) == 9
> > > > > > > > > > >            || (x << 14) != 98304U
> > > > > > > > > > >            || (1 << x) == 14
> > > > > > > > > > >            || (3 << 2) != 12)
> > > > > > > > > > >          __builtin_abort ();
> > > > > > > > > > > }
> > > > > > > > > > > 
> > > > > > > > > > > The patched ifcombine pass works more or less as expected
> > > > > > > > > > > and
> > > > > > > > > > > produces
> > > > > > > > > > > fewer basic blocks.
> > > > > > > > > > > Before this patch a relevant part of the ifcombine dump
> > > > > > > > > > > for
> > > > > > > > > > > test1
> > > > > > > > > > > is:
> > > > > > > > > > > ;;   basic block 2, loop depth 0, count 0, freq 10000,
> > > > > > > > > > > maybe
> > > > > > > > > > > hot
> > > > > > > > > > >        if (x_1(D) != 6)
> > > > > > > > > > >          goto <bb 6>;
> > > > > > > > > > >        else
> > > > > > > > > > >          goto <bb 3>;
> > > > > > > > > > > 
> > > > > > > > > > > ;;   basic block 3, loop depth 0, count 0, freq 9996,
> > > > > > > > > > > maybe
> > > > > > > > > > > hot
> > > > > > > > > > >        _2 = 2 << x_1(D);
> > > > > > > > > > >        _3 = (unsigned intD.10) _2;
> > > > > > > > > > >        if (_3 != u_4(D))
> > > > > > > > > > >          goto <bb 6>;
> > > > > > > > > > >        else
> > > > > > > > > > >          goto <bb 4>;
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > After this patch it is:
> > > > > > > > > > > ;;   basic block 2, loop depth 0, count 0, freq 10000,
> > > > > > > > > > > maybe
> > > > > > > > > > > hot
> > > > > > > > > > >        _2 = 2 << x_1(D);
> > > > > > > > > > >        _3 = (unsigned intD.10) _2;
> > > > > > > > > > >        _9 = _3 != u_4(D);
> > > > > > > > > > >        _10 = x_1(D) != 6;
> > > > > > > > > > >        _11 = _9 | _10;
> > > > > > > > > > >        if (_11 != 0)
> > > > > > > > > > >          goto <bb 5>;
> > > > > > > > > > >        else
> > > > > > > > > > >          goto <bb 3>;
> > > > > > > > > > > 
> > > > > > > > > > > The second form ends up generating worse codegen however,
> > > > > > > > > > > and
> > > > > > > > > > > the
> > > > > > > > > > > badness starts with the dom1 pass.
> > > > > > > > > > > In the unpatched case it manages to deduce that x must be
> > > > > > > > > > > 6 by
> > > > > > > > > > > the
> > > > > > > > > > > time
> > > > > > > > > > > it reaches basic block 3 and
> > > > > > > > > > > uses that information to eliminate the shift in "_2 = 2 <<
> > > > > > > > > > > x_1(D)"
> > > > > > > > > > > from
> > > > > > > > > > > basic block 3
> > > > > > > > > > > In the patched case it is unable to make that call, I
> > > > > > > > > > > think
> > > > > > > > > > > because
> > > > > > > > > > > the
> > > > > > > > > > > x != 6 condition is IORed
> > > > > > > > > > > with another test.
> > > > > > > > > > > 
> > > > > > > > > > > I'm not familiar with the internals of the dom pass, so
> > > > > > > > > > > I'm
> > > > > > > > > > > not
> > > > > > > > > > > sure
> > > > > > > > > > > where to go looking for a fix for this.
> > > > > > > > > > > Is the ifcombine change a step in the right direction? If
> > > > > > > > > > > so,
> > > > > > > > > > > what
> > > > > > > > > > > would
> > > > > > > > > > > need to be done to fix the issue with
> > > > > > > > > > > the dom pass?
> > > > > > > > > > I don't see how you can reasonably fix this in DOM.  if _9
> > > > > > > > > > or
> > > > > > > > > > _10 is
> > > > > > > > > > true, then _11 is true.  But we can't reasonably record any
> > > > > > > > > > kind
> > > > > > > > > > of
> > > > > > > > > > equivalence for _9 or _10 individually.
> > > > > > > > > > 
> > > > > > > > > > If the statement
> > > > > > > > > > _11 = _9 | _10;
> > > > > > > > > > 
> > > > > > > > > > Were changed to
> > > > > > > > > > 
> > > > > > > > > > _11 = _9 & _10;
> > > > > > > > > > 
> > > > > > > > > > Then we could record something useful about _9 and _10.
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > I suppose what we want is to not combine basic blocks if
> > > > > > > > > > > the
> > > > > > > > > > > sequence
> > > > > > > > > > > and conditions of the basic blocks are
> > > > > > > > > > > such that dom can potentially exploit them, but how do we
> > > > > > > > > > > express
> > > > > > > > > > > that?
> > > > > > > > > > I don't think there's going to be a way to directly express
> > > > > > > > > > that.
> > > > > > > > > > You
> > > > > > > > > > could essentially claim that TRUTH_OR is more expensive than
> > > > > > > > > > TRUTH_AND
> > > > > > > > > > because of the impact on DOM, but that in and of itself may
> > > > > > > > > > not
> > > > > > > > > > resolve
> > > > > > > > > > the situation either.
> > > > > > > > > > 
> > > > > > > > > > I think the question we need to answer is whether or not
> > > > > > > > > > your
> > > > > > > > > > changes
> > > > > > > > > > are generally better, even if there's specific instances
> > > > > > > > > > where
> > > > > > > > > > they
> > > > > > > > > > make
> > > > > > > > > > things worse.  If the benefits outweigh the negatives then
> > > > > > > > > > we
> > > > > > > > > > can
> > > > > > > > > > xfail
> > > > > > > > > > that test.
> > > > > > > > > Ok, I'll investigate and benchmark some more.
> > > > > > > > > Andrew, this transformation to ifcombine (together with the
> > > > > > > > > restriction
> > > > > > > > > that the inner condition block
> > > > > > > > > has to be a single comparison) was added by you with r204194.
> > > > > > > > > Is there a particular reason for that restriction and why it
> > > > > > > > > is
> > > > > > > > > applied to
> > > > > > > > > the inner block and not either?
> > > > > > > > My reasoning at the time was there might be an "expensive"
> > > > > > > > instruction
> > > > > > > > or
> > > > > > > > one that might trap (I did not check to see if the other part of
> > > > > > > > the
> > > > > > > > code
> > > > > > > > was detecting that).
> > > > > > > > The outer block did not need any checks as we have something
> > > > > > > > like
> > > > > > > > ...
> > > > > > > > If (a)
> > > > > > > >       If (b)
> > > > > > > > 
> > > > > > > > Or
> > > > > > > > ....
> > > > > > > > If (a)
> > > > > > > >       Goto f
> > > > > > > > else if (b)
> > > > > > > >      ....
> > > > > > > > Else
> > > > > > > > {
> > > > > > > > F:
> > > > > > > > ....
> > > > > > > > }
> > > > > > > > 
> > > > > > > > And there was no need to check what was before the if (a) part
> > > > > > > > just
> > > > > > > > what
> > > > > > > > is
> > > > > > > > in between the two ifs.
> > > > > > > Ah, because the code in outer_cond_bb would have to be executed
> > > > > > > anyway
> > > > > > > whether
> > > > > > > we perform the conversion or not, right?
> > > > > > All ifcombine transforms make the outer condition unconditionally
> > > > > > true/false thus the check should have been on whether the outer
> > > > > > cond BB is "empty".  Which would solve your problem, right?
> > > > > > 
> > > > > > Note that other transforms (bit test recognition) don't care (sth
> > > > > > we might want to fix?).
> > > > > > 
> > > > > > In general this needs a better cost function, maybe simply use
> > > > > > estimate_num_insns with speed estimates and compare against a
> > > > > > new --param.
> > > > > So I've played around with that code and I think I'd like to make it a
> > > > > bit
> > > > > more intricate. Just comparing against estimate_num_insns is too
> > > > > coarse-grained and
> > > > > is just a comparison by a magic number number and I've been struggling
> > > > > to
> > > > > make
> > > > > this
> > > > > aggressive enough without pulling in too much code into the
> > > > > unconditional
> > > > > path.
> > > > > 
> > > > > As far as aarch64 is concerned I want to make this transformation more
> > > > > aggressive when
> > > > > it can facilitate conditional comparison generation during expand.
> > > > > This
> > > > > means
> > > > > that I want
> > > > > to allow the cases where the inner block contains comparisons,
> > > > > combined
> > > > > using
> > > > > logical operations
> > > > > like TRUTH_AND_EXPR, TRUTH_IOR_EXPR, TRUTH_NOT_EXPR, their bitwise
> > > > > variants
> > > > > etc.
> > > > > The expand code in ccmp.c knows how to handle these chains of
> > > > > comparisons
> > > > > and
> > > > > emit the appropriate
> > > > > conditional compare patterns. If, however, the inner block contains
> > > > > other
> > > > > types of operations like
> > > > > arithmetic ops, pointer dereferencing etc I'd want to be conservative
> > > > > to
> > > > > avoid
> > > > > pulling in operations
> > > > > that don't facilitate the ccmp.c expansions.
> > > > So the inner block only contains stmts feeding a condition?  As we're
> > > > combining that condition with the outer one and the result simplified(?)
> > > > it makes sense to allow that as "empty" block generally.
> > > My concern is that those stmts feeding the condition will
> > > now be executed unconditionally (rather than only after jumping
> > > to the inner_cond_bb) so we might end up doing redundant work
> > > if the the outer condition would actually have jumped to the else block
> > > rather than to the inner one.
> > That's true.  But how is arm not affected by this?  That is,
> > what's so special about ccmps here?
> 
> arm doesn't implement the TARGET_GEN_CCMP_FIRST hook and thus does not
> make use of the ccmp.c code that transforms ANDs and IORs of comparisons into
> ccmp chains.
>  In principle it could (since it can just use normal conditional execution)
> but we'd need to implement that hook for the arm backend, which is a separate
> task.
> 
> If/when we implement the ccmp hooks for arm it should automatically benefit
> from a more
> aggressive ifcombine with this scheme but I'm not entirely confident that
> being aggressive
> here for targets that cannot do ccmps is a good idea.

I'd hate to put this kind of target dependence into GIMPLE optimizer
behavior this early in the pipeline...

> > > > > So what I'm proposing is:
> > > > > - If a target supports conditional comparisons through
> > > > > TARGET_GEN_CCMP_FIRST
> > > > > (currently only aarch64) then we allow the aforementioned
> > > > > comparisons+logical-combinations blocks.
> > > > > If the block also contains other types of operations we apply the
> > > > > estimate_num_insns cost comparison
> > > > > with the default value for the comparison picked to be such so that it
> > > > > changes
> > > > > codegen the least from
> > > > > the current situation i.e. one instruction. This value will be a new
> > > > > param
> > > > > that targets can increase
> > > > > if they want to.
> > > > > 
> > > > > - If the target does not support conditional comparisons we follow
> > > > > only
> > > > > the
> > > > > second scheme (the conservative
> > > > > estimate_num_insns comparison). This should cause the minimal codegen
> > > > > difference for the targets that
> > > > > don't support conditional compares (which is all targets but aarch64)
> > > > > while
> > > > > allowing them to scale
> > > > > the aggressiveness of this transformation if their benchmarking shows
> > > > > it
> > > > > to be
> > > > > beneficial.
> > > > > 
> > > > > 
> > > > >    I believe such a scheme would avoid pulling in too much code that
> > > > > doesn't
> > > > > facilitate conditional compares generation into the unconditional path
> > > > > and
> > > > > would minimise
> > > > > impact on existing targets that don't do conditional compares.
> > > > > 
> > > > > Would that be an acceptable plan for the ifcombine_ifandif
> > > > > transformation
> > > > > in
> > > > > tree-ssa-ifcombine.c (pending benchmarking, of course)?

...but yes, that would be an acceptable plan.

Note that I think we run ifcombine a little early and should allow
some jump threading and phiopt to take place first.  I think
running it inbettween dce and forwprop that is followed by the 2nd phiopt
makes some sense (and has PHIOPT1 and DOM1 run first).

Maybe you can play with that idea a bit.

Richard.
Kyrylo Tkachov Sept. 25, 2015, 12:21 p.m. UTC | #14
On 25/09/15 12:35, Richard Biener wrote:
> On Fri, 25 Sep 2015, Kyrill Tkachov wrote:
>
>> On 25/09/15 11:15, Richard Biener wrote:
>>> On Fri, 25 Sep 2015, Kyrill Tkachov wrote:
>>>
>>>> On 25/09/15 10:49, Richard Biener wrote:
>>>>> On Fri, 25 Sep 2015, Kyrill Tkachov wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> On 23/09/15 11:10, Richard Biener wrote:
>>>>>>> On Wed, 23 Sep 2015, Kyrill Tkachov wrote:
>>>>>>>
>>>>>>>> On 23/09/15 10:09, Pinski, Andrew wrote:
>>>>>>>>>> On Sep 23, 2015, at 1:59 AM, Kyrill Tkachov
>>>>>>>>>> <kyrylo.tkachov@arm.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On 22/09/15 20:31, Jeff Law wrote:
>>>>>>>>>>>> On 09/22/2015 07:36 AM, Kyrill Tkachov wrote:
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>> Unfortunately, I see a testsuite regression with this
>>>>>>>>>>>> patch:
>>>>>>>>>>>> FAIL: gcc.dg/pr66299-2.c scan-tree-dump-not optimized "<<"
>>>>>>>>>>>>
>>>>>>>>>>>> The reduced part of that test is:
>>>>>>>>>>>> void
>>>>>>>>>>>> test1 (int x, unsigned u)
>>>>>>>>>>>> {
>>>>>>>>>>>>         if ((1U << x) != 64
>>>>>>>>>>>>             || (2 << x) != u
>>>>>>>>>>>>             || (x << x) != 384
>>>>>>>>>>>>             || (3 << x) == 9
>>>>>>>>>>>>             || (x << 14) != 98304U
>>>>>>>>>>>>             || (1 << x) == 14
>>>>>>>>>>>>             || (3 << 2) != 12)
>>>>>>>>>>>>           __builtin_abort ();
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> The patched ifcombine pass works more or less as expected
>>>>>>>>>>>> and
>>>>>>>>>>>> produces
>>>>>>>>>>>> fewer basic blocks.
>>>>>>>>>>>> Before this patch a relevant part of the ifcombine dump
>>>>>>>>>>>> for
>>>>>>>>>>>> test1
>>>>>>>>>>>> is:
>>>>>>>>>>>> ;;   basic block 2, loop depth 0, count 0, freq 10000,
>>>>>>>>>>>> maybe
>>>>>>>>>>>> hot
>>>>>>>>>>>>         if (x_1(D) != 6)
>>>>>>>>>>>>           goto <bb 6>;
>>>>>>>>>>>>         else
>>>>>>>>>>>>           goto <bb 3>;
>>>>>>>>>>>>
>>>>>>>>>>>> ;;   basic block 3, loop depth 0, count 0, freq 9996,
>>>>>>>>>>>> maybe
>>>>>>>>>>>> hot
>>>>>>>>>>>>         _2 = 2 << x_1(D);
>>>>>>>>>>>>         _3 = (unsigned intD.10) _2;
>>>>>>>>>>>>         if (_3 != u_4(D))
>>>>>>>>>>>>           goto <bb 6>;
>>>>>>>>>>>>         else
>>>>>>>>>>>>           goto <bb 4>;
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> After this patch it is:
>>>>>>>>>>>> ;;   basic block 2, loop depth 0, count 0, freq 10000,
>>>>>>>>>>>> maybe
>>>>>>>>>>>> hot
>>>>>>>>>>>>         _2 = 2 << x_1(D);
>>>>>>>>>>>>         _3 = (unsigned intD.10) _2;
>>>>>>>>>>>>         _9 = _3 != u_4(D);
>>>>>>>>>>>>         _10 = x_1(D) != 6;
>>>>>>>>>>>>         _11 = _9 | _10;
>>>>>>>>>>>>         if (_11 != 0)
>>>>>>>>>>>>           goto <bb 5>;
>>>>>>>>>>>>         else
>>>>>>>>>>>>           goto <bb 3>;
>>>>>>>>>>>>
>>>>>>>>>>>> The second form ends up generating worse codegen however,
>>>>>>>>>>>> and
>>>>>>>>>>>> the
>>>>>>>>>>>> badness starts with the dom1 pass.
>>>>>>>>>>>> In the unpatched case it manages to deduce that x must be
>>>>>>>>>>>> 6 by
>>>>>>>>>>>> the
>>>>>>>>>>>> time
>>>>>>>>>>>> it reaches basic block 3 and
>>>>>>>>>>>> uses that information to eliminate the shift in "_2 = 2 <<
>>>>>>>>>>>> x_1(D)"
>>>>>>>>>>>> from
>>>>>>>>>>>> basic block 3
>>>>>>>>>>>> In the patched case it is unable to make that call, I
>>>>>>>>>>>> think
>>>>>>>>>>>> because
>>>>>>>>>>>> the
>>>>>>>>>>>> x != 6 condition is IORed
>>>>>>>>>>>> with another test.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm not familiar with the internals of the dom pass, so
>>>>>>>>>>>> I'm
>>>>>>>>>>>> not
>>>>>>>>>>>> sure
>>>>>>>>>>>> where to go looking for a fix for this.
>>>>>>>>>>>> Is the ifcombine change a step in the right direction? If
>>>>>>>>>>>> so,
>>>>>>>>>>>> what
>>>>>>>>>>>> would
>>>>>>>>>>>> need to be done to fix the issue with
>>>>>>>>>>>> the dom pass?
>>>>>>>>>>> I don't see how you can reasonably fix this in DOM.  if _9
>>>>>>>>>>> or
>>>>>>>>>>> _10 is
>>>>>>>>>>> true, then _11 is true.  But we can't reasonably record any
>>>>>>>>>>> kind
>>>>>>>>>>> of
>>>>>>>>>>> equivalence for _9 or _10 individually.
>>>>>>>>>>>
>>>>>>>>>>> If the statement
>>>>>>>>>>> _11 = _9 | _10;
>>>>>>>>>>>
>>>>>>>>>>> Were changed to
>>>>>>>>>>>
>>>>>>>>>>> _11 = _9 & _10;
>>>>>>>>>>>
>>>>>>>>>>> Then we could record something useful about _9 and _10.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> I suppose what we want is to not combine basic blocks if
>>>>>>>>>>>> the
>>>>>>>>>>>> sequence
>>>>>>>>>>>> and conditions of the basic blocks are
>>>>>>>>>>>> such that dom can potentially exploit them, but how do we
>>>>>>>>>>>> express
>>>>>>>>>>>> that?
>>>>>>>>>>> I don't think there's going to be a way to directly express
>>>>>>>>>>> that.
>>>>>>>>>>> You
>>>>>>>>>>> could essentially claim that TRUTH_OR is more expensive than
>>>>>>>>>>> TRUTH_AND
>>>>>>>>>>> because of the impact on DOM, but that in and of itself may
>>>>>>>>>>> not
>>>>>>>>>>> resolve
>>>>>>>>>>> the situation either.
>>>>>>>>>>>
>>>>>>>>>>> I think the question we need to answer is whether or not
>>>>>>>>>>> your
>>>>>>>>>>> changes
>>>>>>>>>>> are generally better, even if there's specific instances
>>>>>>>>>>> where
>>>>>>>>>>> they
>>>>>>>>>>> make
>>>>>>>>>>> things worse.  If the benefits outweigh the negatives then
>>>>>>>>>>> we
>>>>>>>>>>> can
>>>>>>>>>>> xfail
>>>>>>>>>>> that test.
>>>>>>>>>> Ok, I'll investigate and benchmark some more.
>>>>>>>>>> Andrew, this transformation to ifcombine (together with the
>>>>>>>>>> restriction
>>>>>>>>>> that the inner condition block
>>>>>>>>>> has to be a single comparison) was added by you with r204194.
>>>>>>>>>> Is there a particular reason for that restriction and why it
>>>>>>>>>> is
>>>>>>>>>> applied to
>>>>>>>>>> the inner block and not either?
>>>>>>>>> My reasoning at the time was there might be an "expensive"
>>>>>>>>> instruction
>>>>>>>>> or
>>>>>>>>> one that might trap (I did not check to see if the other part of
>>>>>>>>> the
>>>>>>>>> code
>>>>>>>>> was detecting that).
>>>>>>>>> The outer block did not need any checks as we have something
>>>>>>>>> like
>>>>>>>>> ...
>>>>>>>>> If (a)
>>>>>>>>>        If (b)
>>>>>>>>>
>>>>>>>>> Or
>>>>>>>>> ....
>>>>>>>>> If (a)
>>>>>>>>>        Goto f
>>>>>>>>> else if (b)
>>>>>>>>>       ....
>>>>>>>>> Else
>>>>>>>>> {
>>>>>>>>> F:
>>>>>>>>> ....
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> And there was no need to check what was before the if (a) part
>>>>>>>>> just
>>>>>>>>> what
>>>>>>>>> is
>>>>>>>>> in between the two ifs.
>>>>>>>> Ah, because the code in outer_cond_bb would have to be executed
>>>>>>>> anyway
>>>>>>>> whether
>>>>>>>> we perform the conversion or not, right?
>>>>>>> All ifcombine transforms make the outer condition unconditionally
>>>>>>> true/false thus the check should have been on whether the outer
>>>>>>> cond BB is "empty".  Which would solve your problem, right?
>>>>>>>
>>>>>>> Note that other transforms (bit test recognition) don't care (sth
>>>>>>> we might want to fix?).
>>>>>>>
>>>>>>> In general this needs a better cost function, maybe simply use
>>>>>>> estimate_num_insns with speed estimates and compare against a
>>>>>>> new --param.
>>>>>> So I've played around with that code and I think I'd like to make it a
>>>>>> bit
>>>>>> more intricate. Just comparing against estimate_num_insns is too
>>>>>> coarse-grained and
>>>>>> is just a comparison by a magic number number and I've been struggling
>>>>>> to
>>>>>> make
>>>>>> this
>>>>>> aggressive enough without pulling in too much code into the
>>>>>> unconditional
>>>>>> path.
>>>>>>
>>>>>> As far as aarch64 is concerned I want to make this transformation more
>>>>>> aggressive when
>>>>>> it can facilitate conditional comparison generation during expand.
>>>>>> This
>>>>>> means
>>>>>> that I want
>>>>>> to allow the cases where the inner block contains comparisons,
>>>>>> combined
>>>>>> using
>>>>>> logical operations
>>>>>> like TRUTH_AND_EXPR, TRUTH_IOR_EXPR, TRUTH_NOT_EXPR, their bitwise
>>>>>> variants
>>>>>> etc.
>>>>>> The expand code in ccmp.c knows how to handle these chains of
>>>>>> comparisons
>>>>>> and
>>>>>> emit the appropriate
>>>>>> conditional compare patterns. If, however, the inner block contains
>>>>>> other
>>>>>> types of operations like
>>>>>> arithmetic ops, pointer dereferencing etc I'd want to be conservative
>>>>>> to
>>>>>> avoid
>>>>>> pulling in operations
>>>>>> that don't facilitate the ccmp.c expansions.
>>>>> So the inner block only contains stmts feeding a condition?  As we're
>>>>> combining that condition with the outer one and the result simplified(?)
>>>>> it makes sense to allow that as "empty" block generally.
>>>> My concern is that those stmts feeding the condition will
>>>> now be executed unconditionally (rather than only after jumping
>>>> to the inner_cond_bb) so we might end up doing redundant work
>>>> if the the outer condition would actually have jumped to the else block
>>>> rather than to the inner one.
>>> That's true.  But how is arm not affected by this?  That is,
>>> what's so special about ccmps here?
>> arm doesn't implement the TARGET_GEN_CCMP_FIRST hook and thus does not
>> make use of the ccmp.c code that transforms ANDs and IORs of comparisons into
>> ccmp chains.
>>   In principle it could (since it can just use normal conditional execution)
>> but we'd need to implement that hook for the arm backend, which is a separate
>> task.
>>
>> If/when we implement the ccmp hooks for arm it should automatically benefit
>> from a more
>> aggressive ifcombine with this scheme but I'm not entirely confident that
>> being aggressive
>> here for targets that cannot do ccmps is a good idea.
> I'd hate to put this kind of target dependence into GIMPLE optimizer
> behavior this early in the pipeline...

I see what you mean. I'll try instead being unconditionally
more aggressive with the AND/IOR/compare blocks and conservative
with the other types of blocks and see if that causes regressions
on a non-ccmp target. Maybe we can get a way with it...

>
>>>>>> So what I'm proposing is:
>>>>>> - If a target supports conditional comparisons through
>>>>>> TARGET_GEN_CCMP_FIRST
>>>>>> (currently only aarch64) then we allow the aforementioned
>>>>>> comparisons+logical-combinations blocks.
>>>>>> If the block also contains other types of operations we apply the
>>>>>> estimate_num_insns cost comparison
>>>>>> with the default value for the comparison picked to be such so that it
>>>>>> changes
>>>>>> codegen the least from
>>>>>> the current situation i.e. one instruction. This value will be a new
>>>>>> param
>>>>>> that targets can increase
>>>>>> if they want to.
>>>>>>
>>>>>> - If the target does not support conditional comparisons we follow
>>>>>> only
>>>>>> the
>>>>>> second scheme (the conservative
>>>>>> estimate_num_insns comparison). This should cause the minimal codegen
>>>>>> difference for the targets that
>>>>>> don't support conditional compares (which is all targets but aarch64)
>>>>>> while
>>>>>> allowing them to scale
>>>>>> the aggressiveness of this transformation if their benchmarking shows
>>>>>> it
>>>>>> to be
>>>>>> beneficial.
>>>>>>
>>>>>>
>>>>>>     I believe such a scheme would avoid pulling in too much code that
>>>>>> doesn't
>>>>>> facilitate conditional compares generation into the unconditional path
>>>>>> and
>>>>>> would minimise
>>>>>> impact on existing targets that don't do conditional compares.
>>>>>>
>>>>>> Would that be an acceptable plan for the ifcombine_ifandif
>>>>>> transformation
>>>>>> in
>>>>>> tree-ssa-ifcombine.c (pending benchmarking, of course)?
> ...but yes, that would be an acceptable plan.
>
> Note that I think we run ifcombine a little early and should allow
> some jump threading and phiopt to take place first.  I think
> running it inbettween dce and forwprop that is followed by the 2nd phiopt
> makes some sense (and has PHIOPT1 and DOM1 run first).
>
> Maybe you can play with that idea a bit.

Will do.

Thanks,
Kyrill

>
> Richard.
>
diff mbox

Patch

diff --git a/gcc/tree-ssa-ifcombine.c b/gcc/tree-ssa-ifcombine.c
index 9f04174..bfe17ba 100644
--- a/gcc/tree-ssa-ifcombine.c
+++ b/gcc/tree-ssa-ifcombine.c
@@ -511,8 +511,12 @@  ifcombine_ifandif (basic_block inner_cond_bb, bool inner_inv,
 	  gimple_stmt_iterator gsi;
 	  if (!LOGICAL_OP_NON_SHORT_CIRCUIT)
 	    return false;
-	  /* Only do this optimization if the inner bb contains only the conditional. */
-	  if (!gsi_one_before_end_p (gsi_start_nondebug_after_labels_bb (inner_cond_bb)))
+	  /* Only do this optimization if inner or outer bb contains
+	     only the conditional. */
+	  if (!gsi_one_before_end_p (gsi_start_nondebug_after_labels_bb (
+				     inner_cond_bb))
+	      && !gsi_one_before_end_p (gsi_start_nondebug_after_labels_bb (
+					 outer_cond_bb)))
 	    return false;
 	  t1 = fold_build2_loc (gimple_location (inner_cond),
 				inner_cond_code,