diff mbox

[match.pd] Add a simplify rule for x * copysign (1.0, y);

Message ID 1443707835-6888-1-git-send-email-james.greenhalgh@arm.com
State New
Headers show

Commit Message

James Greenhalgh Oct. 1, 2015, 1:57 p.m. UTC
Hi,

If it is cheap enough to treat a floating-point value as an integer and
to do bitwise arithmetic on it (as it is for AArch64) we can rewrite:

  x * copysign (1.0, y)

as:

  x ^ (y & (1 << sign_bit_position))

This patch implements that rewriting rule in match.pd, and a testcase
expecting the transform.

This is worth about 6% in 481.wrf for AArch64. I don't don't know enough
about the x86 microarchitectures to know how productive this transformation
is there. In Spec2006FP I didn't see any interesting results in either
direction. Looking at code generation for the testcase I add, I think the
x86 code generation looks worse, but I can't understand why it doesn't use
a vector-side xor and load the mask vector-side. With that fixed up I think
the code generation would look better - though as I say, I'm not an expert
here...

Bootstrapped on both aarch64-none-linux-gnu and x86_64 with no issues.

OK for trunk?

Thanks,
James

---
gcc/

2015-10-01  James Greenhalgh  <james.greenhalgh@arm.com>

	* match.pd (mult (COPYSIGN:s real_onep @0) @1): New simplifier.

gcc/testsuite/

2015-10-01  James Greenhalgh  <james.greenhalgh@arm.com>

	* gcc.dg/tree-ssa/copysign.c: New.

Comments

Andrew Pinski Oct. 1, 2015, 2:28 p.m. UTC | #1
> 
> On Oct 1, 2015, at 6:57 AM, James Greenhalgh <james.greenhalgh@arm.com> wrote:
> 
> 
> Hi,
> 
> If it is cheap enough to treat a floating-point value as an integer and
> to do bitwise arithmetic on it (as it is for AArch64) we can rewrite:
> 
>  x * copysign (1.0, y)
> 
> as:
> 
>  x ^ (y & (1 << sign_bit_position))

Why not just convert it to copysign (x, y) instead and let expand chose the better implementation?  Also I think this can only be done for finite and non trapping types. 

Thanks,
Andrew

> 
> This patch implements that rewriting rule in match.pd, and a testcase
> expecting the transform.
> 
> This is worth about 6% in 481.wrf for AArch64. I don't don't know enough
> about the x86 microarchitectures to know how productive this transformation
> is there. In Spec2006FP I didn't see any interesting results in either
> direction. Looking at code generation for the testcase I add, I think the
> x86 code generation looks worse, but I can't understand why it doesn't use
> a vector-side xor and load the mask vector-side. With that fixed up I think
> the code generation would look better - though as I say, I'm not an expert
> here...
> 
> Bootstrapped on both aarch64-none-linux-gnu and x86_64 with no issues.
> 
> OK for trunk?
> 
> Thanks,
> James
> 
> ---
> gcc/
> 
> 2015-10-01  James Greenhalgh  <james.greenhalgh@arm.com>
> 
>    * match.pd (mult (COPYSIGN:s real_onep @0) @1): New simplifier.
> 
> gcc/testsuite/
> 
> 2015-10-01  James Greenhalgh  <james.greenhalgh@arm.com>
> 
>    * gcc.dg/tree-ssa/copysign.c: New.
> 
> <0001-Patch-match.pd-Add-a-simplify-rule-for-x-copysign-1..patch>
James Greenhalgh Oct. 1, 2015, 2:51 p.m. UTC | #2
On Thu, Oct 01, 2015 at 03:28:22PM +0100, pinskia@gmail.com wrote:
> > 
> > On Oct 1, 2015, at 6:57 AM, James Greenhalgh <james.greenhalgh@arm.com> wrote:
> > 
> > 
> > Hi,
> > 
> > If it is cheap enough to treat a floating-point value as an integer and
> > to do bitwise arithmetic on it (as it is for AArch64) we can rewrite:
> > 
> >  x * copysign (1.0, y)
> > 
> > as:
> > 
> >  x ^ (y & (1 << sign_bit_position))
> 
> Why not just convert it to copysign (x, y) instead and let expand chose
> the better implementation?

Because that transformation is invalid :-)

let x = -1.0, y = -1.0

  x * copysign (1.0, y)
    =  -1.0 * copysign (1.0, -1.0)
    = -1.0 * -1.0
    = 1.0

  copysign (x, y)
    = copysign (-1.0, -1.0)
    = -1.0

Or have I completely lost my maths skills :-)

> Also I think this can only be done for finite and non trapping types. 

That may be well true, I swithered either way and went for no checks, but
I'd happily go back on that and wrap this in something suitable restrictive
if I need to.

Thanks,
James


> > 
> > This patch implements that rewriting rule in match.pd, and a testcase
> > expecting the transform.
> > 
> > This is worth about 6% in 481.wrf for AArch64. I don't don't know enough
> > about the x86 microarchitectures to know how productive this transformation
> > is there. In Spec2006FP I didn't see any interesting results in either
> > direction. Looking at code generation for the testcase I add, I think the
> > x86 code generation looks worse, but I can't understand why it doesn't use
> > a vector-side xor and load the mask vector-side. With that fixed up I think
> > the code generation would look better - though as I say, I'm not an expert
> > here...
> > 
> > Bootstrapped on both aarch64-none-linux-gnu and x86_64 with no issues.
> > 
> > OK for trunk?
> > 
> > Thanks,
> > James
> > 
> > ---
> > gcc/
> > 
> > 2015-10-01  James Greenhalgh  <james.greenhalgh@arm.com>
> > 
> >    * match.pd (mult (COPYSIGN:s real_onep @0) @1): New simplifier.
> > 
> > gcc/testsuite/
> > 
> > 2015-10-01  James Greenhalgh  <james.greenhalgh@arm.com>
> > 
> >    * gcc.dg/tree-ssa/copysign.c: New.
> > 
> > <0001-Patch-match.pd-Add-a-simplify-rule-for-x-copysign-1..patch>
>
Andrew Pinski Oct. 1, 2015, 3:02 p.m. UTC | #3
> On Oct 1, 2015, at 7:51 AM, James Greenhalgh <james.greenhalgh@arm.com> wrote:
> 
> On Thu, Oct 01, 2015 at 03:28:22PM +0100, pinskia@gmail.com wrote:
>>> 
>>> On Oct 1, 2015, at 6:57 AM, James Greenhalgh <james.greenhalgh@arm.com> wrote:
>>> 
>>> 
>>> Hi,
>>> 
>>> If it is cheap enough to treat a floating-point value as an integer and
>>> to do bitwise arithmetic on it (as it is for AArch64) we can rewrite:
>>> 
>>> x * copysign (1.0, y)
>>> 
>>> as:
>>> 
>>> x ^ (y & (1 << sign_bit_position))
>> 
>> Why not just convert it to copysign (x, y) instead and let expand chose
>> the better implementation?
> 
> Because that transformation is invalid :-)
> 
> let x = -1.0, y = -1.0
> 
>  x * copysign (1.0, y)
>    =  -1.0 * copysign (1.0, -1.0)
>    = -1.0 * -1.0
>    = 1.0
> 
>  copysign (x, y)
>    = copysign (-1.0, -1.0)
>    = -1.0
> 
> Or have I completely lost my maths skills :-)

No you are correct. Note I would rather see the copysign form in the tree level and have the integer form on the rtl level. So placing this in expand would be better instead of match.md. 

Thanks,
Andrew

> 
>> Also I think this can only be done for finite and non trapping types. 
> 
> That may be well true, I swithered either way and went for no checks, but
> I'd happily go back on that and wrap this in something suitable restrictive
> if I need to.
> 
> Thanks,
> James
> 
> 
>>> 
>>> This patch implements that rewriting rule in match.pd, and a testcase
>>> expecting the transform.
>>> 
>>> This is worth about 6% in 481.wrf for AArch64. I don't don't know enough
>>> about the x86 microarchitectures to know how productive this transformation
>>> is there. In Spec2006FP I didn't see any interesting results in either
>>> direction. Looking at code generation for the testcase I add, I think the
>>> x86 code generation looks worse, but I can't understand why it doesn't use
>>> a vector-side xor and load the mask vector-side. With that fixed up I think
>>> the code generation would look better - though as I say, I'm not an expert
>>> here...
>>> 
>>> Bootstrapped on both aarch64-none-linux-gnu and x86_64 with no issues.
>>> 
>>> OK for trunk?
>>> 
>>> Thanks,
>>> James
>>> 
>>> ---
>>> gcc/
>>> 
>>> 2015-10-01  James Greenhalgh  <james.greenhalgh@arm.com>
>>> 
>>>   * match.pd (mult (COPYSIGN:s real_onep @0) @1): New simplifier.
>>> 
>>> gcc/testsuite/
>>> 
>>> 2015-10-01  James Greenhalgh  <james.greenhalgh@arm.com>
>>> 
>>>   * gcc.dg/tree-ssa/copysign.c: New.
>>> 
>>> <0001-Patch-match.pd-Add-a-simplify-rule-for-x-copysign-1..patch>
>>
Michael Matz Oct. 1, 2015, 3:43 p.m. UTC | #4
Hi,

On Thu, 1 Oct 2015, James Greenhalgh wrote:

> > >  x * copysign (1.0, y)
> > > 
> > >  x ^ (y & (1 << sign_bit_position))
> > 
> > Also I think this can only be done for finite and non trapping types. 
> 
> That may be well true, I swithered either way and went for no checks, 
> but I'd happily go back on that and wrap this in something suitable 
> restrictive if I need to.

I don't think that's necessary.  copysign (1.0, y) is always 1.0 or -1.0, 
even with y being a NaN or inf.  Additionally copysign is allowed to not 
signal even if y is a sNaN.  That leaves only the form of x to doubt.  If 
x is inf all is well (multiplying by +-1.0 is defined and both sequences 
get the same result), if x is NaN the result must be a NaN, and it is in 
both cases.  The catch is that strictly speaking (NaN * -1.0) needs to 
deliver NaN, not -NaN (operations involving quiet NaNs need to provide 
one of the input NaNs as result), and here both are not equivalent.  OTOH 
the sign of NaNs isn't specified, so I think we could reasonably decide to 
not care about this case (it would have to be checked if the hardware 
multiplication even follows that rule, otherwise it's moot anyway).

And yes, also on x86-64 cores the new sequence would be better (or at 
least as good; the latency of xor[sp][sd] is less than or equal to mul), 
but that only is the case if the arithmetic really happen in SSE 
registers, not in integer ones, and this isn't the case right now.  Hmpf.


Ciao,
Michael.
Jakub Jelinek Oct. 1, 2015, 3:48 p.m. UTC | #5
On Thu, Oct 01, 2015 at 05:43:15PM +0200, Michael Matz wrote:
> Hi,
> 
> On Thu, 1 Oct 2015, James Greenhalgh wrote:
> 
> > > >  x * copysign (1.0, y)
> > > > 
> > > >  x ^ (y & (1 << sign_bit_position))
> > > 
> > > Also I think this can only be done for finite and non trapping types. 
> > 
> > That may be well true, I swithered either way and went for no checks, 
> > but I'd happily go back on that and wrap this in something suitable 
> > restrictive if I need to.
> 
> I don't think that's necessary.  copysign (1.0, y) is always 1.0 or -1.0, 
> even with y being a NaN or inf.  Additionally copysign is allowed to not 
> signal even if y is a sNaN.  That leaves only the form of x to doubt.  If 
> x is inf all is well (multiplying by +-1.0 is defined and both sequences 
> get the same result), if x is NaN the result must be a NaN, and it is in 
> both cases.  The catch is that strictly speaking (NaN * -1.0) needs to 
> deliver NaN, not -NaN (operations involving quiet NaNs need to provide 
> one of the input NaNs as result), and here both are not equivalent.  OTOH 
> the sign of NaNs isn't specified, so I think we could reasonably decide to 
> not care about this case (it would have to be checked if the hardware 
> multiplication even follows that rule, otherwise it's moot anyway).

But if x is a sNaN, then the multiplication will throw an exception, while
the transformed operation will not.  So perhaps it should be guarded by
!HONOR_SNANS (TYPE_MODE (type))
?  And sure, somebody should look at why this isn't done in SSE.

	Jakub
Joseph Myers Oct. 1, 2015, 3:59 p.m. UTC | #6
On Thu, 1 Oct 2015, Michael Matz wrote:

> both cases.  The catch is that strictly speaking (NaN * -1.0) needs to 
> deliver NaN, not -NaN (operations involving quiet NaNs need to provide 
> one of the input NaNs as result), and here both are not equivalent.  OTOH 
> the sign of NaNs isn't specified, so I think we could reasonably decide to 
> not care about this case (it would have to be checked if the hardware 
> multiplication even follows that rule, otherwise it's moot anyway).

"For all other operations, this standard does not specify the sign bit of 
a NaN result, even when there is only one input NaN, or when the NaN is 
produced from an invalid operation." (IEEE 754-2008, 6.3 The sign bit).  
So no need to care about this case.
Michael Matz Oct. 1, 2015, 4:09 p.m. UTC | #7
Hi,

On Thu, 1 Oct 2015, Jakub Jelinek wrote:

> But if x is a sNaN, then the multiplication will throw an exception, while
> the transformed operation will not.

Hmm, that's right, silly me.

> So perhaps it should be guarded by
> !HONOR_SNANS (TYPE_MODE (type))
> ?

That makes sense, yes.


Ciao,
Michael.
Michael Matz Oct. 1, 2015, 4:15 p.m. UTC | #8
Hi,

On Thu, 1 Oct 2015, Joseph Myers wrote:

> On Thu, 1 Oct 2015, Michael Matz wrote:
> 
> > both cases.  The catch is that strictly speaking (NaN * -1.0) needs to 
> > deliver NaN, not -NaN (operations involving quiet NaNs need to provide 
> > one of the input NaNs as result), and here both are not equivalent.  OTOH 
> > the sign of NaNs isn't specified, so I think we could reasonably decide to 
> > not care about this case (it would have to be checked if the hardware 
> > multiplication even follows that rule, otherwise it's moot anyway).
> 
> "For all other operations, this standard does not specify the sign bit of 
> a NaN result, even when there is only one input NaN, or when the NaN is 
> produced from an invalid operation." (IEEE 754-2008, 6.3 The sign bit).  
> So no need to care about this case.

Ah.  I was looking at an old version; thanks.


Ciao,
Michael.
Jakub Jelinek Oct. 1, 2015, 6:36 p.m. UTC | #9
On Thu, Oct 01, 2015 at 02:57:15PM +0100, James Greenhalgh wrote:
> 2015-10-01  James Greenhalgh  <james.greenhalgh@arm.com>
> 
> 	* match.pd (mult (COPYSIGN:s real_onep @0) @1): New simplifier.

Also, please note that
+      wide_int m = wi::min_value (TYPE_PRECISION (type), SIGNED);
+      tree tt
+     = build_nonstandard_integer_type (TYPE_PRECISION (type),
+                                       false);
+      tree mask = wide_int_to_tree (tt, m);
is really not a reliable way to determine which bit to change.
In some floating format it is not possible at all, in others it might not
be the topmost bit of the precision, or might depend on
FLOAT_WORDS_BIG_ENDIAN etc., see expand_copysign_bit and expand_copysign
for details (e.g. one has to look at fmt->signbit_rw etc.).
So, I probably agree with Andrew that it would be better optimized during
expansion.  One issue for that though is that TER stops at calls, we'd need
to special case this case.

	Jakub
Richard Biener Oct. 2, 2015, 8:18 a.m. UTC | #10
On Thu, Oct 1, 2015 at 8:36 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Thu, Oct 01, 2015 at 02:57:15PM +0100, James Greenhalgh wrote:
>> 2015-10-01  James Greenhalgh  <james.greenhalgh@arm.com>
>>
>>       * match.pd (mult (COPYSIGN:s real_onep @0) @1): New simplifier.
>
> Also, please note that
> +      wide_int m = wi::min_value (TYPE_PRECISION (type), SIGNED);
> +      tree tt
> +     = build_nonstandard_integer_type (TYPE_PRECISION (type),
> +                                       false);
> +      tree mask = wide_int_to_tree (tt, m);
> is really not a reliable way to determine which bit to change.
> In some floating format it is not possible at all, in others it might not
> be the topmost bit of the precision, or might depend on
> FLOAT_WORDS_BIG_ENDIAN etc., see expand_copysign_bit and expand_copysign
> for details (e.g. one has to look at fmt->signbit_rw etc.).
> So, I probably agree with Andrew that it would be better optimized during
> expansion.  One issue for that though is that TER stops at calls, we'd need
> to special case this case.

I agreee with optimizing this in expansion only.  The copysign form is shorter
and it captures the high-level part of the operation better.  Say we later
constant-propagate a positive real into y then chances are high we optimize
the copysign form but not the lowered one.  Also if we ever get VRP to
handle real-type ranges it would need to decipher the sequence as well.

Richard.

>         Jakub
Jakub Jelinek Oct. 2, 2015, 9:03 a.m. UTC | #11
On Fri, Oct 02, 2015 at 10:18:01AM +0200, Richard Biener wrote:
> On Thu, Oct 1, 2015 at 8:36 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> > On Thu, Oct 01, 2015 at 02:57:15PM +0100, James Greenhalgh wrote:
> >> 2015-10-01  James Greenhalgh  <james.greenhalgh@arm.com>
> >>
> >>       * match.pd (mult (COPYSIGN:s real_onep @0) @1): New simplifier.
> >
> > Also, please note that
> > +      wide_int m = wi::min_value (TYPE_PRECISION (type), SIGNED);
> > +      tree tt
> > +     = build_nonstandard_integer_type (TYPE_PRECISION (type),
> > +                                       false);
> > +      tree mask = wide_int_to_tree (tt, m);
> > is really not a reliable way to determine which bit to change.
> > In some floating format it is not possible at all, in others it might not
> > be the topmost bit of the precision, or might depend on
> > FLOAT_WORDS_BIG_ENDIAN etc., see expand_copysign_bit and expand_copysign
> > for details (e.g. one has to look at fmt->signbit_rw etc.).
> > So, I probably agree with Andrew that it would be better optimized during
> > expansion.  One issue for that though is that TER stops at calls, we'd need
> > to special case this case.
> 
> I agreee with optimizing this in expansion only.  The copysign form is shorter
> and it captures the high-level part of the operation better.  Say we later
> constant-propagate a positive real into y then chances are high we optimize
> the copysign form but not the lowered one.  Also if we ever get VRP to
> handle real-type ranges it would need to decipher the sequence as well.

But hacking TER for this special case might not be nice either, perhaps
we want an internal function that would represent this 
- CHANGESIGN (x, y) -- (x * copysign (1.0, y)) (or some better name) and
fold this say during fab pass or so, and then let expansion decide how exactly to
expand it (custom optab, or the generic tweaking of the bit, something else?).

BTW, it seems wrf also in many places uses MAX <copysign (1.0, y), 0.0>
or MIN <copysign (1.0, y), 0.0> (always in pairs), would that be also
something to optimize?

Also, the x * copysign (1.0, y) in wrf is actually x * (1/12.) * copysign (1.0, y)
(or similar - other constants), wouldn't it make more sense to optimize that
as x * copysign (1/12., y) first (at least if we can reassociate)?

	Jakub
Richard Biener Oct. 2, 2015, 9:10 a.m. UTC | #12
On Fri, Oct 2, 2015 at 11:03 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Fri, Oct 02, 2015 at 10:18:01AM +0200, Richard Biener wrote:
>> On Thu, Oct 1, 2015 at 8:36 PM, Jakub Jelinek <jakub@redhat.com> wrote:
>> > On Thu, Oct 01, 2015 at 02:57:15PM +0100, James Greenhalgh wrote:
>> >> 2015-10-01  James Greenhalgh  <james.greenhalgh@arm.com>
>> >>
>> >>       * match.pd (mult (COPYSIGN:s real_onep @0) @1): New simplifier.
>> >
>> > Also, please note that
>> > +      wide_int m = wi::min_value (TYPE_PRECISION (type), SIGNED);
>> > +      tree tt
>> > +     = build_nonstandard_integer_type (TYPE_PRECISION (type),
>> > +                                       false);
>> > +      tree mask = wide_int_to_tree (tt, m);
>> > is really not a reliable way to determine which bit to change.
>> > In some floating format it is not possible at all, in others it might not
>> > be the topmost bit of the precision, or might depend on
>> > FLOAT_WORDS_BIG_ENDIAN etc., see expand_copysign_bit and expand_copysign
>> > for details (e.g. one has to look at fmt->signbit_rw etc.).
>> > So, I probably agree with Andrew that it would be better optimized during
>> > expansion.  One issue for that though is that TER stops at calls, we'd need
>> > to special case this case.
>>
>> I agreee with optimizing this in expansion only.  The copysign form is shorter
>> and it captures the high-level part of the operation better.  Say we later
>> constant-propagate a positive real into y then chances are high we optimize
>> the copysign form but not the lowered one.  Also if we ever get VRP to
>> handle real-type ranges it would need to decipher the sequence as well.
>
> But hacking TER for this special case might not be nice either, perhaps
> we want an internal function that would represent this
> - CHANGESIGN (x, y) -- (x * copysign (1.0, y)) (or some better name) and
> fold this say during fab pass or so, and then let expansion decide how exactly to
> expand it (custom optab, or the generic tweaking of the bit, something else?).

In the long run I wanted to make special expansions not require TER by doing
those on the GIMPLE level in a separate pass right before expansion.

> BTW, it seems wrf also in many places uses MAX <copysign (1.0, y), 0.0>
> or MIN <copysign (1.0, y), 0.0> (always in pairs), would that be also
> something to optimize?

Hmm, we'll already CSE copysign so the question is how to optimize
tem1 = MAX <x, y>; tem2= MIN <x, y>;  Turn them back into control-flow?
What would we like to end up with in assembly?

> Also, the x * copysign (1.0, y) in wrf is actually x * (1/12.) * copysign (1.0, y)
> (or similar - other constants), wouldn't it make more sense to optimize that
> as x * copysign (1/12., y) first (at least if we can reassociate)?

Yeah, I think CST * copysign (CST, ...) should constant fold to
copysign (CST', ...)
if that's always valid.  I don't think association comes into play
here but as always
you read the fine prints of the standard for FP optimziations...

Richard.

>
>         Jakub
Jakub Jelinek Oct. 2, 2015, 9:27 a.m. UTC | #13
On Fri, Oct 02, 2015 at 11:10:58AM +0200, Richard Biener wrote:
> > BTW, it seems wrf also in many places uses MAX <copysign (1.0, y), 0.0>
> > or MIN <copysign (1.0, y), 0.0> (always in pairs), would that be also
> > something to optimize?
> 
> Hmm, we'll already CSE copysign so the question is how to optimize
> tem1 = MAX <x, y>; tem2= MIN <x, y>;  Turn them back into control-flow?
> What would we like to end up with in assembly?

This wasn't a well thought thing.  Just that perhaps depending on how
copysign is expanded we might merge that with the min/max somehow.

> > Also, the x * copysign (1.0, y) in wrf is actually x * (1/12.) * copysign (1.0, y)
> > (or similar - other constants), wouldn't it make more sense to optimize that
> > as x * copysign (1/12., y) first (at least if we can reassociate)?
> 
> Yeah, I think CST * copysign (CST, ...) should constant fold to
> copysign (CST', ...)
> if that's always valid.  I don't think association comes into play
> here but as always
> you read the fine prints of the standard for FP optimziations...

The reason talking about reassoc is that we might not necessarily see
CST * copysign (CST, ...) in the IL, but something different:
  _2051 = __builtin_copysignf (1.0e+0, vel_2000);
...
  _2059 = _2051 * _2058;
  _2060 = _2059 * 1.666666753590106964111328125e-2;
is before reassoc1 and
  _2051 = __builtin_copysignf (1.0e+0, vel_2000);
...
  _2047 = _2051 * 1.666666753590106964111328125e-2;
  _2060 = _2047 * _2058;
is after reassoc1, so in this case reassoc1 worked fine, but in another spot
I see
  _2968 = __builtin_copysignf (1.0e+0, vel_2943);
...
  _2973 = _2968 * _2972;
  _2974 = _2973 * 8.3333335816860198974609375e-2;
before reassoc1 and
  _2968 = __builtin_copysignf (1.0e+0, vel_2943);
...
  _3008 = _2972 * 8.3333335816860198974609375e-2;
  _2974 = _3008 * _2968;
after reassoc1, so clearly reassoc puts those two together only randomly.
So, either we'd teach reassoc to prefer putting constant and copysign of
constant together, or even perform this optimization, or the match.pd
(or elsewhere) change would need additional smarts.

Note, I won't have time to work on this in the near future (OpenMP work
still on the plate), so if James (or anyone else?) has time for that, it
would be greatly appreciated.

	Jakub
Richard Biener Oct. 2, 2015, 9:45 a.m. UTC | #14
On Fri, Oct 2, 2015 at 11:27 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Fri, Oct 02, 2015 at 11:10:58AM +0200, Richard Biener wrote:
>> > BTW, it seems wrf also in many places uses MAX <copysign (1.0, y), 0.0>
>> > or MIN <copysign (1.0, y), 0.0> (always in pairs), would that be also
>> > something to optimize?
>>
>> Hmm, we'll already CSE copysign so the question is how to optimize
>> tem1 = MAX <x, y>; tem2= MIN <x, y>;  Turn them back into control-flow?
>> What would we like to end up with in assembly?
>
> This wasn't a well thought thing.  Just that perhaps depending on how
> copysign is expanded we might merge that with the min/max somehow.
>
>> > Also, the x * copysign (1.0, y) in wrf is actually x * (1/12.) * copysign (1.0, y)
>> > (or similar - other constants), wouldn't it make more sense to optimize that
>> > as x * copysign (1/12., y) first (at least if we can reassociate)?
>>
>> Yeah, I think CST * copysign (CST, ...) should constant fold to
>> copysign (CST', ...)
>> if that's always valid.  I don't think association comes into play
>> here but as always
>> you read the fine prints of the standard for FP optimziations...
>
> The reason talking about reassoc is that we might not necessarily see
> CST * copysign (CST, ...) in the IL, but something different:
>   _2051 = __builtin_copysignf (1.0e+0, vel_2000);
> ...
>   _2059 = _2051 * _2058;
>   _2060 = _2059 * 1.666666753590106964111328125e-2;
> is before reassoc1 and
>   _2051 = __builtin_copysignf (1.0e+0, vel_2000);
> ...
>   _2047 = _2051 * 1.666666753590106964111328125e-2;
>   _2060 = _2047 * _2058;
> is after reassoc1, so in this case reassoc1 worked fine, but in another spot
> I see
>   _2968 = __builtin_copysignf (1.0e+0, vel_2943);
> ...
>   _2973 = _2968 * _2972;
>   _2974 = _2973 * 8.3333335816860198974609375e-2;
> before reassoc1 and
>   _2968 = __builtin_copysignf (1.0e+0, vel_2943);
> ...
>   _3008 = _2972 * 8.3333335816860198974609375e-2;
>   _2974 = _3008 * _2968;
> after reassoc1, so clearly reassoc puts those two together only randomly.
> So, either we'd teach reassoc to prefer putting constant and copysign of
> constant together, or even perform this optimization, or the match.pd
> (or elsewhere) change would need additional smarts.

I think for this case reassoc would need to assign the proper 'rank' to
the DEF with the copysign (based on the rank of its first argument).

But yes, this kind of association dependent patterns need to be handled
in reassoc itself.

> Note, I won't have time to work on this in the near future (OpenMP work
> still on the plate), so if James (or anyone else?) has time for that, it
> would be greatly appreciated.

So maybe just open an enhancement bug for now, citing the WRF use.

Richard.

>         Jakub
Jakub Jelinek Oct. 2, 2015, 9:58 a.m. UTC | #15
On Fri, Oct 02, 2015 at 11:45:08AM +0200, Richard Biener wrote:
> > Note, I won't have time to work on this in the near future (OpenMP work
> > still on the plate), so if James (or anyone else?) has time for that, it
> > would be greatly appreciated.
> 
> So maybe just open an enhancement bug for now, citing the WRF use.

PR67815.

	Jakub
Marek Polacek Oct. 2, 2015, 10:06 a.m. UTC | #16
On Fri, Oct 02, 2015 at 11:58:42AM +0200, Jakub Jelinek wrote:
> On Fri, Oct 02, 2015 at 11:45:08AM +0200, Richard Biener wrote:
> > > Note, I won't have time to work on this in the near future (OpenMP work
> > > still on the plate), so if James (or anyone else?) has time for that, it
> > > would be greatly appreciated.
> > 
> > So maybe just open an enhancement bug for now, citing the WRF use.
> 
> PR67815.

James, are you interested in this one, or would you prefer if I take it?

	Marek
James Greenhalgh Oct. 2, 2015, 10:24 a.m. UTC | #17
On Fri, Oct 02, 2015 at 11:06:53AM +0100, Marek Polacek wrote:
> On Fri, Oct 02, 2015 at 11:58:42AM +0200, Jakub Jelinek wrote:
> > On Fri, Oct 02, 2015 at 11:45:08AM +0200, Richard Biener wrote:
> > > > Note, I won't have time to work on this in the near future (OpenMP work
> > > > still on the plate), so if James (or anyone else?) has time for that, it
> > > > would be greatly appreciated.
> > > 
> > > So maybe just open an enhancement bug for now, citing the WRF use.
> > 
> > PR67815.
> 
> James, are you interested in this one, or would you prefer if I take it?

I'm happy either way.

I haven't looked at reassoc before, so I don't know my way wround it, and
I'm out of office for a wedding in the next couple of weeks (+ other
ongoing stuff), so I'm not sure I'd be able to get a prototype out before
Stage 1 closes.

If you have the free cycles for it, I'd be happy for you to take it, if
not I'll try to get to it late this month.

Thanks,
James
Marek Polacek Oct. 2, 2015, 10:29 a.m. UTC | #18
On Fri, Oct 02, 2015 at 11:24:32AM +0100, James Greenhalgh wrote:
> > > PR67815.
> > 
> > James, are you interested in this one, or would you prefer if I take it?
> 
> I'm happy either way.
> 
> I haven't looked at reassoc before, so I don't know my way wround it, and
> I'm out of office for a wedding in the next couple of weeks (+ other
> ongoing stuff), so I'm not sure I'd be able to get a prototype out before
> Stage 1 closes.

Congrats (if it's your wedding) ;).
 
> If you have the free cycles for it, I'd be happy for you to take it, if
> not I'll try to get to it late this month.

All right, I'll have a look at it next week; maybe I'll be able to come up
with something (I cringe when I see floating-point stuff).

	Marek
Richard Biener Oct. 2, 2015, 10:32 a.m. UTC | #19
On Fri, Oct 2, 2015 at 12:24 PM, James Greenhalgh
<james.greenhalgh@arm.com> wrote:
> On Fri, Oct 02, 2015 at 11:06:53AM +0100, Marek Polacek wrote:
>> On Fri, Oct 02, 2015 at 11:58:42AM +0200, Jakub Jelinek wrote:
>> > On Fri, Oct 02, 2015 at 11:45:08AM +0200, Richard Biener wrote:
>> > > > Note, I won't have time to work on this in the near future (OpenMP work
>> > > > still on the plate), so if James (or anyone else?) has time for that, it
>> > > > would be greatly appreciated.
>> > >
>> > > So maybe just open an enhancement bug for now, citing the WRF use.
>> >
>> > PR67815.
>>
>> James, are you interested in this one, or would you prefer if I take it?
>
> I'm happy either way.
>
> I haven't looked at reassoc before, so I don't know my way wround it, and
> I'm out of office for a wedding in the next couple of weeks (+ other
> ongoing stuff), so I'm not sure I'd be able to get a prototype out before
> Stage 1 closes.

Well, now that it has a  bug it will be a bugfix and thus qualify for
stage 3 ....

Richard.

> If you have the free cycles for it, I'd be happy for you to take it, if
> not I'll try to get to it late this month.
>
> Thanks,
> James
>
diff mbox

Patch

diff --git a/gcc/match.pd b/gcc/match.pd
index bd5c267..d51ad2e 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -61,6 +61,7 @@  along with GCC; see the file COPYING3.  If not see
 (define_operator_list TAN BUILT_IN_TANF BUILT_IN_TAN BUILT_IN_TANL)
 (define_operator_list COSH BUILT_IN_COSHF BUILT_IN_COSH BUILT_IN_COSHL)
 (define_operator_list CEXPI BUILT_IN_CEXPIF BUILT_IN_CEXPI BUILT_IN_CEXPIL)
+(define_operator_list COPYSIGN BUILT_IN_COPYSIGNF BUILT_IN_COPYSIGN BUILT_IN_COPYSIGNL)
 
 /* Simplifications of operations with one constant operand and
    simplifications to constants or single values.  */
@@ -2079,6 +2080,21 @@  along with GCC; see the file COPYING3.  If not see
 
 /* Simplification of math builtins.  */
 
+/* Simplify x * copysign (1.0, y) -> x ^ (y & (1 << sign_bit_position)).  */
+(simplify
+  (mult:c (COPYSIGN:s real_onep @0) @1)
+  (with
+    {
+      wide_int m = wi::min_value (TYPE_PRECISION (type), SIGNED);
+      tree tt
+	= build_nonstandard_integer_type (TYPE_PRECISION (type),
+					  false);
+      tree mask = wide_int_to_tree (tt, m);
+    }
+    (view_convert (bit_xor (view_convert:tt @1)
+			   (bit_and (view_convert:tt @0)
+				     { mask; })))))
+
 /* fold_builtin_logarithm */
 (if (flag_unsafe_math_optimizations)
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/copysign.c b/gcc/testsuite/gcc.dg/tree-ssa/copysign.c
new file mode 100644
index 0000000..b67f3c1
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/copysign.c
@@ -0,0 +1,22 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O -fdump-tree-gimple" } */
+
+double
+foo_d (double x, double y)
+{
+  return x * __builtin_copysign (1.0, y);
+}
+
+float
+foo_f (float x, float y)
+{
+  return x * __builtin_copysignf (1.0f, y);
+}
+
+long double
+foo_l (long double x, long double y)
+{
+  return x * __builtin_copysignl (1.0, y);
+}
+
+/* { dg-final { scan-tree-dump-not "copysign" "gimple"} } */