diff mbox

Improve detection of widening multiplication in the vectorizer

Message ID BANLkTikSzRMm_QFL6YKKN6dxNtAEx8DNqg@mail.gmail.com
State Accepted
Headers show

Commit Message

Ira Rosen June 2, 2011, 8:46 a.m. UTC
On 1 June 2011 15:14, Richard Guenther <richard.guenther@gmail.com> wrote:
> On Wed, Jun 1, 2011 at 1:37 PM, Ira Rosen <ira.rosen@linaro.org> wrote:
>> On 1 June 2011 12:42, Richard Guenther <richard.guenther@gmail.com> wrote:
>>
>>> Did you think about moving pass_optimize_widening_mul before
>>> loop optimizations?  Does that pass catch the cases you are
>>> teaching the pattern recognizer?  I think we should try to expose
>>> these more complicated instructions to loop optimizers.
>>>
>>
>> pass_optimize_widening_mul doesn't catch these cases, but I can try to
>> teach it instead of the vectorizer.
>> I am now testing
>>
>> Index: passes.c
>> ===================================================================
>> --- passes.c    (revision 174391)
>> +++ passes.c    (working copy)
>> @@ -870,6 +870,7 @@
>>       NEXT_PASS (pass_split_crit_edges);
>>       NEXT_PASS (pass_pre);
>>       NEXT_PASS (pass_sink_code);
>> +      NEXT_PASS (pass_optimize_widening_mul);
>>       NEXT_PASS (pass_tree_loop);
>>        {
>>          struct opt_pass **p = &pass_tree_loop.pass.sub;
>> @@ -934,7 +935,6 @@
>>       NEXT_PASS (pass_forwprop);
>>       NEXT_PASS (pass_phiopt);
>>       NEXT_PASS (pass_fold_builtins);
>> -      NEXT_PASS (pass_optimize_widening_mul);
>>       NEXT_PASS (pass_tail_calls);
>>       NEXT_PASS (pass_rename_ssa_copies);
>>       NEXT_PASS (pass_uncprop);
>>
>> to see how it affects other loop optimizations (vectorizer pattern
>> tests obviously fail).

Looks like it needs copy_prop and dce as well:


otherwise I get (on x86_64-suse-linux)

FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmaddss
FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmaddsd
FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmsubss
FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmsubsd
FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfnmaddss
FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfnmaddsd

Ira

>
> Thanks.  I would hope that we eventually can get rid of the
> pattern recognizer ... at least for SSE there is also always
> a scalar variant instruction for each vectorized one.
>
> Richard.
>

Comments

Richard Biener June 2, 2011, 9:59 a.m. UTC | #1
On Thu, Jun 2, 2011 at 10:46 AM, Ira Rosen <ira.rosen@linaro.org> wrote:
> On 1 June 2011 15:14, Richard Guenther <richard.guenther@gmail.com> wrote:
>> On Wed, Jun 1, 2011 at 1:37 PM, Ira Rosen <ira.rosen@linaro.org> wrote:
>>> On 1 June 2011 12:42, Richard Guenther <richard.guenther@gmail.com> wrote:
>>>
>>>> Did you think about moving pass_optimize_widening_mul before
>>>> loop optimizations?  Does that pass catch the cases you are
>>>> teaching the pattern recognizer?  I think we should try to expose
>>>> these more complicated instructions to loop optimizers.
>>>>
>>>
>>> pass_optimize_widening_mul doesn't catch these cases, but I can try to
>>> teach it instead of the vectorizer.
>>> I am now testing
>>>
>>> Index: passes.c
>>> ===================================================================
>>> --- passes.c    (revision 174391)
>>> +++ passes.c    (working copy)
>>> @@ -870,6 +870,7 @@
>>>       NEXT_PASS (pass_split_crit_edges);
>>>       NEXT_PASS (pass_pre);
>>>       NEXT_PASS (pass_sink_code);
>>> +      NEXT_PASS (pass_optimize_widening_mul);
>>>       NEXT_PASS (pass_tree_loop);
>>>        {
>>>          struct opt_pass **p = &pass_tree_loop.pass.sub;
>>> @@ -934,7 +935,6 @@
>>>       NEXT_PASS (pass_forwprop);
>>>       NEXT_PASS (pass_phiopt);
>>>       NEXT_PASS (pass_fold_builtins);
>>> -      NEXT_PASS (pass_optimize_widening_mul);
>>>       NEXT_PASS (pass_tail_calls);
>>>       NEXT_PASS (pass_rename_ssa_copies);
>>>       NEXT_PASS (pass_uncprop);
>>>
>>> to see how it affects other loop optimizations (vectorizer pattern
>>> tests obviously fail).
>
> Looks like it needs copy_prop and dce as well:
>
> Index: passes.c
> ===================================================================
> --- passes.c    (revision 174391)
> +++ passes.c    (working copy)
> @@ -870,6 +870,9 @@
>       NEXT_PASS (pass_split_crit_edges);
>       NEXT_PASS (pass_pre);
>       NEXT_PASS (pass_sink_code);
> +      NEXT_PASS (pass_copy_prop);
> +      NEXT_PASS (pass_dce);
> +      NEXT_PASS (pass_optimize_widening_mul);
>       NEXT_PASS (pass_tree_loop);
>        {
>          struct opt_pass **p = &pass_tree_loop.pass.sub;
> @@ -934,7 +937,6 @@
>       NEXT_PASS (pass_forwprop);
>       NEXT_PASS (pass_phiopt);
>       NEXT_PASS (pass_fold_builtins);
> -      NEXT_PASS (pass_optimize_widening_mul);
>       NEXT_PASS (pass_tail_calls);
>       NEXT_PASS (pass_rename_ssa_copies);
>       NEXT_PASS (pass_uncprop);
>
> otherwise I get (on x86_64-suse-linux)
>
> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmaddss
> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmaddsd
> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmsubss
> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmsubsd
> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfnmaddss
> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfnmaddsd

Hmm.  I would have put the pass next to the sincos pass, but yes,
in principle a copyprop & dce pass after PRE makes sense
(the loop passes likely don't run because there are no loops in
those testcases - both copyprop and dce should be scheduled
more like TODOs, or even automatically by the pass manager
via PROPs ...).  Dead code can indeed confuse those matching
passes that look for single-use vars.

I'll think about a more elegant solution for this problem.

Would you mind checking if the next-to-sincos position makes
any difference?

Thanks,
Richard.

> Ira
>
>>
>> Thanks.  I would hope that we eventually can get rid of the
>> pattern recognizer ... at least for SSE there is also always
>> a scalar variant instruction for each vectorized one.
>>
>> Richard.
>>
>
Ira Rosen June 2, 2011, 11:08 a.m. UTC | #2
On 2 June 2011 12:59, Richard Guenther <richard.guenther@gmail.com> wrote:
> On Thu, Jun 2, 2011 at 10:46 AM, Ira Rosen <ira.rosen@linaro.org> wrote:
>> On 1 June 2011 15:14, Richard Guenther <richard.guenther@gmail.com> wrote:
>>> On Wed, Jun 1, 2011 at 1:37 PM, Ira Rosen <ira.rosen@linaro.org> wrote:
>>>> On 1 June 2011 12:42, Richard Guenther <richard.guenther@gmail.com> wrote:
>>>>
>>>>> Did you think about moving pass_optimize_widening_mul before
>>>>> loop optimizations?  Does that pass catch the cases you are
>>>>> teaching the pattern recognizer?  I think we should try to expose
>>>>> these more complicated instructions to loop optimizers.
>>>>>
>>>>
>>>> pass_optimize_widening_mul doesn't catch these cases, but I can try to
>>>> teach it instead of the vectorizer.
>>>> I am now testing
>>>>
>>>> Index: passes.c
>>>> ===================================================================
>>>> --- passes.c    (revision 174391)
>>>> +++ passes.c    (working copy)
>>>> @@ -870,6 +870,7 @@
>>>>       NEXT_PASS (pass_split_crit_edges);
>>>>       NEXT_PASS (pass_pre);
>>>>       NEXT_PASS (pass_sink_code);
>>>> +      NEXT_PASS (pass_optimize_widening_mul);
>>>>       NEXT_PASS (pass_tree_loop);
>>>>        {
>>>>          struct opt_pass **p = &pass_tree_loop.pass.sub;
>>>> @@ -934,7 +935,6 @@
>>>>       NEXT_PASS (pass_forwprop);
>>>>       NEXT_PASS (pass_phiopt);
>>>>       NEXT_PASS (pass_fold_builtins);
>>>> -      NEXT_PASS (pass_optimize_widening_mul);
>>>>       NEXT_PASS (pass_tail_calls);
>>>>       NEXT_PASS (pass_rename_ssa_copies);
>>>>       NEXT_PASS (pass_uncprop);
>>>>
>>>> to see how it affects other loop optimizations (vectorizer pattern
>>>> tests obviously fail).
>>
>> Looks like it needs copy_prop and dce as well:
>>
>> Index: passes.c
>> ===================================================================
>> --- passes.c    (revision 174391)
>> +++ passes.c    (working copy)
>> @@ -870,6 +870,9 @@
>>       NEXT_PASS (pass_split_crit_edges);
>>       NEXT_PASS (pass_pre);
>>       NEXT_PASS (pass_sink_code);
>> +      NEXT_PASS (pass_copy_prop);
>> +      NEXT_PASS (pass_dce);
>> +      NEXT_PASS (pass_optimize_widening_mul);
>>       NEXT_PASS (pass_tree_loop);
>>        {
>>          struct opt_pass **p = &pass_tree_loop.pass.sub;
>> @@ -934,7 +937,6 @@
>>       NEXT_PASS (pass_forwprop);
>>       NEXT_PASS (pass_phiopt);
>>       NEXT_PASS (pass_fold_builtins);
>> -      NEXT_PASS (pass_optimize_widening_mul);
>>       NEXT_PASS (pass_tail_calls);
>>       NEXT_PASS (pass_rename_ssa_copies);
>>       NEXT_PASS (pass_uncprop);
>>
>> otherwise I get (on x86_64-suse-linux)
>>
>> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmaddss
>> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmaddsd
>> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmsubss
>> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmsubsd
>> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfnmaddss
>> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfnmaddsd
>
> Hmm.  I would have put the pass next to the sincos pass, but yes,
> in principle a copyprop & dce pass after PRE makes sense
> (the loop passes likely don't run because there are no loops in
> those testcases - both copyprop and dce should be scheduled
> more like TODOs, or even automatically by the pass manager
> via PROPs ...).  Dead code can indeed confuse those matching
> passes that look for single-use vars.
>
> I'll think about a more elegant solution for this problem.
>
> Would you mind checking if the next-to-sincos position makes
> any difference?

Before sincos we have

  D.2747_2 = __builtin_powf (a_1(D), 2.0e+0);
  D.2746_4 = D.2747_2 + c_3(D);

which is transformed by sincos to

  powmult.8_7 = a_1(D) * a_1(D);
  D.2747_2 = powmult.8_7;
  D.2746_4 = D.2747_2 + c_3(D);

but widening_mul  is confused by D.2747_2 = powmult.8_7; and it needs
both copy_prop and dce to remove it:

  powmult.8_7 = a_1(D) * a_1(D);
  D.2746_4 = c_3(D) + powmult.8_7;

So moving widening_mul next to sincos doesn't help.
Maybe gimple_expand_builtin_pow() can be changed to generate the last
version by itself?

Ira

>
> Thanks,
> Richard.
>
>> Ira
>>
>>>
>>> Thanks.  I would hope that we eventually can get rid of the
>>> pattern recognizer ... at least for SSE there is also always
>>> a scalar variant instruction for each vectorized one.
>>>
>>> Richard.
>>>
>>
>
Richard Biener June 2, 2011, 3:35 p.m. UTC | #3
On Thu, Jun 2, 2011 at 1:08 PM, Ira Rosen <ira.rosen@linaro.org> wrote:
> On 2 June 2011 12:59, Richard Guenther <richard.guenther@gmail.com> wrote:
>> On Thu, Jun 2, 2011 at 10:46 AM, Ira Rosen <ira.rosen@linaro.org> wrote:
>>> On 1 June 2011 15:14, Richard Guenther <richard.guenther@gmail.com> wrote:
>>>> On Wed, Jun 1, 2011 at 1:37 PM, Ira Rosen <ira.rosen@linaro.org> wrote:
>>>>> On 1 June 2011 12:42, Richard Guenther <richard.guenther@gmail.com> wrote:
>>>>>
>>>>>> Did you think about moving pass_optimize_widening_mul before
>>>>>> loop optimizations?  Does that pass catch the cases you are
>>>>>> teaching the pattern recognizer?  I think we should try to expose
>>>>>> these more complicated instructions to loop optimizers.
>>>>>>
>>>>>
>>>>> pass_optimize_widening_mul doesn't catch these cases, but I can try to
>>>>> teach it instead of the vectorizer.
>>>>> I am now testing
>>>>>
>>>>> Index: passes.c
>>>>> ===================================================================
>>>>> --- passes.c    (revision 174391)
>>>>> +++ passes.c    (working copy)
>>>>> @@ -870,6 +870,7 @@
>>>>>       NEXT_PASS (pass_split_crit_edges);
>>>>>       NEXT_PASS (pass_pre);
>>>>>       NEXT_PASS (pass_sink_code);
>>>>> +      NEXT_PASS (pass_optimize_widening_mul);
>>>>>       NEXT_PASS (pass_tree_loop);
>>>>>        {
>>>>>          struct opt_pass **p = &pass_tree_loop.pass.sub;
>>>>> @@ -934,7 +935,6 @@
>>>>>       NEXT_PASS (pass_forwprop);
>>>>>       NEXT_PASS (pass_phiopt);
>>>>>       NEXT_PASS (pass_fold_builtins);
>>>>> -      NEXT_PASS (pass_optimize_widening_mul);
>>>>>       NEXT_PASS (pass_tail_calls);
>>>>>       NEXT_PASS (pass_rename_ssa_copies);
>>>>>       NEXT_PASS (pass_uncprop);
>>>>>
>>>>> to see how it affects other loop optimizations (vectorizer pattern
>>>>> tests obviously fail).
>>>
>>> Looks like it needs copy_prop and dce as well:
>>>
>>> Index: passes.c
>>> ===================================================================
>>> --- passes.c    (revision 174391)
>>> +++ passes.c    (working copy)
>>> @@ -870,6 +870,9 @@
>>>       NEXT_PASS (pass_split_crit_edges);
>>>       NEXT_PASS (pass_pre);
>>>       NEXT_PASS (pass_sink_code);
>>> +      NEXT_PASS (pass_copy_prop);
>>> +      NEXT_PASS (pass_dce);
>>> +      NEXT_PASS (pass_optimize_widening_mul);
>>>       NEXT_PASS (pass_tree_loop);
>>>        {
>>>          struct opt_pass **p = &pass_tree_loop.pass.sub;
>>> @@ -934,7 +937,6 @@
>>>       NEXT_PASS (pass_forwprop);
>>>       NEXT_PASS (pass_phiopt);
>>>       NEXT_PASS (pass_fold_builtins);
>>> -      NEXT_PASS (pass_optimize_widening_mul);
>>>       NEXT_PASS (pass_tail_calls);
>>>       NEXT_PASS (pass_rename_ssa_copies);
>>>       NEXT_PASS (pass_uncprop);
>>>
>>> otherwise I get (on x86_64-suse-linux)
>>>
>>> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmaddss
>>> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmaddsd
>>> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmsubss
>>> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmsubsd
>>> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfnmaddss
>>> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfnmaddsd
>>
>> Hmm.  I would have put the pass next to the sincos pass, but yes,
>> in principle a copyprop & dce pass after PRE makes sense
>> (the loop passes likely don't run because there are no loops in
>> those testcases - both copyprop and dce should be scheduled
>> more like TODOs, or even automatically by the pass manager
>> via PROPs ...).  Dead code can indeed confuse those matching
>> passes that look for single-use vars.
>>
>> I'll think about a more elegant solution for this problem.
>>
>> Would you mind checking if the next-to-sincos position makes
>> any difference?
>
> Before sincos we have
>
>  D.2747_2 = __builtin_powf (a_1(D), 2.0e+0);
>  D.2746_4 = D.2747_2 + c_3(D);
>
> which is transformed by sincos to
>
>  powmult.8_7 = a_1(D) * a_1(D);
>  D.2747_2 = powmult.8_7;
>  D.2746_4 = D.2747_2 + c_3(D);
>
> but widening_mul  is confused by D.2747_2 = powmult.8_7; and it needs
> both copy_prop and dce to remove it:
>
>  powmult.8_7 = a_1(D) * a_1(D);
>  D.2746_4 = c_3(D) + powmult.8_7;
>
> So moving widening_mul next to sincos doesn't help.
> Maybe gimple_expand_builtin_pow() can be changed to generate the last
> version by itself?

Yeah, I guess so.  I'll have a look.

Richard.

> Ira
>
>>
>> Thanks,
>> Richard.
>>
>>> Ira
>>>
>>>>
>>>> Thanks.  I would hope that we eventually can get rid of the
>>>> pattern recognizer ... at least for SSE there is also always
>>>> a scalar variant instruction for each vectorized one.
>>>>
>>>> Richard.
>>>>
>>>
>>
>
diff mbox

Patch

Index: passes.c
===================================================================
--- passes.c    (revision 174391)
+++ passes.c    (working copy)
@@ -870,6 +870,9 @@ 
       NEXT_PASS (pass_split_crit_edges);
       NEXT_PASS (pass_pre);
       NEXT_PASS (pass_sink_code);
+      NEXT_PASS (pass_copy_prop);
+      NEXT_PASS (pass_dce);
+      NEXT_PASS (pass_optimize_widening_mul);
       NEXT_PASS (pass_tree_loop);
        {
          struct opt_pass **p = &pass_tree_loop.pass.sub;
@@ -934,7 +937,6 @@ 
       NEXT_PASS (pass_forwprop);
       NEXT_PASS (pass_phiopt);
       NEXT_PASS (pass_fold_builtins);
-      NEXT_PASS (pass_optimize_widening_mul);
       NEXT_PASS (pass_tail_calls);
       NEXT_PASS (pass_rename_ssa_copies);
       NEXT_PASS (pass_uncprop);