Message ID | BANLkTikSzRMm_QFL6YKKN6dxNtAEx8DNqg@mail.gmail.com |
---|---|
State | Accepted |
Headers | show |
On Thu, Jun 2, 2011 at 10:46 AM, Ira Rosen <ira.rosen@linaro.org> wrote: > On 1 June 2011 15:14, Richard Guenther <richard.guenther@gmail.com> wrote: >> On Wed, Jun 1, 2011 at 1:37 PM, Ira Rosen <ira.rosen@linaro.org> wrote: >>> On 1 June 2011 12:42, Richard Guenther <richard.guenther@gmail.com> wrote: >>> >>>> Did you think about moving pass_optimize_widening_mul before >>>> loop optimizations? Does that pass catch the cases you are >>>> teaching the pattern recognizer? I think we should try to expose >>>> these more complicated instructions to loop optimizers. >>>> >>> >>> pass_optimize_widening_mul doesn't catch these cases, but I can try to >>> teach it instead of the vectorizer. >>> I am now testing >>> >>> Index: passes.c >>> =================================================================== >>> --- passes.c (revision 174391) >>> +++ passes.c (working copy) >>> @@ -870,6 +870,7 @@ >>> NEXT_PASS (pass_split_crit_edges); >>> NEXT_PASS (pass_pre); >>> NEXT_PASS (pass_sink_code); >>> + NEXT_PASS (pass_optimize_widening_mul); >>> NEXT_PASS (pass_tree_loop); >>> { >>> struct opt_pass **p = &pass_tree_loop.pass.sub; >>> @@ -934,7 +935,6 @@ >>> NEXT_PASS (pass_forwprop); >>> NEXT_PASS (pass_phiopt); >>> NEXT_PASS (pass_fold_builtins); >>> - NEXT_PASS (pass_optimize_widening_mul); >>> NEXT_PASS (pass_tail_calls); >>> NEXT_PASS (pass_rename_ssa_copies); >>> NEXT_PASS (pass_uncprop); >>> >>> to see how it affects other loop optimizations (vectorizer pattern >>> tests obviously fail). > > Looks like it needs copy_prop and dce as well: > > Index: passes.c > =================================================================== > --- passes.c (revision 174391) > +++ passes.c (working copy) > @@ -870,6 +870,9 @@ > NEXT_PASS (pass_split_crit_edges); > NEXT_PASS (pass_pre); > NEXT_PASS (pass_sink_code); > + NEXT_PASS (pass_copy_prop); > + NEXT_PASS (pass_dce); > + NEXT_PASS (pass_optimize_widening_mul); > NEXT_PASS (pass_tree_loop); > { > struct opt_pass **p = &pass_tree_loop.pass.sub; > @@ -934,7 +937,6 @@ > NEXT_PASS (pass_forwprop); > NEXT_PASS (pass_phiopt); > NEXT_PASS (pass_fold_builtins); > - NEXT_PASS (pass_optimize_widening_mul); > NEXT_PASS (pass_tail_calls); > NEXT_PASS (pass_rename_ssa_copies); > NEXT_PASS (pass_uncprop); > > otherwise I get (on x86_64-suse-linux) > > FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmaddss > FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmaddsd > FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmsubss > FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmsubsd > FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfnmaddss > FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfnmaddsd Hmm. I would have put the pass next to the sincos pass, but yes, in principle a copyprop & dce pass after PRE makes sense (the loop passes likely don't run because there are no loops in those testcases - both copyprop and dce should be scheduled more like TODOs, or even automatically by the pass manager via PROPs ...). Dead code can indeed confuse those matching passes that look for single-use vars. I'll think about a more elegant solution for this problem. Would you mind checking if the next-to-sincos position makes any difference? Thanks, Richard. > Ira > >> >> Thanks. I would hope that we eventually can get rid of the >> pattern recognizer ... at least for SSE there is also always >> a scalar variant instruction for each vectorized one. >> >> Richard. >> >
On 2 June 2011 12:59, Richard Guenther <richard.guenther@gmail.com> wrote: > On Thu, Jun 2, 2011 at 10:46 AM, Ira Rosen <ira.rosen@linaro.org> wrote: >> On 1 June 2011 15:14, Richard Guenther <richard.guenther@gmail.com> wrote: >>> On Wed, Jun 1, 2011 at 1:37 PM, Ira Rosen <ira.rosen@linaro.org> wrote: >>>> On 1 June 2011 12:42, Richard Guenther <richard.guenther@gmail.com> wrote: >>>> >>>>> Did you think about moving pass_optimize_widening_mul before >>>>> loop optimizations? Does that pass catch the cases you are >>>>> teaching the pattern recognizer? I think we should try to expose >>>>> these more complicated instructions to loop optimizers. >>>>> >>>> >>>> pass_optimize_widening_mul doesn't catch these cases, but I can try to >>>> teach it instead of the vectorizer. >>>> I am now testing >>>> >>>> Index: passes.c >>>> =================================================================== >>>> --- passes.c (revision 174391) >>>> +++ passes.c (working copy) >>>> @@ -870,6 +870,7 @@ >>>> NEXT_PASS (pass_split_crit_edges); >>>> NEXT_PASS (pass_pre); >>>> NEXT_PASS (pass_sink_code); >>>> + NEXT_PASS (pass_optimize_widening_mul); >>>> NEXT_PASS (pass_tree_loop); >>>> { >>>> struct opt_pass **p = &pass_tree_loop.pass.sub; >>>> @@ -934,7 +935,6 @@ >>>> NEXT_PASS (pass_forwprop); >>>> NEXT_PASS (pass_phiopt); >>>> NEXT_PASS (pass_fold_builtins); >>>> - NEXT_PASS (pass_optimize_widening_mul); >>>> NEXT_PASS (pass_tail_calls); >>>> NEXT_PASS (pass_rename_ssa_copies); >>>> NEXT_PASS (pass_uncprop); >>>> >>>> to see how it affects other loop optimizations (vectorizer pattern >>>> tests obviously fail). >> >> Looks like it needs copy_prop and dce as well: >> >> Index: passes.c >> =================================================================== >> --- passes.c (revision 174391) >> +++ passes.c (working copy) >> @@ -870,6 +870,9 @@ >> NEXT_PASS (pass_split_crit_edges); >> NEXT_PASS (pass_pre); >> NEXT_PASS (pass_sink_code); >> + NEXT_PASS (pass_copy_prop); >> + NEXT_PASS (pass_dce); >> + NEXT_PASS (pass_optimize_widening_mul); >> NEXT_PASS (pass_tree_loop); >> { >> struct opt_pass **p = &pass_tree_loop.pass.sub; >> @@ -934,7 +937,6 @@ >> NEXT_PASS (pass_forwprop); >> NEXT_PASS (pass_phiopt); >> NEXT_PASS (pass_fold_builtins); >> - NEXT_PASS (pass_optimize_widening_mul); >> NEXT_PASS (pass_tail_calls); >> NEXT_PASS (pass_rename_ssa_copies); >> NEXT_PASS (pass_uncprop); >> >> otherwise I get (on x86_64-suse-linux) >> >> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmaddss >> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmaddsd >> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmsubss >> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmsubsd >> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfnmaddss >> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfnmaddsd > > Hmm. I would have put the pass next to the sincos pass, but yes, > in principle a copyprop & dce pass after PRE makes sense > (the loop passes likely don't run because there are no loops in > those testcases - both copyprop and dce should be scheduled > more like TODOs, or even automatically by the pass manager > via PROPs ...). Dead code can indeed confuse those matching > passes that look for single-use vars. > > I'll think about a more elegant solution for this problem. > > Would you mind checking if the next-to-sincos position makes > any difference? Before sincos we have D.2747_2 = __builtin_powf (a_1(D), 2.0e+0); D.2746_4 = D.2747_2 + c_3(D); which is transformed by sincos to powmult.8_7 = a_1(D) * a_1(D); D.2747_2 = powmult.8_7; D.2746_4 = D.2747_2 + c_3(D); but widening_mul is confused by D.2747_2 = powmult.8_7; and it needs both copy_prop and dce to remove it: powmult.8_7 = a_1(D) * a_1(D); D.2746_4 = c_3(D) + powmult.8_7; So moving widening_mul next to sincos doesn't help. Maybe gimple_expand_builtin_pow() can be changed to generate the last version by itself? Ira > > Thanks, > Richard. > >> Ira >> >>> >>> Thanks. I would hope that we eventually can get rid of the >>> pattern recognizer ... at least for SSE there is also always >>> a scalar variant instruction for each vectorized one. >>> >>> Richard. >>> >> >
On Thu, Jun 2, 2011 at 1:08 PM, Ira Rosen <ira.rosen@linaro.org> wrote: > On 2 June 2011 12:59, Richard Guenther <richard.guenther@gmail.com> wrote: >> On Thu, Jun 2, 2011 at 10:46 AM, Ira Rosen <ira.rosen@linaro.org> wrote: >>> On 1 June 2011 15:14, Richard Guenther <richard.guenther@gmail.com> wrote: >>>> On Wed, Jun 1, 2011 at 1:37 PM, Ira Rosen <ira.rosen@linaro.org> wrote: >>>>> On 1 June 2011 12:42, Richard Guenther <richard.guenther@gmail.com> wrote: >>>>> >>>>>> Did you think about moving pass_optimize_widening_mul before >>>>>> loop optimizations? Does that pass catch the cases you are >>>>>> teaching the pattern recognizer? I think we should try to expose >>>>>> these more complicated instructions to loop optimizers. >>>>>> >>>>> >>>>> pass_optimize_widening_mul doesn't catch these cases, but I can try to >>>>> teach it instead of the vectorizer. >>>>> I am now testing >>>>> >>>>> Index: passes.c >>>>> =================================================================== >>>>> --- passes.c (revision 174391) >>>>> +++ passes.c (working copy) >>>>> @@ -870,6 +870,7 @@ >>>>> NEXT_PASS (pass_split_crit_edges); >>>>> NEXT_PASS (pass_pre); >>>>> NEXT_PASS (pass_sink_code); >>>>> + NEXT_PASS (pass_optimize_widening_mul); >>>>> NEXT_PASS (pass_tree_loop); >>>>> { >>>>> struct opt_pass **p = &pass_tree_loop.pass.sub; >>>>> @@ -934,7 +935,6 @@ >>>>> NEXT_PASS (pass_forwprop); >>>>> NEXT_PASS (pass_phiopt); >>>>> NEXT_PASS (pass_fold_builtins); >>>>> - NEXT_PASS (pass_optimize_widening_mul); >>>>> NEXT_PASS (pass_tail_calls); >>>>> NEXT_PASS (pass_rename_ssa_copies); >>>>> NEXT_PASS (pass_uncprop); >>>>> >>>>> to see how it affects other loop optimizations (vectorizer pattern >>>>> tests obviously fail). >>> >>> Looks like it needs copy_prop and dce as well: >>> >>> Index: passes.c >>> =================================================================== >>> --- passes.c (revision 174391) >>> +++ passes.c (working copy) >>> @@ -870,6 +870,9 @@ >>> NEXT_PASS (pass_split_crit_edges); >>> NEXT_PASS (pass_pre); >>> NEXT_PASS (pass_sink_code); >>> + NEXT_PASS (pass_copy_prop); >>> + NEXT_PASS (pass_dce); >>> + NEXT_PASS (pass_optimize_widening_mul); >>> NEXT_PASS (pass_tree_loop); >>> { >>> struct opt_pass **p = &pass_tree_loop.pass.sub; >>> @@ -934,7 +937,6 @@ >>> NEXT_PASS (pass_forwprop); >>> NEXT_PASS (pass_phiopt); >>> NEXT_PASS (pass_fold_builtins); >>> - NEXT_PASS (pass_optimize_widening_mul); >>> NEXT_PASS (pass_tail_calls); >>> NEXT_PASS (pass_rename_ssa_copies); >>> NEXT_PASS (pass_uncprop); >>> >>> otherwise I get (on x86_64-suse-linux) >>> >>> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmaddss >>> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmaddsd >>> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmsubss >>> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmsubsd >>> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfnmaddss >>> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfnmaddsd >> >> Hmm. I would have put the pass next to the sincos pass, but yes, >> in principle a copyprop & dce pass after PRE makes sense >> (the loop passes likely don't run because there are no loops in >> those testcases - both copyprop and dce should be scheduled >> more like TODOs, or even automatically by the pass manager >> via PROPs ...). Dead code can indeed confuse those matching >> passes that look for single-use vars. >> >> I'll think about a more elegant solution for this problem. >> >> Would you mind checking if the next-to-sincos position makes >> any difference? > > Before sincos we have > > D.2747_2 = __builtin_powf (a_1(D), 2.0e+0); > D.2746_4 = D.2747_2 + c_3(D); > > which is transformed by sincos to > > powmult.8_7 = a_1(D) * a_1(D); > D.2747_2 = powmult.8_7; > D.2746_4 = D.2747_2 + c_3(D); > > but widening_mul is confused by D.2747_2 = powmult.8_7; and it needs > both copy_prop and dce to remove it: > > powmult.8_7 = a_1(D) * a_1(D); > D.2746_4 = c_3(D) + powmult.8_7; > > So moving widening_mul next to sincos doesn't help. > Maybe gimple_expand_builtin_pow() can be changed to generate the last > version by itself? Yeah, I guess so. I'll have a look. Richard. > Ira > >> >> Thanks, >> Richard. >> >>> Ira >>> >>>> >>>> Thanks. I would hope that we eventually can get rid of the >>>> pattern recognizer ... at least for SSE there is also always >>>> a scalar variant instruction for each vectorized one. >>>> >>>> Richard. >>>> >>> >> >
Index: passes.c =================================================================== --- passes.c (revision 174391) +++ passes.c (working copy) @@ -870,6 +870,9 @@ NEXT_PASS (pass_split_crit_edges); NEXT_PASS (pass_pre); NEXT_PASS (pass_sink_code); + NEXT_PASS (pass_copy_prop); + NEXT_PASS (pass_dce); + NEXT_PASS (pass_optimize_widening_mul); NEXT_PASS (pass_tree_loop); { struct opt_pass **p = &pass_tree_loop.pass.sub; @@ -934,7 +937,6 @@ NEXT_PASS (pass_forwprop); NEXT_PASS (pass_phiopt); NEXT_PASS (pass_fold_builtins); - NEXT_PASS (pass_optimize_widening_mul); NEXT_PASS (pass_tail_calls); NEXT_PASS (pass_rename_ssa_copies); NEXT_PASS (pass_uncprop);