diff mbox

Teach RTL ifcvt to handle multiple simple set instructions

Message ID 1441724029-3124-1-git-send-email-james.greenhalgh@arm.com
State New
Headers show

Commit Message

James Greenhalgh Sept. 8, 2015, 2:53 p.m. UTC
Hi,

RTL "noce" ifcvt will currently give up if the branches it is trying to
make conditional are too complicated. One of the conditions for "too
complicated" is that the branch sets more than one value.

One common idiom that this misses is something like:

  int d = a[i];
  int e = b[i];
  if (d > e)
    std::swap (d, e)
  [...]

Which is currently going to generate something like

  compare (d, e)
  branch.le L1
    tmp = d;
    d = e;
    e = tmp;
  L1:

In the case that this is an unpredictable branch, we can do better
with:

  compare (d, e)
  d1 = if_then_else (le, e, d)
  e1 = if_then_else (le, d, e)
  d = d1
  e = e1

Register allocation will eliminate the two trailing unconditional
assignments, and we get a neater sequence.

This patch introduces this logic to the RTL if convert passes, catching
cases where a basic block does nothing other than multiple SETs. This
helps both with the std::swap idiom above, and with pathological cases
where tree passes create new basic blocks to resolve Phi nodes, which
contain only set instructions and end up unprecdictable.

One big question I have with this patch is how I ought to write a meaningful
cost model I've used. It seems like yet another misuse of RTX costs, and
another bit of stuff for targets to carefully balance. Now, if the
relative cost of branches and conditional move instructions is not
carefully managed, you may enable or disable these optimisations. This is
probably acceptable, but I dislike adding more and more gotcha's to
target costs, as I get bitten by them hard enough as is!

Elsewhere the ifcvt cost usage is pretty lacking - esentially counting
the number of instructions which will be if-converted and comparing that
against the magic number "2". I could follow this lead and just count
the number of moves I would convert, then compare that to the branch cost,
but this feels... wrong. This makes it pretty tough to choose a "good"
number for TARGET_BRANCH_COST. This isn't helped now that higher branch
costs can mean pulling expensive instructions in to the main execution
stream.

I've picked a fairly straightforward cost model for this patch, trying to
compare the cost of each conditional move, as calculated with rtx_costs,
against COSTS_N_INSNS (branch_cost). This essentially kills the
optimisation for any target with conditional-move cost > 1. Personally, I
consider that a pretty horrible bug in this patch - but I couldn't think of
anything better to try.

As you might expect, this triggers all over the place when
TARGET_BRANCH_COST numbers are tuned high. In an AArch64 Spec2006 build,
I saw 3.9% more CSEL operations with this patch and TARGET_BRANCH_COST set
to 4. Performance is also good on AArch64 on a range of microbenchmarks
and larger workloads (after playing with the branch costs). I didn't see
any performance regression on x86_64, as you would expect given that the
cost models preclude x86_64 targets from ever hitting this optimisation.

Bootstrapped and tested on x86_64 and AArch64 with no issues, and
bootstrapped and tested with the cost model turned off, to have some
confidence that we will continue to do the right thing if any targets do
up their branch costs and start using this code.

No testcase provided, as currently I don't know of targets with a high
enough branch cost to actually trigger the optimisation.

OK?

Thanks,
James

---
gcc/

2015-09-07  James Greenhalgh  <james.greenhalgh@arm.com>

	* ifcvt.c (bb_ok_for_noce_convert_multiple_sets): New.
	(noce_convert_multiple_sets): Likewise.
	(noce_process_if_block): Call them.

Comments

Bernd Schmidt Sept. 10, 2015, 6:23 p.m. UTC | #1
On 09/08/2015 04:53 PM, James Greenhalgh wrote:
> One big question I have with this patch is how I ought to write a meaningful
> cost model I've used. It seems like yet another misuse of RTX costs, and
> another bit of stuff for targets to carefully balance. Now, if the
> relative cost of branches and conditional move instructions is not
> carefully managed, you may enable or disable these optimisations. This is
> probably acceptable, but I dislike adding more and more gotcha's to
> target costs, as I get bitten by them hard enough as is!

The code you have seems reasonable, except that for compile time it 
might make sense to not even attempt the optimization if the number of 
sets is too large. I'm not too worried about that, but maybe you could 
bail out early if your cost estimate goes too much above the branch cost.

> +      /* If we were supposed to read from an earlier write in this block,
> +	 we've changed the register allocation.  Rewire the read.  While
> +	 we are looking, also try to catch a swap idiom.  */

So this is one interesting case; do you also have to worry about others 
(such as maybe setting the same register multiple times)?

> +  /* We must have seen some sort of insn to insert, otherwise we were
> +     given an empty BB to convert, and we can't handle that.  */
> +  if (unmodified_insns.is_empty ())
> +    {
> +      end_sequence ();
> +      return FALSE;
> +    }

Looks like some of the error conditions are tested twice across the two 
new functions? I think it would be better to get rid of one copy or turn 
the second one into a gcc_assert.

 > No testcase provided, as currently I don't know of targets with a high
 > enough branch cost to actually trigger the optimisation.

Hmm, so the code would not actually be used right now? In that case I'll 
leave it to others to decide whether we want to apply it. Other than the 
points above it looks OK to me.


Bernd
Jeff Law Sept. 10, 2015, 9:11 p.m. UTC | #2
On 09/10/2015 12:23 PM, Bernd Schmidt wrote:
>
>  > No testcase provided, as currently I don't know of targets with a high
>  > enough branch cost to actually trigger the optimisation.
>
> Hmm, so the code would not actually be used right now? In that case I'll
> leave it to others to decide whether we want to apply it. Other than the
> points above it looks OK to me.
Some targets have -mbranch-cost to allow overriding the default costing. 
  visium has a branch cost of 10!  Several ports have a cost of 6 either 
unconditionally or when the branch is not well predicted.

Presumably James is more interested in the ARM/AArch64 targets ;-)

I think that's probably what James is most interested in getting some 
ideas around -- the cost model.

I think the fundamental problem is BRANCH_COST isn't actually relative 
to anything other than the default value of "1".  It doesn't directly 
correspond to COSTS_N_INSNS or anything else.  So while using 
COSTS_N_INSNS (BRANCH_COST)) would seem to make sense, it actually 
doesn't.  It's not even clear how a value of 10 relates to a value of 1 
other than it's more expensive.

ifcvt (and others) comparing to magic #s is more than a bit lame.  But 
with BRANCH_COST having no meaning relative to anything else I can see 
why Richard did things that way.

In an ideal world we'd find some mapping from BRANCH_COST that relates 
to CONST_N_INSNS.  I suspect a simple mapping doesn't necessarily exist 
and we'll likely regress targets with any simplistic mapping.  But maybe 
now is the time to address that fundamental problem and deal with the 
fallout.

jeff
Kyrylo Tkachov Sept. 11, 2015, 8:49 a.m. UTC | #3
On 10/09/15 22:11, Jeff Law wrote:
> On 09/10/2015 12:23 PM, Bernd Schmidt wrote:
>>   > No testcase provided, as currently I don't know of targets with a high
>>   > enough branch cost to actually trigger the optimisation.
>>
>> Hmm, so the code would not actually be used right now? In that case I'll
>> leave it to others to decide whether we want to apply it. Other than the
>> points above it looks OK to me.
> Some targets have -mbranch-cost to allow overriding the default costing.
>    visium has a branch cost of 10!  Several ports have a cost of 6 either
> unconditionally or when the branch is not well predicted.
>
> Presumably James is more interested in the ARM/AArch64 targets ;-)
>
> I think that's probably what James is most interested in getting some
> ideas around -- the cost model.
>
> I think the fundamental problem is BRANCH_COST isn't actually relative
> to anything other than the default value of "1".  It doesn't directly
> correspond to COSTS_N_INSNS or anything else.  So while using
> COSTS_N_INSNS (BRANCH_COST)) would seem to make sense, it actually
> doesn't.  It's not even clear how a value of 10 relates to a value of 1
> other than it's more expensive.
>
> ifcvt (and others) comparing to magic #s is more than a bit lame.  But
> with BRANCH_COST having no meaning relative to anything else I can see
> why Richard did things that way.

Out of interest, what was the intended original meaning
of branch costs if it was not to be relative to instructions?

Thanks,
Kyrill

> In an ideal world we'd find some mapping from BRANCH_COST that relates
> to CONST_N_INSNS.  I suspect a simple mapping doesn't necessarily exist
> and we'll likely regress targets with any simplistic mapping.  But maybe
> now is the time to address that fundamental problem and deal with the
> fallout.
>
> jeff
>
>
Bernd Schmidt Sept. 11, 2015, 8:53 a.m. UTC | #4
On 09/10/2015 11:11 PM, Jeff Law wrote:
> I think that's probably what James is most interested in getting some
> ideas around -- the cost model.
>
> I think the fundamental problem is BRANCH_COST isn't actually relative
> to anything other than the default value of "1".  It doesn't directly
> correspond to COSTS_N_INSNS or anything else.  So while using
> COSTS_N_INSNS (BRANCH_COST)) would seem to make sense, it actually
> doesn't.  It's not even clear how a value of 10 relates to a value of 1
> other than it's more expensive.
>
> ifcvt (and others) comparing to magic #s is more than a bit lame.  But
> with BRANCH_COST having no meaning relative to anything else I can see
> why Richard did things that way.
>
> In an ideal world we'd find some mapping from BRANCH_COST that relates
> to CONST_N_INSNS.  I suspect a simple mapping doesn't necessarily exist
> and we'll likely regress targets with any simplistic mapping.  But maybe
> now is the time to address that fundamental problem and deal with the
> fallout.

I think the right approach if we want to fix this is a new 
branch_cost_ninsns target hook (maybe with arguments taken_percentage, 
predictability), and gradually move everything to use that instead of 
BRANCH_COST.


Bernd
Ramana Radhakrishnan Sept. 11, 2015, 9:04 a.m. UTC | #5
On Fri, Sep 11, 2015 at 10:53:13AM +0200, Bernd Schmidt wrote:
> On 09/10/2015 11:11 PM, Jeff Law wrote:
> >I think that's probably what James is most interested in getting some
> >ideas around -- the cost model.
> >
> >I think the fundamental problem is BRANCH_COST isn't actually relative
> >to anything other than the default value of "1".  It doesn't directly
> >correspond to COSTS_N_INSNS or anything else.  So while using
> >COSTS_N_INSNS (BRANCH_COST)) would seem to make sense, it actually
> >doesn't.  It's not even clear how a value of 10 relates to a value of 1
> >other than it's more expensive.
> >
> >ifcvt (and others) comparing to magic #s is more than a bit lame.  But
> >with BRANCH_COST having no meaning relative to anything else I can see
> >why Richard did things that way.
> >
> >In an ideal world we'd find some mapping from BRANCH_COST that relates
> >to CONST_N_INSNS.  I suspect a simple mapping doesn't necessarily exist
> >and we'll likely regress targets with any simplistic mapping.  But maybe
> >now is the time to address that fundamental problem and deal with the
> >fallout.
> 
> I think the right approach if we want to fix this is a new
> branch_cost_ninsns target hook (maybe with arguments
> taken_percentage, predictability), and gradually move everything to
> use that instead of BRANCH_COST.

Perhaps providing backends with the entire if-then-else block along
with the above mentioned information being if converted may be another
approach, it allows the backends to analyse what cases are good to
if-convert as per the ISA or micro-architecture and what aren't.

regards
Ramana
James Greenhalgh Sept. 11, 2015, 9:56 a.m. UTC | #6
On Fri, Sep 11, 2015 at 10:04:12AM +0100, Ramana Radhakrishnan wrote:
> On Fri, Sep 11, 2015 at 10:53:13AM +0200, Bernd Schmidt wrote:
> > On 09/10/2015 11:11 PM, Jeff Law wrote:
> > >I think that's probably what James is most interested in getting some
> > >ideas around -- the cost model.
> > >
> > >I think the fundamental problem is BRANCH_COST isn't actually relative
> > >to anything other than the default value of "1".  It doesn't directly
> > >correspond to COSTS_N_INSNS or anything else.  So while using
> > >COSTS_N_INSNS (BRANCH_COST)) would seem to make sense, it actually
> > >doesn't.  It's not even clear how a value of 10 relates to a value of 1
> > >other than it's more expensive.
> > >
> > >ifcvt (and others) comparing to magic #s is more than a bit lame.  But
> > >with BRANCH_COST having no meaning relative to anything else I can see
> > >why Richard did things that way.
> > >
> > >In an ideal world we'd find some mapping from BRANCH_COST that relates
> > >to CONST_N_INSNS.  I suspect a simple mapping doesn't necessarily exist
> > >and we'll likely regress targets with any simplistic mapping.  But maybe
> > >now is the time to address that fundamental problem and deal with the
> > >fallout.
> > 
> > I think the right approach if we want to fix this is a new
> > branch_cost_ninsns target hook (maybe with arguments
> > taken_percentage, predictability), and gradually move everything to
> > use that instead of BRANCH_COST.
> 
> Perhaps providing backends with the entire if-then-else block along
> with the above mentioned information being if converted may be another
> approach, it allows the backends to analyse what cases are good to
> if-convert as per the ISA or micro-architecture and what aren't.

I'm not sure how much of this is likely to be target-dependent and how
much can just be abstracted to common ifcvt code resuing rtx_costs.

I've been sketching out a rough idea of a more applicable cost model for
RTL ifcvt, taking in to consideration what David mentioned regarding the
talks at cauldron. The question we want to ask is:

Which is preferable between:

  Before:
   (COSTS_N_INSNS cost of the compare+branch insns at the tail of the if BB.
     ??? (possibly) some factor related to BRANCH_COST)
   + weighted cost of then BB.
   + (if needed) weighted cost of else BB.

  After:
   seq_cost the candidate new sequence.
  
The weighted cost of the two BBs should mix in some idea as to the relative
probability that we execute them.

The tough part is figuring out how to (reasonably) factor in branch cost.
The reason that is tough is that BRANCH_COST is used inconsistently. Normally
it is not measured relative to anything, but is compared against magic numbers
for optimizations (each of which are really their own question to be posed
as above).

I don't have a good answer to that, nor a good answer as to what BRANCH_COST
should represent in future. The use in the compiler is sort-of consistent
with a measurement against instruction counts (i.e. a branch cost of 3 means
a branch is equivalent to 3 cheap instructions), but is sometimes just used
as a measure of expensive (a branch cost of >= 2 means that abs should be
expanded using a sequence of bit operations).

I'll look in to how the code in ifcvt starts to look with a modified cost
model and get back to you...

James
Jeff Law Sept. 11, 2015, 9:47 p.m. UTC | #7
On 09/11/2015 02:49 AM, Kyrill Tkachov wrote:
>
> On 10/09/15 22:11, Jeff Law wrote:
>> On 09/10/2015 12:23 PM, Bernd Schmidt wrote:
>>>   > No testcase provided, as currently I don't know of targets with a
>>> high
>>>   > enough branch cost to actually trigger the optimisation.
>>>
>>> Hmm, so the code would not actually be used right now? In that case I'll
>>> leave it to others to decide whether we want to apply it. Other than the
>>> points above it looks OK to me.
>> Some targets have -mbranch-cost to allow overriding the default costing.
>>    visium has a branch cost of 10!  Several ports have a cost of 6 either
>> unconditionally or when the branch is not well predicted.
>>
>> Presumably James is more interested in the ARM/AArch64 targets ;-)
>>
>> I think that's probably what James is most interested in getting some
>> ideas around -- the cost model.
>>
>> I think the fundamental problem is BRANCH_COST isn't actually relative
>> to anything other than the default value of "1".  It doesn't directly
>> correspond to COSTS_N_INSNS or anything else.  So while using
>> COSTS_N_INSNS (BRANCH_COST)) would seem to make sense, it actually
>> doesn't.  It's not even clear how a value of 10 relates to a value of 1
>> other than it's more expensive.
>>
>> ifcvt (and others) comparing to magic #s is more than a bit lame.  But
>> with BRANCH_COST having no meaning relative to anything else I can see
>> why Richard did things that way.
>
> Out of interest, what was the intended original meaning
> of branch costs if it was not to be relative to instructions?
I don't think it ever had one.  It's self-relative.  A cost of 2 is 
greater than a cost of 1.  No more, no less IIRC.   Lame?  Yes. 
Short-sighted?  Yes.  Should we try to fix it.  Yes.

If you look at how BRANCH_COST actually gets used, AFAIK it's tested 
only against "magic constants", which are themselves lame, short-sighted 
and need to be fixed.

jeff
Eric Botcazou Sept. 12, 2015, 1:53 p.m. UTC | #8
> Some targets have -mbranch-cost to allow overriding the default costing.
>   visium has a branch cost of 10!

Yeah, the GR5 variant is pipelined but has no branch prediction; moreover 
there is an additional adverse effect coming for the instructions bus...

>   Several ports have a cost of 6 either unconditionally or when the branch
>   is not well predicted.

9 for UltraSPARC3, although this should probably be lowered if predictable_p.
James Greenhalgh Sept. 25, 2015, 3:04 p.m. UTC | #9
Hi,

In relation to the patch I put up for review a few weeks ago to teach
RTL if-convert to handle multiple sets in a basic block [1], I was
asking about a sensible cost model to use. There was some consensus at
Cauldron that what should be done in this situation is to introduce a
target hook that delegates answering the question to the target.

This patch series introduces that new target hook to provide cost
decisions for the RTL ifcvt pass.

The idea is to give the target full visibility of the proposed
transformation, and allow it to respond as to whether if-conversion in that
way is profitable.

In order to preserve current behaviour across targets, we will need the
default implementation to keep to the strategy of simply comparing branch
cost against a magic number. Patch 1/3 performs this refactoring, which is
a bit hairy in some corner cases.

Patch 2/3 is a simple code move, pulling the definition of the if_info
structure used by RTL if-convert in to ifcvt.h where it can be included
by targets.

Patch 3/3 then introduces the new target hook, with the same default
behaviour as was previously in noce_is_profitable_p.

The series has been bootstrapped on ARM, AArch64 and x86_64 targets, and
I've verified with Spec2000 and Spec2006 runs that there are no code
generation differences for any of these three targets after the patch.

I also gave ultrasparc3 a quick go, from what I could see, I changed the
register allocation for the floating-point condition code registers.
Presumably this is a side effect of first constructing RTXen that I then
discard. I didn't see anything which looked like more frequent reloads or
substantial code generation changes, though I'm not familiar with the
intricacies of the Sparc condition registers :).

I've included a patch 4/3, to give an example of what a target might want
to do with this hook. It needs work for tuning and deciding how the function
should actually behave, but works if it is thought of as more of a
strawman/prototype than a patch submission.

Are parts 1, 2 and 3 OK?

Thanks,
James

[1]: https://gcc.gnu.org/ml/gcc-patches/2015-09/msg00781.html

---
[Patch ifcvt 1/3] Factor out cost calculations from noce cases

2015-09-26  James Greenhalgh  <james.greenhalgh@arm.com>

	* ifcvt.c (noce_if_info): Add a magic_number field :-(.
	(noce_is_profitable_p): New.
	(noce_try_store_flag_constants): Move cost calculation
	to after sequence generation, factor it out to noce_is_profitable_p.
	(noce_try_addcc): Likewise.
	(noce_try_store_flag_mask): Likewise.
	(noce_try_cmove): Likewise.
	(noce_try_cmove_arith): Likewise.
	(noce_try_sign_mask): Add comment regarding cost calculations.

[Patch ifcvt 2/3] Move noce_if_info in to ifcvt.h

2015-09-26  James Greenhalgh  <james.greenhalgh@arm.com>

	* ifcvt.c (noce_if_info): Move to...
	* ifcvt.h (noce_if_info): ...Here.

[Patch ifcvt 3/3] Create a new target hook for deciding profitability
    of noce if-conversion

2015-09-26  James Greenhalgh  <james.greenhalgh@arm.com>

	* target.def (costs): New hook vector.
	(ifcvt_noce_profitable_p): New hook.
	* doc/tm.texi.in: Document it.
	* doc/tm.texi: Regenerate.
	* targhooks.h (default_ifcvt_noce_profitable_p): New.
	* targhooks.c (default_ifcvt_noce_profitable_p): New.
	* ifcvt.c (noce_profitable_p): Use new target hook.

[Patch Prototype AArch64 ifcvt 4/3] Wire up the new if-convert costs
    hook for AArch64

2015-09-26  James Greenhalgh  <james.greenhalgh@arm.com>

	* config/aarch64/aarch64.c
	(aarch64_additional_branch_cost_for_probability): New.
	(aarch64_ifcvt_noce_profitable_p): Likewise.
	(TARGET_COSTS_IFCVT_NOCE_PROFITABLE_P): Likewise.
Richard Biener Sept. 29, 2015, 10:16 a.m. UTC | #10
On Fri, Sep 25, 2015 at 5:04 PM, James Greenhalgh
<james.greenhalgh@arm.com> wrote:
> Hi,
>
> In relation to the patch I put up for review a few weeks ago to teach
> RTL if-convert to handle multiple sets in a basic block [1], I was
> asking about a sensible cost model to use. There was some consensus at
> Cauldron that what should be done in this situation is to introduce a
> target hook that delegates answering the question to the target.

Err - the consensus was to _not_ add gazillion of special target hooks
but instead enhance what we have with rtx_cost so that passes can
rely on comparing before and after costs of a sequence of insns.

Richard.

> This patch series introduces that new target hook to provide cost
> decisions for the RTL ifcvt pass.
>
> The idea is to give the target full visibility of the proposed
> transformation, and allow it to respond as to whether if-conversion in that
> way is profitable.
>
> In order to preserve current behaviour across targets, we will need the
> default implementation to keep to the strategy of simply comparing branch
> cost against a magic number. Patch 1/3 performs this refactoring, which is
> a bit hairy in some corner cases.
>
> Patch 2/3 is a simple code move, pulling the definition of the if_info
> structure used by RTL if-convert in to ifcvt.h where it can be included
> by targets.
>
> Patch 3/3 then introduces the new target hook, with the same default
> behaviour as was previously in noce_is_profitable_p.
>
> The series has been bootstrapped on ARM, AArch64 and x86_64 targets, and
> I've verified with Spec2000 and Spec2006 runs that there are no code
> generation differences for any of these three targets after the patch.
>
> I also gave ultrasparc3 a quick go, from what I could see, I changed the
> register allocation for the floating-point condition code registers.
> Presumably this is a side effect of first constructing RTXen that I then
> discard. I didn't see anything which looked like more frequent reloads or
> substantial code generation changes, though I'm not familiar with the
> intricacies of the Sparc condition registers :).
>
> I've included a patch 4/3, to give an example of what a target might want
> to do with this hook. It needs work for tuning and deciding how the function
> should actually behave, but works if it is thought of as more of a
> strawman/prototype than a patch submission.
>
> Are parts 1, 2 and 3 OK?
>
> Thanks,
> James
>
> [1]: https://gcc.gnu.org/ml/gcc-patches/2015-09/msg00781.html
>
> ---
> [Patch ifcvt 1/3] Factor out cost calculations from noce cases
>
> 2015-09-26  James Greenhalgh  <james.greenhalgh@arm.com>
>
>         * ifcvt.c (noce_if_info): Add a magic_number field :-(.
>         (noce_is_profitable_p): New.
>         (noce_try_store_flag_constants): Move cost calculation
>         to after sequence generation, factor it out to noce_is_profitable_p.
>         (noce_try_addcc): Likewise.
>         (noce_try_store_flag_mask): Likewise.
>         (noce_try_cmove): Likewise.
>         (noce_try_cmove_arith): Likewise.
>         (noce_try_sign_mask): Add comment regarding cost calculations.
>
> [Patch ifcvt 2/3] Move noce_if_info in to ifcvt.h
>
> 2015-09-26  James Greenhalgh  <james.greenhalgh@arm.com>
>
>         * ifcvt.c (noce_if_info): Move to...
>         * ifcvt.h (noce_if_info): ...Here.
>
> [Patch ifcvt 3/3] Create a new target hook for deciding profitability
>     of noce if-conversion
>
> 2015-09-26  James Greenhalgh  <james.greenhalgh@arm.com>
>
>         * target.def (costs): New hook vector.
>         (ifcvt_noce_profitable_p): New hook.
>         * doc/tm.texi.in: Document it.
>         * doc/tm.texi: Regenerate.
>         * targhooks.h (default_ifcvt_noce_profitable_p): New.
>         * targhooks.c (default_ifcvt_noce_profitable_p): New.
>         * ifcvt.c (noce_profitable_p): Use new target hook.
>
> [Patch Prototype AArch64 ifcvt 4/3] Wire up the new if-convert costs
>     hook for AArch64
>
> 2015-09-26  James Greenhalgh  <james.greenhalgh@arm.com>
>
>         * config/aarch64/aarch64.c
>         (aarch64_additional_branch_cost_for_probability): New.
>         (aarch64_ifcvt_noce_profitable_p): Likewise.
>         (TARGET_COSTS_IFCVT_NOCE_PROFITABLE_P): Likewise.
James Greenhalgh Sept. 29, 2015, 2:31 p.m. UTC | #11
On Tue, Sep 29, 2015 at 11:16:37AM +0100, Richard Biener wrote:
> On Fri, Sep 25, 2015 at 5:04 PM, James Greenhalgh
> <james.greenhalgh@arm.com> wrote:
> > Hi,
> >
> > In relation to the patch I put up for review a few weeks ago to teach
> > RTL if-convert to handle multiple sets in a basic block [1], I was
> > asking about a sensible cost model to use. There was some consensus at
> > Cauldron that what should be done in this situation is to introduce a
> > target hook that delegates answering the question to the target.
> 
> Err - the consensus was to _not_ add gazillion of special target hooks
> but instead enhance what we have with rtx_cost so that passes can
> rely on comparing before and after costs of a sequence of insns.

Ah, I was not able to attend Cauldron this year, so I was trying to pick out
"consensus" from the video. Rewatching it now, I see a better phrase would
be "suggestion with some support".

Watching the video a second time, it seems your proposal is that we improve
the RTX costs infrastructure to handle sequences of Gimple/RTX. That would
get us some way to making a smart decision in if-convert, but I'm not
convinced it allows us to answer the question we are interested in.

We have the rtx for before and after, and we can generate costs for these
sequences. This allows us to calculate some weighted cost of the
instructions based on the calculated probabilities that each block is
executed. However, we are missing information on how expensive the branch
is, and we have no way to get that through an RTX-costs infrastructure.

We could add a hook to give a cost in COSTS_N_INSNS units to a branch based
on its predictability. This is difficult as COSTS_N_INSNS units can differ
depending on whether you are talking about floating-point or integer code.
By this I mean, the compiler considers a SET which costs more than
COSTS_N_INSNS (1) to be "expensive". Consequently, some targets set the cost
of both an integer SET and a floating-point SET to both be COSTS_N_INSNS (1).
In reality, these instructions may have different latency performance
characteristics. What real world quantity are we trying to invoke when we
say a branch costs the same as 3 SET instructions of any type? It certainly
isn't mispredict penalty (likely measured in cycles, not relative to the cost
of a SET instruction, which may well be completely free on modern x86
processors), nor is it the cost of executing the branch instruction which
is often constant to resolve regardless of predicted/mispredicted status.

On the other side of the equation, we want a cost for the converted
sequence. We can build a cost of the generated rtl sequence, but for
targets like AArch64 this is going to be wildly off. AArch64 will expand
(a > b) ? x : y; as a set to the CC register, followed by a conditional
move based on the CC register. Consequently, where we have multiple sets
back to back we end up with:

  set CC (a > b)
  set x1 (CC ? x : y)
  set CC (a > b)
  set x2 (CC ? x : z)
  set CC (a > b)
  set x3 (CC ? x : k)

Which we know will be simplified later to:

  set CC (a > b)
  set x1 (CC ? x : y)
  set x2 (CC ? x : z)
  set x3 (CC ? x : k)

I imagine other targets have something similar in their expansion of
mov<mode>cc (though I haven't looked).

Our comparison for if-conversion then must be:

  weighted_old_cost = (taken_probability * (then_bb_cost)
			- (1 - taken_probability) * (else_bb_cost));
  branch_cost = branch_cost_in_insns (taken_probability)
  weighted_new_cost = redundancy_factor (new_sequence) * seq_cost (new_sequence)

  profitable = weighted_new_cost <= weighted_old_cost + branch_cost

And we must define:

  branch_cost_in_insns (taken_probability)
  redundancy_factor (new_sequence)

At that point, I feel you are better giving the entire sequence to the
target and asking it to implement whatever logic is needed to return a
profitable/unprofitable analysis of the transformation.

The "redundancy_factor" in particular is pretty tough to define in a way
which makes sense outside of if_convert, without adding some pretty
detailed analysis to decide what might or might not be eliminated by
later passes. The alternative is to weight the other side of the equation
by tuning the cost of branch_cost_in_insns high. This only serves to increase
the disconnect between a real-world cost and a number to tweak to game
code generation.

If you have a different way of phrasing the if-conversion question that
avoids the two very specific hooks, I'd be happy to try taking the patches
in that direction. I don't see a way to implement this as just queries to
a costing function which does not need substantial target and pass
dependent tweaking to make behave correctly.

Thanks,
James

> > This patch series introduces that new target hook to provide cost
> > decisions for the RTL ifcvt pass.
> >
> > The idea is to give the target full visibility of the proposed
> > transformation, and allow it to respond as to whether if-conversion in that
> > way is profitable.
> >
> > In order to preserve current behaviour across targets, we will need the
> > default implementation to keep to the strategy of simply comparing branch
> > cost against a magic number. Patch 1/3 performs this refactoring, which is
> > a bit hairy in some corner cases.
> >
> > Patch 2/3 is a simple code move, pulling the definition of the if_info
> > structure used by RTL if-convert in to ifcvt.h where it can be included
> > by targets.
> >
> > Patch 3/3 then introduces the new target hook, with the same default
> > behaviour as was previously in noce_is_profitable_p.
> >
> > The series has been bootstrapped on ARM, AArch64 and x86_64 targets, and
> > I've verified with Spec2000 and Spec2006 runs that there are no code
> > generation differences for any of these three targets after the patch.
> >
> > I also gave ultrasparc3 a quick go, from what I could see, I changed the
> > register allocation for the floating-point condition code registers.
> > Presumably this is a side effect of first constructing RTXen that I then
> > discard. I didn't see anything which looked like more frequent reloads or
> > substantial code generation changes, though I'm not familiar with the
> > intricacies of the Sparc condition registers :).
> >
> > I've included a patch 4/3, to give an example of what a target might want
> > to do with this hook. It needs work for tuning and deciding how the function
> > should actually behave, but works if it is thought of as more of a
> > strawman/prototype than a patch submission.
> >
> > Are parts 1, 2 and 3 OK?
> >
> > Thanks,
> > James
> >
> > [1]: https://gcc.gnu.org/ml/gcc-patches/2015-09/msg00781.html
> >
> > ---
> > [Patch ifcvt 1/3] Factor out cost calculations from noce cases
> >
> > 2015-09-26  James Greenhalgh  <james.greenhalgh@arm.com>
> >
> >         * ifcvt.c (noce_if_info): Add a magic_number field :-(.
> >         (noce_is_profitable_p): New.
> >         (noce_try_store_flag_constants): Move cost calculation
> >         to after sequence generation, factor it out to noce_is_profitable_p.
> >         (noce_try_addcc): Likewise.
> >         (noce_try_store_flag_mask): Likewise.
> >         (noce_try_cmove): Likewise.
> >         (noce_try_cmove_arith): Likewise.
> >         (noce_try_sign_mask): Add comment regarding cost calculations.
> >
> > [Patch ifcvt 2/3] Move noce_if_info in to ifcvt.h
> >
> > 2015-09-26  James Greenhalgh  <james.greenhalgh@arm.com>
> >
> >         * ifcvt.c (noce_if_info): Move to...
> >         * ifcvt.h (noce_if_info): ...Here.
> >
> > [Patch ifcvt 3/3] Create a new target hook for deciding profitability
> >     of noce if-conversion
> >
> > 2015-09-26  James Greenhalgh  <james.greenhalgh@arm.com>
> >
> >         * target.def (costs): New hook vector.
> >         (ifcvt_noce_profitable_p): New hook.
> >         * doc/tm.texi.in: Document it.
> >         * doc/tm.texi: Regenerate.
> >         * targhooks.h (default_ifcvt_noce_profitable_p): New.
> >         * targhooks.c (default_ifcvt_noce_profitable_p): New.
> >         * ifcvt.c (noce_profitable_p): Use new target hook.
> >
> > [Patch Prototype AArch64 ifcvt 4/3] Wire up the new if-convert costs
> >     hook for AArch64
> >
> > 2015-09-26  James Greenhalgh  <james.greenhalgh@arm.com>
> >
> >         * config/aarch64/aarch64.c
> >         (aarch64_additional_branch_cost_for_probability): New.
> >         (aarch64_ifcvt_noce_profitable_p): Likewise.
> >         (TARGET_COSTS_IFCVT_NOCE_PROFITABLE_P): Likewise.
>
Mike Stump Sept. 29, 2015, 7:23 p.m. UTC | #12
On Sep 29, 2015, at 7:31 AM, James Greenhalgh <james.greenhalgh@arm.com> wrote:
> On Tue, Sep 29, 2015 at 11:16:37AM +0100, Richard Biener wrote:
>> On Fri, Sep 25, 2015 at 5:04 PM, James Greenhalgh
>> <james.greenhalgh@arm.com> wrote:
>>> 
>>> In relation to the patch I put up for review a few weeks ago to teach
>>> RTL if-convert to handle multiple sets in a basic block [1], I was
>>> asking about a sensible cost model to use. There was some consensus at
>>> Cauldron that what should be done in this situation is to introduce a
>>> target hook that delegates answering the question to the target.
>> 
>> Err - the consensus was to _not_ add gazillion of special target hooks
>> but instead enhance what we have with rtx_cost so that passes can
>> rely on comparing before and after costs of a sequence of insns.
> 
> Ah, I was not able to attend Cauldron this year, so I was trying to pick out
> "consensus" from the video. Rewatching it now, I see a better phrase would
> be "suggestion with some support”.

I’m not a big fan of rtx_cost.  To me it feels more like a crude, sledge hammer.  Now, that is the gcc way, we have a ton of these things, but would be nice to refine the tools so that the big escape hatch isn’t used as often and we have more finer grained ways of doing things.  rtx_cost should be what a code-generator generates with most new ports when they use the nice api to do a port.  The old sledge hammer wielding ports may well always define rtx_cost themselves, but, we should shoot for something better.

As a concrete example, I now have a code-generator for enum reg_class, N_REG_CLASSES, REG_CLASS_NAMES, REG_CLASS_CONTENTS, REGISTER_NAMES, FIXED_REGISTERS, CALL_USED_REGISTERS, ADDITIONAL_REGISTER_NAMES, REG_ALLOC_ORDER and more (some binutils code-gen to do with registers), and oh my, it is so much nicer to user than the original api.  If you only ever have to write once these things, fine, but, if you develop and prototype CPUs, the existing interface is, well, less than ideal.  I can do things like:

gccrclass
  rc_gprs = “GENERAL”;

r gpr[] = { rc_gprs, Fixed, Used,
                "$zero", "$sp", "$fp", "$lr" };
r gpr_sav[] = { Notfixed, Notused, alias ("$save_first"),
                "$sav1",   "$sav2",   "$sav3",   "$sav4”,

and get all the other goop I need for free.  I’d encourage people to find a way to do up an rtx_cost generator.  If you're a port maintainer, and want to redo your port to use a nicer api to do the registers, let me know.  I’d love to see progress made to rid gcc of the old crappy apis.
Richard Biener Sept. 30, 2015, 7:51 a.m. UTC | #13
On Tue, Sep 29, 2015 at 9:23 PM, Mike Stump <mikestump@comcast.net> wrote:
> On Sep 29, 2015, at 7:31 AM, James Greenhalgh <james.greenhalgh@arm.com> wrote:
>> On Tue, Sep 29, 2015 at 11:16:37AM +0100, Richard Biener wrote:
>>> On Fri, Sep 25, 2015 at 5:04 PM, James Greenhalgh
>>> <james.greenhalgh@arm.com> wrote:
>>>>
>>>> In relation to the patch I put up for review a few weeks ago to teach
>>>> RTL if-convert to handle multiple sets in a basic block [1], I was
>>>> asking about a sensible cost model to use. There was some consensus at
>>>> Cauldron that what should be done in this situation is to introduce a
>>>> target hook that delegates answering the question to the target.
>>>
>>> Err - the consensus was to _not_ add gazillion of special target hooks
>>> but instead enhance what we have with rtx_cost so that passes can
>>> rely on comparing before and after costs of a sequence of insns.
>>
>> Ah, I was not able to attend Cauldron this year, so I was trying to pick out
>> "consensus" from the video. Rewatching it now, I see a better phrase would
>> be "suggestion with some support”.
>
> I’m not a big fan of rtx_cost.  To me it feels more like a crude, sledge hammer.  Now, that is the gcc way, we have a ton of these things, but would be nice to refine the tools so that the big escape hatch isn’t used as often and we have more finer grained ways of doing things.  rtx_cost should be what a code-generator generates with most new ports when they use the nice api to do a port.  The old sledge hammer wielding ports may well always define rtx_cost themselves, but, we should shoot for something better.
>
> As a concrete example, I now have a code-generator for enum reg_class, N_REG_CLASSES, REG_CLASS_NAMES, REG_CLASS_CONTENTS, REGISTER_NAMES, FIXED_REGISTERS, CALL_USED_REGISTERS, ADDITIONAL_REGISTER_NAMES, REG_ALLOC_ORDER and more (some binutils code-gen to do with registers), and oh my, it is so much nicer to user than the original api.  If you only ever have to write once these things, fine, but, if you develop and prototype CPUs, the existing interface is, well, less than ideal.  I can do things like:
>
> gccrclass
>   rc_gprs = “GENERAL”;
>
> r gpr[] = { rc_gprs, Fixed, Used,
>                 "$zero", "$sp", "$fp", "$lr" };
> r gpr_sav[] = { Notfixed, Notused, alias ("$save_first"),
>                 "$sav1",   "$sav2",   "$sav3",   "$sav4”,
>
> and get all the other goop I need for free.  I’d encourage people to find a way to do up an rtx_cost generator.  If you're a port maintainer, and want to redo your port to use a nicer api to do the registers, let me know.  I’d love to see progress made to rid gcc of the old crappy apis.

I agree that rtx_cost isn't the nicest thing either.  But adding hooks
like my_transform_proftable_p (X *) with 'X' being pass private
data structure isn't very great design either.  In fact it's worse IMHO.

Richard.
Richard Biener Sept. 30, 2015, 8:04 a.m. UTC | #14
On Tue, Sep 29, 2015 at 4:31 PM, James Greenhalgh
<james.greenhalgh@arm.com> wrote:
> On Tue, Sep 29, 2015 at 11:16:37AM +0100, Richard Biener wrote:
>> On Fri, Sep 25, 2015 at 5:04 PM, James Greenhalgh
>> <james.greenhalgh@arm.com> wrote:
>> > Hi,
>> >
>> > In relation to the patch I put up for review a few weeks ago to teach
>> > RTL if-convert to handle multiple sets in a basic block [1], I was
>> > asking about a sensible cost model to use. There was some consensus at
>> > Cauldron that what should be done in this situation is to introduce a
>> > target hook that delegates answering the question to the target.
>>
>> Err - the consensus was to _not_ add gazillion of special target hooks
>> but instead enhance what we have with rtx_cost so that passes can
>> rely on comparing before and after costs of a sequence of insns.
>
> Ah, I was not able to attend Cauldron this year, so I was trying to pick out
> "consensus" from the video. Rewatching it now, I see a better phrase would
> be "suggestion with some support".
>
> Watching the video a second time, it seems your proposal is that we improve
> the RTX costs infrastructure to handle sequences of Gimple/RTX. That would
> get us some way to making a smart decision in if-convert, but I'm not
> convinced it allows us to answer the question we are interested in.
>
> We have the rtx for before and after, and we can generate costs for these
> sequences. This allows us to calculate some weighted cost of the
> instructions based on the calculated probabilities that each block is
> executed. However, we are missing information on how expensive the branch
> is, and we have no way to get that through an RTX-costs infrastructure.

Yeah, though during the meeting at the Cauldron I was asking on whether we
maybe want a replacement_cost hook that can assume the to-be-replaced
sequence is in the IL thus the hook can inspect insn_bb and thus get at
branch costs ...

Surely the proposed rtx_cost infrastructure enhancements will not
cover all cases
so the thing I wanted to throw in was that there was _not_ consensus that cost
should be computed by pass specific target hooks that allow the target
to inspect
pass private data.  Because that's a maintainance nightmare if you change a pass
and have to second guess 50 targets cost hook implementations.

> We could add a hook to give a cost in COSTS_N_INSNS units to a branch based
> on its predictability. This is difficult as COSTS_N_INSNS units can differ
> depending on whether you are talking about floating-point or integer code.

Yes, which is why I suggested a replacement cost ...  Of course the question is
what you feed that hook as in principle it would be nice to avoid building the
replacement RTXen until we know doing that will be necessary (the replacement
is profitable).

Maybe as soon as CFG changes are involved we need to think about adding
a BB frequency to the hook though factoring in that can be done in the passes.
What is the interesting part for the target is probably the cost of
the branch itself
as that can vary depending on predictability (which is usually very
hard to assess
at compile-time anyway).  So what about a branch_cost hook that takes
taken/not-taken probabilities as argument?

Note that the agreement during the discussion was that all costs need to be
comparable.

> By this I mean, the compiler considers a SET which costs more than
> COSTS_N_INSNS (1) to be "expensive". Consequently, some targets set the cost
> of both an integer SET and a floating-point SET to both be COSTS_N_INSNS (1).
> In reality, these instructions may have different latency performance
> characteristics. What real world quantity are we trying to invoke when we
> say a branch costs the same as 3 SET instructions of any type? It certainly
> isn't mispredict penalty (likely measured in cycles, not relative to the cost
> of a SET instruction, which may well be completely free on modern x86
> processors), nor is it the cost of executing the branch instruction which
> is often constant to resolve regardless of predicted/mispredicted status.
>
> On the other side of the equation, we want a cost for the converted
> sequence. We can build a cost of the generated rtl sequence, but for
> targets like AArch64 this is going to be wildly off. AArch64 will expand
> (a > b) ? x : y; as a set to the CC register, followed by a conditional
> move based on the CC register. Consequently, where we have multiple sets
> back to back we end up with:
>
>   set CC (a > b)
>   set x1 (CC ? x : y)
>   set CC (a > b)
>   set x2 (CC ? x : z)
>   set CC (a > b)
>   set x3 (CC ? x : k)
>
> Which we know will be simplified later to:
>
>   set CC (a > b)
>   set x1 (CC ? x : y)
>   set x2 (CC ? x : z)
>   set x3 (CC ? x : k)
>
> I imagine other targets have something similar in their expansion of
> mov<mode>cc (though I haven't looked).
>
> Our comparison for if-conversion then must be:
>
>   weighted_old_cost = (taken_probability * (then_bb_cost)
>                         - (1 - taken_probability) * (else_bb_cost));
>   branch_cost = branch_cost_in_insns (taken_probability)
>   weighted_new_cost = redundancy_factor (new_sequence) * seq_cost (new_sequence)
>
>   profitable = weighted_new_cost <= weighted_old_cost + branch_cost
>
> And we must define:
>
>   branch_cost_in_insns (taken_probability)
>   redundancy_factor (new_sequence)
>
> At that point, I feel you are better giving the entire sequence to the
> target and asking it to implement whatever logic is needed to return a
> profitable/unprofitable analysis of the transformation.

Sure, that was what was suggested at the Cauldron - rtx_cost on individual
insns (rtx cost doesn't even work on that level usually!) isn't coarse enough.
We're C++ now so we'd pass the hook an iterator which it can use to
iterate over the insns it should compute the cost of.

> The "redundancy_factor" in particular is pretty tough to define in a way
> which makes sense outside of if_convert, without adding some pretty
> detailed analysis to decide what might or might not be eliminated by
> later passes. The alternative is to weight the other side of the equation
> by tuning the cost of branch_cost_in_insns high. This only serves to increase
> the disconnect between a real-world cost and a number to tweak to game
> code generation.
>
> If you have a different way of phrasing the if-conversion question that
> avoids the two very specific hooks, I'd be happy to try taking the patches
> in that direction. I don't see a way to implement this as just queries to
> a costing function which does not need substantial target and pass
> dependent tweaking to make behave correctly.

From the patch I can't even see what the question is ;)  I only see
"is the transform profitable".

I saw from your hook implementation that it's basically a set of magic
numbers.  So I suggested to make those numbers target configurable
instead.

IMHO we shouldn't accept new cost hooks that are only implemented
by a single target.  Instead it should be proved that the hook can
improve code on multiple targets.

Richard.

> Thanks,
> James
>
>> > This patch series introduces that new target hook to provide cost
>> > decisions for the RTL ifcvt pass.
>> >
>> > The idea is to give the target full visibility of the proposed
>> > transformation, and allow it to respond as to whether if-conversion in that
>> > way is profitable.
>> >
>> > In order to preserve current behaviour across targets, we will need the
>> > default implementation to keep to the strategy of simply comparing branch
>> > cost against a magic number. Patch 1/3 performs this refactoring, which is
>> > a bit hairy in some corner cases.
>> >
>> > Patch 2/3 is a simple code move, pulling the definition of the if_info
>> > structure used by RTL if-convert in to ifcvt.h where it can be included
>> > by targets.
>> >
>> > Patch 3/3 then introduces the new target hook, with the same default
>> > behaviour as was previously in noce_is_profitable_p.
>> >
>> > The series has been bootstrapped on ARM, AArch64 and x86_64 targets, and
>> > I've verified with Spec2000 and Spec2006 runs that there are no code
>> > generation differences for any of these three targets after the patch.
>> >
>> > I also gave ultrasparc3 a quick go, from what I could see, I changed the
>> > register allocation for the floating-point condition code registers.
>> > Presumably this is a side effect of first constructing RTXen that I then
>> > discard. I didn't see anything which looked like more frequent reloads or
>> > substantial code generation changes, though I'm not familiar with the
>> > intricacies of the Sparc condition registers :).
>> >
>> > I've included a patch 4/3, to give an example of what a target might want
>> > to do with this hook. It needs work for tuning and deciding how the function
>> > should actually behave, but works if it is thought of as more of a
>> > strawman/prototype than a patch submission.
>> >
>> > Are parts 1, 2 and 3 OK?
>> >
>> > Thanks,
>> > James
>> >
>> > [1]: https://gcc.gnu.org/ml/gcc-patches/2015-09/msg00781.html
>> >
>> > ---
>> > [Patch ifcvt 1/3] Factor out cost calculations from noce cases
>> >
>> > 2015-09-26  James Greenhalgh  <james.greenhalgh@arm.com>
>> >
>> >         * ifcvt.c (noce_if_info): Add a magic_number field :-(.
>> >         (noce_is_profitable_p): New.
>> >         (noce_try_store_flag_constants): Move cost calculation
>> >         to after sequence generation, factor it out to noce_is_profitable_p.
>> >         (noce_try_addcc): Likewise.
>> >         (noce_try_store_flag_mask): Likewise.
>> >         (noce_try_cmove): Likewise.
>> >         (noce_try_cmove_arith): Likewise.
>> >         (noce_try_sign_mask): Add comment regarding cost calculations.
>> >
>> > [Patch ifcvt 2/3] Move noce_if_info in to ifcvt.h
>> >
>> > 2015-09-26  James Greenhalgh  <james.greenhalgh@arm.com>
>> >
>> >         * ifcvt.c (noce_if_info): Move to...
>> >         * ifcvt.h (noce_if_info): ...Here.
>> >
>> > [Patch ifcvt 3/3] Create a new target hook for deciding profitability
>> >     of noce if-conversion
>> >
>> > 2015-09-26  James Greenhalgh  <james.greenhalgh@arm.com>
>> >
>> >         * target.def (costs): New hook vector.
>> >         (ifcvt_noce_profitable_p): New hook.
>> >         * doc/tm.texi.in: Document it.
>> >         * doc/tm.texi: Regenerate.
>> >         * targhooks.h (default_ifcvt_noce_profitable_p): New.
>> >         * targhooks.c (default_ifcvt_noce_profitable_p): New.
>> >         * ifcvt.c (noce_profitable_p): Use new target hook.
>> >
>> > [Patch Prototype AArch64 ifcvt 4/3] Wire up the new if-convert costs
>> >     hook for AArch64
>> >
>> > 2015-09-26  James Greenhalgh  <james.greenhalgh@arm.com>
>> >
>> >         * config/aarch64/aarch64.c
>> >         (aarch64_additional_branch_cost_for_probability): New.
>> >         (aarch64_ifcvt_noce_profitable_p): Likewise.
>> >         (TARGET_COSTS_IFCVT_NOCE_PROFITABLE_P): Likewise.
>>
Mike Stump Sept. 30, 2015, 6:49 p.m. UTC | #15
On Sep 30, 2015, at 1:04 AM, Richard Biener <richard.guenther@gmail.com> wrote:
> So what about a branch_cost hook that takes taken/not-taken probabilities as argument?

So, for my port, I need to know %prediction as well to calculate cost.  I know, kinda sucks.  Or put another way, I want to explain the cost taken, predicted, not-taken, predicted, taken, mis-predicted, and not-taken-mis-predicted and let the caller sort out if the branch will be predicted or mis-predicted, as it can do the math itself and that math is target independent.
Bernd Schmidt Oct. 1, 2015, 9:37 a.m. UTC | #16
On 09/29/2015 04:31 PM, James Greenhalgh wrote:
> On the other side of the equation, we want a cost for the converted
> sequence. We can build a cost of the generated rtl sequence, but for
> targets like AArch64 this is going to be wildly off. AArch64 will expand
> (a > b) ? x : y; as a set to the CC register, followed by a conditional
> move based on the CC register. Consequently, where we have multiple sets
> back to back we end up with:
>
>    set CC (a > b)
>    set x1 (CC ? x : y)
>    set CC (a > b)
>    set x2 (CC ? x : z)
>    set CC (a > b)
>    set x3 (CC ? x : k)
>
> Which we know will be simplified later to:
>
>    set CC (a > b)
>    set x1 (CC ? x : y)
>    set x2 (CC ? x : z)
>    set x3 (CC ? x : k)

I guess the transformation you want to make is a bit unusual in that it 
generates such extra instructions. rtx_cost has problems taking such 
secondary considerations into account.

I haven't quite made up my mind about the new target hook, but I wonder 
if it might be a good idea to try and simplify the above sequence on the 
spot before calculating costs for it. (Incidentally, which pass removes 
the extra CC sets?)


Bernd
Bernd Schmidt Oct. 9, 2015, 11:28 a.m. UTC | #17
On 10/01/2015 11:37 AM, Bernd Schmidt wrote:
> On 09/29/2015 04:31 PM, James Greenhalgh wrote:
>> On the other side of the equation, we want a cost for the converted
>> sequence. We can build a cost of the generated rtl sequence, but for
>> targets like AArch64 this is going to be wildly off. AArch64 will expand
>> (a > b) ? x : y; as a set to the CC register, followed by a conditional
>> move based on the CC register. Consequently, where we have multiple sets
>> back to back we end up with:
>>
>>    set CC (a > b)
>>    set x1 (CC ? x : y)
>>    set CC (a > b)
>>    set x2 (CC ? x : z)
>>    set CC (a > b)
>>    set x3 (CC ? x : k)
>>
>> Which we know will be simplified later to:
>>
>>    set CC (a > b)
>>    set x1 (CC ? x : y)
>>    set x2 (CC ? x : z)
>>    set x3 (CC ? x : k)
>
> I guess the transformation you want to make is a bit unusual in that it
> generates such extra instructions. rtx_cost has problems taking such
> secondary considerations into account.
>
> I haven't quite made up my mind about the new target hook, but I wonder
> if it might be a good idea to try and simplify the above sequence on the
> spot before calculating costs for it. (Incidentally, which pass removes
> the extra CC sets?)

I don't know whether you've done any more work on the patch series, but 
I think I've made up my mind that optimizing the sequence before 
computing its cost would be a good thing to try first. Either with a 
better expander interface which generates the right thing immediately, 
or running something like cselib over the generated code. I think this 
would give more reliable results rather than guesstimating a redundancy 
factor in the cost function.


Bernd
Jeff Law Oct. 9, 2015, 3:28 p.m. UTC | #18
On 10/09/2015 05:28 AM, Bernd Schmidt wrote:
>
> I don't know whether you've done any more work on the patch series, but
> I think I've made up my mind that optimizing the sequence before
> computing its cost would be a good thing to try first. Either with a
> better expander interface which generates the right thing immediately,
> or running something like cselib over the generated code. I think this
> would give more reliable results rather than guesstimating a redundancy
> factor in the cost function.
Another great example of what we could do if we could give an RTL 
sequence and mini CFG to the various optimizers and have them just work 
on the sequence (making the appropriate assumptions about live-in and 
live-out objects).  Retrofitting this capability would likely be very 
hard at this point :(

Even just DCE & CSE would probably be a notable step forward.

jeff
James Greenhalgh June 2, 2016, 4:53 p.m. UTC | #19
Hi,

When I was working in the area last year, I promised to revisit the cost
model for noce if-conversion and see if I could improve the modeling. This
turned out to be more tricky than I expected.

This patch set rewrites the cost model for noce if-conversion. The goal is
to rationalise the units used in the calculations away from BRANCH_COST,
which is only defined relative to itself.

I've introduced a new target hook "rtx_branch_cost" which is defined to return
the cost of a branch in RTX cost units, suitable for comparing directly with
the calculated cost of a conditional move, or the conditionally executed
branches.

If you're looking at that and thinking it doesn't sound much different from
our current call to BRANCH_COST, you're right. This isn't as large a
departure from the existing cost model as I had originally intended. I
started out experimenting with much larger hooks (with many
parameters/pass-specific data), or hooks that passed the whole edge to
the target asking for the cost. These ended up feeling quite silly
to write in the target, and don't match with the direction discussed at
last year's cauldron. We don't want to go around leaking pass-internal
data around to back-ends. That is a path of madness as the passes change but
find that targets have invented baroque calculations to help invent a
magic number.

I tried implementing a "replacement_cost" hook, which would take before
and after code sequences and try to guess profitability, but because you
want to take edge probabilities in to consideration while trying to calculate
the costs of an if-then-else structure, the code gets hairy quickly. Worse
is that this would need duplicating across any target implementing the
hook. I found that I was constructing lots of RTX just to throw it away
again, and sometimes we were constructing RTX that would trivially be
optimised by a future pass. As a metric for if-conversion, this hook
seemed more harmful than useful for both the quality of the decision we'd
make, and for the quality of the GCC source.

As I iterated through versions of this patch set, I realised that all we
really wanted for ifcvt was a way to estimate the cost of a branch in units
that were comparable to the cost of instructions. The trouble with BRANCH_COST
wasn't that it was returning a magic number, it was just that it was returning
a magic number which had inconsistent meanings in the compiler. Otherwise,
BRANCH_COST was a straightforward, low-complexity target hook.

So the new hook simply defines the relative units that it will use and
splits off the use in ifcvt from other BRANCH_COST calls.

Having introduced the hook, and added some framework to make use of it, the
rest of the patch set works through each of the cost models in ifcvt.c,
makes them consistent, and moves them to the new hook.

This act of making the cost models consistent will cause code generation
changes on a number of targets - most notably x86_64. On x86_64 the RTX
cost of a conditional move comes out at "20" - this is far higher than
COSTS_N_INSNS (BRANCH_COST) for the x86 targets, so they lose lots
of if-conversion. The easy fix for this would be to implement the new hook.
I measured the performance impact on Spec2000 as a smoke test, it didn't
seem to harm anything, and the result was a slight (< 3%) uplift on
Spec2000FP. I'm no expert on x86_64, so I haven't taken a closer look for
the reasons.

Having worked through the patch set, I'd say it is probably a small
improvement over what we currently do, but I'm not very happy with it. I'm
posting it for comment so we can discuss any directions for costs that I
haven't thought about or prototyped. I'm also happy to drop the costs
rewrite if this seems like complexity for no benefit.

Any thoughts?

I've bootstrapped and tested the patch set on x86_64 and aarch64, but
they probably need some further polishing if we were to decide this was a
useful direction.

Thanks,
James

James Greenhalgh (6):
  [RFC: Patch 1/6] New target hook: rtx_branch_cost
  [RFC: Patch 2/6] Factor out the comparisons against magic numbers in
    ifcvt
  [RFC: Patch 3/6] Remove if_info->branch_cost
  [RFC: Patch 4/6] Modify cost model for noce_cmove_arith
  [RFC: Patch 5/6] Improve the cost model for multiple-sets
  [RFC: Patch 6/6] Remove second cost model from
    noce_try_store_flag_mask

 gcc/doc/tm.texi    |  10 +++
 gcc/doc/tm.texi.in |   2 +
 gcc/ifcvt.c        | 204 +++++++++++++++++++++++++++++++++++++++++------------
 gcc/target.def     |  14 ++++
 gcc/targhooks.c    |  10 +++
 gcc/targhooks.h    |   2 +
 6 files changed, 197 insertions(+), 45 deletions(-)
James Greenhalgh June 10, 2016, 10:45 a.m. UTC | #20
On Thu, Jun 09, 2016 at 10:58:52AM -0600, Jeff Law wrote:
> On 06/02/2016 10:53 AM, James Greenhalgh wrote:

> >Hi,

> >

> >When I was working in the area last year, I promised to revisit the cost

> >model for noce if-conversion and see if I could improve the modeling. This

> >turned out to be more tricky than I expected.

> >

> >This patch set rewrites the cost model for noce if-conversion. The goal is

> >to rationalise the units used in the calculations away from BRANCH_COST,

> >which is only defined relative to itself.

> Right.  I think we all agreed that the key weakness of BRANCH_COST

> was that its meaning is only defined relative to itself.

> 

> What we want is a costing metric that would allow us to estimate the

> cost of different forms of the computation, which might include

> branches and which may include edge probabilty information.

> 

> >

> >If you're looking at that and thinking it doesn't sound much different from

> >our current call to BRANCH_COST, you're right. This isn't as large a

> >departure from the existing cost model as I had originally intended.

> Perhaps not as large of a change as you intended, but I think you're

> hitting the key issue with BRANCH_COST.

> 

> 

> >This act of making the cost models consistent will cause code generation

> >changes on a number of targets - most notably x86_64. On x86_64 the RTX

> >cost of a conditional move comes out at "20" - this is far higher than

> >COSTS_N_INSNS (BRANCH_COST) for the x86 targets, so they lose lots

> >of if-conversion. The easy fix for this would be to implement the new hook.

> >I measured the performance impact on Spec2000 as a smoke test, it didn't

> >seem to harm anything, and the result was a slight (< 3%) uplift on

> >Spec2000FP. I'm no expert on x86_64, so I haven't taken a closer look for

> >the reasons.

> I'd be comfortable with Uros guiding the implementation of the

> target hook for x86 so that we don't take a major step backward.


The trouble I'm having with all targets is noce_try_cmove_arith and the
testcases added as part of that patch set.

The current cost model for noce_try_cmove_arith doesn't take in to
consideration the cost of a conditional move at all, it just checks the
cost of each branch, takes the maximum of that, and compares it against
COSTS_N_INSNS (BRANCH_COST).

As we move to my patch set, I want to compute the cost of the whole
if-converted sequence, which is going to include the cost of a conditional
move. On x86_64, the total cost for a simple conditional move
is COSTS_N_INSNS (5). This alone would prevent if-conversion for most
x86_64 subtargets, once you add in the max cost of one of the two branches,
you guarantee that no x86_64 target will be converting through
noce_try_cmove_arith.

From the test results I'm seeing, this holds true for other targets which
show regressions with the new cost models. Clearly my idea for the default
hook implementation is not going to fly. If more targets had implemented the
PARAM_MAX_RTL_IF_CONVERSION_INSNS hook, I could use that to guide the
default implementation, but that is currently x86_64 only. I could add a
multiplier against BRANCH_COST in the new hook, but then we'd be guaranteeing
that the "cheap" if-conversions, where we use STORE_FLAG_VALUE rather than
introducing a conditional move, always fired unless the target had a
branch_cost of 0 (this might not be a bad model actually...). Finding what
this multiplier should be will be tough, as it depends directly on the cost
of a conditional move, which targets don't expose easily, and which I don't
want to construct junk conditional moves just to find through rtx_cost.

I'll keep giving it some thought, and I'd appreciate any suggestions for the
default hook implementation.

I've respun the patch set around Bernd's suggestion, and now that we don't
check the cost model until after constructing the new sequence the patch set
looks much nicer. I'll wait a bit before sending the respin out in the hope
that I'll have a good idea for the default hook implementation to reduce
the number of performance changes I'll introduce.

Thanks,
James
diff mbox

Patch

diff --git a/gcc/ifcvt.c b/gcc/ifcvt.c
index 157a716..059bd89 100644
--- a/gcc/ifcvt.c
+++ b/gcc/ifcvt.c
@@ -2982,6 +2982,223 @@  bb_valid_for_noce_process_p (basic_block test_bb, rtx cond,
   return false;
 }
 
+/* We have something like:
+
+     if (x > y)
+       { i = a; j = b; k = c; }
+
+   Make it:
+
+     tmp_i = (x > y) ? a : i;
+     tmp_j = (x > y) ? b : j;
+     tmp_k = (x > y) ? c : k;
+     i = tmp_i; <- Should be cleaned up
+     j = tmp_j; <- Likewise.
+     k = tmp_k; <- Likewise.
+
+   Look for special cases such as use of temporary registers (for
+   example in a swap idiom).
+
+   IF_INFO contains the useful information about the block structure and
+   jump instructions.  */
+
+static int
+noce_convert_multiple_sets (struct noce_if_info *if_info)
+{
+  basic_block test_bb = if_info->test_bb;
+  basic_block then_bb = if_info->then_bb;
+  basic_block join_bb = if_info->join_bb;
+  rtx_insn *jump = if_info->jump;
+  rtx_insn *cond_earliest;
+  unsigned int cost = 0;
+  rtx_insn *insn;
+
+  start_sequence ();
+
+  /* Decompose the condition attached to the jump.  */
+  rtx cond = noce_get_condition (jump, &cond_earliest, false);
+  rtx x = XEXP (cond, 0);
+  rtx y = XEXP (cond, 1);
+  rtx_code cond_code = GET_CODE (cond);
+
+  /* The true targets for a conditional move.  */
+  vec<rtx> targets = vNULL;
+  /* The temporaries introduced to allow us to not consider register
+     overlap.  */
+  vec<rtx> temporaries = vNULL;
+  /* The insns we've emitted.  */
+  vec<rtx_insn *> unmodified_insns = vNULL;
+  unsigned count = 0;
+
+  FOR_BB_INSNS (then_bb, insn)
+    {
+      /* Skip over non-insns.  */
+      if (!active_insn_p (insn))
+	continue;
+
+      rtx set = single_set (insn);
+      gcc_checking_assert (set);
+
+      rtx target = SET_DEST (set);
+      rtx temp = gen_reg_rtx (GET_MODE (target));
+      rtx new_val = SET_SRC (set);
+      rtx old_val = target;
+
+      /* If we were supposed to read from an earlier write in this block,
+	 we've changed the register allocation.  Rewire the read.  While
+	 we are looking, also try to catch a swap idiom.  */
+      rtx candidate_rewire = new_val;
+      for (unsigned i = 0; i < count; i++)
+	{
+	  if (reg_overlap_mentioned_p (new_val, targets[i]))
+	    {
+	      /* Catch a "swap" style idiom.  */
+	      if (find_reg_note (insn, REG_DEAD, new_val) != NULL_RTX)
+		{
+		  /* The write to targets[i] is only live until the read
+		     here.  As the condition codes match, we can propagate
+		     the set to here.  */
+		   candidate_rewire
+		     = SET_SRC (single_set (unmodified_insns[i]));
+
+		   /* Discount the cost calculation by one conditional
+		      set instruction.  As we are just putting out
+		      a group of SET instructions, any will do.  */
+		   cost -= insn_rtx_cost (PATTERN (get_last_insn ()),
+					  optimize_bb_for_speed_p (test_bb));
+		}
+	      else
+		candidate_rewire = temporaries[i];
+	    }
+	}
+      new_val = candidate_rewire;
+
+      /* If we had a non-canonical conditional jump (i.e. one where
+	 the fallthrough is to the "else" case) we need to reverse
+	 the conditional select.  */
+      if (if_info->then_else_reversed)
+	std::swap (old_val, new_val);
+
+      /* Actually emit the conditional move.  */
+      rtx temp_dest = noce_emit_cmove (if_info, temp, cond_code,
+				       x, y, new_val, old_val);
+
+      /* If we failed to expand the conditional move, drop out and don't
+	 try to continue.  */
+      if (temp_dest == NULL_RTX)
+	{
+	  end_sequence ();
+	  return FALSE;
+	}
+
+      /* Track the cost of building these conditional instructions.  */
+      cost += insn_rtx_cost (PATTERN (get_last_insn ()),
+			     optimize_bb_for_speed_p (test_bb));
+
+      /* Bookkeeping.  */
+      count++;
+      targets.safe_push (target);
+      temporaries.safe_push (temp_dest);
+      unmodified_insns.safe_push (insn);
+    }
+
+  /* We must have seen some sort of insn to insert, otherwise we were
+     given an empty BB to convert, and we can't handle that.  */
+  if (unmodified_insns.is_empty ())
+    {
+      end_sequence ();
+      return FALSE;
+    }
+
+  /* Check if this is actually beneficial.  */
+  if (cost > COSTS_N_INSNS (if_info->branch_cost))
+     {
+       end_sequence ();
+       return FALSE;
+     }
+
+  /* Now fixup the assignments.  */
+  for (unsigned i = 0; i < count; i++)
+    noce_emit_move_insn (targets[i], temporaries[i]);
+
+  /* Actually emit the sequence.  */
+  rtx_insn *seq = get_insns ();
+
+  for (insn = seq; insn; insn = NEXT_INSN (insn))
+    set_used_flags (insn);
+
+  unshare_all_rtl_in_chain (seq);
+  end_sequence ();
+
+  if (!seq)
+    return FALSE;
+
+  emit_insn_before_setloc (seq, if_info->jump,
+			   INSN_LOCATION (unmodified_insns.last ()));
+
+  /* Clean up THEN_BB and the edges in and out of it.  */
+  remove_edge (find_edge (test_bb, join_bb));
+  remove_edge (find_edge (then_bb, join_bb));
+  redirect_edge_and_branch_force (single_succ_edge (test_bb), join_bb);
+  delete_basic_block (then_bb);
+  num_true_changes++;
+
+  /* Maybe merge blocks now the jump is simple enough.  */
+  if (can_merge_blocks_p (test_bb, join_bb))
+    {
+      merge_blocks (test_bb, join_bb);
+      num_true_changes++;
+    }
+
+  num_updated_if_blocks++;
+  return TRUE;
+}
+
+/* Return true iff basic block TEST_BB is comprised of only
+   (SET (REG) (REG)) insns suitable for conversion to a series
+   of conditional moves.  */
+
+static bool
+bb_ok_for_noce_convert_multiple_sets (basic_block test_bb)
+{
+  rtx_insn *insn;
+
+  /* We must have at least one real insn to convert, or there will
+     be trouble!  */
+  bool bb_is_not_empty = false;
+  FOR_BB_INSNS (test_bb, insn)
+    {
+      /* Skip over notes etc.  */
+      if (!active_insn_p (insn))
+	continue;
+
+      /* We only handle SET insns.  */
+      rtx set = single_set (insn);
+      if (set == NULL_RTX)
+	return false;
+
+      rtx dest = SET_DEST (set);
+      rtx src = SET_SRC (set);
+
+      /* We can possibly relax this, but for now only handle REG to REG
+	 moves.  This avoids any issues that might come from introducing
+	 loads/stores that might violate data-race-freedom guarantees.  */
+      if (!(REG_P (src) && REG_P (dest)))
+	return false;
+
+      /* Destination must be appropriate for a conditional write.  */
+      if (!noce_operand_ok (dest))
+	return false;
+
+      /* We must be able to conditionally move in this mode.  */
+      if (!can_conditionally_move_p (GET_MODE (dest)))
+	return false;
+
+      bb_is_not_empty = true;
+    }
+  return bb_is_not_empty;
+}
+
 /* Given a simple IF-THEN-JOIN or IF-THEN-ELSE-JOIN block, attempt to convert
    it without using conditional execution.  Return TRUE if we were successful
    at converting the block.  */
@@ -3004,12 +3221,22 @@  noce_process_if_block (struct noce_if_info *if_info)
      (1) if (...) x = a; else x = b;
      (2) x = b; if (...) x = a;
      (3) if (...) x = a;   // as if with an initial x = x.
-
+     (4) if (...) { x = a; y = b; z = c; }  // Like 3, for multiple SETS.
      The later patterns require jumps to be more expensive.
      For the if (...) x = a; else x = b; case we allow multiple insns
      inside the then and else blocks as long as their only effect is
      to calculate a value for x.
-     ??? For future expansion, look for multiple X in such patterns.  */
+     ??? For future expansion, further expand the "multiple X" rules.  */
+
+  /* First look for multiple SETS.  */
+  if (!else_bb
+      && HAVE_conditional_move
+      && !HAVE_cc0
+      && bb_ok_for_noce_convert_multiple_sets (then_bb))
+    {
+      if (noce_convert_multiple_sets (if_info))
+	return TRUE;
+    }
 
   if (! bb_valid_for_noce_process_p (then_bb, cond, &if_info->then_cost,
 				    &if_info->then_simple))