diff mbox

[Re:,RFC:,1/2,v3] New target hook: max_noce_ifcvt_seq_cost

Message ID 1469008295-28884-1-git-send-email-james.greenhalgh@arm.com
State New
Headers show

Commit Message

James Greenhalgh July 20, 2016, 9:51 a.m. UTC
Splicing replies to Bernd, Bernhard and Jeff.

Jeff, thanks for reviewing the patch set, I appreciate the ack, though I've
held off committing while I was working through Bernd's criticism of the
size cost model that this patch introduced and trying to get that right.
Sorry to cause extra reviewing work, but I have respun the patch set to
try to improve the consistency of how we're costing things, and to better
handle the size cases. I'm a bit happier with how it has turned out and I
think the approach is now a little easier to justify. Hopefully it will
still be acceptable for trunk.

There are essentially two families of cost models in this file. The true
before/after comparisons (noce_cmove_arith, noce_convert_multiple_sets),
and the "magic numbers" comparisons (noce_try_store_flag_constants,
noce_try_addcc, noce_try_store_flag_mask, noce_try_cmove). In the first
revisions of this patch set, I refactored the magic numbers comparisons,
but I didn't try to solve their "magic" as comparing two integers was
a suitably fast routine, and the comparison seemed accurate enough.

But the magic numbers are potentially inaccurate for a variable-length
instruction architecture, and given the number of times we actually manage to
spot these if-convert opportunities, the compile time overhead of moving
every cost model to a before/after comparison is probably not all that
high. Then we have everything going through one single function, making

Additionally, if we can rework most of the costs to actually calculate
the before/after costs, we can then drop the "size" case from this hook
entirely - we can just look at the size of the sequences directly rather
than asking the target to guess at an acceptable size growth.

This is good as it will completely remove magic numbers from ifcvt and
make everything dependent on a simple question to the target, when
compiling for speed; "What is the maximum cost of extra execution that
you'd like to see on the unconditional path?"

Unfortunately, disentangling this makes it harder to layout the patch set
quite as neatly as before. The changes follow the same structure, but I've
had to squash all the cost changes in to patch 2/2. Fortunately these now
look reasonably mechanical, and consequently the patch is not much more
difficult to review.

Patches 3/2 and 4/2 are not strictly needed as part of the cost model work,
but they do help the cost model by performing some simplifications early.
This reduces the chance of us rejecting if-conversion based on too many
simple moves that a future pass would have cleared up anyway. The csibe
numbers below rely on these two patches having been applied. Without them,
we get a couple of decisions wrong and some files from csibe increase
by < 3%.

On Tue, Jun 21, 2016 at 11:30:17PM +0200, Bernhard Reutner-Fischer wrote:
>

> >For the default implementation, if the parameters are not set, I just

> >multiply BRANCH_COST through by COSTS_N_INSNS (1) for size and

> >COSTS_N_INSNS (3) for speed. I know this is not ideal, but I'm still

> >short

> >of ideas on how best to form the default implementation.

>

> How bad is it in e.g. CSiBE?


I'm not completely sure I've set it up right, but these are the >0.5% size
 differences for an x86_64 compiler I built last Friday using -Os:

Smaller:

Relative size	Test name

93.33	flex-2.5.31,tables_shared
94.37	teem-1.6.0-src,src/limn/qn
97.27	teem-1.6.0-src,src/nrrd/kernel
98.31	teem-1.6.0-src,src/ten/miscTen
98.60	teem-1.6.0-src,src/ell/genmat
98.69	teem-1.6.0-src,src/nrrd/measure
99.03	teem-1.6.0-src,src/ten/mod
99.04	libpng-1.2.5,pngwtran
99.08	jpeg-6b,jdcoefct
99.14	teem-1.6.0-src,src/dye/convertDye
99.15	teem-1.6.0-src,src/ten/glyph
99.16	teem-1.6.0-src,src/bane/gkmsPvg
99.20	teem-1.6.0-src,src/limn/splineEval
99.25	teem-1.6.0-src,src/nrrd/accessors
99.28	teem-1.6.0-src,src/hest/parseHest
99.33	teem-1.6.0-src,src/limn/transform
99.40	teem-1.6.0-src,src/alan/coreAlan
99.48	teem-1.6.0-src,src/air/miscAir

Larger:

Relative size	Test name

101.43	teem-1.6.0-src,src/ten/tendEvec
101.57	teem-1.6.0-src,src/ten/tendEval

However, the total size difference is indistinguishable from noise
(< 0.08%).

Running the same experiment with an AArch64 cross compiler, I get the
following changes:

Smaller:

Relative size	Test name

97.78	libpng-1.2.5,pngrio
98.02	libpng-1.2.5,pngwio
98.82	replaypc-0.4.0.preproc,ReplayPC
99.21	lwip-0.5.3.preproc,src/core/inet
99.48	jpeg-6b,wrppm

Larger:

Relative size	Test name

100.52	jpeg-6b,wrbmp
100.82	libpng-1.2.5,pngwtran
100.91	zlib-1.1.4,infcodes

And the overall size difference was tiny (< 0.01%).

There were no >0.5% changes for the ARM port (expected as it doesn't use
noce).

I looked in to each of the regressions, and generally they occur where
we relying on a future pass to clean up after us. This is especially true
for the large x86_64 regressions, which as far as I can see are a
consequence of x86_64's floating-point conditional move expanding out to
bitwise operations. Taken individually, these look huge, but when you have
multiple conditional moves feeding each other some of the bitwise
expressions simplify and you get a size saving. We can't model that in our
cost model, and in many ways we just got lucky previously.

> s/precitable/predictable/ ?


This, and all your other comments regarding spelling and grammar have been
fixed. Thanks.

On Thu, Jun 30, 2016 at 01:58:52PM +0200, Bernd Schmidt wrote:
> On 06/21/2016 05:50 PM, James Greenhalgh wrote:

> >For the default implementation, if the parameters are not set, I just

> >multiply BRANCH_COST through by COSTS_N_INSNS (1) for size and

> >COSTS_N_INSNS (3) for speed. I know this is not ideal, but I'm still short

> >of ideas on how best to form the default implementation.

>

> Yeah, this does seem kind of arbitrary. It looks especialy odd

> considering that BRANCH_COST is likely to already vary between the

> size/speed cases. What's wrong with just multiplying through by

> CNI(1)?

>

> I'm not sure we want params for this; targets should just eventually

> upgrade their cost models.


We've used params in the past as a migration path, they are particularly
handy in this case as they allow us to override target settings when in the
testsuite.

Removing these would be easy, but I've left them in for now, as I like
the three tiered flexibility:

  * Target does nothing - hook uses BRANCH_COST,
  * Target only needs a simple model so sets params - hook uses params
  * Target thinks it can do something very smart - target implements hook

> >The new default causes some changes in generated conditional move sequences

> >for x86_64. Whether these changes are for the better or not I can't say.

>

> How about arm/aarch64? I think some benchmark results might be good to have.


The ARM port isn't very interesting as it has conditional execution and
therefore mostly uses other paths through this file.

On AArch64 I get an increase in the number of CSEL and FCSEL instructions
generated when compiling Spec2006. With the current definition of
BRANCH_COST for AArch64 we lose some important if-cvt opportunities, but
these can be restored by setting the BRANCH_COST for predictable branches
higher. With this done I see some small improvements in Spec2006, but
nothing meaningful and no regressions. This is probably exactly where I want
to be with this patch set - no change is a good thing.

> Bernhard already pointed out some issues with the patch; I'll omit these.


As mentioned above, I've fixed Bernhard's issues in this patch revision.

> >+(max_noce_ifcvt_seq_cost,

> >+ "This hook should return a value in the same units as\n\

> >+@code{TARGET_RTX_COSTS}, giving the maximum acceptable cost for\n\

> >+a sequence generated by the RTL if-conversion pass when conditional\n\

> >+execution is not available.

>

> There's still the issue that we're also replacing instructions when

> doing if-conversion. Let's say in this case,

>

>  /* Convert "if (test) x = a; else x = b", for A and B constant.

>    Also allow A = y + c1, B = y + c2, with a common y between A

>    and B.  */

>

> we're removing two assignments for the purposes of optimizing for

> size, and one assignment when considering optimization for speed.

> This needs to factor into the cost calculations somehow if we want

> to do it properly. I think we can leave the hook as-is, but maybe

> add documentation to the effect of "The caller should increase the

> limit by the cost of whatever instructions are removed in the

> transformation."


Yes, I see what you mean. Hopefully I've addressed this in this patchset
revision.

>

> >+/* Default implementation of TARGET_RTX_BRANCH_COST.  */

>

> Wrong name for the hook.


Thanks, fixed.

> >+unsigned int

> >+default_max_noce_ifcvt_seq_cost (bool speed_p, edge e)

> >+{

> >+  bool predictable_p = predictable_edge_p (e);

> >+  /* For size, some targets like to set a BRANCH_COST of zero to disable

> >+     ifcvt, continue to allow that.  Then multiply through by

> >+     COSTS_N_INSNS (1) so we're in a comparable base.  */

> >+

> >+  if (!speed_p)

> >+    return BRANCH_COST (speed_p, predictable_p) * COSTS_N_INSNS (1);

>

> Blank line before the comment would be more readable.


Fixed by virtue of removing this code.

> >+  enum compiler_param param = predictable_p

> >+			      ? PARAM_MAX_RTL_IF_CONVERSION_PREDICTABLE_COST

> >+			      : PARAM_MAX_RTL_IF_CONVERSION_UNPREDICTABLE_COST;

>

> When splitting expressions across multiple lines, wrap in parens so

> that emacs formats them automatically.


Fixed.

This patch, and all others in this series bootstrapped and tested on x86_64
and aarch64 with no issues.

OK?

Thanks,
James

---

2016-07-20  James Greenhalgh  <james.greenhalgh@arm.com>

	* target.def (max_noce_ifcvt_seq_cost): New.
	* doc/tm.texi.in (TARGET_MAX_NOCE_IFCVT_SEQ_COST): Document it.
	* doc/tm.texi: Regenerate.
	* targhooks.h (default_max_noce_ifcvt_seq_cost): New.
	* targhooks.c (default_max_noce_ifcvt_seq_cost): New.
	* params.def (PARAM_MAX_RTL_IF_CONVERSION_PREDICTABLE_COST): New.
	(PARAM_MAX_RTL_IF_CONVERSION_UNPREDICTABLE_COST): Likewise.
	* doc/invoke.texi: Document new params.

Comments

James Greenhalgh July 20, 2016, 4:39 p.m. UTC | #1
On Wed, Jul 20, 2016 at 01:41:39PM +0200, Bernd Schmidt wrote:
> On 07/20/2016 11:51 AM, James Greenhalgh wrote:

> 

> >

> >2016-07-20  James Greenhalgh  <james.greenhalgh@arm.com>

> >

> >	* target.def (max_noce_ifcvt_seq_cost): New.

> >	* doc/tm.texi.in (TARGET_MAX_NOCE_IFCVT_SEQ_COST): Document it.

> >	* doc/tm.texi: Regenerate.

> >	* targhooks.h (default_max_noce_ifcvt_seq_cost): New.

> >	* targhooks.c (default_max_noce_ifcvt_seq_cost): New.

> >	* params.def (PARAM_MAX_RTL_IF_CONVERSION_PREDICTABLE_COST): New.

> >	(PARAM_MAX_RTL_IF_CONVERSION_UNPREDICTABLE_COST): Likewise.

> >	* doc/invoke.texi: Document new params.

> 

> I think this is starting to look like a clear improvement, so I'll

> ack patches 1-3 with a few minor comments, and with the expectation

> that you'll address performance regressions on other targets if they

> occur.


I'll gladly take a look if I've caused anyone any trouble.

> Number 4 I still need to figure out.

> 

> Minor details:

> 

> >+  if (!speed_p)

> >+    {

> >+      return cost <= if_info->original_cost;

> >+    }

> 

> No braces around single statements in ifs. There's an instance of

> this in patch 4 as well.

> 

> >+  if (global_options_set.x_param_values[param])

> >+    return PARAM_VALUE (param);

> 

> How about wrapping the param value into COSTS_N_INSNS, to make the

> value of the param less dependent on compiler internals?


I did consider this, but found it hard to word for the user documentation.
I found it easier to understand when it was in the same units as
rtx_cost, particularly as the AArch64 backend prints RTX costs to most
dump files (including ce1, ce2, ce3) so comparing directly was easy for me
to grok. I think going in either direction has the potential to confuse
users, the cost metrics of the RTL passes are very tightly coupled to
compiler internals.

I don't have a strong feeling either way, just a slight preference to keep
everything in the same units as rtx_cost where I can.

Let me know if you'd rather I follow this comment. There's some precedent
to wrapping it in COSTS_N_INSNS in GCSE_UNRESTRICTED_COST, but I find this
less clear than what I've done (well, I would say that :-) ).

> In patch 4:

> 

> >+  /* Check that our new insn isn't going to clobber ORIG_OTHER_DEST.  */

> >+  bool modified_in_x = (set_tmp != NULL_RTX)

> >+			&& modified_in_p (orig_other_dest, set_tmp);

> 

> Watch line wrapping. No parens around the first subexpression (there

> are other examples of unnecessary ones in invocations of

> noce_arith_helper), but around the full one.


I'll catch these and others on commit, thanks for pointing them out.

Thanks,
James
diff mbox

Patch

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 9a4db38..94d2b48 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -8865,6 +8865,17 @@  considered for if-conversion.  The default is 10, though the compiler will
 also use other heuristics to decide whether if-conversion is likely to be
 profitable.
 
+@item max-rtl-if-conversion-predictable-cost
+@item max-rtl-if-conversion-unpredictable-cost
+RTL if-conversion will try to remove conditional branches around a block
+and replace them with conditionally executed instructions.  These parameters
+give the maximum permissible cost for the sequence that would be generated
+by if-conversion depending on whether the branch is statically determined
+to be predictable or not.  The units for this parameter are the same as
+those for the GCC internal seq_cost metric.  The compiler will try to
+provide a reasonable default for this parameter using the BRANCH_COST
+target macro.
+
 @item max-crossjump-edges
 The maximum number of incoming edges to consider for cross-jumping.
 The algorithm used by @option{-fcrossjumping} is @math{O(N^2)} in
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index b318615..28fba6b 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6526,6 +6526,26 @@  should probably only be given to addresses with different numbers of
 registers on machines with lots of registers.
 @end deftypefn
 
+@deftypefn {Target Hook} {unsigned int} TARGET_MAX_NOCE_IFCVT_SEQ_COST (edge @var{e})
+This hook returns a value in the same units as @code{TARGET_RTX_COSTS},
+giving the maximum acceptable cost for a sequence generated by the RTL
+if-conversion pass when conditional execution is not available.
+The RTL if-conversion pass attempts to convert conditional operations
+that would require a branch to a series of unconditional operations and
+@code{mov@var{mode}cc} insns.  This hook returns the maximum cost of the
+unconditional instructions and the @code{mov@var{mode}cc} insns.
+RTL if-conversion is cancelled if the cost of the converted sequence
+is greater than the value returned by this hook.
+
+@code{e} is the edge between the basic block containing the conditional
+branch to the basic block which would be executed if the condition
+were true.
+
+The default implementation of this hook uses the
+@code{max-rtl-if-conversion-[un]predictable} parameters if they are set,
+and uses a multiple of @code{BRANCH_COST} otherwise.
+@end deftypefn
+
 @deftypefn {Target Hook} bool TARGET_NO_SPECULATION_IN_DELAY_SLOTS_P (void)
 This predicate controls the use of the eager delay slot filler to disallow
 speculatively executed instructions being placed in delay slots.  Targets
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 1e8423c..d2b7f41 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4762,6 +4762,8 @@  Define this macro if a non-short-circuit operation produced by
 
 @hook TARGET_ADDRESS_COST
 
+@hook TARGET_MAX_NOCE_IFCVT_SEQ_COST
+
 @hook TARGET_NO_SPECULATION_IN_DELAY_SLOTS_P
 
 @node Scheduling
diff --git a/gcc/params.def b/gcc/params.def
index b86d592..166032e 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -1222,6 +1222,20 @@  DEFPARAM (PARAM_MAX_RTL_IF_CONVERSION_INSNS,
 	  "if-conversion.",
 	  10, 0, 99)
 
+DEFPARAM (PARAM_MAX_RTL_IF_CONVERSION_PREDICTABLE_COST,
+	  "max-rtl-if-conversion-predictable-cost",
+	  "Maximum permissible cost for the sequence that would be "
+	  "generated by the RTL if-conversion pass for a branch that "
+	  "is considered predictable.",
+	  20, 0, 200)
+
+DEFPARAM (PARAM_MAX_RTL_IF_CONVERSION_UNPREDICTABLE_COST,
+	  "max-rtl-if-conversion-unpredictable-cost",
+	  "Maximum permissible cost for the sequence that would be "
+	  "generated by the RTL if-conversion pass for a branch that "
+	  "is considered unpredictable.",
+	  40, 0, 200)
+
 DEFPARAM (PARAM_HSA_GEN_DEBUG_STORES,
 	  "hsa-gen-debug-stores",
 	  "Level of hsa debug stores verbosity",
diff --git a/gcc/target.def b/gcc/target.def
index a4df363..b2139ce 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -3572,6 +3572,30 @@  registers on machines with lots of registers.",
  int, (rtx address, machine_mode mode, addr_space_t as, bool speed),
  default_address_cost)
 
+/* Give a cost, in RTX Costs units, for an edge.  Like BRANCH_COST, but with
+   well defined units.  */
+DEFHOOK
+(max_noce_ifcvt_seq_cost,
+ "This hook returns a value in the same units as @code{TARGET_RTX_COSTS},\n\
+giving the maximum acceptable cost for a sequence generated by the RTL\n\
+if-conversion pass when conditional execution is not available.\n\
+The RTL if-conversion pass attempts to convert conditional operations\n\
+that would require a branch to a series of unconditional operations and\n\
+@code{mov@var{mode}cc} insns.  This hook returns the maximum cost of the\n\
+unconditional instructions and the @code{mov@var{mode}cc} insns.\n\
+RTL if-conversion is cancelled if the cost of the converted sequence\n\
+is greater than the value returned by this hook.\n\
+\n\
+@code{e} is the edge between the basic block containing the conditional\n\
+branch to the basic block which would be executed if the condition\n\
+were true.\n\
+\n\
+The default implementation of this hook uses the\n\
+@code{max-rtl-if-conversion-[un]predictable} parameters if they are set,\n\
+and uses a multiple of @code{BRANCH_COST} otherwise.",
+unsigned int, (edge e),
+default_max_noce_ifcvt_seq_cost)
+
 /* Permit speculative instructions in delay slots during delayed-branch 
    scheduling.  */
 DEFHOOK
diff --git a/gcc/targhooks.c b/gcc/targhooks.c
index 3e089e7..08136eb 100644
--- a/gcc/targhooks.c
+++ b/gcc/targhooks.c
@@ -74,6 +74,8 @@  along with GCC; see the file COPYING3.  If not see
 #include "intl.h"
 #include "opts.h"
 #include "gimplify.h"
+#include "predict.h"
+#include "params.h"
 
 
 bool
@@ -1977,4 +1979,24 @@  default_optab_supported_p (int, machine_mode, machine_mode, optimization_type)
   return true;
 }
 
+/* Default implementation of TARGET_MAX_NOCE_IFCVT_SEQ_COST.  */
+
+unsigned int
+default_max_noce_ifcvt_seq_cost (edge e)
+{
+  bool predictable_p = predictable_edge_p (e);
+
+  enum compiler_param param
+    = (predictable_p
+       ? PARAM_MAX_RTL_IF_CONVERSION_PREDICTABLE_COST
+       : PARAM_MAX_RTL_IF_CONVERSION_UNPREDICTABLE_COST);
+
+  /* If we have a parameter set, use that, otherwise take a guess using
+     BRANCH_COST.  */
+  if (global_options_set.x_param_values[param])
+    return PARAM_VALUE (param);
+  else
+    return BRANCH_COST (true, predictable_p) * COSTS_N_INSNS (3);
+}
+
 #include "gt-targhooks.h"
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index d6581cf..b7b5ba3 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -255,4 +255,6 @@  extern void default_setup_incoming_vararg_bounds (cumulative_args_t ca ATTRIBUTE
 extern bool default_optab_supported_p (int, machine_mode, machine_mode,
 				       optimization_type);
 
+extern unsigned int default_max_noce_ifcvt_seq_cost (edge);
+
 #endif /* GCC_TARGHOOKS_H */