Add support for in-order addition reduction using SVE FADDA

Message ID	87y3n4x2xf.fsf@linaro.org
State	New
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: domain of gcc-patches-return-467175-patch=linaro.org@gcc.gnu.org designates 209.132.180.131 as permitted sender) client-ip=209.132.180.131; DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :to:subject:date:message-id:mime-version:content-type; q=dns; s= default; b=hyKdSa0BCaeGCVcNR3Ndh06GrZkKWHhZFijg0lshO1Ff2ux8whaT7 4LgzDsKnU1tHMIm40/C/s3CLXvZPt7B9brZfgqI/pbZBQotuVU8RFS2Tjs8GgIwt lZCRN+26hIMTKeyrg/RDVkn14MTkj2/LnBIjnMj56Q97kO4G3DMORU= Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk Sender: gcc-patches-owner@gcc.gnu.org From: Richard Sandiford <richard.sandiford@linaro.org> To: gcc-patches@gcc.gnu.org Mail-Followup-To: gcc-patches@gcc.gnu.org, richard.sandiford@linaro.org Subject: Add support for in-order addition reduction using SVE FADDA Date: Fri, 17 Nov 2017 16:53:00 +0000 Message-ID: <87y3n4x2xf.fsf@linaro.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain
Series	Add support for in-order addition reduction using SVE FADDA \| expand Add support for in-order addition reduction using SVE FADDA

Richard Sandiford Nov. 17, 2017, 4:53 p.m. UTC

This patch adds support for in-order floating-point addition reductions,
which are suitable even in strict IEEE mode.

Previously vect_is_simple_reduction would reject any cases that forbid
reassociation.  The idea is instead to tentatively accept them as
"FOLD_LEFT_REDUCTIONs" and only fail later if there is no target
support for them.  Although this patch only handles the particular
case of plus and minus on floating-point types, there's no reason in
principle why targets couldn't handle other cases.

The vect_force_simple_reduction change makes it simpler for parloops
to read the type of reduction.

Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu
and powerpc64le-linux-gnu.  OK to install?

Richard


2017-11-17  Richard Sandiford  <richard.sandiford@linaro.org>
	    Alan Hayward  <alan.hayward@arm.com>
	    David Sherwood  <david.sherwood@arm.com>

gcc/
	* tree.def (FOLD_LEFT_PLUS_EXPR): New tree code.
	* doc/generic.texi (FOLD_LEFT_PLUS_EXPR): Document.
	* optabs.def (fold_left_plus_optab): New optab.
	* doc/md.texi (fold_left_plus_@var{m}): Document.
	* doc/sourcebuild.texi (vect_fold_left_plus): Document.
	* cfgexpand.c (expand_debug_expr): Handle FOLD_LEFT_PLUS_EXPR.
	* expr.c (expand_expr_real_2): Likewise.
	* fold-const.c (const_binop): Likewise.
	* optabs-tree.c (optab_for_tree_code): Likewise.
	* tree-cfg.c (verify_gimple_assign_binary): Likewise.
	* tree-inline.c (estimate_operator_cost): Likewise.
	* tree-pretty-print.c (dump_generic_node): Likewise.
	(op_code_prio): Likewise.
	(op_symbol_code): Likewise.
	* tree-vect-stmts.c (vectorizable_operation): Likewise.
	* tree-parloops.c (valid_reduction_p): New function.
	(gather_scalar_reductions): Use it.
	* tree-vectorizer.h (FOLD_LEFT_REDUCTION): New vect_reduction_type.
	(vect_finish_replace_stmt): Declare.
	* tree-vect-loop.c (fold_left_reduction_code): New function.
	(needs_fold_left_reduction_p): New function, split out from...
	(vect_is_simple_reduction): ...here.  Accept reductions that
	forbid reassociation, but give them type FOLD_LEFT_REDUCTION.
	(vect_force_simple_reduction): Also store the reduction type in
	the assignment's STMT_VINFO_REDUC_TYPE.
	(vect_model_reduction_cost): Handle FOLD_LEFT_REDUCTION.
	(merge_with_identity): New function.
	(vectorize_fold_left_reduction): Likewise.
	(vectorizable_reduction): Handle FOLD_LEFT_REDUCTION.  Leave the
	scalar phi in place for it.  Require target support and reject
	cases that would reassociate the operation.  Defer the transform
	phase to vectorize_fold_left_reduction.
	* config/aarch64/aarch64.md (UNSPEC_FADDA): New unspec.
	* config/aarch64/aarch64-sve.md (fold_left_plus_<mode>): New expander.
	(*fold_left_plus_<mode>, *pred_fold_left_plus_<mode>): New insns.

gcc/testsuite/
	* lib/target-supports.exp (check_effective_target_vect_fold_left_plus):
	New proc.
	* gcc.dg/vect/no-fast-math-vect16.c: Expect the test to pass if
	vect_fold_left_plus.
	* gcc.dg/vect/pr79920.c: Expect both loops to be vectorized if
	vect_fold_left_plus.
	* gcc.dg/vect/trapv-vect-reduc-4.c: Expect the first loop to be
	recognized as a reduction and then rejected for lack of target
	support.
	* gcc.dg/vect/vect-reduc-6.c: Expect the loop to be vectorized if
	vect_fold_left_plus.
	* gcc.target/aarch64/sve_reduc_strict_1.c: New test.
	* gcc.target/aarch64/sve_reduc_strict_1_run.c: Likewise.
	* gcc.target/aarch64/sve_reduc_strict_2.c: Likewise.
	* gcc.target/aarch64/sve_reduc_strict_2_run.c: Likewise.
	* gcc.target/aarch64/sve_reduc_strict_3.c: Likewise.
	* gcc.target/aarch64/sve_slp_13.c: Add floating-point types.
	* gfortran.dg/vect/vect-8.f90: Expect 25 loops to be vectorized if
	vect_fold_left_plus.

Richard Biener Nov. 20, 2017, 11:36 a.m. UTC | #1

On Fri, Nov 17, 2017 at 5:53 PM, Richard Sandiford
<richard.sandiford@linaro.org> wrote:
> This patch adds support for in-order floating-point addition reductions,

> which are suitable even in strict IEEE mode.

>

> Previously vect_is_simple_reduction would reject any cases that forbid

> reassociation.  The idea is instead to tentatively accept them as

> "FOLD_LEFT_REDUCTIONs" and only fail later if there is no target

> support for them.  Although this patch only handles the particular

> case of plus and minus on floating-point types, there's no reason in

> principle why targets couldn't handle other cases.

>

> The vect_force_simple_reduction change makes it simpler for parloops

> to read the type of reduction.

>

> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu

> and powerpc64le-linux-gnu.  OK to install?


I don't like that you add a new tree code for this.  A new IFN looks more
suitable to me.

Also I think if there's a way to handle this correctly with target support
you can also implement a fallback if there is no such support increasing
test coverage.  It would basically boil down to extracting all scalars from
the non-reduction operand vector and performing a series of reduction
ops, keeping the reduction PHI scalar.  This would also support any
reduction operator.

Thanks,
Richard.

> Richard

>

>

> 2017-11-17  Richard Sandiford  <richard.sandiford@linaro.org>

>             Alan Hayward  <alan.hayward@arm.com>

>             David Sherwood  <david.sherwood@arm.com>

>

> gcc/

>         * tree.def (FOLD_LEFT_PLUS_EXPR): New tree code.

>         * doc/generic.texi (FOLD_LEFT_PLUS_EXPR): Document.

>         * optabs.def (fold_left_plus_optab): New optab.

>         * doc/md.texi (fold_left_plus_@var{m}): Document.

>         * doc/sourcebuild.texi (vect_fold_left_plus): Document.

>         * cfgexpand.c (expand_debug_expr): Handle FOLD_LEFT_PLUS_EXPR.

>         * expr.c (expand_expr_real_2): Likewise.

>         * fold-const.c (const_binop): Likewise.

>         * optabs-tree.c (optab_for_tree_code): Likewise.

>         * tree-cfg.c (verify_gimple_assign_binary): Likewise.

>         * tree-inline.c (estimate_operator_cost): Likewise.

>         * tree-pretty-print.c (dump_generic_node): Likewise.

>         (op_code_prio): Likewise.

>         (op_symbol_code): Likewise.

>         * tree-vect-stmts.c (vectorizable_operation): Likewise.

>         * tree-parloops.c (valid_reduction_p): New function.

>         (gather_scalar_reductions): Use it.

>         * tree-vectorizer.h (FOLD_LEFT_REDUCTION): New vect_reduction_type.

>         (vect_finish_replace_stmt): Declare.

>         * tree-vect-loop.c (fold_left_reduction_code): New function.

>         (needs_fold_left_reduction_p): New function, split out from...

>         (vect_is_simple_reduction): ...here.  Accept reductions that

>         forbid reassociation, but give them type FOLD_LEFT_REDUCTION.

>         (vect_force_simple_reduction): Also store the reduction type in

>         the assignment's STMT_VINFO_REDUC_TYPE.

>         (vect_model_reduction_cost): Handle FOLD_LEFT_REDUCTION.

>         (merge_with_identity): New function.

>         (vectorize_fold_left_reduction): Likewise.

>         (vectorizable_reduction): Handle FOLD_LEFT_REDUCTION.  Leave the

>         scalar phi in place for it.  Require target support and reject

>         cases that would reassociate the operation.  Defer the transform

>         phase to vectorize_fold_left_reduction.

>         * config/aarch64/aarch64.md (UNSPEC_FADDA): New unspec.

>         * config/aarch64/aarch64-sve.md (fold_left_plus_<mode>): New expander.

>         (*fold_left_plus_<mode>, *pred_fold_left_plus_<mode>): New insns.

>

> gcc/testsuite/

>         * lib/target-supports.exp (check_effective_target_vect_fold_left_plus):

>         New proc.

>         * gcc.dg/vect/no-fast-math-vect16.c: Expect the test to pass if

>         vect_fold_left_plus.

>         * gcc.dg/vect/pr79920.c: Expect both loops to be vectorized if

>         vect_fold_left_plus.

>         * gcc.dg/vect/trapv-vect-reduc-4.c: Expect the first loop to be

>         recognized as a reduction and then rejected for lack of target

>         support.

>         * gcc.dg/vect/vect-reduc-6.c: Expect the loop to be vectorized if

>         vect_fold_left_plus.

>         * gcc.target/aarch64/sve_reduc_strict_1.c: New test.

>         * gcc.target/aarch64/sve_reduc_strict_1_run.c: Likewise.

>         * gcc.target/aarch64/sve_reduc_strict_2.c: Likewise.

>         * gcc.target/aarch64/sve_reduc_strict_2_run.c: Likewise.

>         * gcc.target/aarch64/sve_reduc_strict_3.c: Likewise.

>         * gcc.target/aarch64/sve_slp_13.c: Add floating-point types.

>         * gfortran.dg/vect/vect-8.f90: Expect 25 loops to be vectorized if

>         vect_fold_left_plus.

>

> Index: gcc/tree.def

> ===================================================================

> --- gcc/tree.def        2017-11-17 16:52:07.246852461 +0000

> +++ gcc/tree.def        2017-11-17 16:52:07.631930981 +0000

> @@ -1302,6 +1302,8 @@ DEFTREECODE (REDUC_AND_EXPR, "reduc_and_

>  DEFTREECODE (REDUC_IOR_EXPR, "reduc_ior_expr", tcc_unary, 1)

>  DEFTREECODE (REDUC_XOR_EXPR, "reduc_xor_expr", tcc_unary, 1)

>

> +DEFTREECODE (FOLD_LEFT_PLUS_EXPR, "fold_left_plus_expr", tcc_binary, 2)

> +

>  /* Widening dot-product.

>     The first two arguments are of type t1.

>     The third argument and the result are of type t2, such that t2 is at least

> Index: gcc/doc/generic.texi

> ===================================================================

> --- gcc/doc/generic.texi        2017-11-17 16:52:07.246852461 +0000

> +++ gcc/doc/generic.texi        2017-11-17 16:52:07.620954871 +0000

> @@ -1746,6 +1746,7 @@ a value from @code{enum annot_expr_kind}

>  @tindex REDUC_AND_EXPR

>  @tindex REDUC_IOR_EXPR

>  @tindex REDUC_XOR_EXPR

> +@tindex FOLD_LEFT_PLUS_EXPR

>

>  @table @code

>  @item VEC_DUPLICATE_EXPR

> @@ -1861,6 +1862,12 @@ the maximum element in @var{x}.  The ass

>  is unspecified; for example, @samp{REDUC_PLUS_EXPR <@var{x}>} could

>  sum floating-point @var{x} in forward order, in reverse order,

>  using a tree, or in some other way.

> +

> +@item FOLD_LEFT_PLUS_EXPR

> +This node takes two arguments: a scalar of type @var{t} and a vector

> +of @var{t}s.  It successively adds each element of the vector to the

> +scalar and returns the result.  The operation is strictly in-order:

> +there is no reassociation.

>  @end table

>

>

> Index: gcc/optabs.def

> ===================================================================

> --- gcc/optabs.def      2017-11-17 16:52:07.246852461 +0000

> +++ gcc/optabs.def      2017-11-17 16:52:07.625528250 +0000

> @@ -306,6 +306,7 @@ OPTAB_D (reduc_umin_scal_optab, "reduc_u

>  OPTAB_D (reduc_and_scal_optab,  "reduc_and_scal_$a")

>  OPTAB_D (reduc_ior_scal_optab,  "reduc_ior_scal_$a")

>  OPTAB_D (reduc_xor_scal_optab,  "reduc_xor_scal_$a")

> +OPTAB_D (fold_left_plus_optab, "fold_left_plus_$a")

>

>  OPTAB_D (extract_last_optab, "extract_last_$a")

>  OPTAB_D (fold_extract_last_optab, "fold_extract_last_$a")

> Index: gcc/doc/md.texi

> ===================================================================

> --- gcc/doc/md.texi     2017-11-17 16:52:07.246852461 +0000

> +++ gcc/doc/md.texi     2017-11-17 16:52:07.621869547 +0000

> @@ -5285,6 +5285,14 @@ has mode @var{m} and operands 0 and 1 ha

>  one element of @var{m}.  Operand 2 has the usual mask mode for vectors

>  of mode @var{m}; see @code{TARGET_VECTORIZE_GET_MASK_MODE}.

>

> +@cindex @code{fold_left_plus_@var{m}} instruction pattern

> +@item @code{fold_left_plus_@var{m}}

> +Take scalar operand 1 and successively add each element from vector

> +operand 2.  Store the result in scalar operand 0.  The vector has

> +mode @var{m} and the scalars have the mode appropriate for one

> +element of @var{m}.  The operation is strictly in-order: there is

> +no reassociation.

> +

>  @cindex @code{sdot_prod@var{m}} instruction pattern

>  @item @samp{sdot_prod@var{m}}

>  @cindex @code{udot_prod@var{m}} instruction pattern

> Index: gcc/doc/sourcebuild.texi

> ===================================================================

> --- gcc/doc/sourcebuild.texi    2017-11-17 16:52:07.246852461 +0000

> +++ gcc/doc/sourcebuild.texi    2017-11-17 16:52:07.621869547 +0000

> @@ -1580,6 +1580,9 @@ Target supports AND, IOR and XOR reducti

>

>  @item vect_fold_extract_last

>  Target supports the @code{fold_extract_last} optab.

> +

> +@item vect_fold_left_plus

> +Target supports the @code{fold_left_plus} optab.

>  @end table

>

>  @subsubsection Thread Local Storage attributes

> Index: gcc/cfgexpand.c

> ===================================================================

> --- gcc/cfgexpand.c     2017-11-17 16:52:07.246852461 +0000

> +++ gcc/cfgexpand.c     2017-11-17 16:52:07.620040195 +0000

> @@ -5072,6 +5072,7 @@ expand_debug_expr (tree exp)

>      case REDUC_AND_EXPR:

>      case REDUC_IOR_EXPR:

>      case REDUC_XOR_EXPR:

> +    case FOLD_LEFT_PLUS_EXPR:

>      case VEC_COND_EXPR:

>      case VEC_PACK_FIX_TRUNC_EXPR:

>      case VEC_PACK_SAT_EXPR:

> Index: gcc/expr.c

> ===================================================================

> --- gcc/expr.c  2017-11-17 16:52:07.246852461 +0000

> +++ gcc/expr.c  2017-11-17 16:52:07.622784222 +0000

> @@ -9438,6 +9438,28 @@ #define REDUCE_BIT_FIELD(expr)   (reduce_b

>          return target;

>        }

>

> +    case FOLD_LEFT_PLUS_EXPR:

> +      {

> +       op0 = expand_normal (treeop0);

> +       op1 = expand_normal (treeop1);

> +       this_optab = optab_for_tree_code (code, type, optab_default);

> +       machine_mode vec_mode = TYPE_MODE (TREE_TYPE (treeop1));

> +       insn_code icode = optab_handler (this_optab, vec_mode);

> +

> +       if (icode != CODE_FOR_nothing)

> +         {

> +           struct expand_operand ops[3];

> +           create_output_operand (&ops[0], target, mode);

> +           create_input_operand (&ops[1], op0, mode);

> +           create_input_operand (&ops[2], op1, vec_mode);

> +           if (maybe_expand_insn (icode, 3, ops))

> +             return ops[0].value;

> +         }

> +

> +       /* Nothing to fall back to.  */

> +       gcc_unreachable ();

> +      }

> +

>      case REDUC_MAX_EXPR:

>      case REDUC_MIN_EXPR:

>      case REDUC_PLUS_EXPR:

> Index: gcc/fold-const.c

> ===================================================================

> --- gcc/fold-const.c    2017-11-17 16:52:07.246852461 +0000

> +++ gcc/fold-const.c    2017-11-17 16:52:07.623698898 +0000

> @@ -1603,6 +1603,32 @@ const_binop (enum tree_code code, tree a

>         return NULL_TREE;

>        return build_vector_from_val (TREE_TYPE (arg1), sub);

>      }

> +

> +  if (CONSTANT_CLASS_P (arg1)

> +      && TREE_CODE (arg2) == VECTOR_CST)

> +    {

> +      tree_code subcode;

> +

> +      switch (code)

> +       {

> +       case FOLD_LEFT_PLUS_EXPR:

> +         subcode = PLUS_EXPR;

> +         break;

> +       default:

> +         return NULL_TREE;

> +       }

> +

> +      int nelts = VECTOR_CST_NELTS (arg2);

> +      tree accum = arg1;

> +      for (int i = 0; i < nelts; i++)

> +       {

> +         accum = const_binop (subcode, accum, VECTOR_CST_ELT (arg2, i));

> +         if (accum == NULL_TREE || !CONSTANT_CLASS_P (accum))

> +           return NULL_TREE;

> +       }

> +

> +      return accum;

> +    }

>    return NULL_TREE;

>  }

>

> Index: gcc/optabs-tree.c

> ===================================================================

> --- gcc/optabs-tree.c   2017-11-17 16:52:07.246852461 +0000

> +++ gcc/optabs-tree.c   2017-11-17 16:52:07.623698898 +0000

> @@ -166,6 +166,9 @@ optab_for_tree_code (enum tree_code code

>      case REDUC_XOR_EXPR:

>        return reduc_xor_scal_optab;

>

> +    case FOLD_LEFT_PLUS_EXPR:

> +      return fold_left_plus_optab;

> +

>      case VEC_WIDEN_MULT_HI_EXPR:

>        return TYPE_UNSIGNED (type) ?

>         vec_widen_umult_hi_optab : vec_widen_smult_hi_optab;

> Index: gcc/tree-cfg.c

> ===================================================================

> --- gcc/tree-cfg.c      2017-11-17 16:52:07.246852461 +0000

> +++ gcc/tree-cfg.c      2017-11-17 16:52:07.628272277 +0000

> @@ -4116,6 +4116,19 @@ verify_gimple_assign_binary (gassign *st

>        /* Continue with generic binary expression handling.  */

>        break;

>

> +    case FOLD_LEFT_PLUS_EXPR:

> +      if (!VECTOR_TYPE_P (rhs2_type)

> +         || !useless_type_conversion_p (lhs_type, TREE_TYPE (rhs2_type))

> +         || !useless_type_conversion_p (lhs_type, rhs1_type))

> +       {

> +         error ("reduction should convert from vector to element type");

> +         debug_generic_expr (lhs_type);

> +         debug_generic_expr (rhs1_type);

> +         debug_generic_expr (rhs2_type);

> +         return true;

> +       }

> +      return false;

> +

>      case VEC_SERIES_EXPR:

>        if (!useless_type_conversion_p (rhs1_type, rhs2_type))

>         {

> Index: gcc/tree-inline.c

> ===================================================================

> --- gcc/tree-inline.c   2017-11-17 16:52:07.246852461 +0000

> +++ gcc/tree-inline.c   2017-11-17 16:52:07.628272277 +0000

> @@ -3881,6 +3881,7 @@ estimate_operator_cost (enum tree_code c

>      case REDUC_AND_EXPR:

>      case REDUC_IOR_EXPR:

>      case REDUC_XOR_EXPR:

> +    case FOLD_LEFT_PLUS_EXPR:

>      case WIDEN_SUM_EXPR:

>      case WIDEN_MULT_EXPR:

>      case DOT_PROD_EXPR:

> Index: gcc/tree-pretty-print.c

> ===================================================================

> --- gcc/tree-pretty-print.c     2017-11-17 16:52:07.246852461 +0000

> +++ gcc/tree-pretty-print.c     2017-11-17 16:52:07.629186953 +0000

> @@ -3232,6 +3232,7 @@ dump_generic_node (pretty_printer *pp, t

>        break;

>

>      case VEC_SERIES_EXPR:

> +    case FOLD_LEFT_PLUS_EXPR:

>      case VEC_WIDEN_MULT_HI_EXPR:

>      case VEC_WIDEN_MULT_LO_EXPR:

>      case VEC_WIDEN_MULT_EVEN_EXPR:

> @@ -3628,6 +3629,7 @@ op_code_prio (enum tree_code code)

>      case REDUC_MAX_EXPR:

>      case REDUC_MIN_EXPR:

>      case REDUC_PLUS_EXPR:

> +    case FOLD_LEFT_PLUS_EXPR:

>      case VEC_UNPACK_HI_EXPR:

>      case VEC_UNPACK_LO_EXPR:

>      case VEC_UNPACK_FLOAT_HI_EXPR:

> @@ -3749,6 +3751,9 @@ op_symbol_code (enum tree_code code)

>      case REDUC_PLUS_EXPR:

>        return "r+";

>

> +    case FOLD_LEFT_PLUS_EXPR:

> +      return "fl+";

> +

>      case WIDEN_SUM_EXPR:

>        return "w+";

>

> Index: gcc/tree-vect-stmts.c

> ===================================================================

> --- gcc/tree-vect-stmts.c       2017-11-17 16:52:07.246852461 +0000

> +++ gcc/tree-vect-stmts.c       2017-11-17 16:52:07.631016305 +0000

> @@ -5415,6 +5415,10 @@ vectorizable_operation (gimple *stmt, gi

>

>    code = gimple_assign_rhs_code (stmt);

>

> +  /* Ignore operations that mix scalar and vector input operands.  */

> +  if (code == FOLD_LEFT_PLUS_EXPR)

> +    return false;

> +

>    /* For pointer addition, we should use the normal plus for

>       the vector addition.  */

>    if (code == POINTER_PLUS_EXPR)

> Index: gcc/tree-parloops.c

> ===================================================================

> --- gcc/tree-parloops.c 2017-11-17 16:52:07.246852461 +0000

> +++ gcc/tree-parloops.c 2017-11-17 16:52:07.629186953 +0000

> @@ -2531,6 +2531,19 @@ set_reduc_phi_uids (reduction_info **slo

>    return 1;

>  }

>

> +/* Return true if the type of reduction performed by STMT is suitable

> +   for this pass.  */

> +

> +static bool

> +valid_reduction_p (gimple *stmt)

> +{

> +  /* Parallelization would reassociate the operation, which isn't

> +     allowed for in-order reductions.  */

> +  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);

> +  vect_reduction_type reduc_type = STMT_VINFO_REDUC_TYPE (stmt_info);

> +  return reduc_type != FOLD_LEFT_REDUCTION;

> +}

> +

>  /* Detect all reductions in the LOOP, insert them into REDUCTION_LIST.  */

>

>  static void

> @@ -2564,7 +2577,7 @@ gather_scalar_reductions (loop_p loop, r

>        gimple *reduc_stmt

>         = vect_force_simple_reduction (simple_loop_info, phi,

>                                        &double_reduc, true);

> -      if (!reduc_stmt)

> +      if (!reduc_stmt || !valid_reduction_p (reduc_stmt))

>         continue;

>

>        if (double_reduc)

> @@ -2610,7 +2623,8 @@ gather_scalar_reductions (loop_p loop, r

>                 = vect_force_simple_reduction (simple_loop_info, inner_phi,

>                                                &double_reduc, true);

>               gcc_assert (!double_reduc);

> -             if (inner_reduc_stmt == NULL)

> +             if (inner_reduc_stmt == NULL

> +                 || !valid_reduction_p (inner_reduc_stmt))

>                 continue;

>

>               build_new_reduction (reduction_list, double_reduc_stmts[i], phi);

> Index: gcc/tree-vectorizer.h

> ===================================================================

> --- gcc/tree-vectorizer.h       2017-11-17 16:52:07.246852461 +0000

> +++ gcc/tree-vectorizer.h       2017-11-17 16:52:07.631016305 +0000

> @@ -74,7 +74,15 @@ enum vect_reduction_type {

>

>         for (int i = 0; i < VF; ++i)

>           res = cond[i] ? val[i] : res;  */

> -  EXTRACT_LAST_REDUCTION

> +  EXTRACT_LAST_REDUCTION,

> +

> +  /* Use a folding reduction within the loop to implement:

> +

> +       for (int i = 0; i < VF; ++i)

> +         res = res OP val[i];

> +

> +     (with no reassocation).  */

> +  FOLD_LEFT_REDUCTION

>  };

>

>  #define VECTORIZABLE_CYCLE_DEF(D) (((D) == vect_reduction_def)           \

> @@ -1389,6 +1397,7 @@ extern void vect_model_load_cost (stmt_v

>  extern unsigned record_stmt_cost (stmt_vector_for_cost *, int,

>                                   enum vect_cost_for_stmt, stmt_vec_info,

>                                   int, enum vect_cost_model_location);

> +extern void vect_finish_replace_stmt (gimple *, gimple *);

>  extern void vect_finish_stmt_generation (gimple *, gimple *,

>                                           gimple_stmt_iterator *);

>  extern bool vect_mark_stmts_to_be_vectorized (loop_vec_info);

> Index: gcc/tree-vect-loop.c

> ===================================================================

> --- gcc/tree-vect-loop.c        2017-11-17 16:52:07.246852461 +0000

> +++ gcc/tree-vect-loop.c        2017-11-17 16:52:07.630101629 +0000

> @@ -2573,6 +2573,29 @@ vect_analyze_loop (struct loop *loop, lo

>      }

>  }

>

> +/* Return true if the target supports in-order reductions for operation

> +   CODE and type TYPE.  If the target supports it, store the reduction

> +   operation in *REDUC_CODE.  */

> +

> +static bool

> +fold_left_reduction_code (tree_code code, tree type, tree_code *reduc_code)

> +{

> +  switch (code)

> +    {

> +    case PLUS_EXPR:

> +      code = FOLD_LEFT_PLUS_EXPR;

> +      break;

> +

> +    default:

> +      return false;

> +    }

> +

> +  if (!target_supports_op_p (type, code, optab_vector))

> +    return false;

> +

> +  *reduc_code = code;

> +  return true;

> +}

>

>  /* Function reduction_code_for_scalar_code

>

> @@ -2880,6 +2903,42 @@ vect_is_slp_reduction (loop_vec_info loo

>    return true;

>  }

>

> +/* Returns true if we need an in-order reduction for operation CODE

> +   on type TYPE.  NEED_WRAPPING_INTEGRAL_OVERFLOW is true if integer

> +   overflow must wrap.  */

> +

> +static bool

> +needs_fold_left_reduction_p (tree type, tree_code code,

> +                            bool need_wrapping_integral_overflow)

> +{

> +  /* CHECKME: check for !flag_finite_math_only too?  */

> +  if (SCALAR_FLOAT_TYPE_P (type))

> +    switch (code)

> +      {

> +      case MIN_EXPR:

> +      case MAX_EXPR:

> +       return false;

> +

> +      default:

> +       return !flag_associative_math;

> +      }

> +

> +  if (INTEGRAL_TYPE_P (type))

> +    {

> +      if (!operation_no_trapping_overflow (type, code))

> +       return true;

> +      if (need_wrapping_integral_overflow

> +         && !TYPE_OVERFLOW_WRAPS (type)

> +         && operation_can_overflow (code))

> +       return true;

> +      return false;

> +    }

> +

> +  if (SAT_FIXED_POINT_TYPE_P (type))

> +    return true;

> +

> +  return false;

> +}

>

>  /* Function vect_is_simple_reduction

>

> @@ -3198,58 +3257,18 @@ vect_is_simple_reduction (loop_vec_info

>        return NULL;

>      }

>

> -  /* Check that it's ok to change the order of the computation.

> +  /* Check whether it's ok to change the order of the computation.

>       Generally, when vectorizing a reduction we change the order of the

>       computation.  This may change the behavior of the program in some

>       cases, so we need to check that this is ok.  One exception is when

>       vectorizing an outer-loop: the inner-loop is executed sequentially,

>       and therefore vectorizing reductions in the inner-loop during

>       outer-loop vectorization is safe.  */

> -

> -  if (*v_reduc_type != COND_REDUCTION

> -      && check_reduction)

> -    {

> -      /* CHECKME: check for !flag_finite_math_only too?  */

> -      if (SCALAR_FLOAT_TYPE_P (type) && !flag_associative_math)

> -       {

> -         /* Changing the order of operations changes the semantics.  */

> -         if (dump_enabled_p ())

> -           report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt,

> -                       "reduction: unsafe fp math optimization: ");

> -         return NULL;

> -       }

> -      else if (INTEGRAL_TYPE_P (type))

> -       {

> -         if (!operation_no_trapping_overflow (type, code))

> -           {

> -             /* Changing the order of operations changes the semantics.  */

> -             if (dump_enabled_p ())

> -               report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt,

> -                               "reduction: unsafe int math optimization"

> -                               " (overflow traps): ");

> -             return NULL;

> -           }

> -         if (need_wrapping_integral_overflow

> -             && !TYPE_OVERFLOW_WRAPS (type)

> -             && operation_can_overflow (code))

> -           {

> -             /* Changing the order of operations changes the semantics.  */

> -             if (dump_enabled_p ())

> -               report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt,

> -                               "reduction: unsafe int math optimization"

> -                               " (overflow doesn't wrap): ");

> -             return NULL;

> -           }

> -       }

> -      else if (SAT_FIXED_POINT_TYPE_P (type))

> -       {

> -         /* Changing the order of operations changes the semantics.  */

> -         if (dump_enabled_p ())

> -         report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt,

> -                         "reduction: unsafe fixed-point math optimization: ");

> -         return NULL;

> -       }

> -    }

> +  if (check_reduction

> +      && *v_reduc_type == TREE_CODE_REDUCTION

> +      && needs_fold_left_reduction_p (type, code,

> +                                     need_wrapping_integral_overflow))

> +    *v_reduc_type = FOLD_LEFT_REDUCTION;

>

>    /* Reduction is safe. We're dealing with one of the following:

>       1) integer arithmetic and no trapv

> @@ -3513,6 +3532,7 @@ vect_force_simple_reduction (loop_vec_in

>        STMT_VINFO_REDUC_TYPE (reduc_def_info) = v_reduc_type;

>        STMT_VINFO_REDUC_DEF (reduc_def_info) = def;

>        reduc_def_info = vinfo_for_stmt (def);

> +      STMT_VINFO_REDUC_TYPE (reduc_def_info) = v_reduc_type;

>        STMT_VINFO_REDUC_DEF (reduc_def_info) = phi;

>      }

>    return def;

> @@ -4065,7 +4085,8 @@ vect_model_reduction_cost (stmt_vec_info

>

>    code = gimple_assign_rhs_code (orig_stmt);

>

> -  if (reduction_type == EXTRACT_LAST_REDUCTION)

> +  if (reduction_type == EXTRACT_LAST_REDUCTION

> +      || reduction_type == FOLD_LEFT_REDUCTION)

>      {

>        /* No extra instructions needed in the prologue.  */

>        prologue_cost = 0;

> @@ -4138,7 +4159,8 @@ vect_model_reduction_cost (stmt_vec_info

>                                           scalar_stmt, stmt_info, 0,

>                                           vect_epilogue);

>         }

> -      else if (reduction_type == EXTRACT_LAST_REDUCTION)

> +      else if (reduction_type == EXTRACT_LAST_REDUCTION

> +              || reduction_type == FOLD_LEFT_REDUCTION)

>         /* No extra instructions need in the epilogue.  */

>         ;

>        else

> @@ -5884,6 +5906,155 @@ vect_create_epilog_for_reduction (vec<tr

>      }

>  }

>

> +/* Return a vector of type VECTYPE that is equal to the vector select

> +   operation "MASK ? VEC : IDENTITY".  Insert the select statements

> +   before GSI.  */

> +

> +static tree

> +merge_with_identity (gimple_stmt_iterator *gsi, tree mask, tree vectype,

> +                    tree vec, tree identity)

> +{

> +  tree cond = make_temp_ssa_name (vectype, NULL, "cond");

> +  gimple *new_stmt = gimple_build_assign (cond, VEC_COND_EXPR,

> +                                         mask, vec, identity);

> +  gsi_insert_before (gsi, new_stmt, GSI_SAME_STMT);

> +  return cond;

> +}

> +

> +/* Perform an in-order reduction (FOLD_LEFT_REDUCTION).  STMT is the

> +   statement that sets the live-out value.  REDUC_DEF_STMT is the phi

> +   statement.  CODE is the operation performed by STMT and OPS are

> +   its scalar operands.  REDUC_INDEX is the index of the operand in

> +   OPS that is set by REDUC_DEF_STMT.  REDUC_CODE is the code that

> +   implements in-order reduction and VECTYPE_IN is the type of its

> +   vector input.  MASKS specifies the masks that should be used to

> +   control the operation in a fully-masked loop.  */

> +

> +static bool

> +vectorize_fold_left_reduction (gimple *stmt, gimple_stmt_iterator *gsi,

> +                              gimple **vec_stmt, slp_tree slp_node,

> +                              gimple *reduc_def_stmt,

> +                              tree_code code, tree_code reduc_code,

> +                              tree ops[3], tree vectype_in,

> +                              int reduc_index, vec_loop_masks *masks)

> +{

> +  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);

> +  loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);

> +  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);

> +  tree vectype_out = STMT_VINFO_VECTYPE (stmt_info);

> +  gimple *new_stmt = NULL;

> +

> +  int ncopies;

> +  if (slp_node)

> +    ncopies = 1;

> +  else

> +    ncopies = vect_get_num_copies (loop_vinfo, vectype_in);

> +

> +  gcc_assert (!nested_in_vect_loop_p (loop, stmt));

> +  gcc_assert (ncopies == 1);

> +  gcc_assert (TREE_CODE_LENGTH (code) == binary_op);

> +  gcc_assert (reduc_index == (code == MINUS_EXPR ? 0 : 1));

> +  gcc_assert (STMT_VINFO_VEC_REDUCTION_TYPE (stmt_info)

> +             == FOLD_LEFT_REDUCTION);

> +

> +  if (slp_node)

> +    gcc_assert (must_eq (TYPE_VECTOR_SUBPARTS (vectype_out),

> +                        TYPE_VECTOR_SUBPARTS (vectype_in)));

> +

> +  tree op0 = ops[1 - reduc_index];

> +

> +  int group_size = 1;

> +  gimple *scalar_dest_def;

> +  auto_vec<tree> vec_oprnds0;

> +  if (slp_node)

> +    {

> +      vect_get_vec_defs (op0, NULL_TREE, stmt, &vec_oprnds0, NULL, slp_node);

> +      group_size = SLP_TREE_SCALAR_STMTS (slp_node).length ();

> +      scalar_dest_def = SLP_TREE_SCALAR_STMTS (slp_node)[group_size - 1];

> +    }

> +  else

> +    {

> +      tree loop_vec_def0 = vect_get_vec_def_for_operand (op0, stmt);

> +      vec_oprnds0.create (1);

> +      vec_oprnds0.quick_push (loop_vec_def0);

> +      scalar_dest_def = stmt;

> +    }

> +

> +  tree scalar_dest = gimple_assign_lhs (scalar_dest_def);

> +  tree scalar_type = TREE_TYPE (scalar_dest);

> +  tree reduc_var = gimple_phi_result (reduc_def_stmt);

> +

> +  int vec_num = vec_oprnds0.length ();

> +  gcc_assert (vec_num == 1 || slp_node);

> +  tree vec_elem_type = TREE_TYPE (vectype_out);

> +  gcc_checking_assert (useless_type_conversion_p (scalar_type, vec_elem_type));

> +

> +  tree vector_identity = NULL_TREE;

> +  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))

> +    vector_identity = build_zero_cst (vectype_out);

> +

> +  int i;

> +  tree def0;

> +  FOR_EACH_VEC_ELT (vec_oprnds0, i, def0)

> +    {

> +      tree mask = NULL_TREE;

> +      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))

> +       mask = vect_get_loop_mask (gsi, masks, vec_num, vectype_in, i);

> +

> +      /* Handle MINUS by adding the negative.  */

> +      if (code == MINUS_EXPR)

> +       {

> +         tree negated = make_ssa_name (vectype_out);

> +         new_stmt = gimple_build_assign (negated, NEGATE_EXPR, def0);

> +         gsi_insert_before (gsi, new_stmt, GSI_SAME_STMT);

> +         def0 = negated;

> +       }

> +

> +      if (mask)

> +       def0 = merge_with_identity (gsi, mask, vectype_out, def0,

> +                                   vector_identity);

> +

> +      /* On the first iteration the input is simply the scalar phi

> +        result, and for subsequent iterations it is the output of

> +        the preceding operation.  */

> +      tree expr = build2 (reduc_code, scalar_type, reduc_var, def0);

> +

> +      /* For chained SLP reductions the output of the previous reduction

> +        operation serves as the input of the next. For the final statement

> +        the output cannot be a temporary - we reuse the original

> +        scalar destination of the last statement.  */

> +      if (i == vec_num - 1)

> +       reduc_var = scalar_dest;

> +      else

> +       reduc_var = vect_create_destination_var (scalar_dest, NULL);

> +      new_stmt = gimple_build_assign (reduc_var, expr);

> +

> +      if (i == vec_num - 1)

> +       {

> +         SSA_NAME_DEF_STMT (reduc_var) = new_stmt;

> +         /* For chained SLP stmt is the first statement in the group and

> +            gsi points to the last statement in the group.  For non SLP stmt

> +            points to the same location as gsi. In either case tmp_gsi and gsi

> +            should both point to the same insertion point.  */

> +         gcc_assert (scalar_dest_def == gsi_stmt (*gsi));

> +         vect_finish_replace_stmt (scalar_dest_def, new_stmt);

> +       }

> +      else

> +       {

> +         reduc_var = make_ssa_name (reduc_var, new_stmt);

> +         gimple_assign_set_lhs (new_stmt, reduc_var);

> +         vect_finish_stmt_generation (stmt, new_stmt, gsi);

> +       }

> +

> +      if (slp_node)

> +       SLP_TREE_VEC_STMTS (slp_node).quick_push (new_stmt);

> +    }

> +

> +  if (!slp_node)

> +    STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt;

> +

> +  return true;

> +}

>

>  /* Function is_nonwrapping_integer_induction.

>

> @@ -6063,6 +6234,12 @@ vectorizable_reduction (gimple *stmt, gi

>           return true;

>         }

>

> +      if (STMT_VINFO_REDUC_TYPE (stmt_info) == FOLD_LEFT_REDUCTION)

> +       /* Leave the scalar phi in place.  Note that checking

> +          STMT_VINFO_VEC_REDUCTION_TYPE (as below) only works

> +          for reductions involving a single statement.  */

> +       return true;

> +

>        gimple *reduc_stmt = STMT_VINFO_REDUC_DEF (stmt_info);

>        if (STMT_VINFO_IN_PATTERN_P (vinfo_for_stmt (reduc_stmt)))

>         reduc_stmt = STMT_VINFO_RELATED_STMT (vinfo_for_stmt (reduc_stmt));

> @@ -6289,6 +6466,14 @@ vectorizable_reduction (gimple *stmt, gi

>       directy used in stmt.  */

>    if (reduc_index == -1)

>      {

> +      if (STMT_VINFO_REDUC_TYPE (stmt_info) == FOLD_LEFT_REDUCTION)

> +       {

> +         if (dump_enabled_p ())

> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,

> +                            "in-order reduction chain without SLP.\n");

> +         return false;

> +       }

> +

>        if (orig_stmt)

>         reduc_def_stmt = STMT_VINFO_REDUC_DEF (orig_stmt_info);

>        else

> @@ -6508,7 +6693,9 @@ vectorizable_reduction (gimple *stmt, gi

>

>    vect_reduction_type reduction_type

>      = STMT_VINFO_VEC_REDUCTION_TYPE (stmt_info);

> -  if (orig_stmt && reduction_type == TREE_CODE_REDUCTION)

> +  if (orig_stmt

> +      && (reduction_type == TREE_CODE_REDUCTION

> +         || reduction_type == FOLD_LEFT_REDUCTION))

>      {

>        /* This is a reduction pattern: get the vectype from the type of the

>           reduction variable, and get the tree-code from orig_stmt.  */

> @@ -6555,13 +6742,22 @@ vectorizable_reduction (gimple *stmt, gi

>    epilog_reduc_code = ERROR_MARK;

>

>    if (reduction_type == TREE_CODE_REDUCTION

> +      || reduction_type == FOLD_LEFT_REDUCTION

>        || reduction_type == INTEGER_INDUC_COND_REDUCTION

>        || reduction_type == CONST_COND_REDUCTION)

>      {

> -      if (reduction_code_for_scalar_code (orig_code, &epilog_reduc_code))

> +      bool have_reduc_support;

> +      if (reduction_type == FOLD_LEFT_REDUCTION)

> +       have_reduc_support = fold_left_reduction_code (orig_code, vectype_out,

> +                                                      &epilog_reduc_code);

> +      else

> +       have_reduc_support

> +         = reduction_code_for_scalar_code (orig_code, &epilog_reduc_code);

> +

> +      if (have_reduc_support)

>         {

>           reduc_optab = optab_for_tree_code (epilog_reduc_code, vectype_out,

> -                                         optab_default);

> +                                            optab_default);

>           if (!reduc_optab)

>             {

>               if (dump_enabled_p ())

> @@ -6687,6 +6883,41 @@ vectorizable_reduction (gimple *stmt, gi

>         }

>      }

>

> +  if (double_reduc && reduction_type == FOLD_LEFT_REDUCTION)

> +    {

> +      /* We can't support in-order reductions of code such as this:

> +

> +          for (int i = 0; i < n1; ++i)

> +            for (int j = 0; j < n2; ++j)

> +              l += a[j];

> +

> +        since GCC effectively transforms the loop when vectorizing:

> +

> +          for (int i = 0; i < n1 / VF; ++i)

> +            for (int j = 0; j < n2; ++j)

> +              for (int k = 0; k < VF; ++k)

> +                l += a[j];

> +

> +        which is a reassociation of the original operation.  */

> +      if (dump_enabled_p ())

> +       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,

> +                        "in-order double reduction not supported.\n");

> +

> +      return false;

> +    }

> +

> +  if (reduction_type == FOLD_LEFT_REDUCTION

> +      && slp_node

> +      && !GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)))

> +    {

> +      /* We cannot in-order reductions in this case because there is

> +         an implicit reassociation of the operations involved.  */

> +      if (dump_enabled_p ())

> +        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,

> +                        "in-order unchained SLP reductions not supported.\n");

> +      return false;

> +    }

> +

>    /* In case of widenning multiplication by a constant, we update the type

>       of the constant to be the type of the other operand.  We check that the

>       constant fits the type in the pattern recognition pass.  */

> @@ -6807,9 +7038,10 @@ vectorizable_reduction (gimple *stmt, gi

>         vect_model_reduction_cost (stmt_info, epilog_reduc_code, ncopies);

>        if (loop_vinfo && LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo))

>         {

> -         if (cond_fn == IFN_LAST

> -             || !direct_internal_fn_supported_p (cond_fn, vectype_in,

> -                                                 OPTIMIZE_FOR_SPEED))

> +         if (reduction_type != FOLD_LEFT_REDUCTION

> +             && (cond_fn == IFN_LAST

> +                 || !direct_internal_fn_supported_p (cond_fn, vectype_in,

> +                                                     OPTIMIZE_FOR_SPEED)))

>             {

>               if (dump_enabled_p ())

>                 dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,

> @@ -6844,6 +7076,11 @@ vectorizable_reduction (gimple *stmt, gi

>

>    bool masked_loop_p = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);

>

> +  if (reduction_type == FOLD_LEFT_REDUCTION)

> +    return vectorize_fold_left_reduction

> +      (stmt, gsi, vec_stmt, slp_node, reduc_def_stmt, code,

> +       epilog_reduc_code, ops, vectype_in, reduc_index, masks);

> +

>    if (reduction_type == EXTRACT_LAST_REDUCTION)

>      {

>        gcc_assert (!slp_node);

> Index: gcc/config/aarch64/aarch64.md

> ===================================================================

> --- gcc/config/aarch64/aarch64.md       2017-11-17 16:52:07.246852461 +0000

> +++ gcc/config/aarch64/aarch64.md       2017-11-17 16:52:07.620954871 +0000

> @@ -164,6 +164,7 @@ (define_c_enum "unspec" [

>      UNSPEC_STN

>      UNSPEC_INSR

>      UNSPEC_CLASTB

> +    UNSPEC_FADDA

>  ])

>

>  (define_c_enum "unspecv" [

> Index: gcc/config/aarch64/aarch64-sve.md

> ===================================================================

> --- gcc/config/aarch64/aarch64-sve.md   2017-11-17 16:52:07.246852461 +0000

> +++ gcc/config/aarch64/aarch64-sve.md   2017-11-17 16:52:07.620040195 +0000

> @@ -1574,6 +1574,45 @@ (define_insn "*reduc_<optab>_scal_<mode>

>    "<bit_reduc_op>\t%<Vetype>0, %1, %2.<Vetype>"

>  )

>

> +;; Unpredicated in-order FP reductions.

> +(define_expand "fold_left_plus_<mode>"

> +  [(set (match_operand:<VEL> 0 "register_operand")

> +       (unspec:<VEL> [(match_dup 3)

> +                      (match_operand:<VEL> 1 "register_operand")

> +                      (match_operand:SVE_F 2 "register_operand")]

> +                     UNSPEC_FADDA))]

> +  "TARGET_SVE"

> +  {

> +    operands[3] = force_reg (<VPRED>mode, CONSTM1_RTX (<VPRED>mode));

> +  }

> +)

> +

> +;; In-order FP reductions predicated with PTRUE.

> +(define_insn "*fold_left_plus_<mode>"

> +  [(set (match_operand:<VEL> 0 "register_operand" "=w")

> +       (unspec:<VEL> [(match_operand:<VPRED> 1 "register_operand" "Upl")

> +                      (match_operand:<VEL> 2 "register_operand" "0")

> +                      (match_operand:SVE_F 3 "register_operand" "w")]

> +                     UNSPEC_FADDA))]

> +  "TARGET_SVE"

> +  "fadda\t%<Vetype>0, %1, %<Vetype>0, %3.<Vetype>"

> +)

> +

> +;; Predicated form of the above in-order reduction.

> +(define_insn "*pred_fold_left_plus_<mode>"

> +  [(set (match_operand:<VEL> 0 "register_operand" "=w")

> +       (unspec:<VEL>

> +         [(match_operand:<VEL> 1 "register_operand" "0")

> +          (unspec:SVE_F

> +            [(match_operand:<VPRED> 2 "register_operand" "Upl")

> +             (match_operand:SVE_F 3 "register_operand" "w")

> +             (match_operand:SVE_F 4 "aarch64_simd_imm_zero")]

> +            UNSPEC_SEL)]

> +         UNSPEC_FADDA))]

> +  "TARGET_SVE"

> +  "fadda\t%<Vetype>0, %2, %<Vetype>0, %3.<Vetype>"

> +)

> +

>  ;; Unpredicated floating-point addition.

>  (define_expand "add<mode>3"

>    [(set (match_operand:SVE_F 0 "register_operand")

> Index: gcc/testsuite/lib/target-supports.exp

> ===================================================================

> --- gcc/testsuite/lib/target-supports.exp       2017-11-17 16:52:07.246852461 +0000

> +++ gcc/testsuite/lib/target-supports.exp       2017-11-17 16:52:07.627357602 +0000

> @@ -7180,6 +7180,12 @@ proc check_effective_target_vect_fold_ex

>      return [check_effective_target_aarch64_sve]

>  }

>

> +# Return 1 if the target supports the fold_left_plus optab.

> +

> +proc check_effective_target_vect_fold_left_plus { } {

> +    return [check_effective_target_aarch64_sve]

> +}

> +

>  # Return 1 if the target supports section-anchors

>

>  proc check_effective_target_section_anchors { } {

> Index: gcc/testsuite/gcc.dg/vect/no-fast-math-vect16.c

> ===================================================================

> --- gcc/testsuite/gcc.dg/vect/no-fast-math-vect16.c     2017-11-17 16:52:07.246852461 +0000

> +++ gcc/testsuite/gcc.dg/vect/no-fast-math-vect16.c     2017-11-17 16:52:07.625528250 +0000

> @@ -34,4 +34,4 @@ int main (void)

>  }

>

>  /* Requires fast-math.  */

> -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail *-*-* } } } */

> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail { ! vect_fold_left_plus } } } } */

> Index: gcc/testsuite/gcc.dg/vect/pr79920.c

> ===================================================================

> --- gcc/testsuite/gcc.dg/vect/pr79920.c 2017-11-17 16:52:07.246852461 +0000

> +++ gcc/testsuite/gcc.dg/vect/pr79920.c 2017-11-17 16:52:07.625528250 +0000

> @@ -41,4 +41,5 @@ int main()

>    return 0;

>  }

>

> -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_double && { vect_perm && vect_hw_misalign } } } } } */

> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { { vect_double && { ! vect_fold_left_plus } } && { vect_perm && vect_hw_misalign } } } } } */

> +/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { target { { vect_double && vect_fold_left_plus } && { vect_perm && vect_hw_misalign } } } } } */

> Index: gcc/testsuite/gcc.dg/vect/trapv-vect-reduc-4.c

> ===================================================================

> --- gcc/testsuite/gcc.dg/vect/trapv-vect-reduc-4.c      2017-11-17 16:52:07.246852461 +0000

> +++ gcc/testsuite/gcc.dg/vect/trapv-vect-reduc-4.c      2017-11-17 16:52:07.625528250 +0000

> @@ -46,5 +46,9 @@ int main (void)

>    return 0;

>  }

>

> -/* { dg-final { scan-tree-dump-times "Detected reduction\\." 2 "vect"  } } */

> +/* 2 for the first loop.  */

> +/* { dg-final { scan-tree-dump-times "Detected reduction\\." 3 "vect" { target { ! vect_multiple_sizes } } } } */

> +/* { dg-final { scan-tree-dump "Detected reduction\\." "vect" { target vect_multiple_sizes } } } */

> +/* { dg-final { scan-tree-dump-times "not vectorized" 1 "vect" { target { ! vect_multiple_sizes } } } } */

> +/* { dg-final { scan-tree-dump "not vectorized" "vect" { target vect_multiple_sizes } } } */

>  /* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { target { ! vect_no_int_min_max } } } } */

> Index: gcc/testsuite/gcc.dg/vect/vect-reduc-6.c

> ===================================================================

> --- gcc/testsuite/gcc.dg/vect/vect-reduc-6.c    2017-11-17 16:52:07.246852461 +0000

> +++ gcc/testsuite/gcc.dg/vect/vect-reduc-6.c    2017-11-17 16:52:07.625528250 +0000

> @@ -50,4 +50,5 @@ int main (void)

>

>  /* need -ffast-math to vectorizer these loops.  */

>  /* ARM NEON passes -ffast-math to these tests, so expect this to fail.  */

> -/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { xfail arm_neon_ok } } } */

> +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target { ! vect_fold_left_plus } xfail arm_neon_ok } } } */

> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target vect_fold_left_plus } } } */

> Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1.c

> ===================================================================

> --- /dev/null   2017-11-14 14:28:07.424493901 +0000

> +++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1.c       2017-11-17 16:52:07.625528250 +0000

> @@ -0,0 +1,28 @@

> +/* { dg-do compile } */

> +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */

> +

> +#define NUM_ELEMS(TYPE) ((int)(5 * (256 / sizeof (TYPE)) + 3))

> +

> +#define DEF_REDUC_PLUS(TYPE)                   \

> +  TYPE __attribute__ ((noinline, noclone))     \

> +  reduc_plus_##TYPE (TYPE *a, TYPE *b)         \

> +  {                                            \

> +    TYPE r = 0, q = 3;                         \

> +    for (int i = 0; i < NUM_ELEMS(TYPE); i++)  \

> +      {                                                \

> +       r += a[i];                              \

> +       q -= b[i];                              \

> +      }                                                \

> +    return r * q;                              \

> +  }

> +

> +#define TEST_ALL(T) \

> +  T (_Float16) \

> +  T (float) \

> +  T (double)

> +

> +TEST_ALL (DEF_REDUC_PLUS)

> +

> +/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, z[0-9]+\.h} 2 } } */

> +/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, z[0-9]+\.s} 2 } } */

> +/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, z[0-9]+\.d} 2 } } */

> Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1_run.c

> ===================================================================

> --- /dev/null   2017-11-14 14:28:07.424493901 +0000

> +++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1_run.c   2017-11-17 16:52:07.625528250 +0000

> @@ -0,0 +1,29 @@

> +/* { dg-do run { target { aarch64_sve_hw } } } */

> +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */

> +

> +#include "sve_reduc_strict_1.c"

> +

> +#define TEST_REDUC_PLUS(TYPE)                  \

> +  {                                            \

> +    TYPE a[NUM_ELEMS (TYPE)];                  \

> +    TYPE b[NUM_ELEMS (TYPE)];                  \

> +    TYPE r = 0, q = 3;                         \

> +    for (int i = 0; i < NUM_ELEMS (TYPE); i++) \

> +      {                                                \

> +       a[i] = (i * 0.1) * (i & 1 ? 1 : -1);    \

> +       b[i] = (i * 0.3) * (i & 1 ? 1 : -1);    \

> +       r += a[i];                              \

> +       q -= b[i];                              \

> +       asm volatile ("" ::: "memory");         \

> +      }                                                \

> +    TYPE res = reduc_plus_##TYPE (a, b);       \

> +    if (res != r * q)                          \

> +      __builtin_abort ();                      \

> +  }

> +

> +int __attribute__ ((optimize (1)))

> +main ()

> +{

> +  TEST_ALL (TEST_REDUC_PLUS);

> +  return 0;

> +}

> Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2.c

> ===================================================================

> --- /dev/null   2017-11-14 14:28:07.424493901 +0000

> +++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2.c       2017-11-17 16:52:07.625528250 +0000

> @@ -0,0 +1,28 @@

> +/* { dg-do compile } */

> +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */

> +

> +#define NUM_ELEMS(TYPE) ((int) (5 * (256 / sizeof (TYPE)) + 3))

> +

> +#define DEF_REDUC_PLUS(TYPE)                                   \

> +void __attribute__ ((noinline, noclone))                       \

> +reduc_plus_##TYPE (TYPE (*restrict a)[NUM_ELEMS(TYPE)],                \

> +                  TYPE *restrict r, int n)                     \

> +{                                                              \

> +  for (int i = 0; i < n; i++)                                  \

> +    {                                                          \

> +      r[i] = 0;                                                        \

> +      for (int j = 0; j < NUM_ELEMS(TYPE); j++)                        \

> +        r[i] += a[i][j];                                       \

> +    }                                                          \

> +}

> +

> +#define TEST_ALL(T) \

> +  T (_Float16) \

> +  T (float) \

> +  T (double)

> +

> +TEST_ALL (DEF_REDUC_PLUS)

> +

> +/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, z[0-9]+\.h} 1 } } */

> +/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, z[0-9]+\.s} 1 } } */

> +/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, z[0-9]+\.d} 1 } } */

> Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2_run.c

> ===================================================================

> --- /dev/null   2017-11-14 14:28:07.424493901 +0000

> +++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2_run.c   2017-11-17 16:52:07.626442926 +0000

> @@ -0,0 +1,31 @@

> +/* { dg-do run { target { aarch64_sve_hw } } } */

> +/* { dg-options "-O2 -ftree-vectorize -fno-inline -march=armv8-a+sve" } */

> +

> +#include "sve_reduc_strict_2.c"

> +

> +#define NROWS 5

> +

> +#define TEST_REDUC_PLUS(TYPE)                                  \

> +  {                                                            \

> +    TYPE a[NROWS][NUM_ELEMS (TYPE)];                           \

> +    TYPE r[NROWS];                                             \

> +    TYPE expected[NROWS] = {};                                 \

> +    for (int i = 0; i < NROWS; ++i)                            \

> +      for (int j = 0; j < NUM_ELEMS (TYPE); ++j)               \

> +       {                                                       \

> +         a[i][j] = (i * 0.1 + j * 0.6) * (j & 1 ? 1 : -1);     \

> +         expected[i] += a[i][j];                               \

> +         asm volatile ("" ::: "memory");                       \

> +       }                                                       \

> +    reduc_plus_##TYPE (a, r, NROWS);                           \

> +    for (int i = 0; i < NROWS; ++i)                            \

> +      if (r[i] != expected[i])                                 \

> +       __builtin_abort ();                                     \

> +  }

> +

> +int __attribute__ ((optimize (1)))

> +main ()

> +{

> +  TEST_ALL (TEST_REDUC_PLUS);

> +  return 0;

> +}

> Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_3.c

> ===================================================================

> --- /dev/null   2017-11-14 14:28:07.424493901 +0000

> +++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_3.c       2017-11-17 16:52:07.626442926 +0000

> @@ -0,0 +1,131 @@

> +/* { dg-do compile } */

> +/* { dg-options "-O2 -ftree-vectorize -fno-inline -march=armv8-a+sve -msve-vector-bits=256 -fdump-tree-vect-details" } */

> +

> +double mat[100][4];

> +double mat2[100][8];

> +double mat3[100][12];

> +double mat4[100][3];

> +

> +double

> +slp_reduc_plus (int n)

> +{

> +  double tmp = 0.0;

> +  for (int i = 0; i < n; i++)

> +    {

> +      tmp = tmp + mat[i][0];

> +      tmp = tmp + mat[i][1];

> +      tmp = tmp + mat[i][2];

> +      tmp = tmp + mat[i][3];

> +    }

> +  return tmp;

> +}

> +

> +double

> +slp_reduc_plus2 (int n)

> +{

> +  double tmp = 0.0;

> +  for (int i = 0; i < n; i++)

> +    {

> +      tmp = tmp + mat2[i][0];

> +      tmp = tmp + mat2[i][1];

> +      tmp = tmp + mat2[i][2];

> +      tmp = tmp + mat2[i][3];

> +      tmp = tmp + mat2[i][4];

> +      tmp = tmp + mat2[i][5];

> +      tmp = tmp + mat2[i][6];

> +      tmp = tmp + mat2[i][7];

> +    }

> +  return tmp;

> +}

> +

> +double

> +slp_reduc_plus3 (int n)

> +{

> +  double tmp = 0.0;

> +  for (int i = 0; i < n; i++)

> +    {

> +      tmp = tmp + mat3[i][0];

> +      tmp = tmp + mat3[i][1];

> +      tmp = tmp + mat3[i][2];

> +      tmp = tmp + mat3[i][3];

> +      tmp = tmp + mat3[i][4];

> +      tmp = tmp + mat3[i][5];

> +      tmp = tmp + mat3[i][6];

> +      tmp = tmp + mat3[i][7];

> +      tmp = tmp + mat3[i][8];

> +      tmp = tmp + mat3[i][9];

> +      tmp = tmp + mat3[i][10];

> +      tmp = tmp + mat3[i][11];

> +    }

> +  return tmp;

> +}

> +

> +void

> +slp_non_chained_reduc (int n, double * restrict out)

> +{

> +  for (int i = 0; i < 3; i++)

> +    out[i] = 0;

> +

> +  for (int i = 0; i < n; i++)

> +    {

> +      out[0] = out[0] + mat4[i][0];

> +      out[1] = out[1] + mat4[i][1];

> +      out[2] = out[2] + mat4[i][2];

> +    }

> +}

> +

> +/* Strict FP reductions shouldn't be used for the outer loops, only the

> +   inner loops.  */

> +

> +float

> +double_reduc1 (float (*restrict i)[16])

> +{

> +  float l = 0;

> +

> +  for (int a = 0; a < 8; a++)

> +    for (int b = 0; b < 8; b++)

> +      l += i[b][a];

> +  return l;

> +}

> +

> +float

> +double_reduc2 (float *restrict i)

> +{

> +  float l = 0;

> +

> +  for (int a = 0; a < 8; a++)

> +    for (int b = 0; b < 16; b++)

> +      {

> +        l += i[b * 4];

> +        l += i[b * 4 + 1];

> +        l += i[b * 4 + 2];

> +        l += i[b * 4 + 3];

> +      }

> +  return l;

> +}

> +

> +float

> +double_reduc3 (float *restrict i, float *restrict j)

> +{

> +  float k = 0, l = 0;

> +

> +  for (int a = 0; a < 8; a++)

> +    for (int b = 0; b < 8; b++)

> +      {

> +        k += i[b];

> +        l += j[b];

> +      }

> +  return l * k;

> +}

> +

> +/* We can't yet handle double_reduc1.  */

> +/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, z[0-9]+\.s} 3 } } */

> +/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, z[0-9]+\.d} 9 } } */

> +/* 1 reduction each for double_reduc{1,2} and 2 for double_reduc3.  Each one

> +   is reported three times, once for SVE, once for 128-bit AdvSIMD and once

> +   for 64-bit AdvSIMD.  */

> +/* { dg-final { scan-tree-dump-times "Detected double reduction" 12 "vect" } } */

> +/* double_reduc2 has 2 reductions and slp_non_chained_reduc has 3.

> +   double_reduc1 is reported 3 times (SVE, 128-bit AdvSIMD, 64-bit AdvSIMD)

> +   before failing.  */

> +/* { dg-final { scan-tree-dump-times "Detected reduction" 12 "vect" } } */

> Index: gcc/testsuite/gcc.target/aarch64/sve_slp_13.c

> ===================================================================

> --- gcc/testsuite/gcc.target/aarch64/sve_slp_13.c       2017-11-17 16:52:07.246852461 +0000

> +++ gcc/testsuite/gcc.target/aarch64/sve_slp_13.c       2017-11-17 16:52:07.626442926 +0000

> @@ -1,5 +1,6 @@

>  /* { dg-do compile } */

> -/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable" } */

> +/* The cost model thinks that the double loop isn't a win for SVE-128.  */

> +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable -fno-vect-cost-model" } */

>

>  #include <stdint.h>

>

> @@ -24,7 +25,10 @@ #define TEST_ALL(T)                          \

>    T (int32_t)                                  \

>    T (uint32_t)                                 \

>    T (int64_t)                                  \

> -  T (uint64_t)

> +  T (uint64_t)                                 \

> +  T (_Float16)                                 \

> +  T (float)                                    \

> +  T (double)

>

>  TEST_ALL (VEC_PERM)

>

> @@ -32,21 +36,25 @@ TEST_ALL (VEC_PERM)

>  /* ??? We don't treat the uint loops as SLP.  */

>  /* The loop should be fully-masked.  */

>  /* { dg-final { scan-assembler-times {\tld1b\t} 2 { xfail *-*-* } } } */

> -/* { dg-final { scan-assembler-times {\tld1h\t} 2 { xfail *-*-* } } } */

> -/* { dg-final { scan-assembler-times {\tld1w\t} 2 { xfail *-*-* } } } */

> -/* { dg-final { scan-assembler-times {\tld1w\t} 1 } } */

> -/* { dg-final { scan-assembler-times {\tld1d\t} 2 { xfail *-*-* } } } */

> -/* { dg-final { scan-assembler-times {\tld1d\t} 1 } } */

> +/* { dg-final { scan-assembler-times {\tld1h\t} 3 { xfail *-*-* } } } */

> +/* { dg-final { scan-assembler-times {\tld1w\t} 3 { xfail *-*-* } } } */

> +/* { dg-final { scan-assembler-times {\tld1w\t} 2 } } */

> +/* { dg-final { scan-assembler-times {\tld1d\t} 3 { xfail *-*-* } } } */

> +/* { dg-final { scan-assembler-times {\tld1d\t} 2 } } */

>  /* { dg-final { scan-assembler-not {\tldr} { xfail *-*-* } } } */

>

>  /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b} 4 { xfail *-*-* } } } */

> -/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 4 { xfail *-*-* } } } */

> -/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 4 } } */

> -/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 4 } } */

> +/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 6 { xfail *-*-* } } } */

> +/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */

> +/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 6 } } */

>

>  /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.b\n} 2 { xfail *-*-* } } } */

>  /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.h\n} 2 { xfail *-*-* } } } */

>  /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.s\n} 2 } } */

>  /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.d\n} 2 } } */

> +/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, z[0-9]+\.h\n} 1 } } */

> +/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, z[0-9]+\.s\n} 1 } } */

> +/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, z[0-9]+\.d\n} 1 } } */

> +/* { dg-final { scan-assembler-not {\tfadd\n} } } */

>

>  /* { dg-final { scan-assembler-not {\tuqdec} } } */

> Index: gcc/testsuite/gfortran.dg/vect/vect-8.f90

> ===================================================================

> --- gcc/testsuite/gfortran.dg/vect/vect-8.f90   2017-11-17 16:52:07.246852461 +0000

> +++ gcc/testsuite/gfortran.dg/vect/vect-8.f90   2017-11-17 16:52:07.626442926 +0000

> @@ -704,5 +704,6 @@ CALL track('KERNEL  ')

>  RETURN

>  END SUBROUTINE kernel

>

> -! { dg-final { scan-tree-dump-times "vectorized 21 loops" 1 "vect" { target { vect_intdouble_cvt } } } }

>  ! { dg-final { scan-tree-dump-times "vectorized 17 loops" 1 "vect" { target { ! vect_intdouble_cvt } } } }

> +! { dg-final { scan-tree-dump-times "vectorized 21 loops" 1 "vect" { target { vect_intdouble_cvt && { ! vect_fold_left_plus } } } } }

> +! { dg-final { scan-tree-dump-times "vectorized 25 loops" 1 "vect" { target { vect_intdouble_cvt && vect_fold_left_plus } } } }

Richard Sandiford Nov. 20, 2017, 12:54 p.m. UTC | #2

Richard Biener <richard.guenther@gmail.com> writes:
> On Fri, Nov 17, 2017 at 5:53 PM, Richard Sandiford

> <richard.sandiford@linaro.org> wrote:

>> This patch adds support for in-order floating-point addition reductions,

>> which are suitable even in strict IEEE mode.

>>

>> Previously vect_is_simple_reduction would reject any cases that forbid

>> reassociation.  The idea is instead to tentatively accept them as

>> "FOLD_LEFT_REDUCTIONs" and only fail later if there is no target

>> support for them.  Although this patch only handles the particular

>> case of plus and minus on floating-point types, there's no reason in

>> principle why targets couldn't handle other cases.

>>

>> The vect_force_simple_reduction change makes it simpler for parloops

>> to read the type of reduction.

>>

>> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu

>> and powerpc64le-linux-gnu.  OK to install?

>

> I don't like that you add a new tree code for this.  A new IFN looks more

> suitable to me.

OK.

> Also I think if there's a way to handle this correctly with target support

> you can also implement a fallback if there is no such support increasing

> test coverage.  It would basically boil down to extracting all scalars from

> the non-reduction operand vector and performing a series of reduction

> ops, keeping the reduction PHI scalar.  This would also support any

> reduction operator.

Yeah, but without target support, that's probably going to be expensive.
It's a bit like how we can implement element-by-element loads and stores
for cases that don't have target support, but had to explicitly disable
that in many cases, since the cost model was too optimistic.

I can give it a go anyway if you think it's worth it.

As far as testing coverage goes: I think the SVE port is just going
to have to take the hit of being the only port that uses this stuff
for now.  The AArch64 testsuite patches test SVE assembly generation
for non-SVE targets, so it does get at least some coverge on normal
AArch64 test runs.  But obviously assembly tests only go so far...

Thanks,
Richard

Richard Biener Nov. 21, 2017, 2:50 p.m. UTC | #3

On Mon, Nov 20, 2017 at 1:54 PM, Richard Sandiford
<richard.sandiford@linaro.org> wrote:
> Richard Biener <richard.guenther@gmail.com> writes:

>> On Fri, Nov 17, 2017 at 5:53 PM, Richard Sandiford

>> <richard.sandiford@linaro.org> wrote:

>>> This patch adds support for in-order floating-point addition reductions,

>>> which are suitable even in strict IEEE mode.

>>>

>>> Previously vect_is_simple_reduction would reject any cases that forbid

>>> reassociation.  The idea is instead to tentatively accept them as

>>> "FOLD_LEFT_REDUCTIONs" and only fail later if there is no target

>>> support for them.  Although this patch only handles the particular

>>> case of plus and minus on floating-point types, there's no reason in

>>> principle why targets couldn't handle other cases.

>>>

>>> The vect_force_simple_reduction change makes it simpler for parloops

>>> to read the type of reduction.

>>>

>>> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu

>>> and powerpc64le-linux-gnu.  OK to install?

>>

>> I don't like that you add a new tree code for this.  A new IFN looks more

>> suitable to me.

>

> OK.


Thanks.  I'd like to eventually get rid of other vectorizer tree codes as well,
like the REDUC_*_EXPR, DOT_PROD_EXPR and SAD_EXPR.  IFNs
are now really the way to go for "target instructions on GIMPLE".

>> Also I think if there's a way to handle this correctly with target support

>> you can also implement a fallback if there is no such support increasing

>> test coverage.  It would basically boil down to extracting all scalars from

>> the non-reduction operand vector and performing a series of reduction

>> ops, keeping the reduction PHI scalar.  This would also support any

>> reduction operator.

>

> Yeah, but without target support, that's probably going to be expensive.

> It's a bit like how we can implement element-by-element loads and stores

> for cases that don't have target support, but had to explicitly disable

> that in many cases, since the cost model was too optimistic.


I expect that for V2DF or even V4DF it might be profitable in quite a number
of cases.  V2DF definitely.

> I can give it a go anyway if you think it's worth it.


I think it is.

Richard.

> As far as testing coverage goes: I think the SVE port is just going

> to have to take the hit of being the only port that uses this stuff

> for now.  The AArch64 testsuite patches test SVE assembly generation

> for non-SVE targets, so it does get at least some coverge on normal

> AArch64 test runs.  But obviously assembly tests only go so far...

>

> Thanks,

> Richard

Richard Sandiford Nov. 21, 2017, 4:38 p.m. UTC | #4

Richard Biener <richard.guenther@gmail.com> writes:
> On Mon, Nov 20, 2017 at 1:54 PM, Richard Sandiford

> <richard.sandiford@linaro.org> wrote:

>> Richard Biener <richard.guenther@gmail.com> writes:

>>> On Fri, Nov 17, 2017 at 5:53 PM, Richard Sandiford

>>> <richard.sandiford@linaro.org> wrote:

>>>> This patch adds support for in-order floating-point addition reductions,

>>>> which are suitable even in strict IEEE mode.

>>>>

>>>> Previously vect_is_simple_reduction would reject any cases that forbid

>>>> reassociation.  The idea is instead to tentatively accept them as

>>>> "FOLD_LEFT_REDUCTIONs" and only fail later if there is no target

>>>> support for them.  Although this patch only handles the particular

>>>> case of plus and minus on floating-point types, there's no reason in

>>>> principle why targets couldn't handle other cases.

>>>>

>>>> The vect_force_simple_reduction change makes it simpler for parloops

>>>> to read the type of reduction.

>>>>

>>>> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu

>>>> and powerpc64le-linux-gnu.  OK to install?

>>>

>>> I don't like that you add a new tree code for this.  A new IFN looks more

>>> suitable to me.

>>

>> OK.

>

> Thanks.  I'd like to eventually get rid of other vectorizer tree codes as well,

> like the REDUC_*_EXPR, DOT_PROD_EXPR and SAD_EXPR.  IFNs

> are now really the way to go for "target instructions on GIMPLE".


Glad you said that.  I ended up having to convert REDUC_*_EXPRs too,
since it was too ugly trying to support some reductions based on tree
codes and some on internal functions.  (I did try using code_helper,
but even then...)

Tested on aarch64-linux-gnu, x86_64-linux-gnu and powerpc64le-linux-gnu.
OK to install?

Thanks,
Richard

PS. This applies at the same point in the series as the FADDA patch.
I can rejig it to apply onto current trunk if that seems better.


2017-11-21  Richard Sandiford  <richard.sandiford@linaro.org>

gcc/
	* tree.def (REDUC_MAX_EXPR, REDUC_MIN_EXPR, REDUC_PLUS_EXPR)
	(REDUC_AND_EXPR, REDUC_IOR_EXPR, REDUC_XOR_EXPR): Delete.
	* doc/generic.texi (REDUC_MAX_EXPR, REDUC_MIN_EXPR, REDUC_PLUS_EXPR)
	(REDUC_AND_EXPR, REDUC_IOR_EXPR, REDUC_XOR_EXPR): Delete.
	* cfgexpand.c (expand_debug_expr): Remove handling for them.
	* expr.c (expand_expr_real_2): Likewise.
	* fold-const.c (const_unop): Likewise.
	* optabs-tree.c (optab_for_tree_code): Likewise.
	* tree-cfg.c (verify_gimple_assign_unary): Likewise.
	* tree-inline.c (estimate_operator_cost): Likewise.
	* tree-pretty-print.c (dump_generic_node): Likewise.
	(op_code_prio): Likewise.
	(op_symbol_code): Likewise.
	* internal-fn.def (DEF_INTERNAL_SIGNED_OPTAB_FN): Define.
	(IFN_REDUC_PLUS, IFN_REDUC_MAX, IFN_REDUC_MIN, IFN_REDUC_AND)
	(IFN_REDUC_IOR, IFN_REDUC_XOR): New internal functions.
	* internal-fn.c (direct_internal_fn_optab): New function.
	(direct_internal_fn_array, direct_internal_fn_supported_p
	(internal_fn_expanders): Handle DEF_INTERNAL_SIGNED_OPTAB_FN.
	* fold-const-call.c (fold_const_reduction): New function.
	(fold_const_call): Handle CFN_REDUC_PLUS, CFN_REDUC_MAX, CFN_REDUC_MIN,
	CFN_REDUC_AND, CFN_REDUC_IOR and CFN_REDUC_XOR.
	* tree-vect-loop.c (reduction_code_for_scalar_code): Rename to...
	(reduction_fn_for_scalar_code): ...this and return an internal
	function.
	(vect_model_reduction_cost): Take an internal_fn rather than
	a tree_code.
	(vect_create_epilog_for_reduction): Likewise.  Build calls rather
	than assignments.
	(vectorizable_reduction): Use internal functions rather than tree
	codes for the reduction operation.  Update calls to the functions
	above.
	* config/aarch64/aarch64-builtins.c (aarch64_gimple_fold_builtin):
	Use calls to internal functions rather than REDUC tree codes.
	* config/aarch64/aarch64-simd.md: Update comment accordingly.

Index: gcc/tree.def
===================================================================
--- gcc/tree.def	2017-11-21 16:31:28.695326387 +0000
+++ gcc/tree.def	2017-11-21 16:31:49.729927809 +0000
@@ -1287,21 +1287,6 @@ DEFTREECODE (OMP_CLAUSE, "omp_clause", t
    Operand 0: BODY: contains body of the transaction.  */
 DEFTREECODE (TRANSACTION_EXPR, "transaction_expr", tcc_expression, 1)
 
-/* Reduction operations.
-   Operations that take a vector of elements and "reduce" it to a scalar
-   result (e.g. summing the elements of the vector, finding the minimum over
-   the vector elements, etc).
-   Operand 0 is a vector.
-   The expression returns a scalar, with type the same as the elements of the
-   vector, holding the result of the reduction of all elements of the operand.
-   */
-DEFTREECODE (REDUC_MAX_EXPR, "reduc_max_expr", tcc_unary, 1)
-DEFTREECODE (REDUC_MIN_EXPR, "reduc_min_expr", tcc_unary, 1)
-DEFTREECODE (REDUC_PLUS_EXPR, "reduc_plus_expr", tcc_unary, 1)
-DEFTREECODE (REDUC_AND_EXPR, "reduc_and_expr", tcc_unary, 1)
-DEFTREECODE (REDUC_IOR_EXPR, "reduc_ior_expr", tcc_unary, 1)
-DEFTREECODE (REDUC_XOR_EXPR, "reduc_xor_expr", tcc_unary, 1)
-
 /* Widening dot-product.
    The first two arguments are of type t1.
    The third argument and the result are of type t2, such that t2 is at least
Index: gcc/doc/generic.texi
===================================================================
--- gcc/doc/generic.texi	2017-11-21 16:31:28.695326387 +0000
+++ gcc/doc/generic.texi	2017-11-21 16:31:49.723928786 +0000
@@ -1740,12 +1740,6 @@ a value from @code{enum annot_expr_kind}
 @tindex VEC_PACK_FIX_TRUNC_EXPR
 @tindex VEC_COND_EXPR
 @tindex SAD_EXPR
-@tindex REDUC_MAX_EXPR
-@tindex REDUC_MIN_EXPR
-@tindex REDUC_PLUS_EXPR
-@tindex REDUC_AND_EXPR
-@tindex REDUC_IOR_EXPR
-@tindex REDUC_XOR_EXPR
 
 @table @code
 @item VEC_DUPLICATE_EXPR
@@ -1846,21 +1840,6 @@ must have the same type.  The size of th
 operand must be at lease twice of the size of the vector element of the
 first and second one.  The SAD is calculated between the first and second
 operands, added to the third operand, and returned.
-
-@item REDUC_MAX_EXPR
-@itemx REDUC_MIN_EXPR
-@itemx REDUC_PLUS_EXPR
-@itemx REDUC_AND_EXPR
-@itemx REDUC_IOR_EXPR
-@itemx REDUC_XOR_EXPR
-These nodes represent operations that take a vector input and repeatedly
-apply a binary operator on pairs of elements until only one scalar remains.
-For example, @samp{REDUC_PLUS_EXPR <@var{x}>} returns the sum of
-the elements in @var{x} and @samp{REDUC_MAX_EXPR <@var{x}>} returns
-the maximum element in @var{x}.  The associativity of the operation
-is unspecified; for example, @samp{REDUC_PLUS_EXPR <@var{x}>} could
-sum floating-point @var{x} in forward order, in reverse order,
-using a tree, or in some other way.
 @end table
 
 
Index: gcc/cfgexpand.c
===================================================================
--- gcc/cfgexpand.c	2017-11-21 16:31:28.695326387 +0000
+++ gcc/cfgexpand.c	2017-11-21 16:31:49.722928949 +0000
@@ -5066,12 +5066,6 @@ expand_debug_expr (tree exp)
 
     /* Vector stuff.  For most of the codes we don't have rtl codes.  */
     case REALIGN_LOAD_EXPR:
-    case REDUC_MAX_EXPR:
-    case REDUC_MIN_EXPR:
-    case REDUC_PLUS_EXPR:
-    case REDUC_AND_EXPR:
-    case REDUC_IOR_EXPR:
-    case REDUC_XOR_EXPR:
     case VEC_COND_EXPR:
     case VEC_PACK_FIX_TRUNC_EXPR:
     case VEC_PACK_SAT_EXPR:
Index: gcc/expr.c
===================================================================
--- gcc/expr.c	2017-11-21 16:31:28.695326387 +0000
+++ gcc/expr.c	2017-11-21 16:31:49.724928624 +0000
@@ -9440,29 +9440,6 @@ #define REDUCE_BIT_FIELD(expr)	(reduce_b
         return target;
       }
 
-    case REDUC_MAX_EXPR:
-    case REDUC_MIN_EXPR:
-    case REDUC_PLUS_EXPR:
-    case REDUC_AND_EXPR:
-    case REDUC_IOR_EXPR:
-    case REDUC_XOR_EXPR:
-      {
-        op0 = expand_normal (treeop0);
-        this_optab = optab_for_tree_code (code, type, optab_default);
-        machine_mode vec_mode = TYPE_MODE (TREE_TYPE (treeop0));
-
-	struct expand_operand ops[2];
-	enum insn_code icode = optab_handler (this_optab, vec_mode);
-
-	create_output_operand (&ops[0], target, mode);
-	create_input_operand (&ops[1], op0, vec_mode);
-	expand_insn (icode, 2, ops);
-	target = ops[0].value;
-	if (GET_MODE (target) != mode)
-	  return gen_lowpart (tmode, target);
-	return target;
-      }
-
     case VEC_UNPACK_HI_EXPR:
     case VEC_UNPACK_LO_EXPR:
       {
Index: gcc/fold-const.c
===================================================================
--- gcc/fold-const.c	2017-11-21 16:31:28.695326387 +0000
+++ gcc/fold-const.c	2017-11-21 16:31:49.725928461 +0000
@@ -1866,42 +1866,6 @@ const_unop (enum tree_code code, tree ty
 	return build_vector (type, elts);
       }
 
-    case REDUC_MIN_EXPR:
-    case REDUC_MAX_EXPR:
-    case REDUC_PLUS_EXPR:
-    case REDUC_AND_EXPR:
-    case REDUC_IOR_EXPR:
-    case REDUC_XOR_EXPR:
-      {
-	unsigned int nelts, i;
-	enum tree_code subcode;
-
-	if (TREE_CODE (arg0) != VECTOR_CST)
-	  return NULL_TREE;
-	nelts = VECTOR_CST_NELTS (arg0);
-
-	switch (code)
-	  {
-	  case REDUC_MIN_EXPR: subcode = MIN_EXPR; break;
-	  case REDUC_MAX_EXPR: subcode = MAX_EXPR; break;
-	  case REDUC_PLUS_EXPR: subcode = PLUS_EXPR; break;
-	  case REDUC_AND_EXPR: subcode = BIT_AND_EXPR; break;
-	  case REDUC_IOR_EXPR: subcode = BIT_IOR_EXPR; break;
-	  case REDUC_XOR_EXPR: subcode = BIT_XOR_EXPR; break;
-	  default: gcc_unreachable ();
-	  }
-
-	tree res = VECTOR_CST_ELT (arg0, 0);
-	for (i = 1; i < nelts; i++)
-	  {
-	    res = const_binop (subcode, res, VECTOR_CST_ELT (arg0, i));
-	    if (res == NULL_TREE || !CONSTANT_CLASS_P (res))
-	      return NULL_TREE;
-	  }
-
-	return res;
-      }
-
     case VEC_DUPLICATE_EXPR:
       if (CONSTANT_CLASS_P (arg0))
 	return build_vector_from_val (type, arg0);
Index: gcc/optabs-tree.c
===================================================================
--- gcc/optabs-tree.c	2017-11-21 16:31:28.695326387 +0000
+++ gcc/optabs-tree.c	2017-11-21 16:31:49.726928298 +0000
@@ -146,26 +146,6 @@ optab_for_tree_code (enum tree_code code
     case FMA_EXPR:
       return fma_optab;
 
-    case REDUC_MAX_EXPR:
-      return TYPE_UNSIGNED (type)
-	     ? reduc_umax_scal_optab : reduc_smax_scal_optab;
-
-    case REDUC_MIN_EXPR:
-      return TYPE_UNSIGNED (type)
-	     ? reduc_umin_scal_optab : reduc_smin_scal_optab;
-
-    case REDUC_PLUS_EXPR:
-      return reduc_plus_scal_optab;
-
-    case REDUC_AND_EXPR:
-      return reduc_and_scal_optab;
-
-    case REDUC_IOR_EXPR:
-      return reduc_ior_scal_optab;
-
-    case REDUC_XOR_EXPR:
-      return reduc_xor_scal_optab;
-
     case VEC_WIDEN_MULT_HI_EXPR:
       return TYPE_UNSIGNED (type) ?
 	vec_widen_umult_hi_optab : vec_widen_smult_hi_optab;
Index: gcc/tree-cfg.c
===================================================================
--- gcc/tree-cfg.c	2017-11-21 16:31:28.695326387 +0000
+++ gcc/tree-cfg.c	2017-11-21 16:31:49.727928135 +0000
@@ -3774,21 +3774,6 @@ verify_gimple_assign_unary (gassign *stm
 
         return false;
       }
-    case REDUC_MAX_EXPR:
-    case REDUC_MIN_EXPR:
-    case REDUC_PLUS_EXPR:
-    case REDUC_AND_EXPR:
-    case REDUC_IOR_EXPR:
-    case REDUC_XOR_EXPR:
-      if (!VECTOR_TYPE_P (rhs1_type)
-	  || !useless_type_conversion_p (lhs_type, TREE_TYPE (rhs1_type)))
-        {
-	  error ("reduction should convert from vector to element type");
-	  debug_generic_expr (lhs_type);
-	  debug_generic_expr (rhs1_type);
-	  return true;
-	}
-      return false;
 
     case VEC_UNPACK_HI_EXPR:
     case VEC_UNPACK_LO_EXPR:
Index: gcc/tree-inline.c
===================================================================
--- gcc/tree-inline.c	2017-11-21 16:31:28.695326387 +0000
+++ gcc/tree-inline.c	2017-11-21 16:31:49.727928135 +0000
@@ -3875,12 +3875,6 @@ estimate_operator_cost (enum tree_code c
 
     case REALIGN_LOAD_EXPR:
 
-    case REDUC_MAX_EXPR:
-    case REDUC_MIN_EXPR:
-    case REDUC_PLUS_EXPR:
-    case REDUC_AND_EXPR:
-    case REDUC_IOR_EXPR:
-    case REDUC_XOR_EXPR:
     case WIDEN_SUM_EXPR:
     case WIDEN_MULT_EXPR:
     case DOT_PROD_EXPR:
Index: gcc/tree-pretty-print.c
===================================================================
--- gcc/tree-pretty-print.c	2017-11-21 16:31:28.695326387 +0000
+++ gcc/tree-pretty-print.c	2017-11-21 16:31:49.727928135 +0000
@@ -3252,12 +3252,6 @@ dump_generic_node (pretty_printer *pp, t
       break;
 
     case VEC_DUPLICATE_EXPR:
-    case REDUC_MAX_EXPR:
-    case REDUC_MIN_EXPR:
-    case REDUC_PLUS_EXPR:
-    case REDUC_AND_EXPR:
-    case REDUC_IOR_EXPR:
-    case REDUC_XOR_EXPR:
       pp_space (pp);
       for (str = get_tree_code_name (code); *str; str++)
 	pp_character (pp, TOUPPER (*str));
@@ -3628,9 +3622,6 @@ op_code_prio (enum tree_code code)
     case ABS_EXPR:
     case REALPART_EXPR:
     case IMAGPART_EXPR:
-    case REDUC_MAX_EXPR:
-    case REDUC_MIN_EXPR:
-    case REDUC_PLUS_EXPR:
     case VEC_UNPACK_HI_EXPR:
     case VEC_UNPACK_LO_EXPR:
     case VEC_UNPACK_FLOAT_HI_EXPR:
@@ -3749,9 +3740,6 @@ op_symbol_code (enum tree_code code)
     case PLUS_EXPR:
       return "+";
 
-    case REDUC_PLUS_EXPR:
-      return "r+";
-
     case WIDEN_SUM_EXPR:
       return "w+";
 
Index: gcc/internal-fn.def
===================================================================
--- gcc/internal-fn.def	2017-11-21 16:31:19.983714206 +0000
+++ gcc/internal-fn.def	2017-11-21 16:31:49.726928298 +0000
@@ -30,6 +30,8 @@ along with GCC; see the file COPYING3.
 
      DEF_INTERNAL_FN (NAME, FLAGS, FNSPEC)
      DEF_INTERNAL_OPTAB_FN (NAME, FLAGS, OPTAB, TYPE)
+     DEF_INTERNAL_SIGNED_OPTAB_FN (NAME, FLAGS, SELECTOR, SIGNED_OPTAB,
+				   UNSIGNED_OPTAB, TYPE)
      DEF_INTERNAL_COND_OPTAB_FN (NAME, FLAGS, OPTAB, TYPE)
      DEF_INTERNAL_FLT_FN (NAME, FLAGS, OPTAB, TYPE)
      DEF_INTERNAL_INT_FN (NAME, FLAGS, OPTAB, TYPE)
@@ -57,6 +59,12 @@ along with GCC; see the file COPYING3.
 
    - cond_binary: a conditional binary optab, such as add<mode>cc
 
+   DEF_INTERNAL_SIGNED_OPTAB_FN defines an internal function that
+   maps to one of two optabs, depending on the signedness of an input.
+   SIGNED_OPTAB and UNSIGNED_OPTAB are the optabs for signed and
+   unsigned inputs respectively, both without the trailing "_optab".
+   SELECTOR says which type in the tree_pair determines the signedness.
+
    DEF_INTERNAL_COND_OPTAB_FN defines a conditional function COND_<NAME>,
    with optab cond_<OPTAB> and type cond_<TYPE>.  All these functions
    are predicated and take the predicate as the first argument.
@@ -87,6 +95,12 @@ along with GCC; see the file COPYING3.
   DEF_INTERNAL_FN (NAME, FLAGS | ECF_LEAF, NULL)
 #endif
 
+#ifndef DEF_INTERNAL_SIGNED_OPTAB_FN
+#define DEF_INTERNAL_SIGNED_OPTAB_FN(NAME, FLAGS, SELECTOR, SIGNED_OPTAB, \
+				     UNSIGNED_OPTAB, TYPE) \
+  DEF_INTERNAL_FN (NAME, FLAGS | ECF_LEAF, NULL)
+#endif
+
 #define DEF_INTERNAL_COND_OPTAB_FN(NAME, FLAGS, OPTAB, TYPE) \
   DEF_INTERNAL_OPTAB_FN (COND_##NAME, FLAGS, cond_##OPTAB, cond_##TYPE)
 
@@ -142,6 +156,19 @@ DEF_INTERNAL_COND_OPTAB_FN (XOR, ECF_CON
 
 DEF_INTERNAL_OPTAB_FN (RSQRT, ECF_CONST, rsqrt, unary)
 
+DEF_INTERNAL_OPTAB_FN (REDUC_PLUS, ECF_CONST | ECF_NOTHROW,
+		       reduc_plus_scal, unary)
+DEF_INTERNAL_SIGNED_OPTAB_FN (REDUC_MAX, ECF_CONST | ECF_NOTHROW, first,
+			      reduc_smax_scal, reduc_umax_scal, unary)
+DEF_INTERNAL_SIGNED_OPTAB_FN (REDUC_MIN, ECF_CONST | ECF_NOTHROW, first,
+			      reduc_smin_scal, reduc_umin_scal, unary)
+DEF_INTERNAL_OPTAB_FN (REDUC_AND, ECF_CONST | ECF_NOTHROW,
+		       reduc_and_scal, unary)
+DEF_INTERNAL_OPTAB_FN (REDUC_IOR, ECF_CONST | ECF_NOTHROW,
+		       reduc_ior_scal, unary)
+DEF_INTERNAL_OPTAB_FN (REDUC_XOR, ECF_CONST | ECF_NOTHROW,
+		       reduc_xor_scal, unary)
+
 /* Extract the last active element from a vector.  */
 DEF_INTERNAL_OPTAB_FN (EXTRACT_LAST, ECF_CONST | ECF_NOTHROW,
 		       extract_last, cond_unary)
@@ -290,5 +317,6 @@ DEF_INTERNAL_FN (DIVMOD, ECF_CONST | ECF
 #undef DEF_INTERNAL_FLT_FN
 #undef DEF_INTERNAL_FLT_FLOATN_FN
 #undef DEF_INTERNAL_COND_OPTAB_FN
+#undef DEF_INTERNAL_SIGNED_OPTAB_FN
 #undef DEF_INTERNAL_OPTAB_FN
 #undef DEF_INTERNAL_FN
Index: gcc/internal-fn.c
===================================================================
--- gcc/internal-fn.c	2017-11-21 16:31:19.983714206 +0000
+++ gcc/internal-fn.c	2017-11-21 16:31:49.726928298 +0000
@@ -96,6 +96,8 @@ #define fold_extract_direct { 2, 2, fals
 const direct_internal_fn_info direct_internal_fn_array[IFN_LAST + 1] = {
 #define DEF_INTERNAL_FN(CODE, FLAGS, FNSPEC) not_direct,
 #define DEF_INTERNAL_OPTAB_FN(CODE, FLAGS, OPTAB, TYPE) TYPE##_direct,
+#define DEF_INTERNAL_SIGNED_OPTAB_FN(CODE, FLAGS, SELECTOR, SIGNED_OPTAB, \
+				     UNSIGNED_OPTAB, TYPE) TYPE##_direct,
 #include "internal-fn.def"
   not_direct
 };
@@ -2921,6 +2923,30 @@ #define direct_mask_store_lanes_optab_su
 #define direct_while_optab_supported_p convert_optab_supported_p
 #define direct_fold_extract_optab_supported_p direct_optab_supported_p
 
+/* Return the optab used by internal function FN.  */
+
+static optab
+direct_internal_fn_optab (internal_fn fn, tree_pair types)
+{
+  switch (fn)
+    {
+#define DEF_INTERNAL_FN(CODE, FLAGS, FNSPEC) \
+    case IFN_##CODE: break;
+#define DEF_INTERNAL_OPTAB_FN(CODE, FLAGS, OPTAB, TYPE) \
+    case IFN_##CODE: return OPTAB##_optab;
+#define DEF_INTERNAL_SIGNED_OPTAB_FN(CODE, FLAGS, SELECTOR, SIGNED_OPTAB, \
+				     UNSIGNED_OPTAB, TYPE)		\
+    case IFN_##CODE: return (TYPE_UNSIGNED (types.SELECTOR)		\
+			     ? UNSIGNED_OPTAB ## _optab			\
+			     : SIGNED_OPTAB ## _optab);
+#include "internal-fn.def"
+
+    case IFN_LAST:
+      break;
+    }
+  gcc_unreachable ();
+}
+
 /* Return true if FN is supported for the types in TYPES when the
    optimization type is OPT_TYPE.  The types are those associated with
    the "type0" and "type1" fields of FN's direct_internal_fn_info
@@ -2938,6 +2964,16 @@ #define DEF_INTERNAL_OPTAB_FN(CODE, FLAG
     case IFN_##CODE: \
       return direct_##TYPE##_optab_supported_p (OPTAB##_optab, types, \
 						opt_type);
+#define DEF_INTERNAL_SIGNED_OPTAB_FN(CODE, FLAGS, SELECTOR, SIGNED_OPTAB, \
+				     UNSIGNED_OPTAB, TYPE)		\
+    case IFN_##CODE:							\
+      {									\
+	optab which_optab = (TYPE_UNSIGNED (types.SELECTOR)		\
+			     ? UNSIGNED_OPTAB ## _optab			\
+			     : SIGNED_OPTAB ## _optab);			\
+	return direct_##TYPE##_optab_supported_p (which_optab, types,	\
+						  opt_type);		\
+      }
 #include "internal-fn.def"
 
     case IFN_LAST:
@@ -2977,6 +3013,15 @@ #define DEF_INTERNAL_OPTAB_FN(CODE, FLAG
   {							\
     expand_##TYPE##_optab_fn (fn, stmt, OPTAB##_optab);	\
   }
+#define DEF_INTERNAL_SIGNED_OPTAB_FN(CODE, FLAGS, SELECTOR, SIGNED_OPTAB, \
+				     UNSIGNED_OPTAB, TYPE)		\
+  static void								\
+  expand_##CODE (internal_fn fn, gcall *stmt)				\
+  {									\
+    tree_pair types = direct_internal_fn_types (fn, stmt);		\
+    optab which_optab = direct_internal_fn_optab (fn, types);		\
+    expand_##TYPE##_optab_fn (fn, stmt, which_optab);			\
+  }
 #include "internal-fn.def"
 
 /* Routines to expand each internal function, indexed by function number.
Index: gcc/fold-const-call.c
===================================================================
--- gcc/fold-const-call.c	2017-11-01 08:07:13.156996103 +0000
+++ gcc/fold-const-call.c	2017-11-21 16:31:49.724928624 +0000
@@ -583,6 +583,25 @@ fold_const_builtin_nan (tree type, tree
   return NULL_TREE;
 }
 
+/* Fold a call to IFN_REDUC_<CODE> (ARG), returning a value of type TYPE.  */
+
+static tree
+fold_const_reduction (tree type, tree arg, tree_code code)
+{
+  if (TREE_CODE (arg) != VECTOR_CST)
+    return NULL_TREE;
+
+  tree res = VECTOR_CST_ELT (arg, 0);
+  unsigned int nelts = VECTOR_CST_NELTS (arg);
+  for (unsigned int i = 1; i < nelts; i++)
+    {
+      res = const_binop (code, type, res, VECTOR_CST_ELT (arg, i));
+      if (res == NULL_TREE || !CONSTANT_CLASS_P (res))
+	return NULL_TREE;
+    }
+  return res;
+}
+
 /* Try to evaluate:
 
       *RESULT = FN (*ARG)
@@ -1148,6 +1167,24 @@ fold_const_call (combined_fn fn, tree ty
     CASE_FLT_FN_FLOATN_NX (CFN_BUILT_IN_NANS):
       return fold_const_builtin_nan (type, arg, false);
 
+    case CFN_REDUC_PLUS:
+      return fold_const_reduction (type, arg, PLUS_EXPR);
+
+    case CFN_REDUC_MAX:
+      return fold_const_reduction (type, arg, MAX_EXPR);
+
+    case CFN_REDUC_MIN:
+      return fold_const_reduction (type, arg, MIN_EXPR);
+
+    case CFN_REDUC_AND:
+      return fold_const_reduction (type, arg, BIT_AND_EXPR);
+
+    case CFN_REDUC_IOR:
+      return fold_const_reduction (type, arg, BIT_IOR_EXPR);
+
+    case CFN_REDUC_XOR:
+      return fold_const_reduction (type, arg, BIT_XOR_EXPR);
+
     default:
       return fold_const_call_1 (fn, type, arg);
     }
Index: gcc/tree-vect-loop.c
===================================================================
--- gcc/tree-vect-loop.c	2017-11-21 16:31:28.695326387 +0000
+++ gcc/tree-vect-loop.c	2017-11-21 16:31:49.728927972 +0000
@@ -2574,52 +2574,51 @@ vect_analyze_loop (struct loop *loop, lo
 }
 
 
-/* Function reduction_code_for_scalar_code
+/* Function reduction_fn_for_scalar_code
 
    Input:
    CODE - tree_code of a reduction operations.
 
    Output:
-   REDUC_CODE - the corresponding tree-code to be used to reduce the
-      vector of partial results into a single scalar result, or ERROR_MARK
+   REDUC_FN - the corresponding internal function to be used to reduce the
+      vector of partial results into a single scalar result, or IFN_LAST
       if the operation is a supported reduction operation, but does not have
-      such a tree-code.
+      such an internal function.
 
    Return FALSE if CODE currently cannot be vectorized as reduction.  */
 
 static bool
-reduction_code_for_scalar_code (enum tree_code code,
-                                enum tree_code *reduc_code)
+reduction_fn_for_scalar_code (enum tree_code code, internal_fn *reduc_fn)
 {
   switch (code)
     {
       case MAX_EXPR:
-        *reduc_code = REDUC_MAX_EXPR;
+        *reduc_fn = IFN_REDUC_MAX;
         return true;
 
       case MIN_EXPR:
-        *reduc_code = REDUC_MIN_EXPR;
+        *reduc_fn = IFN_REDUC_MIN;
         return true;
 
       case PLUS_EXPR:
-        *reduc_code = REDUC_PLUS_EXPR;
+        *reduc_fn = IFN_REDUC_PLUS;
         return true;
 
       case BIT_AND_EXPR:
-	*reduc_code = REDUC_AND_EXPR;
+	*reduc_fn = IFN_REDUC_AND;
 	return true;
 
       case BIT_IOR_EXPR:
-	*reduc_code = REDUC_IOR_EXPR;
+	*reduc_fn = IFN_REDUC_IOR;
 	return true;
 
       case BIT_XOR_EXPR:
-	*reduc_code = REDUC_XOR_EXPR;
+	*reduc_fn = IFN_REDUC_XOR;
 	return true;
 
       case MULT_EXPR:
       case MINUS_EXPR:
-        *reduc_code = ERROR_MARK;
+        *reduc_fn = IFN_LAST;
         return true;
 
       default:
@@ -4029,7 +4028,7 @@ have_whole_vector_shift (machine_mode mo
    the loop, and the epilogue code that must be generated.  */
 
 static void
-vect_model_reduction_cost (stmt_vec_info stmt_info, enum tree_code reduc_code,
+vect_model_reduction_cost (stmt_vec_info stmt_info, internal_fn reduc_fn,
 			   int ncopies)
 {
   int prologue_cost = 0, epilogue_cost = 0, inside_cost;
@@ -4097,7 +4096,7 @@ vect_model_reduction_cost (stmt_vec_info
 
   if (!loop || !nested_in_vect_loop_p (loop, orig_stmt))
     {
-      if (reduc_code != ERROR_MARK)
+      if (reduc_fn != IFN_LAST)
 	{
 	  if (reduction_type == COND_REDUCTION)
 	    {
@@ -4581,7 +4580,7 @@ get_initial_defs_for_reduction (slp_tree
      we have to generate more than one vector stmt - i.e - we need to "unroll"
      the vector stmt by a factor VF/nunits.  For more details see documentation
      in vectorizable_operation.
-   REDUC_CODE is the tree-code for the epilog reduction.
+   REDUC_FN is the internal function for the epilog reduction.
    REDUCTION_PHIS is a list of the phi-nodes that carry the reduction 
      computation.
    REDUC_INDEX is the index of the operand in the right hand side of the 
@@ -4599,7 +4598,7 @@ get_initial_defs_for_reduction (slp_tree
       The loop-latch argument is taken from VECT_DEFS - the vector of partial 
       sums.
    2. "Reduces" each vector of partial results VECT_DEFS into a single result,
-      by applying the operation specified by REDUC_CODE if available, or by 
+      by calling the function specified by REDUC_FN if available, or by
       other means (whole-vector shifts or a scalar loop).
       The function also creates a new phi node at the loop exit to preserve
       loop-closed form, as illustrated below.
@@ -4634,7 +4633,7 @@ get_initial_defs_for_reduction (slp_tree
 static void
 vect_create_epilog_for_reduction (vec<tree> vect_defs, gimple *stmt,
 				  gimple *reduc_def_stmt,
-				  int ncopies, enum tree_code reduc_code,
+				  int ncopies, internal_fn reduc_fn,
 				  vec<gimple *> reduction_phis,
                                   bool double_reduc, 
 				  slp_tree slp_node,
@@ -4885,7 +4884,7 @@ vect_create_epilog_for_reduction (vec<tr
         step 3: adjust the scalar result (s_out3) if needed.
 
         Step 1 can be accomplished using one the following three schemes:
-          (scheme 1) using reduc_code, if available.
+          (scheme 1) using reduc_fn, if available.
           (scheme 2) using whole-vector shifts, if available.
           (scheme 3) using a scalar loop. In this case steps 1+2 above are
                      combined.
@@ -4965,7 +4964,7 @@ vect_create_epilog_for_reduction (vec<tr
   exit_gsi = gsi_after_labels (exit_bb);
 
   /* 2.2 Get the relevant tree-code to use in the epilog for schemes 2,3
-         (i.e. when reduc_code is not available) and in the final adjustment
+         (i.e. when reduc_fn is not available) and in the final adjustment
 	 code (if needed).  Also get the original scalar reduction variable as
          defined in the loop.  In case STMT is a "pattern-stmt" (i.e. - it
          represents a reduction pattern), the tree-code and scalar-def are
@@ -5017,7 +5016,7 @@ vect_create_epilog_for_reduction (vec<tr
 
   /* True if we should implement SLP_REDUC using native reduction operations
      instead of scalar operations.  */
-  direct_slp_reduc = (reduc_code != ERROR_MARK
+  direct_slp_reduc = (reduc_fn != IFN_LAST
 		      && slp_reduc
 		      && !TYPE_VECTOR_SUBPARTS (vectype).is_constant ());
 
@@ -5077,7 +5076,7 @@ vect_create_epilog_for_reduction (vec<tr
     new_phi_result = PHI_RESULT (new_phis[0]);
 
   if (STMT_VINFO_VEC_REDUCTION_TYPE (stmt_info) == COND_REDUCTION
-      && reduc_code != ERROR_MARK)
+      && reduc_fn != IFN_LAST)
     {
       /* For condition reductions, we have a vector (NEW_PHI_RESULT) containing
 	 various data values where the condition matched and another vector
@@ -5115,8 +5114,9 @@ vect_create_epilog_for_reduction (vec<tr
 
       /* Find maximum value from the vector of found indexes.  */
       tree max_index = make_ssa_name (index_scalar_type);
-      gimple *max_index_stmt = gimple_build_assign (max_index, REDUC_MAX_EXPR,
-						    induction_index);
+      gcall *max_index_stmt = gimple_build_call_internal (IFN_REDUC_MAX,
+							  1, induction_index);
+      gimple_call_set_lhs (max_index_stmt, max_index);
       gsi_insert_before (&exit_gsi, max_index_stmt, GSI_SAME_STMT);
 
       /* Vector of {max_index, max_index, max_index,...}.  */
@@ -5171,13 +5171,9 @@ vect_create_epilog_for_reduction (vec<tr
 
       /* Reduce down to a scalar value.  */
       tree data_reduc = make_ssa_name (scalar_type_unsigned);
-      optab ot = optab_for_tree_code (REDUC_MAX_EXPR, vectype_unsigned,
-				      optab_default);
-      gcc_assert (optab_handler (ot, TYPE_MODE (vectype_unsigned))
-		  != CODE_FOR_nothing);
-      gimple *data_reduc_stmt = gimple_build_assign (data_reduc,
-						     REDUC_MAX_EXPR,
-						     vec_cond_cast);
+      gcall *data_reduc_stmt = gimple_build_call_internal (IFN_REDUC_MAX,
+							   1, vec_cond_cast);
+      gimple_call_set_lhs (data_reduc_stmt, data_reduc);
       gsi_insert_before (&exit_gsi, data_reduc_stmt, GSI_SAME_STMT);
 
       /* Convert the reduced value back to the result type and set as the
@@ -5189,9 +5185,9 @@ vect_create_epilog_for_reduction (vec<tr
       scalar_results.safe_push (new_temp);
     }
   else if (STMT_VINFO_VEC_REDUCTION_TYPE (stmt_info) == COND_REDUCTION
-	   && reduc_code == ERROR_MARK)
+	   && reduc_fn == IFN_LAST)
     {
-      /* Condition redution without supported REDUC_MAX_EXPR.  Generate
+      /* Condition redution without supported IFN_REDUC_MAX.  Generate
 	 idx = 0;
          idx_val = induction_index[0];
 	 val = data_reduc[0];
@@ -5264,7 +5260,7 @@ vect_create_epilog_for_reduction (vec<tr
   /* 2.3 Create the reduction code, using one of the three schemes described
          above. In SLP we simply need to extract all the elements from the 
          vector (without reducing them), so we use scalar shifts.  */
-  else if (reduc_code != ERROR_MARK && !slp_reduc)
+  else if (reduc_fn != IFN_LAST && !slp_reduc)
     {
       tree tmp;
       tree vec_elem_type;
@@ -5279,22 +5275,27 @@ vect_create_epilog_for_reduction (vec<tr
       vec_elem_type = TREE_TYPE (TREE_TYPE (new_phi_result));
       if (!useless_type_conversion_p (scalar_type, vec_elem_type))
 	{
-          tree tmp_dest =
-	      vect_create_destination_var (scalar_dest, vec_elem_type);
-	  tmp = build1 (reduc_code, vec_elem_type, new_phi_result);
-	  epilog_stmt = gimple_build_assign (tmp_dest, tmp);
+	  tree tmp_dest
+	    = vect_create_destination_var (scalar_dest, vec_elem_type);
+	  epilog_stmt = gimple_build_call_internal (reduc_fn, 1,
+						    new_phi_result);
+	  gimple_set_lhs (epilog_stmt, tmp_dest);
 	  new_temp = make_ssa_name (tmp_dest, epilog_stmt);
-	  gimple_assign_set_lhs (epilog_stmt, new_temp);
+	  gimple_set_lhs (epilog_stmt, new_temp);
 	  gsi_insert_before (&exit_gsi, epilog_stmt, GSI_SAME_STMT);
 
-	  tmp = build1 (NOP_EXPR, scalar_type, new_temp);
+	  epilog_stmt = gimple_build_assign (new_scalar_dest, NOP_EXPR,
+					     new_temp);
 	}
       else
-	tmp = build1 (reduc_code, scalar_type, new_phi_result);
+	{
+	  epilog_stmt = gimple_build_call_internal (reduc_fn, 1,
+						    new_phi_result);
+	  gimple_set_lhs (epilog_stmt, new_scalar_dest);
+	}
 
-      epilog_stmt = gimple_build_assign (new_scalar_dest, tmp);
       new_temp = make_ssa_name (new_scalar_dest, epilog_stmt);
-      gimple_assign_set_lhs (epilog_stmt, new_temp);
+      gimple_set_lhs (epilog_stmt, new_temp);
       gsi_insert_before (&exit_gsi, epilog_stmt, GSI_SAME_STMT);
 
       if (STMT_VINFO_VEC_REDUCTION_TYPE (stmt_info)
@@ -5383,8 +5384,10 @@ vect_create_epilog_for_reduction (vec<tr
 				   sel, new_phi_result, vector_identity);
 
 	  /* Do the reduction and convert it to the appropriate type.  */
-	  tree scalar = gimple_build (&seq, reduc_code,
-				      TREE_TYPE (vectype), vec);
+	  gcall *call = gimple_build_call_internal (reduc_fn, 1, vec);
+	  tree scalar = make_ssa_name (TREE_TYPE (vectype));
+	  gimple_call_set_lhs (call, scalar);
+	  gimple_seq_add_stmt (&seq, call);
 	  scalar = gimple_convert (&seq, scalar_type, scalar);
 	  scalar_results.safe_push (scalar);
 	}
@@ -5992,10 +5995,11 @@ vectorizable_reduction (gimple *stmt, gi
   tree vectype_in = NULL_TREE;
   loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
   struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
-  enum tree_code code, orig_code, epilog_reduc_code;
+  enum tree_code code, orig_code;
+  internal_fn reduc_fn;
   machine_mode vec_mode;
   int op_type;
-  optab optab, reduc_optab;
+  optab optab;
   tree new_temp = NULL_TREE;
   gimple *def_stmt;
   enum vect_def_type dt, cond_reduc_dt = vect_unknown_def_type;
@@ -6552,31 +6556,23 @@ vectorizable_reduction (gimple *stmt, gi
         double_reduc = true;
     }
 
-  epilog_reduc_code = ERROR_MARK;
+  reduc_fn = IFN_LAST;
 
   if (reduction_type == TREE_CODE_REDUCTION
       || reduction_type == INTEGER_INDUC_COND_REDUCTION
       || reduction_type == CONST_COND_REDUCTION)
     {
-      if (reduction_code_for_scalar_code (orig_code, &epilog_reduc_code))
+      if (reduction_fn_for_scalar_code (orig_code, &reduc_fn))
 	{
-	  reduc_optab = optab_for_tree_code (epilog_reduc_code, vectype_out,
-                                         optab_default);
-	  if (!reduc_optab)
-	    {
-	      if (dump_enabled_p ())
-		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-				 "no optab for reduction.\n");
-
-	      epilog_reduc_code = ERROR_MARK;
-	    }
-	  else if (optab_handler (reduc_optab, vec_mode) == CODE_FOR_nothing)
+	  if (reduc_fn != IFN_LAST
+	      && !direct_internal_fn_supported_p (reduc_fn, vectype_out,
+						  OPTIMIZE_FOR_SPEED))
 	    {
 	      if (dump_enabled_p ())
 		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
 				 "reduc op not supported by target.\n");
 
-	      epilog_reduc_code = ERROR_MARK;
+	      reduc_fn = IFN_LAST;
 	    }
 	}
       else
@@ -6599,15 +6595,13 @@ vectorizable_reduction (gimple *stmt, gi
       cr_index_vector_type = build_vector_type (cr_index_scalar_type,
 						nunits_out);
 
-      optab = optab_for_tree_code (REDUC_MAX_EXPR, cr_index_vector_type,
-				   optab_default);
-      if (optab_handler (optab, TYPE_MODE (cr_index_vector_type))
-	  != CODE_FOR_nothing)
-	epilog_reduc_code = REDUC_MAX_EXPR;
+      if (direct_internal_fn_supported_p (IFN_REDUC_MAX, cr_index_vector_type,
+					  OPTIMIZE_FOR_SPEED))
+	reduc_fn = IFN_REDUC_MAX;
     }
 
   if (reduction_type != EXTRACT_LAST_REDUCTION
-      && epilog_reduc_code == ERROR_MARK
+      && reduc_fn == IFN_LAST
       && !nunits_out.is_constant ())
     {
       if (dump_enabled_p ())
@@ -6804,7 +6798,7 @@ vectorizable_reduction (gimple *stmt, gi
   if (!vec_stmt) /* transformation not required.  */
     {
       if (first_p)
-	vect_model_reduction_cost (stmt_info, epilog_reduc_code, ncopies);
+	vect_model_reduction_cost (stmt_info, reduc_fn, ncopies);
       if (loop_vinfo && LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo))
 	{
 	  if (cond_fn == IFN_LAST
@@ -7008,8 +7002,7 @@ vectorizable_reduction (gimple *stmt, gi
     vect_defs[0] = gimple_get_lhs (*vec_stmt);
 
   vect_create_epilog_for_reduction (vect_defs, stmt, reduc_def_stmt,
-				    epilog_copies,
-                                    epilog_reduc_code, phis,
+				    epilog_copies, reduc_fn, phis,
 				    double_reduc, slp_node, slp_node_instance,
 				    neutral_op);
 
Index: gcc/config/aarch64/aarch64-builtins.c
===================================================================
--- gcc/config/aarch64/aarch64-builtins.c	2017-11-21 16:30:57.913175994 +0000
+++ gcc/config/aarch64/aarch64-builtins.c	2017-11-21 16:31:49.722928949 +0000
@@ -1601,24 +1601,27 @@ aarch64_gimple_fold_builtin (gimple_stmt
 			? gimple_call_arg_ptr (stmt, 0)
 			: &error_mark_node);
 
-	  /* We use gimple's REDUC_(PLUS|MIN|MAX)_EXPRs for float, signed int
+	  /* We use gimple's IFN_REDUC_(PLUS|MIN|MAX)s for float, signed int
 	     and unsigned int; it will distinguish according to the types of
 	     the arguments to the __builtin.  */
 	  switch (fcode)
 	    {
 	      BUILTIN_VALL (UNOP, reduc_plus_scal_, 10)
-	        new_stmt = gimple_build_assign (gimple_call_lhs (stmt),
-						REDUC_PLUS_EXPR, args[0]);
+	        new_stmt = gimple_build_call_internal (IFN_REDUC_PLUS,
+						       1, args[0]);
+		gimple_call_set_lhs (new_stmt, gimple_call_lhs (stmt));
 		break;
 	      BUILTIN_VDQIF (UNOP, reduc_smax_scal_, 10)
 	      BUILTIN_VDQ_BHSI (UNOPU, reduc_umax_scal_, 10)
-		new_stmt = gimple_build_assign (gimple_call_lhs (stmt),
-						REDUC_MAX_EXPR, args[0]);
+	        new_stmt = gimple_build_call_internal (IFN_REDUC_MAX,
+						       1, args[0]);
+		gimple_call_set_lhs (new_stmt, gimple_call_lhs (stmt));
 		break;
 	      BUILTIN_VDQIF (UNOP, reduc_smin_scal_, 10)
 	      BUILTIN_VDQ_BHSI (UNOPU, reduc_umin_scal_, 10)
-		new_stmt = gimple_build_assign (gimple_call_lhs (stmt),
-						REDUC_MIN_EXPR, args[0]);
+	        new_stmt = gimple_build_call_internal (IFN_REDUC_MIN,
+						       1, args[0]);
+		gimple_call_set_lhs (new_stmt, gimple_call_lhs (stmt));
 		break;
 	      BUILTIN_GPF (BINOP, fmulx, 0)
 		{
Index: gcc/config/aarch64/aarch64-simd.md
===================================================================
--- gcc/config/aarch64/aarch64-simd.md	2017-11-21 16:30:57.914175832 +0000
+++ gcc/config/aarch64/aarch64-simd.md	2017-11-21 16:31:49.723928786 +0000
@@ -2338,7 +2338,7 @@ (define_insn "popcount<mode>2"
 ;; 'across lanes' max and min ops.
 
 ;; Template for outputting a scalar, so we can create __builtins which can be
-;; gimple_fold'd to the REDUC_(MAX|MIN)_EXPR tree code.  (This is FP smax/smin).
+;; gimple_fold'd to the IFN_REDUC_(MAX|MIN) function.  (This is FP smax/smin).
 (define_expand "reduc_<maxmin_uns>_scal_<mode>"
   [(match_operand:<VEL> 0 "register_operand")
    (unspec:VHSDF [(match_operand:VHSDF 1 "register_operand")]

Richard Sandiford Nov. 21, 2017, 4:45 p.m. UTC | #5

Richard Biener <richard.guenther@gmail.com> writes:
> On Mon, Nov 20, 2017 at 1:54 PM, Richard Sandiford

> <richard.sandiford@linaro.org> wrote:

>> Richard Biener <richard.guenther@gmail.com> writes:

>>> On Fri, Nov 17, 2017 at 5:53 PM, Richard Sandiford

>>> <richard.sandiford@linaro.org> wrote:

>>>> This patch adds support for in-order floating-point addition reductions,

>>>> which are suitable even in strict IEEE mode.

>>>>

>>>> Previously vect_is_simple_reduction would reject any cases that forbid

>>>> reassociation.  The idea is instead to tentatively accept them as

>>>> "FOLD_LEFT_REDUCTIONs" and only fail later if there is no target

>>>> support for them.  Although this patch only handles the particular

>>>> case of plus and minus on floating-point types, there's no reason in

>>>> principle why targets couldn't handle other cases.

>>>>

>>>> The vect_force_simple_reduction change makes it simpler for parloops

>>>> to read the type of reduction.

>>>>

>>>> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu

>>>> and powerpc64le-linux-gnu.  OK to install?

>>>

>>> I don't like that you add a new tree code for this.  A new IFN looks more

>>> suitable to me.

>>

>> OK.

>

> Thanks.  I'd like to eventually get rid of other vectorizer tree codes as well,

> like the REDUC_*_EXPR, DOT_PROD_EXPR and SAD_EXPR.  IFNs

> are now really the way to go for "target instructions on GIMPLE".

>

>>> Also I think if there's a way to handle this correctly with target support

>>> you can also implement a fallback if there is no such support increasing

>>> test coverage.  It would basically boil down to extracting all scalars from

>>> the non-reduction operand vector and performing a series of reduction

>>> ops, keeping the reduction PHI scalar.  This would also support any

>>> reduction operator.

>>

>> Yeah, but without target support, that's probably going to be expensive.

>> It's a bit like how we can implement element-by-element loads and stores

>> for cases that don't have target support, but had to explicitly disable

>> that in many cases, since the cost model was too optimistic.

>

> I expect that for V2DF or even V4DF it might be profitable in quite a number

> of cases.  V2DF definitely.

>

>> I can give it a go anyway if you think it's worth it.

>

> I think it is.


OK, here's 2/3.  It just splits out some code for reuse in 3/3.

Tested as before.

Thanks,
Richard


2017-11-21  Richard Sandiford  <richard.sandiford@linaro.org>

gcc/
	* tree-vect-loop.c (vect_extract_elements, vect_expand_fold_left): New
	functions, split out from...
	(vect_create_epilog_for_reduction): ...here.

Index: gcc/tree-vect-loop.c
===================================================================
--- gcc/tree-vect-loop.c	2017-11-21 16:31:49.728927972 +0000
+++ gcc/tree-vect-loop.c	2017-11-21 16:43:13.061221251 +0000
@@ -4566,6 +4566,65 @@ get_initial_defs_for_reduction (slp_tree
     }
 }
 
+/* Extract all the elements of VECTOR into SCALAR_RESULTS, inserting
+   the extraction statements before GSI.  Associate the new scalar
+   SSA names with variable SCALAR_DEST.  */
+
+static void
+vect_extract_elements (gimple_stmt_iterator *gsi, vec<tree> *scalar_results,
+		       tree scalar_dest, tree vector)
+{
+  tree vectype = TREE_TYPE (vector);
+  tree scalar_type = TREE_TYPE (vectype);
+  tree bitsize = TYPE_SIZE (scalar_type);
+  unsigned HOST_WIDE_INT vec_size_in_bits = tree_to_uhwi (TYPE_SIZE (vectype));
+  unsigned HOST_WIDE_INT element_bitsize = tree_to_uhwi (bitsize);
+
+  for (unsigned HOST_WIDE_INT bit_offset = 0;
+       bit_offset < vec_size_in_bits;
+       bit_offset += element_bitsize)
+    {
+      tree bitpos = bitsize_int (bit_offset);
+      tree rhs = build3 (BIT_FIELD_REF, scalar_type, vector, bitsize, bitpos);
+
+      gassign *stmt = gimple_build_assign (scalar_dest, rhs);
+      tree new_name = make_ssa_name (scalar_dest, stmt);
+      gimple_assign_set_lhs (stmt, new_name);
+      gsi_insert_before (gsi, stmt, GSI_SAME_STMT);
+
+      scalar_results->safe_push (new_name);
+    }
+}
+
+/* Successively apply CODE to each element of VECTOR_RHS, in left-to-right
+   order.  Start with LHS if LHS is nonnull, otherwise start with the first
+   element of VECTOR_RHS.  Insert the extraction statements before GSI and
+   associate the new scalar SSA names with variable SCALAR_DEST.
+   Return the SSA name for the result.  */
+
+static tree
+vect_expand_fold_left (gimple_stmt_iterator *gsi, tree scalar_dest,
+		       tree_code code, tree lhs, tree vector_rhs)
+{
+  auto_vec<tree, 64> scalar_results;
+  vect_extract_elements (gsi, &scalar_results, scalar_dest, vector_rhs);
+  tree rhs;
+  unsigned int i;
+  FOR_EACH_VEC_ELT (scalar_results, i, rhs)
+    {
+      if (lhs == NULL_TREE)
+	lhs = rhs;
+      else
+	{
+	  gassign *stmt = gimple_build_assign (scalar_dest, code, lhs, rhs);
+	  tree new_name = make_ssa_name (scalar_dest, stmt);
+	  gimple_assign_set_lhs (stmt, new_name);
+	  gsi_insert_before (gsi, stmt, GSI_SAME_STMT);
+	  lhs = new_name;
+	}
+    }
+  return lhs;
+}
 
 /* Function vect_create_epilog_for_reduction
 
@@ -5499,52 +5558,16 @@ vect_create_epilog_for_reduction (vec<tr
           vec_size_in_bits = tree_to_uhwi (TYPE_SIZE (vectype));
           FOR_EACH_VEC_ELT (new_phis, i, new_phi)
             {
-              int bit_offset;
               if (gimple_code (new_phi) == GIMPLE_PHI)
                 vec_temp = PHI_RESULT (new_phi);
               else
                 vec_temp = gimple_assign_lhs (new_phi);
-              tree rhs = build3 (BIT_FIELD_REF, scalar_type, vec_temp, bitsize,
-                            bitsize_zero_node);
-              epilog_stmt = gimple_build_assign (new_scalar_dest, rhs);
-              new_temp = make_ssa_name (new_scalar_dest, epilog_stmt);
-              gimple_assign_set_lhs (epilog_stmt, new_temp);
-              gsi_insert_before (&exit_gsi, epilog_stmt, GSI_SAME_STMT);
-
-              /* In SLP we don't need to apply reduction operation, so we just
-                 collect s' values in SCALAR_RESULTS.  */
-              if (slp_reduc)
-                scalar_results.safe_push (new_temp);
-
-              for (bit_offset = element_bitsize;
-                   bit_offset < vec_size_in_bits;
-                   bit_offset += element_bitsize)
-                {
-                  tree bitpos = bitsize_int (bit_offset);
-                  tree rhs = build3 (BIT_FIELD_REF, scalar_type, vec_temp,
-                                     bitsize, bitpos);
-
-                  epilog_stmt = gimple_build_assign (new_scalar_dest, rhs);
-                  new_name = make_ssa_name (new_scalar_dest, epilog_stmt);
-                  gimple_assign_set_lhs (epilog_stmt, new_name);
-                  gsi_insert_before (&exit_gsi, epilog_stmt, GSI_SAME_STMT);
-
-                  if (slp_reduc)
-                    {
-                      /* In SLP we don't need to apply reduction operation, so 
-                         we just collect s' values in SCALAR_RESULTS.  */
-                      new_temp = new_name;
-                      scalar_results.safe_push (new_name);
-                    }
-                  else
-                    {
-		      epilog_stmt = gimple_build_assign (new_scalar_dest, code,
-							 new_name, new_temp);
-                      new_temp = make_ssa_name (new_scalar_dest, epilog_stmt);
-                      gimple_assign_set_lhs (epilog_stmt, new_temp);
-                      gsi_insert_before (&exit_gsi, epilog_stmt, GSI_SAME_STMT);
-                    }
-                }
+	      if (slp_reduc)
+		vect_extract_elements (&exit_gsi, &scalar_results,
+				       new_scalar_dest, vec_temp);
+	      else
+		new_temp = vect_expand_fold_left (&exit_gsi, new_scalar_dest,
+						  code, NULL_TREE, vec_temp);
             }
 
           /* The only case where we need to reduce scalar results in SLP, is

Jeff Law Nov. 21, 2017, 4:50 p.m. UTC | #6

On 11/21/2017 09:45 AM, Richard Sandiford wrote:
> Richard Biener <richard.guenther@gmail.com> writes:

>> On Mon, Nov 20, 2017 at 1:54 PM, Richard Sandiford

>> <richard.sandiford@linaro.org> wrote:

>>> Richard Biener <richard.guenther@gmail.com> writes:

>>>> On Fri, Nov 17, 2017 at 5:53 PM, Richard Sandiford

>>>> <richard.sandiford@linaro.org> wrote:

>>>>> This patch adds support for in-order floating-point addition reductions,

>>>>> which are suitable even in strict IEEE mode.

>>>>>

>>>>> Previously vect_is_simple_reduction would reject any cases that forbid

>>>>> reassociation.  The idea is instead to tentatively accept them as

>>>>> "FOLD_LEFT_REDUCTIONs" and only fail later if there is no target

>>>>> support for them.  Although this patch only handles the particular

>>>>> case of plus and minus on floating-point types, there's no reason in

>>>>> principle why targets couldn't handle other cases.

>>>>>

>>>>> The vect_force_simple_reduction change makes it simpler for parloops

>>>>> to read the type of reduction.

>>>>>

>>>>> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu

>>>>> and powerpc64le-linux-gnu.  OK to install?

>>>>

>>>> I don't like that you add a new tree code for this.  A new IFN looks more

>>>> suitable to me.

>>>

>>> OK.

>>

>> Thanks.  I'd like to eventually get rid of other vectorizer tree codes as well,

>> like the REDUC_*_EXPR, DOT_PROD_EXPR and SAD_EXPR.  IFNs

>> are now really the way to go for "target instructions on GIMPLE".

>>

>>>> Also I think if there's a way to handle this correctly with target support

>>>> you can also implement a fallback if there is no such support increasing

>>>> test coverage.  It would basically boil down to extracting all scalars from

>>>> the non-reduction operand vector and performing a series of reduction

>>>> ops, keeping the reduction PHI scalar.  This would also support any

>>>> reduction operator.

>>>

>>> Yeah, but without target support, that's probably going to be expensive.

>>> It's a bit like how we can implement element-by-element loads and stores

>>> for cases that don't have target support, but had to explicitly disable

>>> that in many cases, since the cost model was too optimistic.

>>

>> I expect that for V2DF or even V4DF it might be profitable in quite a number

>> of cases.  V2DF definitely.

>>

>>> I can give it a go anyway if you think it's worth it.

>>

>> I think it is.

> 

> OK, here's 2/3.  It just splits out some code for reuse in 3/3.

[ ... ]
Is this going to obsolete any of the stuff posted to date?  I'm thinking
specifically about "Add support for bitwise reductions", but perhaps
there are others.

Jeff

Richard Sandiford Nov. 21, 2017, 5:08 p.m. UTC | #7

Richard Biener <richard.guenther@gmail.com> writes:
> On Mon, Nov 20, 2017 at 1:54 PM, Richard Sandiford

> <richard.sandiford@linaro.org> wrote:

>> Richard Biener <richard.guenther@gmail.com> writes:

>>> On Fri, Nov 17, 2017 at 5:53 PM, Richard Sandiford

>>> <richard.sandiford@linaro.org> wrote:

>>>> This patch adds support for in-order floating-point addition reductions,

>>>> which are suitable even in strict IEEE mode.

>>>>

>>>> Previously vect_is_simple_reduction would reject any cases that forbid

>>>> reassociation.  The idea is instead to tentatively accept them as

>>>> "FOLD_LEFT_REDUCTIONs" and only fail later if there is no target

>>>> support for them.  Although this patch only handles the particular

>>>> case of plus and minus on floating-point types, there's no reason in

>>>> principle why targets couldn't handle other cases.

>>>>

>>>> The vect_force_simple_reduction change makes it simpler for parloops

>>>> to read the type of reduction.

>>>>

>>>> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu

>>>> and powerpc64le-linux-gnu.  OK to install?

>>>

>>> I don't like that you add a new tree code for this.  A new IFN looks more

>>> suitable to me.

>>

>> OK.

>

> Thanks.  I'd like to eventually get rid of other vectorizer tree codes as well,

> like the REDUC_*_EXPR, DOT_PROD_EXPR and SAD_EXPR.  IFNs

> are now really the way to go for "target instructions on GIMPLE".

>

>>> Also I think if there's a way to handle this correctly with target support

>>> you can also implement a fallback if there is no such support increasing

>>> test coverage.  It would basically boil down to extracting all scalars from

>>> the non-reduction operand vector and performing a series of reduction

>>> ops, keeping the reduction PHI scalar.  This would also support any

>>> reduction operator.

>>

>> Yeah, but without target support, that's probably going to be expensive.

>> It's a bit like how we can implement element-by-element loads and stores

>> for cases that don't have target support, but had to explicitly disable

>> that in many cases, since the cost model was too optimistic.

>

> I expect that for V2DF or even V4DF it might be profitable in quite a number

> of cases.  V2DF definitely.

>

>> I can give it a go anyway if you think it's worth it.

>

> I think it is.


OK, done in the patch below.  Tested as before.

Thanks,
Richard


2017-11-21  Richard Sandiford  <richard.sandiford@linaro.org>
	    Alan Hayward  <alan.hayward@arm.com>
	    David Sherwood  <david.sherwood@arm.com>

gcc/
	* optabs.def (fold_left_plus_optab): New optab.
	* doc/md.texi (fold_left_plus_@var{m}): Document.
	* internal-fn.def (IFN_FOLD_LEFT_PLUS): New internal function.
	* internal-fn.c (fold_left_direct): Define.
	(expand_fold_left_optab_fn): Likewise.
	(direct_fold_left_optab_supported_p): Likewise.
	* fold-const-call.c (fold_const_fold_left): New function.
	(fold_const_call): Use it to fold CFN_FOLD_LEFT_PLUS.
	* tree-parloops.c (valid_reduction_p): New function.
	(gather_scalar_reductions): Use it.
	* tree-vectorizer.h (FOLD_LEFT_REDUCTION): New vect_reduction_type.
	(vect_finish_replace_stmt): Declare.
	* tree-vect-loop.c (fold_left_reduction_code): New function.
	(needs_fold_left_reduction_p): New function, split out from...
	(vect_is_simple_reduction): ...here.  Accept reductions that
	forbid reassociation, but give them type FOLD_LEFT_REDUCTION.
	(vect_force_simple_reduction): Also store the reduction type in
	the assignment's STMT_VINFO_REDUC_TYPE.
	(vect_model_reduction_cost): Handle FOLD_LEFT_REDUCTION.
	(merge_with_identity): New function.
	(vectorize_fold_left_reduction): Likewise.
	(vectorizable_reduction): Handle FOLD_LEFT_REDUCTION.  Leave the
	scalar phi in place for it.  Check for target support and reject
	cases that would reassociate the operation.  Defer the transform
	phase to vectorize_fold_left_reduction.
	* config/aarch64/aarch64.md (UNSPEC_FADDA): New unspec.
	* config/aarch64/aarch64-sve.md (fold_left_plus_<mode>): New expander.
	(*fold_left_plus_<mode>, *pred_fold_left_plus_<mode>): New insns.

gcc/testsuite/
	* gcc.dg/vect/no-fast-math-vect16.c: Expect the test to pass and
	check for a message about using in-order reductions.
	* gcc.dg/vect/pr79920.c: Expect both loops to be vectorized and
	check for a message about using in-order reductions.
	* gcc.dg/vect/trapv-vect-reduc-4.c: Expect all three loops to be
	vectorized and check for a message about using in-order reductions.
	* gcc.dg/vect/vect-reduc-6.c: Expect the loop to be vectorized and
	check for a message about using in-order reductions.
	* gcc.dg/vect/vect-reduc-in-order-1.c: New test.
	* gcc.dg/vect/vect-reduc-in-order-2.c: Likewise.
	* gcc.dg/vect/vect-reduc-in-order-3.c: Likewise.
	* gcc.dg/vect/vect-reduc-in-order-4.c: Likewise.
	* gcc.target/aarch64/sve_reduc_strict_1.c: New test.
	* gcc.target/aarch64/sve_reduc_strict_1_run.c: Likewise.
	* gcc.target/aarch64/sve_reduc_strict_2.c: Likewise.
	* gcc.target/aarch64/sve_reduc_strict_2_run.c: Likewise.
	* gcc.target/aarch64/sve_reduc_strict_3.c: Likewise.
	* gcc.target/aarch64/sve_slp_13.c: Add floating-point types.
	* gfortran.dg/vect/vect-8.f90: Expect 22 loops to be vectorized if
	vect_fold_left_plus.

Index: gcc/optabs.def
===================================================================
--- gcc/optabs.def	2017-11-21 17:06:24.670434749 +0000
+++ gcc/optabs.def	2017-11-21 17:06:25.015421374 +0000
@@ -306,6 +306,7 @@ OPTAB_D (reduc_umin_scal_optab, "reduc_u
 OPTAB_D (reduc_and_scal_optab,  "reduc_and_scal_$a")
 OPTAB_D (reduc_ior_scal_optab,  "reduc_ior_scal_$a")
 OPTAB_D (reduc_xor_scal_optab,  "reduc_xor_scal_$a")
+OPTAB_D (fold_left_plus_optab, "fold_left_plus_$a")
 
 OPTAB_D (extract_last_optab, "extract_last_$a")
 OPTAB_D (fold_extract_last_optab, "fold_extract_last_$a")
Index: gcc/doc/md.texi
===================================================================
--- gcc/doc/md.texi	2017-11-21 17:06:24.670434749 +0000
+++ gcc/doc/md.texi	2017-11-21 17:06:25.014421412 +0000
@@ -5285,6 +5285,14 @@ has mode @var{m} and operands 0 and 1 ha
 one element of @var{m}.  Operand 2 has the usual mask mode for vectors
 of mode @var{m}; see @code{TARGET_VECTORIZE_GET_MASK_MODE}.
 
+@cindex @code{fold_left_plus_@var{m}} instruction pattern
+@item @code{fold_left_plus_@var{m}}
+Take scalar operand 1 and successively add each element from vector
+operand 2.  Store the result in scalar operand 0.  The vector has
+mode @var{m} and the scalars have the mode appropriate for one
+element of @var{m}.  The operation is strictly in-order: there is
+no reassociation.
+
 @cindex @code{sdot_prod@var{m}} instruction pattern
 @item @samp{sdot_prod@var{m}}
 @cindex @code{udot_prod@var{m}} instruction pattern
Index: gcc/internal-fn.def
===================================================================
--- gcc/internal-fn.def	2017-11-21 17:06:24.670434749 +0000
+++ gcc/internal-fn.def	2017-11-21 17:06:25.015421374 +0000
@@ -59,6 +59,8 @@ along with GCC; see the file COPYING3.
 
    - cond_binary: a conditional binary optab, such as add<mode>cc
 
+   - fold_left: for scalar = FN (scalar, vector), keyed off the vector mode
+
    DEF_INTERNAL_SIGNED_OPTAB_FN defines an internal function that
    maps to one of two optabs, depending on the signedness of an input.
    SIGNED_OPTAB and UNSIGNED_OPTAB are the optabs for signed and
@@ -177,6 +179,8 @@ DEF_INTERNAL_OPTAB_FN (EXTRACT_LAST, ECF
 DEF_INTERNAL_OPTAB_FN (FOLD_EXTRACT_LAST, ECF_CONST | ECF_NOTHROW,
 		       fold_extract_last, fold_extract)
 
+DEF_INTERNAL_OPTAB_FN (FOLD_LEFT_PLUS, ECF_CONST | ECF_NOTHROW,
+		       fold_left_plus, fold_left)
 
 /* Unary math functions.  */
 DEF_INTERNAL_FLT_FN (ACOS, ECF_CONST, acos, unary)
Index: gcc/internal-fn.c
===================================================================
--- gcc/internal-fn.c	2017-11-21 17:06:24.670434749 +0000
+++ gcc/internal-fn.c	2017-11-21 17:06:25.015421374 +0000
@@ -92,6 +92,7 @@ #define cond_unary_direct { 1, 1, true }
 #define cond_binary_direct { 1, 1, true }
 #define while_direct { 0, 2, false }
 #define fold_extract_direct { 2, 2, false }
+#define fold_left_direct { 1, 1, false }
 
 const direct_internal_fn_info direct_internal_fn_array[IFN_LAST + 1] = {
 #define DEF_INTERNAL_FN(CODE, FLAGS, FNSPEC) not_direct,
@@ -2839,6 +2840,9 @@ #define expand_cond_binary_optab_fn(FN,
 #define expand_fold_extract_optab_fn(FN, STMT, OPTAB) \
   expand_direct_optab_fn (FN, STMT, OPTAB, 3)
 
+#define expand_fold_left_optab_fn(FN, STMT, OPTAB) \
+  expand_direct_optab_fn (FN, STMT, OPTAB, 2)
+
 /* RETURN_TYPE and ARGS are a return type and argument list that are
    in principle compatible with FN (which satisfies direct_internal_fn_p).
    Return the types that should be used to determine whether the
@@ -2922,6 +2926,7 @@ #define direct_store_lanes_optab_support
 #define direct_mask_store_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_while_optab_supported_p convert_optab_supported_p
 #define direct_fold_extract_optab_supported_p direct_optab_supported_p
+#define direct_fold_left_optab_supported_p direct_optab_supported_p
 
 /* Return the optab used by internal function FN.  */
 
Index: gcc/fold-const-call.c
===================================================================
--- gcc/fold-const-call.c	2017-11-21 17:06:24.670434749 +0000
+++ gcc/fold-const-call.c	2017-11-21 17:06:25.014421412 +0000
@@ -1190,6 +1190,25 @@ fold_const_call (combined_fn fn, tree ty
     }
 }
 
+/* Fold a call to IFN_FOLD_LEFT_<CODE> (ARG0, ARG1), returning a value
+   of type TYPE.  */
+
+static tree
+fold_const_fold_left (tree type, tree arg0, tree arg1, tree_code code)
+{
+  if (TREE_CODE (arg1) != VECTOR_CST)
+    return NULL_TREE;
+
+  unsigned int nelts = VECTOR_CST_NELTS (arg1);
+  for (unsigned int i = 0; i < nelts; i++)
+    {
+      arg0 = const_binop (code, type, arg0, VECTOR_CST_ELT (arg1, i));
+      if (arg0 == NULL_TREE || !CONSTANT_CLASS_P (arg0))
+	return NULL_TREE;
+    }
+  return arg0;
+}
+
 /* Try to evaluate:
 
       *RESULT = FN (*ARG0, *ARG1)
@@ -1495,6 +1514,9 @@ fold_const_call (combined_fn fn, tree ty
 	}
       return NULL_TREE;
 
+    case CFN_FOLD_LEFT_PLUS:
+      return fold_const_fold_left (type, arg0, arg1, PLUS_EXPR);
+
     default:
       return fold_const_call_1 (fn, type, arg0, arg1);
     }
Index: gcc/tree-parloops.c
===================================================================
--- gcc/tree-parloops.c	2017-11-21 17:06:24.670434749 +0000
+++ gcc/tree-parloops.c	2017-11-21 17:06:25.017421296 +0000
@@ -2531,6 +2531,19 @@ set_reduc_phi_uids (reduction_info **slo
   return 1;
 }
 
+/* Return true if the type of reduction performed by STMT is suitable
+   for this pass.  */
+
+static bool
+valid_reduction_p (gimple *stmt)
+{
+  /* Parallelization would reassociate the operation, which isn't
+     allowed for in-order reductions.  */
+  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+  vect_reduction_type reduc_type = STMT_VINFO_REDUC_TYPE (stmt_info);
+  return reduc_type != FOLD_LEFT_REDUCTION;
+}
+
 /* Detect all reductions in the LOOP, insert them into REDUCTION_LIST.  */
 
 static void
@@ -2564,7 +2577,7 @@ gather_scalar_reductions (loop_p loop, r
       gimple *reduc_stmt
 	= vect_force_simple_reduction (simple_loop_info, phi,
 				       &double_reduc, true);
-      if (!reduc_stmt)
+      if (!reduc_stmt || !valid_reduction_p (reduc_stmt))
 	continue;
 
       if (double_reduc)
@@ -2610,7 +2623,8 @@ gather_scalar_reductions (loop_p loop, r
 		= vect_force_simple_reduction (simple_loop_info, inner_phi,
 					       &double_reduc, true);
 	      gcc_assert (!double_reduc);
-	      if (inner_reduc_stmt == NULL)
+	      if (inner_reduc_stmt == NULL
+		  || !valid_reduction_p (inner_reduc_stmt))
 		continue;
 
 	      build_new_reduction (reduction_list, double_reduc_stmts[i], phi);
Index: gcc/tree-vectorizer.h
===================================================================
--- gcc/tree-vectorizer.h	2017-11-21 17:06:24.670434749 +0000
+++ gcc/tree-vectorizer.h	2017-11-21 17:06:25.018421257 +0000
@@ -74,7 +74,15 @@ enum vect_reduction_type {
 
        for (int i = 0; i < VF; ++i)
          res = cond[i] ? val[i] : res;  */
-  EXTRACT_LAST_REDUCTION
+  EXTRACT_LAST_REDUCTION,
+
+  /* Use a folding reduction within the loop to implement:
+
+       for (int i = 0; i < VF; ++i)
+         res = res OP val[i];
+
+     (with no reassocation).  */
+  FOLD_LEFT_REDUCTION
 };
 
 #define VECTORIZABLE_CYCLE_DEF(D) (((D) == vect_reduction_def)           \
@@ -1389,6 +1397,7 @@ extern void vect_model_load_cost (stmt_v
 extern unsigned record_stmt_cost (stmt_vector_for_cost *, int,
 				  enum vect_cost_for_stmt, stmt_vec_info,
 				  int, enum vect_cost_model_location);
+extern void vect_finish_replace_stmt (gimple *, gimple *);
 extern void vect_finish_stmt_generation (gimple *, gimple *,
                                          gimple_stmt_iterator *);
 extern bool vect_mark_stmts_to_be_vectorized (loop_vec_info);
Index: gcc/tree-vect-loop.c
===================================================================
--- gcc/tree-vect-loop.c	2017-11-21 17:06:24.670434749 +0000
+++ gcc/tree-vect-loop.c	2017-11-21 17:06:25.018421257 +0000
@@ -2573,6 +2573,22 @@ vect_analyze_loop (struct loop *loop, lo
     }
 }
 
+/* Return true if there is an in-order reduction function for CODE, storing
+   it in *REDUC_FN if so.  */
+
+static bool
+fold_left_reduction_fn (tree_code code, internal_fn *reduc_fn)
+{
+  switch (code)
+    {
+    case PLUS_EXPR:
+      *reduc_fn = IFN_FOLD_LEFT_PLUS;
+      return true;
+
+    default:
+      return false;
+    }
+}
 
 /* Function reduction_fn_for_scalar_code
 
@@ -2879,6 +2895,42 @@ vect_is_slp_reduction (loop_vec_info loo
   return true;
 }
 
+/* Return true if we need an in-order reduction for operation CODE
+   on type TYPE.  NEED_WRAPPING_INTEGRAL_OVERFLOW is true if integer
+   overflow must wrap.  */
+
+static bool
+needs_fold_left_reduction_p (tree type, tree_code code,
+			     bool need_wrapping_integral_overflow)
+{
+  /* CHECKME: check for !flag_finite_math_only too?  */
+  if (SCALAR_FLOAT_TYPE_P (type))
+    switch (code)
+      {
+      case MIN_EXPR:
+      case MAX_EXPR:
+	return false;
+
+      default:
+	return !flag_associative_math;
+      }
+
+  if (INTEGRAL_TYPE_P (type))
+    {
+      if (!operation_no_trapping_overflow (type, code))
+	return true;
+      if (need_wrapping_integral_overflow
+	  && !TYPE_OVERFLOW_WRAPS (type)
+	  && operation_can_overflow (code))
+	return true;
+      return false;
+    }
+
+  if (SAT_FIXED_POINT_TYPE_P (type))
+    return true;
+
+  return false;
+}
 
 /* Function vect_is_simple_reduction
 
@@ -3197,58 +3249,18 @@ vect_is_simple_reduction (loop_vec_info
       return NULL;
     }
 
-  /* Check that it's ok to change the order of the computation.
+  /* Check whether it's ok to change the order of the computation.
      Generally, when vectorizing a reduction we change the order of the
      computation.  This may change the behavior of the program in some
      cases, so we need to check that this is ok.  One exception is when
      vectorizing an outer-loop: the inner-loop is executed sequentially,
      and therefore vectorizing reductions in the inner-loop during
      outer-loop vectorization is safe.  */
-
-  if (*v_reduc_type != COND_REDUCTION
-      && check_reduction)
-    {
-      /* CHECKME: check for !flag_finite_math_only too?  */
-      if (SCALAR_FLOAT_TYPE_P (type) && !flag_associative_math)
-	{
-	  /* Changing the order of operations changes the semantics.  */
-	  if (dump_enabled_p ())
-	    report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt,
-			"reduction: unsafe fp math optimization: ");
-	  return NULL;
-	}
-      else if (INTEGRAL_TYPE_P (type))
-	{
-	  if (!operation_no_trapping_overflow (type, code))
-	    {
-	      /* Changing the order of operations changes the semantics.  */
-	      if (dump_enabled_p ())
-		report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt,
-				"reduction: unsafe int math optimization"
-				" (overflow traps): ");
-	      return NULL;
-	    }
-	  if (need_wrapping_integral_overflow
-	      && !TYPE_OVERFLOW_WRAPS (type)
-	      && operation_can_overflow (code))
-	    {
-	      /* Changing the order of operations changes the semantics.  */
-	      if (dump_enabled_p ())
-		report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt,
-				"reduction: unsafe int math optimization"
-				" (overflow doesn't wrap): ");
-	      return NULL;
-	    }
-	}
-      else if (SAT_FIXED_POINT_TYPE_P (type))
-	{
-	  /* Changing the order of operations changes the semantics.  */
-	  if (dump_enabled_p ())
-	  report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt,
-			  "reduction: unsafe fixed-point math optimization: ");
-	  return NULL;
-	}
-    }
+  if (check_reduction
+      && *v_reduc_type == TREE_CODE_REDUCTION
+      && needs_fold_left_reduction_p (type, code,
+				      need_wrapping_integral_overflow))
+    *v_reduc_type = FOLD_LEFT_REDUCTION;
 
   /* Reduction is safe. We're dealing with one of the following:
      1) integer arithmetic and no trapv
@@ -3512,6 +3524,7 @@ vect_force_simple_reduction (loop_vec_in
       STMT_VINFO_REDUC_TYPE (reduc_def_info) = v_reduc_type;
       STMT_VINFO_REDUC_DEF (reduc_def_info) = def;
       reduc_def_info = vinfo_for_stmt (def);
+      STMT_VINFO_REDUC_TYPE (reduc_def_info) = v_reduc_type;
       STMT_VINFO_REDUC_DEF (reduc_def_info) = phi;
     }
   return def;
@@ -4064,14 +4077,27 @@ vect_model_reduction_cost (stmt_vec_info
 
   code = gimple_assign_rhs_code (orig_stmt);
 
-  if (reduction_type == EXTRACT_LAST_REDUCTION)
+  if (reduction_type == EXTRACT_LAST_REDUCTION
+      || reduction_type == FOLD_LEFT_REDUCTION)
     {
       /* No extra instructions needed in the prologue.  */
       prologue_cost = 0;
 
-      /* Count NCOPIES FOLD_EXTRACT_LAST operations.  */
-      inside_cost = add_stmt_cost (target_cost_data, ncopies, vec_to_scalar,
-				   stmt_info, 0, vect_body);
+      if (reduction_type == EXTRACT_LAST_REDUCTION || reduc_fn != IFN_LAST)
+	/* Count one reduction-like operation per vector.  */
+	inside_cost = add_stmt_cost (target_cost_data, ncopies, vec_to_scalar,
+				     stmt_info, 0, vect_body);
+      else
+	{
+	  /* Use NELEMENTS extracts and NELEMENTS scalar ops.  */
+	  unsigned int nelements = ncopies * vect_nunits_for_cost (vectype);
+	  inside_cost = add_stmt_cost (target_cost_data,  nelements,
+				       vec_to_scalar, stmt_info, 0,
+				       vect_body);
+	  inside_cost += add_stmt_cost (target_cost_data,  nelements,
+					scalar_stmt, stmt_info, 0,
+					vect_body);
+	}
     }
   else
     {
@@ -4137,7 +4163,8 @@ vect_model_reduction_cost (stmt_vec_info
 					  scalar_stmt, stmt_info, 0,
 					  vect_epilogue);
 	}
-      else if (reduction_type == EXTRACT_LAST_REDUCTION)
+      else if (reduction_type == EXTRACT_LAST_REDUCTION
+	       || reduction_type == FOLD_LEFT_REDUCTION)
 	/* No extra instructions need in the epilogue.  */
 	;
       else
@@ -5910,6 +5937,160 @@ vect_create_epilog_for_reduction (vec<tr
     }
 }
 
+/* Return a vector of type VECTYPE that is equal to the vector select
+   operation "MASK ? VEC : IDENTITY".  Insert the select statements
+   before GSI.  */
+
+static tree
+merge_with_identity (gimple_stmt_iterator *gsi, tree mask, tree vectype,
+		     tree vec, tree identity)
+{
+  tree cond = make_temp_ssa_name (vectype, NULL, "cond");
+  gimple *new_stmt = gimple_build_assign (cond, VEC_COND_EXPR,
+					  mask, vec, identity);
+  gsi_insert_before (gsi, new_stmt, GSI_SAME_STMT);
+  return cond;
+}
+
+/* Perform an in-order reduction (FOLD_LEFT_REDUCTION).  STMT is the
+   statement that sets the live-out value.  REDUC_DEF_STMT is the phi
+   statement.  CODE is the operation performed by STMT and OPS are
+   its scalar operands.  REDUC_INDEX is the index of the operand in
+   OPS that is set by REDUC_DEF_STMT.  REDUC_FN is the function that
+   implements in-order reduction, or IFN_LAST if we should open-code it.
+   VECTYPE_IN is the type of the vector input.  MASKS specifies the masks
+   that should be used to control the operation in a fully-masked loop.  */
+
+static bool
+vectorize_fold_left_reduction (gimple *stmt, gimple_stmt_iterator *gsi,
+			       gimple **vec_stmt, slp_tree slp_node,
+			       gimple *reduc_def_stmt,
+			       tree_code code, internal_fn reduc_fn,
+			       tree ops[3], tree vectype_in,
+			       int reduc_index, vec_loop_masks *masks)
+{
+  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+  loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  tree vectype_out = STMT_VINFO_VECTYPE (stmt_info);
+  gimple *new_stmt = NULL;
+
+  int ncopies;
+  if (slp_node)
+    ncopies = 1;
+  else
+    ncopies = vect_get_num_copies (loop_vinfo, vectype_in);
+
+  gcc_assert (!nested_in_vect_loop_p (loop, stmt));
+  gcc_assert (ncopies == 1);
+  gcc_assert (TREE_CODE_LENGTH (code) == binary_op);
+  gcc_assert (reduc_index == (code == MINUS_EXPR ? 0 : 1));
+  gcc_assert (STMT_VINFO_VEC_REDUCTION_TYPE (stmt_info)
+	      == FOLD_LEFT_REDUCTION);
+
+  if (slp_node)
+    gcc_assert (must_eq (TYPE_VECTOR_SUBPARTS (vectype_out),
+			 TYPE_VECTOR_SUBPARTS (vectype_in)));
+
+  tree op0 = ops[1 - reduc_index];
+
+  int group_size = 1;
+  gimple *scalar_dest_def;
+  auto_vec<tree> vec_oprnds0;
+  if (slp_node)
+    {
+      vect_get_vec_defs (op0, NULL_TREE, stmt, &vec_oprnds0, NULL, slp_node);
+      group_size = SLP_TREE_SCALAR_STMTS (slp_node).length ();
+      scalar_dest_def = SLP_TREE_SCALAR_STMTS (slp_node)[group_size - 1];
+    }
+  else
+    {
+      tree loop_vec_def0 = vect_get_vec_def_for_operand (op0, stmt);
+      vec_oprnds0.create (1);
+      vec_oprnds0.quick_push (loop_vec_def0);
+      scalar_dest_def = stmt;
+    }
+
+  tree scalar_dest = gimple_assign_lhs (scalar_dest_def);
+  tree scalar_type = TREE_TYPE (scalar_dest);
+  tree reduc_var = gimple_phi_result (reduc_def_stmt);
+
+  int vec_num = vec_oprnds0.length ();
+  gcc_assert (vec_num == 1 || slp_node);
+  tree vec_elem_type = TREE_TYPE (vectype_out);
+  gcc_checking_assert (useless_type_conversion_p (scalar_type, vec_elem_type));
+
+  tree vector_identity = NULL_TREE;
+  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+    vector_identity = build_zero_cst (vectype_out);
+
+  tree scalar_dest_var = vect_create_destination_var (scalar_dest, NULL);
+  int i;
+  tree def0;
+  FOR_EACH_VEC_ELT (vec_oprnds0, i, def0)
+    {
+      tree mask = NULL_TREE;
+      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+	mask = vect_get_loop_mask (gsi, masks, vec_num, vectype_in, i);
+
+      /* Handle MINUS by adding the negative.  */
+      if (reduc_fn != IFN_LAST && code == MINUS_EXPR)
+	{
+	  tree negated = make_ssa_name (vectype_out);
+	  new_stmt = gimple_build_assign (negated, NEGATE_EXPR, def0);
+	  gsi_insert_before (gsi, new_stmt, GSI_SAME_STMT);
+	  def0 = negated;
+	}
+
+      if (mask)
+	def0 = merge_with_identity (gsi, mask, vectype_out, def0,
+				    vector_identity);
+
+      /* On the first iteration the input is simply the scalar phi
+	 result, and for subsequent iterations it is the output of
+	 the preceding operation.  */
+      if (reduc_fn != IFN_LAST)
+	{
+	  new_stmt = gimple_build_call_internal (reduc_fn, 2, reduc_var, def0);
+	  /* For chained SLP reductions the output of the previous reduction
+	     operation serves as the input of the next. For the final statement
+	     the output cannot be a temporary - we reuse the original
+	     scalar destination of the last statement.  */
+	  if (i != vec_num - 1)
+	    {
+	      gimple_set_lhs (new_stmt, scalar_dest_var);
+	      reduc_var = make_ssa_name (scalar_dest_var, new_stmt);
+	      gimple_set_lhs (new_stmt, reduc_var);
+	    }
+	}
+      else
+	{
+	  reduc_var = vect_expand_fold_left (gsi, scalar_dest_var, code,
+					     reduc_var, def0);
+	  new_stmt = SSA_NAME_DEF_STMT (reduc_var);
+	  /* Remove the statement, so that we can use the same code paths
+	     as for statements that we've just created.  */
+	  gimple_stmt_iterator tmp_gsi = gsi_for_stmt (new_stmt);
+	  gsi_remove (&tmp_gsi, false);
+	}
+
+      if (i == vec_num - 1)
+	{
+	  gimple_set_lhs (new_stmt, scalar_dest);
+	  vect_finish_replace_stmt (scalar_dest_def, new_stmt);
+	}
+      else
+	vect_finish_stmt_generation (scalar_dest_def, new_stmt, gsi);
+
+      if (slp_node)
+	SLP_TREE_VEC_STMTS (slp_node).quick_push (new_stmt);
+    }
+
+  if (!slp_node)
+    STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt;
+
+  return true;
+}
 
 /* Function is_nonwrapping_integer_induction.
 
@@ -6090,6 +6271,12 @@ vectorizable_reduction (gimple *stmt, gi
 	  return true;
 	}
 
+      if (STMT_VINFO_REDUC_TYPE (stmt_info) == FOLD_LEFT_REDUCTION)
+	/* Leave the scalar phi in place.  Note that checking
+	   STMT_VINFO_VEC_REDUCTION_TYPE (as below) only works
+	   for reductions involving a single statement.  */
+	return true;
+
       gimple *reduc_stmt = STMT_VINFO_REDUC_DEF (stmt_info);
       if (STMT_VINFO_IN_PATTERN_P (vinfo_for_stmt (reduc_stmt)))
 	reduc_stmt = STMT_VINFO_RELATED_STMT (vinfo_for_stmt (reduc_stmt));
@@ -6316,6 +6503,14 @@ vectorizable_reduction (gimple *stmt, gi
      directy used in stmt.  */
   if (reduc_index == -1)
     {
+      if (STMT_VINFO_REDUC_TYPE (stmt_info) == FOLD_LEFT_REDUCTION)
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "in-order reduction chain without SLP.\n");
+	  return false;
+	}
+
       if (orig_stmt)
 	reduc_def_stmt = STMT_VINFO_REDUC_DEF (orig_stmt_info);
       else
@@ -6535,7 +6730,9 @@ vectorizable_reduction (gimple *stmt, gi
 
   vect_reduction_type reduction_type
     = STMT_VINFO_VEC_REDUCTION_TYPE (stmt_info);
-  if (orig_stmt && reduction_type == TREE_CODE_REDUCTION)
+  if (orig_stmt
+      && (reduction_type == TREE_CODE_REDUCTION
+	  || reduction_type == FOLD_LEFT_REDUCTION))
     {
       /* This is a reduction pattern: get the vectype from the type of the
          reduction variable, and get the tree-code from orig_stmt.  */
@@ -6582,10 +6779,13 @@ vectorizable_reduction (gimple *stmt, gi
   reduc_fn = IFN_LAST;
 
   if (reduction_type == TREE_CODE_REDUCTION
+      || reduction_type == FOLD_LEFT_REDUCTION
       || reduction_type == INTEGER_INDUC_COND_REDUCTION
       || reduction_type == CONST_COND_REDUCTION)
     {
-      if (reduction_fn_for_scalar_code (orig_code, &reduc_fn))
+      if (reduction_type == FOLD_LEFT_REDUCTION
+	  ? fold_left_reduction_fn (orig_code, &reduc_fn)
+	  : reduction_fn_for_scalar_code (orig_code, &reduc_fn))
 	{
 	  if (reduc_fn != IFN_LAST
 	      && !direct_internal_fn_supported_p (reduc_fn, vectype_out,
@@ -6704,6 +6904,41 @@ vectorizable_reduction (gimple *stmt, gi
 	}
     }
 
+  if (double_reduc && reduction_type == FOLD_LEFT_REDUCTION)
+    {
+      /* We can't support in-order reductions of code such as this:
+
+	   for (int i = 0; i < n1; ++i)
+	     for (int j = 0; j < n2; ++j)
+	       l += a[j];
+
+	 since GCC effectively transforms the loop when vectorizing:
+
+	   for (int i = 0; i < n1 / VF; ++i)
+	     for (int j = 0; j < n2; ++j)
+	       for (int k = 0; k < VF; ++k)
+		 l += a[j];
+
+	 which is a reassociation of the original operation.  */
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "in-order double reduction not supported.\n");
+
+      return false;
+    }
+
+  if (reduction_type == FOLD_LEFT_REDUCTION
+      && slp_node
+      && !GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)))
+    {
+      /* We cannot use in-order reductions in this case because there is
+         an implicit reassociation of the operations involved.  */
+      if (dump_enabled_p ())
+        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "in-order unchained SLP reductions not supported.\n");
+      return false;
+    }
+
   /* In case of widenning multiplication by a constant, we update the type
      of the constant to be the type of the other operand.  We check that the
      constant fits the type in the pattern recognition pass.  */
@@ -6824,9 +7059,10 @@ vectorizable_reduction (gimple *stmt, gi
 	vect_model_reduction_cost (stmt_info, reduc_fn, ncopies);
       if (loop_vinfo && LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo))
 	{
-	  if (cond_fn == IFN_LAST
-	      || !direct_internal_fn_supported_p (cond_fn, vectype_in,
-						  OPTIMIZE_FOR_SPEED))
+	  if (reduction_type != FOLD_LEFT_REDUCTION
+	      && (cond_fn == IFN_LAST
+		  || !direct_internal_fn_supported_p (cond_fn, vectype_in,
+						      OPTIMIZE_FOR_SPEED)))
 	    {
 	      if (dump_enabled_p ())
 		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
@@ -6846,6 +7082,10 @@ vectorizable_reduction (gimple *stmt, gi
 	    vect_record_loop_mask (loop_vinfo, masks, ncopies * vec_num,
 				   vectype_in);
 	}
+      if (dump_enabled_p ()
+	  && reduction_type == FOLD_LEFT_REDUCTION)
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "using an in-order (fold-left) reduction.\n");
       STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
       return true;
     }
@@ -6861,6 +7101,11 @@ vectorizable_reduction (gimple *stmt, gi
 
   bool masked_loop_p = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
 
+  if (reduction_type == FOLD_LEFT_REDUCTION)
+    return vectorize_fold_left_reduction
+      (stmt, gsi, vec_stmt, slp_node, reduc_def_stmt, code,
+       reduc_fn, ops, vectype_in, reduc_index, masks);
+
   if (reduction_type == EXTRACT_LAST_REDUCTION)
     {
       gcc_assert (!slp_node);
Index: gcc/config/aarch64/aarch64.md
===================================================================
--- gcc/config/aarch64/aarch64.md	2017-11-21 17:06:24.670434749 +0000
+++ gcc/config/aarch64/aarch64.md	2017-11-21 17:06:25.013421451 +0000
@@ -164,6 +164,7 @@ (define_c_enum "unspec" [
     UNSPEC_STN
     UNSPEC_INSR
     UNSPEC_CLASTB
+    UNSPEC_FADDA
 ])
 
 (define_c_enum "unspecv" [
Index: gcc/config/aarch64/aarch64-sve.md
===================================================================
--- gcc/config/aarch64/aarch64-sve.md	2017-11-21 17:06:24.670434749 +0000
+++ gcc/config/aarch64/aarch64-sve.md	2017-11-21 17:06:25.012421490 +0000
@@ -1574,6 +1574,45 @@ (define_insn "*reduc_<optab>_scal_<mode>
   "<bit_reduc_op>\t%<Vetype>0, %1, %2.<Vetype>"
 )
 
+;; Unpredicated in-order FP reductions.
+(define_expand "fold_left_plus_<mode>"
+  [(set (match_operand:<VEL> 0 "register_operand")
+	(unspec:<VEL> [(match_dup 3)
+		       (match_operand:<VEL> 1 "register_operand")
+		       (match_operand:SVE_F 2 "register_operand")]
+		      UNSPEC_FADDA))]
+  "TARGET_SVE"
+  {
+    operands[3] = force_reg (<VPRED>mode, CONSTM1_RTX (<VPRED>mode));
+  }
+)
+
+;; In-order FP reductions predicated with PTRUE.
+(define_insn "*fold_left_plus_<mode>"
+  [(set (match_operand:<VEL> 0 "register_operand" "=w")
+	(unspec:<VEL> [(match_operand:<VPRED> 1 "register_operand" "Upl")
+		       (match_operand:<VEL> 2 "register_operand" "0")
+		       (match_operand:SVE_F 3 "register_operand" "w")]
+		      UNSPEC_FADDA))]
+  "TARGET_SVE"
+  "fadda\t%<Vetype>0, %1, %<Vetype>0, %3.<Vetype>"
+)
+
+;; Predicated form of the above in-order reduction.
+(define_insn "*pred_fold_left_plus_<mode>"
+  [(set (match_operand:<VEL> 0 "register_operand" "=w")
+	(unspec:<VEL>
+	  [(match_operand:<VEL> 1 "register_operand" "0")
+	   (unspec:SVE_F
+	     [(match_operand:<VPRED> 2 "register_operand" "Upl")
+	      (match_operand:SVE_F 3 "register_operand" "w")
+	      (match_operand:SVE_F 4 "aarch64_simd_imm_zero")]
+	     UNSPEC_SEL)]
+	  UNSPEC_FADDA))]
+  "TARGET_SVE"
+  "fadda\t%<Vetype>0, %2, %<Vetype>0, %3.<Vetype>"
+)
+
 ;; Unpredicated floating-point addition.
 (define_expand "add<mode>3"
   [(set (match_operand:SVE_F 0 "register_operand")
Index: gcc/testsuite/gcc.dg/vect/no-fast-math-vect16.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/no-fast-math-vect16.c	2017-11-21 17:06:24.670434749 +0000
+++ gcc/testsuite/gcc.dg/vect/no-fast-math-vect16.c	2017-11-21 17:06:25.015421374 +0000
@@ -33,5 +33,5 @@ int main (void)
   return main1 ();
 }
 
-/* Requires fast-math.  */
-/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail *-*-* } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */
Index: gcc/testsuite/gcc.dg/vect/pr79920.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/pr79920.c	2017-11-21 17:06:24.670434749 +0000
+++ gcc/testsuite/gcc.dg/vect/pr79920.c	2017-11-21 17:06:25.015421374 +0000
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-additional-options "-O3" } */
+/* { dg-additional-options "-O3 -fno-fast-math" } */
 
 #include "tree-vect.h"
 
@@ -41,4 +41,5 @@ int main()
   return 0;
 }
 
-/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_double && { vect_perm && vect_hw_misalign } } } } } */
+/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { target { vect_double && { vect_perm && vect_hw_misalign } } } } } */
Index: gcc/testsuite/gcc.dg/vect/trapv-vect-reduc-4.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/trapv-vect-reduc-4.c	2017-11-21 17:06:24.670434749 +0000
+++ gcc/testsuite/gcc.dg/vect/trapv-vect-reduc-4.c	2017-11-21 17:06:25.015421374 +0000
@@ -46,5 +46,6 @@ int main (void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump-times "Detected reduction\\." 2 "vect"  } } */
-/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { target { ! vect_no_int_min_max } } } } */
+/* { dg-final { scan-tree-dump-times "Detected reduction\\." 3 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 3 loops" 1 "vect" { target { ! vect_no_int_min_max } } } } */
+/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */
Index: gcc/testsuite/gcc.dg/vect/vect-reduc-6.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/vect-reduc-6.c	2017-11-21 17:06:24.670434749 +0000
+++ gcc/testsuite/gcc.dg/vect/vect-reduc-6.c	2017-11-21 17:06:25.015421374 +0000
@@ -1,4 +1,5 @@
 /* { dg-require-effective-target vect_float } */
+/* { dg-additional-options "-fno-fast-math" } */
 
 #include <stdarg.h>
 #include "tree-vect.h"
@@ -48,6 +49,5 @@ int main (void)
   return 0;
 }
 
-/* need -ffast-math to vectorizer these loops.  */
-/* ARM NEON passes -ffast-math to these tests, so expect this to fail.  */
-/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { xfail arm_neon_ok } } } */
+/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
Index: gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-1.c
===================================================================
--- /dev/null	2017-11-20 18:51:34.589640877 +0000
+++ gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-1.c	2017-11-21 17:06:25.015421374 +0000
@@ -0,0 +1,42 @@
+/* { dg-do run { xfail { { i?86-*-* x86_64-*-* } && ia32 } } } */
+/* { dg-require-effective-target vect_double } */
+/* { dg-add-options ieee } */
+/* { dg-additional-options "-fno-fast-math" } */
+
+#include "tree-vect.h"
+
+#define N (VECTOR_BITS * 17)
+
+double __attribute__ ((noinline, noclone))
+reduc_plus_double (double *a, double *b)
+{
+  double r = 0, q = 3;
+  for (int i = 0; i < N; i++)
+    {
+      r += a[i];
+      q -= b[i];
+    }
+  return r * q;
+}
+
+int __attribute__ ((optimize (1)))
+main ()
+{
+  double a[N];
+  double b[N];
+  double r = 0, q = 3;
+  for (int i = 0; i < N; i++)
+    {
+      a[i] = (i * 0.1) * (i & 1 ? 1 : -1);
+      b[i] = (i * 0.3) * (i & 1 ? 1 : -1);
+      r += a[i];
+      q -= b[i];
+      asm volatile ("" ::: "memory");
+    }
+  double res = reduc_plus_double (a, b);
+  if (res != r * q)
+    __builtin_abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 2 "vect" } } */
Index: gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-2.c
===================================================================
--- /dev/null	2017-11-20 18:51:34.589640877 +0000
+++ gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-2.c	2017-11-21 17:06:25.015421374 +0000
@@ -0,0 +1,44 @@
+/* { dg-do run { xfail { { i?86-*-* x86_64-*-* } && ia32 } } } */
+/* { dg-require-effective-target vect_double } */
+/* { dg-add-options ieee } */
+/* { dg-additional-options "-fno-fast-math" } */
+
+#include "tree-vect.h"
+
+#define N (VECTOR_BITS * 17)
+
+double __attribute__ ((noinline, noclone))
+reduc_plus_double (double *restrict a, int n)
+{
+  double res = 0.0;
+  for (int i = 0; i < n; i++)
+    for (int j = 0; j < N; j++)
+      res += a[i];
+  return res;
+}
+
+int __attribute__ ((optimize (1)))
+main ()
+{
+  int n = 19;
+  double a[N];
+  double r = 0;
+  for (int i = 0; i < N; i++)
+    {
+      a[i] = (i * 0.1) * (i & 1 ? 1 : -1);
+      asm volatile ("" ::: "memory");
+    }
+  for (int i = 0; i < n; i++)
+    for (int j = 0; j < N; j++)
+      {
+	r += a[i];
+	asm volatile ("" ::: "memory");
+      }
+  double res = reduc_plus_double (a, n);
+  if (res != r)
+    __builtin_abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times {in-order double reduction not supported} 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */
Index: gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-3.c
===================================================================
--- /dev/null	2017-11-20 18:51:34.589640877 +0000
+++ gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-3.c	2017-11-21 17:06:25.016421335 +0000
@@ -0,0 +1,42 @@
+/* { dg-do run { xfail { { i?86-*-* x86_64-*-* } && ia32 } } } */
+/* { dg-require-effective-target vect_double } */
+/* { dg-add-options ieee } */
+/* { dg-additional-options "-fno-fast-math" } */
+
+#include "tree-vect.h"
+
+#define N (VECTOR_BITS * 17)
+
+double __attribute__ ((noinline, noclone))
+reduc_plus_double (double *a)
+{
+  double r = 0;
+  for (int i = 0; i < N; i += 4)
+    {
+      r += a[i] * 2.0;
+      r += a[i + 1] * 3.0;
+      r += a[i + 2] * 4.0;
+      r += a[i + 3] * 5.0;
+    }
+  return r;
+}
+
+int __attribute__ ((optimize (1)))
+main ()
+{
+  double a[N];
+  double r = 0;
+  for (int i = 0; i < N; i++)
+    {
+      a[i] = (i * 0.1) * (i & 1 ? 1 : -1);
+      r += a[i] * (i % 4 + 2);
+      asm volatile ("" ::: "memory");
+    }
+  double res = reduc_plus_double (a);
+  if (res != r)
+    __builtin_abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times {vectorizing stmts using SLP} 1 "vect" } } */
Index: gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-4.c
===================================================================
--- /dev/null	2017-11-20 18:51:34.589640877 +0000
+++ gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-4.c	2017-11-21 17:06:25.016421335 +0000
@@ -0,0 +1,45 @@
+/* { dg-do run { xfail { { i?86-*-* x86_64-*-* } && ia32 } } } */
+/* { dg-require-effective-target vect_double } */
+/* { dg-add-options ieee } */
+/* { dg-additional-options "-fno-fast-math" } */
+
+#include "tree-vect.h"
+
+#define N (VECTOR_BITS * 17)
+
+double __attribute__ ((noinline, noclone))
+reduc_plus_double (double *a)
+{
+  double r1 = 0;
+  double r2 = 0;
+  double r3 = 0;
+  double r4 = 0;
+  for (int i = 0; i < N; i += 4)
+    {
+      r1 += a[i];
+      r2 += a[i + 1];
+      r3 += a[i + 2];
+      r4 += a[i + 3];
+    }
+  return r1 * r2 * r3 * r4;
+}
+
+int __attribute__ ((optimize (1)))
+main ()
+{
+  double a[N];
+  double r[4] = {};
+  for (int i = 0; i < N; i++)
+    {
+      a[i] = (i * 0.1) * (i & 1 ? 1 : -1);
+      r[i % 4] += a[i];
+      asm volatile ("" ::: "memory");
+    }
+  double res = reduc_plus_double (a);
+  if (res != r[0] * r[1] * r[2] * r[3])
+    __builtin_abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times {in-order unchained SLP reductions not supported} 1 "vect" } } */
+/* { dg-final { scan-tree-dump-not {vectorizing stmts using SLP} "vect" } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1.c
===================================================================
--- /dev/null	2017-11-20 18:51:34.589640877 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1.c	2017-11-21 17:06:25.016421335 +0000
@@ -0,0 +1,28 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#define NUM_ELEMS(TYPE) ((int)(5 * (256 / sizeof (TYPE)) + 3))
+
+#define DEF_REDUC_PLUS(TYPE)			\
+  TYPE __attribute__ ((noinline, noclone))	\
+  reduc_plus_##TYPE (TYPE *a, TYPE *b)		\
+  {						\
+    TYPE r = 0, q = 3;				\
+    for (int i = 0; i < NUM_ELEMS (TYPE); i++)	\
+      {						\
+	r += a[i];				\
+	q -= b[i];				\
+      }						\
+    return r * q;				\
+  }
+
+#define TEST_ALL(T) \
+  T (_Float16) \
+  T (float) \
+  T (double)
+
+TEST_ALL (DEF_REDUC_PLUS)
+
+/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, z[0-9]+\.h} 2 } } */
+/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, z[0-9]+\.s} 2 } } */
+/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, z[0-9]+\.d} 2 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1_run.c
===================================================================
--- /dev/null	2017-11-20 18:51:34.589640877 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1_run.c	2017-11-21 17:06:25.016421335 +0000
@@ -0,0 +1,29 @@
+/* { dg-do run { target { aarch64_sve_hw } } } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#include "sve_reduc_strict_1.c"
+
+#define TEST_REDUC_PLUS(TYPE)			\
+  {						\
+    TYPE a[NUM_ELEMS (TYPE)];			\
+    TYPE b[NUM_ELEMS (TYPE)];			\
+    TYPE r = 0, q = 3;				\
+    for (int i = 0; i < NUM_ELEMS (TYPE); i++)	\
+      {						\
+	a[i] = (i * 0.1) * (i & 1 ? 1 : -1);	\
+	b[i] = (i * 0.3) * (i & 1 ? 1 : -1);	\
+	r += a[i];				\
+	q -= b[i];				\
+	asm volatile ("" ::: "memory");		\
+      }						\
+    TYPE res = reduc_plus_##TYPE (a, b);	\
+    if (res != r * q)				\
+      __builtin_abort ();			\
+  }
+
+int __attribute__ ((optimize (1)))
+main ()
+{
+  TEST_ALL (TEST_REDUC_PLUS);
+  return 0;
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2.c
===================================================================
--- /dev/null	2017-11-20 18:51:34.589640877 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2.c	2017-11-21 17:06:25.016421335 +0000
@@ -0,0 +1,28 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#define NUM_ELEMS(TYPE) ((int) (5 * (256 / sizeof (TYPE)) + 3))
+
+#define DEF_REDUC_PLUS(TYPE)					\
+void __attribute__ ((noinline, noclone))			\
+reduc_plus_##TYPE (TYPE (*restrict a)[NUM_ELEMS (TYPE)],	\
+		   TYPE *restrict r, int n)			\
+{								\
+  for (int i = 0; i < n; i++)					\
+    {								\
+      r[i] = 0;							\
+      for (int j = 0; j < NUM_ELEMS (TYPE); j++)		\
+        r[i] += a[i][j];					\
+    }								\
+}
+
+#define TEST_ALL(T) \
+  T (_Float16) \
+  T (float) \
+  T (double)
+
+TEST_ALL (DEF_REDUC_PLUS)
+
+/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, z[0-9]+\.h} 1 } } */
+/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, z[0-9]+\.s} 1 } } */
+/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, z[0-9]+\.d} 1 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2_run.c
===================================================================
--- /dev/null	2017-11-20 18:51:34.589640877 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2_run.c	2017-11-21 17:06:25.016421335 +0000
@@ -0,0 +1,31 @@
+/* { dg-do run { target { aarch64_sve_hw } } } */
+/* { dg-options "-O2 -ftree-vectorize -fno-inline -march=armv8-a+sve" } */
+
+#include "sve_reduc_strict_2.c"
+
+#define NROWS 5
+
+#define TEST_REDUC_PLUS(TYPE)					\
+  {								\
+    TYPE a[NROWS][NUM_ELEMS (TYPE)];				\
+    TYPE r[NROWS];						\
+    TYPE expected[NROWS] = {};					\
+    for (int i = 0; i < NROWS; ++i)				\
+      for (int j = 0; j < NUM_ELEMS (TYPE); ++j)		\
+	{							\
+	  a[i][j] = (i * 0.1 + j * 0.6) * (j & 1 ? 1 : -1);	\
+	  expected[i] += a[i][j];				\
+	  asm volatile ("" ::: "memory");			\
+	}							\
+    reduc_plus_##TYPE (a, r, NROWS);				\
+    for (int i = 0; i < NROWS; ++i)				\
+      if (r[i] != expected[i])					\
+	__builtin_abort ();					\
+  }
+
+int __attribute__ ((optimize (1)))
+main ()
+{
+  TEST_ALL (TEST_REDUC_PLUS);
+  return 0;
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_3.c
===================================================================
--- /dev/null	2017-11-20 18:51:34.589640877 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_3.c	2017-11-21 17:06:25.016421335 +0000
@@ -0,0 +1,131 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -fno-inline -march=armv8-a+sve -msve-vector-bits=256 -fdump-tree-vect-details" } */
+
+double mat[100][4];
+double mat2[100][8];
+double mat3[100][12];
+double mat4[100][3];
+
+double
+slp_reduc_plus (int n)
+{
+  double tmp = 0.0;
+  for (int i = 0; i < n; i++)
+    {
+      tmp = tmp + mat[i][0];
+      tmp = tmp + mat[i][1];
+      tmp = tmp + mat[i][2];
+      tmp = tmp + mat[i][3];
+    }
+  return tmp;
+}
+
+double
+slp_reduc_plus2 (int n)
+{
+  double tmp = 0.0;
+  for (int i = 0; i < n; i++)
+    {
+      tmp = tmp + mat2[i][0];
+      tmp = tmp + mat2[i][1];
+      tmp = tmp + mat2[i][2];
+      tmp = tmp + mat2[i][3];
+      tmp = tmp + mat2[i][4];
+      tmp = tmp + mat2[i][5];
+      tmp = tmp + mat2[i][6];
+      tmp = tmp + mat2[i][7];
+    }
+  return tmp;
+}
+
+double
+slp_reduc_plus3 (int n)
+{
+  double tmp = 0.0;
+  for (int i = 0; i < n; i++)
+    {
+      tmp = tmp + mat3[i][0];
+      tmp = tmp + mat3[i][1];
+      tmp = tmp + mat3[i][2];
+      tmp = tmp + mat3[i][3];
+      tmp = tmp + mat3[i][4];
+      tmp = tmp + mat3[i][5];
+      tmp = tmp + mat3[i][6];
+      tmp = tmp + mat3[i][7];
+      tmp = tmp + mat3[i][8];
+      tmp = tmp + mat3[i][9];
+      tmp = tmp + mat3[i][10];
+      tmp = tmp + mat3[i][11];
+    }
+  return tmp;
+}
+
+void
+slp_non_chained_reduc (int n, double * restrict out)
+{
+  for (int i = 0; i < 3; i++)
+    out[i] = 0;
+
+  for (int i = 0; i < n; i++)
+    {
+      out[0] = out[0] + mat4[i][0];
+      out[1] = out[1] + mat4[i][1];
+      out[2] = out[2] + mat4[i][2];
+    }
+}
+
+/* Strict FP reductions shouldn't be used for the outer loops, only the
+   inner loops.  */
+
+float
+double_reduc1 (float (*restrict i)[16])
+{
+  float l = 0;
+
+  for (int a = 0; a < 8; a++)
+    for (int b = 0; b < 8; b++)
+      l += i[b][a];
+  return l;
+}
+
+float
+double_reduc2 (float *restrict i)
+{
+  float l = 0;
+
+  for (int a = 0; a < 8; a++)
+    for (int b = 0; b < 16; b++)
+      {
+        l += i[b * 4];
+        l += i[b * 4 + 1];
+        l += i[b * 4 + 2];
+        l += i[b * 4 + 3];
+      }
+  return l;
+}
+
+float
+double_reduc3 (float *restrict i, float *restrict j)
+{
+  float k = 0, l = 0;
+
+  for (int a = 0; a < 8; a++)
+    for (int b = 0; b < 8; b++)
+      {
+        k += i[b];
+        l += j[b];
+      }
+  return l * k;
+}
+
+/* We can't yet handle double_reduc1.  */
+/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, z[0-9]+\.s} 3 } } */
+/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, z[0-9]+\.d} 9 } } */
+/* 1 reduction each for double_reduc{1,2} and 2 for double_reduc3.  Each one
+   is reported three times, once for SVE, once for 128-bit AdvSIMD and once
+   for 64-bit AdvSIMD.  */
+/* { dg-final { scan-tree-dump-times "Detected double reduction" 12 "vect" } } */
+/* double_reduc2 has 2 reductions and slp_non_chained_reduc has 3.
+   double_reduc1 is reported 3 times (SVE, 128-bit AdvSIMD, 64-bit AdvSIMD)
+   before failing.  */
+/* { dg-final { scan-tree-dump-times "Detected reduction" 12 "vect" } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_slp_13.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/sve_slp_13.c	2017-11-21 17:06:24.670434749 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_slp_13.c	2017-11-21 17:06:25.016421335 +0000
@@ -1,5 +1,6 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable" } */
+/* The cost model thinks that the double loop isn't a win for SVE-128.  */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable -fno-vect-cost-model" } */
 
 #include <stdint.h>
 
@@ -24,7 +25,10 @@ #define TEST_ALL(T)				\
   T (int32_t)					\
   T (uint32_t)					\
   T (int64_t)					\
-  T (uint64_t)
+  T (uint64_t)					\
+  T (_Float16)					\
+  T (float)					\
+  T (double)
 
 TEST_ALL (VEC_PERM)
 
@@ -32,21 +36,25 @@ TEST_ALL (VEC_PERM)
 /* ??? We don't treat the uint loops as SLP.  */
 /* The loop should be fully-masked.  */
 /* { dg-final { scan-assembler-times {\tld1b\t} 2 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tld1h\t} 2 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tld1w\t} 2 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tld1w\t} 1 } } */
-/* { dg-final { scan-assembler-times {\tld1d\t} 2 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tld1d\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1h\t} 3 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tld1w\t} 3 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tld1w\t} 2 } } */
+/* { dg-final { scan-assembler-times {\tld1d\t} 3 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tld1d\t} 2 } } */
 /* { dg-final { scan-assembler-not {\tldr} { xfail *-*-* } } } */
 
 /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b} 4 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 4 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 4 } } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 4 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 6 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 6 } } */
 
 /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.b\n} 2 { xfail *-*-* } } } */
 /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.h\n} 2 { xfail *-*-* } } } */
 /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.s\n} 2 } } */
 /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.d\n} 2 } } */
+/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, z[0-9]+\.h\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, z[0-9]+\.s\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, z[0-9]+\.d\n} 1 } } */
+/* { dg-final { scan-assembler-not {\tfadd\n} } } */
 
 /* { dg-final { scan-assembler-not {\tuqdec} } } */
Index: gcc/testsuite/gfortran.dg/vect/vect-8.f90
===================================================================
--- gcc/testsuite/gfortran.dg/vect/vect-8.f90	2017-11-21 17:06:24.670434749 +0000
+++ gcc/testsuite/gfortran.dg/vect/vect-8.f90	2017-11-21 17:06:25.016421335 +0000
@@ -704,5 +704,5 @@ CALL track('KERNEL  ')
 RETURN
 END SUBROUTINE kernel
 
-! { dg-final { scan-tree-dump-times "vectorized 21 loops" 1 "vect" { target { vect_intdouble_cvt } } } }
+! { dg-final { scan-tree-dump-times "vectorized 22 loops" 1 "vect" { target vect_intdouble_cvt } } }
 ! { dg-final { scan-tree-dump-times "vectorized 17 loops" 1 "vect" { target { ! vect_intdouble_cvt } } } }

Richard Sandiford Nov. 21, 2017, 5:10 p.m. UTC | #8

Jeff Law <law@redhat.com> writes:
> On 11/21/2017 09:45 AM, Richard Sandiford wrote:

>> Richard Biener <richard.guenther@gmail.com> writes:

>>> On Mon, Nov 20, 2017 at 1:54 PM, Richard Sandiford

>>> <richard.sandiford@linaro.org> wrote:

>>>> Richard Biener <richard.guenther@gmail.com> writes:

>>>>> On Fri, Nov 17, 2017 at 5:53 PM, Richard Sandiford

>>>>> <richard.sandiford@linaro.org> wrote:

>>>>>> This patch adds support for in-order floating-point addition reductions,

>>>>>> which are suitable even in strict IEEE mode.

>>>>>>

>>>>>> Previously vect_is_simple_reduction would reject any cases that forbid

>>>>>> reassociation.  The idea is instead to tentatively accept them as

>>>>>> "FOLD_LEFT_REDUCTIONs" and only fail later if there is no target

>>>>>> support for them.  Although this patch only handles the particular

>>>>>> case of plus and minus on floating-point types, there's no reason in

>>>>>> principle why targets couldn't handle other cases.

>>>>>>

>>>>>> The vect_force_simple_reduction change makes it simpler for parloops

>>>>>> to read the type of reduction.

>>>>>>

>>>>>> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu

>>>>>> and powerpc64le-linux-gnu.  OK to install?

>>>>>

>>>>> I don't like that you add a new tree code for this.  A new IFN looks more

>>>>> suitable to me.

>>>>

>>>> OK.

>>>

>>> Thanks.  I'd like to eventually get rid of other vectorizer tree

>>> codes as well,

>>> like the REDUC_*_EXPR, DOT_PROD_EXPR and SAD_EXPR.  IFNs

>>> are now really the way to go for "target instructions on GIMPLE".

>>>

>>>>> Also I think if there's a way to handle this correctly with target support

>>>>> you can also implement a fallback if there is no such support increasing

>>>>> test coverage.  It would basically boil down to extracting all scalars from

>>>>> the non-reduction operand vector and performing a series of reduction

>>>>> ops, keeping the reduction PHI scalar.  This would also support any

>>>>> reduction operator.

>>>>

>>>> Yeah, but without target support, that's probably going to be expensive.

>>>> It's a bit like how we can implement element-by-element loads and stores

>>>> for cases that don't have target support, but had to explicitly disable

>>>> that in many cases, since the cost model was too optimistic.

>>>

>>> I expect that for V2DF or even V4DF it might be profitable in quite a number

>>> of cases.  V2DF definitely.

>>>

>>>> I can give it a go anyway if you think it's worth it.

>>>

>>> I think it is.

>> 

>> OK, here's 2/3.  It just splits out some code for reuse in 3/3.

> [ ... ]

> Is this going to obsolete any of the stuff posted to date?  I'm thinking

> specifically about "Add support for bitwise reductions", but perhaps

> there are others.


It means that those codes go in and then come out again, yeah, although
the end result is the same.

OK, I'll redo it so that this goes first and then repost a new bitwise
patch too.

Thanks,
Richard

Richard Sandiford Jan. 9, 2018, 3:36 p.m. UTC | #9

Ping

Richard Sandiford <richard.sandiford@linaro.org> writes:
> Richard Biener <richard.guenther@gmail.com> writes:

>> On Mon, Nov 20, 2017 at 1:54 PM, Richard Sandiford

>> <richard.sandiford@linaro.org> wrote:

>>> Richard Biener <richard.guenther@gmail.com> writes:

>>>> On Fri, Nov 17, 2017 at 5:53 PM, Richard Sandiford

>>>> <richard.sandiford@linaro.org> wrote:

>>>>> This patch adds support for in-order floating-point addition reductions,

>>>>> which are suitable even in strict IEEE mode.

>>>>>

>>>>> Previously vect_is_simple_reduction would reject any cases that forbid

>>>>> reassociation.  The idea is instead to tentatively accept them as

>>>>> "FOLD_LEFT_REDUCTIONs" and only fail later if there is no target

>>>>> support for them.  Although this patch only handles the particular

>>>>> case of plus and minus on floating-point types, there's no reason in

>>>>> principle why targets couldn't handle other cases.

>>>>>

>>>>> The vect_force_simple_reduction change makes it simpler for parloops

>>>>> to read the type of reduction.

>>>>>

>>>>> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu

>>>>> and powerpc64le-linux-gnu.  OK to install?

>>>>

>>>> I don't like that you add a new tree code for this.  A new IFN looks more

>>>> suitable to me.

>>>

>>> OK.

>>

>> Thanks.  I'd like to eventually get rid of other vectorizer tree codes as well,

>> like the REDUC_*_EXPR, DOT_PROD_EXPR and SAD_EXPR.  IFNs

>> are now really the way to go for "target instructions on GIMPLE".

>>

>>>> Also I think if there's a way to handle this correctly with target support

>>>> you can also implement a fallback if there is no such support increasing

>>>> test coverage.  It would basically boil down to extracting all scalars from

>>>> the non-reduction operand vector and performing a series of reduction

>>>> ops, keeping the reduction PHI scalar.  This would also support any

>>>> reduction operator.

>>>

>>> Yeah, but without target support, that's probably going to be expensive.

>>> It's a bit like how we can implement element-by-element loads and stores

>>> for cases that don't have target support, but had to explicitly disable

>>> that in many cases, since the cost model was too optimistic.

>>

>> I expect that for V2DF or even V4DF it might be profitable in quite a number

>> of cases.  V2DF definitely.

>>

>>> I can give it a go anyway if you think it's worth it.

>>

>> I think it is.

>

> OK, done in the patch below.  Tested as before.

>

> Thanks,

> Richard


2017-11-21  Richard Sandiford  <richard.sandiford@linaro.org>
	    Alan Hayward  <alan.hayward@arm.com>
	    David Sherwood  <david.sherwood@arm.com>

gcc/
	* optabs.def (fold_left_plus_optab): New optab.
	* doc/md.texi (fold_left_plus_@var{m}): Document.
	* internal-fn.def (IFN_FOLD_LEFT_PLUS): New internal function.
	* internal-fn.c (fold_left_direct): Define.
	(expand_fold_left_optab_fn): Likewise.
	(direct_fold_left_optab_supported_p): Likewise.
	* fold-const-call.c (fold_const_fold_left): New function.
	(fold_const_call): Use it to fold CFN_FOLD_LEFT_PLUS.
	* tree-parloops.c (valid_reduction_p): New function.
	(gather_scalar_reductions): Use it.
	* tree-vectorizer.h (FOLD_LEFT_REDUCTION): New vect_reduction_type.
	(vect_finish_replace_stmt): Declare.
	* tree-vect-loop.c (fold_left_reduction_code): New function.
	(needs_fold_left_reduction_p): New function, split out from...
	(vect_is_simple_reduction): ...here.  Accept reductions that
	forbid reassociation, but give them type FOLD_LEFT_REDUCTION.
	(vect_force_simple_reduction): Also store the reduction type in
	the assignment's STMT_VINFO_REDUC_TYPE.
	(vect_model_reduction_cost): Handle FOLD_LEFT_REDUCTION.
	(merge_with_identity): New function.
	(vectorize_fold_left_reduction): Likewise.
	(vectorizable_reduction): Handle FOLD_LEFT_REDUCTION.  Leave the
	scalar phi in place for it.  Check for target support and reject
	cases that would reassociate the operation.  Defer the transform
	phase to vectorize_fold_left_reduction.
	* config/aarch64/aarch64.md (UNSPEC_FADDA): New unspec.
	* config/aarch64/aarch64-sve.md (fold_left_plus_<mode>): New expander.
	(*fold_left_plus_<mode>, *pred_fold_left_plus_<mode>): New insns.

gcc/testsuite/
	* gcc.dg/vect/no-fast-math-vect16.c: Expect the test to pass and
	check for a message about using in-order reductions.
	* gcc.dg/vect/pr79920.c: Expect both loops to be vectorized and
	check for a message about using in-order reductions.
	* gcc.dg/vect/trapv-vect-reduc-4.c: Expect all three loops to be
	vectorized and check for a message about using in-order reductions.
	* gcc.dg/vect/vect-reduc-6.c: Expect the loop to be vectorized and
	check for a message about using in-order reductions.
	* gcc.dg/vect/vect-reduc-in-order-1.c: New test.
	* gcc.dg/vect/vect-reduc-in-order-2.c: Likewise.
	* gcc.dg/vect/vect-reduc-in-order-3.c: Likewise.
	* gcc.dg/vect/vect-reduc-in-order-4.c: Likewise.
	* gcc.target/aarch64/sve_reduc_strict_1.c: New test.
	* gcc.target/aarch64/sve_reduc_strict_1_run.c: Likewise.
	* gcc.target/aarch64/sve_reduc_strict_2.c: Likewise.
	* gcc.target/aarch64/sve_reduc_strict_2_run.c: Likewise.
	* gcc.target/aarch64/sve_reduc_strict_3.c: Likewise.
	* gcc.target/aarch64/sve_slp_13.c: Add floating-point types.
	* gfortran.dg/vect/vect-8.f90: Expect 22 loops to be vectorized if
	vect_fold_left_plus.

Index: gcc/optabs.def
===================================================================
--- gcc/optabs.def	2017-11-21 17:06:24.670434749 +0000
+++ gcc/optabs.def	2017-11-21 17:06:25.015421374 +0000
@@ -306,6 +306,7 @@ OPTAB_D (reduc_umin_scal_optab, "reduc_u
 OPTAB_D (reduc_and_scal_optab,  "reduc_and_scal_$a")
 OPTAB_D (reduc_ior_scal_optab,  "reduc_ior_scal_$a")
 OPTAB_D (reduc_xor_scal_optab,  "reduc_xor_scal_$a")
+OPTAB_D (fold_left_plus_optab, "fold_left_plus_$a")
 
 OPTAB_D (extract_last_optab, "extract_last_$a")
 OPTAB_D (fold_extract_last_optab, "fold_extract_last_$a")
Index: gcc/doc/md.texi
===================================================================
--- gcc/doc/md.texi	2017-11-21 17:06:24.670434749 +0000
+++ gcc/doc/md.texi	2017-11-21 17:06:25.014421412 +0000
@@ -5285,6 +5285,14 @@ has mode @var{m} and operands 0 and 1 ha
 one element of @var{m}.  Operand 2 has the usual mask mode for vectors
 of mode @var{m}; see @code{TARGET_VECTORIZE_GET_MASK_MODE}.
 
+@cindex @code{fold_left_plus_@var{m}} instruction pattern
+@item @code{fold_left_plus_@var{m}}
+Take scalar operand 1 and successively add each element from vector
+operand 2.  Store the result in scalar operand 0.  The vector has
+mode @var{m} and the scalars have the mode appropriate for one
+element of @var{m}.  The operation is strictly in-order: there is
+no reassociation.
+
 @cindex @code{sdot_prod@var{m}} instruction pattern
 @item @samp{sdot_prod@var{m}}
 @cindex @code{udot_prod@var{m}} instruction pattern
Index: gcc/internal-fn.def
===================================================================
--- gcc/internal-fn.def	2017-11-21 17:06:24.670434749 +0000
+++ gcc/internal-fn.def	2017-11-21 17:06:25.015421374 +0000
@@ -59,6 +59,8 @@ along with GCC; see the file COPYING3.
 
    - cond_binary: a conditional binary optab, such as add<mode>cc
 
+   - fold_left: for scalar = FN (scalar, vector), keyed off the vector mode
+
    DEF_INTERNAL_SIGNED_OPTAB_FN defines an internal function that
    maps to one of two optabs, depending on the signedness of an input.
    SIGNED_OPTAB and UNSIGNED_OPTAB are the optabs for signed and
@@ -177,6 +179,8 @@ DEF_INTERNAL_OPTAB_FN (EXTRACT_LAST, ECF
 DEF_INTERNAL_OPTAB_FN (FOLD_EXTRACT_LAST, ECF_CONST | ECF_NOTHROW,
 		       fold_extract_last, fold_extract)
 
+DEF_INTERNAL_OPTAB_FN (FOLD_LEFT_PLUS, ECF_CONST | ECF_NOTHROW,
+		       fold_left_plus, fold_left)
 
 /* Unary math functions.  */
 DEF_INTERNAL_FLT_FN (ACOS, ECF_CONST, acos, unary)
Index: gcc/internal-fn.c
===================================================================
--- gcc/internal-fn.c	2017-11-21 17:06:24.670434749 +0000
+++ gcc/internal-fn.c	2017-11-21 17:06:25.015421374 +0000
@@ -92,6 +92,7 @@ #define cond_unary_direct { 1, 1, true }
 #define cond_binary_direct { 1, 1, true }
 #define while_direct { 0, 2, false }
 #define fold_extract_direct { 2, 2, false }
+#define fold_left_direct { 1, 1, false }
 
 const direct_internal_fn_info direct_internal_fn_array[IFN_LAST + 1] = {
 #define DEF_INTERNAL_FN(CODE, FLAGS, FNSPEC) not_direct,
@@ -2839,6 +2840,9 @@ #define expand_cond_binary_optab_fn(FN,
 #define expand_fold_extract_optab_fn(FN, STMT, OPTAB) \
   expand_direct_optab_fn (FN, STMT, OPTAB, 3)
 
+#define expand_fold_left_optab_fn(FN, STMT, OPTAB) \
+  expand_direct_optab_fn (FN, STMT, OPTAB, 2)
+
 /* RETURN_TYPE and ARGS are a return type and argument list that are
    in principle compatible with FN (which satisfies direct_internal_fn_p).
    Return the types that should be used to determine whether the
@@ -2922,6 +2926,7 @@ #define direct_store_lanes_optab_support
 #define direct_mask_store_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_while_optab_supported_p convert_optab_supported_p
 #define direct_fold_extract_optab_supported_p direct_optab_supported_p
+#define direct_fold_left_optab_supported_p direct_optab_supported_p
 
 /* Return the optab used by internal function FN.  */
 
Index: gcc/fold-const-call.c
===================================================================
--- gcc/fold-const-call.c	2017-11-21 17:06:24.670434749 +0000
+++ gcc/fold-const-call.c	2017-11-21 17:06:25.014421412 +0000
@@ -1190,6 +1190,25 @@ fold_const_call (combined_fn fn, tree ty
     }
 }
 
+/* Fold a call to IFN_FOLD_LEFT_<CODE> (ARG0, ARG1), returning a value
+   of type TYPE.  */
+
+static tree
+fold_const_fold_left (tree type, tree arg0, tree arg1, tree_code code)
+{
+  if (TREE_CODE (arg1) != VECTOR_CST)
+    return NULL_TREE;
+
+  unsigned int nelts = VECTOR_CST_NELTS (arg1);
+  for (unsigned int i = 0; i < nelts; i++)
+    {
+      arg0 = const_binop (code, type, arg0, VECTOR_CST_ELT (arg1, i));
+      if (arg0 == NULL_TREE || !CONSTANT_CLASS_P (arg0))
+	return NULL_TREE;
+    }
+  return arg0;
+}
+
 /* Try to evaluate:
 
       *RESULT = FN (*ARG0, *ARG1)
@@ -1495,6 +1514,9 @@ fold_const_call (combined_fn fn, tree ty
 	}
       return NULL_TREE;
 
+    case CFN_FOLD_LEFT_PLUS:
+      return fold_const_fold_left (type, arg0, arg1, PLUS_EXPR);
+
     default:
       return fold_const_call_1 (fn, type, arg0, arg1);
     }
Index: gcc/tree-parloops.c
===================================================================
--- gcc/tree-parloops.c	2017-11-21 17:06:24.670434749 +0000
+++ gcc/tree-parloops.c	2017-11-21 17:06:25.017421296 +0000
@@ -2531,6 +2531,19 @@ set_reduc_phi_uids (reduction_info **slo
   return 1;
 }
 
+/* Return true if the type of reduction performed by STMT is suitable
+   for this pass.  */
+
+static bool
+valid_reduction_p (gimple *stmt)
+{
+  /* Parallelization would reassociate the operation, which isn't
+     allowed for in-order reductions.  */
+  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+  vect_reduction_type reduc_type = STMT_VINFO_REDUC_TYPE (stmt_info);
+  return reduc_type != FOLD_LEFT_REDUCTION;
+}
+
 /* Detect all reductions in the LOOP, insert them into REDUCTION_LIST.  */
 
 static void
@@ -2564,7 +2577,7 @@ gather_scalar_reductions (loop_p loop, r
       gimple *reduc_stmt
 	= vect_force_simple_reduction (simple_loop_info, phi,
 				       &double_reduc, true);
-      if (!reduc_stmt)
+      if (!reduc_stmt || !valid_reduction_p (reduc_stmt))
 	continue;
 
       if (double_reduc)
@@ -2610,7 +2623,8 @@ gather_scalar_reductions (loop_p loop, r
 		= vect_force_simple_reduction (simple_loop_info, inner_phi,
 					       &double_reduc, true);
 	      gcc_assert (!double_reduc);
-	      if (inner_reduc_stmt == NULL)
+	      if (inner_reduc_stmt == NULL
+		  || !valid_reduction_p (inner_reduc_stmt))
 		continue;
 
 	      build_new_reduction (reduction_list, double_reduc_stmts[i], phi);
Index: gcc/tree-vectorizer.h
===================================================================
--- gcc/tree-vectorizer.h	2017-11-21 17:06:24.670434749 +0000
+++ gcc/tree-vectorizer.h	2017-11-21 17:06:25.018421257 +0000
@@ -74,7 +74,15 @@ enum vect_reduction_type {
 
        for (int i = 0; i < VF; ++i)
          res = cond[i] ? val[i] : res;  */
-  EXTRACT_LAST_REDUCTION
+  EXTRACT_LAST_REDUCTION,
+
+  /* Use a folding reduction within the loop to implement:
+
+       for (int i = 0; i < VF; ++i)
+         res = res OP val[i];
+
+     (with no reassocation).  */
+  FOLD_LEFT_REDUCTION
 };
 
 #define VECTORIZABLE_CYCLE_DEF(D) (((D) == vect_reduction_def)           \
@@ -1389,6 +1397,7 @@ extern void vect_model_load_cost (stmt_v
 extern unsigned record_stmt_cost (stmt_vector_for_cost *, int,
 				  enum vect_cost_for_stmt, stmt_vec_info,
 				  int, enum vect_cost_model_location);
+extern void vect_finish_replace_stmt (gimple *, gimple *);
 extern void vect_finish_stmt_generation (gimple *, gimple *,
                                          gimple_stmt_iterator *);
 extern bool vect_mark_stmts_to_be_vectorized (loop_vec_info);
Index: gcc/tree-vect-loop.c
===================================================================
--- gcc/tree-vect-loop.c	2017-11-21 17:06:24.670434749 +0000
+++ gcc/tree-vect-loop.c	2017-11-21 17:06:25.018421257 +0000
@@ -2573,6 +2573,22 @@ vect_analyze_loop (struct loop *loop, lo
     }
 }
 
+/* Return true if there is an in-order reduction function for CODE, storing
+   it in *REDUC_FN if so.  */
+
+static bool
+fold_left_reduction_fn (tree_code code, internal_fn *reduc_fn)
+{
+  switch (code)
+    {
+    case PLUS_EXPR:
+      *reduc_fn = IFN_FOLD_LEFT_PLUS;
+      return true;
+
+    default:
+      return false;
+    }
+}
 
 /* Function reduction_fn_for_scalar_code
 
@@ -2879,6 +2895,42 @@ vect_is_slp_reduction (loop_vec_info loo
   return true;
 }
 
+/* Return true if we need an in-order reduction for operation CODE
+   on type TYPE.  NEED_WRAPPING_INTEGRAL_OVERFLOW is true if integer
+   overflow must wrap.  */
+
+static bool
+needs_fold_left_reduction_p (tree type, tree_code code,
+			     bool need_wrapping_integral_overflow)
+{
+  /* CHECKME: check for !flag_finite_math_only too?  */
+  if (SCALAR_FLOAT_TYPE_P (type))
+    switch (code)
+      {
+      case MIN_EXPR:
+      case MAX_EXPR:
+	return false;
+
+      default:
+	return !flag_associative_math;
+      }
+
+  if (INTEGRAL_TYPE_P (type))
+    {
+      if (!operation_no_trapping_overflow (type, code))
+	return true;
+      if (need_wrapping_integral_overflow
+	  && !TYPE_OVERFLOW_WRAPS (type)
+	  && operation_can_overflow (code))
+	return true;
+      return false;
+    }
+
+  if (SAT_FIXED_POINT_TYPE_P (type))
+    return true;
+
+  return false;
+}
 
 /* Function vect_is_simple_reduction
 
@@ -3197,58 +3249,18 @@ vect_is_simple_reduction (loop_vec_info
       return NULL;
     }
 
-  /* Check that it's ok to change the order of the computation.
+  /* Check whether it's ok to change the order of the computation.
      Generally, when vectorizing a reduction we change the order of the
      computation.  This may change the behavior of the program in some
      cases, so we need to check that this is ok.  One exception is when
      vectorizing an outer-loop: the inner-loop is executed sequentially,
      and therefore vectorizing reductions in the inner-loop during
      outer-loop vectorization is safe.  */
-
-  if (*v_reduc_type != COND_REDUCTION
-      && check_reduction)
-    {
-      /* CHECKME: check for !flag_finite_math_only too?  */
-      if (SCALAR_FLOAT_TYPE_P (type) && !flag_associative_math)
-	{
-	  /* Changing the order of operations changes the semantics.  */
-	  if (dump_enabled_p ())
-	    report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt,
-			"reduction: unsafe fp math optimization: ");
-	  return NULL;
-	}
-      else if (INTEGRAL_TYPE_P (type))
-	{
-	  if (!operation_no_trapping_overflow (type, code))
-	    {
-	      /* Changing the order of operations changes the semantics.  */
-	      if (dump_enabled_p ())
-		report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt,
-				"reduction: unsafe int math optimization"
-				" (overflow traps): ");
-	      return NULL;
-	    }
-	  if (need_wrapping_integral_overflow
-	      && !TYPE_OVERFLOW_WRAPS (type)
-	      && operation_can_overflow (code))
-	    {
-	      /* Changing the order of operations changes the semantics.  */
-	      if (dump_enabled_p ())
-		report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt,
-				"reduction: unsafe int math optimization"
-				" (overflow doesn't wrap): ");
-	      return NULL;
-	    }
-	}
-      else if (SAT_FIXED_POINT_TYPE_P (type))
-	{
-	  /* Changing the order of operations changes the semantics.  */
-	  if (dump_enabled_p ())
-	  report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt,
-			  "reduction: unsafe fixed-point math optimization: ");
-	  return NULL;
-	}
-    }
+  if (check_reduction
+      && *v_reduc_type == TREE_CODE_REDUCTION
+      && needs_fold_left_reduction_p (type, code,
+				      need_wrapping_integral_overflow))
+    *v_reduc_type = FOLD_LEFT_REDUCTION;
 
   /* Reduction is safe. We're dealing with one of the following:
      1) integer arithmetic and no trapv
@@ -3512,6 +3524,7 @@ vect_force_simple_reduction (loop_vec_in
       STMT_VINFO_REDUC_TYPE (reduc_def_info) = v_reduc_type;
       STMT_VINFO_REDUC_DEF (reduc_def_info) = def;
       reduc_def_info = vinfo_for_stmt (def);
+      STMT_VINFO_REDUC_TYPE (reduc_def_info) = v_reduc_type;
       STMT_VINFO_REDUC_DEF (reduc_def_info) = phi;
     }
   return def;
@@ -4064,14 +4077,27 @@ vect_model_reduction_cost (stmt_vec_info
 
   code = gimple_assign_rhs_code (orig_stmt);
 
-  if (reduction_type == EXTRACT_LAST_REDUCTION)
+  if (reduction_type == EXTRACT_LAST_REDUCTION
+      || reduction_type == FOLD_LEFT_REDUCTION)
     {
       /* No extra instructions needed in the prologue.  */
       prologue_cost = 0;
 
-      /* Count NCOPIES FOLD_EXTRACT_LAST operations.  */
-      inside_cost = add_stmt_cost (target_cost_data, ncopies, vec_to_scalar,
-				   stmt_info, 0, vect_body);
+      if (reduction_type == EXTRACT_LAST_REDUCTION || reduc_fn != IFN_LAST)
+	/* Count one reduction-like operation per vector.  */
+	inside_cost = add_stmt_cost (target_cost_data, ncopies, vec_to_scalar,
+				     stmt_info, 0, vect_body);
+      else
+	{
+	  /* Use NELEMENTS extracts and NELEMENTS scalar ops.  */
+	  unsigned int nelements = ncopies * vect_nunits_for_cost (vectype);
+	  inside_cost = add_stmt_cost (target_cost_data,  nelements,
+				       vec_to_scalar, stmt_info, 0,
+				       vect_body);
+	  inside_cost += add_stmt_cost (target_cost_data,  nelements,
+					scalar_stmt, stmt_info, 0,
+					vect_body);
+	}
     }
   else
     {
@@ -4137,7 +4163,8 @@ vect_model_reduction_cost (stmt_vec_info
 					  scalar_stmt, stmt_info, 0,
 					  vect_epilogue);
 	}
-      else if (reduction_type == EXTRACT_LAST_REDUCTION)
+      else if (reduction_type == EXTRACT_LAST_REDUCTION
+	       || reduction_type == FOLD_LEFT_REDUCTION)
 	/* No extra instructions need in the epilogue.  */
 	;
       else
@@ -5910,6 +5937,160 @@ vect_create_epilog_for_reduction (vec<tr
     }
 }
 
+/* Return a vector of type VECTYPE that is equal to the vector select
+   operation "MASK ? VEC : IDENTITY".  Insert the select statements
+   before GSI.  */
+
+static tree
+merge_with_identity (gimple_stmt_iterator *gsi, tree mask, tree vectype,
+		     tree vec, tree identity)
+{
+  tree cond = make_temp_ssa_name (vectype, NULL, "cond");
+  gimple *new_stmt = gimple_build_assign (cond, VEC_COND_EXPR,
+					  mask, vec, identity);
+  gsi_insert_before (gsi, new_stmt, GSI_SAME_STMT);
+  return cond;
+}
+
+/* Perform an in-order reduction (FOLD_LEFT_REDUCTION).  STMT is the
+   statement that sets the live-out value.  REDUC_DEF_STMT is the phi
+   statement.  CODE is the operation performed by STMT and OPS are
+   its scalar operands.  REDUC_INDEX is the index of the operand in
+   OPS that is set by REDUC_DEF_STMT.  REDUC_FN is the function that
+   implements in-order reduction, or IFN_LAST if we should open-code it.
+   VECTYPE_IN is the type of the vector input.  MASKS specifies the masks
+   that should be used to control the operation in a fully-masked loop.  */
+
+static bool
+vectorize_fold_left_reduction (gimple *stmt, gimple_stmt_iterator *gsi,
+			       gimple **vec_stmt, slp_tree slp_node,
+			       gimple *reduc_def_stmt,
+			       tree_code code, internal_fn reduc_fn,
+			       tree ops[3], tree vectype_in,
+			       int reduc_index, vec_loop_masks *masks)
+{
+  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+  loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  tree vectype_out = STMT_VINFO_VECTYPE (stmt_info);
+  gimple *new_stmt = NULL;
+
+  int ncopies;
+  if (slp_node)
+    ncopies = 1;
+  else
+    ncopies = vect_get_num_copies (loop_vinfo, vectype_in);
+
+  gcc_assert (!nested_in_vect_loop_p (loop, stmt));
+  gcc_assert (ncopies == 1);
+  gcc_assert (TREE_CODE_LENGTH (code) == binary_op);
+  gcc_assert (reduc_index == (code == MINUS_EXPR ? 0 : 1));
+  gcc_assert (STMT_VINFO_VEC_REDUCTION_TYPE (stmt_info)
+	      == FOLD_LEFT_REDUCTION);
+
+  if (slp_node)
+    gcc_assert (must_eq (TYPE_VECTOR_SUBPARTS (vectype_out),
+			 TYPE_VECTOR_SUBPARTS (vectype_in)));
+
+  tree op0 = ops[1 - reduc_index];
+
+  int group_size = 1;
+  gimple *scalar_dest_def;
+  auto_vec<tree> vec_oprnds0;
+  if (slp_node)
+    {
+      vect_get_vec_defs (op0, NULL_TREE, stmt, &vec_oprnds0, NULL, slp_node);
+      group_size = SLP_TREE_SCALAR_STMTS (slp_node).length ();
+      scalar_dest_def = SLP_TREE_SCALAR_STMTS (slp_node)[group_size - 1];
+    }
+  else
+    {
+      tree loop_vec_def0 = vect_get_vec_def_for_operand (op0, stmt);
+      vec_oprnds0.create (1);
+      vec_oprnds0.quick_push (loop_vec_def0);
+      scalar_dest_def = stmt;
+    }
+
+  tree scalar_dest = gimple_assign_lhs (scalar_dest_def);
+  tree scalar_type = TREE_TYPE (scalar_dest);
+  tree reduc_var = gimple_phi_result (reduc_def_stmt);
+
+  int vec_num = vec_oprnds0.length ();
+  gcc_assert (vec_num == 1 || slp_node);
+  tree vec_elem_type = TREE_TYPE (vectype_out);
+  gcc_checking_assert (useless_type_conversion_p (scalar_type, vec_elem_type));
+
+  tree vector_identity = NULL_TREE;
+  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+    vector_identity = build_zero_cst (vectype_out);
+
+  tree scalar_dest_var = vect_create_destination_var (scalar_dest, NULL);
+  int i;
+  tree def0;
+  FOR_EACH_VEC_ELT (vec_oprnds0, i, def0)
+    {
+      tree mask = NULL_TREE;
+      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+	mask = vect_get_loop_mask (gsi, masks, vec_num, vectype_in, i);
+
+      /* Handle MINUS by adding the negative.  */
+      if (reduc_fn != IFN_LAST && code == MINUS_EXPR)
+	{
+	  tree negated = make_ssa_name (vectype_out);
+	  new_stmt = gimple_build_assign (negated, NEGATE_EXPR, def0);
+	  gsi_insert_before (gsi, new_stmt, GSI_SAME_STMT);
+	  def0 = negated;
+	}
+
+      if (mask)
+	def0 = merge_with_identity (gsi, mask, vectype_out, def0,
+				    vector_identity);
+
+      /* On the first iteration the input is simply the scalar phi
+	 result, and for subsequent iterations it is the output of
+	 the preceding operation.  */
+      if (reduc_fn != IFN_LAST)
+	{
+	  new_stmt = gimple_build_call_internal (reduc_fn, 2, reduc_var, def0);
+	  /* For chained SLP reductions the output of the previous reduction
+	     operation serves as the input of the next. For the final statement
+	     the output cannot be a temporary - we reuse the original
+	     scalar destination of the last statement.  */
+	  if (i != vec_num - 1)
+	    {
+	      gimple_set_lhs (new_stmt, scalar_dest_var);
+	      reduc_var = make_ssa_name (scalar_dest_var, new_stmt);
+	      gimple_set_lhs (new_stmt, reduc_var);
+	    }
+	}
+      else
+	{
+	  reduc_var = vect_expand_fold_left (gsi, scalar_dest_var, code,
+					     reduc_var, def0);
+	  new_stmt = SSA_NAME_DEF_STMT (reduc_var);
+	  /* Remove the statement, so that we can use the same code paths
+	     as for statements that we've just created.  */
+	  gimple_stmt_iterator tmp_gsi = gsi_for_stmt (new_stmt);
+	  gsi_remove (&tmp_gsi, false);
+	}
+
+      if (i == vec_num - 1)
+	{
+	  gimple_set_lhs (new_stmt, scalar_dest);
+	  vect_finish_replace_stmt (scalar_dest_def, new_stmt);
+	}
+      else
+	vect_finish_stmt_generation (scalar_dest_def, new_stmt, gsi);
+
+      if (slp_node)
+	SLP_TREE_VEC_STMTS (slp_node).quick_push (new_stmt);
+    }
+
+  if (!slp_node)
+    STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt;
+
+  return true;
+}
 
 /* Function is_nonwrapping_integer_induction.
 
@@ -6090,6 +6271,12 @@ vectorizable_reduction (gimple *stmt, gi
 	  return true;
 	}
 
+      if (STMT_VINFO_REDUC_TYPE (stmt_info) == FOLD_LEFT_REDUCTION)
+	/* Leave the scalar phi in place.  Note that checking
+	   STMT_VINFO_VEC_REDUCTION_TYPE (as below) only works
+	   for reductions involving a single statement.  */
+	return true;
+
       gimple *reduc_stmt = STMT_VINFO_REDUC_DEF (stmt_info);
       if (STMT_VINFO_IN_PATTERN_P (vinfo_for_stmt (reduc_stmt)))
 	reduc_stmt = STMT_VINFO_RELATED_STMT (vinfo_for_stmt (reduc_stmt));
@@ -6316,6 +6503,14 @@ vectorizable_reduction (gimple *stmt, gi
      directy used in stmt.  */
   if (reduc_index == -1)
     {
+      if (STMT_VINFO_REDUC_TYPE (stmt_info) == FOLD_LEFT_REDUCTION)
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "in-order reduction chain without SLP.\n");
+	  return false;
+	}
+
       if (orig_stmt)
 	reduc_def_stmt = STMT_VINFO_REDUC_DEF (orig_stmt_info);
       else
@@ -6535,7 +6730,9 @@ vectorizable_reduction (gimple *stmt, gi
 
   vect_reduction_type reduction_type
     = STMT_VINFO_VEC_REDUCTION_TYPE (stmt_info);
-  if (orig_stmt && reduction_type == TREE_CODE_REDUCTION)
+  if (orig_stmt
+      && (reduction_type == TREE_CODE_REDUCTION
+	  || reduction_type == FOLD_LEFT_REDUCTION))
     {
       /* This is a reduction pattern: get the vectype from the type of the
          reduction variable, and get the tree-code from orig_stmt.  */
@@ -6582,10 +6779,13 @@ vectorizable_reduction (gimple *stmt, gi
   reduc_fn = IFN_LAST;
 
   if (reduction_type == TREE_CODE_REDUCTION
+      || reduction_type == FOLD_LEFT_REDUCTION
       || reduction_type == INTEGER_INDUC_COND_REDUCTION
       || reduction_type == CONST_COND_REDUCTION)
     {
-      if (reduction_fn_for_scalar_code (orig_code, &reduc_fn))
+      if (reduction_type == FOLD_LEFT_REDUCTION
+	  ? fold_left_reduction_fn (orig_code, &reduc_fn)
+	  : reduction_fn_for_scalar_code (orig_code, &reduc_fn))
 	{
 	  if (reduc_fn != IFN_LAST
 	      && !direct_internal_fn_supported_p (reduc_fn, vectype_out,
@@ -6704,6 +6904,41 @@ vectorizable_reduction (gimple *stmt, gi
 	}
     }
 
+  if (double_reduc && reduction_type == FOLD_LEFT_REDUCTION)
+    {
+      /* We can't support in-order reductions of code such as this:
+
+	   for (int i = 0; i < n1; ++i)
+	     for (int j = 0; j < n2; ++j)
+	       l += a[j];
+
+	 since GCC effectively transforms the loop when vectorizing:
+
+	   for (int i = 0; i < n1 / VF; ++i)
+	     for (int j = 0; j < n2; ++j)
+	       for (int k = 0; k < VF; ++k)
+		 l += a[j];
+
+	 which is a reassociation of the original operation.  */
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "in-order double reduction not supported.\n");
+
+      return false;
+    }
+
+  if (reduction_type == FOLD_LEFT_REDUCTION
+      && slp_node
+      && !GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)))
+    {
+      /* We cannot use in-order reductions in this case because there is
+         an implicit reassociation of the operations involved.  */
+      if (dump_enabled_p ())
+        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "in-order unchained SLP reductions not supported.\n");
+      return false;
+    }
+
   /* In case of widenning multiplication by a constant, we update the type
      of the constant to be the type of the other operand.  We check that the
      constant fits the type in the pattern recognition pass.  */
@@ -6824,9 +7059,10 @@ vectorizable_reduction (gimple *stmt, gi
 	vect_model_reduction_cost (stmt_info, reduc_fn, ncopies);
       if (loop_vinfo && LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo))
 	{
-	  if (cond_fn == IFN_LAST
-	      || !direct_internal_fn_supported_p (cond_fn, vectype_in,
-						  OPTIMIZE_FOR_SPEED))
+	  if (reduction_type != FOLD_LEFT_REDUCTION
+	      && (cond_fn == IFN_LAST
+		  || !direct_internal_fn_supported_p (cond_fn, vectype_in,
+						      OPTIMIZE_FOR_SPEED)))
 	    {
 	      if (dump_enabled_p ())
 		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
@@ -6846,6 +7082,10 @@ vectorizable_reduction (gimple *stmt, gi
 	    vect_record_loop_mask (loop_vinfo, masks, ncopies * vec_num,
 				   vectype_in);
 	}
+      if (dump_enabled_p ()
+	  && reduction_type == FOLD_LEFT_REDUCTION)
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "using an in-order (fold-left) reduction.\n");
       STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
       return true;
     }
@@ -6861,6 +7101,11 @@ vectorizable_reduction (gimple *stmt, gi
 
   bool masked_loop_p = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
 
+  if (reduction_type == FOLD_LEFT_REDUCTION)
+    return vectorize_fold_left_reduction
+      (stmt, gsi, vec_stmt, slp_node, reduc_def_stmt, code,
+       reduc_fn, ops, vectype_in, reduc_index, masks);
+
   if (reduction_type == EXTRACT_LAST_REDUCTION)
     {
       gcc_assert (!slp_node);
Index: gcc/config/aarch64/aarch64.md
===================================================================
--- gcc/config/aarch64/aarch64.md	2017-11-21 17:06:24.670434749 +0000
+++ gcc/config/aarch64/aarch64.md	2017-11-21 17:06:25.013421451 +0000
@@ -164,6 +164,7 @@ (define_c_enum "unspec" [
     UNSPEC_STN
     UNSPEC_INSR
     UNSPEC_CLASTB
+    UNSPEC_FADDA
 ])
 
 (define_c_enum "unspecv" [
Index: gcc/config/aarch64/aarch64-sve.md
===================================================================
--- gcc/config/aarch64/aarch64-sve.md	2017-11-21 17:06:24.670434749 +0000
+++ gcc/config/aarch64/aarch64-sve.md	2017-11-21 17:06:25.012421490 +0000
@@ -1574,6 +1574,45 @@ (define_insn "*reduc_<optab>_scal_<mode>
   "<bit_reduc_op>\t%<Vetype>0, %1, %2.<Vetype>"
 )
 
+;; Unpredicated in-order FP reductions.
+(define_expand "fold_left_plus_<mode>"
+  [(set (match_operand:<VEL> 0 "register_operand")
+	(unspec:<VEL> [(match_dup 3)
+		       (match_operand:<VEL> 1 "register_operand")
+		       (match_operand:SVE_F 2 "register_operand")]
+		      UNSPEC_FADDA))]
+  "TARGET_SVE"
+  {
+    operands[3] = force_reg (<VPRED>mode, CONSTM1_RTX (<VPRED>mode));
+  }
+)
+
+;; In-order FP reductions predicated with PTRUE.
+(define_insn "*fold_left_plus_<mode>"
+  [(set (match_operand:<VEL> 0 "register_operand" "=w")
+	(unspec:<VEL> [(match_operand:<VPRED> 1 "register_operand" "Upl")
+		       (match_operand:<VEL> 2 "register_operand" "0")
+		       (match_operand:SVE_F 3 "register_operand" "w")]
+		      UNSPEC_FADDA))]
+  "TARGET_SVE"
+  "fadda\t%<Vetype>0, %1, %<Vetype>0, %3.<Vetype>"
+)
+
+;; Predicated form of the above in-order reduction.
+(define_insn "*pred_fold_left_plus_<mode>"
+  [(set (match_operand:<VEL> 0 "register_operand" "=w")
+	(unspec:<VEL>
+	  [(match_operand:<VEL> 1 "register_operand" "0")
+	   (unspec:SVE_F
+	     [(match_operand:<VPRED> 2 "register_operand" "Upl")
+	      (match_operand:SVE_F 3 "register_operand" "w")
+	      (match_operand:SVE_F 4 "aarch64_simd_imm_zero")]
+	     UNSPEC_SEL)]
+	  UNSPEC_FADDA))]
+  "TARGET_SVE"
+  "fadda\t%<Vetype>0, %2, %<Vetype>0, %3.<Vetype>"
+)
+
 ;; Unpredicated floating-point addition.
 (define_expand "add<mode>3"
   [(set (match_operand:SVE_F 0 "register_operand")
Index: gcc/testsuite/gcc.dg/vect/no-fast-math-vect16.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/no-fast-math-vect16.c	2017-11-21 17:06:24.670434749 +0000
+++ gcc/testsuite/gcc.dg/vect/no-fast-math-vect16.c	2017-11-21 17:06:25.015421374 +0000
@@ -33,5 +33,5 @@ int main (void)
   return main1 ();
 }
 
-/* Requires fast-math.  */
-/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail *-*-* } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */
Index: gcc/testsuite/gcc.dg/vect/pr79920.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/pr79920.c	2017-11-21 17:06:24.670434749 +0000
+++ gcc/testsuite/gcc.dg/vect/pr79920.c	2017-11-21 17:06:25.015421374 +0000
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-additional-options "-O3" } */
+/* { dg-additional-options "-O3 -fno-fast-math" } */
 
 #include "tree-vect.h"
 
@@ -41,4 +41,5 @@ int main()
   return 0;
 }
 
-/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_double && { vect_perm && vect_hw_misalign } } } } } */
+/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { target { vect_double && { vect_perm && vect_hw_misalign } } } } } */
Index: gcc/testsuite/gcc.dg/vect/trapv-vect-reduc-4.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/trapv-vect-reduc-4.c	2017-11-21 17:06:24.670434749 +0000
+++ gcc/testsuite/gcc.dg/vect/trapv-vect-reduc-4.c	2017-11-21 17:06:25.015421374 +0000
@@ -46,5 +46,6 @@ int main (void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump-times "Detected reduction\\." 2 "vect"  } } */
-/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { target { ! vect_no_int_min_max } } } } */
+/* { dg-final { scan-tree-dump-times "Detected reduction\\." 3 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 3 loops" 1 "vect" { target { ! vect_no_int_min_max } } } } */
+/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */
Index: gcc/testsuite/gcc.dg/vect/vect-reduc-6.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/vect-reduc-6.c	2017-11-21 17:06:24.670434749 +0000
+++ gcc/testsuite/gcc.dg/vect/vect-reduc-6.c	2017-11-21 17:06:25.015421374 +0000
@@ -1,4 +1,5 @@
 /* { dg-require-effective-target vect_float } */
+/* { dg-additional-options "-fno-fast-math" } */
 
 #include <stdarg.h>
 #include "tree-vect.h"
@@ -48,6 +49,5 @@ int main (void)
   return 0;
 }
 
-/* need -ffast-math to vectorizer these loops.  */
-/* ARM NEON passes -ffast-math to these tests, so expect this to fail.  */
-/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { xfail arm_neon_ok } } } */
+/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
Index: gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-1.c
===================================================================
--- /dev/null	2017-11-20 18:51:34.589640877 +0000
+++ gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-1.c	2017-11-21 17:06:25.015421374 +0000
@@ -0,0 +1,42 @@
+/* { dg-do run { xfail { { i?86-*-* x86_64-*-* } && ia32 } } } */
+/* { dg-require-effective-target vect_double } */
+/* { dg-add-options ieee } */
+/* { dg-additional-options "-fno-fast-math" } */
+
+#include "tree-vect.h"
+
+#define N (VECTOR_BITS * 17)
+
+double __attribute__ ((noinline, noclone))
+reduc_plus_double (double *a, double *b)
+{
+  double r = 0, q = 3;
+  for (int i = 0; i < N; i++)
+    {
+      r += a[i];
+      q -= b[i];
+    }
+  return r * q;
+}
+
+int __attribute__ ((optimize (1)))
+main ()
+{
+  double a[N];
+  double b[N];
+  double r = 0, q = 3;
+  for (int i = 0; i < N; i++)
+    {
+      a[i] = (i * 0.1) * (i & 1 ? 1 : -1);
+      b[i] = (i * 0.3) * (i & 1 ? 1 : -1);
+      r += a[i];
+      q -= b[i];
+      asm volatile ("" ::: "memory");
+    }
+  double res = reduc_plus_double (a, b);
+  if (res != r * q)
+    __builtin_abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 2 "vect" } } */
Index: gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-2.c
===================================================================
--- /dev/null	2017-11-20 18:51:34.589640877 +0000
+++ gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-2.c	2017-11-21 17:06:25.015421374 +0000
@@ -0,0 +1,44 @@
+/* { dg-do run { xfail { { i?86-*-* x86_64-*-* } && ia32 } } } */
+/* { dg-require-effective-target vect_double } */
+/* { dg-add-options ieee } */
+/* { dg-additional-options "-fno-fast-math" } */
+
+#include "tree-vect.h"
+
+#define N (VECTOR_BITS * 17)
+
+double __attribute__ ((noinline, noclone))
+reduc_plus_double (double *restrict a, int n)
+{
+  double res = 0.0;
+  for (int i = 0; i < n; i++)
+    for (int j = 0; j < N; j++)
+      res += a[i];
+  return res;
+}
+
+int __attribute__ ((optimize (1)))
+main ()
+{
+  int n = 19;
+  double a[N];
+  double r = 0;
+  for (int i = 0; i < N; i++)
+    {
+      a[i] = (i * 0.1) * (i & 1 ? 1 : -1);
+      asm volatile ("" ::: "memory");
+    }
+  for (int i = 0; i < n; i++)
+    for (int j = 0; j < N; j++)
+      {
+	r += a[i];
+	asm volatile ("" ::: "memory");
+      }
+  double res = reduc_plus_double (a, n);
+  if (res != r)
+    __builtin_abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times {in-order double reduction not supported} 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */
Index: gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-3.c
===================================================================
--- /dev/null	2017-11-20 18:51:34.589640877 +0000
+++ gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-3.c	2017-11-21 17:06:25.016421335 +0000
@@ -0,0 +1,42 @@
+/* { dg-do run { xfail { { i?86-*-* x86_64-*-* } && ia32 } } } */
+/* { dg-require-effective-target vect_double } */
+/* { dg-add-options ieee } */
+/* { dg-additional-options "-fno-fast-math" } */
+
+#include "tree-vect.h"
+
+#define N (VECTOR_BITS * 17)
+
+double __attribute__ ((noinline, noclone))
+reduc_plus_double (double *a)
+{
+  double r = 0;
+  for (int i = 0; i < N; i += 4)
+    {
+      r += a[i] * 2.0;
+      r += a[i + 1] * 3.0;
+      r += a[i + 2] * 4.0;
+      r += a[i + 3] * 5.0;
+    }
+  return r;
+}
+
+int __attribute__ ((optimize (1)))
+main ()
+{
+  double a[N];
+  double r = 0;
+  for (int i = 0; i < N; i++)
+    {
+      a[i] = (i * 0.1) * (i & 1 ? 1 : -1);
+      r += a[i] * (i % 4 + 2);
+      asm volatile ("" ::: "memory");
+    }
+  double res = reduc_plus_double (a);
+  if (res != r)
+    __builtin_abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times {vectorizing stmts using SLP} 1 "vect" } } */
Index: gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-4.c
===================================================================
--- /dev/null	2017-11-20 18:51:34.589640877 +0000
+++ gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-4.c	2017-11-21 17:06:25.016421335 +0000
@@ -0,0 +1,45 @@
+/* { dg-do run { xfail { { i?86-*-* x86_64-*-* } && ia32 } } } */
+/* { dg-require-effective-target vect_double } */
+/* { dg-add-options ieee } */
+/* { dg-additional-options "-fno-fast-math" } */
+
+#include "tree-vect.h"
+
+#define N (VECTOR_BITS * 17)
+
+double __attribute__ ((noinline, noclone))
+reduc_plus_double (double *a)
+{
+  double r1 = 0;
+  double r2 = 0;
+  double r3 = 0;
+  double r4 = 0;
+  for (int i = 0; i < N; i += 4)
+    {
+      r1 += a[i];
+      r2 += a[i + 1];
+      r3 += a[i + 2];
+      r4 += a[i + 3];
+    }
+  return r1 * r2 * r3 * r4;
+}
+
+int __attribute__ ((optimize (1)))
+main ()
+{
+  double a[N];
+  double r[4] = {};
+  for (int i = 0; i < N; i++)
+    {
+      a[i] = (i * 0.1) * (i & 1 ? 1 : -1);
+      r[i % 4] += a[i];
+      asm volatile ("" ::: "memory");
+    }
+  double res = reduc_plus_double (a);
+  if (res != r[0] * r[1] * r[2] * r[3])
+    __builtin_abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times {in-order unchained SLP reductions not supported} 1 "vect" } } */
+/* { dg-final { scan-tree-dump-not {vectorizing stmts using SLP} "vect" } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1.c
===================================================================
--- /dev/null	2017-11-20 18:51:34.589640877 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1.c	2017-11-21 17:06:25.016421335 +0000
@@ -0,0 +1,28 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#define NUM_ELEMS(TYPE) ((int)(5 * (256 / sizeof (TYPE)) + 3))
+
+#define DEF_REDUC_PLUS(TYPE)			\
+  TYPE __attribute__ ((noinline, noclone))	\
+  reduc_plus_##TYPE (TYPE *a, TYPE *b)		\
+  {						\
+    TYPE r = 0, q = 3;				\
+    for (int i = 0; i < NUM_ELEMS (TYPE); i++)	\
+      {						\
+	r += a[i];				\
+	q -= b[i];				\
+      }						\
+    return r * q;				\
+  }
+
+#define TEST_ALL(T) \
+  T (_Float16) \
+  T (float) \
+  T (double)
+
+TEST_ALL (DEF_REDUC_PLUS)
+
+/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, z[0-9]+\.h} 2 } } */
+/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, z[0-9]+\.s} 2 } } */
+/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, z[0-9]+\.d} 2 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1_run.c
===================================================================
--- /dev/null	2017-11-20 18:51:34.589640877 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1_run.c	2017-11-21 17:06:25.016421335 +0000
@@ -0,0 +1,29 @@
+/* { dg-do run { target { aarch64_sve_hw } } } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#include "sve_reduc_strict_1.c"
+
+#define TEST_REDUC_PLUS(TYPE)			\
+  {						\
+    TYPE a[NUM_ELEMS (TYPE)];			\
+    TYPE b[NUM_ELEMS (TYPE)];			\
+    TYPE r = 0, q = 3;				\
+    for (int i = 0; i < NUM_ELEMS (TYPE); i++)	\
+      {						\
+	a[i] = (i * 0.1) * (i & 1 ? 1 : -1);	\
+	b[i] = (i * 0.3) * (i & 1 ? 1 : -1);	\
+	r += a[i];				\
+	q -= b[i];				\
+	asm volatile ("" ::: "memory");		\
+      }						\
+    TYPE res = reduc_plus_##TYPE (a, b);	\
+    if (res != r * q)				\
+      __builtin_abort ();			\
+  }
+
+int __attribute__ ((optimize (1)))
+main ()
+{
+  TEST_ALL (TEST_REDUC_PLUS);
+  return 0;
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2.c
===================================================================
--- /dev/null	2017-11-20 18:51:34.589640877 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2.c	2017-11-21 17:06:25.016421335 +0000
@@ -0,0 +1,28 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#define NUM_ELEMS(TYPE) ((int) (5 * (256 / sizeof (TYPE)) + 3))
+
+#define DEF_REDUC_PLUS(TYPE)					\
+void __attribute__ ((noinline, noclone))			\
+reduc_plus_##TYPE (TYPE (*restrict a)[NUM_ELEMS (TYPE)],	\
+		   TYPE *restrict r, int n)			\
+{								\
+  for (int i = 0; i < n; i++)					\
+    {								\
+      r[i] = 0;							\
+      for (int j = 0; j < NUM_ELEMS (TYPE); j++)		\
+        r[i] += a[i][j];					\
+    }								\
+}
+
+#define TEST_ALL(T) \
+  T (_Float16) \
+  T (float) \
+  T (double)
+
+TEST_ALL (DEF_REDUC_PLUS)
+
+/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, z[0-9]+\.h} 1 } } */
+/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, z[0-9]+\.s} 1 } } */
+/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, z[0-9]+\.d} 1 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2_run.c
===================================================================
--- /dev/null	2017-11-20 18:51:34.589640877 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2_run.c	2017-11-21 17:06:25.016421335 +0000
@@ -0,0 +1,31 @@
+/* { dg-do run { target { aarch64_sve_hw } } } */
+/* { dg-options "-O2 -ftree-vectorize -fno-inline -march=armv8-a+sve" } */
+
+#include "sve_reduc_strict_2.c"
+
+#define NROWS 5
+
+#define TEST_REDUC_PLUS(TYPE)					\
+  {								\
+    TYPE a[NROWS][NUM_ELEMS (TYPE)];				\
+    TYPE r[NROWS];						\
+    TYPE expected[NROWS] = {};					\
+    for (int i = 0; i < NROWS; ++i)				\
+      for (int j = 0; j < NUM_ELEMS (TYPE); ++j)		\
+	{							\
+	  a[i][j] = (i * 0.1 + j * 0.6) * (j & 1 ? 1 : -1);	\
+	  expected[i] += a[i][j];				\
+	  asm volatile ("" ::: "memory");			\
+	}							\
+    reduc_plus_##TYPE (a, r, NROWS);				\
+    for (int i = 0; i < NROWS; ++i)				\
+      if (r[i] != expected[i])					\
+	__builtin_abort ();					\
+  }
+
+int __attribute__ ((optimize (1)))
+main ()
+{
+  TEST_ALL (TEST_REDUC_PLUS);
+  return 0;
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_3.c
===================================================================
--- /dev/null	2017-11-20 18:51:34.589640877 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_3.c	2017-11-21 17:06:25.016421335 +0000
@@ -0,0 +1,131 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -fno-inline -march=armv8-a+sve -msve-vector-bits=256 -fdump-tree-vect-details" } */
+
+double mat[100][4];
+double mat2[100][8];
+double mat3[100][12];
+double mat4[100][3];
+
+double
+slp_reduc_plus (int n)
+{
+  double tmp = 0.0;
+  for (int i = 0; i < n; i++)
+    {
+      tmp = tmp + mat[i][0];
+      tmp = tmp + mat[i][1];
+      tmp = tmp + mat[i][2];
+      tmp = tmp + mat[i][3];
+    }
+  return tmp;
+}
+
+double
+slp_reduc_plus2 (int n)
+{
+  double tmp = 0.0;
+  for (int i = 0; i < n; i++)
+    {
+      tmp = tmp + mat2[i][0];
+      tmp = tmp + mat2[i][1];
+      tmp = tmp + mat2[i][2];
+      tmp = tmp + mat2[i][3];
+      tmp = tmp + mat2[i][4];
+      tmp = tmp + mat2[i][5];
+      tmp = tmp + mat2[i][6];
+      tmp = tmp + mat2[i][7];
+    }
+  return tmp;
+}
+
+double
+slp_reduc_plus3 (int n)
+{
+  double tmp = 0.0;
+  for (int i = 0; i < n; i++)
+    {
+      tmp = tmp + mat3[i][0];
+      tmp = tmp + mat3[i][1];
+      tmp = tmp + mat3[i][2];
+      tmp = tmp + mat3[i][3];
+      tmp = tmp + mat3[i][4];
+      tmp = tmp + mat3[i][5];
+      tmp = tmp + mat3[i][6];
+      tmp = tmp + mat3[i][7];
+      tmp = tmp + mat3[i][8];
+      tmp = tmp + mat3[i][9];
+      tmp = tmp + mat3[i][10];
+      tmp = tmp + mat3[i][11];
+    }
+  return tmp;
+}
+
+void
+slp_non_chained_reduc (int n, double * restrict out)
+{
+  for (int i = 0; i < 3; i++)
+    out[i] = 0;
+
+  for (int i = 0; i < n; i++)
+    {
+      out[0] = out[0] + mat4[i][0];
+      out[1] = out[1] + mat4[i][1];
+      out[2] = out[2] + mat4[i][2];
+    }
+}
+
+/* Strict FP reductions shouldn't be used for the outer loops, only the
+   inner loops.  */
+
+float
+double_reduc1 (float (*restrict i)[16])
+{
+  float l = 0;
+
+  for (int a = 0; a < 8; a++)
+    for (int b = 0; b < 8; b++)
+      l += i[b][a];
+  return l;
+}
+
+float
+double_reduc2 (float *restrict i)
+{
+  float l = 0;
+
+  for (int a = 0; a < 8; a++)
+    for (int b = 0; b < 16; b++)
+      {
+        l += i[b * 4];
+        l += i[b * 4 + 1];
+        l += i[b * 4 + 2];
+        l += i[b * 4 + 3];
+      }
+  return l;
+}
+
+float
+double_reduc3 (float *restrict i, float *restrict j)
+{
+  float k = 0, l = 0;
+
+  for (int a = 0; a < 8; a++)
+    for (int b = 0; b < 8; b++)
+      {
+        k += i[b];
+        l += j[b];
+      }
+  return l * k;
+}
+
+/* We can't yet handle double_reduc1.  */
+/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, z[0-9]+\.s} 3 } } */
+/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, z[0-9]+\.d} 9 } } */
+/* 1 reduction each for double_reduc{1,2} and 2 for double_reduc3.  Each one
+   is reported three times, once for SVE, once for 128-bit AdvSIMD and once
+   for 64-bit AdvSIMD.  */
+/* { dg-final { scan-tree-dump-times "Detected double reduction" 12 "vect" } } */
+/* double_reduc2 has 2 reductions and slp_non_chained_reduc has 3.
+   double_reduc1 is reported 3 times (SVE, 128-bit AdvSIMD, 64-bit AdvSIMD)
+   before failing.  */
+/* { dg-final { scan-tree-dump-times "Detected reduction" 12 "vect" } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_slp_13.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/sve_slp_13.c	2017-11-21 17:06:24.670434749 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_slp_13.c	2017-11-21 17:06:25.016421335 +0000
@@ -1,5 +1,6 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable" } */
+/* The cost model thinks that the double loop isn't a win for SVE-128.  */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable -fno-vect-cost-model" } */
 
 #include <stdint.h>
 
@@ -24,7 +25,10 @@ #define TEST_ALL(T)				\
   T (int32_t)					\
   T (uint32_t)					\
   T (int64_t)					\
-  T (uint64_t)
+  T (uint64_t)					\
+  T (_Float16)					\
+  T (float)					\
+  T (double)
 
 TEST_ALL (VEC_PERM)
 
@@ -32,21 +36,25 @@ TEST_ALL (VEC_PERM)
 /* ??? We don't treat the uint loops as SLP.  */
 /* The loop should be fully-masked.  */
 /* { dg-final { scan-assembler-times {\tld1b\t} 2 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tld1h\t} 2 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tld1w\t} 2 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tld1w\t} 1 } } */
-/* { dg-final { scan-assembler-times {\tld1d\t} 2 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tld1d\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1h\t} 3 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tld1w\t} 3 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tld1w\t} 2 } } */
+/* { dg-final { scan-assembler-times {\tld1d\t} 3 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tld1d\t} 2 } } */
 /* { dg-final { scan-assembler-not {\tldr} { xfail *-*-* } } } */
 
 /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b} 4 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 4 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 4 } } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 4 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 6 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 6 } } */
 
 /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.b\n} 2 { xfail *-*-* } } } */
 /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.h\n} 2 { xfail *-*-* } } } */
 /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.s\n} 2 } } */
 /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.d\n} 2 } } */
+/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, z[0-9]+\.h\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, z[0-9]+\.s\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, z[0-9]+\.d\n} 1 } } */
+/* { dg-final { scan-assembler-not {\tfadd\n} } } */
 
 /* { dg-final { scan-assembler-not {\tuqdec} } } */
Index: gcc/testsuite/gfortran.dg/vect/vect-8.f90
===================================================================
--- gcc/testsuite/gfortran.dg/vect/vect-8.f90	2017-11-21 17:06:24.670434749 +0000
+++ gcc/testsuite/gfortran.dg/vect/vect-8.f90	2017-11-21 17:06:25.016421335 +0000
@@ -704,5 +704,5 @@ CALL track('KERNEL  ')
 RETURN
 END SUBROUTINE kernel
 
-! { dg-final { scan-tree-dump-times "vectorized 21 loops" 1 "vect" { target { vect_intdouble_cvt } } } }
+! { dg-final { scan-tree-dump-times "vectorized 22 loops" 1 "vect" { target vect_intdouble_cvt } } }
 ! { dg-final { scan-tree-dump-times "vectorized 17 loops" 1 "vect" { target { ! vect_intdouble_cvt } } } }

Richard Biener Jan. 10, 2018, 1:12 p.m. UTC | #10

On Tue, Jan 9, 2018 at 4:36 PM, Richard Sandiford
<richard.sandiford@linaro.org> wrote:
> Ping


Ok.

Richard.

> Richard Sandiford <richard.sandiford@linaro.org> writes:

>> Richard Biener <richard.guenther@gmail.com> writes:

>>> On Mon, Nov 20, 2017 at 1:54 PM, Richard Sandiford

>>> <richard.sandiford@linaro.org> wrote:

>>>> Richard Biener <richard.guenther@gmail.com> writes:

>>>>> On Fri, Nov 17, 2017 at 5:53 PM, Richard Sandiford

>>>>> <richard.sandiford@linaro.org> wrote:

>>>>>> This patch adds support for in-order floating-point addition reductions,

>>>>>> which are suitable even in strict IEEE mode.

>>>>>>

>>>>>> Previously vect_is_simple_reduction would reject any cases that forbid

>>>>>> reassociation.  The idea is instead to tentatively accept them as

>>>>>> "FOLD_LEFT_REDUCTIONs" and only fail later if there is no target

>>>>>> support for them.  Although this patch only handles the particular

>>>>>> case of plus and minus on floating-point types, there's no reason in

>>>>>> principle why targets couldn't handle other cases.

>>>>>>

>>>>>> The vect_force_simple_reduction change makes it simpler for parloops

>>>>>> to read the type of reduction.

>>>>>>

>>>>>> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu

>>>>>> and powerpc64le-linux-gnu.  OK to install?

>>>>>

>>>>> I don't like that you add a new tree code for this.  A new IFN looks more

>>>>> suitable to me.

>>>>

>>>> OK.

>>>

>>> Thanks.  I'd like to eventually get rid of other vectorizer tree codes as well,

>>> like the REDUC_*_EXPR, DOT_PROD_EXPR and SAD_EXPR.  IFNs

>>> are now really the way to go for "target instructions on GIMPLE".

>>>

>>>>> Also I think if there's a way to handle this correctly with target support

>>>>> you can also implement a fallback if there is no such support increasing

>>>>> test coverage.  It would basically boil down to extracting all scalars from

>>>>> the non-reduction operand vector and performing a series of reduction

>>>>> ops, keeping the reduction PHI scalar.  This would also support any

>>>>> reduction operator.

>>>>

>>>> Yeah, but without target support, that's probably going to be expensive.

>>>> It's a bit like how we can implement element-by-element loads and stores

>>>> for cases that don't have target support, but had to explicitly disable

>>>> that in many cases, since the cost model was too optimistic.

>>>

>>> I expect that for V2DF or even V4DF it might be profitable in quite a number

>>> of cases.  V2DF definitely.

>>>

>>>> I can give it a go anyway if you think it's worth it.

>>>

>>> I think it is.

>>

>> OK, done in the patch below.  Tested as before.

>>

>> Thanks,

>> Richard

>

> 2017-11-21  Richard Sandiford  <richard.sandiford@linaro.org>

>             Alan Hayward  <alan.hayward@arm.com>

>             David Sherwood  <david.sherwood@arm.com>

>

> gcc/

>         * optabs.def (fold_left_plus_optab): New optab.

>         * doc/md.texi (fold_left_plus_@var{m}): Document.

>         * internal-fn.def (IFN_FOLD_LEFT_PLUS): New internal function.

>         * internal-fn.c (fold_left_direct): Define.

>         (expand_fold_left_optab_fn): Likewise.

>         (direct_fold_left_optab_supported_p): Likewise.

>         * fold-const-call.c (fold_const_fold_left): New function.

>         (fold_const_call): Use it to fold CFN_FOLD_LEFT_PLUS.

>         * tree-parloops.c (valid_reduction_p): New function.

>         (gather_scalar_reductions): Use it.

>         * tree-vectorizer.h (FOLD_LEFT_REDUCTION): New vect_reduction_type.

>         (vect_finish_replace_stmt): Declare.

>         * tree-vect-loop.c (fold_left_reduction_code): New function.

>         (needs_fold_left_reduction_p): New function, split out from...

>         (vect_is_simple_reduction): ...here.  Accept reductions that

>         forbid reassociation, but give them type FOLD_LEFT_REDUCTION.

>         (vect_force_simple_reduction): Also store the reduction type in

>         the assignment's STMT_VINFO_REDUC_TYPE.

>         (vect_model_reduction_cost): Handle FOLD_LEFT_REDUCTION.

>         (merge_with_identity): New function.

>         (vectorize_fold_left_reduction): Likewise.

>         (vectorizable_reduction): Handle FOLD_LEFT_REDUCTION.  Leave the

>         scalar phi in place for it.  Check for target support and reject

>         cases that would reassociate the operation.  Defer the transform

>         phase to vectorize_fold_left_reduction.

>         * config/aarch64/aarch64.md (UNSPEC_FADDA): New unspec.

>         * config/aarch64/aarch64-sve.md (fold_left_plus_<mode>): New expander.

>         (*fold_left_plus_<mode>, *pred_fold_left_plus_<mode>): New insns.

>

> gcc/testsuite/

>         * gcc.dg/vect/no-fast-math-vect16.c: Expect the test to pass and

>         check for a message about using in-order reductions.

>         * gcc.dg/vect/pr79920.c: Expect both loops to be vectorized and

>         check for a message about using in-order reductions.

>         * gcc.dg/vect/trapv-vect-reduc-4.c: Expect all three loops to be

>         vectorized and check for a message about using in-order reductions.

>         * gcc.dg/vect/vect-reduc-6.c: Expect the loop to be vectorized and

>         check for a message about using in-order reductions.

>         * gcc.dg/vect/vect-reduc-in-order-1.c: New test.

>         * gcc.dg/vect/vect-reduc-in-order-2.c: Likewise.

>         * gcc.dg/vect/vect-reduc-in-order-3.c: Likewise.

>         * gcc.dg/vect/vect-reduc-in-order-4.c: Likewise.

>         * gcc.target/aarch64/sve_reduc_strict_1.c: New test.

>         * gcc.target/aarch64/sve_reduc_strict_1_run.c: Likewise.

>         * gcc.target/aarch64/sve_reduc_strict_2.c: Likewise.

>         * gcc.target/aarch64/sve_reduc_strict_2_run.c: Likewise.

>         * gcc.target/aarch64/sve_reduc_strict_3.c: Likewise.

>         * gcc.target/aarch64/sve_slp_13.c: Add floating-point types.

>         * gfortran.dg/vect/vect-8.f90: Expect 22 loops to be vectorized if

>         vect_fold_left_plus.

>

> Index: gcc/optabs.def

> ===================================================================

> --- gcc/optabs.def      2017-11-21 17:06:24.670434749 +0000

> +++ gcc/optabs.def      2017-11-21 17:06:25.015421374 +0000

> @@ -306,6 +306,7 @@ OPTAB_D (reduc_umin_scal_optab, "reduc_u

>  OPTAB_D (reduc_and_scal_optab,  "reduc_and_scal_$a")

>  OPTAB_D (reduc_ior_scal_optab,  "reduc_ior_scal_$a")

>  OPTAB_D (reduc_xor_scal_optab,  "reduc_xor_scal_$a")

> +OPTAB_D (fold_left_plus_optab, "fold_left_plus_$a")

>

>  OPTAB_D (extract_last_optab, "extract_last_$a")

>  OPTAB_D (fold_extract_last_optab, "fold_extract_last_$a")

> Index: gcc/doc/md.texi

> ===================================================================

> --- gcc/doc/md.texi     2017-11-21 17:06:24.670434749 +0000

> +++ gcc/doc/md.texi     2017-11-21 17:06:25.014421412 +0000

> @@ -5285,6 +5285,14 @@ has mode @var{m} and operands 0 and 1 ha

>  one element of @var{m}.  Operand 2 has the usual mask mode for vectors

>  of mode @var{m}; see @code{TARGET_VECTORIZE_GET_MASK_MODE}.

>

> +@cindex @code{fold_left_plus_@var{m}} instruction pattern

> +@item @code{fold_left_plus_@var{m}}

> +Take scalar operand 1 and successively add each element from vector

> +operand 2.  Store the result in scalar operand 0.  The vector has

> +mode @var{m} and the scalars have the mode appropriate for one

> +element of @var{m}.  The operation is strictly in-order: there is

> +no reassociation.

> +

>  @cindex @code{sdot_prod@var{m}} instruction pattern

>  @item @samp{sdot_prod@var{m}}

>  @cindex @code{udot_prod@var{m}} instruction pattern

> Index: gcc/internal-fn.def

> ===================================================================

> --- gcc/internal-fn.def 2017-11-21 17:06:24.670434749 +0000

> +++ gcc/internal-fn.def 2017-11-21 17:06:25.015421374 +0000

> @@ -59,6 +59,8 @@ along with GCC; see the file COPYING3.

>

>     - cond_binary: a conditional binary optab, such as add<mode>cc

>

> +   - fold_left: for scalar = FN (scalar, vector), keyed off the vector mode

> +

>     DEF_INTERNAL_SIGNED_OPTAB_FN defines an internal function that

>     maps to one of two optabs, depending on the signedness of an input.

>     SIGNED_OPTAB and UNSIGNED_OPTAB are the optabs for signed and

> @@ -177,6 +179,8 @@ DEF_INTERNAL_OPTAB_FN (EXTRACT_LAST, ECF

>  DEF_INTERNAL_OPTAB_FN (FOLD_EXTRACT_LAST, ECF_CONST | ECF_NOTHROW,

>                        fold_extract_last, fold_extract)

>

> +DEF_INTERNAL_OPTAB_FN (FOLD_LEFT_PLUS, ECF_CONST | ECF_NOTHROW,

> +                      fold_left_plus, fold_left)

>

>  /* Unary math functions.  */

>  DEF_INTERNAL_FLT_FN (ACOS, ECF_CONST, acos, unary)

> Index: gcc/internal-fn.c

> ===================================================================

> --- gcc/internal-fn.c   2017-11-21 17:06:24.670434749 +0000

> +++ gcc/internal-fn.c   2017-11-21 17:06:25.015421374 +0000

> @@ -92,6 +92,7 @@ #define cond_unary_direct { 1, 1, true }

>  #define cond_binary_direct { 1, 1, true }

>  #define while_direct { 0, 2, false }

>  #define fold_extract_direct { 2, 2, false }

> +#define fold_left_direct { 1, 1, false }

>

>  const direct_internal_fn_info direct_internal_fn_array[IFN_LAST + 1] = {

>  #define DEF_INTERNAL_FN(CODE, FLAGS, FNSPEC) not_direct,

> @@ -2839,6 +2840,9 @@ #define expand_cond_binary_optab_fn(FN,

>  #define expand_fold_extract_optab_fn(FN, STMT, OPTAB) \

>    expand_direct_optab_fn (FN, STMT, OPTAB, 3)

>

> +#define expand_fold_left_optab_fn(FN, STMT, OPTAB) \

> +  expand_direct_optab_fn (FN, STMT, OPTAB, 2)

> +

>  /* RETURN_TYPE and ARGS are a return type and argument list that are

>     in principle compatible with FN (which satisfies direct_internal_fn_p).

>     Return the types that should be used to determine whether the

> @@ -2922,6 +2926,7 @@ #define direct_store_lanes_optab_support

>  #define direct_mask_store_lanes_optab_supported_p multi_vector_optab_supported_p

>  #define direct_while_optab_supported_p convert_optab_supported_p

>  #define direct_fold_extract_optab_supported_p direct_optab_supported_p

> +#define direct_fold_left_optab_supported_p direct_optab_supported_p

>

>  /* Return the optab used by internal function FN.  */

>

> Index: gcc/fold-const-call.c

> ===================================================================

> --- gcc/fold-const-call.c       2017-11-21 17:06:24.670434749 +0000

> +++ gcc/fold-const-call.c       2017-11-21 17:06:25.014421412 +0000

> @@ -1190,6 +1190,25 @@ fold_const_call (combined_fn fn, tree ty

>      }

>  }

>

> +/* Fold a call to IFN_FOLD_LEFT_<CODE> (ARG0, ARG1), returning a value

> +   of type TYPE.  */

> +

> +static tree

> +fold_const_fold_left (tree type, tree arg0, tree arg1, tree_code code)

> +{

> +  if (TREE_CODE (arg1) != VECTOR_CST)

> +    return NULL_TREE;

> +

> +  unsigned int nelts = VECTOR_CST_NELTS (arg1);

> +  for (unsigned int i = 0; i < nelts; i++)

> +    {

> +      arg0 = const_binop (code, type, arg0, VECTOR_CST_ELT (arg1, i));

> +      if (arg0 == NULL_TREE || !CONSTANT_CLASS_P (arg0))

> +       return NULL_TREE;

> +    }

> +  return arg0;

> +}

> +

>  /* Try to evaluate:

>

>        *RESULT = FN (*ARG0, *ARG1)

> @@ -1495,6 +1514,9 @@ fold_const_call (combined_fn fn, tree ty

>         }

>        return NULL_TREE;

>

> +    case CFN_FOLD_LEFT_PLUS:

> +      return fold_const_fold_left (type, arg0, arg1, PLUS_EXPR);

> +

>      default:

>        return fold_const_call_1 (fn, type, arg0, arg1);

>      }

> Index: gcc/tree-parloops.c

> ===================================================================

> --- gcc/tree-parloops.c 2017-11-21 17:06:24.670434749 +0000

> +++ gcc/tree-parloops.c 2017-11-21 17:06:25.017421296 +0000

> @@ -2531,6 +2531,19 @@ set_reduc_phi_uids (reduction_info **slo

>    return 1;

>  }

>

> +/* Return true if the type of reduction performed by STMT is suitable

> +   for this pass.  */

> +

> +static bool

> +valid_reduction_p (gimple *stmt)

> +{

> +  /* Parallelization would reassociate the operation, which isn't

> +     allowed for in-order reductions.  */

> +  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);

> +  vect_reduction_type reduc_type = STMT_VINFO_REDUC_TYPE (stmt_info);

> +  return reduc_type != FOLD_LEFT_REDUCTION;

> +}

> +

>  /* Detect all reductions in the LOOP, insert them into REDUCTION_LIST.  */

>

>  static void

> @@ -2564,7 +2577,7 @@ gather_scalar_reductions (loop_p loop, r

>        gimple *reduc_stmt

>         = vect_force_simple_reduction (simple_loop_info, phi,

>                                        &double_reduc, true);

> -      if (!reduc_stmt)

> +      if (!reduc_stmt || !valid_reduction_p (reduc_stmt))

>         continue;

>

>        if (double_reduc)

> @@ -2610,7 +2623,8 @@ gather_scalar_reductions (loop_p loop, r

>                 = vect_force_simple_reduction (simple_loop_info, inner_phi,

>                                                &double_reduc, true);

>               gcc_assert (!double_reduc);

> -             if (inner_reduc_stmt == NULL)

> +             if (inner_reduc_stmt == NULL

> +                 || !valid_reduction_p (inner_reduc_stmt))

>                 continue;

>

>               build_new_reduction (reduction_list, double_reduc_stmts[i], phi);

> Index: gcc/tree-vectorizer.h

> ===================================================================

> --- gcc/tree-vectorizer.h       2017-11-21 17:06:24.670434749 +0000

> +++ gcc/tree-vectorizer.h       2017-11-21 17:06:25.018421257 +0000

> @@ -74,7 +74,15 @@ enum vect_reduction_type {

>

>         for (int i = 0; i < VF; ++i)

>           res = cond[i] ? val[i] : res;  */

> -  EXTRACT_LAST_REDUCTION

> +  EXTRACT_LAST_REDUCTION,

> +

> +  /* Use a folding reduction within the loop to implement:

> +

> +       for (int i = 0; i < VF; ++i)

> +         res = res OP val[i];

> +

> +     (with no reassocation).  */

> +  FOLD_LEFT_REDUCTION

>  };

>

>  #define VECTORIZABLE_CYCLE_DEF(D) (((D) == vect_reduction_def)           \

> @@ -1389,6 +1397,7 @@ extern void vect_model_load_cost (stmt_v

>  extern unsigned record_stmt_cost (stmt_vector_for_cost *, int,

>                                   enum vect_cost_for_stmt, stmt_vec_info,

>                                   int, enum vect_cost_model_location);

> +extern void vect_finish_replace_stmt (gimple *, gimple *);

>  extern void vect_finish_stmt_generation (gimple *, gimple *,

>                                           gimple_stmt_iterator *);

>  extern bool vect_mark_stmts_to_be_vectorized (loop_vec_info);

> Index: gcc/tree-vect-loop.c

> ===================================================================

> --- gcc/tree-vect-loop.c        2017-11-21 17:06:24.670434749 +0000

> +++ gcc/tree-vect-loop.c        2017-11-21 17:06:25.018421257 +0000

> @@ -2573,6 +2573,22 @@ vect_analyze_loop (struct loop *loop, lo

>      }

>  }

>

> +/* Return true if there is an in-order reduction function for CODE, storing

> +   it in *REDUC_FN if so.  */

> +

> +static bool

> +fold_left_reduction_fn (tree_code code, internal_fn *reduc_fn)

> +{

> +  switch (code)

> +    {

> +    case PLUS_EXPR:

> +      *reduc_fn = IFN_FOLD_LEFT_PLUS;

> +      return true;

> +

> +    default:

> +      return false;

> +    }

> +}

>

>  /* Function reduction_fn_for_scalar_code

>

> @@ -2879,6 +2895,42 @@ vect_is_slp_reduction (loop_vec_info loo

>    return true;

>  }

>

> +/* Return true if we need an in-order reduction for operation CODE

> +   on type TYPE.  NEED_WRAPPING_INTEGRAL_OVERFLOW is true if integer

> +   overflow must wrap.  */

> +

> +static bool

> +needs_fold_left_reduction_p (tree type, tree_code code,

> +                            bool need_wrapping_integral_overflow)

> +{

> +  /* CHECKME: check for !flag_finite_math_only too?  */

> +  if (SCALAR_FLOAT_TYPE_P (type))

> +    switch (code)

> +      {

> +      case MIN_EXPR:

> +      case MAX_EXPR:

> +       return false;

> +

> +      default:

> +       return !flag_associative_math;

> +      }

> +

> +  if (INTEGRAL_TYPE_P (type))

> +    {

> +      if (!operation_no_trapping_overflow (type, code))

> +       return true;

> +      if (need_wrapping_integral_overflow

> +         && !TYPE_OVERFLOW_WRAPS (type)

> +         && operation_can_overflow (code))

> +       return true;

> +      return false;

> +    }

> +

> +  if (SAT_FIXED_POINT_TYPE_P (type))

> +    return true;

> +

> +  return false;

> +}

>

>  /* Function vect_is_simple_reduction

>

> @@ -3197,58 +3249,18 @@ vect_is_simple_reduction (loop_vec_info

>        return NULL;

>      }

>

> -  /* Check that it's ok to change the order of the computation.

> +  /* Check whether it's ok to change the order of the computation.

>       Generally, when vectorizing a reduction we change the order of the

>       computation.  This may change the behavior of the program in some

>       cases, so we need to check that this is ok.  One exception is when

>       vectorizing an outer-loop: the inner-loop is executed sequentially,

>       and therefore vectorizing reductions in the inner-loop during

>       outer-loop vectorization is safe.  */

> -

> -  if (*v_reduc_type != COND_REDUCTION

> -      && check_reduction)

> -    {

> -      /* CHECKME: check for !flag_finite_math_only too?  */

> -      if (SCALAR_FLOAT_TYPE_P (type) && !flag_associative_math)

> -       {

> -         /* Changing the order of operations changes the semantics.  */

> -         if (dump_enabled_p ())

> -           report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt,

> -                       "reduction: unsafe fp math optimization: ");

> -         return NULL;

> -       }

> -      else if (INTEGRAL_TYPE_P (type))

> -       {

> -         if (!operation_no_trapping_overflow (type, code))

> -           {

> -             /* Changing the order of operations changes the semantics.  */

> -             if (dump_enabled_p ())

> -               report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt,

> -                               "reduction: unsafe int math optimization"

> -                               " (overflow traps): ");

> -             return NULL;

> -           }

> -         if (need_wrapping_integral_overflow

> -             && !TYPE_OVERFLOW_WRAPS (type)

> -             && operation_can_overflow (code))

> -           {

> -             /* Changing the order of operations changes the semantics.  */

> -             if (dump_enabled_p ())

> -               report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt,

> -                               "reduction: unsafe int math optimization"

> -                               " (overflow doesn't wrap): ");

> -             return NULL;

> -           }

> -       }

> -      else if (SAT_FIXED_POINT_TYPE_P (type))

> -       {

> -         /* Changing the order of operations changes the semantics.  */

> -         if (dump_enabled_p ())

> -         report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt,

> -                         "reduction: unsafe fixed-point math optimization: ");

> -         return NULL;

> -       }

> -    }

> +  if (check_reduction

> +      && *v_reduc_type == TREE_CODE_REDUCTION

> +      && needs_fold_left_reduction_p (type, code,

> +                                     need_wrapping_integral_overflow))

> +    *v_reduc_type = FOLD_LEFT_REDUCTION;

>

>    /* Reduction is safe. We're dealing with one of the following:

>       1) integer arithmetic and no trapv

> @@ -3512,6 +3524,7 @@ vect_force_simple_reduction (loop_vec_in

>        STMT_VINFO_REDUC_TYPE (reduc_def_info) = v_reduc_type;

>        STMT_VINFO_REDUC_DEF (reduc_def_info) = def;

>        reduc_def_info = vinfo_for_stmt (def);

> +      STMT_VINFO_REDUC_TYPE (reduc_def_info) = v_reduc_type;

>        STMT_VINFO_REDUC_DEF (reduc_def_info) = phi;

>      }

>    return def;

> @@ -4064,14 +4077,27 @@ vect_model_reduction_cost (stmt_vec_info

>

>    code = gimple_assign_rhs_code (orig_stmt);

>

> -  if (reduction_type == EXTRACT_LAST_REDUCTION)

> +  if (reduction_type == EXTRACT_LAST_REDUCTION

> +      || reduction_type == FOLD_LEFT_REDUCTION)

>      {

>        /* No extra instructions needed in the prologue.  */

>        prologue_cost = 0;

>

> -      /* Count NCOPIES FOLD_EXTRACT_LAST operations.  */

> -      inside_cost = add_stmt_cost (target_cost_data, ncopies, vec_to_scalar,

> -                                  stmt_info, 0, vect_body);

> +      if (reduction_type == EXTRACT_LAST_REDUCTION || reduc_fn != IFN_LAST)

> +       /* Count one reduction-like operation per vector.  */

> +       inside_cost = add_stmt_cost (target_cost_data, ncopies, vec_to_scalar,

> +                                    stmt_info, 0, vect_body);

> +      else

> +       {

> +         /* Use NELEMENTS extracts and NELEMENTS scalar ops.  */

> +         unsigned int nelements = ncopies * vect_nunits_for_cost (vectype);

> +         inside_cost = add_stmt_cost (target_cost_data,  nelements,

> +                                      vec_to_scalar, stmt_info, 0,

> +                                      vect_body);

> +         inside_cost += add_stmt_cost (target_cost_data,  nelements,

> +                                       scalar_stmt, stmt_info, 0,

> +                                       vect_body);

> +       }

>      }

>    else

>      {

> @@ -4137,7 +4163,8 @@ vect_model_reduction_cost (stmt_vec_info

>                                           scalar_stmt, stmt_info, 0,

>                                           vect_epilogue);

>         }

> -      else if (reduction_type == EXTRACT_LAST_REDUCTION)

> +      else if (reduction_type == EXTRACT_LAST_REDUCTION

> +              || reduction_type == FOLD_LEFT_REDUCTION)

>         /* No extra instructions need in the epilogue.  */

>         ;

>        else

> @@ -5910,6 +5937,160 @@ vect_create_epilog_for_reduction (vec<tr

>      }

>  }

>

> +/* Return a vector of type VECTYPE that is equal to the vector select

> +   operation "MASK ? VEC : IDENTITY".  Insert the select statements

> +   before GSI.  */

> +

> +static tree

> +merge_with_identity (gimple_stmt_iterator *gsi, tree mask, tree vectype,

> +                    tree vec, tree identity)

> +{

> +  tree cond = make_temp_ssa_name (vectype, NULL, "cond");

> +  gimple *new_stmt = gimple_build_assign (cond, VEC_COND_EXPR,

> +                                         mask, vec, identity);

> +  gsi_insert_before (gsi, new_stmt, GSI_SAME_STMT);

> +  return cond;

> +}

> +

> +/* Perform an in-order reduction (FOLD_LEFT_REDUCTION).  STMT is the

> +   statement that sets the live-out value.  REDUC_DEF_STMT is the phi

> +   statement.  CODE is the operation performed by STMT and OPS are

> +   its scalar operands.  REDUC_INDEX is the index of the operand in

> +   OPS that is set by REDUC_DEF_STMT.  REDUC_FN is the function that

> +   implements in-order reduction, or IFN_LAST if we should open-code it.

> +   VECTYPE_IN is the type of the vector input.  MASKS specifies the masks

> +   that should be used to control the operation in a fully-masked loop.  */

> +

> +static bool

> +vectorize_fold_left_reduction (gimple *stmt, gimple_stmt_iterator *gsi,

> +                              gimple **vec_stmt, slp_tree slp_node,

> +                              gimple *reduc_def_stmt,

> +                              tree_code code, internal_fn reduc_fn,

> +                              tree ops[3], tree vectype_in,

> +                              int reduc_index, vec_loop_masks *masks)

> +{

> +  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);

> +  loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);

> +  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);

> +  tree vectype_out = STMT_VINFO_VECTYPE (stmt_info);

> +  gimple *new_stmt = NULL;

> +

> +  int ncopies;

> +  if (slp_node)

> +    ncopies = 1;

> +  else

> +    ncopies = vect_get_num_copies (loop_vinfo, vectype_in);

> +

> +  gcc_assert (!nested_in_vect_loop_p (loop, stmt));

> +  gcc_assert (ncopies == 1);

> +  gcc_assert (TREE_CODE_LENGTH (code) == binary_op);

> +  gcc_assert (reduc_index == (code == MINUS_EXPR ? 0 : 1));

> +  gcc_assert (STMT_VINFO_VEC_REDUCTION_TYPE (stmt_info)

> +             == FOLD_LEFT_REDUCTION);

> +

> +  if (slp_node)

> +    gcc_assert (must_eq (TYPE_VECTOR_SUBPARTS (vectype_out),

> +                        TYPE_VECTOR_SUBPARTS (vectype_in)));

> +

> +  tree op0 = ops[1 - reduc_index];

> +

> +  int group_size = 1;

> +  gimple *scalar_dest_def;

> +  auto_vec<tree> vec_oprnds0;

> +  if (slp_node)

> +    {

> +      vect_get_vec_defs (op0, NULL_TREE, stmt, &vec_oprnds0, NULL, slp_node);

> +      group_size = SLP_TREE_SCALAR_STMTS (slp_node).length ();

> +      scalar_dest_def = SLP_TREE_SCALAR_STMTS (slp_node)[group_size - 1];

> +    }

> +  else

> +    {

> +      tree loop_vec_def0 = vect_get_vec_def_for_operand (op0, stmt);

> +      vec_oprnds0.create (1);

> +      vec_oprnds0.quick_push (loop_vec_def0);

> +      scalar_dest_def = stmt;

> +    }

> +

> +  tree scalar_dest = gimple_assign_lhs (scalar_dest_def);

> +  tree scalar_type = TREE_TYPE (scalar_dest);

> +  tree reduc_var = gimple_phi_result (reduc_def_stmt);

> +

> +  int vec_num = vec_oprnds0.length ();

> +  gcc_assert (vec_num == 1 || slp_node);

> +  tree vec_elem_type = TREE_TYPE (vectype_out);

> +  gcc_checking_assert (useless_type_conversion_p (scalar_type, vec_elem_type));

> +

> +  tree vector_identity = NULL_TREE;

> +  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))

> +    vector_identity = build_zero_cst (vectype_out);

> +

> +  tree scalar_dest_var = vect_create_destination_var (scalar_dest, NULL);

> +  int i;

> +  tree def0;

> +  FOR_EACH_VEC_ELT (vec_oprnds0, i, def0)

> +    {

> +      tree mask = NULL_TREE;

> +      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))

> +       mask = vect_get_loop_mask (gsi, masks, vec_num, vectype_in, i);

> +

> +      /* Handle MINUS by adding the negative.  */

> +      if (reduc_fn != IFN_LAST && code == MINUS_EXPR)

> +       {

> +         tree negated = make_ssa_name (vectype_out);

> +         new_stmt = gimple_build_assign (negated, NEGATE_EXPR, def0);

> +         gsi_insert_before (gsi, new_stmt, GSI_SAME_STMT);

> +         def0 = negated;

> +       }

> +

> +      if (mask)

> +       def0 = merge_with_identity (gsi, mask, vectype_out, def0,

> +                                   vector_identity);

> +

> +      /* On the first iteration the input is simply the scalar phi

> +        result, and for subsequent iterations it is the output of

> +        the preceding operation.  */

> +      if (reduc_fn != IFN_LAST)

> +       {

> +         new_stmt = gimple_build_call_internal (reduc_fn, 2, reduc_var, def0);

> +         /* For chained SLP reductions the output of the previous reduction

> +            operation serves as the input of the next. For the final statement

> +            the output cannot be a temporary - we reuse the original

> +            scalar destination of the last statement.  */

> +         if (i != vec_num - 1)

> +           {

> +             gimple_set_lhs (new_stmt, scalar_dest_var);

> +             reduc_var = make_ssa_name (scalar_dest_var, new_stmt);

> +             gimple_set_lhs (new_stmt, reduc_var);

> +           }

> +       }

> +      else

> +       {

> +         reduc_var = vect_expand_fold_left (gsi, scalar_dest_var, code,

> +                                            reduc_var, def0);

> +         new_stmt = SSA_NAME_DEF_STMT (reduc_var);

> +         /* Remove the statement, so that we can use the same code paths

> +            as for statements that we've just created.  */

> +         gimple_stmt_iterator tmp_gsi = gsi_for_stmt (new_stmt);

> +         gsi_remove (&tmp_gsi, false);

> +       }

> +

> +      if (i == vec_num - 1)

> +       {

> +         gimple_set_lhs (new_stmt, scalar_dest);

> +         vect_finish_replace_stmt (scalar_dest_def, new_stmt);

> +       }

> +      else

> +       vect_finish_stmt_generation (scalar_dest_def, new_stmt, gsi);

> +

> +      if (slp_node)

> +       SLP_TREE_VEC_STMTS (slp_node).quick_push (new_stmt);

> +    }

> +

> +  if (!slp_node)

> +    STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt;

> +

> +  return true;

> +}

>

>  /* Function is_nonwrapping_integer_induction.

>

> @@ -6090,6 +6271,12 @@ vectorizable_reduction (gimple *stmt, gi

>           return true;

>         }

>

> +      if (STMT_VINFO_REDUC_TYPE (stmt_info) == FOLD_LEFT_REDUCTION)

> +       /* Leave the scalar phi in place.  Note that checking

> +          STMT_VINFO_VEC_REDUCTION_TYPE (as below) only works

> +          for reductions involving a single statement.  */

> +       return true;

> +

>        gimple *reduc_stmt = STMT_VINFO_REDUC_DEF (stmt_info);

>        if (STMT_VINFO_IN_PATTERN_P (vinfo_for_stmt (reduc_stmt)))

>         reduc_stmt = STMT_VINFO_RELATED_STMT (vinfo_for_stmt (reduc_stmt));

> @@ -6316,6 +6503,14 @@ vectorizable_reduction (gimple *stmt, gi

>       directy used in stmt.  */

>    if (reduc_index == -1)

>      {

> +      if (STMT_VINFO_REDUC_TYPE (stmt_info) == FOLD_LEFT_REDUCTION)

> +       {

> +         if (dump_enabled_p ())

> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,

> +                            "in-order reduction chain without SLP.\n");

> +         return false;

> +       }

> +

>        if (orig_stmt)

>         reduc_def_stmt = STMT_VINFO_REDUC_DEF (orig_stmt_info);

>        else

> @@ -6535,7 +6730,9 @@ vectorizable_reduction (gimple *stmt, gi

>

>    vect_reduction_type reduction_type

>      = STMT_VINFO_VEC_REDUCTION_TYPE (stmt_info);

> -  if (orig_stmt && reduction_type == TREE_CODE_REDUCTION)

> +  if (orig_stmt

> +      && (reduction_type == TREE_CODE_REDUCTION

> +         || reduction_type == FOLD_LEFT_REDUCTION))

>      {

>        /* This is a reduction pattern: get the vectype from the type of the

>           reduction variable, and get the tree-code from orig_stmt.  */

> @@ -6582,10 +6779,13 @@ vectorizable_reduction (gimple *stmt, gi

>    reduc_fn = IFN_LAST;

>

>    if (reduction_type == TREE_CODE_REDUCTION

> +      || reduction_type == FOLD_LEFT_REDUCTION

>        || reduction_type == INTEGER_INDUC_COND_REDUCTION

>        || reduction_type == CONST_COND_REDUCTION)

>      {

> -      if (reduction_fn_for_scalar_code (orig_code, &reduc_fn))

> +      if (reduction_type == FOLD_LEFT_REDUCTION

> +         ? fold_left_reduction_fn (orig_code, &reduc_fn)

> +         : reduction_fn_for_scalar_code (orig_code, &reduc_fn))

>         {

>           if (reduc_fn != IFN_LAST

>               && !direct_internal_fn_supported_p (reduc_fn, vectype_out,

> @@ -6704,6 +6904,41 @@ vectorizable_reduction (gimple *stmt, gi

>         }

>      }

>

> +  if (double_reduc && reduction_type == FOLD_LEFT_REDUCTION)

> +    {

> +      /* We can't support in-order reductions of code such as this:

> +

> +          for (int i = 0; i < n1; ++i)

> +            for (int j = 0; j < n2; ++j)

> +              l += a[j];

> +

> +        since GCC effectively transforms the loop when vectorizing:

> +

> +          for (int i = 0; i < n1 / VF; ++i)

> +            for (int j = 0; j < n2; ++j)

> +              for (int k = 0; k < VF; ++k)

> +                l += a[j];

> +

> +        which is a reassociation of the original operation.  */

> +      if (dump_enabled_p ())

> +       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,

> +                        "in-order double reduction not supported.\n");

> +

> +      return false;

> +    }

> +

> +  if (reduction_type == FOLD_LEFT_REDUCTION

> +      && slp_node

> +      && !GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)))

> +    {

> +      /* We cannot use in-order reductions in this case because there is

> +         an implicit reassociation of the operations involved.  */

> +      if (dump_enabled_p ())

> +        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,

> +                        "in-order unchained SLP reductions not supported.\n");

> +      return false;

> +    }

> +

>    /* In case of widenning multiplication by a constant, we update the type

>       of the constant to be the type of the other operand.  We check that the

>       constant fits the type in the pattern recognition pass.  */

> @@ -6824,9 +7059,10 @@ vectorizable_reduction (gimple *stmt, gi

>         vect_model_reduction_cost (stmt_info, reduc_fn, ncopies);

>        if (loop_vinfo && LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo))

>         {

> -         if (cond_fn == IFN_LAST

> -             || !direct_internal_fn_supported_p (cond_fn, vectype_in,

> -                                                 OPTIMIZE_FOR_SPEED))

> +         if (reduction_type != FOLD_LEFT_REDUCTION

> +             && (cond_fn == IFN_LAST

> +                 || !direct_internal_fn_supported_p (cond_fn, vectype_in,

> +                                                     OPTIMIZE_FOR_SPEED)))

>             {

>               if (dump_enabled_p ())

>                 dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,

> @@ -6846,6 +7082,10 @@ vectorizable_reduction (gimple *stmt, gi

>             vect_record_loop_mask (loop_vinfo, masks, ncopies * vec_num,

>                                    vectype_in);

>         }

> +      if (dump_enabled_p ()

> +         && reduction_type == FOLD_LEFT_REDUCTION)

> +       dump_printf_loc (MSG_NOTE, vect_location,

> +                        "using an in-order (fold-left) reduction.\n");

>        STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;

>        return true;

>      }

> @@ -6861,6 +7101,11 @@ vectorizable_reduction (gimple *stmt, gi

>

>    bool masked_loop_p = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);

>

> +  if (reduction_type == FOLD_LEFT_REDUCTION)

> +    return vectorize_fold_left_reduction

> +      (stmt, gsi, vec_stmt, slp_node, reduc_def_stmt, code,

> +       reduc_fn, ops, vectype_in, reduc_index, masks);

> +

>    if (reduction_type == EXTRACT_LAST_REDUCTION)

>      {

>        gcc_assert (!slp_node);

> Index: gcc/config/aarch64/aarch64.md

> ===================================================================

> --- gcc/config/aarch64/aarch64.md       2017-11-21 17:06:24.670434749 +0000

> +++ gcc/config/aarch64/aarch64.md       2017-11-21 17:06:25.013421451 +0000

> @@ -164,6 +164,7 @@ (define_c_enum "unspec" [

>      UNSPEC_STN

>      UNSPEC_INSR

>      UNSPEC_CLASTB

> +    UNSPEC_FADDA

>  ])

>

>  (define_c_enum "unspecv" [

> Index: gcc/config/aarch64/aarch64-sve.md

> ===================================================================

> --- gcc/config/aarch64/aarch64-sve.md   2017-11-21 17:06:24.670434749 +0000

> +++ gcc/config/aarch64/aarch64-sve.md   2017-11-21 17:06:25.012421490 +0000

> @@ -1574,6 +1574,45 @@ (define_insn "*reduc_<optab>_scal_<mode>

>    "<bit_reduc_op>\t%<Vetype>0, %1, %2.<Vetype>"

>  )

>

> +;; Unpredicated in-order FP reductions.

> +(define_expand "fold_left_plus_<mode>"

> +  [(set (match_operand:<VEL> 0 "register_operand")

> +       (unspec:<VEL> [(match_dup 3)

> +                      (match_operand:<VEL> 1 "register_operand")

> +                      (match_operand:SVE_F 2 "register_operand")]

> +                     UNSPEC_FADDA))]

> +  "TARGET_SVE"

> +  {

> +    operands[3] = force_reg (<VPRED>mode, CONSTM1_RTX (<VPRED>mode));

> +  }

> +)

> +

> +;; In-order FP reductions predicated with PTRUE.

> +(define_insn "*fold_left_plus_<mode>"

> +  [(set (match_operand:<VEL> 0 "register_operand" "=w")

> +       (unspec:<VEL> [(match_operand:<VPRED> 1 "register_operand" "Upl")

> +                      (match_operand:<VEL> 2 "register_operand" "0")

> +                      (match_operand:SVE_F 3 "register_operand" "w")]

> +                     UNSPEC_FADDA))]

> +  "TARGET_SVE"

> +  "fadda\t%<Vetype>0, %1, %<Vetype>0, %3.<Vetype>"

> +)

> +

> +;; Predicated form of the above in-order reduction.

> +(define_insn "*pred_fold_left_plus_<mode>"

> +  [(set (match_operand:<VEL> 0 "register_operand" "=w")

> +       (unspec:<VEL>

> +         [(match_operand:<VEL> 1 "register_operand" "0")

> +          (unspec:SVE_F

> +            [(match_operand:<VPRED> 2 "register_operand" "Upl")

> +             (match_operand:SVE_F 3 "register_operand" "w")

> +             (match_operand:SVE_F 4 "aarch64_simd_imm_zero")]

> +            UNSPEC_SEL)]

> +         UNSPEC_FADDA))]

> +  "TARGET_SVE"

> +  "fadda\t%<Vetype>0, %2, %<Vetype>0, %3.<Vetype>"

> +)

> +

>  ;; Unpredicated floating-point addition.

>  (define_expand "add<mode>3"

>    [(set (match_operand:SVE_F 0 "register_operand")

> Index: gcc/testsuite/gcc.dg/vect/no-fast-math-vect16.c

> ===================================================================

> --- gcc/testsuite/gcc.dg/vect/no-fast-math-vect16.c     2017-11-21 17:06:24.670434749 +0000

> +++ gcc/testsuite/gcc.dg/vect/no-fast-math-vect16.c     2017-11-21 17:06:25.015421374 +0000

> @@ -33,5 +33,5 @@ int main (void)

>    return main1 ();

>  }

>

> -/* Requires fast-math.  */

> -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail *-*-* } } } */

> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */

> +/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */

> Index: gcc/testsuite/gcc.dg/vect/pr79920.c

> ===================================================================

> --- gcc/testsuite/gcc.dg/vect/pr79920.c 2017-11-21 17:06:24.670434749 +0000

> +++ gcc/testsuite/gcc.dg/vect/pr79920.c 2017-11-21 17:06:25.015421374 +0000

> @@ -1,5 +1,5 @@

>  /* { dg-do run } */

> -/* { dg-additional-options "-O3" } */

> +/* { dg-additional-options "-O3 -fno-fast-math" } */

>

>  #include "tree-vect.h"

>

> @@ -41,4 +41,5 @@ int main()

>    return 0;

>  }

>

> -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_double && { vect_perm && vect_hw_misalign } } } } } */

> +/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */

> +/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { target { vect_double && { vect_perm && vect_hw_misalign } } } } } */

> Index: gcc/testsuite/gcc.dg/vect/trapv-vect-reduc-4.c

> ===================================================================

> --- gcc/testsuite/gcc.dg/vect/trapv-vect-reduc-4.c      2017-11-21 17:06:24.670434749 +0000

> +++ gcc/testsuite/gcc.dg/vect/trapv-vect-reduc-4.c      2017-11-21 17:06:25.015421374 +0000

> @@ -46,5 +46,6 @@ int main (void)

>    return 0;

>  }

>

> -/* { dg-final { scan-tree-dump-times "Detected reduction\\." 2 "vect"  } } */

> -/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { target { ! vect_no_int_min_max } } } } */

> +/* { dg-final { scan-tree-dump-times "Detected reduction\\." 3 "vect" } } */

> +/* { dg-final { scan-tree-dump-times "vectorized 3 loops" 1 "vect" { target { ! vect_no_int_min_max } } } } */

> +/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */

> Index: gcc/testsuite/gcc.dg/vect/vect-reduc-6.c

> ===================================================================

> --- gcc/testsuite/gcc.dg/vect/vect-reduc-6.c    2017-11-21 17:06:24.670434749 +0000

> +++ gcc/testsuite/gcc.dg/vect/vect-reduc-6.c    2017-11-21 17:06:25.015421374 +0000

> @@ -1,4 +1,5 @@

>  /* { dg-require-effective-target vect_float } */

> +/* { dg-additional-options "-fno-fast-math" } */

>

>  #include <stdarg.h>

>  #include "tree-vect.h"

> @@ -48,6 +49,5 @@ int main (void)

>    return 0;

>  }

>

> -/* need -ffast-math to vectorizer these loops.  */

> -/* ARM NEON passes -ffast-math to these tests, so expect this to fail.  */

> -/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { xfail arm_neon_ok } } } */

> +/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */

> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */

> Index: gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-1.c

> ===================================================================

> --- /dev/null   2017-11-20 18:51:34.589640877 +0000

> +++ gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-1.c   2017-11-21 17:06:25.015421374 +0000

> @@ -0,0 +1,42 @@

> +/* { dg-do run { xfail { { i?86-*-* x86_64-*-* } && ia32 } } } */

> +/* { dg-require-effective-target vect_double } */

> +/* { dg-add-options ieee } */

> +/* { dg-additional-options "-fno-fast-math" } */

> +

> +#include "tree-vect.h"

> +

> +#define N (VECTOR_BITS * 17)

> +

> +double __attribute__ ((noinline, noclone))

> +reduc_plus_double (double *a, double *b)

> +{

> +  double r = 0, q = 3;

> +  for (int i = 0; i < N; i++)

> +    {

> +      r += a[i];

> +      q -= b[i];

> +    }

> +  return r * q;

> +}

> +

> +int __attribute__ ((optimize (1)))

> +main ()

> +{

> +  double a[N];

> +  double b[N];

> +  double r = 0, q = 3;

> +  for (int i = 0; i < N; i++)

> +    {

> +      a[i] = (i * 0.1) * (i & 1 ? 1 : -1);

> +      b[i] = (i * 0.3) * (i & 1 ? 1 : -1);

> +      r += a[i];

> +      q -= b[i];

> +      asm volatile ("" ::: "memory");

> +    }

> +  double res = reduc_plus_double (a, b);

> +  if (res != r * q)

> +    __builtin_abort ();

> +  return 0;

> +}

> +

> +/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 2 "vect" } } */

> Index: gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-2.c

> ===================================================================

> --- /dev/null   2017-11-20 18:51:34.589640877 +0000

> +++ gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-2.c   2017-11-21 17:06:25.015421374 +0000

> @@ -0,0 +1,44 @@

> +/* { dg-do run { xfail { { i?86-*-* x86_64-*-* } && ia32 } } } */

> +/* { dg-require-effective-target vect_double } */

> +/* { dg-add-options ieee } */

> +/* { dg-additional-options "-fno-fast-math" } */

> +

> +#include "tree-vect.h"

> +

> +#define N (VECTOR_BITS * 17)

> +

> +double __attribute__ ((noinline, noclone))

> +reduc_plus_double (double *restrict a, int n)

> +{

> +  double res = 0.0;

> +  for (int i = 0; i < n; i++)

> +    for (int j = 0; j < N; j++)

> +      res += a[i];

> +  return res;

> +}

> +

> +int __attribute__ ((optimize (1)))

> +main ()

> +{

> +  int n = 19;

> +  double a[N];

> +  double r = 0;

> +  for (int i = 0; i < N; i++)

> +    {

> +      a[i] = (i * 0.1) * (i & 1 ? 1 : -1);

> +      asm volatile ("" ::: "memory");

> +    }

> +  for (int i = 0; i < n; i++)

> +    for (int j = 0; j < N; j++)

> +      {

> +       r += a[i];

> +       asm volatile ("" ::: "memory");

> +      }

> +  double res = reduc_plus_double (a, n);

> +  if (res != r)

> +    __builtin_abort ();

> +  return 0;

> +}

> +

> +/* { dg-final { scan-tree-dump-times {in-order double reduction not supported} 1 "vect" } } */

> +/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */

> Index: gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-3.c

> ===================================================================

> --- /dev/null   2017-11-20 18:51:34.589640877 +0000

> +++ gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-3.c   2017-11-21 17:06:25.016421335 +0000

> @@ -0,0 +1,42 @@

> +/* { dg-do run { xfail { { i?86-*-* x86_64-*-* } && ia32 } } } */

> +/* { dg-require-effective-target vect_double } */

> +/* { dg-add-options ieee } */

> +/* { dg-additional-options "-fno-fast-math" } */

> +

> +#include "tree-vect.h"

> +

> +#define N (VECTOR_BITS * 17)

> +

> +double __attribute__ ((noinline, noclone))

> +reduc_plus_double (double *a)

> +{

> +  double r = 0;

> +  for (int i = 0; i < N; i += 4)

> +    {

> +      r += a[i] * 2.0;

> +      r += a[i + 1] * 3.0;

> +      r += a[i + 2] * 4.0;

> +      r += a[i + 3] * 5.0;

> +    }

> +  return r;

> +}

> +

> +int __attribute__ ((optimize (1)))

> +main ()

> +{

> +  double a[N];

> +  double r = 0;

> +  for (int i = 0; i < N; i++)

> +    {

> +      a[i] = (i * 0.1) * (i & 1 ? 1 : -1);

> +      r += a[i] * (i % 4 + 2);

> +      asm volatile ("" ::: "memory");

> +    }

> +  double res = reduc_plus_double (a);

> +  if (res != r)

> +    __builtin_abort ();

> +  return 0;

> +}

> +

> +/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */

> +/* { dg-final { scan-tree-dump-times {vectorizing stmts using SLP} 1 "vect" } } */

> Index: gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-4.c

> ===================================================================

> --- /dev/null   2017-11-20 18:51:34.589640877 +0000

> +++ gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-4.c   2017-11-21 17:06:25.016421335 +0000

> @@ -0,0 +1,45 @@

> +/* { dg-do run { xfail { { i?86-*-* x86_64-*-* } && ia32 } } } */

> +/* { dg-require-effective-target vect_double } */

> +/* { dg-add-options ieee } */

> +/* { dg-additional-options "-fno-fast-math" } */

> +

> +#include "tree-vect.h"

> +

> +#define N (VECTOR_BITS * 17)

> +

> +double __attribute__ ((noinline, noclone))

> +reduc_plus_double (double *a)

> +{

> +  double r1 = 0;

> +  double r2 = 0;

> +  double r3 = 0;

> +  double r4 = 0;

> +  for (int i = 0; i < N; i += 4)

> +    {

> +      r1 += a[i];

> +      r2 += a[i + 1];

> +      r3 += a[i + 2];

> +      r4 += a[i + 3];

> +    }

> +  return r1 * r2 * r3 * r4;

> +}

> +

> +int __attribute__ ((optimize (1)))

> +main ()

> +{

> +  double a[N];

> +  double r[4] = {};

> +  for (int i = 0; i < N; i++)

> +    {

> +      a[i] = (i * 0.1) * (i & 1 ? 1 : -1);

> +      r[i % 4] += a[i];

> +      asm volatile ("" ::: "memory");

> +    }

> +  double res = reduc_plus_double (a);

> +  if (res != r[0] * r[1] * r[2] * r[3])

> +    __builtin_abort ();

> +  return 0;

> +}

> +

> +/* { dg-final { scan-tree-dump-times {in-order unchained SLP reductions not supported} 1 "vect" } } */

> +/* { dg-final { scan-tree-dump-not {vectorizing stmts using SLP} "vect" } } */

> Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1.c

> ===================================================================

> --- /dev/null   2017-11-20 18:51:34.589640877 +0000

> +++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1.c       2017-11-21 17:06:25.016421335 +0000

> @@ -0,0 +1,28 @@

> +/* { dg-do compile } */

> +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */

> +

> +#define NUM_ELEMS(TYPE) ((int)(5 * (256 / sizeof (TYPE)) + 3))

> +

> +#define DEF_REDUC_PLUS(TYPE)                   \

> +  TYPE __attribute__ ((noinline, noclone))     \

> +  reduc_plus_##TYPE (TYPE *a, TYPE *b)         \

> +  {                                            \

> +    TYPE r = 0, q = 3;                         \

> +    for (int i = 0; i < NUM_ELEMS (TYPE); i++) \

> +      {                                                \

> +       r += a[i];                              \

> +       q -= b[i];                              \

> +      }                                                \

> +    return r * q;                              \

> +  }

> +

> +#define TEST_ALL(T) \

> +  T (_Float16) \

> +  T (float) \

> +  T (double)

> +

> +TEST_ALL (DEF_REDUC_PLUS)

> +

> +/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, z[0-9]+\.h} 2 } } */

> +/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, z[0-9]+\.s} 2 } } */

> +/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, z[0-9]+\.d} 2 } } */

> Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1_run.c

> ===================================================================

> --- /dev/null   2017-11-20 18:51:34.589640877 +0000

> +++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1_run.c   2017-11-21 17:06:25.016421335 +0000

> @@ -0,0 +1,29 @@

> +/* { dg-do run { target { aarch64_sve_hw } } } */

> +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */

> +

> +#include "sve_reduc_strict_1.c"

> +

> +#define TEST_REDUC_PLUS(TYPE)                  \

> +  {                                            \

> +    TYPE a[NUM_ELEMS (TYPE)];                  \

> +    TYPE b[NUM_ELEMS (TYPE)];                  \

> +    TYPE r = 0, q = 3;                         \

> +    for (int i = 0; i < NUM_ELEMS (TYPE); i++) \

> +      {                                                \

> +       a[i] = (i * 0.1) * (i & 1 ? 1 : -1);    \

> +       b[i] = (i * 0.3) * (i & 1 ? 1 : -1);    \

> +       r += a[i];                              \

> +       q -= b[i];                              \

> +       asm volatile ("" ::: "memory");         \

> +      }                                                \

> +    TYPE res = reduc_plus_##TYPE (a, b);       \

> +    if (res != r * q)                          \

> +      __builtin_abort ();                      \

> +  }

> +

> +int __attribute__ ((optimize (1)))

> +main ()

> +{

> +  TEST_ALL (TEST_REDUC_PLUS);

> +  return 0;

> +}

> Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2.c

> ===================================================================

> --- /dev/null   2017-11-20 18:51:34.589640877 +0000

> +++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2.c       2017-11-21 17:06:25.016421335 +0000

> @@ -0,0 +1,28 @@

> +/* { dg-do compile } */

> +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */

> +

> +#define NUM_ELEMS(TYPE) ((int) (5 * (256 / sizeof (TYPE)) + 3))

> +

> +#define DEF_REDUC_PLUS(TYPE)                                   \

> +void __attribute__ ((noinline, noclone))                       \

> +reduc_plus_##TYPE (TYPE (*restrict a)[NUM_ELEMS (TYPE)],       \

> +                  TYPE *restrict r, int n)                     \

> +{                                                              \

> +  for (int i = 0; i < n; i++)                                  \

> +    {                                                          \

> +      r[i] = 0;                                                        \

> +      for (int j = 0; j < NUM_ELEMS (TYPE); j++)               \

> +        r[i] += a[i][j];                                       \

> +    }                                                          \

> +}

> +

> +#define TEST_ALL(T) \

> +  T (_Float16) \

> +  T (float) \

> +  T (double)

> +

> +TEST_ALL (DEF_REDUC_PLUS)

> +

> +/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, z[0-9]+\.h} 1 } } */

> +/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, z[0-9]+\.s} 1 } } */

> +/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, z[0-9]+\.d} 1 } } */

> Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2_run.c

> ===================================================================

> --- /dev/null   2017-11-20 18:51:34.589640877 +0000

> +++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2_run.c   2017-11-21 17:06:25.016421335 +0000

> @@ -0,0 +1,31 @@

> +/* { dg-do run { target { aarch64_sve_hw } } } */

> +/* { dg-options "-O2 -ftree-vectorize -fno-inline -march=armv8-a+sve" } */

> +

> +#include "sve_reduc_strict_2.c"

> +

> +#define NROWS 5

> +

> +#define TEST_REDUC_PLUS(TYPE)                                  \

> +  {                                                            \

> +    TYPE a[NROWS][NUM_ELEMS (TYPE)];                           \

> +    TYPE r[NROWS];                                             \

> +    TYPE expected[NROWS] = {};                                 \

> +    for (int i = 0; i < NROWS; ++i)                            \

> +      for (int j = 0; j < NUM_ELEMS (TYPE); ++j)               \

> +       {                                                       \

> +         a[i][j] = (i * 0.1 + j * 0.6) * (j & 1 ? 1 : -1);     \

> +         expected[i] += a[i][j];                               \

> +         asm volatile ("" ::: "memory");                       \

> +       }                                                       \

> +    reduc_plus_##TYPE (a, r, NROWS);                           \

> +    for (int i = 0; i < NROWS; ++i)                            \

> +      if (r[i] != expected[i])                                 \

> +       __builtin_abort ();                                     \

> +  }

> +

> +int __attribute__ ((optimize (1)))

> +main ()

> +{

> +  TEST_ALL (TEST_REDUC_PLUS);

> +  return 0;

> +}

> Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_3.c

> ===================================================================

> --- /dev/null   2017-11-20 18:51:34.589640877 +0000

> +++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_3.c       2017-11-21 17:06:25.016421335 +0000

> @@ -0,0 +1,131 @@

> +/* { dg-do compile } */

> +/* { dg-options "-O2 -ftree-vectorize -fno-inline -march=armv8-a+sve -msve-vector-bits=256 -fdump-tree-vect-details" } */

> +

> +double mat[100][4];

> +double mat2[100][8];

> +double mat3[100][12];

> +double mat4[100][3];

> +

> +double

> +slp_reduc_plus (int n)

> +{

> +  double tmp = 0.0;

> +  for (int i = 0; i < n; i++)

> +    {

> +      tmp = tmp + mat[i][0];

> +      tmp = tmp + mat[i][1];

> +      tmp = tmp + mat[i][2];

> +      tmp = tmp + mat[i][3];

> +    }

> +  return tmp;

> +}

> +

> +double

> +slp_reduc_plus2 (int n)

> +{

> +  double tmp = 0.0;

> +  for (int i = 0; i < n; i++)

> +    {

> +      tmp = tmp + mat2[i][0];

> +      tmp = tmp + mat2[i][1];

> +      tmp = tmp + mat2[i][2];

> +      tmp = tmp + mat2[i][3];

> +      tmp = tmp + mat2[i][4];

> +      tmp = tmp + mat2[i][5];

> +      tmp = tmp + mat2[i][6];

> +      tmp = tmp + mat2[i][7];

> +    }

> +  return tmp;

> +}

> +

> +double

> +slp_reduc_plus3 (int n)

> +{

> +  double tmp = 0.0;

> +  for (int i = 0; i < n; i++)

> +    {

> +      tmp = tmp + mat3[i][0];

> +      tmp = tmp + mat3[i][1];

> +      tmp = tmp + mat3[i][2];

> +      tmp = tmp + mat3[i][3];

> +      tmp = tmp + mat3[i][4];

> +      tmp = tmp + mat3[i][5];

> +      tmp = tmp + mat3[i][6];

> +      tmp = tmp + mat3[i][7];

> +      tmp = tmp + mat3[i][8];

> +      tmp = tmp + mat3[i][9];

> +      tmp = tmp + mat3[i][10];

> +      tmp = tmp + mat3[i][11];

> +    }

> +  return tmp;

> +}

> +

> +void

> +slp_non_chained_reduc (int n, double * restrict out)

> +{

> +  for (int i = 0; i < 3; i++)

> +    out[i] = 0;

> +

> +  for (int i = 0; i < n; i++)

> +    {

> +      out[0] = out[0] + mat4[i][0];

> +      out[1] = out[1] + mat4[i][1];

> +      out[2] = out[2] + mat4[i][2];

> +    }

> +}

> +

> +/* Strict FP reductions shouldn't be used for the outer loops, only the

> +   inner loops.  */

> +

> +float

> +double_reduc1 (float (*restrict i)[16])

> +{

> +  float l = 0;

> +

> +  for (int a = 0; a < 8; a++)

> +    for (int b = 0; b < 8; b++)

> +      l += i[b][a];

> +  return l;

> +}

> +

> +float

> +double_reduc2 (float *restrict i)

> +{

> +  float l = 0;

> +

> +  for (int a = 0; a < 8; a++)

> +    for (int b = 0; b < 16; b++)

> +      {

> +        l += i[b * 4];

> +        l += i[b * 4 + 1];

> +        l += i[b * 4 + 2];

> +        l += i[b * 4 + 3];

> +      }

> +  return l;

> +}

> +

> +float

> +double_reduc3 (float *restrict i, float *restrict j)

> +{

> +  float k = 0, l = 0;

> +

> +  for (int a = 0; a < 8; a++)

> +    for (int b = 0; b < 8; b++)

> +      {

> +        k += i[b];

> +        l += j[b];

> +      }

> +  return l * k;

> +}

> +

> +/* We can't yet handle double_reduc1.  */

> +/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, z[0-9]+\.s} 3 } } */

> +/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, z[0-9]+\.d} 9 } } */

> +/* 1 reduction each for double_reduc{1,2} and 2 for double_reduc3.  Each one

> +   is reported three times, once for SVE, once for 128-bit AdvSIMD and once

> +   for 64-bit AdvSIMD.  */

> +/* { dg-final { scan-tree-dump-times "Detected double reduction" 12 "vect" } } */

> +/* double_reduc2 has 2 reductions and slp_non_chained_reduc has 3.

> +   double_reduc1 is reported 3 times (SVE, 128-bit AdvSIMD, 64-bit AdvSIMD)

> +   before failing.  */

> +/* { dg-final { scan-tree-dump-times "Detected reduction" 12 "vect" } } */

> Index: gcc/testsuite/gcc.target/aarch64/sve_slp_13.c

> ===================================================================

> --- gcc/testsuite/gcc.target/aarch64/sve_slp_13.c       2017-11-21 17:06:24.670434749 +0000

> +++ gcc/testsuite/gcc.target/aarch64/sve_slp_13.c       2017-11-21 17:06:25.016421335 +0000

> @@ -1,5 +1,6 @@

>  /* { dg-do compile } */

> -/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable" } */

> +/* The cost model thinks that the double loop isn't a win for SVE-128.  */

> +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable -fno-vect-cost-model" } */

>

>  #include <stdint.h>

>

> @@ -24,7 +25,10 @@ #define TEST_ALL(T)                          \

>    T (int32_t)                                  \

>    T (uint32_t)                                 \

>    T (int64_t)                                  \

> -  T (uint64_t)

> +  T (uint64_t)                                 \

> +  T (_Float16)                                 \

> +  T (float)                                    \

> +  T (double)

>

>  TEST_ALL (VEC_PERM)

>

> @@ -32,21 +36,25 @@ TEST_ALL (VEC_PERM)

>  /* ??? We don't treat the uint loops as SLP.  */

>  /* The loop should be fully-masked.  */

>  /* { dg-final { scan-assembler-times {\tld1b\t} 2 { xfail *-*-* } } } */

> -/* { dg-final { scan-assembler-times {\tld1h\t} 2 { xfail *-*-* } } } */

> -/* { dg-final { scan-assembler-times {\tld1w\t} 2 { xfail *-*-* } } } */

> -/* { dg-final { scan-assembler-times {\tld1w\t} 1 } } */

> -/* { dg-final { scan-assembler-times {\tld1d\t} 2 { xfail *-*-* } } } */

> -/* { dg-final { scan-assembler-times {\tld1d\t} 1 } } */

> +/* { dg-final { scan-assembler-times {\tld1h\t} 3 { xfail *-*-* } } } */

> +/* { dg-final { scan-assembler-times {\tld1w\t} 3 { xfail *-*-* } } } */

> +/* { dg-final { scan-assembler-times {\tld1w\t} 2 } } */

> +/* { dg-final { scan-assembler-times {\tld1d\t} 3 { xfail *-*-* } } } */

> +/* { dg-final { scan-assembler-times {\tld1d\t} 2 } } */

>  /* { dg-final { scan-assembler-not {\tldr} { xfail *-*-* } } } */

>

>  /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b} 4 { xfail *-*-* } } } */

> -/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 4 { xfail *-*-* } } } */

> -/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 4 } } */

> -/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 4 } } */

> +/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 6 { xfail *-*-* } } } */

> +/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */

> +/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 6 } } */

>

>  /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.b\n} 2 { xfail *-*-* } } } */

>  /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.h\n} 2 { xfail *-*-* } } } */

>  /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.s\n} 2 } } */

>  /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.d\n} 2 } } */

> +/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, z[0-9]+\.h\n} 1 } } */

> +/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, z[0-9]+\.s\n} 1 } } */

> +/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, z[0-9]+\.d\n} 1 } } */

> +/* { dg-final { scan-assembler-not {\tfadd\n} } } */

>

>  /* { dg-final { scan-assembler-not {\tuqdec} } } */

> Index: gcc/testsuite/gfortran.dg/vect/vect-8.f90

> ===================================================================

> --- gcc/testsuite/gfortran.dg/vect/vect-8.f90   2017-11-21 17:06:24.670434749 +0000

> +++ gcc/testsuite/gfortran.dg/vect/vect-8.f90   2017-11-21 17:06:25.016421335 +0000

> @@ -704,5 +704,5 @@ CALL track('KERNEL  ')

>  RETURN

>  END SUBROUTINE kernel

>

> -! { dg-final { scan-tree-dump-times "vectorized 21 loops" 1 "vect" { target { vect_intdouble_cvt } } } }

> +! { dg-final { scan-tree-dump-times "vectorized 22 loops" 1 "vect" { target vect_intdouble_cvt } } }

>  ! { dg-final { scan-tree-dump-times "vectorized 17 loops" 1 "vect" { target { ! vect_intdouble_cvt } } } }

Add support for in-order addition reduction using SVE FADDA

Commit Message

Comments

Patch