From patchwork Fri Nov 17 16:53:00 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Richard Sandiford X-Patchwork-Id: 119188 Delivered-To: patch@linaro.org Received: by 10.140.22.164 with SMTP id 33csp751638qgn; Fri, 17 Nov 2017 08:54:05 -0800 (PST) X-Google-Smtp-Source: AGs4zMYvcXtbecbm7koxQfmqFlmzUmwECFZoFr+y5/BpZXK+4KPb7bO8qgmi7KACIJb72UcONpRb X-Received: by 10.98.7.149 with SMTP id 21mr2720338pfh.14.1510937645458; Fri, 17 Nov 2017 08:54:05 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1510937645; cv=none; d=google.com; s=arc-20160816; b=oKqhovG2f4BVdwQzv/A8ODNHLvnBwoMhVNT3POG1J+L9mi2RV8dzjJ25QLx+eev0Sp mC8OvhmDTmTqvXKirfDU+ctrtAvwOv1PZfbKYWksu0gBrrSYv4kOLAamf+Dq/KvqeZtK ogxXOKziWtjUcZA5VvCBTm50CALS9+wXUKLoT0hZB7d7JJRMZtPcDwF24tPLCJmTnILN SHjav3ZnuxR//FyKbw0/Asba1go/pI1T8KsRD83HNIwVC45Mx5QPCqb8wKwNV4PNtBPq nrU2MQJ5LgC48zwEepX3z8YDryvp3nem4ope96JNTKica84CjDnUBibibVDhgRWQtm8U RbTA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=mime-version:user-agent:message-id:date:subject:mail-followup-to:to :from:delivered-to:sender:list-help:list-post:list-archive :list-unsubscribe:list-id:precedence:mailing-list:dkim-signature :domainkey-signature:arc-authentication-results; bh=Hr16RF7RJNWoKPOKQ8GdQlaIxdRnXrUIMonLpdOpyjw=; b=UilBDy8wZx8BZZn3iWOiE1NwW8IMI2jnWDtftaGOXlKXKc4SNNann3+VshTfD+E3XM SY2gBbrMhdjOeTBPaC6Y2kAMIjI6tJCYgyPSeHqmOboLxzx7bY1bfeeNtUv4zwZqWhfK sRmoq3Ny0AWp+AfSwNyigQ2A53d+DyZVq9Bo71afPRR5leCaaA1qa+R/26BmV0px5GtX LXRuqlhJ48Kiz43hEgpvptsn1xS4McJHNhi3aKCno6fBo12c0Phu6Na/luRoD04pOMVa Nh/jkhB7yy0TY1DieufooyRb3aU/C7yG3EtgtrCC/O2xHRwCcNxBVT3MNsHPJ3EGqV9h lJng== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gcc.gnu.org header.s=default header.b=fTYk/2nn; spf=pass (google.com: domain of gcc-patches-return-467175-patch=linaro.org@gcc.gnu.org designates 209.132.180.131 as permitted sender) smtp.mailfrom=gcc-patches-return-467175-patch=linaro.org@gcc.gnu.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from sourceware.org (server1.sourceware.org. [209.132.180.131]) by mx.google.com with ESMTPS id o10si2975077pgn.514.2017.11.17.08.54.05 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 17 Nov 2017 08:54:05 -0800 (PST) Received-SPF: pass (google.com: domain of gcc-patches-return-467175-patch=linaro.org@gcc.gnu.org designates 209.132.180.131 as permitted sender) client-ip=209.132.180.131; Authentication-Results: mx.google.com; dkim=pass header.i=@gcc.gnu.org header.s=default header.b=fTYk/2nn; spf=pass (google.com: domain of gcc-patches-return-467175-patch=linaro.org@gcc.gnu.org designates 209.132.180.131 as permitted sender) smtp.mailfrom=gcc-patches-return-467175-patch=linaro.org@gcc.gnu.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=linaro.org DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :to:subject:date:message-id:mime-version:content-type; q=dns; s= default; b=hyKdSa0BCaeGCVcNR3Ndh06GrZkKWHhZFijg0lshO1Ff2ux8whaT7 4LgzDsKnU1tHMIm40/C/s3CLXvZPt7B9brZfgqI/pbZBQotuVU8RFS2Tjs8GgIwt lZCRN+26hIMTKeyrg/RDVkn14MTkj2/LnBIjnMj56Q97kO4G3DMORU= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :to:subject:date:message-id:mime-version:content-type; s= default; bh=3pEMIQJIAee+3Wd4VqkaTGgXMaU=; b=fTYk/2nnBb9+Nsn59hRu 0QOLO4rLqncw/H00g+lWFpae8BR3bDic2+KdIBMcL8AuXlyPyRnwWGKCRDSk2Mh1 Oj5AjovnStp8Ema/rqgnKYmdmSTbtVMZxGJ7ib+///TPS+9LBBg8dR/3RUONx7Lw I8x/N6oIUTET6L+TWjGRt1g= Received: (qmail 63148 invoked by alias); 17 Nov 2017 16:53:15 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 63019 invoked by uid 89); 17 Nov 2017 16:53:14 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-15.3 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_ASCII_DIVIDERS, KB_WAM_FROM_NAME_SINGLEWORD, RCVD_IN_DNSWL_NONE, SPF_PASS autolearn=ham version=3.3.2 spammy=UD:vect-8.f90, vect-8.f90, vect8f90 X-HELO: mail-wr0-f171.google.com Received: from mail-wr0-f171.google.com (HELO mail-wr0-f171.google.com) (209.85.128.171) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 17 Nov 2017 16:53:07 +0000 Received: by mail-wr0-f171.google.com with SMTP id w95so2660003wrc.2 for ; Fri, 17 Nov 2017 08:53:06 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:mail-followup-to:subject:date:message-id :user-agent:mime-version; bh=Hr16RF7RJNWoKPOKQ8GdQlaIxdRnXrUIMonLpdOpyjw=; b=SCojOmgXRf44mq9PuWswuc988gMW56ydVm4o0FZHp6Mtobtwu9zRJZPp0INwyvU8bi lNixPdW0HFnbR/JGi3IkQCQsJWzdA0EltSCnzDIKC4Xc/T5cqgkJ4qCn0UFWwo5lkbGN m1duI/o92VfQVGDfP8A+TpCbFSDkK3fxkhCTobYZJtBVr5yq8f1c3G2SGIXMRdYGGn7j NGIJSd8jJrYmnmz+dtW4j7GPWbPkQmdMMDWulpbrnCtJr6J3Fr0ObRBStvF3rUCrZxAF OrA1Smy4EYBGC0DvBE75HIojlfyrk+BCpt1aqxSK5nljmUz4rcLrll2HTox+KSx0/dvo 7oqw== X-Gm-Message-State: AJaThX4vwYfxnxHqC0aVzpymaZcuoW4OrzNuL4Ip5QzHGGrP91UCaaAk 1SS1cXCrDiiEnHcz9coBBhYbQ7Et4jw= X-Received: by 10.223.164.154 with SMTP id g26mr5136219wrb.137.1510937583789; Fri, 17 Nov 2017 08:53:03 -0800 (PST) Received: from localhost ([2.25.234.120]) by smtp.gmail.com with ESMTPSA id q13sm2648549wrg.97.2017.11.17.08.53.00 for (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Fri, 17 Nov 2017 08:53:02 -0800 (PST) From: Richard Sandiford To: gcc-patches@gcc.gnu.org Mail-Followup-To: gcc-patches@gcc.gnu.org, richard.sandiford@linaro.org Subject: Add support for in-order addition reduction using SVE FADDA Date: Fri, 17 Nov 2017 16:53:00 +0000 Message-ID: <87y3n4x2xf.fsf@linaro.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.3 (gnu/linux) MIME-Version: 1.0 This patch adds support for in-order floating-point addition reductions, which are suitable even in strict IEEE mode. Previously vect_is_simple_reduction would reject any cases that forbid reassociation. The idea is instead to tentatively accept them as "FOLD_LEFT_REDUCTIONs" and only fail later if there is no target support for them. Although this patch only handles the particular case of plus and minus on floating-point types, there's no reason in principle why targets couldn't handle other cases. The vect_force_simple_reduction change makes it simpler for parloops to read the type of reduction. Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu and powerpc64le-linux-gnu. OK to install? Richard 2017-11-17 Richard Sandiford Alan Hayward David Sherwood gcc/ * tree.def (FOLD_LEFT_PLUS_EXPR): New tree code. * doc/generic.texi (FOLD_LEFT_PLUS_EXPR): Document. * optabs.def (fold_left_plus_optab): New optab. * doc/md.texi (fold_left_plus_@var{m}): Document. * doc/sourcebuild.texi (vect_fold_left_plus): Document. * cfgexpand.c (expand_debug_expr): Handle FOLD_LEFT_PLUS_EXPR. * expr.c (expand_expr_real_2): Likewise. * fold-const.c (const_binop): Likewise. * optabs-tree.c (optab_for_tree_code): Likewise. * tree-cfg.c (verify_gimple_assign_binary): Likewise. * tree-inline.c (estimate_operator_cost): Likewise. * tree-pretty-print.c (dump_generic_node): Likewise. (op_code_prio): Likewise. (op_symbol_code): Likewise. * tree-vect-stmts.c (vectorizable_operation): Likewise. * tree-parloops.c (valid_reduction_p): New function. (gather_scalar_reductions): Use it. * tree-vectorizer.h (FOLD_LEFT_REDUCTION): New vect_reduction_type. (vect_finish_replace_stmt): Declare. * tree-vect-loop.c (fold_left_reduction_code): New function. (needs_fold_left_reduction_p): New function, split out from... (vect_is_simple_reduction): ...here. Accept reductions that forbid reassociation, but give them type FOLD_LEFT_REDUCTION. (vect_force_simple_reduction): Also store the reduction type in the assignment's STMT_VINFO_REDUC_TYPE. (vect_model_reduction_cost): Handle FOLD_LEFT_REDUCTION. (merge_with_identity): New function. (vectorize_fold_left_reduction): Likewise. (vectorizable_reduction): Handle FOLD_LEFT_REDUCTION. Leave the scalar phi in place for it. Require target support and reject cases that would reassociate the operation. Defer the transform phase to vectorize_fold_left_reduction. * config/aarch64/aarch64.md (UNSPEC_FADDA): New unspec. * config/aarch64/aarch64-sve.md (fold_left_plus_): New expander. (*fold_left_plus_, *pred_fold_left_plus_): New insns. gcc/testsuite/ * lib/target-supports.exp (check_effective_target_vect_fold_left_plus): New proc. * gcc.dg/vect/no-fast-math-vect16.c: Expect the test to pass if vect_fold_left_plus. * gcc.dg/vect/pr79920.c: Expect both loops to be vectorized if vect_fold_left_plus. * gcc.dg/vect/trapv-vect-reduc-4.c: Expect the first loop to be recognized as a reduction and then rejected for lack of target support. * gcc.dg/vect/vect-reduc-6.c: Expect the loop to be vectorized if vect_fold_left_plus. * gcc.target/aarch64/sve_reduc_strict_1.c: New test. * gcc.target/aarch64/sve_reduc_strict_1_run.c: Likewise. * gcc.target/aarch64/sve_reduc_strict_2.c: Likewise. * gcc.target/aarch64/sve_reduc_strict_2_run.c: Likewise. * gcc.target/aarch64/sve_reduc_strict_3.c: Likewise. * gcc.target/aarch64/sve_slp_13.c: Add floating-point types. * gfortran.dg/vect/vect-8.f90: Expect 25 loops to be vectorized if vect_fold_left_plus. Index: gcc/tree.def =================================================================== --- gcc/tree.def 2017-11-17 16:52:07.246852461 +0000 +++ gcc/tree.def 2017-11-17 16:52:07.631930981 +0000 @@ -1302,6 +1302,8 @@ DEFTREECODE (REDUC_AND_EXPR, "reduc_and_ DEFTREECODE (REDUC_IOR_EXPR, "reduc_ior_expr", tcc_unary, 1) DEFTREECODE (REDUC_XOR_EXPR, "reduc_xor_expr", tcc_unary, 1) +DEFTREECODE (FOLD_LEFT_PLUS_EXPR, "fold_left_plus_expr", tcc_binary, 2) + /* Widening dot-product. The first two arguments are of type t1. The third argument and the result are of type t2, such that t2 is at least Index: gcc/doc/generic.texi =================================================================== --- gcc/doc/generic.texi 2017-11-17 16:52:07.246852461 +0000 +++ gcc/doc/generic.texi 2017-11-17 16:52:07.620954871 +0000 @@ -1746,6 +1746,7 @@ a value from @code{enum annot_expr_kind} @tindex REDUC_AND_EXPR @tindex REDUC_IOR_EXPR @tindex REDUC_XOR_EXPR +@tindex FOLD_LEFT_PLUS_EXPR @table @code @item VEC_DUPLICATE_EXPR @@ -1861,6 +1862,12 @@ the maximum element in @var{x}. The ass is unspecified; for example, @samp{REDUC_PLUS_EXPR <@var{x}>} could sum floating-point @var{x} in forward order, in reverse order, using a tree, or in some other way. + +@item FOLD_LEFT_PLUS_EXPR +This node takes two arguments: a scalar of type @var{t} and a vector +of @var{t}s. It successively adds each element of the vector to the +scalar and returns the result. The operation is strictly in-order: +there is no reassociation. @end table Index: gcc/optabs.def =================================================================== --- gcc/optabs.def 2017-11-17 16:52:07.246852461 +0000 +++ gcc/optabs.def 2017-11-17 16:52:07.625528250 +0000 @@ -306,6 +306,7 @@ OPTAB_D (reduc_umin_scal_optab, "reduc_u OPTAB_D (reduc_and_scal_optab, "reduc_and_scal_$a") OPTAB_D (reduc_ior_scal_optab, "reduc_ior_scal_$a") OPTAB_D (reduc_xor_scal_optab, "reduc_xor_scal_$a") +OPTAB_D (fold_left_plus_optab, "fold_left_plus_$a") OPTAB_D (extract_last_optab, "extract_last_$a") OPTAB_D (fold_extract_last_optab, "fold_extract_last_$a") Index: gcc/doc/md.texi =================================================================== --- gcc/doc/md.texi 2017-11-17 16:52:07.246852461 +0000 +++ gcc/doc/md.texi 2017-11-17 16:52:07.621869547 +0000 @@ -5285,6 +5285,14 @@ has mode @var{m} and operands 0 and 1 ha one element of @var{m}. Operand 2 has the usual mask mode for vectors of mode @var{m}; see @code{TARGET_VECTORIZE_GET_MASK_MODE}. +@cindex @code{fold_left_plus_@var{m}} instruction pattern +@item @code{fold_left_plus_@var{m}} +Take scalar operand 1 and successively add each element from vector +operand 2. Store the result in scalar operand 0. The vector has +mode @var{m} and the scalars have the mode appropriate for one +element of @var{m}. The operation is strictly in-order: there is +no reassociation. + @cindex @code{sdot_prod@var{m}} instruction pattern @item @samp{sdot_prod@var{m}} @cindex @code{udot_prod@var{m}} instruction pattern Index: gcc/doc/sourcebuild.texi =================================================================== --- gcc/doc/sourcebuild.texi 2017-11-17 16:52:07.246852461 +0000 +++ gcc/doc/sourcebuild.texi 2017-11-17 16:52:07.621869547 +0000 @@ -1580,6 +1580,9 @@ Target supports AND, IOR and XOR reducti @item vect_fold_extract_last Target supports the @code{fold_extract_last} optab. + +@item vect_fold_left_plus +Target supports the @code{fold_left_plus} optab. @end table @subsubsection Thread Local Storage attributes Index: gcc/cfgexpand.c =================================================================== --- gcc/cfgexpand.c 2017-11-17 16:52:07.246852461 +0000 +++ gcc/cfgexpand.c 2017-11-17 16:52:07.620040195 +0000 @@ -5072,6 +5072,7 @@ expand_debug_expr (tree exp) case REDUC_AND_EXPR: case REDUC_IOR_EXPR: case REDUC_XOR_EXPR: + case FOLD_LEFT_PLUS_EXPR: case VEC_COND_EXPR: case VEC_PACK_FIX_TRUNC_EXPR: case VEC_PACK_SAT_EXPR: Index: gcc/expr.c =================================================================== --- gcc/expr.c 2017-11-17 16:52:07.246852461 +0000 +++ gcc/expr.c 2017-11-17 16:52:07.622784222 +0000 @@ -9438,6 +9438,28 @@ #define REDUCE_BIT_FIELD(expr) (reduce_b return target; } + case FOLD_LEFT_PLUS_EXPR: + { + op0 = expand_normal (treeop0); + op1 = expand_normal (treeop1); + this_optab = optab_for_tree_code (code, type, optab_default); + machine_mode vec_mode = TYPE_MODE (TREE_TYPE (treeop1)); + insn_code icode = optab_handler (this_optab, vec_mode); + + if (icode != CODE_FOR_nothing) + { + struct expand_operand ops[3]; + create_output_operand (&ops[0], target, mode); + create_input_operand (&ops[1], op0, mode); + create_input_operand (&ops[2], op1, vec_mode); + if (maybe_expand_insn (icode, 3, ops)) + return ops[0].value; + } + + /* Nothing to fall back to. */ + gcc_unreachable (); + } + case REDUC_MAX_EXPR: case REDUC_MIN_EXPR: case REDUC_PLUS_EXPR: Index: gcc/fold-const.c =================================================================== --- gcc/fold-const.c 2017-11-17 16:52:07.246852461 +0000 +++ gcc/fold-const.c 2017-11-17 16:52:07.623698898 +0000 @@ -1603,6 +1603,32 @@ const_binop (enum tree_code code, tree a return NULL_TREE; return build_vector_from_val (TREE_TYPE (arg1), sub); } + + if (CONSTANT_CLASS_P (arg1) + && TREE_CODE (arg2) == VECTOR_CST) + { + tree_code subcode; + + switch (code) + { + case FOLD_LEFT_PLUS_EXPR: + subcode = PLUS_EXPR; + break; + default: + return NULL_TREE; + } + + int nelts = VECTOR_CST_NELTS (arg2); + tree accum = arg1; + for (int i = 0; i < nelts; i++) + { + accum = const_binop (subcode, accum, VECTOR_CST_ELT (arg2, i)); + if (accum == NULL_TREE || !CONSTANT_CLASS_P (accum)) + return NULL_TREE; + } + + return accum; + } return NULL_TREE; } Index: gcc/optabs-tree.c =================================================================== --- gcc/optabs-tree.c 2017-11-17 16:52:07.246852461 +0000 +++ gcc/optabs-tree.c 2017-11-17 16:52:07.623698898 +0000 @@ -166,6 +166,9 @@ optab_for_tree_code (enum tree_code code case REDUC_XOR_EXPR: return reduc_xor_scal_optab; + case FOLD_LEFT_PLUS_EXPR: + return fold_left_plus_optab; + case VEC_WIDEN_MULT_HI_EXPR: return TYPE_UNSIGNED (type) ? vec_widen_umult_hi_optab : vec_widen_smult_hi_optab; Index: gcc/tree-cfg.c =================================================================== --- gcc/tree-cfg.c 2017-11-17 16:52:07.246852461 +0000 +++ gcc/tree-cfg.c 2017-11-17 16:52:07.628272277 +0000 @@ -4116,6 +4116,19 @@ verify_gimple_assign_binary (gassign *st /* Continue with generic binary expression handling. */ break; + case FOLD_LEFT_PLUS_EXPR: + if (!VECTOR_TYPE_P (rhs2_type) + || !useless_type_conversion_p (lhs_type, TREE_TYPE (rhs2_type)) + || !useless_type_conversion_p (lhs_type, rhs1_type)) + { + error ("reduction should convert from vector to element type"); + debug_generic_expr (lhs_type); + debug_generic_expr (rhs1_type); + debug_generic_expr (rhs2_type); + return true; + } + return false; + case VEC_SERIES_EXPR: if (!useless_type_conversion_p (rhs1_type, rhs2_type)) { Index: gcc/tree-inline.c =================================================================== --- gcc/tree-inline.c 2017-11-17 16:52:07.246852461 +0000 +++ gcc/tree-inline.c 2017-11-17 16:52:07.628272277 +0000 @@ -3881,6 +3881,7 @@ estimate_operator_cost (enum tree_code c case REDUC_AND_EXPR: case REDUC_IOR_EXPR: case REDUC_XOR_EXPR: + case FOLD_LEFT_PLUS_EXPR: case WIDEN_SUM_EXPR: case WIDEN_MULT_EXPR: case DOT_PROD_EXPR: Index: gcc/tree-pretty-print.c =================================================================== --- gcc/tree-pretty-print.c 2017-11-17 16:52:07.246852461 +0000 +++ gcc/tree-pretty-print.c 2017-11-17 16:52:07.629186953 +0000 @@ -3232,6 +3232,7 @@ dump_generic_node (pretty_printer *pp, t break; case VEC_SERIES_EXPR: + case FOLD_LEFT_PLUS_EXPR: case VEC_WIDEN_MULT_HI_EXPR: case VEC_WIDEN_MULT_LO_EXPR: case VEC_WIDEN_MULT_EVEN_EXPR: @@ -3628,6 +3629,7 @@ op_code_prio (enum tree_code code) case REDUC_MAX_EXPR: case REDUC_MIN_EXPR: case REDUC_PLUS_EXPR: + case FOLD_LEFT_PLUS_EXPR: case VEC_UNPACK_HI_EXPR: case VEC_UNPACK_LO_EXPR: case VEC_UNPACK_FLOAT_HI_EXPR: @@ -3749,6 +3751,9 @@ op_symbol_code (enum tree_code code) case REDUC_PLUS_EXPR: return "r+"; + case FOLD_LEFT_PLUS_EXPR: + return "fl+"; + case WIDEN_SUM_EXPR: return "w+"; Index: gcc/tree-vect-stmts.c =================================================================== --- gcc/tree-vect-stmts.c 2017-11-17 16:52:07.246852461 +0000 +++ gcc/tree-vect-stmts.c 2017-11-17 16:52:07.631016305 +0000 @@ -5415,6 +5415,10 @@ vectorizable_operation (gimple *stmt, gi code = gimple_assign_rhs_code (stmt); + /* Ignore operations that mix scalar and vector input operands. */ + if (code == FOLD_LEFT_PLUS_EXPR) + return false; + /* For pointer addition, we should use the normal plus for the vector addition. */ if (code == POINTER_PLUS_EXPR) Index: gcc/tree-parloops.c =================================================================== --- gcc/tree-parloops.c 2017-11-17 16:52:07.246852461 +0000 +++ gcc/tree-parloops.c 2017-11-17 16:52:07.629186953 +0000 @@ -2531,6 +2531,19 @@ set_reduc_phi_uids (reduction_info **slo return 1; } +/* Return true if the type of reduction performed by STMT is suitable + for this pass. */ + +static bool +valid_reduction_p (gimple *stmt) +{ + /* Parallelization would reassociate the operation, which isn't + allowed for in-order reductions. */ + stmt_vec_info stmt_info = vinfo_for_stmt (stmt); + vect_reduction_type reduc_type = STMT_VINFO_REDUC_TYPE (stmt_info); + return reduc_type != FOLD_LEFT_REDUCTION; +} + /* Detect all reductions in the LOOP, insert them into REDUCTION_LIST. */ static void @@ -2564,7 +2577,7 @@ gather_scalar_reductions (loop_p loop, r gimple *reduc_stmt = vect_force_simple_reduction (simple_loop_info, phi, &double_reduc, true); - if (!reduc_stmt) + if (!reduc_stmt || !valid_reduction_p (reduc_stmt)) continue; if (double_reduc) @@ -2610,7 +2623,8 @@ gather_scalar_reductions (loop_p loop, r = vect_force_simple_reduction (simple_loop_info, inner_phi, &double_reduc, true); gcc_assert (!double_reduc); - if (inner_reduc_stmt == NULL) + if (inner_reduc_stmt == NULL + || !valid_reduction_p (inner_reduc_stmt)) continue; build_new_reduction (reduction_list, double_reduc_stmts[i], phi); Index: gcc/tree-vectorizer.h =================================================================== --- gcc/tree-vectorizer.h 2017-11-17 16:52:07.246852461 +0000 +++ gcc/tree-vectorizer.h 2017-11-17 16:52:07.631016305 +0000 @@ -74,7 +74,15 @@ enum vect_reduction_type { for (int i = 0; i < VF; ++i) res = cond[i] ? val[i] : res; */ - EXTRACT_LAST_REDUCTION + EXTRACT_LAST_REDUCTION, + + /* Use a folding reduction within the loop to implement: + + for (int i = 0; i < VF; ++i) + res = res OP val[i]; + + (with no reassocation). */ + FOLD_LEFT_REDUCTION }; #define VECTORIZABLE_CYCLE_DEF(D) (((D) == vect_reduction_def) \ @@ -1389,6 +1397,7 @@ extern void vect_model_load_cost (stmt_v extern unsigned record_stmt_cost (stmt_vector_for_cost *, int, enum vect_cost_for_stmt, stmt_vec_info, int, enum vect_cost_model_location); +extern void vect_finish_replace_stmt (gimple *, gimple *); extern void vect_finish_stmt_generation (gimple *, gimple *, gimple_stmt_iterator *); extern bool vect_mark_stmts_to_be_vectorized (loop_vec_info); Index: gcc/tree-vect-loop.c =================================================================== --- gcc/tree-vect-loop.c 2017-11-17 16:52:07.246852461 +0000 +++ gcc/tree-vect-loop.c 2017-11-17 16:52:07.630101629 +0000 @@ -2573,6 +2573,29 @@ vect_analyze_loop (struct loop *loop, lo } } +/* Return true if the target supports in-order reductions for operation + CODE and type TYPE. If the target supports it, store the reduction + operation in *REDUC_CODE. */ + +static bool +fold_left_reduction_code (tree_code code, tree type, tree_code *reduc_code) +{ + switch (code) + { + case PLUS_EXPR: + code = FOLD_LEFT_PLUS_EXPR; + break; + + default: + return false; + } + + if (!target_supports_op_p (type, code, optab_vector)) + return false; + + *reduc_code = code; + return true; +} /* Function reduction_code_for_scalar_code @@ -2880,6 +2903,42 @@ vect_is_slp_reduction (loop_vec_info loo return true; } +/* Returns true if we need an in-order reduction for operation CODE + on type TYPE. NEED_WRAPPING_INTEGRAL_OVERFLOW is true if integer + overflow must wrap. */ + +static bool +needs_fold_left_reduction_p (tree type, tree_code code, + bool need_wrapping_integral_overflow) +{ + /* CHECKME: check for !flag_finite_math_only too? */ + if (SCALAR_FLOAT_TYPE_P (type)) + switch (code) + { + case MIN_EXPR: + case MAX_EXPR: + return false; + + default: + return !flag_associative_math; + } + + if (INTEGRAL_TYPE_P (type)) + { + if (!operation_no_trapping_overflow (type, code)) + return true; + if (need_wrapping_integral_overflow + && !TYPE_OVERFLOW_WRAPS (type) + && operation_can_overflow (code)) + return true; + return false; + } + + if (SAT_FIXED_POINT_TYPE_P (type)) + return true; + + return false; +} /* Function vect_is_simple_reduction @@ -3198,58 +3257,18 @@ vect_is_simple_reduction (loop_vec_info return NULL; } - /* Check that it's ok to change the order of the computation. + /* Check whether it's ok to change the order of the computation. Generally, when vectorizing a reduction we change the order of the computation. This may change the behavior of the program in some cases, so we need to check that this is ok. One exception is when vectorizing an outer-loop: the inner-loop is executed sequentially, and therefore vectorizing reductions in the inner-loop during outer-loop vectorization is safe. */ - - if (*v_reduc_type != COND_REDUCTION - && check_reduction) - { - /* CHECKME: check for !flag_finite_math_only too? */ - if (SCALAR_FLOAT_TYPE_P (type) && !flag_associative_math) - { - /* Changing the order of operations changes the semantics. */ - if (dump_enabled_p ()) - report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt, - "reduction: unsafe fp math optimization: "); - return NULL; - } - else if (INTEGRAL_TYPE_P (type)) - { - if (!operation_no_trapping_overflow (type, code)) - { - /* Changing the order of operations changes the semantics. */ - if (dump_enabled_p ()) - report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt, - "reduction: unsafe int math optimization" - " (overflow traps): "); - return NULL; - } - if (need_wrapping_integral_overflow - && !TYPE_OVERFLOW_WRAPS (type) - && operation_can_overflow (code)) - { - /* Changing the order of operations changes the semantics. */ - if (dump_enabled_p ()) - report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt, - "reduction: unsafe int math optimization" - " (overflow doesn't wrap): "); - return NULL; - } - } - else if (SAT_FIXED_POINT_TYPE_P (type)) - { - /* Changing the order of operations changes the semantics. */ - if (dump_enabled_p ()) - report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt, - "reduction: unsafe fixed-point math optimization: "); - return NULL; - } - } + if (check_reduction + && *v_reduc_type == TREE_CODE_REDUCTION + && needs_fold_left_reduction_p (type, code, + need_wrapping_integral_overflow)) + *v_reduc_type = FOLD_LEFT_REDUCTION; /* Reduction is safe. We're dealing with one of the following: 1) integer arithmetic and no trapv @@ -3513,6 +3532,7 @@ vect_force_simple_reduction (loop_vec_in STMT_VINFO_REDUC_TYPE (reduc_def_info) = v_reduc_type; STMT_VINFO_REDUC_DEF (reduc_def_info) = def; reduc_def_info = vinfo_for_stmt (def); + STMT_VINFO_REDUC_TYPE (reduc_def_info) = v_reduc_type; STMT_VINFO_REDUC_DEF (reduc_def_info) = phi; } return def; @@ -4065,7 +4085,8 @@ vect_model_reduction_cost (stmt_vec_info code = gimple_assign_rhs_code (orig_stmt); - if (reduction_type == EXTRACT_LAST_REDUCTION) + if (reduction_type == EXTRACT_LAST_REDUCTION + || reduction_type == FOLD_LEFT_REDUCTION) { /* No extra instructions needed in the prologue. */ prologue_cost = 0; @@ -4138,7 +4159,8 @@ vect_model_reduction_cost (stmt_vec_info scalar_stmt, stmt_info, 0, vect_epilogue); } - else if (reduction_type == EXTRACT_LAST_REDUCTION) + else if (reduction_type == EXTRACT_LAST_REDUCTION + || reduction_type == FOLD_LEFT_REDUCTION) /* No extra instructions need in the epilogue. */ ; else @@ -5884,6 +5906,155 @@ vect_create_epilog_for_reduction (vec vec_oprnds0; + if (slp_node) + { + vect_get_vec_defs (op0, NULL_TREE, stmt, &vec_oprnds0, NULL, slp_node); + group_size = SLP_TREE_SCALAR_STMTS (slp_node).length (); + scalar_dest_def = SLP_TREE_SCALAR_STMTS (slp_node)[group_size - 1]; + } + else + { + tree loop_vec_def0 = vect_get_vec_def_for_operand (op0, stmt); + vec_oprnds0.create (1); + vec_oprnds0.quick_push (loop_vec_def0); + scalar_dest_def = stmt; + } + + tree scalar_dest = gimple_assign_lhs (scalar_dest_def); + tree scalar_type = TREE_TYPE (scalar_dest); + tree reduc_var = gimple_phi_result (reduc_def_stmt); + + int vec_num = vec_oprnds0.length (); + gcc_assert (vec_num == 1 || slp_node); + tree vec_elem_type = TREE_TYPE (vectype_out); + gcc_checking_assert (useless_type_conversion_p (scalar_type, vec_elem_type)); + + tree vector_identity = NULL_TREE; + if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + vector_identity = build_zero_cst (vectype_out); + + int i; + tree def0; + FOR_EACH_VEC_ELT (vec_oprnds0, i, def0) + { + tree mask = NULL_TREE; + if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + mask = vect_get_loop_mask (gsi, masks, vec_num, vectype_in, i); + + /* Handle MINUS by adding the negative. */ + if (code == MINUS_EXPR) + { + tree negated = make_ssa_name (vectype_out); + new_stmt = gimple_build_assign (negated, NEGATE_EXPR, def0); + gsi_insert_before (gsi, new_stmt, GSI_SAME_STMT); + def0 = negated; + } + + if (mask) + def0 = merge_with_identity (gsi, mask, vectype_out, def0, + vector_identity); + + /* On the first iteration the input is simply the scalar phi + result, and for subsequent iterations it is the output of + the preceding operation. */ + tree expr = build2 (reduc_code, scalar_type, reduc_var, def0); + + /* For chained SLP reductions the output of the previous reduction + operation serves as the input of the next. For the final statement + the output cannot be a temporary - we reuse the original + scalar destination of the last statement. */ + if (i == vec_num - 1) + reduc_var = scalar_dest; + else + reduc_var = vect_create_destination_var (scalar_dest, NULL); + new_stmt = gimple_build_assign (reduc_var, expr); + + if (i == vec_num - 1) + { + SSA_NAME_DEF_STMT (reduc_var) = new_stmt; + /* For chained SLP stmt is the first statement in the group and + gsi points to the last statement in the group. For non SLP stmt + points to the same location as gsi. In either case tmp_gsi and gsi + should both point to the same insertion point. */ + gcc_assert (scalar_dest_def == gsi_stmt (*gsi)); + vect_finish_replace_stmt (scalar_dest_def, new_stmt); + } + else + { + reduc_var = make_ssa_name (reduc_var, new_stmt); + gimple_assign_set_lhs (new_stmt, reduc_var); + vect_finish_stmt_generation (stmt, new_stmt, gsi); + } + + if (slp_node) + SLP_TREE_VEC_STMTS (slp_node).quick_push (new_stmt); + } + + if (!slp_node) + STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt; + + return true; +} /* Function is_nonwrapping_integer_induction. @@ -6063,6 +6234,12 @@ vectorizable_reduction (gimple *stmt, gi return true; } + if (STMT_VINFO_REDUC_TYPE (stmt_info) == FOLD_LEFT_REDUCTION) + /* Leave the scalar phi in place. Note that checking + STMT_VINFO_VEC_REDUCTION_TYPE (as below) only works + for reductions involving a single statement. */ + return true; + gimple *reduc_stmt = STMT_VINFO_REDUC_DEF (stmt_info); if (STMT_VINFO_IN_PATTERN_P (vinfo_for_stmt (reduc_stmt))) reduc_stmt = STMT_VINFO_RELATED_STMT (vinfo_for_stmt (reduc_stmt)); @@ -6289,6 +6466,14 @@ vectorizable_reduction (gimple *stmt, gi directy used in stmt. */ if (reduc_index == -1) { + if (STMT_VINFO_REDUC_TYPE (stmt_info) == FOLD_LEFT_REDUCTION) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "in-order reduction chain without SLP.\n"); + return false; + } + if (orig_stmt) reduc_def_stmt = STMT_VINFO_REDUC_DEF (orig_stmt_info); else @@ -6508,7 +6693,9 @@ vectorizable_reduction (gimple *stmt, gi vect_reduction_type reduction_type = STMT_VINFO_VEC_REDUCTION_TYPE (stmt_info); - if (orig_stmt && reduction_type == TREE_CODE_REDUCTION) + if (orig_stmt + && (reduction_type == TREE_CODE_REDUCTION + || reduction_type == FOLD_LEFT_REDUCTION)) { /* This is a reduction pattern: get the vectype from the type of the reduction variable, and get the tree-code from orig_stmt. */ @@ -6555,13 +6742,22 @@ vectorizable_reduction (gimple *stmt, gi epilog_reduc_code = ERROR_MARK; if (reduction_type == TREE_CODE_REDUCTION + || reduction_type == FOLD_LEFT_REDUCTION || reduction_type == INTEGER_INDUC_COND_REDUCTION || reduction_type == CONST_COND_REDUCTION) { - if (reduction_code_for_scalar_code (orig_code, &epilog_reduc_code)) + bool have_reduc_support; + if (reduction_type == FOLD_LEFT_REDUCTION) + have_reduc_support = fold_left_reduction_code (orig_code, vectype_out, + &epilog_reduc_code); + else + have_reduc_support + = reduction_code_for_scalar_code (orig_code, &epilog_reduc_code); + + if (have_reduc_support) { reduc_optab = optab_for_tree_code (epilog_reduc_code, vectype_out, - optab_default); + optab_default); if (!reduc_optab) { if (dump_enabled_p ()) @@ -6687,6 +6883,41 @@ vectorizable_reduction (gimple *stmt, gi } } + if (double_reduc && reduction_type == FOLD_LEFT_REDUCTION) + { + /* We can't support in-order reductions of code such as this: + + for (int i = 0; i < n1; ++i) + for (int j = 0; j < n2; ++j) + l += a[j]; + + since GCC effectively transforms the loop when vectorizing: + + for (int i = 0; i < n1 / VF; ++i) + for (int j = 0; j < n2; ++j) + for (int k = 0; k < VF; ++k) + l += a[j]; + + which is a reassociation of the original operation. */ + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "in-order double reduction not supported.\n"); + + return false; + } + + if (reduction_type == FOLD_LEFT_REDUCTION + && slp_node + && !GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt))) + { + /* We cannot in-order reductions in this case because there is + an implicit reassociation of the operations involved. */ + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "in-order unchained SLP reductions not supported.\n"); + return false; + } + /* In case of widenning multiplication by a constant, we update the type of the constant to be the type of the other operand. We check that the constant fits the type in the pattern recognition pass. */ @@ -6807,9 +7038,10 @@ vectorizable_reduction (gimple *stmt, gi vect_model_reduction_cost (stmt_info, epilog_reduc_code, ncopies); if (loop_vinfo && LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo)) { - if (cond_fn == IFN_LAST - || !direct_internal_fn_supported_p (cond_fn, vectype_in, - OPTIMIZE_FOR_SPEED)) + if (reduction_type != FOLD_LEFT_REDUCTION + && (cond_fn == IFN_LAST + || !direct_internal_fn_supported_p (cond_fn, vectype_in, + OPTIMIZE_FOR_SPEED))) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, @@ -6844,6 +7076,11 @@ vectorizable_reduction (gimple *stmt, gi bool masked_loop_p = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo); + if (reduction_type == FOLD_LEFT_REDUCTION) + return vectorize_fold_left_reduction + (stmt, gsi, vec_stmt, slp_node, reduc_def_stmt, code, + epilog_reduc_code, ops, vectype_in, reduc_index, masks); + if (reduction_type == EXTRACT_LAST_REDUCTION) { gcc_assert (!slp_node); Index: gcc/config/aarch64/aarch64.md =================================================================== --- gcc/config/aarch64/aarch64.md 2017-11-17 16:52:07.246852461 +0000 +++ gcc/config/aarch64/aarch64.md 2017-11-17 16:52:07.620954871 +0000 @@ -164,6 +164,7 @@ (define_c_enum "unspec" [ UNSPEC_STN UNSPEC_INSR UNSPEC_CLASTB + UNSPEC_FADDA ]) (define_c_enum "unspecv" [ Index: gcc/config/aarch64/aarch64-sve.md =================================================================== --- gcc/config/aarch64/aarch64-sve.md 2017-11-17 16:52:07.246852461 +0000 +++ gcc/config/aarch64/aarch64-sve.md 2017-11-17 16:52:07.620040195 +0000 @@ -1574,6 +1574,45 @@ (define_insn "*reduc__scal_ "\t%0, %1, %2." ) +;; Unpredicated in-order FP reductions. +(define_expand "fold_left_plus_" + [(set (match_operand: 0 "register_operand") + (unspec: [(match_dup 3) + (match_operand: 1 "register_operand") + (match_operand:SVE_F 2 "register_operand")] + UNSPEC_FADDA))] + "TARGET_SVE" + { + operands[3] = force_reg (mode, CONSTM1_RTX (mode)); + } +) + +;; In-order FP reductions predicated with PTRUE. +(define_insn "*fold_left_plus_" + [(set (match_operand: 0 "register_operand" "=w") + (unspec: [(match_operand: 1 "register_operand" "Upl") + (match_operand: 2 "register_operand" "0") + (match_operand:SVE_F 3 "register_operand" "w")] + UNSPEC_FADDA))] + "TARGET_SVE" + "fadda\t%0, %1, %0, %3." +) + +;; Predicated form of the above in-order reduction. +(define_insn "*pred_fold_left_plus_" + [(set (match_operand: 0 "register_operand" "=w") + (unspec: + [(match_operand: 1 "register_operand" "0") + (unspec:SVE_F + [(match_operand: 2 "register_operand" "Upl") + (match_operand:SVE_F 3 "register_operand" "w") + (match_operand:SVE_F 4 "aarch64_simd_imm_zero")] + UNSPEC_SEL)] + UNSPEC_FADDA))] + "TARGET_SVE" + "fadda\t%0, %2, %0, %3." +) + ;; Unpredicated floating-point addition. (define_expand "add3" [(set (match_operand:SVE_F 0 "register_operand") Index: gcc/testsuite/lib/target-supports.exp =================================================================== --- gcc/testsuite/lib/target-supports.exp 2017-11-17 16:52:07.246852461 +0000 +++ gcc/testsuite/lib/target-supports.exp 2017-11-17 16:52:07.627357602 +0000 @@ -7180,6 +7180,12 @@ proc check_effective_target_vect_fold_ex return [check_effective_target_aarch64_sve] } +# Return 1 if the target supports the fold_left_plus optab. + +proc check_effective_target_vect_fold_left_plus { } { + return [check_effective_target_aarch64_sve] +} + # Return 1 if the target supports section-anchors proc check_effective_target_section_anchors { } { Index: gcc/testsuite/gcc.dg/vect/no-fast-math-vect16.c =================================================================== --- gcc/testsuite/gcc.dg/vect/no-fast-math-vect16.c 2017-11-17 16:52:07.246852461 +0000 +++ gcc/testsuite/gcc.dg/vect/no-fast-math-vect16.c 2017-11-17 16:52:07.625528250 +0000 @@ -34,4 +34,4 @@ int main (void) } /* Requires fast-math. */ -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail *-*-* } } } */ +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail { ! vect_fold_left_plus } } } } */ Index: gcc/testsuite/gcc.dg/vect/pr79920.c =================================================================== --- gcc/testsuite/gcc.dg/vect/pr79920.c 2017-11-17 16:52:07.246852461 +0000 +++ gcc/testsuite/gcc.dg/vect/pr79920.c 2017-11-17 16:52:07.625528250 +0000 @@ -41,4 +41,5 @@ int main() return 0; } -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_double && { vect_perm && vect_hw_misalign } } } } } */ +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { { vect_double && { ! vect_fold_left_plus } } && { vect_perm && vect_hw_misalign } } } } } */ +/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { target { { vect_double && vect_fold_left_plus } && { vect_perm && vect_hw_misalign } } } } } */ Index: gcc/testsuite/gcc.dg/vect/trapv-vect-reduc-4.c =================================================================== --- gcc/testsuite/gcc.dg/vect/trapv-vect-reduc-4.c 2017-11-17 16:52:07.246852461 +0000 +++ gcc/testsuite/gcc.dg/vect/trapv-vect-reduc-4.c 2017-11-17 16:52:07.625528250 +0000 @@ -46,5 +46,9 @@ int main (void) return 0; } -/* { dg-final { scan-tree-dump-times "Detected reduction\\." 2 "vect" } } */ +/* 2 for the first loop. */ +/* { dg-final { scan-tree-dump-times "Detected reduction\\." 3 "vect" { target { ! vect_multiple_sizes } } } } */ +/* { dg-final { scan-tree-dump "Detected reduction\\." "vect" { target vect_multiple_sizes } } } */ +/* { dg-final { scan-tree-dump-times "not vectorized" 1 "vect" { target { ! vect_multiple_sizes } } } } */ +/* { dg-final { scan-tree-dump "not vectorized" "vect" { target vect_multiple_sizes } } } */ /* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { target { ! vect_no_int_min_max } } } } */ Index: gcc/testsuite/gcc.dg/vect/vect-reduc-6.c =================================================================== --- gcc/testsuite/gcc.dg/vect/vect-reduc-6.c 2017-11-17 16:52:07.246852461 +0000 +++ gcc/testsuite/gcc.dg/vect/vect-reduc-6.c 2017-11-17 16:52:07.625528250 +0000 @@ -50,4 +50,5 @@ int main (void) /* need -ffast-math to vectorizer these loops. */ /* ARM NEON passes -ffast-math to these tests, so expect this to fail. */ -/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { xfail arm_neon_ok } } } */ +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target { ! vect_fold_left_plus } xfail arm_neon_ok } } } */ +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target vect_fold_left_plus } } } */ Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1.c =================================================================== --- /dev/null 2017-11-14 14:28:07.424493901 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1.c 2017-11-17 16:52:07.625528250 +0000 @@ -0,0 +1,28 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */ + +#define NUM_ELEMS(TYPE) ((int)(5 * (256 / sizeof (TYPE)) + 3)) + +#define DEF_REDUC_PLUS(TYPE) \ + TYPE __attribute__ ((noinline, noclone)) \ + reduc_plus_##TYPE (TYPE *a, TYPE *b) \ + { \ + TYPE r = 0, q = 3; \ + for (int i = 0; i < NUM_ELEMS(TYPE); i++) \ + { \ + r += a[i]; \ + q -= b[i]; \ + } \ + return r * q; \ + } + +#define TEST_ALL(T) \ + T (_Float16) \ + T (float) \ + T (double) + +TEST_ALL (DEF_REDUC_PLUS) + +/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, z[0-9]+\.h} 2 } } */ +/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, z[0-9]+\.s} 2 } } */ +/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, z[0-9]+\.d} 2 } } */ Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1_run.c =================================================================== --- /dev/null 2017-11-14 14:28:07.424493901 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1_run.c 2017-11-17 16:52:07.625528250 +0000 @@ -0,0 +1,29 @@ +/* { dg-do run { target { aarch64_sve_hw } } } */ +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */ + +#include "sve_reduc_strict_1.c" + +#define TEST_REDUC_PLUS(TYPE) \ + { \ + TYPE a[NUM_ELEMS (TYPE)]; \ + TYPE b[NUM_ELEMS (TYPE)]; \ + TYPE r = 0, q = 3; \ + for (int i = 0; i < NUM_ELEMS (TYPE); i++) \ + { \ + a[i] = (i * 0.1) * (i & 1 ? 1 : -1); \ + b[i] = (i * 0.3) * (i & 1 ? 1 : -1); \ + r += a[i]; \ + q -= b[i]; \ + asm volatile ("" ::: "memory"); \ + } \ + TYPE res = reduc_plus_##TYPE (a, b); \ + if (res != r * q) \ + __builtin_abort (); \ + } + +int __attribute__ ((optimize (1))) +main () +{ + TEST_ALL (TEST_REDUC_PLUS); + return 0; +} Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2.c =================================================================== --- /dev/null 2017-11-14 14:28:07.424493901 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2.c 2017-11-17 16:52:07.625528250 +0000 @@ -0,0 +1,28 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */ + +#define NUM_ELEMS(TYPE) ((int) (5 * (256 / sizeof (TYPE)) + 3)) + +#define DEF_REDUC_PLUS(TYPE) \ +void __attribute__ ((noinline, noclone)) \ +reduc_plus_##TYPE (TYPE (*restrict a)[NUM_ELEMS(TYPE)], \ + TYPE *restrict r, int n) \ +{ \ + for (int i = 0; i < n; i++) \ + { \ + r[i] = 0; \ + for (int j = 0; j < NUM_ELEMS(TYPE); j++) \ + r[i] += a[i][j]; \ + } \ +} + +#define TEST_ALL(T) \ + T (_Float16) \ + T (float) \ + T (double) + +TEST_ALL (DEF_REDUC_PLUS) + +/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, z[0-9]+\.h} 1 } } */ +/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, z[0-9]+\.s} 1 } } */ +/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, z[0-9]+\.d} 1 } } */ Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2_run.c =================================================================== --- /dev/null 2017-11-14 14:28:07.424493901 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2_run.c 2017-11-17 16:52:07.626442926 +0000 @@ -0,0 +1,31 @@ +/* { dg-do run { target { aarch64_sve_hw } } } */ +/* { dg-options "-O2 -ftree-vectorize -fno-inline -march=armv8-a+sve" } */ + +#include "sve_reduc_strict_2.c" + +#define NROWS 5 + +#define TEST_REDUC_PLUS(TYPE) \ + { \ + TYPE a[NROWS][NUM_ELEMS (TYPE)]; \ + TYPE r[NROWS]; \ + TYPE expected[NROWS] = {}; \ + for (int i = 0; i < NROWS; ++i) \ + for (int j = 0; j < NUM_ELEMS (TYPE); ++j) \ + { \ + a[i][j] = (i * 0.1 + j * 0.6) * (j & 1 ? 1 : -1); \ + expected[i] += a[i][j]; \ + asm volatile ("" ::: "memory"); \ + } \ + reduc_plus_##TYPE (a, r, NROWS); \ + for (int i = 0; i < NROWS; ++i) \ + if (r[i] != expected[i]) \ + __builtin_abort (); \ + } + +int __attribute__ ((optimize (1))) +main () +{ + TEST_ALL (TEST_REDUC_PLUS); + return 0; +} Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_3.c =================================================================== --- /dev/null 2017-11-14 14:28:07.424493901 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_3.c 2017-11-17 16:52:07.626442926 +0000 @@ -0,0 +1,131 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -ftree-vectorize -fno-inline -march=armv8-a+sve -msve-vector-bits=256 -fdump-tree-vect-details" } */ + +double mat[100][4]; +double mat2[100][8]; +double mat3[100][12]; +double mat4[100][3]; + +double +slp_reduc_plus (int n) +{ + double tmp = 0.0; + for (int i = 0; i < n; i++) + { + tmp = tmp + mat[i][0]; + tmp = tmp + mat[i][1]; + tmp = tmp + mat[i][2]; + tmp = tmp + mat[i][3]; + } + return tmp; +} + +double +slp_reduc_plus2 (int n) +{ + double tmp = 0.0; + for (int i = 0; i < n; i++) + { + tmp = tmp + mat2[i][0]; + tmp = tmp + mat2[i][1]; + tmp = tmp + mat2[i][2]; + tmp = tmp + mat2[i][3]; + tmp = tmp + mat2[i][4]; + tmp = tmp + mat2[i][5]; + tmp = tmp + mat2[i][6]; + tmp = tmp + mat2[i][7]; + } + return tmp; +} + +double +slp_reduc_plus3 (int n) +{ + double tmp = 0.0; + for (int i = 0; i < n; i++) + { + tmp = tmp + mat3[i][0]; + tmp = tmp + mat3[i][1]; + tmp = tmp + mat3[i][2]; + tmp = tmp + mat3[i][3]; + tmp = tmp + mat3[i][4]; + tmp = tmp + mat3[i][5]; + tmp = tmp + mat3[i][6]; + tmp = tmp + mat3[i][7]; + tmp = tmp + mat3[i][8]; + tmp = tmp + mat3[i][9]; + tmp = tmp + mat3[i][10]; + tmp = tmp + mat3[i][11]; + } + return tmp; +} + +void +slp_non_chained_reduc (int n, double * restrict out) +{ + for (int i = 0; i < 3; i++) + out[i] = 0; + + for (int i = 0; i < n; i++) + { + out[0] = out[0] + mat4[i][0]; + out[1] = out[1] + mat4[i][1]; + out[2] = out[2] + mat4[i][2]; + } +} + +/* Strict FP reductions shouldn't be used for the outer loops, only the + inner loops. */ + +float +double_reduc1 (float (*restrict i)[16]) +{ + float l = 0; + + for (int a = 0; a < 8; a++) + for (int b = 0; b < 8; b++) + l += i[b][a]; + return l; +} + +float +double_reduc2 (float *restrict i) +{ + float l = 0; + + for (int a = 0; a < 8; a++) + for (int b = 0; b < 16; b++) + { + l += i[b * 4]; + l += i[b * 4 + 1]; + l += i[b * 4 + 2]; + l += i[b * 4 + 3]; + } + return l; +} + +float +double_reduc3 (float *restrict i, float *restrict j) +{ + float k = 0, l = 0; + + for (int a = 0; a < 8; a++) + for (int b = 0; b < 8; b++) + { + k += i[b]; + l += j[b]; + } + return l * k; +} + +/* We can't yet handle double_reduc1. */ +/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, z[0-9]+\.s} 3 } } */ +/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, z[0-9]+\.d} 9 } } */ +/* 1 reduction each for double_reduc{1,2} and 2 for double_reduc3. Each one + is reported three times, once for SVE, once for 128-bit AdvSIMD and once + for 64-bit AdvSIMD. */ +/* { dg-final { scan-tree-dump-times "Detected double reduction" 12 "vect" } } */ +/* double_reduc2 has 2 reductions and slp_non_chained_reduc has 3. + double_reduc1 is reported 3 times (SVE, 128-bit AdvSIMD, 64-bit AdvSIMD) + before failing. */ +/* { dg-final { scan-tree-dump-times "Detected reduction" 12 "vect" } } */ Index: gcc/testsuite/gcc.target/aarch64/sve_slp_13.c =================================================================== --- gcc/testsuite/gcc.target/aarch64/sve_slp_13.c 2017-11-17 16:52:07.246852461 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve_slp_13.c 2017-11-17 16:52:07.626442926 +0000 @@ -1,5 +1,6 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable" } */ +/* The cost model thinks that the double loop isn't a win for SVE-128. */ +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable -fno-vect-cost-model" } */ #include @@ -24,7 +25,10 @@ #define TEST_ALL(T) \ T (int32_t) \ T (uint32_t) \ T (int64_t) \ - T (uint64_t) + T (uint64_t) \ + T (_Float16) \ + T (float) \ + T (double) TEST_ALL (VEC_PERM) @@ -32,21 +36,25 @@ TEST_ALL (VEC_PERM) /* ??? We don't treat the uint loops as SLP. */ /* The loop should be fully-masked. */ /* { dg-final { scan-assembler-times {\tld1b\t} 2 { xfail *-*-* } } } */ -/* { dg-final { scan-assembler-times {\tld1h\t} 2 { xfail *-*-* } } } */ -/* { dg-final { scan-assembler-times {\tld1w\t} 2 { xfail *-*-* } } } */ -/* { dg-final { scan-assembler-times {\tld1w\t} 1 } } */ -/* { dg-final { scan-assembler-times {\tld1d\t} 2 { xfail *-*-* } } } */ -/* { dg-final { scan-assembler-times {\tld1d\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tld1h\t} 3 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tld1w\t} 3 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tld1w\t} 2 } } */ +/* { dg-final { scan-assembler-times {\tld1d\t} 3 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tld1d\t} 2 } } */ /* { dg-final { scan-assembler-not {\tldr} { xfail *-*-* } } } */ /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b} 4 { xfail *-*-* } } } */ -/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 4 { xfail *-*-* } } } */ -/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 4 } } */ -/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 4 } } */ +/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 6 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */ +/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 6 } } */ /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.b\n} 2 { xfail *-*-* } } } */ /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.h\n} 2 { xfail *-*-* } } } */ /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.s\n} 2 } } */ /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.d\n} 2 } } */ +/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, z[0-9]+\.h\n} 1 } } */ +/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, z[0-9]+\.s\n} 1 } } */ +/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, z[0-9]+\.d\n} 1 } } */ +/* { dg-final { scan-assembler-not {\tfadd\n} } } */ /* { dg-final { scan-assembler-not {\tuqdec} } } */ Index: gcc/testsuite/gfortran.dg/vect/vect-8.f90 =================================================================== --- gcc/testsuite/gfortran.dg/vect/vect-8.f90 2017-11-17 16:52:07.246852461 +0000 +++ gcc/testsuite/gfortran.dg/vect/vect-8.f90 2017-11-17 16:52:07.626442926 +0000 @@ -704,5 +704,6 @@ CALL track('KERNEL ') RETURN END SUBROUTINE kernel -! { dg-final { scan-tree-dump-times "vectorized 21 loops" 1 "vect" { target { vect_intdouble_cvt } } } } ! { dg-final { scan-tree-dump-times "vectorized 17 loops" 1 "vect" { target { ! vect_intdouble_cvt } } } } +! { dg-final { scan-tree-dump-times "vectorized 21 loops" 1 "vect" { target { vect_intdouble_cvt && { ! vect_fold_left_plus } } } } } +! { dg-final { scan-tree-dump-times "vectorized 25 loops" 1 "vect" { target { vect_intdouble_cvt && vect_fold_left_plus } } } }