From patchwork Fri Nov 17 15:38:21 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Richard Sandiford <richard.sandiford@linaro.org>
X-Patchwork-Id: 119168
Delivered-To: patch@linaro.org
Received: by 10.140.22.164 with SMTP id 33csp666515qgn;
 Fri, 17 Nov 2017 07:38:50 -0800 (PST)
X-Google-Smtp-Source: AGs4zMboeqfTOSZF/KxkJnJfwB00aNuukDTWGmGRIqA23RQ5VCmDx7fNARtJeRStNHZLrcJ5X9Rj
X-Received: by 10.84.210.164 with SMTP id a33mr5742201pli.134.1510933130429; 
 Fri, 17 Nov 2017 07:38:50 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1510933130; cv=none;
 d=google.com; s=arc-20160816;
 b=CakxHbkbS9zrTNf988bBwUxlcK7g9dhOuRDF7ZgBV1CMzwtPoppz7svzz6umBCg85j
 UpCOjx+iyQLe9CzPR8c5351UCQk+/FqZ4IKqo4TUTpxy/SIM/cGt/A4CJLb5d4ucXDuF
 81b3sf+r8C2MsVt0aTvyGWW0xmtLMScbnJqvphg6UBc7JDQ54ZCQPx0nL4atoxpfkzzX
 zqRyAG9hlxcWH/hvlmQRpYNNPvT+L3Is10KFoX4PrDrXMWe3TFL0HheAF6x0aQlvNstX
 eWGQOfS9q7iMN0hZfp0YO7u5GSGFy/gzq4z7PehMbyTG4xkH3LFQlKEcM2pRoEk9jP2w
 5b3g==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816; 
 h=mime-version:user-agent:message-id:date:subject:mail-followup-to:to
 :from:delivered-to:sender:list-help:list-post:list-archive
 :list-unsubscribe:list-id:precedence:mailing-list:dkim-signature
 :domainkey-signature:arc-authentication-results;
 bh=2t3swCVA/kEXe349eklhpnZe3g0q9p5KqYsYjpW2jbE=;
 b=uYSNq1B3dmJaA7pVLTOjTudkobwD1g5SBD5jvxk7n4lVr9Q2hav3ghSzAH+UqqlX1Z
 RwCJxY9NtnL6a81Py1/QUXS2yM3eoDbpbtsUA/LyoM8T3LK+bH+clHBob9iWS2+eHS/X
 J8A6nO5ggnMlPRqfYZ9bZ3XY72yCO51QneJ0VkqY5Htlk2UWnc4LzNCdCRF4Ewia1gAn
 bzL1jgvnV1l3xfsBCipYGk9JVq/QVocwhlHlHcLsrJGqQibWQPw45kZf4xI6Umozxxgw
 922p/KpFQFXu8hXd5jkdWRtP9ZO8m2qinVPhKB37rvHcwfl8qWxwzdNaQhfzJRt0RsKX
 6v9A==
ARC-Authentication-Results: i=1; mx.google.com;
 dkim=pass header.i=@gcc.gnu.org header.s=default header.b=GgvFnr76;
 spf=pass (google.com: domain of
 gcc-patches-return-467158-patch=linaro.org@gcc.gnu.org
 designates 209.132.180.131 as permitted sender)
 smtp.mailfrom=gcc-patches-return-467158-patch=linaro.org@gcc.gnu.org;
 dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Return-Path: <gcc-patches-return-467158-patch=linaro.org@gcc.gnu.org>
Received: from sourceware.org (server1.sourceware.org. [209.132.180.131])
 by mx.google.com with ESMTPS id
 e8si2876160pgs.480.2017.11.17.07.38.50 for <patch@linaro.org>
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Fri, 17 Nov 2017 07:38:50 -0800 (PST)
Received-SPF: pass (google.com: domain of
 gcc-patches-return-467158-patch=linaro.org@gcc.gnu.org
 designates 209.132.180.131 as permitted sender)
 client-ip=209.132.180.131; 
Authentication-Results: mx.google.com;
 dkim=pass header.i=@gcc.gnu.org header.s=default header.b=GgvFnr76;
 spf=pass (google.com: domain of
 gcc-patches-return-467158-patch=linaro.org@gcc.gnu.org
 designates 209.132.180.131 as permitted sender)
 smtp.mailfrom=gcc-patches-return-467158-patch=linaro.org@gcc.gnu.org;
 dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=linaro.org
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
 :list-unsubscribe:list-archive:list-post:list-help:sender:from
 :to:subject:date:message-id:mime-version:content-type; q=dns; s=
 default; b=LiypM2yFY7LF91qBcnFoa/GCtrFs8QTBUncTadH/FJEuL5ZeIE9c2
 TYhTcK4It68iGyqOtW8UpB3Za7CC+wv13q0qmzvo6PhPKcXia5toFkhyQzSPMZkD
 JSsQyFPkkiu+QbydBgM2VFnqppqTQIMuMYDEsy8aDp3zQqff5Ec54k=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id
 :list-unsubscribe:list-archive:list-post:list-help:sender:from
 :to:subject:date:message-id:mime-version:content-type; s=
 default; bh=HNGeZne30VdQrI1KiffgLXsCdoU=; b=GgvFnr76gi9vbr0S59nm
 I443k1RTC3kAZQb1DknBQVrZLCd+l2SnnSp52txhtrvhOklGSofWWej3Eb44m6gE
 YNHagtGYy0KbcLaL29fLffw1mIuQ6qlja7Wu9KoGHktBSPOK1E6OxlBwcucVdWoa
 qedrrSzfgtw9ERfyXuvvlqk=
Received: (qmail 109011 invoked by alias); 17 Nov 2017 15:38:32 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <mailto:gcc-patches-unsubscribe-patch=linaro.org@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org
Received: (qmail 108994 invoked by uid 89); 17 Nov 2017 15:38:31 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-10.8 required=5.0 tests=AWL, BAYES_00,
 GIT_PATCH_2, GIT_PATCH_3, KAM_ASCII_DIVIDERS,
 KB_WAM_FROM_NAME_SINGLEWORD, RCVD_IN_DNSWL_NONE,
 SPF_PASS autolearn=ham version=3.3.2 spammy=
X-HELO: mail-wm0-f52.google.com
Received: from mail-wm0-f52.google.com (HELO mail-wm0-f52.google.com)
 (74.125.82.52) by sourceware.org
 (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP;
 Fri, 17 Nov 2017 15:38:27 +0000
Received: by mail-wm0-f52.google.com with SMTP id y80so7242811wmd.0 for
 <gcc-patches@gcc.gnu.org>; Fri, 17 Nov 2017 07:38:27 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net;
 s=20161025;
 h=x-gm-message-state:from:to:mail-followup-to:subject:date:message-id
 :user-agent:mime-version;
 bh=2t3swCVA/kEXe349eklhpnZe3g0q9p5KqYsYjpW2jbE=;
 b=TFdPK4HhKWt8ziBpDb9ad6rzcZhjjs1zIVJZjzw4SDXJwQ+cAB7zPxqYweRBE6Ee8n
 5cGZoepB0ZaVPY7rIaO37HsHdFQdbzQEEp5IltiXD6udZK9D3W4TPctCQQQE1TGKH/1c
 OVZMO5xJzXkA+3CEtFMV3bOCnWnBXKLvHBnHj+7XoWBU5ICFddMT5LwF+4eGLSGIXv7k
 GS/N1OmJIQmk1C4MJhQcYpdOlkt2V7Tb7skh/XDNatxwAlnL71lfraOBVlapSgY1xK7f
 jBW0qcPwirJhnOmQUKiD2iI/+s6xADrErCF5a34Mdw3jLtZ0a5sBFr48uZ9TcITWui9F
 bkow==
X-Gm-Message-State: AJaThX5n15AaVWhJkLB0ZI7G18cbRVAuATcn/UmoFemJSelo2ADFUAMD	ayCqsDfN+ZwZ1wMhm3lu8Y6S0Mt9Xow=
X-Received: by 10.28.207.8 with SMTP id f8mr4947119wmg.30.1510933104818;
 Fri, 17 Nov 2017 07:38:24 -0800 (PST)
Received: from localhost ([2.25.234.120]) by smtp.gmail.com with ESMTPSA id
 k185sm2749034wma.28.2017.11.17.07.38.22 for
 <gcc-patches@gcc.gnu.org> (version=TLS1_2
 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256);
 Fri, 17 Nov 2017 07:38:24 -0800 (PST)
From: Richard Sandiford <richard.sandiford@linaro.org>
To: gcc-patches@gcc.gnu.org
Mail-Followup-To: gcc-patches@gcc.gnu.org, richard.sandiford@linaro.org
Subject: Use single-iteration epilogues when peeling for gaps
Date: Fri, 17 Nov 2017 15:38:21 +0000
Message-ID: <87k1yoykya.fsf@linaro.org>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.3 (gnu/linux)
MIME-Version: 1.0

This patch adds support for fully-masking loops that require peeling
for gaps.  It peels exactly one scalar iteration and uses the masked
loop to handle the rest.  Previously we would fall back on using a
standard unmasked loop instead.

Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu
and powerpc64le-linux-gnu.  OK to install?

Richard


2017-11-17  Richard Sandiford  <richard.sandiford@linaro.org>
	    Alan Hayward  <alan.hayward@arm.com>
	    David Sherwood  <david.sherwood@arm.com>

gcc/
	* tree-vect-loop-manip.c (vect_gen_scalar_loop_niters): Replace
	vfm1 with a bound_epilog parameter.
	(vect_do_peeling): Update calls accordingly, and move the prologue
	call earlier in the function.  Treat the base bound_epilog as 0 for
	fully-masked loops and retain vf - 1 for other loops.  Add 1 to
	this base when peeling for gaps.
	* tree-vect-loop.c (vect_analyze_loop_2): Allow peeling for gaps
	with fully-masked loops.
	(vect_estimate_min_profitable_iters): Handle the single peeled
	iteration in that case.

gcc/testsuite/
	* gcc.target/aarch64/sve_struct_vect_18.c: Check the number
	of branches.
	* gcc.target/aarch64/sve_struct_vect_19.c: Likewise.
	* gcc.target/aarch64/sve_struct_vect_20.c: New test.
	* gcc.target/aarch64/sve_struct_vect_20_run.c: Likewise.
	* gcc.target/aarch64/sve_struct_vect_21.c: Likewise.
	* gcc.target/aarch64/sve_struct_vect_21_run.c: Likewise.
	* gcc.target/aarch64/sve_struct_vect_22.c: Likewise.
	* gcc.target/aarch64/sve_struct_vect_22_run.c: Likewise.
	* gcc.target/aarch64/sve_struct_vect_23.c: Likewise.
	* gcc.target/aarch64/sve_struct_vect_23_run.c: Likewise.

Index: gcc/tree-vect-loop-manip.c
===================================================================
--- gcc/tree-vect-loop-manip.c	2017-11-17 15:36:46.119499244 +0000
+++ gcc/tree-vect-loop-manip.c	2017-11-17 15:36:46.354499238 +0000
@@ -596,8 +596,9 @@ vect_set_loop_masks_directly (struct loo
 
 /* Make LOOP iterate NITERS times using masking and WHILE_ULT calls.
    LOOP_VINFO describes the vectorization of LOOP.  NITERS is the
-   number of iterations of the original scalar loop.  NITERS_MAYBE_ZERO
-   and FINAL_IV are as for vect_set_loop_condition.
+   number of iterations of the original scalar loop that should be
+   handled by the vector loop.  NITERS_MAYBE_ZERO and FINAL_IV are
+   as for vect_set_loop_condition.
 
    Insert the branch-back condition before LOOP_COND_GSI and return the
    final gcond.  */
@@ -1812,23 +1813,24 @@ vect_build_loop_niters (loop_vec_info lo
 /* Calculate the number of iterations above which vectorized loop will be
    preferred than scalar loop.  NITERS_PROLOG is the number of iterations
    of prolog loop.  If it's integer const, the integer number is also passed
-   in INT_NITERS_PROLOG.  BOUND_PROLOG is the upper bound (included) of
-   number of iterations of prolog loop.  VFM1 is vector factor minus one.
-   If CHECK_PROFITABILITY is true, TH is the threshold below which scalar
-   (rather than vectorized) loop will be executed.  This function stores
-   upper bound (included) of the result in BOUND_SCALAR.  */
+   in INT_NITERS_PROLOG.  BOUND_PROLOG is the upper bound (inclusive) of the
+   number of iterations of the prolog loop.  BOUND_EPILOG is the corresponding
+   value for the epilog loop.  If CHECK_PROFITABILITY is true, TH is the
+   threshold below which the scalar (rather than vectorized) loop will be
+   executed.  This function stores the upper bound (inclusive) of the result
+   in BOUND_SCALAR.  */
 
 static tree
 vect_gen_scalar_loop_niters (tree niters_prolog, int int_niters_prolog,
-			     int bound_prolog, poly_int64 vfm1, int th,
+			     int bound_prolog, poly_int64 bound_epilog, int th,
 			     poly_uint64 *bound_scalar,
 			     bool check_profitability)
 {
   tree type = TREE_TYPE (niters_prolog);
   tree niters = fold_build2 (PLUS_EXPR, type, niters_prolog,
-			     build_int_cst (type, vfm1));
+			     build_int_cst (type, bound_epilog));
 
-  *bound_scalar = vfm1 + bound_prolog;
+  *bound_scalar = bound_prolog + bound_epilog;
   if (check_profitability)
     {
       /* TH indicates the minimum niters of vectorized loop, while we
@@ -1837,18 +1839,18 @@ vect_gen_scalar_loop_niters (tree niters
       /* Peeling for constant times.  */
       if (int_niters_prolog >= 0)
 	{
-	  *bound_scalar = upper_bound (int_niters_prolog + vfm1, th);
+	  *bound_scalar = upper_bound (int_niters_prolog + bound_epilog, th);
 	  return build_int_cst (type, *bound_scalar);
 	}
-      /* Peeling for unknown times.  Note BOUND_PROLOG is the upper
-	 bound (inlcuded) of niters of prolog loop.  */
-      if (must_ge (th, vfm1 + bound_prolog))
+      /* Peeling an unknown number of times.  Note that both BOUND_PROLOG
+	 and BOUND_EPILOG are inclusive upper bounds.  */
+      if (must_ge (th, bound_prolog + bound_epilog))
 	{
 	  *bound_scalar = th;
 	  return build_int_cst (type, th);
 	}
       /* Need to do runtime comparison.  */
-      else if (may_gt (th, vfm1))
+      else if (may_gt (th, bound_epilog))
 	{
 	  *bound_scalar = upper_bound (*bound_scalar, th);
 	  return fold_build2 (MAX_EXPR, type,
@@ -2381,14 +2383,20 @@ vect_do_peeling (loop_vec_info loop_vinf
   tree type = TREE_TYPE (niters), guard_cond;
   basic_block guard_bb, guard_to;
   profile_probability prob_prolog, prob_vector, prob_epilog;
-  int bound_prolog = 0;
-  poly_uint64 bound_scalar = 0;
   int estimated_vf;
   int prolog_peeling = 0;
   if (!vect_use_loop_mask_for_alignment_p (loop_vinfo))
     prolog_peeling = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
-  bool epilog_peeling = (LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)
-			 || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
+
+  poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+  poly_uint64 bound_epilog = 0;
+  if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
+      && LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
+    bound_epilog += vf - 1;
+  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
+    bound_epilog += 1;
+  bool epilog_peeling = may_ne (bound_epilog, 0U);
+  poly_uint64 bound_scalar = bound_epilog;
 
   if (!prolog_peeling && !epilog_peeling)
     return NULL;
@@ -2399,7 +2407,6 @@ vect_do_peeling (loop_vec_info loop_vinf
     estimated_vf = 3;
   prob_prolog = prob_epilog = profile_probability::guessed_always ()
 			.apply_scale (estimated_vf - 1, estimated_vf);
-  poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
 
   struct loop *prolog, *epilog = NULL, *loop = LOOP_VINFO_LOOP (loop_vinfo);
   struct loop *first_loop = loop;
@@ -2414,14 +2421,29 @@ vect_do_peeling (loop_vec_info loop_vinf
     }
   initialize_original_copy_tables ();
 
+  /* Record the anchor bb at which the guard should be placed if the scalar
+     loop might be preferred.  */
+  basic_block anchor = loop_preheader_edge (loop)->src;
+
+  /* Generate the number of iterations for the prolog loop.  We do this here
+     so that we can also get the upper bound on the number of iterations.  */
+  tree niters_prolog;
+  int bound_prolog = 0;
+  if (prolog_peeling)
+    niters_prolog = vect_gen_prolog_loop_niters (loop_vinfo, anchor,
+						 &bound_prolog);
+  else
+    niters_prolog = build_int_cst (type, 0);
+
   /* Prolog loop may be skipped.  */
   bool skip_prolog = (prolog_peeling != 0);
   /* Skip to epilog if scalar loop may be preferred.  It's only needed
      when we peel for epilog loop and when it hasn't been checked with
      loop versioning.  */
-  bool skip_vector = ((!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
-		       && !LOOP_REQUIRES_VERSIONING (loop_vinfo))
-		      || !vf.is_constant ());
+  bool skip_vector = (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+		      ? may_lt (LOOP_VINFO_INT_NITERS (loop_vinfo),
+				bound_prolog + bound_epilog)
+		      : !LOOP_REQUIRES_VERSIONING (loop_vinfo));
   /* Epilog loop must be executed if the number of iterations for epilog
      loop is known at compile time, otherwise we need to add a check at
      the end of vector loop and skip to the end of epilog loop.  */
@@ -2432,9 +2454,6 @@ vect_do_peeling (loop_vec_info loop_vinf
   if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
     skip_epilog = false;
 
-  /* Record the anchor bb at which guard should be placed if scalar loop
-     may be preferred.  */
-  basic_block anchor = loop_preheader_edge (loop)->src;
   if (skip_vector)
     {
       split_edge (loop_preheader_edge (loop));
@@ -2452,7 +2471,6 @@ vect_do_peeling (loop_vec_info loop_vinf
 	}
     }
 
-  tree niters_prolog = build_int_cst (type, 0);
   source_location loop_loc = find_loop_location (loop);
   struct loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
   if (prolog_peeling)
@@ -2476,9 +2494,7 @@ vect_do_peeling (loop_vec_info loop_vinf
       first_loop = prolog;
       reset_original_copy_tables ();
 
-      /* Generate and update the number of iterations for prolog loop.  */
-      niters_prolog = vect_gen_prolog_loop_niters (loop_vinfo, anchor,
-						   &bound_prolog);
+      /* Update the number of iterations for prolog loop.  */
       tree step_prolog = build_one_cst (TREE_TYPE (niters_prolog));
       vect_set_loop_condition (prolog, NULL, niters_prolog,
 			       step_prolog, NULL_TREE, false);
@@ -2553,10 +2569,8 @@ vect_do_peeling (loop_vec_info loop_vinf
       if (skip_vector)
 	{
 	  /* Additional epilogue iteration is peeled if gap exists.  */
-	  bool peel_for_gaps = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo);
 	  tree t = vect_gen_scalar_loop_niters (niters_prolog, prolog_peeling,
-						bound_prolog,
-						peel_for_gaps ? vf : vf - 1,
+						bound_prolog, bound_epilog,
 						th, &bound_scalar,
 						check_profitability);
 	  /* Build guard against NITERSM1 since NITERS may overflow.  */
@@ -2640,14 +2654,12 @@ vect_do_peeling (loop_vec_info loop_vinf
       else
 	slpeel_update_phi_nodes_for_lcssa (epilog);
 
-      unsigned HOST_WIDE_INT bound1, bound2;
-      if (vf.is_constant (&bound1) && bound_scalar.is_constant (&bound2))
+      unsigned HOST_WIDE_INT bound;
+      if (bound_scalar.is_constant (&bound))
 	{
-	  bound1 -= LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) ? 1 : 2;
-	  if (bound2)
-	    /* We share epilog loop with scalar version loop.  */
-	    bound1 = MAX (bound1, bound2 - 1);
-	  record_niter_bound (epilog, bound1, false, true);
+	  gcc_assert (bound != 0);
+	  /* -1 to convert loop iterations to latch iterations.  */
+	  record_niter_bound (epilog, bound - 1, false, true);
 	}
 
       delete_update_ssa ();
Index: gcc/tree-vect-loop.c
===================================================================
--- gcc/tree-vect-loop.c	2017-11-17 15:36:46.119499244 +0000
+++ gcc/tree-vect-loop.c	2017-11-17 15:36:46.355499238 +0000
@@ -2257,16 +2257,6 @@ vect_analyze_loop_2 (loop_vec_info loop_
       return false;
     }
 
-  if (LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo)
-      && LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
-    {
-      LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
-      if (dump_enabled_p ())
-	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			 "can't use a fully-masked loop because peeling for"
-			 " gaps is required.\n");
-    }
-
   /* Decide whether to use a fully-masked loop for this vectorization
      factor.  */
   LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
@@ -3702,6 +3692,23 @@ vect_estimate_min_profitable_iters (loop
     {
       peel_iters_prologue = 0;
       peel_iters_epilogue = 0;
+
+      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
+	{
+	  /* We need to peel exactly one iteration.  */
+	  peel_iters_epilogue += 1;
+	  stmt_info_for_cost *si;
+	  int j;
+	  FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo),
+			    j, si)
+	    {
+	      struct _stmt_vec_info *stmt_info
+		= si->stmt ? vinfo_for_stmt (si->stmt) : NULL;
+	      (void) add_stmt_cost (target_cost_data, si->count,
+				    si->kind, stmt_info, si->misalign,
+				    vect_epilogue);
+	    }
+	}
     }
   else if (npeel < 0)
     {
Index: gcc/testsuite/gcc.target/aarch64/sve_struct_vect_18.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/sve_struct_vect_18.c	2017-11-17 15:36:46.119499244 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_struct_vect_18.c	2017-11-17 15:36:46.353499239 +0000
@@ -42,3 +42,6 @@ TEST (test)
 /* { dg-final { scan-assembler-times {\tstr\ts} 1 } } */
 /* { dg-final { scan-assembler-times {\tldr\td} 2 } } */
 /* { dg-final { scan-assembler-times {\tstr\td} 1 } } */
+
+/* The only branches should be in the vectorized loop.  */
+/* { dg-final { scan-assembler-times {\tb[a-z]+\t} 4 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_struct_vect_19.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/sve_struct_vect_19.c	2017-11-17 15:36:46.119499244 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_struct_vect_19.c	2017-11-17 15:36:46.353499239 +0000
@@ -40,3 +40,8 @@ TEST (test)
 /* { dg-final { scan-assembler-times {\tstr\ts} 1 } } */
 /* { dg-final { scan-assembler-times {\tldr\td} 2 } } */
 /* { dg-final { scan-assembler-times {\tstr\td} 1 } } */
+
+/* Each function should have three branches: one directly to the exit
+   (n <= 0), one to the single scalar epilogue iteration (n == 1),
+   and one branch-back for the vectorized loop.  */
+/* { dg-final { scan-assembler-times {\tb[a-z]+\t} 12 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_struct_vect_20.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_struct_vect_20.c	2017-11-17 15:36:46.353499239 +0000
@@ -0,0 +1,47 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#define N 2000
+
+#define TEST_LOOP(NAME, TYPE)					\
+  void __attribute__ ((noinline, noclone))			\
+  NAME (TYPE *restrict dest, TYPE *restrict src)		\
+  {								\
+    for (int i = 0; i < N; ++i)					\
+      dest[i] += src[i * 2];					\
+  }
+
+#define TEST(NAME) \
+  TEST_LOOP (NAME##_i8, signed char) \
+  TEST_LOOP (NAME##_i16, unsigned short) \
+  TEST_LOOP (NAME##_f32, float) \
+  TEST_LOOP (NAME##_f64, double)
+
+TEST (test)
+
+/* Check the vectorized loop.  */
+/* { dg-final { scan-assembler-times {\tld1b\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld2b\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1b\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1h\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld2h\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1h\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1w\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld2w\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1w\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1d\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld2d\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1d\t} 1 } } */
+
+/* Check the scalar tail.  */
+/* { dg-final { scan-assembler-times {\tldrb\tw} 2 } } */
+/* { dg-final { scan-assembler-times {\tstrb\tw} 1 } } */
+/* { dg-final { scan-assembler-times {\tldrh\tw} 2 } } */
+/* { dg-final { scan-assembler-times {\tstrh\tw} 1 } } */
+/* { dg-final { scan-assembler-times {\tldr\ts} 2 } } */
+/* { dg-final { scan-assembler-times {\tstr\ts} 1 } } */
+/* { dg-final { scan-assembler-times {\tldr\td} 2 } } */
+/* { dg-final { scan-assembler-times {\tstr\td} 1 } } */
+
+/* The only branches should be in the vectorized loop.  */
+/* { dg-final { scan-assembler-times {\tb[a-z]+\t} 4 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_struct_vect_20_run.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_struct_vect_20_run.c	2017-11-17 15:36:46.353499239 +0000
@@ -0,0 +1,36 @@
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#include "sve_struct_vect_20.c"
+
+#undef TEST_LOOP
+#define TEST_LOOP(NAME, TYPE)				\
+  {							\
+    TYPE out[N];					\
+    TYPE in[N * 2];					\
+    for (int i = 0; i < N; ++i)				\
+      {							\
+	out[i] = i * 7 / 2;				\
+	asm volatile ("" ::: "memory");			\
+      }							\
+    for (int i = 0; i < N * 2; ++i)			\
+      {							\
+	in[i] = i * 9 / 2;				\
+	asm volatile ("" ::: "memory");			\
+      }							\
+    NAME (out, in);					\
+    for (int i = 0; i < N; ++i)				\
+      {							\
+	TYPE expected = i * 7 / 2 + in[i * 2];		\
+	if (out[i] != expected)				\
+	  __builtin_abort ();				\
+	asm volatile ("" ::: "memory");			\
+      }							\
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST (test);
+  return 0;
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_struct_vect_21.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_struct_vect_21.c	2017-11-17 15:36:46.353499239 +0000
@@ -0,0 +1,47 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, TYPE)					\
+  void __attribute__ ((noinline, noclone))			\
+  NAME (TYPE *restrict dest, TYPE *restrict src, int n)		\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      dest[i] += src[i * 2];					\
+  }
+
+#define TEST(NAME) \
+  TEST_LOOP (NAME##_i8, signed char) \
+  TEST_LOOP (NAME##_i16, unsigned short) \
+  TEST_LOOP (NAME##_f32, float) \
+  TEST_LOOP (NAME##_f64, double)
+
+TEST (test)
+
+/* Check the vectorized loop.  */
+/* { dg-final { scan-assembler-times {\tld1b\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld2b\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1b\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1h\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld2h\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1h\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1w\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld2w\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1w\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1d\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld2d\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1d\t} 1 } } */
+
+/* Check the scalar tail.  */
+/* { dg-final { scan-assembler-times {\tldrb\tw} 2 } } */
+/* { dg-final { scan-assembler-times {\tstrb\tw} 1 } } */
+/* { dg-final { scan-assembler-times {\tldrh\tw} 2 } } */
+/* { dg-final { scan-assembler-times {\tstrh\tw} 1 } } */
+/* { dg-final { scan-assembler-times {\tldr\ts} 2 } } */
+/* { dg-final { scan-assembler-times {\tstr\ts} 1 } } */
+/* { dg-final { scan-assembler-times {\tldr\td} 2 } } */
+/* { dg-final { scan-assembler-times {\tstr\td} 1 } } */
+
+/* Each function should have three branches: one directly to the exit
+   (n <= 0), one to the single scalar epilogue iteration (n == 1),
+   and one branch-back for the vectorized loop.  */
+/* { dg-final { scan-assembler-times {\tb[a-z]+\t} 12 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_struct_vect_21_run.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_struct_vect_21_run.c	2017-11-17 15:36:46.353499239 +0000
@@ -0,0 +1,45 @@
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#include "sve_struct_vect_21.c"
+
+#define N 1000
+
+#undef TEST_LOOP
+#define TEST_LOOP(NAME, TYPE)			\
+  {						\
+    TYPE out[N];				\
+    TYPE in[N * 2];				\
+    int counts[] = { 0, 1, N - 1 };		\
+    for (int j = 0; j < 3; ++j)			\
+      {						\
+	int count = counts[j];			\
+	for (int i = 0; i < N; ++i)		\
+	  {					\
+	    out[i] = i * 7 / 2;			\
+	    asm volatile ("" ::: "memory");	\
+	  }					\
+	for (int i = 0; i < N * 2; ++i)		\
+	  {					\
+	    in[i] = i * 9 / 2;			\
+	    asm volatile ("" ::: "memory");	\
+	  }					\
+	NAME (out, in, count);			\
+	for (int i = 0; i < N; ++i)		\
+	  {					\
+	    TYPE expected = i * 7 / 2;		\
+	    if (i < count)			\
+	      expected += in[i * 2];		\
+	    if (out[i] != expected)		\
+	      __builtin_abort ();		\
+	    asm volatile ("" ::: "memory");	\
+	  }					\
+      }						\
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST (test);
+  return 0;
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_struct_vect_22.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_struct_vect_22.c	2017-11-17 15:36:46.353499239 +0000
@@ -0,0 +1,47 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#define N 2000
+
+#define TEST_LOOP(NAME, TYPE)					\
+  void __attribute__ ((noinline, noclone))			\
+  NAME (TYPE *restrict dest, TYPE *restrict src)		\
+  {								\
+    for (int i = 0; i < N; ++i)					\
+      dest[i] += src[i * 4];					\
+  }
+
+#define TEST(NAME) \
+  TEST_LOOP (NAME##_i8, signed char) \
+  TEST_LOOP (NAME##_i16, unsigned short) \
+  TEST_LOOP (NAME##_f32, float) \
+  TEST_LOOP (NAME##_f64, double)
+
+TEST (test)
+
+/* Check the vectorized loop.  */
+/* { dg-final { scan-assembler-times {\tld1b\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld4b\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1b\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1h\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld4h\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1h\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1w\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld4w\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1w\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1d\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld4d\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1d\t} 1 } } */
+
+/* Check the scalar tail.  */
+/* { dg-final { scan-assembler-times {\tldrb\tw} 2 } } */
+/* { dg-final { scan-assembler-times {\tstrb\tw} 1 } } */
+/* { dg-final { scan-assembler-times {\tldrh\tw} 2 } } */
+/* { dg-final { scan-assembler-times {\tstrh\tw} 1 } } */
+/* { dg-final { scan-assembler-times {\tldr\ts} 2 } } */
+/* { dg-final { scan-assembler-times {\tstr\ts} 1 } } */
+/* { dg-final { scan-assembler-times {\tldr\td} 2 } } */
+/* { dg-final { scan-assembler-times {\tstr\td} 1 } } */
+
+/* The only branches should be in the vectorized loop.  */
+/* { dg-final { scan-assembler-times {\tb[a-z]+\t} 4 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_struct_vect_22_run.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_struct_vect_22_run.c	2017-11-17 15:36:46.354499238 +0000
@@ -0,0 +1,36 @@
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#include "sve_struct_vect_22.c"
+
+#undef TEST_LOOP
+#define TEST_LOOP(NAME, TYPE)				\
+  {							\
+    TYPE out[N];					\
+    TYPE in[N * 4];					\
+    for (int i = 0; i < N; ++i)				\
+      {							\
+	out[i] = i * 7 / 2;				\
+	asm volatile ("" ::: "memory");			\
+      }							\
+    for (int i = 0; i < N * 4; ++i)			\
+      {							\
+	in[i] = i * 9 / 2;				\
+	asm volatile ("" ::: "memory");			\
+      }							\
+    NAME (out, in);					\
+    for (int i = 0; i < N; ++i)				\
+      {							\
+	TYPE expected = i * 7 / 2 + in[i * 4];		\
+	if (out[i] != expected)				\
+	  __builtin_abort ();				\
+	asm volatile ("" ::: "memory");			\
+      }							\
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST (test);
+  return 0;
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_struct_vect_23.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_struct_vect_23.c	2017-11-17 15:36:46.354499238 +0000
@@ -0,0 +1,47 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, TYPE)					\
+  void __attribute__ ((noinline, noclone))			\
+  NAME (TYPE *restrict dest, TYPE *restrict src, int n)		\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      dest[i] += src[i * 4];					\
+  }
+
+#define TEST(NAME) \
+  TEST_LOOP (NAME##_i8, signed char) \
+  TEST_LOOP (NAME##_i16, unsigned short) \
+  TEST_LOOP (NAME##_f32, float) \
+  TEST_LOOP (NAME##_f64, double)
+
+TEST (test)
+
+/* Check the vectorized loop.  */
+/* { dg-final { scan-assembler-times {\tld1b\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld4b\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1b\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1h\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld4h\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1h\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1w\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld4w\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1w\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1d\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld4d\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1d\t} 1 } } */
+
+/* Check the scalar tail.  */
+/* { dg-final { scan-assembler-times {\tldrb\tw} 2 } } */
+/* { dg-final { scan-assembler-times {\tstrb\tw} 1 } } */
+/* { dg-final { scan-assembler-times {\tldrh\tw} 2 } } */
+/* { dg-final { scan-assembler-times {\tstrh\tw} 1 } } */
+/* { dg-final { scan-assembler-times {\tldr\ts} 2 } } */
+/* { dg-final { scan-assembler-times {\tstr\ts} 1 } } */
+/* { dg-final { scan-assembler-times {\tldr\td} 2 } } */
+/* { dg-final { scan-assembler-times {\tstr\td} 1 } } */
+
+/* Each function should have three branches: one directly to the exit
+   (n <= 0), one to the single scalar epilogue iteration (n == 1),
+   and one branch-back for the vectorized loop.  */
+/* { dg-final { scan-assembler-times {\tb[a-z]+\t} 12 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_struct_vect_23_run.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_struct_vect_23_run.c	2017-11-17 15:36:46.354499238 +0000
@@ -0,0 +1,45 @@
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#include "sve_struct_vect_23.c"
+
+#define N 1000
+
+#undef TEST_LOOP
+#define TEST_LOOP(NAME, TYPE)			\
+  {						\
+    TYPE out[N];				\
+    TYPE in[N * 4];				\
+    int counts[] = { 0, 1, N - 1 };		\
+    for (int j = 0; j < 3; ++j)			\
+      {						\
+	int count = counts[j];			\
+	for (int i = 0; i < N; ++i)		\
+	  {					\
+	    out[i] = i * 7 / 2;			\
+	    asm volatile ("" ::: "memory");	\
+	  }					\
+	for (int i = 0; i < N * 4; ++i)		\
+	  {					\
+	    in[i] = i * 9 / 2;			\
+	    asm volatile ("" ::: "memory");	\
+	  }					\
+	NAME (out, in, count);			\
+	for (int i = 0; i < N; ++i)		\
+	  {					\
+	    TYPE expected = i * 7 / 2;		\
+	    if (i < count)			\
+	      expected += in[i * 4];		\
+	    if (out[i] != expected)		\
+	      __builtin_abort ();		\
+	    asm volatile ("" ::: "memory");	\
+	  }					\
+      }						\
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST (test);
+  return 0;
+}