diff mbox series

Allow the number of iterations to be smaller than VF

Message ID 87d14hym7l.fsf@linaro.org
State New
Headers show
Series Allow the number of iterations to be smaller than VF | expand

Commit Message

Richard Sandiford Nov. 17, 2017, 3:11 p.m. UTC
Fully-masked loops can be profitable even if the iteration
count is smaller than the vectorisation factor.  In this case
we're effectively doing a complete unroll followed by SLP.

The documentation for min-vect-loop-bound says that the
default value is 0, but actually the default and minimum
were 1.  We need it to be 0 for this case since the parameter
counts a whole number of vector iterations.

Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu
and powerpc64le-linux-gnu.  OK to install?

Richard


2017-11-17  Richard Sandiford  <richard.sandiford@linaro.org>
	    Alan Hayward  <alan.hayward@arm.com>
	    David Sherwood  <david.sherwood@arm.com>

gcc/
	* doc/sourcebuild.texi (vect_fully_masked): Document.
	* params.def (PARAM_MIN_VECT_LOOP_BOUND): Change minimum and
	default value to 0.
	* tree-vect-loop.c (vect_analyze_loop_costing): New function,
	split out from...
	(vect_analyze_loop_2): ...here. Don't check the vectorization
	factor against the number of loop iterations if the loop is
	fully-masked.

gcc/testsuite/
	* lib/target-supports.exp (check_effective_target_vect_fully_masked):
	New proc.
	* gcc.dg/vect/slp-3.c: Expect all loops to be vectorized if
	vect_fully_masked.
	* gcc.target/aarch64/sve_loop_add_4.c: New test.
	* gcc.target/aarch64/sve_loop_add_4_run.c: Likewise.
	* gcc.target/aarch64/sve_loop_add_5.c: Likewise.
	* gcc.target/aarch64/sve_loop_add_5_run.c: Likewise.
	* gcc.target/aarch64/sve_miniloop_1.c: Likewise.
	* gcc.target/aarch64/sve_miniloop_2.c: Likewise.

Comments

Jeff Law Nov. 20, 2017, 12:12 a.m. UTC | #1
On 11/17/2017 08:11 AM, Richard Sandiford wrote:
> Fully-masked loops can be profitable even if the iteration

> count is smaller than the vectorisation factor.  In this case

> we're effectively doing a complete unroll followed by SLP.

> 

> The documentation for min-vect-loop-bound says that the

> default value is 0, but actually the default and minimum

> were 1.  We need it to be 0 for this case since the parameter

> counts a whole number of vector iterations.

> 

> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu

> and powerpc64le-linux-gnu.  OK to install?

> 

> Richard

> 

> 

> 2017-11-17  Richard Sandiford  <richard.sandiford@linaro.org>

> 	    Alan Hayward  <alan.hayward@arm.com>

> 	    David Sherwood  <david.sherwood@arm.com>

> 

> gcc/

> 	* doc/sourcebuild.texi (vect_fully_masked): Document.

> 	* params.def (PARAM_MIN_VECT_LOOP_BOUND): Change minimum and

> 	default value to 0.

> 	* tree-vect-loop.c (vect_analyze_loop_costing): New function,

> 	split out from...

> 	(vect_analyze_loop_2): ...here. Don't check the vectorization

> 	factor against the number of loop iterations if the loop is

> 	fully-masked.

> 

> gcc/testsuite/

> 	* lib/target-supports.exp (check_effective_target_vect_fully_masked):

> 	New proc.

> 	* gcc.dg/vect/slp-3.c: Expect all loops to be vectorized if

> 	vect_fully_masked.

> 	* gcc.target/aarch64/sve_loop_add_4.c: New test.

> 	* gcc.target/aarch64/sve_loop_add_4_run.c: Likewise.

> 	* gcc.target/aarch64/sve_loop_add_5.c: Likewise.

> 	* gcc.target/aarch64/sve_loop_add_5_run.c: Likewise.

> 	* gcc.target/aarch64/sve_miniloop_1.c: Likewise.

> 	* gcc.target/aarch64/sve_miniloop_2.c: Likewise.

OK.
Jeff
James Greenhalgh Jan. 7, 2018, 8:51 p.m. UTC | #2
On Mon, Nov 20, 2017 at 12:12:38AM +0000, Jeff Law wrote:
> On 11/17/2017 08:11 AM, Richard Sandiford wrote:

> > Fully-masked loops can be profitable even if the iteration

> > count is smaller than the vectorisation factor.  In this case

> > we're effectively doing a complete unroll followed by SLP.

> > 

> > The documentation for min-vect-loop-bound says that the

> > default value is 0, but actually the default and minimum

> > were 1.  We need it to be 0 for this case since the parameter

> > counts a whole number of vector iterations.

> > 

> > Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu

> > and powerpc64le-linux-gnu.  OK to install?

> > 

> > Richard

> > 

> > 

> > 2017-11-17  Richard Sandiford  <richard.sandiford@linaro.org>

> > 	    Alan Hayward  <alan.hayward@arm.com>

> > 	    David Sherwood  <david.sherwood@arm.com>

> > 

> > gcc/

> > 	* doc/sourcebuild.texi (vect_fully_masked): Document.

> > 	* params.def (PARAM_MIN_VECT_LOOP_BOUND): Change minimum and

> > 	default value to 0.

> > 	* tree-vect-loop.c (vect_analyze_loop_costing): New function,

> > 	split out from...

> > 	(vect_analyze_loop_2): ...here. Don't check the vectorization

> > 	factor against the number of loop iterations if the loop is

> > 	fully-masked.

> > 

> > gcc/testsuite/

> > 	* lib/target-supports.exp (check_effective_target_vect_fully_masked):

> > 	New proc.

> > 	* gcc.dg/vect/slp-3.c: Expect all loops to be vectorized if

> > 	vect_fully_masked.

> > 	* gcc.target/aarch64/sve_loop_add_4.c: New test.

> > 	* gcc.target/aarch64/sve_loop_add_4_run.c: Likewise.

> > 	* gcc.target/aarch64/sve_loop_add_5.c: Likewise.

> > 	* gcc.target/aarch64/sve_loop_add_5_run.c: Likewise.

> > 	* gcc.target/aarch64/sve_miniloop_1.c: Likewise.

> > 	* gcc.target/aarch64/sve_miniloop_2.c: Likewise.

> OK.

> Jeff


The AArch64 tests are OK.

James
Christophe Lyon Jan. 15, 2018, 10:13 a.m. UTC | #3
On 7 January 2018 at 21:51, James Greenhalgh <james.greenhalgh@arm.com> wrote:
> On Mon, Nov 20, 2017 at 12:12:38AM +0000, Jeff Law wrote:

>> On 11/17/2017 08:11 AM, Richard Sandiford wrote:

>> > Fully-masked loops can be profitable even if the iteration

>> > count is smaller than the vectorisation factor.  In this case

>> > we're effectively doing a complete unroll followed by SLP.

>> >

>> > The documentation for min-vect-loop-bound says that the

>> > default value is 0, but actually the default and minimum

>> > were 1.  We need it to be 0 for this case since the parameter

>> > counts a whole number of vector iterations.

>> >

>> > Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu

>> > and powerpc64le-linux-gnu.  OK to install?

>> >

>> > Richard

>> >

>> >

>> > 2017-11-17  Richard Sandiford  <richard.sandiford@linaro.org>

>> >         Alan Hayward  <alan.hayward@arm.com>

>> >         David Sherwood  <david.sherwood@arm.com>

>> >

>> > gcc/

>> >     * doc/sourcebuild.texi (vect_fully_masked): Document.

>> >     * params.def (PARAM_MIN_VECT_LOOP_BOUND): Change minimum and

>> >     default value to 0.

>> >     * tree-vect-loop.c (vect_analyze_loop_costing): New function,

>> >     split out from...

>> >     (vect_analyze_loop_2): ...here. Don't check the vectorization

>> >     factor against the number of loop iterations if the loop is

>> >     fully-masked.

>> >

>> > gcc/testsuite/

>> >     * lib/target-supports.exp (check_effective_target_vect_fully_masked):

>> >     New proc.

>> >     * gcc.dg/vect/slp-3.c: Expect all loops to be vectorized if

>> >     vect_fully_masked.

>> >     * gcc.target/aarch64/sve_loop_add_4.c: New test.

>> >     * gcc.target/aarch64/sve_loop_add_4_run.c: Likewise.

>> >     * gcc.target/aarch64/sve_loop_add_5.c: Likewise.

>> >     * gcc.target/aarch64/sve_loop_add_5_run.c: Likewise.

>> >     * gcc.target/aarch64/sve_miniloop_1.c: Likewise.

>> >     * gcc.target/aarch64/sve_miniloop_2.c: Likewise.

>> OK.

>> Jeff

>

> The AArch64 tests are OK.

>


I've reported the failures on aarch64-none-elf -mabi=ilp32 in:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83849

Christophe

> James

>
diff mbox series

Patch

Index: gcc/doc/sourcebuild.texi
===================================================================
--- gcc/doc/sourcebuild.texi	2017-11-17 15:09:28.740330131 +0000
+++ gcc/doc/sourcebuild.texi	2017-11-17 15:09:28.967330125 +0000
@@ -1403,6 +1403,10 @@  Target supports hardware vectors of @cod
 @item vect_long_long
 Target supports hardware vectors of @code{long long}.
 
+@item vect_fully_masked
+Target supports fully-masked (also known as fully-predicated) loops,
+so that vector loops can handle partial as well as full vectors.
+
 @item vect_masked_store
 Target supports vector masked stores.
 
Index: gcc/params.def
===================================================================
--- gcc/params.def	2017-11-17 15:09:28.740330131 +0000
+++ gcc/params.def	2017-11-17 15:09:28.967330125 +0000
@@ -139,7 +139,7 @@  DEFPARAM (PARAM_MAX_VARIABLE_EXPANSIONS,
 DEFPARAM (PARAM_MIN_VECT_LOOP_BOUND,
 	  "min-vect-loop-bound",
 	  "If -ftree-vectorize is used, the minimal loop bound of a loop to be considered for vectorization.",
-	  1, 1, 0)
+	  0, 0, 0)
 
 /* The maximum number of instructions to consider when looking for an
    instruction to fill a delay slot.  If more than this arbitrary
Index: gcc/tree-vect-loop.c
===================================================================
--- gcc/tree-vect-loop.c	2017-11-17 15:09:28.740330131 +0000
+++ gcc/tree-vect-loop.c	2017-11-17 15:09:28.969330125 +0000
@@ -1893,6 +1893,101 @@  vect_analyze_loop_operations (loop_vec_i
   return true;
 }
 
+/* Analyze the cost of the loop described by LOOP_VINFO.  Decide if it
+   is worthwhile to vectorize.  Return 1 if definitely yes, 0 if
+   definitely no, or -1 if it's worth retrying.  */
+
+static int
+vect_analyze_loop_costing (loop_vec_info loop_vinfo)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  unsigned int assumed_vf = vect_vf_for_cost (loop_vinfo);
+
+  /* Only fully-masked loops can have iteration counts less than the
+     vectorization factor.  */
+  if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+    {
+      HOST_WIDE_INT max_niter;
+
+      if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
+	max_niter = LOOP_VINFO_INT_NITERS (loop_vinfo);
+      else
+	max_niter = max_stmt_executions_int (loop);
+
+      if (max_niter != -1
+	  && (unsigned HOST_WIDE_INT) max_niter < assumed_vf)
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "not vectorized: iteration count smaller than "
+			     "vectorization factor.\n");
+	  return 0;
+	}
+    }
+
+  int min_profitable_iters, min_profitable_estimate;
+  vect_estimate_min_profitable_iters (loop_vinfo, &min_profitable_iters,
+				      &min_profitable_estimate);
+
+  if (min_profitable_iters < 0)
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "not vectorized: vectorization not profitable.\n");
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "not vectorized: vector version will never be "
+			 "profitable.\n");
+      return -1;
+    }
+
+  int min_scalar_loop_bound = (PARAM_VALUE (PARAM_MIN_VECT_LOOP_BOUND)
+			       * assumed_vf);
+
+  /* Use the cost model only if it is more conservative than user specified
+     threshold.  */
+  unsigned int th = (unsigned) MAX (min_scalar_loop_bound,
+				    min_profitable_iters);
+
+  LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo) = th;
+
+  if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+      && LOOP_VINFO_INT_NITERS (loop_vinfo) < th)
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "not vectorized: vectorization not profitable.\n");
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "not vectorized: iteration count smaller than user "
+			 "specified loop bound parameter or minimum profitable "
+			 "iterations (whichever is more conservative).\n");
+      return 0;
+    }
+
+  HOST_WIDE_INT estimated_niter = estimated_stmt_executions_int (loop);
+  if (estimated_niter == -1)
+    estimated_niter = likely_max_stmt_executions_int (loop);
+  if (estimated_niter != -1
+      && ((unsigned HOST_WIDE_INT) estimated_niter
+	  < MAX (th, (unsigned) min_profitable_estimate)))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "not vectorized: estimated iteration count too "
+			 "small.\n");
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "not vectorized: estimated iteration count smaller "
+			 "than specified loop bound parameter or minimum "
+			 "profitable iterations (whichever is more "
+			 "conservative).\n");
+      return -1;
+    }
+
+  return 1;
+}
+
 
 /* Function vect_analyze_loop_2.
 
@@ -1903,6 +1998,7 @@  vect_analyze_loop_operations (loop_vec_i
 vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal)
 {
   bool ok;
+  int res;
   unsigned int max_vf = MAX_VECTORIZATION_FACTOR;
   poly_uint64 min_vf = 2;
   unsigned int n_stmts = 0;
@@ -2060,9 +2156,7 @@  vect_analyze_loop_2 (loop_vec_info loop_
   vect_compute_single_scalar_iteration_cost (loop_vinfo);
 
   poly_uint64 saved_vectorization_factor = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
-  HOST_WIDE_INT estimated_niter;
   unsigned th;
-  int min_scalar_loop_bound;
 
   /* Check the SLP opportunities in the loop, analyze and build SLP trees.  */
   ok = vect_analyze_slp (loop_vinfo, n_stmts);
@@ -2092,7 +2186,6 @@  vect_analyze_loop_2 (loop_vec_info loop_
   /* Now the vectorization factor is final.  */
   poly_uint64 vectorization_factor = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
   gcc_assert (must_ne (vectorization_factor, 0U));
-  unsigned int assumed_vf = vect_vf_for_cost (loop_vinfo);
 
   if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) && dump_enabled_p ())
     {
@@ -2105,17 +2198,6 @@  vect_analyze_loop_2 (loop_vec_info loop_
 
   HOST_WIDE_INT max_niter
     = likely_max_stmt_executions_int (LOOP_VINFO_LOOP (loop_vinfo));
-  if ((LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
-       && (LOOP_VINFO_INT_NITERS (loop_vinfo) < assumed_vf))
-      || (max_niter != -1
-	  && (unsigned HOST_WIDE_INT) max_niter < assumed_vf))
-    {
-      if (dump_enabled_p ())
-	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			 "not vectorized: iteration count smaller than "
-			 "vectorization factor.\n");
-      return false;
-    }
 
   /* Analyze the alignment of the data-refs in the loop.
      Fail if a data reference is found that cannot be vectorized.  */
@@ -2229,65 +2311,16 @@  vect_analyze_loop_2 (loop_vec_info loop_
 	}
     }
 
-  /* Analyze cost.  Decide if worth while to vectorize.  */
-  int min_profitable_estimate, min_profitable_iters;
-  vect_estimate_min_profitable_iters (loop_vinfo, &min_profitable_iters,
-				      &min_profitable_estimate);
-
-  if (min_profitable_iters < 0)
-    {
-      if (dump_enabled_p ())
-	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			 "not vectorized: vectorization not profitable.\n");
-      if (dump_enabled_p ())
-	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			 "not vectorized: vector version will never be "
-			 "profitable.\n");
-      goto again;
-    }
-
-  min_scalar_loop_bound = (PARAM_VALUE (PARAM_MIN_VECT_LOOP_BOUND)
-			   * assumed_vf);
-
-  /* Use the cost model only if it is more conservative than user specified
-     threshold.  */
-  th = (unsigned) MAX (min_scalar_loop_bound, min_profitable_iters);
-
-  LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo) = th;
-
-  if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
-      && LOOP_VINFO_INT_NITERS (loop_vinfo) < th)
-    {
-      if (dump_enabled_p ())
-	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			 "not vectorized: vectorization not profitable.\n");
-      if (dump_enabled_p ())
-        dump_printf_loc (MSG_NOTE, vect_location,
-			 "not vectorized: iteration count smaller than user "
-			 "specified loop bound parameter or minimum profitable "
-			 "iterations (whichever is more conservative).\n");
-      goto again;
-    }
-
-  estimated_niter
-    = estimated_stmt_executions_int (LOOP_VINFO_LOOP (loop_vinfo));
-  if (estimated_niter == -1)
-    estimated_niter = max_niter;
-  if (estimated_niter != -1
-      && ((unsigned HOST_WIDE_INT) estimated_niter
-          < MAX (th, (unsigned) min_profitable_estimate)))
+  /* Check the costings of the loop make vectorizing worthwhile.  */
+  res = vect_analyze_loop_costing (loop_vinfo);
+  if (res < 0)
+    goto again;
+  if (!res)
     {
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			 "not vectorized: estimated iteration count too "
-                         "small.\n");
-      if (dump_enabled_p ())
-        dump_printf_loc (MSG_NOTE, vect_location,
-			 "not vectorized: estimated iteration count smaller "
-                         "than specified loop bound parameter or minimum "
-                         "profitable iterations (whichever is more "
-                         "conservative).\n");
-      goto again;
+			 "Loop costings not worthwhile.\n");
+      return false;
     }
 
   /* Decide whether we need to create an epilogue loop to handle
@@ -3869,7 +3902,6 @@  vect_estimate_min_profitable_iters (loop
 			      * assumed_vf
 			      - vec_inside_cost * peel_iters_prologue
 			      - vec_inside_cost * peel_iters_epilogue);
-
       if (min_profitable_iters <= 0)
         min_profitable_iters = 0;
       else
Index: gcc/testsuite/lib/target-supports.exp
===================================================================
--- gcc/testsuite/lib/target-supports.exp	2017-11-17 15:09:28.740330131 +0000
+++ gcc/testsuite/lib/target-supports.exp	2017-11-17 15:09:28.968330125 +0000
@@ -6434,6 +6434,12 @@  proc check_effective_target_vect_natural
     return $et_vect_natural_alignment
 }
 
+# Return true if fully-masked loops are supported.
+
+proc check_effective_target_vect_fully_masked { } {
+    return [check_effective_target_aarch64_sve]
+}
+
 # Return 1 if the target doesn't prefer any alignment beyond element
 # alignment during vectorization.
 
Index: gcc/testsuite/gcc.dg/vect/slp-3.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/slp-3.c	2017-11-17 15:09:28.740330131 +0000
+++ gcc/testsuite/gcc.dg/vect/slp-3.c	2017-11-17 15:09:28.967330125 +0000
@@ -141,6 +141,8 @@  int main (void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump-times "vectorized 3 loops" 1 "vect" } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 3 loops" 1 "vect" { target { ! vect_fully_masked } } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 4 loops" 1 "vect" { target vect_fully_masked } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect" { target { ! vect_fully_masked } } } }*/
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 4 "vect" { target vect_fully_masked } } } */
   
Index: gcc/testsuite/gcc.target/aarch64/sve_loop_add_4.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_loop_add_4.c	2017-11-17 15:09:28.967330125 +0000
@@ -0,0 +1,96 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable" } */
+
+#include <stdint.h>
+
+#define LOOP(TYPE, NAME, STEP)					\
+  __attribute__((noinline, noclone))				\
+  void								\
+  test_##TYPE##_##NAME (TYPE *dst, TYPE base, int count)	\
+  {								\
+    for (int i = 0; i < count; ++i, base += STEP)		\
+      dst[i] += base;						\
+  }
+
+#define TEST_TYPE(T, TYPE) \
+  T (TYPE, m17, -17) \
+  T (TYPE, m16, -16) \
+  T (TYPE, m15, -15) \
+  T (TYPE, m1, -1) \
+  T (TYPE, 1, 1) \
+  T (TYPE, 15, 15) \
+  T (TYPE, 16, 16) \
+  T (TYPE, 17, 17)
+
+#define TEST_ALL(T) \
+  TEST_TYPE (T, int8_t) \
+  TEST_TYPE (T, int16_t) \
+  TEST_TYPE (T, int32_t) \
+  TEST_TYPE (T, int64_t)
+
+TEST_ALL (LOOP)
+
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.b, w[0-9]+, #-16\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.b, w[0-9]+, #-15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.b, w[0-9]+, #1\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.b, w[0-9]+, #15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.b, w[0-9]+, w[0-9]+\n} 3 } } */
+/* { dg-final { scan-assembler-times {\tld1b\tz[0-9]+\.b, p[0-7]+/z, \[x[0-9]+, x[0-9]+\]} 8 } } */
+/* { dg-final { scan-assembler-times {\tst1b\tz[0-9]+\.b, p[0-7]+, \[x[0-9]+, x[0-9]+\]} 8 } } */
+/* { dg-final { scan-assembler-times {\tincb\tx[0-9]+\n} 8 } } */
+
+/* { dg-final { scan-assembler-not {\tdecb\tz[0-9]+\.b} } } */
+/* We don't need to increment the vector IV for steps -16 and 16, since the
+   increment is always a multiple of 256.  */
+/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.b, z[0-9]+\.b, z[0-9]+\.b\n} 14 } } */
+
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.h, w[0-9]+, #-16\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.h, w[0-9]+, #-15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.h, w[0-9]+, #1\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.h, w[0-9]+, #15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.h, w[0-9]+, w[0-9]+\n} 3 } } */
+/* { dg-final { scan-assembler-times {\tld1h\tz[0-9]+\.h, p[0-7]+/z, \[x[0-9]+, x[0-9]+, lsl 1\]} 8 } } */
+/* { dg-final { scan-assembler-times {\tst1h\tz[0-9]+\.h, p[0-7]+, \[x[0-9]+, x[0-9]+, lsl 1\]} 8 } } */
+/* { dg-final { scan-assembler-times {\tincb\tx[0-9]+\n} 8 } } */
+
+/* { dg-final { scan-assembler-times {\tdech\tz[0-9]+\.h, all, mul #16\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tdech\tz[0-9]+\.h, all, mul #15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tdech\tz[0-9]+\.h\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tinch\tz[0-9]+\.h\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tinch\tz[0-9]+\.h, all, mul #15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tinch\tz[0-9]+\.h, all, mul #16\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.h, z[0-9]+\.h, z[0-9]+\.h\n} 10 } } */
+
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.s, w[0-9]+, #-16\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.s, w[0-9]+, #-15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.s, w[0-9]+, #1\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.s, w[0-9]+, #15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.s, w[0-9]+, w[0-9]+\n} 3 } } */
+/* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]+/z, \[x[0-9]+, x[0-9]+, lsl 2\]} 8 } } */
+/* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7]+, \[x[0-9]+, x[0-9]+, lsl 2\]} 8 } } */
+/* { dg-final { scan-assembler-times {\tincw\tx[0-9]+\n} 8 } } */
+
+/* { dg-final { scan-assembler-times {\tdecw\tz[0-9]+\.s, all, mul #16\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tdecw\tz[0-9]+\.s, all, mul #15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tdecw\tz[0-9]+\.s\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tincw\tz[0-9]+\.s\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tincw\tz[0-9]+\.s, all, mul #15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tincw\tz[0-9]+\.s, all, mul #16\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.s, z[0-9]+\.s, z[0-9]+\.s\n} 10 } } */
+
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.d, x[0-9]+, #-16\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.d, x[0-9]+, #-15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.d, x[0-9]+, #1\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.d, x[0-9]+, #15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.d, x[0-9]+, x[0-9]+\n} 3 } } */
+/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]+/z, \[x[0-9]+, x[0-9]+, lsl 3\]} 8 } } */
+/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7]+, \[x[0-9]+, x[0-9]+, lsl 3\]} 8 } } */
+/* { dg-final { scan-assembler-times {\tincd\tx[0-9]+\n} 8 } } */
+
+/* { dg-final { scan-assembler-times {\tdecd\tz[0-9]+\.d, all, mul #16\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tdecd\tz[0-9]+\.d, all, mul #15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tdecd\tz[0-9]+\.d\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tincd\tz[0-9]+\.d\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tincd\tz[0-9]+\.d, all, mul #15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tincd\tz[0-9]+\.d, all, mul #16\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 10 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_loop_add_4_run.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_loop_add_4_run.c	2017-11-17 15:09:28.967330125 +0000
@@ -0,0 +1,30 @@ 
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#include "sve_loop_add_4.c"
+
+#define N 131
+#define BASE 41
+
+#define TEST_LOOP(TYPE, NAME, STEP)				\
+  {								\
+    TYPE a[N];							\
+    for (int i = 0; i < N; ++i)					\
+      {								\
+	a[i] = i * i + i % 5;					\
+	asm volatile ("" ::: "memory");				\
+      }								\
+    test_##TYPE##_##NAME (a, BASE, N);				\
+    for (int i = 0; i < N; ++i)					\
+      {								\
+	TYPE expected = i * i + i % 5 + BASE + i * STEP;	\
+	if (a[i] != expected)					\
+	  __builtin_abort ();					\
+      }								\
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST_ALL (TEST_LOOP)
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_loop_add_5.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_loop_add_5.c	2017-11-17 15:09:28.967330125 +0000
@@ -0,0 +1,54 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=256" } */
+
+#include "sve_loop_add_4.c"
+
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.b, w[0-9]+, #-16\n} 1 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.b, w[0-9]+, #-15\n} 1 { xfail *-*-* }  } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.b, w[0-9]+, #1\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.b, w[0-9]+, #15\n} 1 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.b, w[0-9]+, w[0-9]+\n} 3 { xfail *-*-* }  } } */
+/* { dg-final { scan-assembler-times {\tld1b\tz[0-9]+\.b, p[0-7]+/z, \[x[0-9]+, x[0-9]+\]} 8 } } */
+/* { dg-final { scan-assembler-times {\tst1b\tz[0-9]+\.b, p[0-7]+, \[x[0-9]+, x[0-9]+\]} 8 } } */
+
+/* The induction vector is invariant for steps of -16 and 16.  */
+/* { dg-final { scan-assembler-not {\tsub\tz[0-9]+\.b, z[0-9]+\.b, #} } } */
+/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.b, z[0-9]+\.b, #} 6 } } */
+/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.b, z[0-9]+\.b, z[0-9]+\.b\n} 8 } } */
+
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.h, w[0-9]+, #-16\n} 1 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.h, w[0-9]+, #-15\n} 1 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.h, w[0-9]+, #1\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.h, w[0-9]+, #15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.h, w[0-9]+, w[0-9]+\n} 3 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tld1h\tz[0-9]+\.h, p[0-7]+/z, \[x[0-9]+, x[0-9]+, lsl 1\]} 8 } } */
+/* { dg-final { scan-assembler-times {\tst1h\tz[0-9]+\.h, p[0-7]+, \[x[0-9]+, x[0-9]+, lsl 1\]} 8 } } */
+
+/* The (-)17 * 16 is out of range.  */
+/* { dg-final { scan-assembler-times {\tsub\tz[0-9]+\.h, z[0-9]+\.h, #} 2 } } */
+/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.h, z[0-9]+\.h, #} 4 } } */
+/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.h, z[0-9]+\.h, z[0-9]+\.h\n} 10 } } */
+
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.s, w[0-9]+, #-16\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.s, w[0-9]+, #-15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.s, w[0-9]+, #1\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.s, w[0-9]+, #15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.s, w[0-9]+, w[0-9]+\n} 3 } } */
+/* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]+/z, \[x[0-9]+, x[0-9]+, lsl 2\]} 8 } } */
+/* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7]+, \[x[0-9]+, x[0-9]+, lsl 2\]} 8 } } */
+
+/* { dg-final { scan-assembler-times {\tsub\tz[0-9]+\.s, z[0-9]+\.s, #} 4 } } */
+/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.s, z[0-9]+\.s, #} 4 } } */
+/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.s, z[0-9]+\.s, z[0-9]+\.s\n} 8 } } */
+
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.d, x[0-9]+, #-16\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.d, x[0-9]+, #-15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.d, x[0-9]+, #1\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.d, x[0-9]+, #15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.d, x[0-9]+, x[0-9]+\n} 3 } } */
+/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]+/z, \[x[0-9]+, x[0-9]+, lsl 3\]} 8 } } */
+/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7]+, \[x[0-9]+, x[0-9]+, lsl 3\]} 8 } } */
+
+/* { dg-final { scan-assembler-times {\tsub\tz[0-9]+\.d, z[0-9]+\.d, #} 4 } } */
+/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.d, z[0-9]+\.d, #} 4 } } */
+/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 8 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_loop_add_5_run.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_loop_add_5_run.c	2017-11-17 15:09:28.967330125 +0000
@@ -0,0 +1,5 @@ 
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=256" { target aarch64_sve256_hw } } */
+
+#include "sve_loop_add_4_run.c"
Index: gcc/testsuite/gcc.target/aarch64/sve_miniloop_1.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_miniloop_1.c	2017-11-17 15:09:28.967330125 +0000
@@ -0,0 +1,23 @@ 
+/* { dg-do assemble } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve --save-temps" } */
+
+void loop (int * __restrict__ a, int * __restrict__ b, int * __restrict__ c,
+	   int * __restrict__ d, int * __restrict__ e, int * __restrict__ f,
+	   int * __restrict__ g, int * __restrict__ h)
+{
+  int i = 0;
+  for (i = 0; i < 3; i++)
+    {
+      a[i] += i;
+      b[i] += i;
+      c[i] += i;
+      d[i] += i;
+      e[i] += i;
+      f[i] += a[i] + 7;
+      g[i] += b[i] - 3;
+      h[i] += c[i] + 3;
+    }
+}
+
+/* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, } 8 } } */
+/* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, } 8 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_miniloop_2.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_miniloop_2.c	2017-11-17 15:09:28.967330125 +0000
@@ -0,0 +1,7 @@ 
+/* { dg-do assemble } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve --save-temps -msve-vector-bits=256" } */
+
+#include "sve_miniloop_1.c"
+
+/* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, } 8 } } */
+/* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, } 8 } } */