diff mbox series

Add support for vectorising live-out values using SVE LASTB

Message ID 87wp2px70o.fsf@linaro.org
State New
Headers show
Series Add support for vectorising live-out values using SVE LASTB | expand

Commit Message

Richard Sandiford Nov. 17, 2017, 3:24 p.m. UTC
This patch uses the SVE LASTB instruction to optimise cases in which
a value produced by the final scalar iteration of a vectorised loop is
live outside the loop.  Previously this situation would stop us from
using a fully-masked loop.

Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu
and powerpc64le-linux-gnu.  OK to install?

Richard


2017-11-17  Richard Sandiford  <richard.sandiford@linaro.org>
	    Alan Hayward  <alan.hayward@arm.com>
	    David Sherwood  <david.sherwood@arm.com>

gcc/
	* doc/md.texi (extract_last_@var{m}): Document.
	* optabs.def (extract_last_optab): New optab.
	* internal-fn.def (EXTRACT_LAST): New internal function.
	* internal-fn.c (cond_unary_direct): New macro.
	(expand_cond_unary_optab_fn): Likewise.
	(direct_cond_unary_optab_supported_p): Likewise.
	* tree-vect-loop.c (vectorizable_live_operation): Allow fully-masked
	loops using EXTRACT_LAST.
	* config/aarch64/aarch64-sve.md (aarch64_sve_lastb<mode>): Rename to...
	(extract_last_<mode>): ...this optab.
	(vec_extract<mode><Vel>): Update accordingly.

gcc/testsuite/
	* gcc.target/aarch64/sve_live_1.c: New test.
	* gcc.target/aarch64/sve_live_1_run.c: Likewise.

Comments

Jeff Law Dec. 13, 2017, 4:36 p.m. UTC | #1
On 11/17/2017 08:24 AM, Richard Sandiford wrote:
> This patch uses the SVE LASTB instruction to optimise cases in which

> a value produced by the final scalar iteration of a vectorised loop is

> live outside the loop.  Previously this situation would stop us from

> using a fully-masked loop.

> 

> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu

> and powerpc64le-linux-gnu.  OK to install?

> 

> Richard

> 

> 

> 2017-11-17  Richard Sandiford  <richard.sandiford@linaro.org>

> 	    Alan Hayward  <alan.hayward@arm.com>

> 	    David Sherwood  <david.sherwood@arm.com>

> 

> gcc/

> 	* doc/md.texi (extract_last_@var{m}): Document.

> 	* optabs.def (extract_last_optab): New optab.

> 	* internal-fn.def (EXTRACT_LAST): New internal function.

> 	* internal-fn.c (cond_unary_direct): New macro.

> 	(expand_cond_unary_optab_fn): Likewise.

> 	(direct_cond_unary_optab_supported_p): Likewise.

> 	* tree-vect-loop.c (vectorizable_live_operation): Allow fully-masked

> 	loops using EXTRACT_LAST.

> 	* config/aarch64/aarch64-sve.md (aarch64_sve_lastb<mode>): Rename to...

> 	(extract_last_<mode>): ...this optab.

> 	(vec_extract<mode><Vel>): Update accordingly.

> 

> gcc/testsuite/

> 	* gcc.target/aarch64/sve_live_1.c: New test.

> 	* gcc.target/aarch64/sve_live_1_run.c: Likewise.

Like the last patch, I didn't look at the aarch64 bits.  The generic
bits are OK.

jeff
James Greenhalgh Jan. 7, 2018, 8:37 p.m. UTC | #2
On Wed, Dec 13, 2017 at 04:36:47PM +0000, Jeff Law wrote:
> On 11/17/2017 08:24 AM, Richard Sandiford wrote:

> > This patch uses the SVE LASTB instruction to optimise cases in which

> > a value produced by the final scalar iteration of a vectorised loop is

> > live outside the loop.  Previously this situation would stop us from

> > using a fully-masked loop.

> > 

> > Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu

> > and powerpc64le-linux-gnu.  OK to install?

> > 

> > Richard

> > 

> > 

> > 2017-11-17  Richard Sandiford  <richard.sandiford@linaro.org>

> > 	    Alan Hayward  <alan.hayward@arm.com>

> > 	    David Sherwood  <david.sherwood@arm.com>

> > 

> > gcc/

> > 	* doc/md.texi (extract_last_@var{m}): Document.

> > 	* optabs.def (extract_last_optab): New optab.

> > 	* internal-fn.def (EXTRACT_LAST): New internal function.

> > 	* internal-fn.c (cond_unary_direct): New macro.

> > 	(expand_cond_unary_optab_fn): Likewise.

> > 	(direct_cond_unary_optab_supported_p): Likewise.

> > 	* tree-vect-loop.c (vectorizable_live_operation): Allow fully-masked

> > 	loops using EXTRACT_LAST.

> > 	* config/aarch64/aarch64-sve.md (aarch64_sve_lastb<mode>): Rename to...

> > 	(extract_last_<mode>): ...this optab.

> > 	(vec_extract<mode><Vel>): Update accordingly.

> > 

> > gcc/testsuite/

> > 	* gcc.target/aarch64/sve_live_1.c: New test.

> > 	* gcc.target/aarch64/sve_live_1_run.c: Likewise.

> Like the last patch, I didn't look at the aarch64 bits.  The generic

> bits are OK.


OK for the AArch64 parts.

Thanks,
James
diff mbox series

Patch

Index: gcc/doc/md.texi
===================================================================
--- gcc/doc/md.texi	2017-11-17 15:23:55.844062247 +0000
+++ gcc/doc/md.texi	2017-11-17 15:23:56.035829132 +0000
@@ -5268,6 +5268,14 @@  of a vector of mode @var{m}.  Operand 1
 is the scalar result.  The mode of the scalar result is the same as one
 element of @var{m}.
 
+@cindex @code{extract_last_@var{m}} instruction pattern
+@item @code{extract_last_@var{m}}
+Find the last set bit in mask operand 1 and extract the associated element
+of vector operand 2.  Store the result in scalar operand 0.  Operand 2
+has vector mode @var{m} while operand 0 has the mode appropriate for one
+element of @var{m}.  Operand 1 has the usual mask mode for vectors of mode
+@var{m}; see @code{TARGET_VECTORIZE_GET_MASK_MODE}.
+
 @cindex @code{sdot_prod@var{m}} instruction pattern
 @item @samp{sdot_prod@var{m}}
 @cindex @code{udot_prod@var{m}} instruction pattern
Index: gcc/optabs.def
===================================================================
--- gcc/optabs.def	2017-11-17 15:23:55.844062247 +0000
+++ gcc/optabs.def	2017-11-17 15:23:56.035829132 +0000
@@ -307,6 +307,8 @@  OPTAB_D (reduc_and_scal_optab,  "reduc_a
 OPTAB_D (reduc_ior_scal_optab,  "reduc_ior_scal_$a")
 OPTAB_D (reduc_xor_scal_optab,  "reduc_xor_scal_$a")
 
+OPTAB_D (extract_last_optab, "extract_last_$a")
+
 OPTAB_D (sdot_prod_optab, "sdot_prod$I$a")
 OPTAB_D (ssum_widen_optab, "widen_ssum$I$a3")
 OPTAB_D (udot_prod_optab, "udot_prod$I$a")
Index: gcc/internal-fn.def
===================================================================
--- gcc/internal-fn.def	2017-11-17 15:23:55.844062247 +0000
+++ gcc/internal-fn.def	2017-11-17 15:23:56.035829132 +0000
@@ -142,6 +142,10 @@  DEF_INTERNAL_COND_OPTAB_FN (XOR, ECF_CON
 
 DEF_INTERNAL_OPTAB_FN (RSQRT, ECF_CONST, rsqrt, unary)
 
+/* Extract the last active element from a vector.  */
+DEF_INTERNAL_OPTAB_FN (EXTRACT_LAST, ECF_CONST | ECF_NOTHROW,
+		       extract_last, cond_unary)
+
 /* Unary math functions.  */
 DEF_INTERNAL_FLT_FN (ACOS, ECF_CONST, acos, unary)
 DEF_INTERNAL_FLT_FN (ASIN, ECF_CONST, asin, unary)
Index: gcc/internal-fn.c
===================================================================
--- gcc/internal-fn.c	2017-11-17 15:23:55.844062247 +0000
+++ gcc/internal-fn.c	2017-11-17 15:23:56.035829132 +0000
@@ -88,6 +88,7 @@  #define store_lanes_direct { 0, 0, false
 #define mask_store_lanes_direct { 0, 0, false }
 #define unary_direct { 0, 0, true }
 #define binary_direct { 0, 0, true }
+#define cond_unary_direct { 1, 1, true }
 #define cond_binary_direct { 1, 1, true }
 #define while_direct { 0, 2, false }
 
@@ -2826,6 +2827,9 @@  #define expand_unary_optab_fn(FN, STMT,
 #define expand_binary_optab_fn(FN, STMT, OPTAB) \
   expand_direct_optab_fn (FN, STMT, OPTAB, 2)
 
+#define expand_cond_unary_optab_fn(FN, STMT, OPTAB) \
+  expand_direct_optab_fn (FN, STMT, OPTAB, 2)
+
 #define expand_cond_binary_optab_fn(FN, STMT, OPTAB) \
   expand_direct_optab_fn (FN, STMT, OPTAB, 3)
 
@@ -2902,6 +2906,7 @@  multi_vector_optab_supported_p (convert_
 
 #define direct_unary_optab_supported_p direct_optab_supported_p
 #define direct_binary_optab_supported_p direct_optab_supported_p
+#define direct_cond_unary_optab_supported_p direct_optab_supported_p
 #define direct_cond_binary_optab_supported_p direct_optab_supported_p
 #define direct_mask_load_optab_supported_p direct_optab_supported_p
 #define direct_load_lanes_optab_supported_p multi_vector_optab_supported_p
Index: gcc/tree-vect-loop.c
===================================================================
--- gcc/tree-vect-loop.c	2017-11-17 15:23:55.844062247 +0000
+++ gcc/tree-vect-loop.c	2017-11-17 15:23:56.036742308 +0000
@@ -7643,16 +7643,43 @@  vectorizable_live_operation (gimple *stm
 
   if (!vec_stmt)
     {
+      /* No transformation required.  */
       if (LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo))
 	{
-	  if (dump_enabled_p ())
-	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			     "can't use a fully-masked loop because "
-			     "a value is live outside the loop.\n");
-	  LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
+	  if (!direct_internal_fn_supported_p (IFN_EXTRACT_LAST, vectype,
+					       OPTIMIZE_FOR_SPEED))
+	    {
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+				 "can't use a fully-masked loop because "
+				 "the target doesn't support extract last "
+				 "reduction.\n");
+	      LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
+	    }
+	  else if (slp_node)
+	    {
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+				 "can't use a fully-masked loop because an "
+				 "SLP statement is live after the loop.\n");
+	      LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
+	    }
+	  else if (ncopies > 1)
+	    {
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+				 "can't use a fully-masked loop because"
+				 " ncopies is greater than 1.\n");
+	      LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
+	    }
+	  else
+	    {
+	      gcc_assert (ncopies == 1 && !slp_node);
+	      vect_record_loop_mask (loop_vinfo,
+				     &LOOP_VINFO_MASKS (loop_vinfo),
+				     1, vectype);
+	    }
 	}
-
-      /* No transformation required.  */
       return true;
     }
 
@@ -7686,6 +7713,8 @@  vectorizable_live_operation (gimple *stm
     {
       enum vect_def_type dt = STMT_VINFO_DEF_TYPE (stmt_info);
       vec_lhs = vect_get_vec_def_for_operand_1 (stmt, dt);
+      gcc_checking_assert (ncopies == 1
+			   || !LOOP_VINFO_FULLY_MASKED_P (loop_vinfo));
 
       /* For multiple copies, get the last copy.  */
       for (int i = 1; i < ncopies; ++i)
@@ -7696,15 +7725,39 @@  vectorizable_live_operation (gimple *stm
       bitstart = int_const_binop (MINUS_EXPR, vec_bitsize, bitsize);
     }
 
-  /* Create a new vectorized stmt for the uses of STMT and insert outside the
-     loop.  */
   gimple_seq stmts = NULL;
-  tree bftype = TREE_TYPE (vectype);
-  if (VECTOR_BOOLEAN_TYPE_P (vectype))
-    bftype = build_nonstandard_integer_type (tree_to_uhwi (bitsize), 1);
-  tree new_tree = build3 (BIT_FIELD_REF, bftype, vec_lhs, bitsize, bitstart);
-  new_tree = force_gimple_operand (fold_convert (lhs_type, new_tree), &stmts,
-				   true, NULL_TREE);
+  tree new_tree;
+  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+    {
+      /* Emit:
+
+	   SCALAR_RES = EXTRACT_LAST <VEC_LHS, MASK>
+
+	 where VEC_LHS is the vectorized live-out result and MASK is
+	 the loop mask for the final iteration.  */
+      gcc_assert (ncopies == 1 && !slp_node);
+      tree scalar_type = TREE_TYPE (STMT_VINFO_VECTYPE (stmt_info));
+      tree scalar_res = make_ssa_name (scalar_type);
+      tree mask = vect_get_loop_mask (gsi, &LOOP_VINFO_MASKS (loop_vinfo),
+				      1, vectype, 0);
+      gcall *new_stmt = gimple_build_call_internal (IFN_EXTRACT_LAST,
+						    2, mask, vec_lhs);
+      gimple_call_set_lhs (new_stmt, scalar_res);
+      gimple_seq_add_stmt (&stmts, new_stmt);
+
+      /* Convert the extracted vector element to the required scalar type.  */
+      new_tree = gimple_convert (&stmts, lhs_type, scalar_res);
+    }
+  else
+    {
+      tree bftype = TREE_TYPE (vectype);
+      if (VECTOR_BOOLEAN_TYPE_P (vectype))
+	bftype = build_nonstandard_integer_type (tree_to_uhwi (bitsize), 1);
+      new_tree = build3 (BIT_FIELD_REF, bftype, vec_lhs, bitsize, bitstart);
+      new_tree = force_gimple_operand (fold_convert (lhs_type, new_tree),
+				       &stmts, true, NULL_TREE);
+    }
+
   if (stmts)
     gsi_insert_seq_on_edge_immediate (single_exit (loop), stmts);
 
Index: gcc/config/aarch64/aarch64-sve.md
===================================================================
--- gcc/config/aarch64/aarch64-sve.md	2017-11-17 15:23:55.844062247 +0000
+++ gcc/config/aarch64/aarch64-sve.md	2017-11-17 15:23:56.034915957 +0000
@@ -345,8 +345,7 @@  (define_expand "vec_extract<mode><Vel>"
 	/* The last element can be extracted with a LASTB and a false
 	   predicate.  */
 	rtx sel = force_reg (<VPRED>mode, CONST0_RTX (<VPRED>mode));
-	emit_insn (gen_aarch64_sve_lastb<mode> (operands[0], sel,
-						operands[1]));
+	emit_insn (gen_extract_last_<mode> (operands[0], sel, operands[1]));
 	DONE;
       }
     if (!CONST_INT_P (operands[2]))
@@ -365,8 +364,7 @@  (define_expand "vec_extract<mode><Vel>"
 	emit_insn (gen_vec_cmp<v_int_equiv><vpred> (sel, cmp, series, zero));
 
 	/* Select the element using LASTB.  */
-	emit_insn (gen_aarch64_sve_lastb<mode> (operands[0], sel,
-						operands[1]));
+	emit_insn (gen_extract_last_<mode> (operands[0], sel, operands[1]));
 	DONE;
       }
   }
@@ -431,7 +429,7 @@  (define_insn "*vec_extract<mode><Vel>_ex
 
 ;; Extract the last active element of operand 1 into operand 0.
 ;; If no elements are active, extract the last inactive element instead.
-(define_insn "aarch64_sve_lastb<mode>"
+(define_insn "extract_last_<mode>"
   [(set (match_operand:<VEL> 0 "register_operand" "=r, w")
 	(unspec:<VEL>
 	  [(match_operand:<VPRED> 1 "register_operand" "Upl, Upl")
Index: gcc/testsuite/gcc.target/aarch64/sve_live_1.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_live_1.c	2017-11-17 15:23:56.035829132 +0000
@@ -0,0 +1,41 @@ 
+/* { dg-do assemble } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve --save-temps" } */
+
+#include <stdint.h>
+
+#define EXTRACT_LAST(TYPE)			\
+  TYPE __attribute__ ((noinline, noclone))	\
+  test_##TYPE (TYPE *x, int n, TYPE value)	\
+  {						\
+    TYPE last;					\
+    for (int j = 0; j < n; ++j)			\
+      {						\
+	last = x[j];				\
+	x[j] = last * value;			\
+      }						\
+    return last;				\
+  }
+
+#define TEST_ALL(T)				\
+  T (uint8_t)					\
+  T (uint16_t)					\
+  T (uint32_t)					\
+  T (uint64_t)					\
+  T (_Float16)					\
+  T (float)					\
+  T (double)
+
+TEST_ALL (EXTRACT_LAST)
+
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7].b, } 2 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7].h, } 4 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7].s, } 4 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7].d, } 4 } } */
+
+/* { dg-final { scan-assembler-times {\tlastb\tw[0-9]+, p[0-7], z[0-9]+\.b\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tlastb\tw[0-9]+, p[0-7], z[0-9]+\.h\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tlastb\tw[0-9]+, p[0-7], z[0-9]+\.s\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tlastb\tx[0-9]+, p[0-7], z[0-9]+\.d\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tlastb\th[0-9]+, p[0-7], z[0-9]+\.h\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tlastb\ts[0-9]+, p[0-7], z[0-9]+\.s\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tlastb\td[0-9]+, p[0-7], z[0-9]+\.d\n} 1 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_live_1_run.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_live_1_run.c	2017-11-17 15:23:56.035829132 +0000
@@ -0,0 +1,35 @@ 
+/* { dg-do run { target { aarch64_sve_hw } } } */
+/* { dg-options "-O2 -ftree-vectorize -fno-inline -march=armv8-a+sve" } */
+
+#include "sve_live_1.c"
+
+#define N 107
+#define OP 70
+
+#define TEST_LOOP(TYPE)				\
+  {						\
+    TYPE a[N];					\
+    for (int i = 0; i < N; ++i)			\
+      {						\
+	a[i] = i * 2 + (i % 3);			\
+	asm volatile ("" ::: "memory");		\
+      }						\
+    TYPE expected = a[N - 1];			\
+    TYPE res = test_##TYPE (a, N, OP);		\
+    if (res != expected)			\
+      __builtin_abort ();			\
+    for (int i = 0; i < N; ++i)			\
+      {						\
+	TYPE old = i * 2 + (i % 3);		\
+	if (a[i] != (TYPE) (old * (TYPE) OP))	\
+	  __builtin_abort ();			\
+	asm volatile ("" ::: "memory");		\
+      }						\
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST_ALL (TEST_LOOP);
+  return 0;
+}