[ARM] Implement division using vrecpe, vrecps

Message ID	CAAgBjMkZheX1ZczGwg9dZy21YfdNg64n4uvXA=PQZrdV_MxDNw@mail.gmail.com
State	New
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: domain of gcc-patches-return-488347-patch=linaro.org@gcc.gnu.org designates 209.132.180.131 as permitted sender) client-ip=209.132.180.131; DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :mime-version:from:date:message-id:subject:to:content-type; q= dns; s=default; b=wVRrRb7K+bCz8SMfn+/MH7UnnkpYygAa3D7W6pko9400pd Is/xVlP0KfCeQ40k8BY+UbenOe7oFQf+jkcunugpSd4jsQ//354AoGeZjPlvgKSN R9+AJGPXudqKMO2lokF1HJsoMDlwCGhOt3nEyk1NFE+dRvUFeYV45kFWhfr6o= Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk Sender: gcc-patches-owner@gcc.gnu.org MIME-Version: 1.0 From: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> Date: Fri, 26 Oct 2018 10:34:02 +0530 Message-ID: <CAAgBjMkZheX1ZczGwg9dZy21YfdNg64n4uvXA=PQZrdV_MxDNw@mail.gmail.com> Subject: [ARM] Implement division using vrecpe, vrecps To: gcc Patches <gcc-patches@gcc.gnu.org>, Kyrill Tkachov <kyrylo.tkachov@arm.com>, Ramana Radhakrishnan <ramana.radhakrishnan@arm.com> Content-Type: multipart/mixed; boundary="00000000000090754a05791aa608"
Series	[ARM] Implement division using vrecpe, vrecps \| expand [ARM] Implement division using vrecpe, vrecps

Message ID

CAAgBjMkZheX1ZczGwg9dZy21YfdNg64n4uvXA=PQZrdV_MxDNw@mail.gmail.com

State

New

Headers

Received-SPF: pass (google.com: domain of
	gcc-patches-return-488347-patch=linaro.org@gcc.gnu.org
	designates 209.132.180.131 as permitted sender)
	client-ip=209.132.180.131; 
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:mime-version:from:date:message-id:subject:to:content-type; q=
	dns; s=default; b=wVRrRb7K+bCz8SMfn+/MH7UnnkpYygAa3D7W6pko9400pd
	Is/xVlP0KfCeQ40k8BY+UbenOe7oFQf+jkcunugpSd4jsQ//354AoGeZjPlvgKSN
	R9+AJGPXudqKMO2lokF1HJsoMDlwCGhOt3nEyk1NFE+dRvUFeYV45kFWhfr6o=
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
Sender: gcc-patches-owner@gcc.gnu.org
MIME-Version: 1.0
From: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>
Date: Fri, 26 Oct 2018 10:34:02 +0530
Message-ID: <CAAgBjMkZheX1ZczGwg9dZy21YfdNg64n4uvXA=PQZrdV_MxDNw@mail.gmail.com>
Subject: [ARM] Implement division using vrecpe, vrecps
To: gcc Patches <gcc-patches@gcc.gnu.org>,
	Kyrill Tkachov <kyrylo.tkachov@arm.com>,
	Ramana Radhakrishnan <ramana.radhakrishnan@arm.com>
Content-Type: multipart/mixed; boundary="00000000000090754a05791aa608"

Series

[ARM] Implement division using vrecpe, vrecps | expand

Commit Message

Prathamesh Kulkarni Oct. 26, 2018, 5:04 a.m. UTC

Hi,
This is a rebased version of patch that adds a pattern to neon.md for
implementing division with multiplication by reciprocal using
vrecpe/vrecps with -funsafe-math-optimizations excluding -Os.
The newly added test-cases are not vectorized on armeb target with
-O2. I posted the analysis for that here:
https://gcc.gnu.org/ml/gcc-patches/2016-05/msg01765.html

Briefly, the difference between little and big-endian vectorizer is in
arm_builtin_support_vector_misalignment() which calls
default_builtin_support_vector_misalignment() for big-endian case, and
that returns false because
movmisalign_optab does not exist for V2SF mode. This isn't observed
with -O3 because loop peeling for alignment gets enabled.

It seems that the test cases in patch appear unsupported on armeb,
after r221677 thus this patch requires no changes to
target-supports.exp to adjust for armeb (unlike last time which
stalled the patch).

Bootstrap+tested on arm-linux-gnueabihf.
Cross-tested on arm*-*-* variants.
OK for trunk ?

Thanks,
Prathamesh
2018-10-26  Prathamesh Kulkarni  <prathamesh.kulkarni@linaro.org>

	* config/arm/neon.md (div<mode>3): New pattern.

testsuite/
	* gcc.target/arm/neon-vect-div-1.c: New test.
	* gcc.target/arm/neon-vect-div-2.c: Likewise.

Comments

Prathamesh Kulkarni Nov. 2, 2018, 9:39 a.m. UTC | #1

On Fri, 26 Oct 2018 at 10:34, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>

> Hi,

> This is a rebased version of patch that adds a pattern to neon.md for

> implementing division with multiplication by reciprocal using

> vrecpe/vrecps with -funsafe-math-optimizations excluding -Os.

> The newly added test-cases are not vectorized on armeb target with

> -O2. I posted the analysis for that here:

> https://gcc.gnu.org/ml/gcc-patches/2016-05/msg01765.html

>

> Briefly, the difference between little and big-endian vectorizer is in

> arm_builtin_support_vector_misalignment() which calls

> default_builtin_support_vector_misalignment() for big-endian case, and

> that returns false because

> movmisalign_optab does not exist for V2SF mode. This isn't observed

> with -O3 because loop peeling for alignment gets enabled.

>

> It seems that the test cases in patch appear unsupported on armeb,

> after r221677 thus this patch requires no changes to

> target-supports.exp to adjust for armeb (unlike last time which

> stalled the patch).

>

> Bootstrap+tested on arm-linux-gnueabihf.

> Cross-tested on arm*-*-* variants.

> OK for trunk ?

ping: https://gcc.gnu.org/ml/gcc-patches/2018-10/msg01645.html

Thanks,
Prathamesh
>

> Thanks,

> Prathamesh

Ramana Radhakrishnan Nov. 5, 2018, 1:52 p.m. UTC | #2

On 26/10/2018 06:04, Prathamesh Kulkarni wrote:
> Hi,

> This is a rebased version of patch that adds a pattern to neon.md for

> implementing division with multiplication by reciprocal using

> vrecpe/vrecps with -funsafe-math-optimizations excluding -Os.

> The newly added test-cases are not vectorized on armeb target with

> -O2. I posted the analysis for that here:

> https://gcc.gnu.org/ml/gcc-patches/2016-05/msg01765.html

> 

> Briefly, the difference between little and big-endian vectorizer is in

> arm_builtin_support_vector_misalignment() which calls

> default_builtin_support_vector_misalignment() for big-endian case, and

> that returns false because

> movmisalign_optab does not exist for V2SF mode. This isn't observed

> with -O3 because loop peeling for alignment gets enabled.

> 

> It seems that the test cases in patch appear unsupported on armeb,

> after r221677 thus this patch requires no changes to

> target-supports.exp to adjust for armeb (unlike last time which

> stalled the patch).

> 

> Bootstrap+tested on arm-linux-gnueabihf.

> Cross-tested on arm*-*-* variants.

> OK for trunk ?

> 

> Thanks,

> Prathamesh

> 

> 

> tcwg-319-3.txt

> 

> 2018-10-26  Prathamesh Kulkarni<prathamesh.kulkarni@linaro.org>

> 

> 	* config/arm/neon.md (div<mode>3): New pattern.

> 

> testsuite/

> 	* gcc.target/arm/neon-vect-div-1.c: New test.

> 	* gcc.target/arm/neon-vect-div-2.c: Likewise.

> 

> diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md

> index 5aeee4b08c1..25ed45d381a 100644

> --- a/gcc/config/arm/neon.md

> +++ b/gcc/config/arm/neon.md

> @@ -620,6 +620,38 @@

>                       (const_string "neon_mul_<V_elem_ch><q>")))]

>   )

>   

> +/* Perform division using multiply-by-reciprocal.

> +   Reciprocal is calculated using Newton-Raphson method.

> +   Enabled with -funsafe-math-optimizations -freciprocal-math

> +   and disabled for -Os since it increases code size .  */ > +

> +(define_expand "div<mode>3"

> +  [(set (match_operand:VCVTF 0 "s_register_operand" "=w")

> +        (div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w")

> +		  (match_operand:VCVTF 2 "s_register_operand" "w")))]

> +  "TARGET_NEON && !optimize_size

> +   && flag_unsafe_math_optimizations && flag_reciprocal_math"


I would prefer this to be more granular than 
flag_unsafe_math_optimization && flag_reciprocal_math which really is 
flag_reciprocal_math as it is turned on by default with 
funsafe-math-optimizations.

I think this should really be just flag_reciprocal_math.


Otherwise ok.

regards
Ramana




> +  {

> +    rtx rec = gen_reg_rtx (<MODE>mode);

> +    rtx vrecps_temp = gen_reg_rtx (<MODE>mode);

> +

> +    /* Reciprocal estimate.  */

> +    emit_insn (gen_neon_vrecpe<mode> (rec, operands[2]));

> +

> +    /* Perform 2 iterations of newton-raphson method.  */

> +    for (int i = 0; i < 2; i++)

> +      {

> +	emit_insn (gen_neon_vrecps<mode> (vrecps_temp, rec, operands[2]));

> +	emit_insn (gen_mul<mode>3 (rec, rec, vrecps_temp));

> +      }

> +

> +    /* We now have reciprocal in rec, perform operands[0] = operands[1] * rec.  */

> +    emit_insn (gen_mul<mode>3 (operands[0], operands[1], rec));

> +    DONE;

> +  }

> +)

> +

> +

>   (define_insn "mul<mode>3add<mode>_neon"

>     [(set (match_operand:VDQW 0 "s_register_operand" "=w")

>           (plus:VDQW (mult:VDQW (match_operand:VDQW 2 "s_register_operand" "w")

> diff --git a/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c b/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c

> new file mode 100644

> index 00000000000..50d04b4175b

> --- /dev/null

> +++ b/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c

> @@ -0,0 +1,16 @@

> +/* Test pattern div<mode>3.  */

> +/* { dg-do compile } */

> +/* { dg-require-effective-target arm_neon_ok } */

> +/* { dg-require-effective-target vect_hw_misalign } */

> +/* { dg-options "-O2 -ftree-vectorize -funsafe-math-optimizations -fdump-tree-vect-details" } */

> +/* { dg-add-options arm_neon } */

> +

> +void

> +foo (int len, float * __restrict p, float *__restrict x)

> +{

> +  len = len & ~31;

> +  for (int i = 0; i < len; i++)

> +    p[i] = p[i] / x[i];

> +}

> +

> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */

> diff --git a/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c b/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c

> new file mode 100644

> index 00000000000..606f54b4e0e

> --- /dev/null

> +++ b/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c

> @@ -0,0 +1,16 @@

> +/* Test pattern div<mode>3.  */

> +/* { dg-do compile } */

> +/* { dg-require-effective-target arm_neon_ok } */

> +/* { dg-require-effective-target vect_hw_misalign } */

> +/* { dg-options "-O3 -ftree-vectorize -funsafe-math-optimizations -fdump-tree-vect-details -fno-reciprocal-math" } */

> +/* { dg-add-options arm_neon } */

> +

> +void

> +foo (int len, float * __restrict p, float *__restrict x)

> +{

> +  len = len & ~31;

> +  for (int i = 0; i < len; i++)

> +    p[i] = p[i] / x[i];

> +}

> +

> +/* { dg-final { scan-tree-dump-not "vectorized 1 loops" "vect" } } */

>

Prathamesh Kulkarni Nov. 9, 2018, 6:43 a.m. UTC | #3

On Mon, 5 Nov 2018 at 19:22, Ramana Radhakrishnan
<Ramana.Radhakrishnan@arm.com> wrote:
>

> On 26/10/2018 06:04, Prathamesh Kulkarni wrote:

> > Hi,

> > This is a rebased version of patch that adds a pattern to neon.md for

> > implementing division with multiplication by reciprocal using

> > vrecpe/vrecps with -funsafe-math-optimizations excluding -Os.

> > The newly added test-cases are not vectorized on armeb target with

> > -O2. I posted the analysis for that here:

> > https://gcc.gnu.org/ml/gcc-patches/2016-05/msg01765.html

> >

> > Briefly, the difference between little and big-endian vectorizer is in

> > arm_builtin_support_vector_misalignment() which calls

> > default_builtin_support_vector_misalignment() for big-endian case, and

> > that returns false because

> > movmisalign_optab does not exist for V2SF mode. This isn't observed

> > with -O3 because loop peeling for alignment gets enabled.

> >

> > It seems that the test cases in patch appear unsupported on armeb,

> > after r221677 thus this patch requires no changes to

> > target-supports.exp to adjust for armeb (unlike last time which

> > stalled the patch).

> >

> > Bootstrap+tested on arm-linux-gnueabihf.

> > Cross-tested on arm*-*-* variants.

> > OK for trunk ?

> >

> > Thanks,

> > Prathamesh

> >

> >

> > tcwg-319-3.txt

> >

> > 2018-10-26  Prathamesh Kulkarni<prathamesh.kulkarni@linaro.org>

> >

> >       * config/arm/neon.md (div<mode>3): New pattern.

> >

> > testsuite/

> >       * gcc.target/arm/neon-vect-div-1.c: New test.

> >       * gcc.target/arm/neon-vect-div-2.c: Likewise.

> >

> > diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md

> > index 5aeee4b08c1..25ed45d381a 100644

> > --- a/gcc/config/arm/neon.md

> > +++ b/gcc/config/arm/neon.md

> > @@ -620,6 +620,38 @@

> >                       (const_string "neon_mul_<V_elem_ch><q>")))]

> >   )

> >

> > +/* Perform division using multiply-by-reciprocal.

> > +   Reciprocal is calculated using Newton-Raphson method.

> > +   Enabled with -funsafe-math-optimizations -freciprocal-math

> > +   and disabled for -Os since it increases code size .  */ > +

> > +(define_expand "div<mode>3"

> > +  [(set (match_operand:VCVTF 0 "s_register_operand" "=w")

> > +        (div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w")

> > +               (match_operand:VCVTF 2 "s_register_operand" "w")))]

> > +  "TARGET_NEON && !optimize_size

> > +   && flag_unsafe_math_optimizations && flag_reciprocal_math"

>

> I would prefer this to be more granular than

> flag_unsafe_math_optimization && flag_reciprocal_math which really is

> flag_reciprocal_math as it is turned on by default with

> funsafe-math-optimizations.

>

> I think this should really be just flag_reciprocal_math.

>

>

> Otherwise ok.

Thanks, committed it in r265948 after removing check for
flag_unsafe_math_optimizations.

Regards,
Prathamesh
>

> regards

> Ramana

>

>

>

>

> > +  {

> > +    rtx rec = gen_reg_rtx (<MODE>mode);

> > +    rtx vrecps_temp = gen_reg_rtx (<MODE>mode);

> > +

> > +    /* Reciprocal estimate.  */

> > +    emit_insn (gen_neon_vrecpe<mode> (rec, operands[2]));

> > +

> > +    /* Perform 2 iterations of newton-raphson method.  */

> > +    for (int i = 0; i < 2; i++)

> > +      {

> > +     emit_insn (gen_neon_vrecps<mode> (vrecps_temp, rec, operands[2]));

> > +     emit_insn (gen_mul<mode>3 (rec, rec, vrecps_temp));

> > +      }

> > +

> > +    /* We now have reciprocal in rec, perform operands[0] = operands[1] * rec.  */

> > +    emit_insn (gen_mul<mode>3 (operands[0], operands[1], rec));

> > +    DONE;

> > +  }

> > +)

> > +

> > +

> >   (define_insn "mul<mode>3add<mode>_neon"

> >     [(set (match_operand:VDQW 0 "s_register_operand" "=w")

> >           (plus:VDQW (mult:VDQW (match_operand:VDQW 2 "s_register_operand" "w")

> > diff --git a/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c b/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c

> > new file mode 100644

> > index 00000000000..50d04b4175b

> > --- /dev/null

> > +++ b/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c

> > @@ -0,0 +1,16 @@

> > +/* Test pattern div<mode>3.  */

> > +/* { dg-do compile } */

> > +/* { dg-require-effective-target arm_neon_ok } */

> > +/* { dg-require-effective-target vect_hw_misalign } */

> > +/* { dg-options "-O2 -ftree-vectorize -funsafe-math-optimizations -fdump-tree-vect-details" } */

> > +/* { dg-add-options arm_neon } */

> > +

> > +void

> > +foo (int len, float * __restrict p, float *__restrict x)

> > +{

> > +  len = len & ~31;

> > +  for (int i = 0; i < len; i++)

> > +    p[i] = p[i] / x[i];

> > +}

> > +

> > +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */

> > diff --git a/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c b/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c

> > new file mode 100644

> > index 00000000000..606f54b4e0e

> > --- /dev/null

> > +++ b/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c

> > @@ -0,0 +1,16 @@

> > +/* Test pattern div<mode>3.  */

> > +/* { dg-do compile } */

> > +/* { dg-require-effective-target arm_neon_ok } */

> > +/* { dg-require-effective-target vect_hw_misalign } */

> > +/* { dg-options "-O3 -ftree-vectorize -funsafe-math-optimizations -fdump-tree-vect-details -fno-reciprocal-math" } */

> > +/* { dg-add-options arm_neon } */

> > +

> > +void

> > +foo (int len, float * __restrict p, float *__restrict x)

> > +{

> > +  len = len & ~31;

> > +  for (int i = 0; i < len; i++)

> > +    p[i] = p[i] / x[i];

> > +}

> > +

> > +/* { dg-final { scan-tree-dump-not "vectorized 1 loops" "vect" } } */

> >

>

diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
index 5aeee4b08c1..25ed45d381a 100644
--- a/gcc/config/arm/neon.md
+++ b/gcc/config/arm/neon.md
@@ -620,6 +620,38 @@ 
                     (const_string "neon_mul_<V_elem_ch><q>")))]
 )
 
+/* Perform division using multiply-by-reciprocal.
+   Reciprocal is calculated using Newton-Raphson method.
+   Enabled with -funsafe-math-optimizations -freciprocal-math
+   and disabled for -Os since it increases code size .  */
+
+(define_expand "div<mode>3"
+  [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
+        (div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w")
+		  (match_operand:VCVTF 2 "s_register_operand" "w")))]
+  "TARGET_NEON && !optimize_size
+   && flag_unsafe_math_optimizations && flag_reciprocal_math"
+  {
+    rtx rec = gen_reg_rtx (<MODE>mode);
+    rtx vrecps_temp = gen_reg_rtx (<MODE>mode);
+
+    /* Reciprocal estimate.  */
+    emit_insn (gen_neon_vrecpe<mode> (rec, operands[2]));
+
+    /* Perform 2 iterations of newton-raphson method.  */
+    for (int i = 0; i < 2; i++)
+      {
+	emit_insn (gen_neon_vrecps<mode> (vrecps_temp, rec, operands[2]));
+	emit_insn (gen_mul<mode>3 (rec, rec, vrecps_temp));
+      }
+
+    /* We now have reciprocal in rec, perform operands[0] = operands[1] * rec.  */
+    emit_insn (gen_mul<mode>3 (operands[0], operands[1], rec));
+    DONE;
+  }
+)
+
+
 (define_insn "mul<mode>3add<mode>_neon"
   [(set (match_operand:VDQW 0 "s_register_operand" "=w")
         (plus:VDQW (mult:VDQW (match_operand:VDQW 2 "s_register_operand" "w")
diff --git a/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c b/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c
new file mode 100644
index 00000000000..50d04b4175b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c
@@ -0,0 +1,16 @@ 
+/* Test pattern div<mode>3.  */
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_neon_ok } */
+/* { dg-require-effective-target vect_hw_misalign } */
+/* { dg-options "-O2 -ftree-vectorize -funsafe-math-optimizations -fdump-tree-vect-details" } */
+/* { dg-add-options arm_neon } */
+
+void
+foo (int len, float * __restrict p, float *__restrict x)
+{
+  len = len & ~31;
+  for (int i = 0; i < len; i++)
+    p[i] = p[i] / x[i];
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
diff --git a/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c b/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c
new file mode 100644
index 00000000000..606f54b4e0e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c
@@ -0,0 +1,16 @@ 
+/* Test pattern div<mode>3.  */
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_neon_ok } */
+/* { dg-require-effective-target vect_hw_misalign } */
+/* { dg-options "-O3 -ftree-vectorize -funsafe-math-optimizations -fdump-tree-vect-details -fno-reciprocal-math" } */
+/* { dg-add-options arm_neon } */
+
+void
+foo (int len, float * __restrict p, float *__restrict x)
+{
+  len = len & ~31;
+  for (int i = 0; i < len; i++)
+    p[i] = p[i] / x[i];
+}
+
+/* { dg-final { scan-tree-dump-not "vectorized 1 loops" "vect" } } */