diff mbox

[ARM] implement division using vrecpe/vrecps with -funsafe-math-optimizations

Message ID CAAgBjMk0Hdask2JU8xs4fj_Ai1e0ggxB+h3ayb=NOGQBYJ8ccQ@mail.gmail.com
State Superseded
Headers show

Commit Message

Prathamesh Kulkarni July 29, 2015, 10:09 a.m. UTC
Hi,
This patch tries to implement division with multiplication by
reciprocal using vrecpe/vrecps
with -funsafe-math-optimizations and -freciprocal-math enabled.
Tested on arm-none-linux-gnueabihf using qemu.
OK for trunk ?

Thank you,
Prathamesh
2015-07-28  Prathamesh Kulkarni  <prathamesh.kulkarni@linaro.org>
	    Charles Baylis  <charles.baylis@linaro.org>

	* config/arm/neon.md (div<mode>3): New pattern.

testsuite/
	* gcc.target/arm/vect-div-1.c: New test-case.
	* gcc.target/arm/vect-div-2.c: Likewise.

Comments

Charles Baylis July 31, 2015, 12:23 p.m. UTC | #1
On 31 July 2015 at 10:34, Ramana Radhakrishnan
<ramana.radhakrishnan@foss.arm.com> wrote:
> I've tried this in the past and never been convinced that 2 iterations are enough to get to stability with this given that the results are only precise for 8 bits / iteration. Thus I've always believed you need 3 iterations rather than 2 at which point I've never been sure that it's worth it. So the testing that you've done with this currently is not enough for this to go into the tree.

My understanding is that 2 iterations is sufficient for single
precision floating point (although not for double precision), because
each iteration of Newton-Raphson doubles the number of bits of
accuracy.

I haven't worked through the maths myself, but
    https://en.wikipedia.org/wiki/Division_algorithm#Newton.E2.80.93Raphson_division
says
    "This squaring of the error at each iteration step — the so-called
    quadratic convergence of Newton–Raphson's method — has the
    effect that the number of correct digits in the result roughly
    doubles for every iteration, a property that becomes extremely
    valuable when the numbers involved have many digits"

Therefore:
vrecpe -> 8 bits of accuracy
+1 iteration -> 16 bits of accuracy
+2 iterations -> 32 bits of accuracy (but in reality limited to
precision of 32bit float)

Since 32 bits is much more accuracy than the 24 bits of precision in a
single precision FP value, 2 iterations should be sufficient.

> I'd like this to be tested on a couple of different AArch32 implementations with a wider range of inputs to verify that the results are acceptable as well as running something like SPEC2k(6) with atleast one iteration to ensure correctness.

I can't argue with confirming theory matches practice :)

Some corner cases (eg numbers around FLT_MAX, FLT_MIN etc) may result
in denormals or out of range values during the reciprocal calculation
which could result in answers which are less accurate than the typical
case but I think that is acceptable with -ffast-math.

Charles
Prathamesh Kulkarni Jan. 17, 2016, 9:06 a.m. UTC | #2
On 31 July 2015 at 15:04, Ramana Radhakrishnan
<ramana.radhakrishnan@foss.arm.com> wrote:
>

>

> On 29/07/15 11:09, Prathamesh Kulkarni wrote:

>> Hi,

>> This patch tries to implement division with multiplication by

>> reciprocal using vrecpe/vrecps

>> with -funsafe-math-optimizations and -freciprocal-math enabled.

>> Tested on arm-none-linux-gnueabihf using qemu.

>> OK for trunk ?

>>

>> Thank you,

>> Prathamesh

>>

>

> I've tried this in the past and never been convinced that 2 iterations are enough to get to stability with this given that the results are only precise for 8 bits / iteration. Thus I've always believed you need 3 iterations rather than 2 at which point I've never been sure that it's worth it. So the testing that you've done with this currently is not enough for this to go into the tree.

>

> I'd like this to be tested on a couple of different AArch32 implementations with a wider range of inputs to verify that the results are acceptable as well as running something like SPEC2k(6) with atleast one iteration to ensure correctness.

Hi,
I got results of SPEC2k6 fp benchmarks:
a15: +0.64% overall, 481.wrf: +6.46%
a53: +0.21% overall, 416.gamess: -1.39%, 481.wrf: +6.76%
a57: +0.35% overall, 481.wrf: +3.84%
The other benchmarks had (almost) identical results.

Thanks,
Prathamesh
>

>

> moving on to the patches.

>

>> diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md

>> index 654d9d5..28c2e2a 100644

>> --- a/gcc/config/arm/neon.md

>> +++ b/gcc/config/arm/neon.md

>> @@ -548,6 +548,32 @@

>>                      (const_string "neon_mul_<V_elem_ch><q>")))]

>>  )

>>

>

> Please add a comment here.

>

>> +(define_expand "div<mode>3"

>> +  [(set (match_operand:VCVTF 0 "s_register_operand" "=w")

>> +        (div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w")

>> +               (match_operand:VCVTF 2 "s_register_operand" "w")))]

>

> I want to double check that this doesn't collide with Alan's patches for FP16 especially if he reuses the VCVTF iterator for all the vcvt f16 cases.

>

>> +  "TARGET_NEON && flag_unsafe_math_optimizations && flag_reciprocal_math"

>> +  {

>> +    rtx rec = gen_reg_rtx (<MODE>mode);

>> +    rtx vrecps_temp = gen_reg_rtx (<MODE>mode);

>> +

>> +    /* Reciprocal estimate */

>> +    emit_insn (gen_neon_vrecpe<mode> (rec, operands[2]));

>> +

>> +    /* Perform 2 iterations of Newton-Raphson method for better accuracy */

>> +    for (int i = 0; i < 2; i++)

>> +      {

>> +     emit_insn (gen_neon_vrecps<mode> (vrecps_temp, rec, operands[2]));

>> +     emit_insn (gen_mul<mode>3 (rec, rec, vrecps_temp));

>> +      }

>> +

>> +    /* We now have reciprocal in rec, perform operands[0] = operands[1] * rec */

>> +    emit_insn (gen_mul<mode>3 (operands[0], operands[1], rec));

>> +    DONE;

>> +  }

>> +)

>> +

>> +

>>  (define_insn "mul<mode>3add<mode>_neon"

>>    [(set (match_operand:VDQW 0 "s_register_operand" "=w")

>>          (plus:VDQW (mult:VDQW (match_operand:VDQW 2 "s_register_operand" "w")

>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-1.c b/gcc/testsuite/gcc.target/arm/vect-div-1.c

>> new file mode 100644

>> index 0000000..e562ef3

>> --- /dev/null

>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-1.c

>> @@ -0,0 +1,14 @@

>> +/* { dg-do compile } */

>> +/* { dg-require-effective-target arm_v8_neon_ok } */

>> +/* { dg-options "-O2 -funsafe-math-optimizations -ftree-vectorize -fdump-tree-vect-all" } */

>> +/* { dg-add-options arm_v8_neon } */

>

> No this is wrong.

>

> What is armv8 specific about this test ? This is just like another test that is for Neon. vrecpe / vrecps are not instructions that were introduced in the v8 version of the architecture. They've existed in the base Neon instruction set. The code generation above in the patterns will be enabled when TARGET_NEON is true which can happen when -mfpu=neon -mfloat-abi={softfp/hard} is true.

>

>> +

>> +void

>> +foo (int len, float * __restrict p, float *__restrict x)

>> +{

>> +  len = len & ~31;

>> +  for (int i = 0; i < len; i++)

>> +    p[i] = p[i] / x[i];

>> +}

>> +

>> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */

>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-2.c b/gcc/testsuite/gcc.target/arm/vect-div-2.c

>> new file mode 100644

>> index 0000000..8e15d0a

>> --- /dev/null

>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-2.c

>> @@ -0,0 +1,14 @@

>> +/* { dg-do compile } */

>> +/* { dg-require-effective-target arm_v8_neon_ok } */

>

> And likewise.

>

>> +/* { dg-options "-O2 -funsafe-math-optimizations -fno-reciprocal-math -ftree-vectorize -fdump-tree-vect-all" } */

>> +/* { dg-add-options arm_v8_neon } */

>> +

>> +void

>> +foo (int len, float * __restrict p, float *__restrict x)

>> +{

>> +  len = len & ~31;

>> +  for (int i = 0; i < len; i++)

>> +    p[i] = p[i] / x[i];

>> +}

>> +

>> +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } */

>

>

> regards

> Ramana
diff mbox

Patch

diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
index 654d9d5..28c2e2a 100644
--- a/gcc/config/arm/neon.md
+++ b/gcc/config/arm/neon.md
@@ -548,6 +548,32 @@ 
                     (const_string "neon_mul_<V_elem_ch><q>")))]
 )
 
+(define_expand "div<mode>3"
+  [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
+        (div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w")
+		  (match_operand:VCVTF 2 "s_register_operand" "w")))]
+  "TARGET_NEON && flag_unsafe_math_optimizations && flag_reciprocal_math"
+  {
+    rtx rec = gen_reg_rtx (<MODE>mode);
+    rtx vrecps_temp = gen_reg_rtx (<MODE>mode);
+
+    /* Reciprocal estimate */
+    emit_insn (gen_neon_vrecpe<mode> (rec, operands[2]));
+
+    /* Perform 2 iterations of Newton-Raphson method for better accuracy */
+    for (int i = 0; i < 2; i++)
+      {
+	emit_insn (gen_neon_vrecps<mode> (vrecps_temp, rec, operands[2]));
+	emit_insn (gen_mul<mode>3 (rec, rec, vrecps_temp));
+      }
+
+    /* We now have reciprocal in rec, perform operands[0] = operands[1] * rec */
+    emit_insn (gen_mul<mode>3 (operands[0], operands[1], rec));
+    DONE;
+  }
+)
+
+
 (define_insn "mul<mode>3add<mode>_neon"
   [(set (match_operand:VDQW 0 "s_register_operand" "=w")
         (plus:VDQW (mult:VDQW (match_operand:VDQW 2 "s_register_operand" "w")
diff --git a/gcc/testsuite/gcc.target/arm/vect-div-1.c b/gcc/testsuite/gcc.target/arm/vect-div-1.c
new file mode 100644
index 0000000..e562ef3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/vect-div-1.c
@@ -0,0 +1,14 @@ 
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_v8_neon_ok } */
+/* { dg-options "-O2 -funsafe-math-optimizations -ftree-vectorize -fdump-tree-vect-all" } */
+/* { dg-add-options arm_v8_neon } */
+
+void
+foo (int len, float * __restrict p, float *__restrict x)
+{
+  len = len & ~31;
+  for (int i = 0; i < len; i++)
+    p[i] = p[i] / x[i];
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
diff --git a/gcc/testsuite/gcc.target/arm/vect-div-2.c b/gcc/testsuite/gcc.target/arm/vect-div-2.c
new file mode 100644
index 0000000..8e15d0a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/vect-div-2.c
@@ -0,0 +1,14 @@ 
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_v8_neon_ok } */
+/* { dg-options "-O2 -funsafe-math-optimizations -fno-reciprocal-math -ftree-vectorize -fdump-tree-vect-all" } */
+/* { dg-add-options arm_v8_neon } */
+
+void
+foo (int len, float * __restrict p, float *__restrict x)
+{
+  len = len & ~31;
+  for (int i = 0; i < len; i++)
+    p[i] = p[i] / x[i];
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } */