From patchwork Fri Feb  5 13:10:29 2016
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>
X-Patchwork-Id: 61270
Delivered-To: patch@linaro.org
Received: by 10.112.43.199 with SMTP id y7csp1081984lbl;
 Fri, 5 Feb 2016 05:10:53 -0800 (PST)
X-Received: by 10.98.87.20 with SMTP id l20mr19813401pfb.70.1454677853290;
 Fri, 05 Feb 2016 05:10:53 -0800 (PST)
Return-Path: <gcc-patches-return-420867-patch=linaro.org@gcc.gnu.org>
Received: from sourceware.org (server1.sourceware.org. [209.132.180.131])
 by mx.google.com with ESMTPS id
 r83si23933222pfb.124.2016.02.05.05.10.52 for <patch@linaro.org>
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Fri, 05 Feb 2016 05:10:53 -0800 (PST)
Received-SPF: pass (google.com: domain of
 gcc-patches-return-420867-patch=linaro.org@gcc.gnu.org
 designates 209.132.180.131 as permitted sender)
 client-ip=209.132.180.131; 
Authentication-Results: mx.google.com; spf=pass (google.com: domain of
 gcc-patches-return-420867-patch=linaro.org@gcc.gnu.org
 designates 209.132.180.131 as permitted sender)
 smtp.mailfrom=gcc-patches-return-420867-patch=linaro.org@gcc.gnu.org;
 dkim=pass header.i=@gcc.gnu.org
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
 :list-unsubscribe:list-archive:list-post:list-help:sender
 :mime-version:in-reply-to:references:date:message-id:subject
 :from:to:cc:content-type; q=dns; s=default; b=Dc9VKNBvsfuDY72shK
 KR+az8M8pw2WLDohTkxmc89Cf+758/cAf3+dseKKZDDWfD8W89L6scar2GCvLzUc
 GRvulHmUqyoifxdmMDMxrhVu+pdY8beF6XgLXZ0gM0yb1qA/HehvcAugfe2UKhck
 vTduRBmNkYe/epFoltdgnVjOU=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id
 :list-unsubscribe:list-archive:list-post:list-help:sender
 :mime-version:in-reply-to:references:date:message-id:subject
 :from:to:cc:content-type; s=default; bh=R715n0velRuoAClQjQr3QUPE
 yO4=; b=WkeKNrx2VX0Hf5UkEdoxRRXkuSJXO85EDFcDDqrz2ji/+mfxGKwW50Ik
 ZW/7ZFmLvuxhF5KeQJRI1q3V9XRk4z+HO9qKow6ZQrwCA1bFRMqXjcTx3ppBxUhQ
 nwWsdzG6JBjEWXhyIt40tP25wcfKx5xNRFQ5QQ8PHYuN5m7T71I=
Received: (qmail 80009 invoked by alias); 5 Feb 2016 13:10:36 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <mailto:gcc-patches-unsubscribe-patch=linaro.org@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org
Received: (qmail 79997 invoked by uid 89); 5 Feb 2016 13:10:35 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-2.3 required=5.0 tests=AWL, BAYES_00,
 RCVD_IN_DNSWL_LOW, SPF_PASS autolearn=ham version=3.3.2
 spammy=prathamesh.kulkarni@linaro.org,
 ramana.radhakrishnan@foss.arm.com, prathameshkulkarnilinaroorg,
 ramanaradhakrishnanfossarmcom
X-HELO: mail-ig0-f174.google.com
Received: from mail-ig0-f174.google.com (HELO mail-ig0-f174.google.com)
 (209.85.213.174) by sourceware.org
 (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256
 encrypted) ESMTPS; Fri, 05 Feb 2016 13:10:32 +0000
Received: by mail-ig0-f174.google.com with SMTP id rs20so39786270igc.0 for
 <gcc-patches@gcc.gnu.org>; Fri, 05 Feb 2016 05:10:31 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net;
 s=20130820;
 h=x-gm-message-state:mime-version:in-reply-to:references:date
 :message-id:subject:from:to:cc:content-type;
 bh=DyAHJVtN5l+8dFbgLo7ptjAr0B463xmHPbo+K8Kt+TY=;
 b=Hnpc1X/YE6uJEggPaEHZsMz+0dkRZTOPjIz78ya1KXTnCbZ06BQKNEuKxQxva1jj/I
 BNALyxc2sEf6PC5KjuYT/dZ2WZOW1EFLzgQANrfiupxPNYr7/M8Vir/zT1OQRiQn/Kjv
 t6B0UiWkGOlZ420nFIePgtdYNh9rzo96m/FNGvrqMn3GZ/uQhStDB48JqCWEPGYRBg4Y
 bFGqZ4WraB84ga0D8RbouKw2ZRIdJdxOQi5WByeh/ZyksKTmtN+T38CNaxyELXbJWDka
 tfqGsPIFweSAMCbQADFIhVq7odHhyk3WnC7G1nOwejqUh7Bu4bEC8EKmz0tYLDMM+kKQ
 CCiw==
X-Gm-Message-State: AG10YORj3Nk6SntEfqaaKzg3GaToSXmh5kVZk5OVqaiLVlvIRI9bJgXaZHgwBvCQbkY1Q36DCaNPAJirx0LAjrD2
MIME-Version: 1.0
X-Received: by 10.50.150.97 with SMTP id uh1mr15086357igb.39.1454677829785;
 Fri, 05 Feb 2016 05:10:29 -0800 (PST)
Received: by 10.36.199.66 with HTTP; Fri, 5 Feb 2016 05:10:29 -0800 (PST)
In-Reply-To: <CAJA7tRaf8sZHmzHCt0n8rGJKt4bQycfSU+ABmb8rbcKb1VFQ=A@mail.gmail.com>
References: <CAAgBjMk0Hdask2JU8xs4fj_Ai1e0ggxB+h3ayb=NOGQBYJ8ccQ@mail.gmail.com>	<55BB4127.5050202@foss.arm.com>	<CAAgBjMnteaOO=JFifLD1CG7s-JE5r=65wcQm+HBwSMtpwg3ReA@mail.gmail.com>	<CAJA7tRaf8sZHmzHCt0n8rGJKt4bQycfSU+ABmb8rbcKb1VFQ=A@mail.gmail.com>
Date: Fri, 5 Feb 2016 18:40:29 +0530
Message-ID: <CAAgBjM=X8rXnDnNV4-sWC_0fCt9hcvuOx2-hJ=vwsAb1oBq22Q@mail.gmail.com>
Subject: Re: [ARM] implement division using vrecpe/vrecps with
 -funsafe-math-optimizations
From: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>
To: Ramana Radhakrishnan <ramana.gcc@googlemail.com>
Cc: Ramana Radhakrishnan <ramana.radhakrishnan@foss.arm.com>,
 gcc Patches <gcc-patches@gcc.gnu.org>,
 Charles Baylis <charles.baylis@linaro.org>
X-IsSubscribed: yes

On 4 February 2016 at 16:31, Ramana Radhakrishnan
<ramana.gcc@googlemail.com> wrote:
> On Sun, Jan 17, 2016 at 9:06 AM, Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
>> On 31 July 2015 at 15:04, Ramana Radhakrishnan
>> <ramana.radhakrishnan@foss.arm.com> wrote:
>>>
>>>
>>> On 29/07/15 11:09, Prathamesh Kulkarni wrote:
>>>> Hi,
>>>> This patch tries to implement division with multiplication by
>>>> reciprocal using vrecpe/vrecps
>>>> with -funsafe-math-optimizations and -freciprocal-math enabled.
>>>> Tested on arm-none-linux-gnueabihf using qemu.
>>>> OK for trunk ?
>>>>
>>>> Thank you,
>>>> Prathamesh
>>>>
>>>
>>> I've tried this in the past and never been convinced that 2 iterations are enough to get to stability with this given that the results are only precise for 8 bits / iteration. Thus I've always believed you need 3 iterations rather than 2 at which point I've never been sure that it's worth it. So the testing that you've done with this currently is not enough for this to go into the tree.
>>>
>>> I'd like this to be tested on a couple of different AArch32 implementations with a wider range of inputs to verify that the results are acceptable as well as running something like SPEC2k(6) with atleast one iteration to ensure correctness.
>> Hi,
>> I got results of SPEC2k6 fp benchmarks:
>> a15: +0.64% overall, 481.wrf: +6.46%
>> a53: +0.21% overall, 416.gamess: -1.39%, 481.wrf: +6.76%
>> a57: +0.35% overall, 481.wrf: +3.84%
>> The other benchmarks had (almost) identical results.
>
> Thanks for the benchmarking results -  Please repost the patch with
> the changes that I had requested in my previous review - given it is
> now stage4 , I would rather queue changes like this for stage1 now.
Hi,
Please find the updated patch attached.
It passes testsuite for arm-none-linux-gnueabi, arm-none-linux-gnueabihf and
arm-none-eabi.
However the test-case added in the patch (neon-vect-div-1.c) fails to
get vectorized at -O2
for armeb-none-linux-gnueabihf.
Charles suggested me to try with -O3, which worked.
It appears the test-case fails to get vectorized with
-fvect-cost-model=cheap (which is default enabled at -O2)
and passes for -fno-vect-cost-model / -fvect-cost-model=dynamic

I can't figure out why it fails -fvect-cost-model=cheap.
>From the vect dump (attached):
neon-vect-div-1.c:12:3: note: Setting misalignment to -1.
neon-vect-div-1.c:12:3: note: not vectorized: unsupported unaligned load.*_9

Thanks,
Prathamesh
>
> Thanks,
> Ramana
>
>>
>> Thanks,
>> Prathamesh
>>>
>>>
>>> moving on to the patches.
>>>
>>>> diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
>>>> index 654d9d5..28c2e2a 100644
>>>> --- a/gcc/config/arm/neon.md
>>>> +++ b/gcc/config/arm/neon.md
>>>> @@ -548,6 +548,32 @@
>>>>                      (const_string "neon_mul_<V_elem_ch><q>")))]
>>>>  )
>>>>
>>>
>>> Please add a comment here.
>>>
>>>> +(define_expand "div<mode>3"
>>>> +  [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
>>>> +        (div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w")
>>>> +               (match_operand:VCVTF 2 "s_register_operand" "w")))]
>>>
>>> I want to double check that this doesn't collide with Alan's patches for FP16 especially if he reuses the VCVTF iterator for all the vcvt f16 cases.
>>>
>>>> +  "TARGET_NEON && flag_unsafe_math_optimizations && flag_reciprocal_math"
>>>> +  {
>>>> +    rtx rec = gen_reg_rtx (<MODE>mode);
>>>> +    rtx vrecps_temp = gen_reg_rtx (<MODE>mode);
>>>> +
>>>> +    /* Reciprocal estimate */
>>>> +    emit_insn (gen_neon_vrecpe<mode> (rec, operands[2]));
>>>> +
>>>> +    /* Perform 2 iterations of Newton-Raphson method for better accuracy */
>>>> +    for (int i = 0; i < 2; i++)
>>>> +      {
>>>> +     emit_insn (gen_neon_vrecps<mode> (vrecps_temp, rec, operands[2]));
>>>> +     emit_insn (gen_mul<mode>3 (rec, rec, vrecps_temp));
>>>> +      }
>>>> +
>>>> +    /* We now have reciprocal in rec, perform operands[0] = operands[1] * rec */
>>>> +    emit_insn (gen_mul<mode>3 (operands[0], operands[1], rec));
>>>> +    DONE;
>>>> +  }
>>>> +)
>>>> +
>>>> +
>>>>  (define_insn "mul<mode>3add<mode>_neon"
>>>>    [(set (match_operand:VDQW 0 "s_register_operand" "=w")
>>>>          (plus:VDQW (mult:VDQW (match_operand:VDQW 2 "s_register_operand" "w")
>>>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-1.c b/gcc/testsuite/gcc.target/arm/vect-div-1.c
>>>> new file mode 100644
>>>> index 0000000..e562ef3
>>>> --- /dev/null
>>>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-1.c
>>>> @@ -0,0 +1,14 @@
>>>> +/* { dg-do compile } */
>>>> +/* { dg-require-effective-target arm_v8_neon_ok } */
>>>> +/* { dg-options "-O2 -funsafe-math-optimizations -ftree-vectorize -fdump-tree-vect-all" } */
>>>> +/* { dg-add-options arm_v8_neon } */
>>>
>>> No this is wrong.
>>>
>>> What is armv8 specific about this test ? This is just like another test that is for Neon. vrecpe / vrecps are not instructions that were introduced in the v8 version of the architecture. They've existed in the base Neon instruction set. The code generation above in the patterns will be enabled when TARGET_NEON is true which can happen when -mfpu=neon -mfloat-abi={softfp/hard} is true.
>>>
>>>> +
>>>> +void
>>>> +foo (int len, float * __restrict p, float *__restrict x)
>>>> +{
>>>> +  len = len & ~31;
>>>> +  for (int i = 0; i < len; i++)
>>>> +    p[i] = p[i] / x[i];
>>>> +}
>>>> +
>>>> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
>>>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-2.c b/gcc/testsuite/gcc.target/arm/vect-div-2.c
>>>> new file mode 100644
>>>> index 0000000..8e15d0a
>>>> --- /dev/null
>>>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-2.c
>>>> @@ -0,0 +1,14 @@
>>>> +/* { dg-do compile } */
>>>> +/* { dg-require-effective-target arm_v8_neon_ok } */
>>>
>>> And likewise.
>>>
>>>> +/* { dg-options "-O2 -funsafe-math-optimizations -fno-reciprocal-math -ftree-vectorize -fdump-tree-vect-all" } */
>>>> +/* { dg-add-options arm_v8_neon } */
>>>> +
>>>> +void
>>>> +foo (int len, float * __restrict p, float *__restrict x)
>>>> +{
>>>> +  len = len & ~31;
>>>> +  for (int i = 0; i < len; i++)
>>>> +    p[i] = p[i] / x[i];
>>>> +}
>>>> +
>>>> +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } */
>>>
>>>
>>> regards
>>> Ramana

diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
index 55b61eb..7eafee4 100644
--- a/gcc/config/arm/neon.md
+++ b/gcc/config/arm/neon.md
@@ -578,6 +578,38 @@
                     (const_string "neon_mul_<V_elem_ch><q>")))]
 )
 
+/* Perform division using multiply-by-reciprocal. 
+   Reciprocal is calculated using Newton-Raphson method.
+   Enabled with -funsafe-math-optimizations -freciprocal-math
+   and disabled for -Os since it increases code size .  */
+
+(define_expand "div<mode>3"
+  [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
+        (div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w")
+		  (match_operand:VCVTF 2 "s_register_operand" "w")))]
+  "TARGET_NEON && !optimize_size
+   && flag_unsafe_math_optimizations && flag_reciprocal_math"
+  {
+    rtx rec = gen_reg_rtx (<MODE>mode);
+    rtx vrecps_temp = gen_reg_rtx (<MODE>mode);
+
+    /* Reciprocal estimate.  */
+    emit_insn (gen_neon_vrecpe<mode> (rec, operands[2]));
+
+    /* Perform 2 iterations of newton-raphson method.  */
+    for (int i = 0; i < 2; i++)
+      {
+	emit_insn (gen_neon_vrecps<mode> (vrecps_temp, rec, operands[2]));
+	emit_insn (gen_mul<mode>3 (rec, rec, vrecps_temp));
+      }
+
+    /* We now have reciprocal in rec, perform operands[0] = operands[1] * rec.  */
+    emit_insn (gen_mul<mode>3 (operands[0], operands[1], rec));
+    DONE;
+  }
+)
+
+
 (define_insn "mul<mode>3add<mode>_neon"
   [(set (match_operand:VDQW 0 "s_register_operand" "=w")
         (plus:VDQW (mult:VDQW (match_operand:VDQW 2 "s_register_operand" "w")
diff --git a/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c b/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c
new file mode 100644
index 0000000..dae724a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c
@@ -0,0 +1,16 @@
+/* Test pattern div<mode>3.  */
+
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_neon_ok } */
+/* { dg-options "-O3 -funsafe-math-optimizations -fdump-tree-vect-all" } */
+/* { dg-add-options arm_neon } */
+
+void
+foo (int len, float * __restrict p, float *__restrict x)
+{
+  len = len & ~31;
+  for (int i = 0; i < len; i++)
+    p[i] = p[i] / x[i];
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
diff --git a/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c b/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c
new file mode 100644
index 0000000..0450b70
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c
@@ -0,0 +1,16 @@
+/* Test pattern div<mode>3.  */
+
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_neon_ok } */
+/* { dg-options "-O3 -funsafe-math-optimizations -fno-reciprocal-math -fdump-tree-vect-all" } */
+/* { dg-add-options arm_neon } */
+
+void
+foo (int len, float * __restrict p, float *__restrict x)
+{
+  len = len & ~31;
+  for (int i = 0; i < len; i++)
+    p[i] = p[i] / x[i];
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } */