diff mbox

[ARM] Post-indexed addressing for NEON memory access

Message ID CADnVucAaG=uAZyxQGvyf5bqrmW8JfhfjCp84uCpVnf+=Tois5w@mail.gmail.com
State New
Headers show

Commit Message

Charles Baylis June 2, 2014, 4:47 p.m. UTC
This patch adds support for post-indexed addressing for NEON structure
memory accesses.

For example VLD1.8 {d0}, [r0], r1


Bootstrapped and checked on arm-unknown-gnueabihf using Qemu.

Ok for trunk?


gcc/Changelog:

2014-06-02  Charles Baylis  <charles.baylis@linaro.org>

        * config/arm/arm.c (neon_vector_mem_operand): Allow register
        POST_MODIFY for neon loads and stores.
        (arm_print_operand): Output post-index register for neon loads and
        stores.

Comments

Ramana Radhakrishnan June 5, 2014, 6:27 a.m. UTC | #1
On Mon, Jun 2, 2014 at 5:47 PM, Charles Baylis
<charles.baylis@linaro.org> wrote:
> This patch adds support for post-indexed addressing for NEON structure
> memory accesses.
>
> For example VLD1.8 {d0}, [r0], r1
>
>
> Bootstrapped and checked on arm-unknown-gnueabihf using Qemu.
>
> Ok for trunk?

This looks like a reasonable start but this work doesn't look complete
to me yet.

Can you also look at the impact on performance of a range of
benchmarks especially a popular embedded one to see how this behaves
unless you have already done so ?

POST_INC, POST_MODIFY usually have a funny way of biting you with
either ivopts or the way in which address costs work. I think there
maybe further tweaks needed but for a first step I'd like to know what
the performance impact is.

I would also suggest running this through clyon's neon intrinsics
testsuite to see if that catches any issues especially with the large
vector modes.

regards
Ramana

>
>
> gcc/Changelog:
>
> 2014-06-02  Charles Baylis  <charles.baylis@linaro.org>
>
>         * config/arm/arm.c (neon_vector_mem_operand): Allow register
>         POST_MODIFY for neon loads and stores.
>         (arm_print_operand): Output post-index register for neon loads and
>         stores.
Charles Baylis June 17, 2014, 3:03 p.m. UTC | #2
On 5 June 2014 07:27, Ramana Radhakrishnan <ramana.gcc@googlemail.com> wrote:
> On Mon, Jun 2, 2014 at 5:47 PM, Charles Baylis
> <charles.baylis@linaro.org> wrote:
>> This patch adds support for post-indexed addressing for NEON structure
>> memory accesses.
>>
>> For example VLD1.8 {d0}, [r0], r1
>>
>>
>> Bootstrapped and checked on arm-unknown-gnueabihf using Qemu.
>>
>> Ok for trunk?
>
> This looks like a reasonable start but this work doesn't look complete
> to me yet.
>
> Can you also look at the impact on performance of a range of
> benchmarks especially a popular embedded one to see how this behaves
> unless you have already done so ?

I ran a popular suite of embedded benchmarks, and there is no impact
at all on Chromebook (including with the additional attached patch)

The patch was developed to address a performance issue with a new
version of libvpx which uses intrinsics instead of NEON assembler. The
patch results in a 3% improvement for VP8 decode.

> POST_INC, POST_MODIFY usually have a funny way of biting you with
> either ivopts or the way in which address costs work. I think there
> maybe further tweaks needed but for a first step I'd like to know what
> the performance impact is.

> I would also suggest running this through clyon's neon intrinsics
> testsuite to see if that catches any issues especially with the large
> vector modes.

No issues found in clyon's tests.

Your mention of larger vector modes prompted me to check that the
patch has the desired result with them. In fact, the costs are
estimated incorrectly which means the post_modify pattern is not used.
The attached patch fixes that. (used in combination with my original
patch)


2014-06-15  Charles Baylis  <charles.bayls@linaro.org>

        * config/arm/arm.c (arm_new_rtx_costs): Reduce cost for mem with
        embedded side effects.
Ramana Radhakrishnan June 18, 2014, 10:01 a.m. UTC | #3
On Mon, Jun 2, 2014 at 5:47 PM, Charles Baylis
<charles.baylis@linaro.org> wrote:
> This patch adds support for post-indexed addressing for NEON structure
> memory accesses.
>
> For example VLD1.8 {d0}, [r0], r1
>
>
> Bootstrapped and checked on arm-unknown-gnueabihf using Qemu.
>
> Ok for trunk?

This is OK.

Ramana
>
>
> gcc/Changelog:
>
> 2014-06-02  Charles Baylis  <charles.baylis@linaro.org>
>
>         * config/arm/arm.c (neon_vector_mem_operand): Allow register
>         POST_MODIFY for neon loads and stores.
>         (arm_print_operand): Output post-index register for neon loads and
>         stores.
Ramana Radhakrishnan June 18, 2014, 10:06 a.m. UTC | #4
On Tue, Jun 17, 2014 at 4:03 PM, Charles Baylis
<charles.baylis@linaro.org> wrote:
> On 5 June 2014 07:27, Ramana Radhakrishnan <ramana.gcc@googlemail.com> wrote:
>> On Mon, Jun 2, 2014 at 5:47 PM, Charles Baylis
>> <charles.baylis@linaro.org> wrote:
>>> This patch adds support for post-indexed addressing for NEON structure
>>> memory accesses.
>>>
>>> For example VLD1.8 {d0}, [r0], r1
>>>
>>>
>>> Bootstrapped and checked on arm-unknown-gnueabihf using Qemu.
>>>
>>> Ok for trunk?
>>
>> This looks like a reasonable start but this work doesn't look complete
>> to me yet.
>>
>> Can you also look at the impact on performance of a range of
>> benchmarks especially a popular embedded one to see how this behaves
>> unless you have already done so ?
>
> I ran a popular suite of embedded benchmarks, and there is no impact
> at all on Chromebook (including with the additional attached patch)

Thanks for the due diligence

>
> The patch was developed to address a performance issue with a new
> version of libvpx which uses intrinsics instead of NEON assembler. The
> patch results in a 3% improvement for VP8 decode.

Good - 3% not to be sneezed at.

>
>> POST_INC, POST_MODIFY usually have a funny way of biting you with
>> either ivopts or the way in which address costs work. I think there
>> maybe further tweaks needed but for a first step I'd like to know what
>> the performance impact is.
>
>> I would also suggest running this through clyon's neon intrinsics
>> testsuite to see if that catches any issues especially with the large
>> vector modes.

Thanks.

>
> No issues found in clyon's tests.

Please keep an eye out for any regressions.

>
> Your mention of larger vector modes prompted me to check that the
> patch has the desired result with them. In fact, the costs are
> estimated incorrectly which means the post_modify pattern is not used.
> The attached patch fixes that. (used in combination with my original
> patch)
>
>
> 2014-06-15  Charles Baylis  <charles.bayls@linaro.org>
>
>         * config/arm/arm.c (arm_new_rtx_costs): Reduce cost for mem with
>         embedded side effects.

I'm not too thrilled with putting in more special cases that are not
table driven in there. Can you file a PR with some testcases that show
this so that we don't forget and CC me on it please ?


Ramana
Charles Baylis June 18, 2014, 1:49 p.m. UTC | #5
On 18 June 2014 11:01, Ramana Radhakrishnan <ramana.gcc@googlemail.com> wrote:
> On Mon, Jun 2, 2014 at 5:47 PM, Charles Baylis
> <charles.baylis@linaro.org> wrote:
>> This patch adds support for post-indexed addressing for NEON structure
>> memory accesses.
>>
>> For example VLD1.8 {d0}, [r0], r1
>>
>>
>> Bootstrapped and checked on arm-unknown-gnueabihf using Qemu.
>>
>> Ok for trunk?
>
> This is OK.

Committed as r211783.
Charles Baylis June 18, 2014, 2:30 p.m. UTC | #6
On 18 June 2014 11:06, Ramana Radhakrishnan <ramana.gcc@googlemail.com> wrote:
>> 2014-06-15  Charles Baylis  <charles.bayls@linaro.org>
>>
>>         * config/arm/arm.c (arm_new_rtx_costs): Reduce cost for mem with
>>         embedded side effects.
>
> I'm not too thrilled with putting in more special cases that are not
> table driven in there. Can you file a PR with some testcases that show
> this so that we don't forget and CC me on it please ?

I created PR61551 and CC'd.
Charles Baylis June 19, 2015, 6:04 p.m. UTC | #7
On 18 June 2014 at 11:06, Ramana Radhakrishnan
<ramana.gcc@googlemail.com> wrote:
> On Tue, Jun 17, 2014 at 4:03 PM, Charles Baylis
> <charles.baylis@linaro.org> wrote:
>> Your mention of larger vector modes prompted me to check that the
>> patch has the desired result with them. In fact, the costs are
>> estimated incorrectly which means the post_modify pattern is not used.
>> The attached patch fixes that. (used in combination with my original
>> patch)
>>
>>
>> 2014-06-15  Charles Baylis  <charles.bayls@linaro.org>
>>
>>         * config/arm/arm.c (arm_new_rtx_costs): Reduce cost for mem with
>>         embedded side effects.
>
> I'm not too thrilled with putting in more special cases that are not
> table driven in there. Can you file a PR with some testcases that show
> this so that we don't forget and CC me on it please ?

I created https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61551 at the time.

I've come back to look at this again and would like to fix it in this
release cycle. I still don't really understand what you mean by
table-driven in this context. Do you still hold this view, and if so,
could you describe what you'd like to see instead of this patch?
diff mbox

Patch

From a8e0bdbceab00d5e5b655611965d3975ba74365c Mon Sep 17 00:00:00 2001
From: Charles Baylis <charles.baylis@linaro.org>
Date: Tue, 6 May 2014 15:23:46 +0100
Subject: [PATCH] post-indexed addressing for vld/vst

2014-05-09  Charles Baylis  <charles.baylis@linaro.org>

	* config/arm/arm.c (neon_vector_mem_operand): Allow register
	POST_MODIFY for neon loads and stores.
	(arm_print_operand): Output post-index register for neon loads and
	stores.
---
 gcc/config/arm/arm.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index 1117bd4..6ab02ef 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -12786,7 +12786,11 @@  neon_vector_mem_operand (rtx op, int type, bool strict)
       || (type == 0 && GET_CODE (ind) == PRE_DEC))
     return arm_address_register_rtx_p (XEXP (ind, 0), 0);
 
-  /* FIXME: vld1 allows register post-modify.  */
+  /* Allow post-increment by register for VLDn */
+  if (type == 2 && GET_CODE (ind) == POST_MODIFY
+      && GET_CODE (XEXP (ind, 1)) == PLUS
+      && REG_P (XEXP (XEXP (ind, 1), 1)))
+     return true;
 
   /* Match:
      (plus (reg)
@@ -21816,6 +21820,7 @@  arm_print_operand (FILE *stream, rtx x, int code)
       {
 	rtx addr;
 	bool postinc = FALSE;
+	rtx postinc_reg = NULL;
 	unsigned align, memsize, align_bits;
 
 	gcc_assert (MEM_P (x));
@@ -21825,6 +21830,11 @@  arm_print_operand (FILE *stream, rtx x, int code)
 	    postinc = 1;
 	    addr = XEXP (addr, 0);
 	  }
+	if (GET_CODE (addr) == POST_MODIFY)
+	  {
+	    postinc_reg = XEXP( XEXP (addr, 1), 1);
+	    addr = XEXP (addr, 0);
+	  }
 	asm_fprintf (stream, "[%r", REGNO (addr));
 
 	/* We know the alignment of this access, so we can emit a hint in the
@@ -21850,6 +21860,8 @@  arm_print_operand (FILE *stream, rtx x, int code)
 
 	if (postinc)
 	  fputs("!", stream);
+	if (postinc_reg)
+	  asm_fprintf (stream, ", %r", REGNO (postinc_reg));
       }
       return;
 
-- 
1.9.1