[ARM] Use vector wide add for mixed-mode adds

Message ID	560D0567.40207@linaro.org
State	New
Headers	show Return-Path: <patchwork-forward+bncBDIIVBVZ6QLRBAMLWSYAKGQEV2X25YI@linaro.org> Received-SPF: pass (google.com: domain of patch+caf_=patchwork-forward=linaro.org@linaro.org designates 2a00:1450:4010:c04::22d as permitted sender) client-ip=2a00:1450:4010:c04::22d; Received-SPF: pass (google.com: domain of gcc-patches-return-408808-patch=linaro.org@gcc.gnu.org designates 209.132.180.131 as permitted sender) client-ip=209.132.180.131; Mailing-List: list patchwork-forward@linaro.org; contact patchwork-forward+owners@linaro.org Precedence: list Sender: gcc-patches-owner@gcc.gnu.org Message-ID: <560D0567.40207@linaro.org> Date: Thu, 01 Oct 2015 03:05:27 -0700 From: Michael Collison <michael.collison@linaro.org> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: Kyrill Tkachov <kyrylo.tkachov@arm.com>, GCC Patches <gcc-patches@gcc.gnu.org>, Ramana Radhakrishnan <Ramana.Radhakrishnan@arm.com> Subject: Re: [ARM] Use vector wide add for mixed-mode adds References: <5601E9B9.5060600@linaro.org> <560267B4.5070809@arm.com> In-Reply-To: <560267B4.5070809@arm.com> Content-Type: multipart/mixed; boundary="------------020404050702030908040405"

Message ID

560D0567.40207@linaro.org

State

New

Headers

Received-SPF: pass (google.com: domain of
	patch+caf_=patchwork-forward=linaro.org@linaro.org designates
	2a00:1450:4010:c04::22d as permitted sender)
	client-ip=2a00:1450:4010:c04::22d; 
Received-SPF: pass (google.com: domain of
	gcc-patches-return-408808-patch=linaro.org@gcc.gnu.org
	designates 209.132.180.131 as permitted sender)
	client-ip=209.132.180.131; 
Mailing-List: list patchwork-forward@linaro.org;
	contact patchwork-forward+owners@linaro.org
Precedence: list
Sender: gcc-patches-owner@gcc.gnu.org
Message-ID: <560D0567.40207@linaro.org>
Date: Thu, 01 Oct 2015 03:05:27 -0700
From: Michael Collison <michael.collison@linaro.org>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
	rv:31.0) Gecko/20100101 Thunderbird/31.4.0
MIME-Version: 1.0
To: Kyrill Tkachov <kyrylo.tkachov@arm.com>,
	GCC Patches <gcc-patches@gcc.gnu.org>,
	Ramana Radhakrishnan <Ramana.Radhakrishnan@arm.com>
Subject: Re: [ARM] Use vector wide add for mixed-mode adds
References: <5601E9B9.5060600@linaro.org> <560267B4.5070809@arm.com>
In-Reply-To: <560267B4.5070809@arm.com>
Content-Type: multipart/mixed;
	boundary="------------020404050702030908040405"

Commit Message

Michael Collison Oct. 1, 2015, 10:05 a.m. UTC

Kyrill,

I have modified the patch to address your comments. I also modified 
check_effective_target_vect_widen_sum_hi_to_si_pattern in 
target-supports.exp to
indicate that arm neon supports vector widen sum of HImode to SImode. 
This resolved
several test suite failures.

Successfully tested on arm-none-eabi, arm-none-linux-gnueabihf. I have 
four related execution failure
tests on armeb-non-linux-gnueabihf with -flto only.

gcc.dg/vect/vect-outer-4f.c -flto -ffat-lto-objects execution test
gcc.dg/vect/vect-outer-4g.c -flto -ffat-lto-objects execution test
gcc.dg/vect/vect-outer-4k.c -flto -ffat-lto-objects execution test
gcc.dg/vect/vect-outer-4l.c -flto -ffat-lto-objects execution test


I am debugging but have not tracked down the root cause yet. Feedback?

2015-07-22  Michael Collison  <michael.collison@linaro.org>

     * config/arm/neon.md (widen_<us>sum<mode>): New patterns
     where mode is VQI to improve mixed mode vectorization.
     * config/arm/neon.md (vec_sel_widen_ssum_lo<VQI:mode><VW:mode>3): New
     define_insn to match low half of signed vaddw.
     * config/arm/neon.md (vec_sel_widen_ssum_hi<VQI:mode><VW:mode>3): New
     define_insn to match high half of signed vaddw.
     * config/arm/neon.md (vec_sel_widen_usum_lo<VQI:mode><VW:mode>3): New
     define_insn to match low half of unsigned vaddw.
     * config/arm/neon.md (vec_sel_widen_usum_hi<VQI:mode><VW:mode>3): New
     define_insn to match high half of unsigned vaddw.
     * testsuite/gcc.target/arm/neon-vaddws16.c: New test.
     * testsuite/gcc.target/arm/neon-vaddws32.c: New test.
     * testsuite/gcc.target/arm/neon-vaddwu16.c: New test.
     * testsuite/gcc.target/arm/neon-vaddwu32.c: New test.
     * testsuite/gcc.target/arm/neon-vaddwu8.c: New test.
     * testsuite/lib/target-supports.exp
     (check_effective_target_vect_widen_sum_hi_to_si_pattern): Indicate
     that arm neon support vector widen sum of HImode TO SImode.

On 09/23/2015 01:49 AM, Kyrill Tkachov wrote:
> Hi Michael,
>
> On 23/09/15 00:52, Michael Collison wrote:
>> This is a modified version of the previous patch that removes the
>> documentation and read-md.c fixes. These patches have been submitted
>> separately and approved.
>>
>> This patch is designed to address code that was not being vectorized due
>> to missing widening patterns in the ARM backend. Code such as:
>>
>> int t6(int len, void * dummy, short * __restrict x)
>> {
>>     len = len & ~31;
>>     int result = 0;
>>     __asm volatile ("");
>>     for (int i = 0; i < len; i++)
>>       result += x[i];
>>     return result;
>> }
>>
>> Validated on arm-none-eabi, arm-none-linux-gnueabi,
>> arm-none-linux-gnueabihf, and armeb-none-linux-gnueabihf.
>>
>> 2015-09-22  Michael Collison <michael.collison@linaro.org>
>>
>>       * config/arm/neon.md (widen_<us>sum<mode>): New patterns
>>       where mode is VQI to improve mixed mode add vectorization.
>>
>
> Please list all the new define_expands and define_insns
> in the changelog. Also, please add an ChangeLog entry for
> the testsuite additions.
>
> The approach looks ok to me with a few comments on some
> parts of the patch itself.
>
>
> +(define_insn "vec_sel_widen_ssum_hi<VQI:mode><VW:mode>3"
> +  [(set (match_operand:<VW:V_widen> 0 "s_register_operand" "=w")
> +    (plus:<VW:V_widen> (sign_extend:<VW:V_widen> (vec_select:VW 
> (match_operand:VQI 1 "s_register_operand" "%w")
> +                           (match_operand:VQI 2 
> "vect_par_constant_high" "")))
> +                (match_operand:<VW:V_widen> 3 "s_register_operand" 
> "0")))]
> +  "TARGET_NEON"
> +  "vaddw.<V_s_elem>\t%q0, %q3, %f1"
> +  [(set_attr "type" "neon_add_widen")
> +  (set_attr "length" "8")]
> +)
>
>
> This is a single instruction, and it has a length of 4, so no need to 
> override the length attribute.
> Same with the other define_insns in this patch.
>
>
> diff --git a/gcc/testsuite/gcc.target/arm/neon-vaddws16.c 
> b/gcc/testsuite/gcc.target/arm/neon-vaddws16.c
> new file mode 100644
> index 0000000..ed10669
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/arm/neon-vaddws16.c
> @@ -0,0 +1,21 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target arm_neon_hw } */
>
> The arm_neon_hw check is usually used when you want to run the tests.
> Since this is a compile-only tests you just need arm_neon_ok.
>
>  +/* { dg-add-options arm_neon_ok } */
> +/* { dg-options "-O3" } */
> +
> +
> +int
> +t6(int len, void * dummy, short * __restrict x)
> +{
> +  len = len & ~31;
> +  int result = 0;
> +  __asm volatile ("");
> +  for (int i = 0; i < len; i++)
> +    result += x[i];
> +  return result;
> +}
> +
> +/* { dg-final { scan-assembler "vaddw\.s16" } } */
> +
> +
> +
>
> Stray trailing newlines. Similar comments for the other testcases.
>
> Thanks,
> Kyrill
>

Comments

Kyrylo Tkachov Oct. 8, 2015, 11:02 a.m. UTC | #1

Hi Michael,

On 01/10/15 11:05, Michael Collison wrote:
> Kyrill,
>
> I have modified the patch to address your comments. I also modified
> check_effective_target_vect_widen_sum_hi_to_si_pattern in
> target-supports.exp to
> indicate that arm neon supports vector widen sum of HImode to SImode.
> This resolved
> several test suite failures.
>
> Successfully tested on arm-none-eabi, arm-none-linux-gnueabihf. I have
> four related execution failure
> tests on armeb-non-linux-gnueabihf with -flto only.
>
> gcc.dg/vect/vect-outer-4f.c -flto -ffat-lto-objects execution test
> gcc.dg/vect/vect-outer-4g.c -flto -ffat-lto-objects execution test
> gcc.dg/vect/vect-outer-4k.c -flto -ffat-lto-objects execution test
> gcc.dg/vect/vect-outer-4l.c -flto -ffat-lto-objects execution test

We'd want to get to the bottom of these before committing.
Does codegen before and after the patch show anything?
When it comes to big-endian and NEON, the fiddly parts are
usually lane numbers. Do you need to select the proper lanes with
ENDIAN_LANE_N like Charles in his patch at:
https://gcc.gnu.org/ml/gcc-patches/2015-10/msg00656.html?

Thanks,
Kyrill

>
> I am debugging but have not tracked down the root cause yet. Feedback?
>
> 2015-07-22  Michael Collison  <michael.collison@linaro.org>
>
>       * config/arm/neon.md (widen_<us>sum<mode>): New patterns
>       where mode is VQI to improve mixed mode vectorization.
>       * config/arm/neon.md (vec_sel_widen_ssum_lo<VQI:mode><VW:mode>3): New
>       define_insn to match low half of signed vaddw.
>       * config/arm/neon.md (vec_sel_widen_ssum_hi<VQI:mode><VW:mode>3): New
>       define_insn to match high half of signed vaddw.
>       * config/arm/neon.md (vec_sel_widen_usum_lo<VQI:mode><VW:mode>3): New
>       define_insn to match low half of unsigned vaddw.
>       * config/arm/neon.md (vec_sel_widen_usum_hi<VQI:mode><VW:mode>3): New
>       define_insn to match high half of unsigned vaddw.
>       * testsuite/gcc.target/arm/neon-vaddws16.c: New test.
>       * testsuite/gcc.target/arm/neon-vaddws32.c: New test.
>       * testsuite/gcc.target/arm/neon-vaddwu16.c: New test.
>       * testsuite/gcc.target/arm/neon-vaddwu32.c: New test.
>       * testsuite/gcc.target/arm/neon-vaddwu8.c: New test.
>       * testsuite/lib/target-supports.exp
>       (check_effective_target_vect_widen_sum_hi_to_si_pattern): Indicate
>       that arm neon support vector widen sum of HImode TO SImode.

Note that the testsuite changes should have their own ChangeLog entry
with the paths there starting relative to gcc/testsuite/

>
> On 09/23/2015 01:49 AM, Kyrill Tkachov wrote:
>> Hi Michael,
>>
>> On 23/09/15 00:52, Michael Collison wrote:
>>> This is a modified version of the previous patch that removes the
>>> documentation and read-md.c fixes. These patches have been submitted
>>> separately and approved.
>>>
>>> This patch is designed to address code that was not being vectorized due
>>> to missing widening patterns in the ARM backend. Code such as:
>>>
>>> int t6(int len, void * dummy, short * __restrict x)
>>> {
>>>      len = len & ~31;
>>>      int result = 0;
>>>      __asm volatile ("");
>>>      for (int i = 0; i < len; i++)
>>>        result += x[i];
>>>      return result;
>>> }
>>>
>>> Validated on arm-none-eabi, arm-none-linux-gnueabi,
>>> arm-none-linux-gnueabihf, and armeb-none-linux-gnueabihf.
>>>
>>> 2015-09-22  Michael Collison <michael.collison@linaro.org>
>>>
>>>        * config/arm/neon.md (widen_<us>sum<mode>): New patterns
>>>        where mode is VQI to improve mixed mode add vectorization.
>>>
>> Please list all the new define_expands and define_insns
>> in the changelog. Also, please add an ChangeLog entry for
>> the testsuite additions.
>>
>> The approach looks ok to me with a few comments on some
>> parts of the patch itself.
>>
>>
>> +(define_insn "vec_sel_widen_ssum_hi<VQI:mode><VW:mode>3"
>> +  [(set (match_operand:<VW:V_widen> 0 "s_register_operand" "=w")
>> +    (plus:<VW:V_widen> (sign_extend:<VW:V_widen> (vec_select:VW
>> (match_operand:VQI 1 "s_register_operand" "%w")
>> +                           (match_operand:VQI 2
>> "vect_par_constant_high" "")))
>> +                (match_operand:<VW:V_widen> 3 "s_register_operand"
>> "0")))]
>> +  "TARGET_NEON"
>> +  "vaddw.<V_s_elem>\t%q0, %q3, %f1"
>> +  [(set_attr "type" "neon_add_widen")
>> +  (set_attr "length" "8")]
>> +)
>>
>>
>> This is a single instruction, and it has a length of 4, so no need to
>> override the length attribute.
>> Same with the other define_insns in this patch.
>>
>>
>> diff --git a/gcc/testsuite/gcc.target/arm/neon-vaddws16.c
>> b/gcc/testsuite/gcc.target/arm/neon-vaddws16.c
>> new file mode 100644
>> index 0000000..ed10669
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/arm/neon-vaddws16.c
>> @@ -0,0 +1,21 @@
>> +/* { dg-do compile } */
>> +/* { dg-require-effective-target arm_neon_hw } */
>>
>> The arm_neon_hw check is usually used when you want to run the tests.
>> Since this is a compile-only tests you just need arm_neon_ok.
>>
>>   +/* { dg-add-options arm_neon_ok } */
>> +/* { dg-options "-O3" } */
>> +
>> +
>> +int
>> +t6(int len, void * dummy, short * __restrict x)
>> +{
>> +  len = len & ~31;
>> +  int result = 0;
>> +  __asm volatile ("");
>> +  for (int i = 0; i < len; i++)
>> +    result += x[i];
>> +  return result;
>> +}
>> +
>> +/* { dg-final { scan-assembler "vaddw\.s16" } } */
>> +
>> +
>> +
>>
>> Stray trailing newlines. Similar comments for the other testcases.
>>
>> Thanks,
>> Kyrill
>>

Michael Collison Oct. 20, 2015, 7:54 a.m. UTC | #2

Hi Kyrill,

Since your email I have done the following:

1. Added the ENDIAN_LANE_N to the define_expand patterns for big endian 
targets. The big endian patches produced no change in the test results. 
I still have several execution failures with targeting big endian with 
lto enabled.

2. I diff'd the rtl dumps from a big endian compiler with lto enabled 
and disabled. I also examined the assembly language and there no 
differences except for the .ascii directives.

I want to ask a question about existing patterns in neon.md that utilize 
the vec_select and all the lanes as my example does: Why are the 
following pattern not matched if the target is big endian?


(define_insn "neon_vec_unpack<US>_lo_<mode>"
   [(set (match_operand:<V_unpack> 0 "register_operand" "=w")
         (SE:<V_unpack> (vec_select:<V_HALF>
               (match_operand:VU 1 "register_operand" "w")
               (match_operand:VU 2 "vect_par_constant_low" ""))))]
   "TARGET_NEON && !BYTES_BIG_ENDIAN"
   "vmovl.<US><V_sz_elem> %q0, %e1"
   [(set_attr "type" "neon_shift_imm_long")]
)

(define_insn "neon_vec_unpack<US>_hi_<mode>"
   [(set (match_operand:<V_unpack> 0 "register_operand" "=w")
         (SE:<V_unpack> (vec_select:<V_HALF>
               (match_operand:VU 1 "register_operand" "w")
               (match_operand:VU 2 "vect_par_constant_high" ""))))]
   "TARGET_NEON && !BYTES_BIG_ENDIAN"
   "vmovl.<US><V_sz_elem> %q0, %f1"
   [(set_attr "type" "neon_shift_imm_long")]

These patterns are similar to the new patterns I am adding and I am 
wondering if my patterns should exclude BYTES_BIG_ENDIAN?

On 10/08/2015 04:02 AM, Kyrill Tkachov wrote:
> Hi Michael,
>
> On 01/10/15 11:05, Michael Collison wrote:
>> Kyrill,
>>
>> I have modified the patch to address your comments. I also modified
>> check_effective_target_vect_widen_sum_hi_to_si_pattern in
>> target-supports.exp to
>> indicate that arm neon supports vector widen sum of HImode to SImode.
>> This resolved
>> several test suite failures.
>>
>> Successfully tested on arm-none-eabi, arm-none-linux-gnueabihf. I have
>> four related execution failure
>> tests on armeb-non-linux-gnueabihf with -flto only.
>>
>> gcc.dg/vect/vect-outer-4f.c -flto -ffat-lto-objects execution test
>> gcc.dg/vect/vect-outer-4g.c -flto -ffat-lto-objects execution test
>> gcc.dg/vect/vect-outer-4k.c -flto -ffat-lto-objects execution test
>> gcc.dg/vect/vect-outer-4l.c -flto -ffat-lto-objects execution test
>
> We'd want to get to the bottom of these before committing.
> Does codegen before and after the patch show anything?
> When it comes to big-endian and NEON, the fiddly parts are
> usually lane numbers. Do you need to select the proper lanes with
> ENDIAN_LANE_N like Charles in his patch at:
> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg00656.html?
>
> Thanks,
> Kyrill
>
>>
>> I am debugging but have not tracked down the root cause yet. Feedback?
>>
>> 2015-07-22  Michael Collison <michael.collison@linaro.org>
>>
>>       * config/arm/neon.md (widen_<us>sum<mode>): New patterns
>>       where mode is VQI to improve mixed mode vectorization.
>>       * config/arm/neon.md 
>> (vec_sel_widen_ssum_lo<VQI:mode><VW:mode>3): New
>>       define_insn to match low half of signed vaddw.
>>       * config/arm/neon.md 
>> (vec_sel_widen_ssum_hi<VQI:mode><VW:mode>3): New
>>       define_insn to match high half of signed vaddw.
>>       * config/arm/neon.md 
>> (vec_sel_widen_usum_lo<VQI:mode><VW:mode>3): New
>>       define_insn to match low half of unsigned vaddw.
>>       * config/arm/neon.md 
>> (vec_sel_widen_usum_hi<VQI:mode><VW:mode>3): New
>>       define_insn to match high half of unsigned vaddw.
>>       * testsuite/gcc.target/arm/neon-vaddws16.c: New test.
>>       * testsuite/gcc.target/arm/neon-vaddws32.c: New test.
>>       * testsuite/gcc.target/arm/neon-vaddwu16.c: New test.
>>       * testsuite/gcc.target/arm/neon-vaddwu32.c: New test.
>>       * testsuite/gcc.target/arm/neon-vaddwu8.c: New test.
>>       * testsuite/lib/target-supports.exp
>>       (check_effective_target_vect_widen_sum_hi_to_si_pattern): Indicate
>>       that arm neon support vector widen sum of HImode TO SImode.
>
> Note that the testsuite changes should have their own ChangeLog entry
> with the paths there starting relative to gcc/testsuite/
>
>>
>> On 09/23/2015 01:49 AM, Kyrill Tkachov wrote:
>>> Hi Michael,
>>>
>>> On 23/09/15 00:52, Michael Collison wrote:
>>>> This is a modified version of the previous patch that removes the
>>>> documentation and read-md.c fixes. These patches have been submitted
>>>> separately and approved.
>>>>
>>>> This patch is designed to address code that was not being 
>>>> vectorized due
>>>> to missing widening patterns in the ARM backend. Code such as:
>>>>
>>>> int t6(int len, void * dummy, short * __restrict x)
>>>> {
>>>>      len = len & ~31;
>>>>      int result = 0;
>>>>      __asm volatile ("");
>>>>      for (int i = 0; i < len; i++)
>>>>        result += x[i];
>>>>      return result;
>>>> }
>>>>
>>>> Validated on arm-none-eabi, arm-none-linux-gnueabi,
>>>> arm-none-linux-gnueabihf, and armeb-none-linux-gnueabihf.
>>>>
>>>> 2015-09-22  Michael Collison <michael.collison@linaro.org>
>>>>
>>>>        * config/arm/neon.md (widen_<us>sum<mode>): New patterns
>>>>        where mode is VQI to improve mixed mode add vectorization.
>>>>
>>> Please list all the new define_expands and define_insns
>>> in the changelog. Also, please add an ChangeLog entry for
>>> the testsuite additions.
>>>
>>> The approach looks ok to me with a few comments on some
>>> parts of the patch itself.
>>>
>>>
>>> +(define_insn "vec_sel_widen_ssum_hi<VQI:mode><VW:mode>3"
>>> +  [(set (match_operand:<VW:V_widen> 0 "s_register_operand" "=w")
>>> +    (plus:<VW:V_widen> (sign_extend:<VW:V_widen> (vec_select:VW
>>> (match_operand:VQI 1 "s_register_operand" "%w")
>>> +                           (match_operand:VQI 2
>>> "vect_par_constant_high" "")))
>>> +                (match_operand:<VW:V_widen> 3 "s_register_operand"
>>> "0")))]
>>> +  "TARGET_NEON"
>>> +  "vaddw.<V_s_elem>\t%q0, %q3, %f1"
>>> +  [(set_attr "type" "neon_add_widen")
>>> +  (set_attr "length" "8")]
>>> +)
>>>
>>>
>>> This is a single instruction, and it has a length of 4, so no need to
>>> override the length attribute.
>>> Same with the other define_insns in this patch.
>>>
>>>
>>> diff --git a/gcc/testsuite/gcc.target/arm/neon-vaddws16.c
>>> b/gcc/testsuite/gcc.target/arm/neon-vaddws16.c
>>> new file mode 100644
>>> index 0000000..ed10669
>>> --- /dev/null
>>> +++ b/gcc/testsuite/gcc.target/arm/neon-vaddws16.c
>>> @@ -0,0 +1,21 @@
>>> +/* { dg-do compile } */
>>> +/* { dg-require-effective-target arm_neon_hw } */
>>>
>>> The arm_neon_hw check is usually used when you want to run the tests.
>>> Since this is a compile-only tests you just need arm_neon_ok.
>>>
>>>   +/* { dg-add-options arm_neon_ok } */
>>> +/* { dg-options "-O3" } */
>>> +
>>> +
>>> +int
>>> +t6(int len, void * dummy, short * __restrict x)
>>> +{
>>> +  len = len & ~31;
>>> +  int result = 0;
>>> +  __asm volatile ("");
>>> +  for (int i = 0; i < len; i++)
>>> +    result += x[i];
>>> +  return result;
>>> +}
>>> +
>>> +/* { dg-final { scan-assembler "vaddw\.s16" } } */
>>> +
>>> +
>>> +
>>>
>>> Stray trailing newlines. Similar comments for the other testcases.
>>>
>>> Thanks,
>>> Kyrill
>>>
>

Charles Baylis Oct. 21, 2015, 3:05 p.m. UTC | #3

On 20 October 2015 at 08:54, Michael Collison
<michael.collison@linaro.org> wrote:
> I want to ask a question about existing patterns in neon.md that utilize the
> vec_select and all the lanes as my example does: Why are the following
> pattern not matched if the target is big endian?

> (define_insn "neon_vec_unpack<US>_lo_<mode>"
>   [(set (match_operand:<V_unpack> 0 "register_operand" "=w")
>         (SE:<V_unpack> (vec_select:<V_HALF>
>               (match_operand:VU 1 "register_operand" "w")
>               (match_operand:VU 2 "vect_par_constant_low" ""))))]
>   "TARGET_NEON && !BYTES_BIG_ENDIAN"
>   "vmovl.<US><V_sz_elem> %q0, %e1"
>   [(set_attr "type" "neon_shift_imm_long")]
> )
>
> (define_insn "neon_vec_unpack<US>_hi_<mode>"
>   [(set (match_operand:<V_unpack> 0 "register_operand" "=w")
>         (SE:<V_unpack> (vec_select:<V_HALF>
>               (match_operand:VU 1 "register_operand" "w")
>               (match_operand:VU 2 "vect_par_constant_high" ""))))]
>   "TARGET_NEON && !BYTES_BIG_ENDIAN"
>   "vmovl.<US><V_sz_elem> %q0, %f1"
>   [(set_attr "type" "neon_shift_imm_long")]
>
> These patterns are similar to the new patterns I am adding and I am
> wondering if my patterns should exclude BYTES_BIG_ENDIAN?

These patterns use %e and %f to access the low and high part of the
input operand - so %e is used to match the use of _lo in the pattern
name, and vect_par_constant_low, and %f with _hi and
vect_par_constant_high. For big-endian, the use of %e and %f would
need to be swapped.

Looking at the patch you posted last month (possibly not the latest version?):

This is a pattern which is supposed to act on the low part of the
input vector, hence _lo in the name:
+(define_insn "vec_sel_widen_ssum_lo<VQI:mode><VW:mode>3"
+  [(set (match_operand:<VW:V_widen> 0 "s_register_operand" "=w")
+ (plus:<VW:V_widen> (sign_extend:<VW:V_widen> (vec_select:VW
(match_operand:VQI 1 "s_register_operand" "%w")
+   (match_operand:VQI 2 "vect_par_constant_low" "")))
+        (match_operand:<VW:V_widen> 3 "s_register_operand" "0")))]
+  "TARGET_NEON"
+  "vaddw.<V_s_elem>\t%q0, %q3, %e1"

Here, using %e1 carries an implicit assumption that the low part of
the input vector is in the lowest numbered of the pair of D registers,
which is only true on little-endian.

This is a bit ugly (and untested) but perhaps something like this
would fix the problem
{
    return BYTES_BIG_ENDIAN ?  "vaddw.<V_s_elem>\t%q0, %q3, %f1" :
"vaddw.<V_s_elem>\t%q0, %q3, %e1";
}

+  [(set_attr "type" "neon_add_widen")
+  (set_attr "length" "8")]
+)

Similarly, here. Pattern is _hi, register is %f1:

+(define_insn "vec_sel_widen_ssum_hi<VQI:mode><VW:mode>3"
+  [(set (match_operand:<VW:V_widen> 0 "s_register_operand" "=w")
+ (plus:<VW:V_widen> (sign_extend:<VW:V_widen> (vec_select:VW
(match_operand:VQI 1 "s_register_operand" "%w")
+   (match_operand:VQI 2 "vect_par_constant_high" "")))
+        (match_operand:<VW:V_widen> 3 "s_register_operand" "0")))]
+  "TARGET_NEON"
+  "vaddw.<V_s_elem>\t%q0, %q3, %f1"
+  [(set_attr "type" "neon_add_widen")
+  (set_attr "length" "8")]
+)

However, as far as I can see, there isn't an endianness dependency in
widen_ssum<mode>3/widen_usum<mode>3 because both halves of the vector
are used and added together.


Hope this helps
Charles

diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
index 654d9d5..b3485f1 100644
--- a/gcc/config/arm/neon.md
+++ b/gcc/config/arm/neon.md
@@ -1174,6 +1174,55 @@ 
 
 ;; Widening operations
 
+(define_expand "widen_ssum<mode>3"
+  [(set (match_operand:<V_double_width> 0 "s_register_operand" "")
+	(plus:<V_double_width> (sign_extend:<V_double_width> (match_operand:VQI 1 "s_register_operand" ""))
+			       (match_operand:<V_double_width> 2 "s_register_operand" "")))]
+  "TARGET_NEON"
+  {
+    int i;
+    int half_elem = <V_mode_nunits>/2;
+    rtvec v1 = rtvec_alloc (half_elem);
+    rtvec v2 = rtvec_alloc (half_elem);
+    rtx p1, p2;
+
+    for (i = 0; i < half_elem; i++)
+      RTVEC_ELT (v1, i) = GEN_INT (i);
+    p1 = gen_rtx_PARALLEL (GET_MODE (operands[1]), v1);
+
+    for (i = half_elem; i < <V_mode_nunits>; i++)
+      RTVEC_ELT (v2, i - half_elem) = GEN_INT (i);
+    p2 = gen_rtx_PARALLEL (GET_MODE (operands[1]), v2);
+
+    if (operands[0] != operands[2])
+      emit_move_insn (operands[0], operands[2]);
+
+    emit_insn (gen_vec_sel_widen_ssum_lo<mode><V_half>3 (operands[0], operands[1], p1, operands[0]));
+    emit_insn (gen_vec_sel_widen_ssum_hi<mode><V_half>3 (operands[0], operands[1], p2, operands[0]));
+    DONE;
+  }
+)
+
+(define_insn "vec_sel_widen_ssum_lo<VQI:mode><VW:mode>3"
+  [(set (match_operand:<VW:V_widen> 0 "s_register_operand" "=w")
+	(plus:<VW:V_widen> (sign_extend:<VW:V_widen> (vec_select:VW (match_operand:VQI 1 "s_register_operand" "%w")
+						   (match_operand:VQI 2 "vect_par_constant_low" "")))
+		        (match_operand:<VW:V_widen> 3 "s_register_operand" "0")))]
+  "TARGET_NEON"
+  "vaddw.<V_s_elem>\t%q0, %q3, %e1"
+  [(set_attr "type" "neon_add_widen")]
+)
+
+(define_insn "vec_sel_widen_ssum_hi<VQI:mode><VW:mode>3"
+  [(set (match_operand:<VW:V_widen> 0 "s_register_operand" "=w")
+	(plus:<VW:V_widen> (sign_extend:<VW:V_widen> (vec_select:VW (match_operand:VQI 1 "s_register_operand" "%w")
+						   (match_operand:VQI 2 "vect_par_constant_high" "")))
+		        (match_operand:<VW:V_widen> 3 "s_register_operand" "0")))]
+  "TARGET_NEON"
+  "vaddw.<V_s_elem>\t%q0, %q3, %f1"
+  [(set_attr "type" "neon_add_widen")]
+)
+
 (define_insn "widen_ssum<mode>3"
   [(set (match_operand:<V_widen> 0 "s_register_operand" "=w")
 	(plus:<V_widen> (sign_extend:<V_widen>
@@ -1184,6 +1233,55 @@ 
   [(set_attr "type" "neon_add_widen")]
 )
 
+(define_expand "widen_usum<mode>3"
+  [(set (match_operand:<V_double_width> 0 "s_register_operand" "")
+	(plus:<V_double_width> (zero_extend:<V_double_width> (match_operand:VQI 1 "s_register_operand" ""))
+			       (match_operand:<V_double_width> 2 "s_register_operand" "")))]
+  "TARGET_NEON"
+  {
+    int i;
+    int half_elem = <V_mode_nunits>/2;
+    rtvec v1 = rtvec_alloc (half_elem);
+    rtvec v2 = rtvec_alloc (half_elem);
+    rtx p1, p2;
+
+    for (i = 0; i < half_elem; i++)
+      RTVEC_ELT (v1, i) = GEN_INT (i);
+    p1 = gen_rtx_PARALLEL (GET_MODE (operands[1]), v1);
+
+    for (i = half_elem; i < <V_mode_nunits>; i++)
+      RTVEC_ELT (v2, i - half_elem) = GEN_INT (i);
+    p2 = gen_rtx_PARALLEL (GET_MODE (operands[1]), v2);
+
+    if (operands[0] != operands[2])
+      emit_move_insn (operands[0], operands[2]);
+
+    emit_insn (gen_vec_sel_widen_usum_lo<mode><V_half>3 (operands[0], operands[1], p1, operands[0]));
+    emit_insn (gen_vec_sel_widen_usum_hi<mode><V_half>3 (operands[0], operands[1], p2, operands[0]));
+    DONE;
+  }
+)
+
+(define_insn "vec_sel_widen_usum_lo<VQI:mode><VW:mode>3"
+  [(set (match_operand:<VW:V_widen> 0 "s_register_operand" "=w")
+	(plus:<VW:V_widen> (zero_extend:<VW:V_widen> (vec_select:VW (match_operand:VQI 1 "s_register_operand" "%w")
+						   (match_operand:VQI 2 "vect_par_constant_low" "")))
+		        (match_operand:<VW:V_widen> 3 "s_register_operand" "0")))]
+  "TARGET_NEON"
+  "vaddw.<V_u_elem>\t%q0, %q3, %e1"
+  [(set_attr "type" "neon_add_widen")]
+)
+
+(define_insn "vec_sel_widen_usum_hi<VQI:mode><VW:mode>3"
+  [(set (match_operand:<VW:V_widen> 0 "s_register_operand" "=w")
+	(plus:<VW:V_widen> (zero_extend:<VW:V_widen> (vec_select:VW (match_operand:VQI 1 "s_register_operand" "%w")
+						   (match_operand:VQI 2 "vect_par_constant_high" "")))
+		        (match_operand:<VW:V_widen> 3 "s_register_operand" "0")))]
+  "TARGET_NEON"
+  "vaddw.<V_u_elem>\t%q0, %q3, %f1"
+  [(set_attr "type" "neon_add_widen")]
+)
+
 (define_insn "widen_usum<mode>3"
   [(set (match_operand:<V_widen> 0 "s_register_operand" "=w")
 	(plus:<V_widen> (zero_extend:<V_widen>
@@ -5347,7 +5445,7 @@ 
  [(set (match_operand:<V_unpack> 0 "register_operand" "=w")
        (mult:<V_unpack> (SE:<V_unpack> (vec_select:<V_HALF>
 			   (match_operand:VU 1 "register_operand" "w") 
-                           (match_operand:VU 2 "vect_par_constant_low" "")))
+					(match_operand:VU 2 "vect_par_constant_low" "")))
  		        (SE:<V_unpack> (vec_select:<V_HALF>
                            (match_operand:VU 3 "register_operand" "w") 
                            (match_dup 2)))))]
diff --git a/gcc/testsuite/gcc.target/arm/neon-vaddws16.c b/gcc/testsuite/gcc.target/arm/neon-vaddws16.c
new file mode 100644
index 0000000..96c657e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/neon-vaddws16.c
@@ -0,0 +1,18 @@ 
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_neon_hw } */
+/* { dg-add-options arm_neon_ok } */
+/* { dg-options "-O3" } */
+
+
+int 
+t6(int len, void * dummy, short * __restrict x)
+{
+  len = len & ~31;
+  int result = 0;
+  __asm volatile ("");
+  for (int i = 0; i < len; i++)
+    result += x[i];
+  return result;
+}
+
+/* { dg-final { scan-assembler "vaddw\.s16" } } */
diff --git a/gcc/testsuite/gcc.target/arm/neon-vaddws32.c b/gcc/testsuite/gcc.target/arm/neon-vaddws32.c
new file mode 100644
index 0000000..1bfdc13
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/neon-vaddws32.c
@@ -0,0 +1,17 @@ 
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_neon_hw } */
+/* { dg-add-options arm_neon_ok } */
+/* { dg-options "-O3" } */
+
+int 
+t6(int len, void * dummy, int * __restrict x)
+{
+  len = len & ~31;
+  long long result = 0;
+  __asm volatile ("");
+  for (int i = 0; i < len; i++)
+    result += x[i];
+  return result;
+}
+
+/* { dg-final { scan-assembler "vaddw\.s32" } } */
diff --git a/gcc/testsuite/gcc.target/arm/neon-vaddwu16.c b/gcc/testsuite/gcc.target/arm/neon-vaddwu16.c
new file mode 100644
index 0000000..98f8768
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/neon-vaddwu16.c
@@ -0,0 +1,18 @@ 
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_neon_hw } */
+/* { dg-add-options arm_neon_ok } */
+/* { dg-options "-O3" } */
+
+
+int 
+t6(int len, void * dummy, unsigned short * __restrict x)
+{
+  len = len & ~31;
+  unsigned int result = 0;
+  __asm volatile ("");
+  for (int i = 0; i < len; i++)
+    result += x[i];
+  return result;
+}
+
+/* { dg-final { scan-assembler "vaddw.u16" } } */
diff --git a/gcc/testsuite/gcc.target/arm/neon-vaddwu32.c b/gcc/testsuite/gcc.target/arm/neon-vaddwu32.c
new file mode 100644
index 0000000..4a72a39
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/neon-vaddwu32.c
@@ -0,0 +1,17 @@ 
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_neon_hw } */
+/* { dg-add-options arm_neon_ok } */
+/* { dg-options "-O3" } */
+
+int 
+t6(int len, void * dummy, unsigned int * __restrict x)
+{
+  len = len & ~31;
+  unsigned long long result = 0;
+  __asm volatile ("");
+  for (int i = 0; i < len; i++)
+    result += x[i];
+  return result;
+}
+
+/* { dg-final { scan-assembler "vaddw\.u32" } } */
diff --git a/gcc/testsuite/gcc.target/arm/neon-vaddwu8.c b/gcc/testsuite/gcc.target/arm/neon-vaddwu8.c
new file mode 100644
index 0000000..9c9c68a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/neon-vaddwu8.c
@@ -0,0 +1,18 @@ 
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_neon_hw } */
+/* { dg-add-options arm_neon_ok } */
+/* { dg-options "-O3" } */
+
+
+int 
+t6(int len, void * dummy, char * __restrict x)
+{
+  len = len & ~31;
+  unsigned short result = 0;
+  __asm volatile ("");
+  for (int i = 0; i < len; i++)
+    result += x[i];
+  return result;
+}
+
+/* { dg-final { scan-assembler "vaddw\.u8" } } */
diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index 1988301..5530edc 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -3838,6 +3838,7 @@  proc check_effective_target_vect_widen_sum_hi_to_si_pattern { } {
     } else {
         set et_vect_widen_sum_hi_to_si_pattern_saved 0
         if { [istarget powerpc*-*-*]
+	     || [check_effective_target_arm_neon_ok]
              || [istarget ia64-*-*] } {
             set et_vect_widen_sum_hi_to_si_pattern_saved 1
         }
-- 
1.9.1