diff mbox

[ARM] stop changing signedness in PROMOTE_MODE

Message ID CABXYE2VXjy7=5Y=c1TCxLE8KuwLtwBYBhTB24xrWDvWAeiBwbQ@mail.gmail.com
State New
Headers show

Commit Message

Jim Wilson June 30, 2015, 1:15 a.m. UTC
This is my suggested fix for PR 65932, which is a linux kernel
miscompile with gcc-5.1.

The problem here is caused by a chain of events.  The first is that
the relatively new eipa_sra pass creates fake parameters that behave
slightly differently than normal parameters.  The second is that the
optimizer creates phi nodes that copy local variables to fake
parameters and/or vice versa.  The third is that the ouf-of-ssa pass
assumes that it can emit simple move instructions for these phi nodes.
And the fourth is that the ARM port has a PROMOTE_MODE macro that
forces QImode and HImode to unsigned, but a
TARGET_PROMOTE_FUNCTION_MODE hook that does not.  So signed char and
short parameters have different in register representations than local
variables, and require a conversion when copying between them, a
conversion that the out-of-ssa pass can't easily emit.

Ultimately, I think this is a problem in the arm backend.  It should
not have a PROMOTE_MODE macro that is changing the sign of char and
short local variables.  I also think that we should merge the
PROMOTE_MODE macro with the TARGET_PROMOTE_FUNCTION_MODE hook to
prevent this from happening again.

I see four general problems with the current ARM PROMOTE_MODE definition.
1) Unsigned char is only faster for armv5 and earlier, before the sxtb
instruction was added.  It is a lose for armv6 and later.
2) Unsigned short was only faster for targets that don't support
unaligned accesses.  Support for these targets was removed a while
ago, and this PROMODE_MODE hunk should have been removed at the same
time.  It was accidentally left behind.
3) TARGET_PROMOTE_FUNCTION_MODE used to be a boolean hook, when it was
converted to a function, the PROMOTE_MODE code was copied without the
UNSIGNEDP changes.  Thus it is only an accident that
TARGET_PROMOTE_FUNCTION_MODE and PROMOTE_MODE disagree.  Changing
TARGET_PROMOTE_FUNCTION_MODE is an ABI change, so only PROMOTE_MODE
changes to resolve the difference are safe.
4) There is a general principle that you should only change signedness
in PROMOTE_MODE if the hardware forces it, as otherwise this results
in extra conversion instructions that make code slower.  The mips64
hardware for instance requires that 32-bit values be sign-extended
regardless of type, and instructions may trap if this is not true.
However, it has a set of 32-bit instructions that operate on these
values, and hence no conversions are required.  There is no similar
case on ARM. Thus the conversions are unnecessary and unwise.  This
can be seen in the testcases where gcc emits both a zero-extend and a
sign-extend inside a loop, as the sign-extend is required for a
compare, and the zero-extend is required by PROMOTE_MODE.

My change was tested with an arm bootstrap, make check, and SPEC
CPU2000 run.  The original poster verified that this gives a linux
kernel that boots correctly.

The PRMOTE_MODE change causes 3 testsuite testcases to fail.  These
are tests to verify that smulbb and/or smlabb are generated.
Eliminating the unnecessary sign conversions causes us to get better
code that doesn't include the smulbb and smlabb instructions.  I had
to modify the testcases to get them to emit the desired instructions.
With the testcase changes there are no additional testsuite failures,
though I'm concerned that these testcases with the changes may be
fragile, and future changes may break them again.

If there are ARM parts where smulbb/smlabb are faster than mul/mla,
then maybe we should try to add new patterns to get the instructions
emitted again for the unmodified testcases.

Jim

Comments

Jim Wilson July 7, 2015, 4:29 p.m. UTC | #1
On Tue, Jul 7, 2015 at 8:07 AM, Jeff Law <law@redhat.com> wrote:
> On 06/29/2015 07:15 PM, Jim Wilson wrote:
> So if these "copies" require a  conversion, then isn't it fundamentally
> wrong to have a PHI node which copies between them?  That would seem to
> implicate the eipa_sra pass as needing to be aware of these promotions and
> avoid having these objects with different representations appearing on the
> lhs/rhs of a PHI node.

My years at Cisco didn't give me a chance to work on the SSA passes,
so I don't know much about how they work.  But looking at this, I see
that PHI nodes are eventually handed by emit_partition_copy in
tree-outof-ssa.c, which calls convert_to_mode, so it appears that
conversions between different (closely related?) types are OK in a PHI
node.  The problem in this case is that we have the exact same type
for the src and dest.  The only difference is that the ARM forces
sign-extension for signed sub-word parameters and zero-extension for
signed sub-word locals.  Thus to detect the need for a conversion, you
have to have the decls, and we don't have them here.  There is also
the problem that all of the SUBREG_PROMOTED_* stuff is in expand_expr
and friends, which aren't used by the cfglayout/tree-outof-ssa.c code
for PHI nodes.  So we need to copy some of the SUBREG_PROMOTED_*
handling into cfglyout/tree-outof-ssa, or modify them to call
expand_expr for PHI nodes, and I haven't had any luck getting that to
work yet. I still need to learn more about the code to figure out if
this is possible.

I also think that the ARM handling of PROMOTE_MODE is wrong.  Treating
a signed sub-word and unsigned can lead to inefficient code.  This is
part of the problem is much easier for me to fix.  It may be hard to
convince ARM maintainers that it should be changed though.  I need
more time to work on this too.

I haven't looked at trying to forbid the optimizer from creating PHI
nodes connecting parameters to locals.  That just sounds like a
strange thing to forbid, and seems likely to result in worse code by
disabling too many optimizations.  But maybe it is the right solution
if neither of the other two options work.  This should only be done
when PROMOTE_MODE disagrees with TARGET_PROMOTE_FUNCTION_MODE, forcing
the copy to require a conversion.

Jim
Jim Wilson July 7, 2015, 6:25 p.m. UTC | #2
On Thu, Jul 2, 2015 at 2:07 AM, Richard Earnshaw
<Richard.Earnshaw@foss.arm.com> wrote:
> Not quite, ARM state still has more flexible addressing modes for
> unsigned byte loads than for signed byte loads.  It's even worse with
> thumb1 where some signed loads have no single-register addressing mode
> (ie you have to copy zero into another register to use as an index
> before doing the load).

I wasn't aware of the load address problem.  That was something I
hadn't considered, and will have to look at that.  Load is just one
instruction though.  For most other instructions, a zero-extend
results in less efficient code, because it then forces a sign-extend
before a signed operation.  The fact that parameters and locals are
handled differently which requires conversions when copying between
them results in more inefficient code.  And changing
TARGET_PROMOTE_FUNCTION_MODE is an ABI change, and hence would be
unwise, so changing PROMOTE_MODE is the safer option.

Consider this testcase
extern signed short gs;

short
sub (void)
{
  signed short s = gs;
  int i;

  for (i = 0; i < 10; i++)
    {
      s += 1;
      if (s > 10) break;
    }

  return s;
}

The inner loop ends up as
.L3:
adds r3, r3, #1
mov r0, r1
uxth r3, r3
sxth r2, r3
cmp r2, #10
bgt .L8
cmp r2, r1
bne .L3
bx lr

We need the sign-extension for the compare.  We need the
zero-extension for the loop carried dependency.  We have two
extensions in every loop iteration, plus some extra register usage and
register movement.  We get better code for this example if we aren't
forcing signed shorts to be zero-extended via PROMOTE_MODE.

The lack of a reg+immediate address mode for ldrs[bh] in thumb1 does
look like a problem though.  But this means the difference between
generating
    movs r2, #0
    ldrsh r3, [r3, r2]
with my patch, or
    ldrh r3, [r3]
    lsls r2, r3, #16
    asrs r2, r2, #16
without my patch.  It isn't clear which sequence is better.  The
sign-extends in the second sequence can sometimes be optimized away,
and sometimes they can't be optimized away.  Similarly, in the first
sequence, loading zero into a reg can sometimes be optimized, and
sometimes it can't.  There is also no guarantee that you get the first
sequence with the patch or the second sequence without the patch.
There is a splitter for ldrsh, so you can get the second pattern
sometimes with the patch.  Similarly, it might be possible to get the
first pattern without the patch in some cases, though I don't have one
at the moment.

Jim
Jim Wilson July 10, 2015, 3:34 p.m. UTC | #3
On Wed, Jul 8, 2015 at 3:54 PM, Jeff Law <law@redhat.com> wrote:
> On 07/07/2015 10:29 AM, Jim Wilson wrote:
> This is critically important as various parts of the compiler will take a
> degenerate PHI node and propagate the RHS of the PHI into the uses of the
> LHS of the PHI -- without doing any conversions.

I think this is OK, because tree-outof-ssa does send code in basic
blocks through expand_expr, which will emit conversions if necessary.
it is only the conversion of PHI nodes to RTL that is the problem, as
it doesn't use expand_expr, and hence doesn't get the
SUBREG_PROMOTED_P conversions.

Jim
Christophe Lyon Feb. 17, 2016, 10:03 a.m. UTC | #4
On 15 February 2016 at 12:32, Kyrill Tkachov
<kyrylo.tkachov@foss.arm.com> wrote:
>

> On 04/02/16 08:58, Ramana Radhakrishnan wrote:

>>

>> On Tue, Jun 30, 2015 at 2:15 AM, Jim Wilson <jim.wilson@linaro.org> wrote:

>>>

>>> This is my suggested fix for PR 65932, which is a linux kernel

>>> miscompile with gcc-5.1.

>>>

>>> The problem here is caused by a chain of events.  The first is that

>>> the relatively new eipa_sra pass creates fake parameters that behave

>>> slightly differently than normal parameters.  The second is that the

>>> optimizer creates phi nodes that copy local variables to fake

>>> parameters and/or vice versa.  The third is that the ouf-of-ssa pass

>>> assumes that it can emit simple move instructions for these phi nodes.

>>> And the fourth is that the ARM port has a PROMOTE_MODE macro that

>>> forces QImode and HImode to unsigned, but a

>>> TARGET_PROMOTE_FUNCTION_MODE hook that does not.  So signed char and

>>> short parameters have different in register representations than local

>>> variables, and require a conversion when copying between them, a

>>> conversion that the out-of-ssa pass can't easily emit.

>>>

>>> Ultimately, I think this is a problem in the arm backend.  It should

>>> not have a PROMOTE_MODE macro that is changing the sign of char and

>>> short local variables.  I also think that we should merge the

>>> PROMOTE_MODE macro with the TARGET_PROMOTE_FUNCTION_MODE hook to

>>> prevent this from happening again.

>>>

>>> I see four general problems with the current ARM PROMOTE_MODE definition.

>>> 1) Unsigned char is only faster for armv5 and earlier, before the sxtb

>>> instruction was added.  It is a lose for armv6 and later.

>>> 2) Unsigned short was only faster for targets that don't support

>>> unaligned accesses.  Support for these targets was removed a while

>>> ago, and this PROMODE_MODE hunk should have been removed at the same

>>> time.  It was accidentally left behind.

>>> 3) TARGET_PROMOTE_FUNCTION_MODE used to be a boolean hook, when it was

>>> converted to a function, the PROMOTE_MODE code was copied without the

>>> UNSIGNEDP changes.  Thus it is only an accident that

>>> TARGET_PROMOTE_FUNCTION_MODE and PROMOTE_MODE disagree.  Changing

>>> TARGET_PROMOTE_FUNCTION_MODE is an ABI change, so only PROMOTE_MODE

>>> changes to resolve the difference are safe.

>>> 4) There is a general principle that you should only change signedness

>>> in PROMOTE_MODE if the hardware forces it, as otherwise this results

>>> in extra conversion instructions that make code slower.  The mips64

>>> hardware for instance requires that 32-bit values be sign-extended

>>> regardless of type, and instructions may trap if this is not true.

>>> However, it has a set of 32-bit instructions that operate on these

>>> values, and hence no conversions are required.  There is no similar

>>> case on ARM. Thus the conversions are unnecessary and unwise.  This

>>> can be seen in the testcases where gcc emits both a zero-extend and a

>>> sign-extend inside a loop, as the sign-extend is required for a

>>> compare, and the zero-extend is required by PROMOTE_MODE.

>>

>> Given Kyrill's testing with the patch and the reasonably detailed

>> check of the effects of code generation changes - The arm.h hunk is ok

>> - I do think we should make this explicit in the documentation that

>> TARGET_PROMOTE_MODE and TARGET_PROMOTE_FUNCTION_MODE should agree and

>> better still maybe put in a checking assert for the same in the

>> mid-end but that could be the subject of a follow-up patch.

>>

>> Ok to apply just the arm.h hunk as I think Kyrill has taken care of

>> the testsuite fallout separately.

>

> Hi all,

>

> I'd like to backport the arm.h from this ( r233130) to the GCC 5

> branch. As the CSE patch from my series had some fallout on x86_64

> due to a deficiency in the AVX patterns that is too invasive to fix

> at this stage (and presumably backport), I'd like to just backport

> this arm.h fix and adjust the tests to XFAIL the fallout that comes

> with not applying the CSE patch. The attached patch does that.

>

> The code quality fallout on code outside the testsuite is not

> that gread. The SPEC benchmarks are not affected by not applying

> the CSE change, and only a single sequence in a popular embedded benchmark

> shows some degradation for -mtune=cortex-a9 in the same way as the

> wmul-1.c and wmul-2.c tests.

>

> I think that's a fair tradeoff for fixing the wrong code bug on that branch.

>

> Ok to backport r233130 and the attached testsuite patch to the GCC 5 branch?

>

> Thanks,

> Kyrill

>

> 2016-02-15  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

>

>     PR target/65932

>     * gcc.target/arm/wmul-1.c: Add -mtune=cortex-a9 to dg-options.

>     xfail the scan-assembler test.

>     * gcc.target/arm/wmul-2.c: Likewise.

>     * gcc.target/arm/wmul-3.c: Simplify test to generate a single smulbb.

>

>

Hi Kyrill,

I've noticed that wmul-3 still fails on the gcc-5 branch when forcing GCC
configuration to:
--with-cpu cortex-a5 --with-fpu vfpv3-d16-fp16
(target arm-none-linux-gnueabihf)

The generated code is:
        sxth    r0, r0
        sxth    r1, r1
        mul     r0, r1, r0
instead of
        smulbb  r0, r1, r0
on trunk.

I guess we don't worry?

>

>

>>

>> regards

>> Ramana

>>

>>

>>

>>

>>> My change was tested with an arm bootstrap, make check, and SPEC

>>> CPU2000 run.  The original poster verified that this gives a linux

>>> kernel that boots correctly.

>>>

>>> The PRMOTE_MODE change causes 3 testsuite testcases to fail.  These

>>> are tests to verify that smulbb and/or smlabb are generated.

>>> Eliminating the unnecessary sign conversions causes us to get better

>>> code that doesn't include the smulbb and smlabb instructions.  I had

>>> to modify the testcases to get them to emit the desired instructions.

>>> With the testcase changes there are no additional testsuite failures,

>>> though I'm concerned that these testcases with the changes may be

>>> fragile, and future changes may break them again.

>>

>>

>>

>>> If there are ARM parts where smulbb/smlabb are faster than mul/mla,

>>> then maybe we should try to add new patterns to get the instructions

>>> emitted again for the unmodified testcases.

>>>

>>> Jim

>

>
diff mbox

Patch

gcc/
2015-06-29  Jim Wilson  <jim.wilson@linaro.org>

	PR target/65932
	* config/arm/arm.h (PROMOTE_MODE): Don't set UNSIGNEDP for QImode and
	HImode.

gcc/testsuite/
2015-06-29  Jim Wilson  <jim.wilson@linaro.org>

	PR target/65932
	* gcc.target/arm/wmul-1.c (mac): Change a and b to int pointers.  Cast
	multiply operands to short.
	* gcc.target/arm/wmul-2.c (vec_mpy): Cast multiply result to short.
	* gcc.target/arm/wmul-3.c (mac): Cast multiply results to short.

Index: config/arm/arm.h
===================================================================
--- config/arm/arm.h	(revision 224672)
+++ config/arm/arm.h	(working copy)
@@ -523,16 +523,10 @@  extern int arm_arch_crc;
    type, but kept valid in the wider mode.  The signedness of the
    extension may differ from that of the type.  */
 
-/* It is far faster to zero extend chars than to sign extend them */
-
 #define PROMOTE_MODE(MODE, UNSIGNEDP, TYPE)	\
   if (GET_MODE_CLASS (MODE) == MODE_INT		\
       && GET_MODE_SIZE (MODE) < 4)      	\
     {						\
-      if (MODE == QImode)			\
-	UNSIGNEDP = 1;				\
-      else if (MODE == HImode)			\
-	UNSIGNEDP = 1;				\
       (MODE) = SImode;				\
     }
 
Index: testsuite/gcc.target/arm/wmul-1.c
===================================================================
--- testsuite/gcc.target/arm/wmul-1.c	(revision 224672)
+++ testsuite/gcc.target/arm/wmul-1.c	(working copy)
@@ -2,14 +2,14 @@ 
 /* { dg-require-effective-target arm_dsp } */
 /* { dg-options "-O1 -fexpensive-optimizations" } */
 
-int mac(const short *a, const short *b, int sqr, int *sum)
+int mac(const int *a, const int *b, int sqr, int *sum)
 {
   int i;
   int dotp = *sum;
 
   for (i = 0; i < 150; i++) {
-    dotp += b[i] * a[i];
-    sqr += b[i] * b[i];
+    dotp += (short)b[i] * (short)a[i];
+    sqr += (short)b[i] * (short)b[i];
   }
 
   *sum = dotp;
Index: testsuite/gcc.target/arm/wmul-2.c
===================================================================
--- testsuite/gcc.target/arm/wmul-2.c	(revision 224672)
+++ testsuite/gcc.target/arm/wmul-2.c	(working copy)
@@ -7,7 +7,7 @@  void vec_mpy(int y[], const short x[], s
  int i;
 
  for (i = 0; i < 150; i++)
-   y[i] += ((scaler * x[i]) >> 31);
+   y[i] += ((short)(scaler * x[i]) >> 31);
 }
 
 /* { dg-final { scan-assembler-times "smulbb" 1 } } */
Index: testsuite/gcc.target/arm/wmul-3.c
===================================================================
--- testsuite/gcc.target/arm/wmul-3.c	(revision 224672)
+++ testsuite/gcc.target/arm/wmul-3.c	(working copy)
@@ -8,8 +8,8 @@  int mac(const short *a, const short *b,
   int dotp = *sum;
 
   for (i = 0; i < 150; i++) {
-    dotp -= b[i] * a[i];
-    sqr -= b[i] * b[i];
+    dotp -= (short)(b[i] * a[i]);
+    sqr -= (short)(b[i] * b[i]);
   }
 
   *sum = dotp;