diff mbox

[ARM] Further improve stack usage on sha512 (PR 77308)

Message ID AM4PR0701MB2162B1578B69466A7FDEC294E4840@AM4PR0701MB2162.eurprd07.prod.outlook.com
State New
Headers show

Commit Message

Bernd Edlinger Dec. 8, 2016, 7:50 p.m. UTC
Hi Wilco,

On 11/30/16 18:01, Bernd Edlinger wrote:
> I attached the completely untested follow-up patch now, but I would

> like to post that one again for review, after I applied my current

> patch, which is still waiting for final review (please feel pinged!).

>

>

> This is really exciting...

>

>



when testing the follow-up patch I discovered a single regression
in gcc.dg/fixed-point/convert-sat.c that was caused by a mis-compilation
of the libgcc function __gnu_satfractdasq.

I think it triggerd a latent bug in the carryin_compare patterns.

everything is as expected until reload.  First what is left over
of a split cmpdi_insn followed by a former cmpdi_unsigned, if the
branch is not taken.

(insn 109 10 110 2 (set (reg:CC 100 cc)
         (compare:CC (reg:SI 0 r0 [orig:124 _10 ] [124])
             (const_int 0 [0]))) 
"../../../gcc-trunk/libgcc/fixed-bit.c":785 196 {*arm_cmpsi_insn}
      (nil))
(insn 110 109 13 2 (parallel [
             (set (reg:CC 100 cc)
                 (compare:CC (reg:SI 1 r1 [orig:125 _10+4 ] [125])
                     (const_int -1 [0xffffffffffffffff])))
             (set (reg:SI 3 r3 [123])
                 (minus:SI (plus:SI (reg:SI 1 r1 [orig:125 _10+4 ] [125])
                         (const_int -1 [0xffffffffffffffff]))
                     (ltu:SI (reg:CC_C 100 cc)
                         (const_int 0 [0]))))
         ]) "../../../gcc-trunk/libgcc/fixed-bit.c":785 32 
{*subsi3_carryin_compare_const}
      (nil))
(jump_insn 13 110 31 2 (set (pc)
         (if_then_else (ge (reg:CC_NCV 100 cc)
                 (const_int 0 [0]))
             (label_ref:SI 102)
             (pc))) "../../../gcc-trunk/libgcc/fixed-bit.c":785 204 
{arm_cond_branch}
      (int_list:REG_BR_PROB 6400 (nil))

(note 31 13 97 3 [bb 3] NOTE_INSN_BASIC_BLOCK)
(note 97 31 114 3 NOTE_INSN_DELETED)
(insn 114 97 113 3 (set (reg:SI 2 r2 [orig:127+4 ] [127])
         (const_int -1 [0xffffffffffffffff])) 
"../../../gcc-trunk/libgcc/fixed-bit.c":831 630 {*arm_movsi_vfp}
      (expr_list:REG_EQUIV (const_int -1 [0xffffffffffffffff])
         (nil)))
(insn 113 114 107 3 (set (reg:SI 3 r3 [126])
         (const_int 2147483647 [0x7fffffff])) 
"../../../gcc-trunk/libgcc/fixed-bit.c":831 630 {*arm_movsi_vfp}
      (expr_list:REG_EQUIV (const_int 2147483647 [0x7fffffff])
         (nil)))
(insn 107 113 108 3 (set (reg:CC 100 cc)
         (compare:CC (reg:SI 1 r1 [orig:125 _10+4 ] [125])
             (reg:SI 2 r2 [orig:127+4 ] [127]))) 
"../../../gcc-trunk/libgcc/fixed-bit.c":831 196 {*arm_cmpsi_insn}
      (nil))


Note that the CC register is not really set as implied by insn 110,
because the C flag depends on the comparison of r1, 0xFFFF and the
carry flag from insn 109.  Therefore in the postreload pass the
insn 107 appears to be unnecessary, as if should compute
exactly the same CC flag, as insn 110, i.e. not dependent on
previous CC flag.  I think all carryin_compare patterns are
wrong because they do not describe the true value of the CC reg.

I think the CC reg is actually dependent on left, right and CC-in
value, as in the new version of the patch it must be a computation
in DI mode without overflow, as in my new version of the patch.

I attached an update of the followup patch which is not yet adjusted
on your pending negdi patch.  Reg-testing is no yet done, but the
mis-compilation on libgcc is fixed at least.

What do you think?


Thanks
Bernd.

Comments

Richard Earnshaw (lists) Jan. 11, 2017, 4:55 p.m. UTC | #1
On 08/12/16 19:50, Bernd Edlinger wrote:
> Hi Wilco,

> 

> On 11/30/16 18:01, Bernd Edlinger wrote:

>> I attached the completely untested follow-up patch now, but I would

>> like to post that one again for review, after I applied my current

>> patch, which is still waiting for final review (please feel pinged!).

>>

>>

>> This is really exciting...

>>

>>

> 

> 

> when testing the follow-up patch I discovered a single regression

> in gcc.dg/fixed-point/convert-sat.c that was caused by a mis-compilation

> of the libgcc function __gnu_satfractdasq.

> 

> I think it triggerd a latent bug in the carryin_compare patterns.

> 

> everything is as expected until reload.  First what is left over

> of a split cmpdi_insn followed by a former cmpdi_unsigned, if the

> branch is not taken.

> 

> (insn 109 10 110 2 (set (reg:CC 100 cc)

>          (compare:CC (reg:SI 0 r0 [orig:124 _10 ] [124])

>              (const_int 0 [0]))) 

> "../../../gcc-trunk/libgcc/fixed-bit.c":785 196 {*arm_cmpsi_insn}

>       (nil))

> (insn 110 109 13 2 (parallel [

>              (set (reg:CC 100 cc)

>                  (compare:CC (reg:SI 1 r1 [orig:125 _10+4 ] [125])

>                      (const_int -1 [0xffffffffffffffff])))

>              (set (reg:SI 3 r3 [123])

>                  (minus:SI (plus:SI (reg:SI 1 r1 [orig:125 _10+4 ] [125])

>                          (const_int -1 [0xffffffffffffffff]))

>                      (ltu:SI (reg:CC_C 100 cc)

>                          (const_int 0 [0]))))

>          ]) "../../../gcc-trunk/libgcc/fixed-bit.c":785 32 

> {*subsi3_carryin_compare_const}

>       (nil))

> (jump_insn 13 110 31 2 (set (pc)

>          (if_then_else (ge (reg:CC_NCV 100 cc)

>                  (const_int 0 [0]))

>              (label_ref:SI 102)

>              (pc))) "../../../gcc-trunk/libgcc/fixed-bit.c":785 204 

> {arm_cond_branch}

>       (int_list:REG_BR_PROB 6400 (nil))

> 

> (note 31 13 97 3 [bb 3] NOTE_INSN_BASIC_BLOCK)

> (note 97 31 114 3 NOTE_INSN_DELETED)

> (insn 114 97 113 3 (set (reg:SI 2 r2 [orig:127+4 ] [127])

>          (const_int -1 [0xffffffffffffffff])) 

> "../../../gcc-trunk/libgcc/fixed-bit.c":831 630 {*arm_movsi_vfp}

>       (expr_list:REG_EQUIV (const_int -1 [0xffffffffffffffff])

>          (nil)))

> (insn 113 114 107 3 (set (reg:SI 3 r3 [126])

>          (const_int 2147483647 [0x7fffffff])) 

> "../../../gcc-trunk/libgcc/fixed-bit.c":831 630 {*arm_movsi_vfp}

>       (expr_list:REG_EQUIV (const_int 2147483647 [0x7fffffff])

>          (nil)))

> (insn 107 113 108 3 (set (reg:CC 100 cc)

>          (compare:CC (reg:SI 1 r1 [orig:125 _10+4 ] [125])

>              (reg:SI 2 r2 [orig:127+4 ] [127]))) 

> "../../../gcc-trunk/libgcc/fixed-bit.c":831 196 {*arm_cmpsi_insn}

>       (nil))

> 

> 

> Note that the CC register is not really set as implied by insn 110,

> because the C flag depends on the comparison of r1, 0xFFFF and the

> carry flag from insn 109.  Therefore in the postreload pass the

> insn 107 appears to be unnecessary, as if should compute

> exactly the same CC flag, as insn 110, i.e. not dependent on

> previous CC flag.  I think all carryin_compare patterns are

> wrong because they do not describe the true value of the CC reg.

> 

> I think the CC reg is actually dependent on left, right and CC-in

> value, as in the new version of the patch it must be a computation

> in DI mode without overflow, as in my new version of the patch.

> 

> I attached an update of the followup patch which is not yet adjusted

> on your pending negdi patch.  Reg-testing is no yet done, but the

> mis-compilation on libgcc is fixed at least.

> 

> What do you think?


Sorry for the delay getting around to this.

I just tried this patch and found that it doesn't apply.  Furthermore,
there's not enough context in the rejected hunks for me to be certain
which patterns you're trying to fix up.

Could you do an update please?

R.

> 

> 

> Thanks

> Bernd.

> 

> 

> patch-pr77308-5.diff

> 

> 

> 2016-12-08  Bernd Edlinger  <bernd.edlinger@hotmail.de>

> 

> 	PR target/77308

> 	* config/arm/arm.md (subdi3_compare1, subsi3_carryin_compare,

> 	subsi3_carryin_compare_const, negdi2_compare): Fix the CC reg dataflow.

> 	(*arm_negdi2, *arm_cmpdi_unsigned): Split early except for

>         TARGET_NEON and TARGET_IWMMXT.

> 	(*arm_cmpdi_insn): Split early except for

> 	TARGET_NEON and TARGET_IWMMXT.  Fix the CC reg dataflow.

> 	* config/arm/thumb2.md (*thumb2_negdi2): Split early except for

> 	TARGET_NEON and TARGET_IWMMXT.

> 

> testsuite:

> 2016-12-08  Bernd Edlinger  <bernd.edlinger@hotmail.de>

> 

> 	PR target/77308

> 	* gcc.target/arm/pr77308-2.c: New test.

> 

> --- gcc/config/arm/arm.md.orig	2016-12-08 16:01:43.290595127 +0100

> +++ gcc/config/arm/arm.md	2016-12-08 19:04:22.251065848 +0100

> @@ -1086,8 +1086,8 @@

>  })

>  

>  (define_insn_and_split "subdi3_compare1"

> -  [(set (reg:CC CC_REGNUM)

> -	(compare:CC

> +  [(set (reg:CC_NCV CC_REGNUM)

> +	(compare:CC_NCV

>  	  (match_operand:DI 1 "register_operand" "r")

>  	  (match_operand:DI 2 "register_operand" "r")))

>     (set (match_operand:DI 0 "register_operand" "=&r")

> @@ -1098,10 +1098,15 @@

>    [(parallel [(set (reg:CC CC_REGNUM)

>  		   (compare:CC (match_dup 1) (match_dup 2)))

>  	      (set (match_dup 0) (minus:SI (match_dup 1) (match_dup 2)))])

> -   (parallel [(set (reg:CC CC_REGNUM)

> -		   (compare:CC (match_dup 4) (match_dup 5)))

> -	     (set (match_dup 3) (minus:SI (minus:SI (match_dup 4) (match_dup 5))

> -			       (ltu:SI (reg:CC_C CC_REGNUM) (const_int 0))))])]

> +   (parallel [(set (reg:CC_C CC_REGNUM)

> +		   (compare:CC_C

> +		     (zero_extend:DI (match_dup 4))

> +		     (plus:DI

> +		       (zero_extend:DI (match_dup 5))

> +		       (ltu:DI (reg:CC_C CC_REGNUM) (const_int 0)))))

> +	      (set (match_dup 3)

> +		   (minus:SI (minus:SI (match_dup 4) (match_dup 5))

> +			     (ltu:SI (reg:CC_C CC_REGNUM) (const_int 0))))])]

>    {

>      operands[3] = gen_highpart (SImode, operands[0]);

>      operands[0] = gen_lowpart (SImode, operands[0]);

> @@ -1156,13 +1161,15 @@

>  )

>  

>  (define_insn "*subsi3_carryin_compare"

> -  [(set (reg:CC CC_REGNUM)

> -        (compare:CC (match_operand:SI 1 "s_register_operand" "r")

> -                    (match_operand:SI 2 "s_register_operand" "r")))

> +  [(set (reg:CC_C CC_REGNUM)

> +	(compare:CC_C

> +	  (zero_extend:DI (match_operand:SI 1 "s_register_operand" "r"))

> +	  (plus:DI

> +	    (zero_extend:DI (match_operand:SI 2 "s_register_operand" "r"))

> +	    (ltu:DI (reg:CC_C CC_REGNUM) (const_int 0)))))

>     (set (match_operand:SI 0 "s_register_operand" "=r")

> -        (minus:SI (minus:SI (match_dup 1)

> -                            (match_dup 2))

> -                  (ltu:SI (reg:CC_C CC_REGNUM) (const_int 0))))]

> +	(minus:SI (minus:SI (match_dup 1) (match_dup 2))

> +		  (ltu:SI (reg:CC_C CC_REGNUM) (const_int 0))))]

>    "TARGET_32BIT"

>    "sbcs\\t%0, %1, %2"

>    [(set_attr "conds" "set")

> @@ -1170,12 +1177,14 @@

>  )

>  

>  (define_insn "*subsi3_carryin_compare_const"

> -  [(set (reg:CC CC_REGNUM)

> -        (compare:CC (match_operand:SI 1 "reg_or_int_operand" "r")

> -                    (match_operand:SI 2 "arm_not_operand" "K")))

> +  [(set (reg:CC_C CC_REGNUM)

> +	(compare:CC_C

> +	  (zero_extend:DI (match_operand:SI 1 "reg_or_int_operand" "r"))

> +	  (plus:DI

> +	    (zero_extend:DI (match_operand:SI 2 "arm_not_operand" "K"))

> +	    (ltu:DI (reg:CC_C CC_REGNUM) (const_int 0)))))

>     (set (match_operand:SI 0 "s_register_operand" "=r")

> -        (minus:SI (plus:SI (match_dup 1)

> -                           (match_dup 2))

> +        (minus:SI (plus:SI (match_dup 1) (match_dup 2))

>                    (ltu:SI (reg:CC_C CC_REGNUM) (const_int 0))))]

>    "TARGET_32BIT"

>    "sbcs\\t%0, %1, #%B2"

> @@ -4684,8 +4693,8 @@

>  

>  

>  (define_insn_and_split "negdi2_compare"

> -  [(set (reg:CC CC_REGNUM)

> -	(compare:CC

> +  [(set (reg:CC_NCV CC_REGNUM)

> +	(compare:CC_NCV

>  	  (const_int 0)

>  	  (match_operand:DI 1 "register_operand" "0,r")))

>     (set (match_operand:DI 0 "register_operand" "=r,&r")

> @@ -4697,8 +4706,12 @@

>  		   (compare:CC (const_int 0) (match_dup 1)))

>  	      (set (match_dup 0) (minus:SI (const_int 0)

>  					   (match_dup 1)))])

> -   (parallel [(set (reg:CC CC_REGNUM)

> -		   (compare:CC (const_int 0) (match_dup 3)))

> +   (parallel [(set (reg:CC_C CC_REGNUM)

> +		   (compare:CC_C

> +		     (const_int 0)

> +		     (plus:DI

> +		       (zero_extend:DI (match_dup 3))

> +		       (ltu:DI (reg:CC_C CC_REGNUM) (const_int 0)))))

>  	     (set (match_dup 2)

>  		  (minus:SI

>  		   (minus:SI (const_int 0) (match_dup 3))

> @@ -4738,7 +4751,7 @@

>     (clobber (reg:CC CC_REGNUM))]

>    "TARGET_ARM"

>    "#"   ; "rsbs\\t%Q0, %Q1, #0\;rsc\\t%R0, %R1, #0"

> -  "&& reload_completed"

> +  "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)"

>    [(parallel [(set (reg:CC CC_REGNUM)

>  		   (compare:CC (const_int 0) (match_dup 1)))

>  	      (set (match_dup 0) (minus:SI (const_int 0) (match_dup 1)))])

> @@ -4756,12 +4769,14 @@

>  )

>  

>  (define_insn "*negsi2_carryin_compare"

> -  [(set (reg:CC CC_REGNUM)

> -	(compare:CC (const_int 0)

> -		    (match_operand:SI 1 "s_register_operand" "r")))

> +  [(set (reg:CC_C CC_REGNUM)

> +	(compare:CC_C

> +	  (const_int 0)

> +	  (plus:DI

> +	    (zero_extend:DI (match_operand:SI 1 "s_register_operand" "r"))

> +	    (ltu:DI (reg:CC_C CC_REGNUM) (const_int 0)))))

>     (set (match_operand:SI 0 "s_register_operand" "=r")

> -	(minus:SI (minus:SI (const_int 0)

> -			    (match_dup 1))

> +	(minus:SI (minus:SI (const_int 0) (match_dup 1))

>  		  (ltu:SI (reg:CC_C CC_REGNUM) (const_int 0))))]

>    "TARGET_ARM"

>    "rscs\\t%0, %1, #0"

> @@ -7432,14 +7447,17 @@

>     (clobber (match_scratch:SI 2 "=r"))]

>    "TARGET_32BIT"

>    "#"   ; "cmp\\t%Q0, %Q1\;sbcs\\t%2, %R0, %R1"

> -  "&& reload_completed"

> +  "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)"

>    [(set (reg:CC CC_REGNUM)

> -        (compare:CC (match_dup 0) (match_dup 1)))

> -   (parallel [(set (reg:CC CC_REGNUM)

> -                   (compare:CC (match_dup 3) (match_dup 4)))

> -              (set (match_dup 2)

> -                   (minus:SI (match_dup 5)

> -                            (ltu:SI (reg:CC_C CC_REGNUM) (const_int 0))))])]

> +	(compare:CC (match_dup 0) (match_dup 1)))

> +   (parallel [(set (reg:CC_C CC_REGNUM)

> +		   (compare:CC_C

> +		     (zero_extend:DI (match_dup 3))

> +		     (plus:DI (zero_extend:DI (match_dup 4))

> +			      (ltu:DI (reg:CC_C CC_REGNUM) (const_int 0)))))

> +	      (set (match_dup 2)

> +		   (minus:SI (match_dup 5)

> +			     (ltu:SI (reg:CC_C CC_REGNUM) (const_int 0))))])]

>    {

>      operands[3] = gen_highpart (SImode, operands[0]);

>      operands[0] = gen_lowpart (SImode, operands[0]);

> @@ -7456,7 +7474,10 @@

>          operands[5] = gen_rtx_MINUS (SImode, operands[3], operands[4]);

>        }

>      operands[1] = gen_lowpart (SImode, operands[1]);

> -    operands[2] = gen_lowpart (SImode, operands[2]);

> +    if (can_create_pseudo_p ())

> +      operands[2] = gen_reg_rtx (SImode);

> +    else

> +      operands[2] = gen_lowpart (SImode, operands[2]);

>    }

>    [(set_attr "conds" "set")

>     (set_attr "length" "8")

> @@ -7470,7 +7491,7 @@

>  

>    "TARGET_32BIT"

>    "#"   ; "cmp\\t%R0, %R1\;it eq\;cmpeq\\t%Q0, %Q1"

> -  "&& reload_completed"

> +  "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)"

>    [(set (reg:CC CC_REGNUM)

>          (compare:CC (match_dup 2) (match_dup 3)))

>     (cond_exec (eq:SI (reg:CC CC_REGNUM) (const_int 0))

> --- gcc/config/arm/thumb2.md.orig	2016-12-08 16:00:59.017597265 +0100

> +++ gcc/config/arm/thumb2.md	2016-12-08 16:02:38.591592456 +0100

> @@ -132,7 +132,7 @@

>     (clobber (reg:CC CC_REGNUM))]

>    "TARGET_THUMB2"

>    "#" ; negs\\t%Q0, %Q1\;sbc\\t%R0, %R1, %R1, lsl #1

> -  "&& reload_completed"

> +  "&& (!TARGET_NEON || reload_completed)"

>    [(parallel [(set (reg:CC CC_REGNUM)

>  		   (compare:CC (const_int 0) (match_dup 1)))

>  	      (set (match_dup 0) (minus:SI (const_int 0) (match_dup 1)))])

> --- /dev/null	2016-12-08 15:50:45.426271450 +0100

> +++ gcc/testsuite/gcc.target/arm/pr77308-2.c	2016-12-08 16:02:38.591592456 +0100

> @@ -0,0 +1,169 @@

> +/* { dg-do compile } */

> +/* { dg-options "-Os -Wstack-usage=2500" } */

> +

> +/* This is a modified algorithm with 64bit cmp and neg at the Sigma-blocks.

> +   It improves the test coverage of cmpdi and negdi2 patterns.

> +   Unlike the original test case these insns can reach the reload pass,

> +   which may result in large stack usage.  */

> +

> +#define SHA_LONG64 unsigned long long

> +#define U64(C)     C##ULL

> +

> +#define SHA_LBLOCK      16

> +#define SHA512_CBLOCK   (SHA_LBLOCK*8)

> +

> +typedef struct SHA512state_st {

> +    SHA_LONG64 h[8];

> +    SHA_LONG64 Nl, Nh;

> +    union {

> +        SHA_LONG64 d[SHA_LBLOCK];

> +        unsigned char p[SHA512_CBLOCK];

> +    } u;

> +    unsigned int num, md_len;

> +} SHA512_CTX;

> +

> +static const SHA_LONG64 K512[80] = {

> +    U64(0x428a2f98d728ae22), U64(0x7137449123ef65cd),

> +    U64(0xb5c0fbcfec4d3b2f), U64(0xe9b5dba58189dbbc),

> +    U64(0x3956c25bf348b538), U64(0x59f111f1b605d019),

> +    U64(0x923f82a4af194f9b), U64(0xab1c5ed5da6d8118),

> +    U64(0xd807aa98a3030242), U64(0x12835b0145706fbe),

> +    U64(0x243185be4ee4b28c), U64(0x550c7dc3d5ffb4e2),

> +    U64(0x72be5d74f27b896f), U64(0x80deb1fe3b1696b1),

> +    U64(0x9bdc06a725c71235), U64(0xc19bf174cf692694),

> +    U64(0xe49b69c19ef14ad2), U64(0xefbe4786384f25e3),

> +    U64(0x0fc19dc68b8cd5b5), U64(0x240ca1cc77ac9c65),

> +    U64(0x2de92c6f592b0275), U64(0x4a7484aa6ea6e483),

> +    U64(0x5cb0a9dcbd41fbd4), U64(0x76f988da831153b5),

> +    U64(0x983e5152ee66dfab), U64(0xa831c66d2db43210),

> +    U64(0xb00327c898fb213f), U64(0xbf597fc7beef0ee4),

> +    U64(0xc6e00bf33da88fc2), U64(0xd5a79147930aa725),

> +    U64(0x06ca6351e003826f), U64(0x142929670a0e6e70),

> +    U64(0x27b70a8546d22ffc), U64(0x2e1b21385c26c926),

> +    U64(0x4d2c6dfc5ac42aed), U64(0x53380d139d95b3df),

> +    U64(0x650a73548baf63de), U64(0x766a0abb3c77b2a8),

> +    U64(0x81c2c92e47edaee6), U64(0x92722c851482353b),

> +    U64(0xa2bfe8a14cf10364), U64(0xa81a664bbc423001),

> +    U64(0xc24b8b70d0f89791), U64(0xc76c51a30654be30),

> +    U64(0xd192e819d6ef5218), U64(0xd69906245565a910),

> +    U64(0xf40e35855771202a), U64(0x106aa07032bbd1b8),

> +    U64(0x19a4c116b8d2d0c8), U64(0x1e376c085141ab53),

> +    U64(0x2748774cdf8eeb99), U64(0x34b0bcb5e19b48a8),

> +    U64(0x391c0cb3c5c95a63), U64(0x4ed8aa4ae3418acb),

> +    U64(0x5b9cca4f7763e373), U64(0x682e6ff3d6b2b8a3),

> +    U64(0x748f82ee5defb2fc), U64(0x78a5636f43172f60),

> +    U64(0x84c87814a1f0ab72), U64(0x8cc702081a6439ec),

> +    U64(0x90befffa23631e28), U64(0xa4506cebde82bde9),

> +    U64(0xbef9a3f7b2c67915), U64(0xc67178f2e372532b),

> +    U64(0xca273eceea26619c), U64(0xd186b8c721c0c207),

> +    U64(0xeada7dd6cde0eb1e), U64(0xf57d4f7fee6ed178),

> +    U64(0x06f067aa72176fba), U64(0x0a637dc5a2c898a6),

> +    U64(0x113f9804bef90dae), U64(0x1b710b35131c471b),

> +    U64(0x28db77f523047d84), U64(0x32caab7b40c72493),

> +    U64(0x3c9ebe0a15c9bebc), U64(0x431d67c49c100d4c),

> +    U64(0x4cc5d4becb3e42b6), U64(0x597f299cfc657e2a),

> +    U64(0x5fcb6fab3ad6faec), U64(0x6c44198c4a475817)

> +};

> +

> +#define B(x,j)    (((SHA_LONG64)(*(((const unsigned char *)(&x))+j)))<<((7-j)*8))

> +#define PULL64(x) (B(x,0)|B(x,1)|B(x,2)|B(x,3)|B(x,4)|B(x,5)|B(x,6)|B(x,7))

> +#define ROTR(x,s)       (((x)>>s) | (x)<<(64-s))

> +#define Sigma0(x)       (ROTR((x),28) ^ ROTR((x),34) ^ (ROTR((x),39) == (x)) ? -(x) : (x))

> +#define Sigma1(x)       (ROTR((x),14) ^ ROTR(-(x),18) ^ ((long long)ROTR((x),41) < (long long)(x)) ? -(x) : (x))

> +#define sigma0(x)       (ROTR((x),1)  ^ ROTR((x),8)  ^ (((x)>>7) > (x)) ? -(x) : (x))

> +#define sigma1(x)       (ROTR((x),19) ^ ROTR((x),61) ^ ((long long)((x)>>6) < (long long)(x)) ? -(x) : (x))

> +#define Ch(x,y,z)       (((x) & (y)) ^ ((~(x)) & (z)))

> +#define Maj(x,y,z)      (((x) & (y)) ^ ((x) & (z)) ^ ((y) & (z)))

> +

> +#define ROUND_00_15(i,a,b,c,d,e,f,g,h)          do {    \

> +        T1 += h + Sigma1(e) + Ch(e,f,g) + K512[i];      \

> +        h = Sigma0(a) + Maj(a,b,c);                     \

> +        d += T1;        h += T1;                } while (0)

> +#define ROUND_16_80(i,j,a,b,c,d,e,f,g,h,X)      do {    \

> +        s0 = X[(j+1)&0x0f];     s0 = sigma0(s0);        \

> +        s1 = X[(j+14)&0x0f];    s1 = sigma1(s1);        \

> +        T1 = X[(j)&0x0f] += s0 + s1 + X[(j+9)&0x0f];    \

> +        ROUND_00_15(i+j,a,b,c,d,e,f,g,h);               } while (0)

> +void sha512_block_data_order(SHA512_CTX *ctx, const void *in,

> +                                    unsigned int num)

> +{

> +    const SHA_LONG64 *W = in;

> +    SHA_LONG64 a, b, c, d, e, f, g, h, s0, s1, T1;

> +    SHA_LONG64 X[16];

> +    int i;

> +

> +    while (num--) {

> +

> +        a = ctx->h[0];

> +        b = ctx->h[1];

> +        c = ctx->h[2];

> +        d = ctx->h[3];

> +        e = ctx->h[4];

> +        f = ctx->h[5];

> +        g = ctx->h[6];

> +        h = ctx->h[7];

> +

> +        T1 = X[0] = PULL64(W[0]);

> +        ROUND_00_15(0, a, b, c, d, e, f, g, h);

> +        T1 = X[1] = PULL64(W[1]);

> +        ROUND_00_15(1, h, a, b, c, d, e, f, g);

> +        T1 = X[2] = PULL64(W[2]);

> +        ROUND_00_15(2, g, h, a, b, c, d, e, f);

> +        T1 = X[3] = PULL64(W[3]);

> +        ROUND_00_15(3, f, g, h, a, b, c, d, e);

> +        T1 = X[4] = PULL64(W[4]);

> +        ROUND_00_15(4, e, f, g, h, a, b, c, d);

> +        T1 = X[5] = PULL64(W[5]);

> +        ROUND_00_15(5, d, e, f, g, h, a, b, c);

> +        T1 = X[6] = PULL64(W[6]);

> +        ROUND_00_15(6, c, d, e, f, g, h, a, b);

> +        T1 = X[7] = PULL64(W[7]);

> +        ROUND_00_15(7, b, c, d, e, f, g, h, a);

> +        T1 = X[8] = PULL64(W[8]);

> +        ROUND_00_15(8, a, b, c, d, e, f, g, h);

> +        T1 = X[9] = PULL64(W[9]);

> +        ROUND_00_15(9, h, a, b, c, d, e, f, g);

> +        T1 = X[10] = PULL64(W[10]);

> +        ROUND_00_15(10, g, h, a, b, c, d, e, f);

> +        T1 = X[11] = PULL64(W[11]);

> +        ROUND_00_15(11, f, g, h, a, b, c, d, e);

> +        T1 = X[12] = PULL64(W[12]);

> +        ROUND_00_15(12, e, f, g, h, a, b, c, d);

> +        T1 = X[13] = PULL64(W[13]);

> +        ROUND_00_15(13, d, e, f, g, h, a, b, c);

> +        T1 = X[14] = PULL64(W[14]);

> +        ROUND_00_15(14, c, d, e, f, g, h, a, b);

> +        T1 = X[15] = PULL64(W[15]);

> +        ROUND_00_15(15, b, c, d, e, f, g, h, a);

> +

> +        for (i = 16; i < 80; i += 16) {

> +            ROUND_16_80(i, 0, a, b, c, d, e, f, g, h, X);

> +            ROUND_16_80(i, 1, h, a, b, c, d, e, f, g, X);

> +            ROUND_16_80(i, 2, g, h, a, b, c, d, e, f, X);

> +            ROUND_16_80(i, 3, f, g, h, a, b, c, d, e, X);

> +            ROUND_16_80(i, 4, e, f, g, h, a, b, c, d, X);

> +            ROUND_16_80(i, 5, d, e, f, g, h, a, b, c, X);

> +            ROUND_16_80(i, 6, c, d, e, f, g, h, a, b, X);

> +            ROUND_16_80(i, 7, b, c, d, e, f, g, h, a, X);

> +            ROUND_16_80(i, 8, a, b, c, d, e, f, g, h, X);

> +            ROUND_16_80(i, 9, h, a, b, c, d, e, f, g, X);

> +            ROUND_16_80(i, 10, g, h, a, b, c, d, e, f, X);

> +            ROUND_16_80(i, 11, f, g, h, a, b, c, d, e, X);

> +            ROUND_16_80(i, 12, e, f, g, h, a, b, c, d, X);

> +            ROUND_16_80(i, 13, d, e, f, g, h, a, b, c, X);

> +            ROUND_16_80(i, 14, c, d, e, f, g, h, a, b, X);

> +            ROUND_16_80(i, 15, b, c, d, e, f, g, h, a, X);

> +        }

> +

> +        ctx->h[0] += a;

> +        ctx->h[1] += b;

> +        ctx->h[2] += c;

> +        ctx->h[3] += d;

> +        ctx->h[4] += e;

> +        ctx->h[5] += f;

> +        ctx->h[6] += g;

> +        ctx->h[7] += h;

> +

> +        W += SHA_LBLOCK;

> +    }

> +}

>
Bernd Edlinger Jan. 11, 2017, 5:18 p.m. UTC | #2
On 01/11/17 17:55, Richard Earnshaw (lists) wrote:
>

> Sorry for the delay getting around to this.

>

> I just tried this patch and found that it doesn't apply.  Furthermore,

> there's not enough context in the rejected hunks for me to be certain

> which patterns you're trying to fix up.

>

> Could you do an update please?

>


Sure, I just gave up pinging, as we are rather late in stage 3
already...

So the current status is this:

I have the invalid code issue here; it is independent of the
optimization issues:

[PATCH, ARM] correctly encode the CC reg data flow
https://gcc.gnu.org/ml/gcc-patches/2016-12/msg01562.html

Then I have the patch for splitting the most important
64bit patterns here:

[PATCH, ARM] Further improve stack usage on sha512 (PR 77308)
https://gcc.gnu.org/ml/gcc-patches/2016-11/msg02796.html

and the follow-up patch that triggered the invalid code here:

[PATCH, ARM] Further improve stack usage in sha512, part 2 (PR 77308)
https://gcc.gnu.org/ml/gcc-patches/2016-12/msg01563.html

In the last part I had initially this hunk,
-    operands[2] = gen_lowpart (SImode, operands[2]);
+    if (can_create_pseudo_p ())
+      operands[2] = gen_reg_rtx (SImode);
+    else
+      operands[2] = gen_lowpart (SImode, operands[2]);

As Wilco pointed out that the else part is superfluous,
I already removed the gen_lowpart stuff locally.

All three parts should apply to trunk, only the last part
depends on both earlier patches.


Thanks
Bernd.
diff mbox

Patch

2016-12-08  Bernd Edlinger  <bernd.edlinger@hotmail.de>

	PR target/77308
	* config/arm/arm.md (subdi3_compare1, subsi3_carryin_compare,
	subsi3_carryin_compare_const, negdi2_compare): Fix the CC reg dataflow.
	(*arm_negdi2, *arm_cmpdi_unsigned): Split early except for
        TARGET_NEON and TARGET_IWMMXT.
	(*arm_cmpdi_insn): Split early except for
	TARGET_NEON and TARGET_IWMMXT.  Fix the CC reg dataflow.
	* config/arm/thumb2.md (*thumb2_negdi2): Split early except for
	TARGET_NEON and TARGET_IWMMXT.

testsuite:
2016-12-08  Bernd Edlinger  <bernd.edlinger@hotmail.de>

	PR target/77308
	* gcc.target/arm/pr77308-2.c: New test.

--- gcc/config/arm/arm.md.orig	2016-12-08 16:01:43.290595127 +0100
+++ gcc/config/arm/arm.md	2016-12-08 19:04:22.251065848 +0100
@@ -1086,8 +1086,8 @@ 
 })
 
 (define_insn_and_split "subdi3_compare1"
-  [(set (reg:CC CC_REGNUM)
-	(compare:CC
+  [(set (reg:CC_NCV CC_REGNUM)
+	(compare:CC_NCV
 	  (match_operand:DI 1 "register_operand" "r")
 	  (match_operand:DI 2 "register_operand" "r")))
    (set (match_operand:DI 0 "register_operand" "=&r")
@@ -1098,10 +1098,15 @@ 
   [(parallel [(set (reg:CC CC_REGNUM)
 		   (compare:CC (match_dup 1) (match_dup 2)))
 	      (set (match_dup 0) (minus:SI (match_dup 1) (match_dup 2)))])
-   (parallel [(set (reg:CC CC_REGNUM)
-		   (compare:CC (match_dup 4) (match_dup 5)))
-	     (set (match_dup 3) (minus:SI (minus:SI (match_dup 4) (match_dup 5))
-			       (ltu:SI (reg:CC_C CC_REGNUM) (const_int 0))))])]
+   (parallel [(set (reg:CC_C CC_REGNUM)
+		   (compare:CC_C
+		     (zero_extend:DI (match_dup 4))
+		     (plus:DI
+		       (zero_extend:DI (match_dup 5))
+		       (ltu:DI (reg:CC_C CC_REGNUM) (const_int 0)))))
+	      (set (match_dup 3)
+		   (minus:SI (minus:SI (match_dup 4) (match_dup 5))
+			     (ltu:SI (reg:CC_C CC_REGNUM) (const_int 0))))])]
   {
     operands[3] = gen_highpart (SImode, operands[0]);
     operands[0] = gen_lowpart (SImode, operands[0]);
@@ -1156,13 +1161,15 @@ 
 )
 
 (define_insn "*subsi3_carryin_compare"
-  [(set (reg:CC CC_REGNUM)
-        (compare:CC (match_operand:SI 1 "s_register_operand" "r")
-                    (match_operand:SI 2 "s_register_operand" "r")))
+  [(set (reg:CC_C CC_REGNUM)
+	(compare:CC_C
+	  (zero_extend:DI (match_operand:SI 1 "s_register_operand" "r"))
+	  (plus:DI
+	    (zero_extend:DI (match_operand:SI 2 "s_register_operand" "r"))
+	    (ltu:DI (reg:CC_C CC_REGNUM) (const_int 0)))))
    (set (match_operand:SI 0 "s_register_operand" "=r")
-        (minus:SI (minus:SI (match_dup 1)
-                            (match_dup 2))
-                  (ltu:SI (reg:CC_C CC_REGNUM) (const_int 0))))]
+	(minus:SI (minus:SI (match_dup 1) (match_dup 2))
+		  (ltu:SI (reg:CC_C CC_REGNUM) (const_int 0))))]
   "TARGET_32BIT"
   "sbcs\\t%0, %1, %2"
   [(set_attr "conds" "set")
@@ -1170,12 +1177,14 @@ 
 )
 
 (define_insn "*subsi3_carryin_compare_const"
-  [(set (reg:CC CC_REGNUM)
-        (compare:CC (match_operand:SI 1 "reg_or_int_operand" "r")
-                    (match_operand:SI 2 "arm_not_operand" "K")))
+  [(set (reg:CC_C CC_REGNUM)
+	(compare:CC_C
+	  (zero_extend:DI (match_operand:SI 1 "reg_or_int_operand" "r"))
+	  (plus:DI
+	    (zero_extend:DI (match_operand:SI 2 "arm_not_operand" "K"))
+	    (ltu:DI (reg:CC_C CC_REGNUM) (const_int 0)))))
    (set (match_operand:SI 0 "s_register_operand" "=r")
-        (minus:SI (plus:SI (match_dup 1)
-                           (match_dup 2))
+        (minus:SI (plus:SI (match_dup 1) (match_dup 2))
                   (ltu:SI (reg:CC_C CC_REGNUM) (const_int 0))))]
   "TARGET_32BIT"
   "sbcs\\t%0, %1, #%B2"
@@ -4684,8 +4693,8 @@ 
 
 
 (define_insn_and_split "negdi2_compare"
-  [(set (reg:CC CC_REGNUM)
-	(compare:CC
+  [(set (reg:CC_NCV CC_REGNUM)
+	(compare:CC_NCV
 	  (const_int 0)
 	  (match_operand:DI 1 "register_operand" "0,r")))
    (set (match_operand:DI 0 "register_operand" "=r,&r")
@@ -4697,8 +4706,12 @@ 
 		   (compare:CC (const_int 0) (match_dup 1)))
 	      (set (match_dup 0) (minus:SI (const_int 0)
 					   (match_dup 1)))])
-   (parallel [(set (reg:CC CC_REGNUM)
-		   (compare:CC (const_int 0) (match_dup 3)))
+   (parallel [(set (reg:CC_C CC_REGNUM)
+		   (compare:CC_C
+		     (const_int 0)
+		     (plus:DI
+		       (zero_extend:DI (match_dup 3))
+		       (ltu:DI (reg:CC_C CC_REGNUM) (const_int 0)))))
 	     (set (match_dup 2)
 		  (minus:SI
 		   (minus:SI (const_int 0) (match_dup 3))
@@ -4738,7 +4751,7 @@ 
    (clobber (reg:CC CC_REGNUM))]
   "TARGET_ARM"
   "#"   ; "rsbs\\t%Q0, %Q1, #0\;rsc\\t%R0, %R1, #0"
-  "&& reload_completed"
+  "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)"
   [(parallel [(set (reg:CC CC_REGNUM)
 		   (compare:CC (const_int 0) (match_dup 1)))
 	      (set (match_dup 0) (minus:SI (const_int 0) (match_dup 1)))])
@@ -4756,12 +4769,14 @@ 
 )
 
 (define_insn "*negsi2_carryin_compare"
-  [(set (reg:CC CC_REGNUM)
-	(compare:CC (const_int 0)
-		    (match_operand:SI 1 "s_register_operand" "r")))
+  [(set (reg:CC_C CC_REGNUM)
+	(compare:CC_C
+	  (const_int 0)
+	  (plus:DI
+	    (zero_extend:DI (match_operand:SI 1 "s_register_operand" "r"))
+	    (ltu:DI (reg:CC_C CC_REGNUM) (const_int 0)))))
    (set (match_operand:SI 0 "s_register_operand" "=r")
-	(minus:SI (minus:SI (const_int 0)
-			    (match_dup 1))
+	(minus:SI (minus:SI (const_int 0) (match_dup 1))
 		  (ltu:SI (reg:CC_C CC_REGNUM) (const_int 0))))]
   "TARGET_ARM"
   "rscs\\t%0, %1, #0"
@@ -7432,14 +7447,17 @@ 
    (clobber (match_scratch:SI 2 "=r"))]
   "TARGET_32BIT"
   "#"   ; "cmp\\t%Q0, %Q1\;sbcs\\t%2, %R0, %R1"
-  "&& reload_completed"
+  "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)"
   [(set (reg:CC CC_REGNUM)
-        (compare:CC (match_dup 0) (match_dup 1)))
-   (parallel [(set (reg:CC CC_REGNUM)
-                   (compare:CC (match_dup 3) (match_dup 4)))
-              (set (match_dup 2)
-                   (minus:SI (match_dup 5)
-                            (ltu:SI (reg:CC_C CC_REGNUM) (const_int 0))))])]
+	(compare:CC (match_dup 0) (match_dup 1)))
+   (parallel [(set (reg:CC_C CC_REGNUM)
+		   (compare:CC_C
+		     (zero_extend:DI (match_dup 3))
+		     (plus:DI (zero_extend:DI (match_dup 4))
+			      (ltu:DI (reg:CC_C CC_REGNUM) (const_int 0)))))
+	      (set (match_dup 2)
+		   (minus:SI (match_dup 5)
+			     (ltu:SI (reg:CC_C CC_REGNUM) (const_int 0))))])]
   {
     operands[3] = gen_highpart (SImode, operands[0]);
     operands[0] = gen_lowpart (SImode, operands[0]);
@@ -7456,7 +7474,10 @@ 
         operands[5] = gen_rtx_MINUS (SImode, operands[3], operands[4]);
       }
     operands[1] = gen_lowpart (SImode, operands[1]);
-    operands[2] = gen_lowpart (SImode, operands[2]);
+    if (can_create_pseudo_p ())
+      operands[2] = gen_reg_rtx (SImode);
+    else
+      operands[2] = gen_lowpart (SImode, operands[2]);
   }
   [(set_attr "conds" "set")
    (set_attr "length" "8")
@@ -7470,7 +7491,7 @@ 
 
   "TARGET_32BIT"
   "#"   ; "cmp\\t%R0, %R1\;it eq\;cmpeq\\t%Q0, %Q1"
-  "&& reload_completed"
+  "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)"
   [(set (reg:CC CC_REGNUM)
         (compare:CC (match_dup 2) (match_dup 3)))
    (cond_exec (eq:SI (reg:CC CC_REGNUM) (const_int 0))
--- gcc/config/arm/thumb2.md.orig	2016-12-08 16:00:59.017597265 +0100
+++ gcc/config/arm/thumb2.md	2016-12-08 16:02:38.591592456 +0100
@@ -132,7 +132,7 @@ 
    (clobber (reg:CC CC_REGNUM))]
   "TARGET_THUMB2"
   "#" ; negs\\t%Q0, %Q1\;sbc\\t%R0, %R1, %R1, lsl #1
-  "&& reload_completed"
+  "&& (!TARGET_NEON || reload_completed)"
   [(parallel [(set (reg:CC CC_REGNUM)
 		   (compare:CC (const_int 0) (match_dup 1)))
 	      (set (match_dup 0) (minus:SI (const_int 0) (match_dup 1)))])
--- /dev/null	2016-12-08 15:50:45.426271450 +0100
+++ gcc/testsuite/gcc.target/arm/pr77308-2.c	2016-12-08 16:02:38.591592456 +0100
@@ -0,0 +1,169 @@ 
+/* { dg-do compile } */
+/* { dg-options "-Os -Wstack-usage=2500" } */
+
+/* This is a modified algorithm with 64bit cmp and neg at the Sigma-blocks.
+   It improves the test coverage of cmpdi and negdi2 patterns.
+   Unlike the original test case these insns can reach the reload pass,
+   which may result in large stack usage.  */
+
+#define SHA_LONG64 unsigned long long
+#define U64(C)     C##ULL
+
+#define SHA_LBLOCK      16
+#define SHA512_CBLOCK   (SHA_LBLOCK*8)
+
+typedef struct SHA512state_st {
+    SHA_LONG64 h[8];
+    SHA_LONG64 Nl, Nh;
+    union {
+        SHA_LONG64 d[SHA_LBLOCK];
+        unsigned char p[SHA512_CBLOCK];
+    } u;
+    unsigned int num, md_len;
+} SHA512_CTX;
+
+static const SHA_LONG64 K512[80] = {
+    U64(0x428a2f98d728ae22), U64(0x7137449123ef65cd),
+    U64(0xb5c0fbcfec4d3b2f), U64(0xe9b5dba58189dbbc),
+    U64(0x3956c25bf348b538), U64(0x59f111f1b605d019),
+    U64(0x923f82a4af194f9b), U64(0xab1c5ed5da6d8118),
+    U64(0xd807aa98a3030242), U64(0x12835b0145706fbe),
+    U64(0x243185be4ee4b28c), U64(0x550c7dc3d5ffb4e2),
+    U64(0x72be5d74f27b896f), U64(0x80deb1fe3b1696b1),
+    U64(0x9bdc06a725c71235), U64(0xc19bf174cf692694),
+    U64(0xe49b69c19ef14ad2), U64(0xefbe4786384f25e3),
+    U64(0x0fc19dc68b8cd5b5), U64(0x240ca1cc77ac9c65),
+    U64(0x2de92c6f592b0275), U64(0x4a7484aa6ea6e483),
+    U64(0x5cb0a9dcbd41fbd4), U64(0x76f988da831153b5),
+    U64(0x983e5152ee66dfab), U64(0xa831c66d2db43210),
+    U64(0xb00327c898fb213f), U64(0xbf597fc7beef0ee4),
+    U64(0xc6e00bf33da88fc2), U64(0xd5a79147930aa725),
+    U64(0x06ca6351e003826f), U64(0x142929670a0e6e70),
+    U64(0x27b70a8546d22ffc), U64(0x2e1b21385c26c926),
+    U64(0x4d2c6dfc5ac42aed), U64(0x53380d139d95b3df),
+    U64(0x650a73548baf63de), U64(0x766a0abb3c77b2a8),
+    U64(0x81c2c92e47edaee6), U64(0x92722c851482353b),
+    U64(0xa2bfe8a14cf10364), U64(0xa81a664bbc423001),
+    U64(0xc24b8b70d0f89791), U64(0xc76c51a30654be30),
+    U64(0xd192e819d6ef5218), U64(0xd69906245565a910),
+    U64(0xf40e35855771202a), U64(0x106aa07032bbd1b8),
+    U64(0x19a4c116b8d2d0c8), U64(0x1e376c085141ab53),
+    U64(0x2748774cdf8eeb99), U64(0x34b0bcb5e19b48a8),
+    U64(0x391c0cb3c5c95a63), U64(0x4ed8aa4ae3418acb),
+    U64(0x5b9cca4f7763e373), U64(0x682e6ff3d6b2b8a3),
+    U64(0x748f82ee5defb2fc), U64(0x78a5636f43172f60),
+    U64(0x84c87814a1f0ab72), U64(0x8cc702081a6439ec),
+    U64(0x90befffa23631e28), U64(0xa4506cebde82bde9),
+    U64(0xbef9a3f7b2c67915), U64(0xc67178f2e372532b),
+    U64(0xca273eceea26619c), U64(0xd186b8c721c0c207),
+    U64(0xeada7dd6cde0eb1e), U64(0xf57d4f7fee6ed178),
+    U64(0x06f067aa72176fba), U64(0x0a637dc5a2c898a6),
+    U64(0x113f9804bef90dae), U64(0x1b710b35131c471b),
+    U64(0x28db77f523047d84), U64(0x32caab7b40c72493),
+    U64(0x3c9ebe0a15c9bebc), U64(0x431d67c49c100d4c),
+    U64(0x4cc5d4becb3e42b6), U64(0x597f299cfc657e2a),
+    U64(0x5fcb6fab3ad6faec), U64(0x6c44198c4a475817)
+};
+
+#define B(x,j)    (((SHA_LONG64)(*(((const unsigned char *)(&x))+j)))<<((7-j)*8))
+#define PULL64(x) (B(x,0)|B(x,1)|B(x,2)|B(x,3)|B(x,4)|B(x,5)|B(x,6)|B(x,7))
+#define ROTR(x,s)       (((x)>>s) | (x)<<(64-s))
+#define Sigma0(x)       (ROTR((x),28) ^ ROTR((x),34) ^ (ROTR((x),39) == (x)) ? -(x) : (x))
+#define Sigma1(x)       (ROTR((x),14) ^ ROTR(-(x),18) ^ ((long long)ROTR((x),41) < (long long)(x)) ? -(x) : (x))
+#define sigma0(x)       (ROTR((x),1)  ^ ROTR((x),8)  ^ (((x)>>7) > (x)) ? -(x) : (x))
+#define sigma1(x)       (ROTR((x),19) ^ ROTR((x),61) ^ ((long long)((x)>>6) < (long long)(x)) ? -(x) : (x))
+#define Ch(x,y,z)       (((x) & (y)) ^ ((~(x)) & (z)))
+#define Maj(x,y,z)      (((x) & (y)) ^ ((x) & (z)) ^ ((y) & (z)))
+
+#define ROUND_00_15(i,a,b,c,d,e,f,g,h)          do {    \
+        T1 += h + Sigma1(e) + Ch(e,f,g) + K512[i];      \
+        h = Sigma0(a) + Maj(a,b,c);                     \
+        d += T1;        h += T1;                } while (0)
+#define ROUND_16_80(i,j,a,b,c,d,e,f,g,h,X)      do {    \
+        s0 = X[(j+1)&0x0f];     s0 = sigma0(s0);        \
+        s1 = X[(j+14)&0x0f];    s1 = sigma1(s1);        \
+        T1 = X[(j)&0x0f] += s0 + s1 + X[(j+9)&0x0f];    \
+        ROUND_00_15(i+j,a,b,c,d,e,f,g,h);               } while (0)
+void sha512_block_data_order(SHA512_CTX *ctx, const void *in,
+                                    unsigned int num)
+{
+    const SHA_LONG64 *W = in;
+    SHA_LONG64 a, b, c, d, e, f, g, h, s0, s1, T1;
+    SHA_LONG64 X[16];
+    int i;
+
+    while (num--) {
+
+        a = ctx->h[0];
+        b = ctx->h[1];
+        c = ctx->h[2];
+        d = ctx->h[3];
+        e = ctx->h[4];
+        f = ctx->h[5];
+        g = ctx->h[6];
+        h = ctx->h[7];
+
+        T1 = X[0] = PULL64(W[0]);
+        ROUND_00_15(0, a, b, c, d, e, f, g, h);
+        T1 = X[1] = PULL64(W[1]);
+        ROUND_00_15(1, h, a, b, c, d, e, f, g);
+        T1 = X[2] = PULL64(W[2]);
+        ROUND_00_15(2, g, h, a, b, c, d, e, f);
+        T1 = X[3] = PULL64(W[3]);
+        ROUND_00_15(3, f, g, h, a, b, c, d, e);
+        T1 = X[4] = PULL64(W[4]);
+        ROUND_00_15(4, e, f, g, h, a, b, c, d);
+        T1 = X[5] = PULL64(W[5]);
+        ROUND_00_15(5, d, e, f, g, h, a, b, c);
+        T1 = X[6] = PULL64(W[6]);
+        ROUND_00_15(6, c, d, e, f, g, h, a, b);
+        T1 = X[7] = PULL64(W[7]);
+        ROUND_00_15(7, b, c, d, e, f, g, h, a);
+        T1 = X[8] = PULL64(W[8]);
+        ROUND_00_15(8, a, b, c, d, e, f, g, h);
+        T1 = X[9] = PULL64(W[9]);
+        ROUND_00_15(9, h, a, b, c, d, e, f, g);
+        T1 = X[10] = PULL64(W[10]);
+        ROUND_00_15(10, g, h, a, b, c, d, e, f);
+        T1 = X[11] = PULL64(W[11]);
+        ROUND_00_15(11, f, g, h, a, b, c, d, e);
+        T1 = X[12] = PULL64(W[12]);
+        ROUND_00_15(12, e, f, g, h, a, b, c, d);
+        T1 = X[13] = PULL64(W[13]);
+        ROUND_00_15(13, d, e, f, g, h, a, b, c);
+        T1 = X[14] = PULL64(W[14]);
+        ROUND_00_15(14, c, d, e, f, g, h, a, b);
+        T1 = X[15] = PULL64(W[15]);
+        ROUND_00_15(15, b, c, d, e, f, g, h, a);
+
+        for (i = 16; i < 80; i += 16) {
+            ROUND_16_80(i, 0, a, b, c, d, e, f, g, h, X);
+            ROUND_16_80(i, 1, h, a, b, c, d, e, f, g, X);
+            ROUND_16_80(i, 2, g, h, a, b, c, d, e, f, X);
+            ROUND_16_80(i, 3, f, g, h, a, b, c, d, e, X);
+            ROUND_16_80(i, 4, e, f, g, h, a, b, c, d, X);
+            ROUND_16_80(i, 5, d, e, f, g, h, a, b, c, X);
+            ROUND_16_80(i, 6, c, d, e, f, g, h, a, b, X);
+            ROUND_16_80(i, 7, b, c, d, e, f, g, h, a, X);
+            ROUND_16_80(i, 8, a, b, c, d, e, f, g, h, X);
+            ROUND_16_80(i, 9, h, a, b, c, d, e, f, g, X);
+            ROUND_16_80(i, 10, g, h, a, b, c, d, e, f, X);
+            ROUND_16_80(i, 11, f, g, h, a, b, c, d, e, X);
+            ROUND_16_80(i, 12, e, f, g, h, a, b, c, d, X);
+            ROUND_16_80(i, 13, d, e, f, g, h, a, b, c, X);
+            ROUND_16_80(i, 14, c, d, e, f, g, h, a, b, X);
+            ROUND_16_80(i, 15, b, c, d, e, f, g, h, a, X);
+        }
+
+        ctx->h[0] += a;
+        ctx->h[1] += b;
+        ctx->h[2] += c;
+        ctx->h[3] += d;
+        ctx->h[4] += e;
+        ctx->h[5] += f;
+        ctx->h[6] += g;
+        ctx->h[7] += h;
+
+        W += SHA_LBLOCK;
+    }
+}