From patchwork Mon Nov 28 19:42:17 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Bernd Edlinger X-Patchwork-Id: 84442 Delivered-To: patch@linaro.org Received: by 10.140.20.101 with SMTP id 92csp1289664qgi; Mon, 28 Nov 2016 11:42:57 -0800 (PST) X-Received: by 10.98.84.68 with SMTP id i65mr23649617pfb.133.1480362177603; Mon, 28 Nov 2016 11:42:57 -0800 (PST) Return-Path: Received: from sourceware.org (server1.sourceware.org. [209.132.180.131]) by mx.google.com with ESMTPS id r7si20491391pgf.303.2016.11.28.11.42.57 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 28 Nov 2016 11:42:57 -0800 (PST) Received-SPF: pass (google.com: domain of gcc-patches-return-442810-patch=linaro.org@gcc.gnu.org designates 209.132.180.131 as permitted sender) client-ip=209.132.180.131; Authentication-Results: mx.google.com; dkim=pass header.i=@gcc.gnu.org; spf=pass (google.com: domain of gcc-patches-return-442810-patch=linaro.org@gcc.gnu.org designates 209.132.180.131 as permitted sender) smtp.mailfrom=gcc-patches-return-442810-patch=linaro.org@gcc.gnu.org DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :to:cc:subject:date:message-id:references:in-reply-to :content-type:mime-version; q=dns; s=default; b=AD5gRhq38YcMlOKe YhmGVbVrc4h8sj6HiX33n4L8KYxUepi+L+JgYaKDK/79Z3HBNTpk8dO7c4VduFhl AIV8KEnfYQTVlITJZmevGv72dkImXWFDLBH7aN5Sdv6IHZlzXyq74EA7ixvNqlyE Rm4MsioYHgV1N3bwibKN5rZone0= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :to:cc:subject:date:message-id:references:in-reply-to :content-type:mime-version; s=default; bh=6nChYZpITyZGW25VLFDGqX zT5hc=; b=kjcEcpm3AL1lAf79QFdIL1rj05XEVaZPalb/5MkrXgW4tlQGD2HdlS 3+3D78O94X5RhBcKsdFPFJUQzlaujFHj9dKEl0QVV+RXUrhLXuAnEjAZsyaE2C7N QzA0mZMW51GxE4IRrTu5PQsLfIYCz5+E0VM8DMR40lOdmj7sK7tSA= Received: (qmail 112937 invoked by alias); 28 Nov 2016 19:42:43 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 112916 invoked by uid 89); 28 Nov 2016 19:42:43 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.8 required=5.0 tests=AWL, BAYES_00, FREEMAIL_FROM, KAM_ASCII_DIVIDERS, RCVD_IN_DNSWL_LOW, SPF_PASS autolearn=no version=3.3.2 spammy=enabling, *ctx, xj, x5 X-HELO: SNT004-OMC1S51.hotmail.com Received: from snt004-omc1s51.hotmail.com (HELO SNT004-OMC1S51.hotmail.com) (65.54.61.88) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 28 Nov 2016 19:42:32 +0000 Received: from EUR01-DB5-obe.outbound.protection.outlook.com ([65.55.90.7]) by SNT004-OMC1S51.hotmail.com over TLS secured channel with Microsoft SMTPSVC(7.5.7601.23008); Mon, 28 Nov 2016 11:42:31 -0800 Received: from VE1EUR01FT031.eop-EUR01.prod.protection.outlook.com (10.152.2.51) by VE1EUR01HT135.eop-EUR01.prod.protection.outlook.com (10.152.3.224) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P384) id 15.1.734.4; Mon, 28 Nov 2016 19:42:18 +0000 Received: from AM4PR0701MB2162.eurprd07.prod.outlook.com (10.152.2.56) by VE1EUR01FT031.mail.protection.outlook.com (10.152.2.225) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P384) id 15.1.734.4 via Frontend Transport; Mon, 28 Nov 2016 19:42:18 +0000 Received: from AM4PR0701MB2162.eurprd07.prod.outlook.com ([10.167.132.147]) by AM4PR0701MB2162.eurprd07.prod.outlook.com ([10.167.132.147]) with mapi id 15.01.0761.009; Mon, 28 Nov 2016 19:42:18 +0000 From: Bernd Edlinger To: Ramana Radhakrishnan CC: GCC Patches , Kyrill Tkachov , Richard Earnshaw , Wilco Dijkstra Subject: Re: [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) Date: Mon, 28 Nov 2016 19:42:17 +0000 Message-ID: References: In-Reply-To: authentication-results: googlemail.com; dkim=none (message not signed) header.d=none; googlemail.com; dmarc=none action=none header.from=hotmail.de; x-incomingtopheadermarker: OriginalChecksum:; UpperCasedChecksum:; SizeAsReceived:7614; Count:37 x-ms-exchange-messagesentrepresentingtype: 1 x-incomingheadercount: 37 x-eopattributedmessage: 0 x-microsoft-exchange-diagnostics: 1; VE1EUR01HT135; 7:IRalJGThLY7oq43FKtnKIrHkgew9qUu4NAvcB9q/O4CpiWSAl6o+HE1TFjXNU+hJ6/qCbBIE0r14Rev9LMFBVWmw4bfw7V1Io8hXfQNpPuv4DVCNeLbWEIe6zL1UsX2+OSezucFUQO4/ajtic0gvBFWeuc0esGTdvPfj2umR52Eubow+pvUaNigIiyPiR9zk4mDKQESMkp+a4oOPaAGmrexDwWw3MZAdTUaq/xnihgb6N8PR/NgVzdVHKnMho0zg7MBEvUZUlPVaazAPC70VVSbKAQ8XxYoEpunlLhq4Im2ldzGpfqHWUeQsw5aY+AktWbGLQah1EHHUm26YveL/RpN6k8UIzRdkSh80luM4D8s= x-forefront-antispam-report: EFV:NLI; SFV:NSPM; SFS:(10019020)(98900003); DIR:OUT; SFP:1102; SCL:1; SRVR:VE1EUR01HT135; H:AM4PR0701MB2162.eurprd07.prod.outlook.com; FPR:; SPF:None; LANG:en; x-ms-office365-filtering-correlation-id: 99fac248-c11c-4d7a-3b1e-08d417c6a8a0 x-microsoft-antispam: UriScan:; BCL:0; PCL:0; RULEID:(22001)(1601124038)(1603103113)(1601125047); SRVR:VE1EUR01HT135; x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(432015012)(102415395)(82015046); SRVR:VE1EUR01HT135; BCL:0; PCL:0; RULEID:; SRVR:VE1EUR01HT135; x-forefront-prvs: 01401330D1 spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM MIME-Version: 1.0 X-OriginatorOrg: outlook.com X-MS-Exchange-CrossTenant-originalarrivaltime: 28 Nov 2016 19:42:17.9150 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Internet X-MS-Exchange-CrossTenant-id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa X-MS-Exchange-Transport-CrossTenantHeadersStamped: VE1EUR01HT135 On 11/25/16 12:30, Ramana Radhakrishnan wrote: > On Sun, Nov 6, 2016 at 2:18 PM, Bernd Edlinger > wrote: >> Hi! >> >> This improves the stack usage on the sha512 test case for the case >> without hardware fpu and without iwmmxt by splitting all di-mode >> patterns right while expanding which is similar to what the shift-pattern >> does. It does nothing in the case iwmmxt and fpu=neon or vfp as well as >> thumb1. >> > > I would go further and do this in the absence of Neon, the VFP unit > being there doesn't help with DImode operations i.e. we do not have 64 > bit integer arithmetic instructions without Neon. The main reason why > we have the DImode patterns split so late is to give a chance for > folks who want to do 64 bit arithmetic in Neon a chance to make this > work as well as support some of the 64 bit Neon intrinsics which IIRC > map down to these instructions. Doing this just for soft-float doesn't > improve the default case only. I don't usually test iwmmxt and I'm not > sure who has the ability to do so, thus keeping this restriction for > iwMMX is fine. > > Yes I understand, thanks for pointing that out. I was not aware what iwmmxt exists at all, but I noticed that most 64bit expansions work completely different, and would break if we split the pattern early. I can however only look at the assembler outout for iwmmxt, and make sure that the stack usage does not get worse. Thus the new version of the patch keeps only thumb1, neon and iwmmxt as it is: around 1570 (thumb1), 2300 (neon) and 2200 (wimmxt) bytes stack for the test cases, and vfp and soft-float at around 270 bytes stack usage. >> It reduces the stack usage from 2300 to near optimal 272 bytes (!). >> >> Note this also splits many ldrd/strd instructions and therefore I will >> post a followup-patch that mitigates this effect by enabling the ldrd/strd >> peephole optimization after the necessary reg-testing. >> >> >> Bootstrapped and reg-tested on arm-linux-gnueabihf. > > What do you mean by arm-linux-gnueabihf - when folks say that I > interpret it as --with-arch=armv7-a --with-float=hard > --with-fpu=vfpv3-d16 or (--with-fpu=neon). > > If you've really bootstrapped and regtested it on armhf, doesn't this > patch as it stand have no effect there i.e. no change ? > arm-linux-gnueabihf usually means to me someone has configured with > --with-float=hard, so there are no regressions in the hard float ABI > case, > I know it proves little. When I say arm-linux-gnueabihf I do in fact mean --enable-languages=all,ada,go,obj-c++ --with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16 --with-float=hard. My main interest in the stack usage is of course not because of linux, but because of eCos where we have very small task stacks and in fact no fpu support by the O/S at all, so that patch is exactly what we need. Bootstrapped and reg-tested on arm-linux-gnueabihf Is it OK for trunk? Thanks Bernd. 2016-11-25 Bernd Edlinger PR target/77308 * config/arm/arm.md (*arm_adddi3, *arm_subdi3): Split early except for TARGET_NEON and TARGET_IWMMXT. (anddi3, iordi3, xordi3, one_cmpldi2): Split while expanding except for TARGET_NEON and TARGET_IWMMXT. (*one_cmpldi2_insn): Moved the body of one_cmpldi2 here. testsuite: 2016-11-25 Bernd Edlinger PR target/77308 * gcc.target/arm/pr77308-1.c: New test. Index: gcc/config/arm/arm.md =================================================================== --- gcc/config/arm/arm.md (revision 242875) +++ gcc/config/arm/arm.md (working copy) @@ -467,7 +467,7 @@ (clobber (reg:CC CC_REGNUM))] "TARGET_32BIT && !TARGET_NEON" "#" - "TARGET_32BIT && reload_completed + "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed) && ! (TARGET_NEON && IS_VFP_REGNUM (REGNO (operands[0])))" [(parallel [(set (reg:CC_C CC_REGNUM) (compare:CC_C (plus:SI (match_dup 1) (match_dup 2)) @@ -1272,7 +1272,7 @@ (clobber (reg:CC CC_REGNUM))] "TARGET_32BIT && !TARGET_NEON" "#" ; "subs\\t%Q0, %Q1, %Q2\;sbc\\t%R0, %R1, %R2" - "&& reload_completed" + "&& (!TARGET_IWMMXT || reload_completed)" [(parallel [(set (reg:CC CC_REGNUM) (compare:CC (match_dup 1) (match_dup 2))) (set (match_dup 0) (minus:SI (match_dup 1) (match_dup 2)))]) @@ -2258,7 +2258,24 @@ (and:DI (match_operand:DI 1 "s_register_operand" "") (match_operand:DI 2 "neon_inv_logic_op2" "")))] "TARGET_32BIT" - "" + " + if (!TARGET_NEON && !TARGET_IWMMXT) + { + rtx low = simplify_gen_binary (AND, SImode, + gen_lowpart (SImode, operands[1]), + gen_lowpart (SImode, operands[2])); + rtx high = simplify_gen_binary (AND, SImode, + gen_highpart (SImode, operands[1]), + gen_highpart_mode (SImode, DImode, + operands[2])); + + emit_insn (gen_rtx_SET (gen_lowpart (SImode, operands[0]), low)); + emit_insn (gen_rtx_SET (gen_highpart (SImode, operands[0]), high)); + + DONE; + } + /* Otherwise expand pattern as above. */ + " ) (define_insn_and_split "*anddi3_insn" @@ -3131,7 +3148,24 @@ (ior:DI (match_operand:DI 1 "s_register_operand" "") (match_operand:DI 2 "neon_logic_op2" "")))] "TARGET_32BIT" - "" + " + if (!TARGET_NEON && !TARGET_IWMMXT) + { + rtx low = simplify_gen_binary (IOR, SImode, + gen_lowpart (SImode, operands[1]), + gen_lowpart (SImode, operands[2])); + rtx high = simplify_gen_binary (IOR, SImode, + gen_highpart (SImode, operands[1]), + gen_highpart_mode (SImode, DImode, + operands[2])); + + emit_insn (gen_rtx_SET (gen_lowpart (SImode, operands[0]), low)); + emit_insn (gen_rtx_SET (gen_highpart (SImode, operands[0]), high)); + + DONE; + } + /* Otherwise expand pattern as above. */ + " ) (define_insn_and_split "*iordi3_insn" @@ -3312,7 +3346,24 @@ (xor:DI (match_operand:DI 1 "s_register_operand" "") (match_operand:DI 2 "arm_xordi_operand" "")))] "TARGET_32BIT" - "" + " + if (!TARGET_NEON && !TARGET_IWMMXT) + { + rtx low = simplify_gen_binary (XOR, SImode, + gen_lowpart (SImode, operands[1]), + gen_lowpart (SImode, operands[2])); + rtx high = simplify_gen_binary (XOR, SImode, + gen_highpart (SImode, operands[1]), + gen_highpart_mode (SImode, DImode, + operands[2])); + + emit_insn (gen_rtx_SET (gen_lowpart (SImode, operands[0]), low)); + emit_insn (gen_rtx_SET (gen_highpart (SImode, operands[0]), high)); + + DONE; + } + /* Otherwise expand pattern as above. */ + " ) (define_insn_and_split "*xordi3_insn" @@ -5022,7 +5073,31 @@ "TARGET_32BIT && TARGET_HARD_FLOAT && TARGET_VFP_DOUBLE" "") -(define_insn_and_split "one_cmpldi2" +(define_expand "one_cmpldi2" + [(set (match_operand:DI 0 "s_register_operand" "") + (not:DI (match_operand:DI 1 "s_register_operand" "")))] + "TARGET_32BIT" + " + if (!TARGET_NEON && !TARGET_IWMMXT) + { + rtx low = simplify_gen_unary (NOT, SImode, + gen_lowpart (SImode, operands[1]), + SImode); + rtx high = simplify_gen_unary (NOT, SImode, + gen_highpart_mode (SImode, DImode, + operands[1]), + SImode); + + emit_insn (gen_rtx_SET (gen_lowpart (SImode, operands[0]), low)); + emit_insn (gen_rtx_SET (gen_highpart (SImode, operands[0]), high)); + + DONE; + } + /* Otherwise expand pattern as above. */ + " +) + +(define_insn_and_split "*one_cmpldi2_insn" [(set (match_operand:DI 0 "s_register_operand" "=w,&r,&r,?w") (not:DI (match_operand:DI 1 "s_register_operand" " w, 0, r, w")))] "TARGET_32BIT" Index: gcc/testsuite/gcc.target/arm/pr77308-1.c =================================================================== --- gcc/testsuite/gcc.target/arm/pr77308-1.c (revision 0) +++ gcc/testsuite/gcc.target/arm/pr77308-1.c (working copy) @@ -0,0 +1,169 @@ +/* { dg-do compile } */ +/* { dg-options "-Os -Wstack-usage=2500" } */ + +/* This is a modified algorithm with bit-not "~" at the Sigma-blocks. + It improves the test coverage of one_cmpldi2 and subdi3 patterns. + Unlike the original test case these insns can reach the reload pass, + which may result in large stack usage. */ + +#define SHA_LONG64 unsigned long long +#define U64(C) C##ULL + +#define SHA_LBLOCK 16 +#define SHA512_CBLOCK (SHA_LBLOCK*8) + +typedef struct SHA512state_st { + SHA_LONG64 h[8]; + SHA_LONG64 Nl, Nh; + union { + SHA_LONG64 d[SHA_LBLOCK]; + unsigned char p[SHA512_CBLOCK]; + } u; + unsigned int num, md_len; +} SHA512_CTX; + +static const SHA_LONG64 K512[80] = { + U64(0x428a2f98d728ae22), U64(0x7137449123ef65cd), + U64(0xb5c0fbcfec4d3b2f), U64(0xe9b5dba58189dbbc), + U64(0x3956c25bf348b538), U64(0x59f111f1b605d019), + U64(0x923f82a4af194f9b), U64(0xab1c5ed5da6d8118), + U64(0xd807aa98a3030242), U64(0x12835b0145706fbe), + U64(0x243185be4ee4b28c), U64(0x550c7dc3d5ffb4e2), + U64(0x72be5d74f27b896f), U64(0x80deb1fe3b1696b1), + U64(0x9bdc06a725c71235), U64(0xc19bf174cf692694), + U64(0xe49b69c19ef14ad2), U64(0xefbe4786384f25e3), + U64(0x0fc19dc68b8cd5b5), U64(0x240ca1cc77ac9c65), + U64(0x2de92c6f592b0275), U64(0x4a7484aa6ea6e483), + U64(0x5cb0a9dcbd41fbd4), U64(0x76f988da831153b5), + U64(0x983e5152ee66dfab), U64(0xa831c66d2db43210), + U64(0xb00327c898fb213f), U64(0xbf597fc7beef0ee4), + U64(0xc6e00bf33da88fc2), U64(0xd5a79147930aa725), + U64(0x06ca6351e003826f), U64(0x142929670a0e6e70), + U64(0x27b70a8546d22ffc), U64(0x2e1b21385c26c926), + U64(0x4d2c6dfc5ac42aed), U64(0x53380d139d95b3df), + U64(0x650a73548baf63de), U64(0x766a0abb3c77b2a8), + U64(0x81c2c92e47edaee6), U64(0x92722c851482353b), + U64(0xa2bfe8a14cf10364), U64(0xa81a664bbc423001), + U64(0xc24b8b70d0f89791), U64(0xc76c51a30654be30), + U64(0xd192e819d6ef5218), U64(0xd69906245565a910), + U64(0xf40e35855771202a), U64(0x106aa07032bbd1b8), + U64(0x19a4c116b8d2d0c8), U64(0x1e376c085141ab53), + U64(0x2748774cdf8eeb99), U64(0x34b0bcb5e19b48a8), + U64(0x391c0cb3c5c95a63), U64(0x4ed8aa4ae3418acb), + U64(0x5b9cca4f7763e373), U64(0x682e6ff3d6b2b8a3), + U64(0x748f82ee5defb2fc), U64(0x78a5636f43172f60), + U64(0x84c87814a1f0ab72), U64(0x8cc702081a6439ec), + U64(0x90befffa23631e28), U64(0xa4506cebde82bde9), + U64(0xbef9a3f7b2c67915), U64(0xc67178f2e372532b), + U64(0xca273eceea26619c), U64(0xd186b8c721c0c207), + U64(0xeada7dd6cde0eb1e), U64(0xf57d4f7fee6ed178), + U64(0x06f067aa72176fba), U64(0x0a637dc5a2c898a6), + U64(0x113f9804bef90dae), U64(0x1b710b35131c471b), + U64(0x28db77f523047d84), U64(0x32caab7b40c72493), + U64(0x3c9ebe0a15c9bebc), U64(0x431d67c49c100d4c), + U64(0x4cc5d4becb3e42b6), U64(0x597f299cfc657e2a), + U64(0x5fcb6fab3ad6faec), U64(0x6c44198c4a475817) +}; + +#define B(x,j) (((SHA_LONG64)(*(((const unsigned char *)(&x))+j)))<<((7-j)*8)) +#define PULL64(x) (B(x,0)|B(x,1)|B(x,2)|B(x,3)|B(x,4)|B(x,5)|B(x,6)|B(x,7)) +#define ROTR(x,s) (((x)>>s) | (x)<<(64-s)) +#define Sigma0(x) ~(ROTR((x),28) ^ ROTR((x),34) ^ ROTR((x),39)) +#define Sigma1(x) ~(ROTR((x),14) ^ ROTR((x),18) ^ ROTR((x),41)) +#define sigma0(x) ~(ROTR((x),1) ^ ROTR((x),8) ^ ((x)>>7)) +#define sigma1(x) ~(ROTR((x),19) ^ ROTR((x),61) ^ ((x)>>6)) +#define Ch(x,y,z) (((x) & (y)) ^ ((~(x)) & (z))) +#define Maj(x,y,z) (((x) & (y)) ^ ((x) & (z)) ^ ((y) & (z))) + +#define ROUND_00_15(i,a,b,c,d,e,f,g,h) do { \ + T1 += h + Sigma1(e) + Ch(e,f,g) + K512[i]; \ + h = Sigma0(a) + Maj(a,b,c); \ + d += T1; h += T1; } while (0) +#define ROUND_16_80(i,j,a,b,c,d,e,f,g,h,X) do { \ + s0 = X[(j+1)&0x0f]; s0 = sigma0(s0); \ + s1 = X[(j+14)&0x0f]; s1 = sigma1(s1); \ + T1 = X[(j)&0x0f] += s0 + s1 + X[(j+9)&0x0f]; \ + ROUND_00_15(i+j,a,b,c,d,e,f,g,h); } while (0) +void sha512_block_data_order(SHA512_CTX *ctx, const void *in, + unsigned int num) +{ + const SHA_LONG64 *W = in; + SHA_LONG64 a, b, c, d, e, f, g, h, s0, s1, T1; + SHA_LONG64 X[16]; + int i; + + while (num--) { + + a = ctx->h[0]; + b = ctx->h[1]; + c = ctx->h[2]; + d = ctx->h[3]; + e = ctx->h[4]; + f = ctx->h[5]; + g = ctx->h[6]; + h = ctx->h[7]; + + T1 = X[0] = PULL64(W[0]); + ROUND_00_15(0, a, b, c, d, e, f, g, h); + T1 = X[1] = PULL64(W[1]); + ROUND_00_15(1, h, a, b, c, d, e, f, g); + T1 = X[2] = PULL64(W[2]); + ROUND_00_15(2, g, h, a, b, c, d, e, f); + T1 = X[3] = PULL64(W[3]); + ROUND_00_15(3, f, g, h, a, b, c, d, e); + T1 = X[4] = PULL64(W[4]); + ROUND_00_15(4, e, f, g, h, a, b, c, d); + T1 = X[5] = PULL64(W[5]); + ROUND_00_15(5, d, e, f, g, h, a, b, c); + T1 = X[6] = PULL64(W[6]); + ROUND_00_15(6, c, d, e, f, g, h, a, b); + T1 = X[7] = PULL64(W[7]); + ROUND_00_15(7, b, c, d, e, f, g, h, a); + T1 = X[8] = PULL64(W[8]); + ROUND_00_15(8, a, b, c, d, e, f, g, h); + T1 = X[9] = PULL64(W[9]); + ROUND_00_15(9, h, a, b, c, d, e, f, g); + T1 = X[10] = PULL64(W[10]); + ROUND_00_15(10, g, h, a, b, c, d, e, f); + T1 = X[11] = PULL64(W[11]); + ROUND_00_15(11, f, g, h, a, b, c, d, e); + T1 = X[12] = PULL64(W[12]); + ROUND_00_15(12, e, f, g, h, a, b, c, d); + T1 = X[13] = PULL64(W[13]); + ROUND_00_15(13, d, e, f, g, h, a, b, c); + T1 = X[14] = PULL64(W[14]); + ROUND_00_15(14, c, d, e, f, g, h, a, b); + T1 = X[15] = PULL64(W[15]); + ROUND_00_15(15, b, c, d, e, f, g, h, a); + + for (i = 16; i < 80; i += 16) { + ROUND_16_80(i, 0, a, b, c, d, e, f, g, h, X); + ROUND_16_80(i, 1, h, a, b, c, d, e, f, g, X); + ROUND_16_80(i, 2, g, h, a, b, c, d, e, f, X); + ROUND_16_80(i, 3, f, g, h, a, b, c, d, e, X); + ROUND_16_80(i, 4, e, f, g, h, a, b, c, d, X); + ROUND_16_80(i, 5, d, e, f, g, h, a, b, c, X); + ROUND_16_80(i, 6, c, d, e, f, g, h, a, b, X); + ROUND_16_80(i, 7, b, c, d, e, f, g, h, a, X); + ROUND_16_80(i, 8, a, b, c, d, e, f, g, h, X); + ROUND_16_80(i, 9, h, a, b, c, d, e, f, g, X); + ROUND_16_80(i, 10, g, h, a, b, c, d, e, f, X); + ROUND_16_80(i, 11, f, g, h, a, b, c, d, e, X); + ROUND_16_80(i, 12, e, f, g, h, a, b, c, d, X); + ROUND_16_80(i, 13, d, e, f, g, h, a, b, c, X); + ROUND_16_80(i, 14, c, d, e, f, g, h, a, b, X); + ROUND_16_80(i, 15, b, c, d, e, f, g, h, a, X); + } + + ctx->h[0] += a; + ctx->h[1] += b; + ctx->h[2] += c; + ctx->h[3] += d; + ctx->h[4] += e; + ctx->h[5] += f; + ctx->h[6] += g; + ctx->h[7] += h; + + W += SHA_LBLOCK; + } +}