[v2,4/4] zinc: ChaCha20 x86_64 implementation

From: Jason A. Donenfeld <Jason@zx2c4.com>

From: Jason A. Donenfeld <Jason@zx2c4.com>

This ports SSSE3, AVX-2, AVX-512F, and AVX-512VL implementations for
ChaCha20. The AVX-512F implementation is disabled on Skylake, due to
throttling, and the VL ymm implementation is used instead. These come
from Andy Polyakov's implementation, with the following modifications
from Samuel Neves:

  - Some cosmetic changes, like renaming labels to .Lname, constants,
    and other Linux conventions.

  - CPU feature checking is done in C by the glue code, so that has been
    removed from the assembly.

  - Eliminate translating certain instructions, such as pshufb, palignr,
    vprotd, etc, to .byte directives. This is meant for compatibility
    with ancient toolchains, but presumably it is unnecessary here,
    since the build system already does checks on what GNU as can
    assemble.

  - When aligning the stack, the original code was saving %rsp to %r9.
    To keep objtool happy, we use instead the DRAP idiom to save %rsp
    to %r10:

      leaq    8(%rsp),%r10
      ... code here ...
      leaq    -8(%r10),%rsp

  - The original code assumes the stack comes aligned to 16 bytes. This
    is not necessarily the case, and to avoid crashes,
    `andq $-alignment, %rsp` was added in the prolog of a few functions.

  - The original hardcodes returns as .byte 0xf3,0xc3, aka "rep ret".
    We replace this by "ret". "rep ret" was meant to help with AMD K8
    chips, cf. http://repzret.org/p/repzret. It makes no sense to
    continue to use this kludge for code that won't even run on ancient
    AMD chips.

Cycle counts on a Core i7 6700HQ using the AVX-2 codepath, comparing
this implementation ("new") to the implementation in the current crypto
api ("old"):

size	old	new
----	----	----
0	62	52
16	414	376
32	410	400
48	414	422
64	362	356
80	714	666
96	714	700
112	712	718
128	692	646
144	1042	674
160	1042	694
176	1042	726
192	1018	650
208	1366	686
224	1366	696
240	1366	722
256	640	656
272	988	1246
288	988	1276
304	992	1296
320	972	1222
336	1318	1256
352	1318	1276
368	1316	1294
384	1294	1218
400	1642	1258
416	1642	1282
432	1642	1302
448	1628	1224
464	1970	1258
480	1970	1280
496	1970	1300
512	656	676
528	1010	1290
544	1010	1306
560	1010	1332
576	986	1254
592	1340	1284
608	1334	1310
624	1340	1334
640	1314	1254
656	1664	1282
672	1674	1306
688	1662	1336
704	1638	1250
720	1992	1292
736	1994	1308
752	1988	1334
768	1252	1254
784	1596	1290
800	1596	1314
816	1596	1330
832	1576	1256
848	1922	1286
864	1922	1314
880	1926	1338
896	1898	1258
912	2248	1288
928	2248	1320
944	2248	1338
960	2226	1268
976	2574	1288
992	2576	1312
1008	2574	1340

Cycle counts on a Xeon Gold 5120 using the AVX-512 codepath:

size	old	new
----	----	----
0	64	54
16	386	372
32	388	396
48	388	420
64	366	350
80	708	666
96	708	692
112	706	736
128	692	648
144	1036	682
160	1036	708
176	1036	730
192	1016	658
208	1360	684
224	1362	708
240	1360	732
256	644	500
272	990	526
288	988	556
304	988	576
320	972	500
336	1314	532
352	1316	558
368	1318	578
384	1308	506
400	1644	532
416	1644	556
432	1644	594
448	1624	508
464	1970	534
480	1970	556
496	1968	582
512	660	624
528	1016	682
544	1016	702
560	1018	728
576	998	654
592	1344	680
608	1344	708
624	1344	730
640	1326	654
656	1670	686
672	1670	708
688	1670	732
704	1652	658
720	1998	682
736	1998	710
752	1996	734
768	1256	662
784	1606	688
800	1606	714
816	1606	736
832	1584	660
848	1948	688
864	1950	714
880	1948	736
896	1912	688
912	2258	718
928	2258	744
944	2256	768
960	2238	692
976	2584	718
992	2584	744
1008	2584	770

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

Signed-off-by: Samuel Neves <sneves@dei.uc.pt>

Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: x86@kernel.org
Cc: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: kernel-hardening@lists.openwall.com
Cc: linux-crypto@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

---

 arch/x86/crypto/Makefile                 |    3 
 arch/x86/crypto/chacha20-avx2-x86_64.S   | 1026 ------------
 arch/x86/crypto/chacha20-ssse3-x86_64.S  |  761 --------
 arch/x86/crypto/chacha20-zinc-x86_64.S   | 2632 +++++++++++++++++++++++++++++++
 arch/x86/crypto/chacha20_glue.c          |  119 -
 include/crypto/chacha20.h                |    4 
 lib/zinc/chacha20/chacha20-x86_64-glue.c |    3 
 7 files changed, 2692 insertions(+), 1856 deletions(-)

Message ID	E1gOz9U-000678-EL@gondobar
State	New
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Subject: [v2 PATCH 4/4] zinc: ChaCha20 x86_64 implementation References: <20181120060217.t4nccaqpwnxkl4tx@gondor.apana.org.au> To: "Jason A. Donenfeld" <Jason@zx2c4.com>, Eric Biggers <ebiggers@kernel.org>, Ard Biesheuvel <ard.biesheuvel@linaro.org>, Linux Crypto Mailing List <linux-crypto@vger.kernel.org>, linux-fscrypt@vger.kernel.org, linux-arm-kernel@lists.infradead.org, LKML <linux-kernel@vger.kernel.org>, Paul Crowley <paulcrowley@google.com>, Greg Kaiser <gkaiser@google.com>, Samuel Neves <samuel.c.p.neves@gmail.com>, Tomer Ashur <tomer.ashur@esat.kuleuven.be>, Martin Willi <martin@strongswan.org> Message-Id: <E1gOz9U-000678-EL@gondobar> From: Herbert Xu <herbert@gondor.apana.org.au> Date: Tue, 20 Nov 2018 14:04:48 +0800 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk
Series	None \| expand [v2,2/4] zinc: ChaCha20 generic C implementation and selftest [v2,4/4] zinc: ChaCha20 x86_64 implementation

[v2,4/4] zinc: ChaCha20 x86_64 implementation

Commit Message

Patch