From patchwork Sat Oct 6 02:56:42 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Jason A. Donenfeld" X-Patchwork-Id: 148298 Delivered-To: patch@linaro.org Received: by 2002:a2e:8595:0:0:0:0:0 with SMTP id b21-v6csp1157184lji; Fri, 5 Oct 2018 19:57:35 -0700 (PDT) X-Google-Smtp-Source: ACcGV61QflkQ3l5RWjnbRE9dMtnS4tDNpcveZRjlSKa3HmaFgGo6d3zPmz0A+YwvV0rtq0u7yDL0 X-Received: by 2002:a63:b518:: with SMTP id y24-v6mr12631974pge.436.1538794655146; Fri, 05 Oct 2018 19:57:35 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1538794655; cv=none; d=google.com; s=arc-20160816; b=xuFSIwznKxOm/eLJsDVLdfEF7vsJ7R5iIJdLZ56RshDufLTPXyHOUf5TAF985wKo92 RSffC0svQNNWD3C8m9jq97Uo4c2NAhg+9XjtwnW8Z0ZL3s4G/rhI26amv+0DQ3MLZmru 7logz4UnfYe3WKO9/sMSnkkWeCt9TPJ8gaMINe/Qz5X+7po/DTkiwezE6rANhCx7Lv2Q UNUnBsLMcHcSLey0lme30spLh5jcrPDXmAQP1YezmoyOxje9B9DkBs0eECpLRg7pLP9C aRVjWQPcUgRXU46/YngJrOEMXiOQRv+t4q6fbrMnXFAk7htuE4uLyUKWrjvstO/Gmv7p UhaA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=nAIJXoC6iGpG3gXTxfNMwmL02EGiecuTTiGHcfb774I=; b=PsC/SAR7GDEgr22gWnclsYN1ucrLyTXVCMzvxwP2MckpD+olCrz0FEQInChzdnypHu X4ltfeIKmxvMMtx7NvHZ4d/8MfEZS8nv1mwEBb1MmU+hhI52l1k4/nT2nULhJ9iXuU85 y4Vs5Vt0L3TLGP+hdpbC5uSUB+HQ/lTW6nK30y2/P4qYXxCaGNhJOjmPJAzr5XSY68dw dd6VoctvgapMhx4l+mpvIvUHZgbTRAcVvS6q0B4WSuCL4L6td+NTdSNCcXlLhHWFWmKW FSoTQiiQFI28BcRBcsaUSzg3UA305K594J4Nx/8MEfMCsl2JTxN7MeABl3reNzXsoCo/ jllg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=IM8XeCP6; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id n15-v6si10725251pgc.143.2018.10.05.19.57.34; Fri, 05 Oct 2018 19:57:35 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=IM8XeCP6; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729436AbeJFJ7F (ORCPT + 32 others); Sat, 6 Oct 2018 05:59:05 -0400 Received: from frisell.zx2c4.com ([192.95.5.64]:50073 "EHLO frisell.zx2c4.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726812AbeJFJ7E (ORCPT ); Sat, 6 Oct 2018 05:59:04 -0400 Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTP id 2709d65b; Sat, 6 Oct 2018 02:57:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=zx2c4.com; h=from:to:cc :subject:date:message-id:in-reply-to:references:mime-version :content-transfer-encoding; s=mail; bh=cz3qBitK9lcOU4T8Q4EPHXFd1 i8=; b=IM8XeCP63ziJoUPJuWwoJbqP94slpXYS2c6P4c5xfgLrzZdB0q0rqc7lc d34yjklFn4PFxbUUdoBLplyOcFc+V4qnf4S7rylzBOtZ4UYSPy8Ebm8ZUmXAQBNc 1oF0EceoYREejc2QNh9b1xALPVJIvi/3ukOMSXayaqAX54MYmEG9s77iymom8Psm HsHsUEHxixcNpSK5Wa+2IUUVrxH7cUxT9SSUBgDRxj+Tig4kdTMlb7w2SRYosZRi T0y2KyNe2NfoOjlNb/xSsvAbYC8w1VlTz8NevispwrmYMzcKCD9XZfOmWYsk/9GL W2jGlU+okoH3ykqs+2iAgYsW3xeAQ== Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTPSA id 4bdb2774 (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256:NO); Sat, 6 Oct 2018 02:57:00 +0000 (UTC) From: "Jason A. Donenfeld" To: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, davem@davemloft.net, gregkh@linuxfoundation.org Cc: "Jason A. Donenfeld" Subject: [PATCH net-next v7 01/28] ARM: makefile: use ARMv3M mode for RiscPC Date: Sat, 6 Oct 2018 04:56:42 +0200 Message-Id: <20181006025709.4019-2-Jason@zx2c4.com> In-Reply-To: <20181006025709.4019-1-Jason@zx2c4.com> References: <20181006025709.4019-1-Jason@zx2c4.com> MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The purpose of CONFIG_CPU_32v3 is to avoid ldrh/strh on the RiscPC, which is pretty much an ARMv4 device, except its bus will choke on the half-words. The way to make the C compiler not output ldrh/strh is with -march=armv3, which doesn't support them in the ISA. However, this prevents certain cryptography code from working that uses instructions like umull. Fortunately there's also -march=armv3m that does support those, making it possible to continue assembling optimized cryptography routines for our beloved RiscPC. Signed-off-by: Jason A. Donenfeld --- Notes: This commit has been submitted to the proper ARM tree and is working its way upstream. It's included in this series here so that kbuild 0-day bot doesn't get too nervous about RiscPC, but is already entering the tree through arm-next. arch/arm/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- 2.19.0 diff --git a/arch/arm/Makefile b/arch/arm/Makefile index d1516f85f25d..7fd4bcaf0721 100644 --- a/arch/arm/Makefile +++ b/arch/arm/Makefile @@ -74,7 +74,7 @@ endif arch-$(CONFIG_CPU_32v5) =-D__LINUX_ARM_ARCH__=5 $(call cc-option,-march=armv5te,-march=armv4t) arch-$(CONFIG_CPU_32v4T) =-D__LINUX_ARM_ARCH__=4 -march=armv4t arch-$(CONFIG_CPU_32v4) =-D__LINUX_ARM_ARCH__=4 -march=armv4 -arch-$(CONFIG_CPU_32v3) =-D__LINUX_ARM_ARCH__=3 -march=armv3 +arch-$(CONFIG_CPU_32v3) =-D__LINUX_ARM_ARCH__=3 -march=armv3m # Evaluate arch cc-option calls now arch-y := $(arch-y) From patchwork Sat Oct 6 02:56:43 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Jason A. Donenfeld" X-Patchwork-Id: 148300 Delivered-To: patch@linaro.org Received: by 2002:a2e:8595:0:0:0:0:0 with SMTP id b21-v6csp1157238lji; Fri, 5 Oct 2018 19:57:41 -0700 (PDT) X-Google-Smtp-Source: ACcGV61cUvpFaVz5Odahnt/Ogf3rl86YA+GRuZOLeAybBNqG6G8cIdbEHbbvMyGOiLtaA09QrnxH X-Received: by 2002:a62:b604:: with SMTP id j4-v6mr14731791pff.199.1538794661321; Fri, 05 Oct 2018 19:57:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1538794661; cv=none; d=google.com; s=arc-20160816; b=JqcWfkpk9yCagThqqe6YVqwLjoI4ugJYklxABhuDwldvcKhUQL7FMI4dzV7vgZ1jAC CE8V8ao+S+szk3A8jkgOBAq+lp+P9ElruqMqoz4+h982ejhDQ7mlUI7pUQvwicsk1Wya GdDQr9Sce0D+RVyIj8OEH7p21cVh5ZXu6ox9uVI4wOCzl/OSELXyQwlb1zAzdkPTsDEg q3lVZkWrh2EGdVtJtmYvlcv22PLKltRlS57CLVV7bs6L3kmWc1IZ5qpfq/PzeKhrnxkk RFvj33sipqcvCJe/XRAffcjsuvIiXBbEQ7awAF9Oze45bKaosq1TwAsc4tN0XmYBtagd sY8Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=DhxoW8b9oc2J21VC4meBDUfe4gUxc4A/T3a2TCICA3s=; b=Jl6n0aDeZ8xbdQG7ets2K+JajYL5S/lkQENpHzSIQcAJZKm8ls64la0vVO/7ri/pfB rXi0EQgf8iuQxw4lkVkZnggeAEjeKGDxf6I+U5wlt51W/v7kgnPXoovLDWMUzjwYbgM9 gcfDPOEweR7Dvo3jUOPA0c1xlkUPpLMAh227IpAQu79IIEx4yGJzVYIo5oSGkaYjJgT7 IysCwAKkOu02ChmypZDYWKdAOgzWRJEVWcmR0rrIbXzKPaJnyvQbF1KAuNI9ZTdYNMWQ VYrZ9sUgR+nsljJWe9G+9yEKZncZo/KrXJv1RciLOxpT356TOZTBIksY3cNB8EyEVaqC uL9Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b="ICe56U/u"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id j15-v6si9457502pgg.433.2018.10.05.19.57.41; Fri, 05 Oct 2018 19:57:41 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b="ICe56U/u"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729474AbeJFJ7I (ORCPT + 32 others); Sat, 6 Oct 2018 05:59:08 -0400 Received: from frisell.zx2c4.com ([192.95.5.64]:50073 "EHLO frisell.zx2c4.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726812AbeJFJ7H (ORCPT ); Sat, 6 Oct 2018 05:59:07 -0400 Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTP id 58f33857; Sat, 6 Oct 2018 02:57:04 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=zx2c4.com; h=from:to:cc :subject:date:message-id:in-reply-to:references:mime-version :content-transfer-encoding; s=mail; bh=eI1BXc22Lh2xCEMoFQc7iWy7E mM=; b=ICe56U/uVD5WO7XB1z25zmXPd7BVHBnnzjTu1fqdNl5LEbTjHWbmWCDkM aLVad8Vuq0LIjb84MjE53Pcpr/ueFrz9RsXRSAtwuwANFFsbzOGvWciAEDq5KLbI vpXj+o+/FnDf9pOQHpJTkd1X/LZ822/nJmM3UMZ24+Eui1n0ndg50nL/UhVYIcLM HrTarqmZxRJpTyFLcfCM5o7aCM23d+Hv9gigQqJtJSPBzGHZ48Ls+YtS0IuJPILD JffpN2IHSrUlAC4XcQQKNYx2JcrTJwCx39fkvchFYbGo2ljo2y/wONGQ1KC3S9hc aZHIPHgNf2H2LWgkDTUMefKFWMYhQ== Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTPSA id 045e05f3 (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256:NO); Sat, 6 Oct 2018 02:57:03 +0000 (UTC) From: "Jason A. Donenfeld" To: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, davem@davemloft.net, gregkh@linuxfoundation.org Cc: "Jason A. Donenfeld" , Samuel Neves , Andy Lutomirski , Thomas Gleixner , linux-arch@vger.kernel.org Subject: [PATCH net-next v7 02/28] asm: simd context helper API Date: Sat, 6 Oct 2018 04:56:43 +0200 Message-Id: <20181006025709.4019-3-Jason@zx2c4.com> In-Reply-To: <20181006025709.4019-1-Jason@zx2c4.com> References: <20181006025709.4019-1-Jason@zx2c4.com> MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Sometimes it's useful to amortize calls to XSAVE/XRSTOR and the related FPU/SIMD functions over a number of calls, because FPU restoration is quite expensive. This adds a simple header for carrying out this pattern: simd_context_t simd_context; simd_get(&simd_context); while ((item = get_item_from_queue()) != NULL) { encrypt_item(item, &simd_context); simd_relax(&simd_context); } simd_put(&simd_context); The relaxation step ensures that we don't trample over preemption, and the get/put API should be a familiar paradigm in the kernel. On the other end, code that actually wants to use SIMD instructions can accept this as a parameter and check it via: void encrypt_item(struct item *item, simd_context_t *simd_context) { if (item->len > LARGE_FOR_SIMD && simd_use(simd_context)) wild_simd_code(item); else boring_scalar_code(item); } The actual XSAVE happens during simd_use (and only on the first time), so that if the context is never actually used, no performance penalty is hit. Signed-off-by: Jason A. Donenfeld Cc: Samuel Neves Cc: Andy Lutomirski Cc: Thomas Gleixner Cc: Greg KH Cc: linux-arch@vger.kernel.org --- arch/alpha/include/asm/Kbuild | 1 + arch/arc/include/asm/Kbuild | 1 + arch/arm/include/asm/Kbuild | 1 - arch/arm/include/asm/simd.h | 63 ++++++++++++++++++++++++++++++ arch/arm64/include/asm/simd.h | 51 +++++++++++++++++++++--- arch/c6x/include/asm/Kbuild | 1 + arch/h8300/include/asm/Kbuild | 1 + arch/hexagon/include/asm/Kbuild | 1 + arch/ia64/include/asm/Kbuild | 1 + arch/m68k/include/asm/Kbuild | 1 + arch/microblaze/include/asm/Kbuild | 1 + arch/mips/include/asm/Kbuild | 1 + arch/nds32/include/asm/Kbuild | 1 + arch/nios2/include/asm/Kbuild | 1 + arch/openrisc/include/asm/Kbuild | 1 + arch/parisc/include/asm/Kbuild | 1 + arch/powerpc/include/asm/Kbuild | 1 + arch/riscv/include/asm/Kbuild | 1 + arch/s390/include/asm/Kbuild | 1 + arch/sh/include/asm/Kbuild | 1 + arch/sparc/include/asm/Kbuild | 1 + arch/um/include/asm/Kbuild | 1 + arch/unicore32/include/asm/Kbuild | 1 + arch/x86/include/asm/simd.h | 44 ++++++++++++++++++++- arch/xtensa/include/asm/Kbuild | 1 + include/asm-generic/simd.h | 20 ++++++++++ include/linux/simd.h | 32 +++++++++++++++ 27 files changed, 224 insertions(+), 8 deletions(-) create mode 100644 arch/arm/include/asm/simd.h create mode 100644 include/linux/simd.h -- 2.19.0 diff --git a/arch/alpha/include/asm/Kbuild b/arch/alpha/include/asm/Kbuild index 0580cb8c84b2..220dfd170d45 100644 --- a/arch/alpha/include/asm/Kbuild +++ b/arch/alpha/include/asm/Kbuild @@ -13,3 +13,4 @@ generic-y += sections.h generic-y += trace_clock.h generic-y += current.h generic-y += kprobes.h +generic-y += simd.h diff --git a/arch/arc/include/asm/Kbuild b/arch/arc/include/asm/Kbuild index feed50ce89fa..a7f4255f1649 100644 --- a/arch/arc/include/asm/Kbuild +++ b/arch/arc/include/asm/Kbuild @@ -22,6 +22,7 @@ generic-y += parport.h generic-y += pci.h generic-y += percpu.h generic-y += preempt.h +generic-y += simd.h generic-y += topology.h generic-y += trace_clock.h generic-y += user.h diff --git a/arch/arm/include/asm/Kbuild b/arch/arm/include/asm/Kbuild index 1d66db9c9db5..ebdc9eeb8d39 100644 --- a/arch/arm/include/asm/Kbuild +++ b/arch/arm/include/asm/Kbuild @@ -16,7 +16,6 @@ generic-y += rwsem.h generic-y += seccomp.h generic-y += segment.h generic-y += serial.h -generic-y += simd.h generic-y += sizes.h generic-y += timex.h generic-y += trace_clock.h diff --git a/arch/arm/include/asm/simd.h b/arch/arm/include/asm/simd.h new file mode 100644 index 000000000000..264ed84b41d8 --- /dev/null +++ b/arch/arm/include/asm/simd.h @@ -0,0 +1,63 @@ +/* SPDX-License-Identifier: GPL-2.0 + * + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#include +#ifndef _ASM_SIMD_H +#define _ASM_SIMD_H + +#ifdef CONFIG_KERNEL_MODE_NEON +#include + +static __must_check inline bool may_use_simd(void) +{ + return !in_nmi() && !in_irq() && !in_serving_softirq(); +} + +static inline void simd_get(simd_context_t *ctx) +{ + *ctx = may_use_simd() ? HAVE_FULL_SIMD : HAVE_NO_SIMD; +} + +static inline void simd_put(simd_context_t *ctx) +{ + if (*ctx & HAVE_SIMD_IN_USE) + kernel_neon_end(); + *ctx = HAVE_NO_SIMD; +} + +static __must_check inline bool simd_use(simd_context_t *ctx) +{ + if (!(*ctx & HAVE_FULL_SIMD)) + return false; + if (*ctx & HAVE_SIMD_IN_USE) + return true; + kernel_neon_begin(); + *ctx |= HAVE_SIMD_IN_USE; + return true; +} + +#else + +static __must_check inline bool may_use_simd(void) +{ + return false; +} + +static inline void simd_get(simd_context_t *ctx) +{ + *ctx = HAVE_NO_SIMD; +} + +static inline void simd_put(simd_context_t *ctx) +{ +} + +static __must_check inline bool simd_use(simd_context_t *ctx) +{ + return false; +} +#endif + +#endif /* _ASM_SIMD_H */ diff --git a/arch/arm64/include/asm/simd.h b/arch/arm64/include/asm/simd.h index 6495cc51246f..a45ff1600040 100644 --- a/arch/arm64/include/asm/simd.h +++ b/arch/arm64/include/asm/simd.h @@ -1,11 +1,10 @@ -/* - * Copyright (C) 2017 Linaro Ltd. +/* SPDX-License-Identifier: GPL-2.0 * - * This program is free software; you can redistribute it and/or modify it - * under the terms of the GNU General Public License version 2 as published - * by the Free Software Foundation. + * Copyright (C) 2017 Linaro Ltd. + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. */ +#include #ifndef __ASM_SIMD_H #define __ASM_SIMD_H @@ -16,6 +15,8 @@ #include #ifdef CONFIG_KERNEL_MODE_NEON +#include +#include DECLARE_PER_CPU(bool, kernel_neon_busy); @@ -40,9 +41,47 @@ static __must_check inline bool may_use_simd(void) !this_cpu_read(kernel_neon_busy); } +static inline void simd_get(simd_context_t *ctx) +{ + *ctx = may_use_simd() ? HAVE_FULL_SIMD : HAVE_NO_SIMD; +} + +static inline void simd_put(simd_context_t *ctx) +{ + if (*ctx & HAVE_SIMD_IN_USE) + kernel_neon_end(); + *ctx = HAVE_NO_SIMD; +} + +static __must_check inline bool simd_use(simd_context_t *ctx) +{ + if (!(*ctx & HAVE_FULL_SIMD)) + return false; + if (*ctx & HAVE_SIMD_IN_USE) + return true; + kernel_neon_begin(); + *ctx |= HAVE_SIMD_IN_USE; + return true; +} + #else /* ! CONFIG_KERNEL_MODE_NEON */ -static __must_check inline bool may_use_simd(void) { +static __must_check inline bool may_use_simd(void) +{ + return false; +} + +static inline void simd_get(simd_context_t *ctx) +{ + *ctx = HAVE_NO_SIMD; +} + +static inline void simd_put(simd_context_t *ctx) +{ +} + +static __must_check inline bool simd_use(simd_context_t *ctx) +{ return false; } diff --git a/arch/c6x/include/asm/Kbuild b/arch/c6x/include/asm/Kbuild index 33a2c94fed0d..7543c38f7ade 100644 --- a/arch/c6x/include/asm/Kbuild +++ b/arch/c6x/include/asm/Kbuild @@ -30,6 +30,7 @@ generic-y += pgalloc.h generic-y += preempt.h generic-y += segment.h generic-y += serial.h +generic-y += simd.h generic-y += tlbflush.h generic-y += topology.h generic-y += trace_clock.h diff --git a/arch/h8300/include/asm/Kbuild b/arch/h8300/include/asm/Kbuild index a5d0b2991f47..1fcef25ee19d 100644 --- a/arch/h8300/include/asm/Kbuild +++ b/arch/h8300/include/asm/Kbuild @@ -39,6 +39,7 @@ generic-y += preempt.h generic-y += scatterlist.h generic-y += sections.h generic-y += serial.h +generic-y += simd.h generic-y += sizes.h generic-y += spinlock.h generic-y += timex.h diff --git a/arch/hexagon/include/asm/Kbuild b/arch/hexagon/include/asm/Kbuild index dd2fd9c0d292..217d4695fd8a 100644 --- a/arch/hexagon/include/asm/Kbuild +++ b/arch/hexagon/include/asm/Kbuild @@ -29,6 +29,7 @@ generic-y += rwsem.h generic-y += sections.h generic-y += segment.h generic-y += serial.h +generic-y += simd.h generic-y += sizes.h generic-y += topology.h generic-y += trace_clock.h diff --git a/arch/ia64/include/asm/Kbuild b/arch/ia64/include/asm/Kbuild index 557bbc8ba9f5..41c5ebdf79e5 100644 --- a/arch/ia64/include/asm/Kbuild +++ b/arch/ia64/include/asm/Kbuild @@ -4,6 +4,7 @@ generic-y += irq_work.h generic-y += mcs_spinlock.h generic-y += mm-arch-hooks.h generic-y += preempt.h +generic-y += simd.h generic-y += trace_clock.h generic-y += vtime.h generic-y += word-at-a-time.h diff --git a/arch/m68k/include/asm/Kbuild b/arch/m68k/include/asm/Kbuild index a4b8d3331a9e..73898dd1a4d0 100644 --- a/arch/m68k/include/asm/Kbuild +++ b/arch/m68k/include/asm/Kbuild @@ -19,6 +19,7 @@ generic-y += mm-arch-hooks.h generic-y += percpu.h generic-y += preempt.h generic-y += sections.h +generic-y += simd.h generic-y += spinlock.h generic-y += topology.h generic-y += trace_clock.h diff --git a/arch/microblaze/include/asm/Kbuild b/arch/microblaze/include/asm/Kbuild index 569ba9e670c1..7a877eea99d3 100644 --- a/arch/microblaze/include/asm/Kbuild +++ b/arch/microblaze/include/asm/Kbuild @@ -25,6 +25,7 @@ generic-y += parport.h generic-y += percpu.h generic-y += preempt.h generic-y += serial.h +generic-y += simd.h generic-y += syscalls.h generic-y += topology.h generic-y += trace_clock.h diff --git a/arch/mips/include/asm/Kbuild b/arch/mips/include/asm/Kbuild index 58351e48421e..e8868e0fb2c3 100644 --- a/arch/mips/include/asm/Kbuild +++ b/arch/mips/include/asm/Kbuild @@ -16,6 +16,7 @@ generic-y += qrwlock.h generic-y += qspinlock.h generic-y += sections.h generic-y += segment.h +generic-y += simd.h generic-y += trace_clock.h generic-y += unaligned.h generic-y += user.h diff --git a/arch/nds32/include/asm/Kbuild b/arch/nds32/include/asm/Kbuild index dbc4e5422550..fb2f113716ce 100644 --- a/arch/nds32/include/asm/Kbuild +++ b/arch/nds32/include/asm/Kbuild @@ -46,6 +46,7 @@ generic-y += sections.h generic-y += segment.h generic-y += serial.h generic-y += shmbuf.h +generic-y += simd.h generic-y += sizes.h generic-y += stat.h generic-y += switch_to.h diff --git a/arch/nios2/include/asm/Kbuild b/arch/nios2/include/asm/Kbuild index 8fde4fa2c34f..571a9d9ad107 100644 --- a/arch/nios2/include/asm/Kbuild +++ b/arch/nios2/include/asm/Kbuild @@ -33,6 +33,7 @@ generic-y += preempt.h generic-y += sections.h generic-y += segment.h generic-y += serial.h +generic-y += simd.h generic-y += spinlock.h generic-y += topology.h generic-y += trace_clock.h diff --git a/arch/openrisc/include/asm/Kbuild b/arch/openrisc/include/asm/Kbuild index eb87cd8327c8..b6231211bbad 100644 --- a/arch/openrisc/include/asm/Kbuild +++ b/arch/openrisc/include/asm/Kbuild @@ -34,6 +34,7 @@ generic-y += qrwlock_types.h generic-y += qrwlock.h generic-y += sections.h generic-y += segment.h +generic-y += simd.h generic-y += string.h generic-y += switch_to.h generic-y += topology.h diff --git a/arch/parisc/include/asm/Kbuild b/arch/parisc/include/asm/Kbuild index 2013d639e735..97970b4d05ab 100644 --- a/arch/parisc/include/asm/Kbuild +++ b/arch/parisc/include/asm/Kbuild @@ -17,6 +17,7 @@ generic-y += percpu.h generic-y += preempt.h generic-y += seccomp.h generic-y += segment.h +generic-y += simd.h generic-y += topology.h generic-y += trace_clock.h generic-y += user.h diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild index 3196d227e351..2337190aaf69 100644 --- a/arch/powerpc/include/asm/Kbuild +++ b/arch/powerpc/include/asm/Kbuild @@ -8,3 +8,4 @@ generic-y += preempt.h generic-y += rwsem.h generic-y += vtime.h generic-y += msi.h +generic-y += simd.h diff --git a/arch/riscv/include/asm/Kbuild b/arch/riscv/include/asm/Kbuild index efdbe311e936..438a11d9c47a 100644 --- a/arch/riscv/include/asm/Kbuild +++ b/arch/riscv/include/asm/Kbuild @@ -46,6 +46,7 @@ generic-y += setup.h generic-y += shmbuf.h generic-y += shmparam.h generic-y += signal.h +generic-y += simd.h generic-y += socket.h generic-y += sockios.h generic-y += stat.h diff --git a/arch/s390/include/asm/Kbuild b/arch/s390/include/asm/Kbuild index e3239772887a..3744c4c61fb5 100644 --- a/arch/s390/include/asm/Kbuild +++ b/arch/s390/include/asm/Kbuild @@ -22,6 +22,7 @@ generic-y += mcs_spinlock.h generic-y += mm-arch-hooks.h generic-y += preempt.h generic-y += rwsem.h +generic-y += simd.h generic-y += trace_clock.h generic-y += unaligned.h generic-y += word-at-a-time.h diff --git a/arch/sh/include/asm/Kbuild b/arch/sh/include/asm/Kbuild index 6a5609a55965..8e64ff35a933 100644 --- a/arch/sh/include/asm/Kbuild +++ b/arch/sh/include/asm/Kbuild @@ -16,6 +16,7 @@ generic-y += percpu.h generic-y += preempt.h generic-y += rwsem.h generic-y += serial.h +generic-y += simd.h generic-y += sizes.h generic-y += trace_clock.h generic-y += xor.h diff --git a/arch/sparc/include/asm/Kbuild b/arch/sparc/include/asm/Kbuild index 410b263ef5c8..72b9e08fb350 100644 --- a/arch/sparc/include/asm/Kbuild +++ b/arch/sparc/include/asm/Kbuild @@ -17,5 +17,6 @@ generic-y += msi.h generic-y += preempt.h generic-y += rwsem.h generic-y += serial.h +generic-y += simd.h generic-y += trace_clock.h generic-y += word-at-a-time.h diff --git a/arch/um/include/asm/Kbuild b/arch/um/include/asm/Kbuild index b10dde6cb793..8c2bfa6e0494 100644 --- a/arch/um/include/asm/Kbuild +++ b/arch/um/include/asm/Kbuild @@ -22,6 +22,7 @@ generic-y += param.h generic-y += pci.h generic-y += percpu.h generic-y += preempt.h +generic-y += simd.h generic-y += switch_to.h generic-y += topology.h generic-y += trace_clock.h diff --git a/arch/unicore32/include/asm/Kbuild b/arch/unicore32/include/asm/Kbuild index bfc7abe77905..98a908720bbd 100644 --- a/arch/unicore32/include/asm/Kbuild +++ b/arch/unicore32/include/asm/Kbuild @@ -27,6 +27,7 @@ generic-y += preempt.h generic-y += sections.h generic-y += segment.h generic-y += serial.h +generic-y += simd.h generic-y += sizes.h generic-y += syscalls.h generic-y += topology.h diff --git a/arch/x86/include/asm/simd.h b/arch/x86/include/asm/simd.h index a341c878e977..4aad7f158dcb 100644 --- a/arch/x86/include/asm/simd.h +++ b/arch/x86/include/asm/simd.h @@ -1,4 +1,11 @@ -/* SPDX-License-Identifier: GPL-2.0 */ +/* SPDX-License-Identifier: GPL-2.0 + * + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#include +#ifndef _ASM_SIMD_H +#define _ASM_SIMD_H #include @@ -10,3 +17,38 @@ static __must_check inline bool may_use_simd(void) { return irq_fpu_usable(); } + +static inline void simd_get(simd_context_t *ctx) +{ +#if !defined(CONFIG_UML) + *ctx = may_use_simd() ? HAVE_FULL_SIMD : HAVE_NO_SIMD; +#else + *ctx = HAVE_NO_SIMD; +#endif +} + +static inline void simd_put(simd_context_t *ctx) +{ +#if !defined(CONFIG_UML) + if (*ctx & HAVE_SIMD_IN_USE) + kernel_fpu_end(); +#endif + *ctx = HAVE_NO_SIMD; +} + +static __must_check inline bool simd_use(simd_context_t *ctx) +{ +#if !defined(CONFIG_UML) + if (!(*ctx & HAVE_FULL_SIMD)) + return false; + if (*ctx & HAVE_SIMD_IN_USE) + return true; + kernel_fpu_begin(); + *ctx |= HAVE_SIMD_IN_USE; + return true; +#else + return false; +#endif +} + +#endif /* _ASM_SIMD_H */ diff --git a/arch/xtensa/include/asm/Kbuild b/arch/xtensa/include/asm/Kbuild index 82c756431b49..7950f359649d 100644 --- a/arch/xtensa/include/asm/Kbuild +++ b/arch/xtensa/include/asm/Kbuild @@ -24,6 +24,7 @@ generic-y += percpu.h generic-y += preempt.h generic-y += rwsem.h generic-y += sections.h +generic-y += simd.h generic-y += topology.h generic-y += trace_clock.h generic-y += word-at-a-time.h diff --git a/include/asm-generic/simd.h b/include/asm-generic/simd.h index d0343d58a74a..b3dd61ac010e 100644 --- a/include/asm-generic/simd.h +++ b/include/asm-generic/simd.h @@ -1,5 +1,9 @@ /* SPDX-License-Identifier: GPL-2.0 */ +#include +#ifndef _ASM_SIMD_H +#define _ASM_SIMD_H + #include /* @@ -13,3 +17,19 @@ static __must_check inline bool may_use_simd(void) { return !in_interrupt(); } + +static inline void simd_get(simd_context_t *ctx) +{ + *ctx = HAVE_NO_SIMD; +} + +static inline void simd_put(simd_context_t *ctx) +{ +} + +static __must_check inline bool simd_use(simd_context_t *ctx) +{ + return false; +} + +#endif /* _ASM_SIMD_H */ diff --git a/include/linux/simd.h b/include/linux/simd.h new file mode 100644 index 000000000000..4e0b8a9bdc14 --- /dev/null +++ b/include/linux/simd.h @@ -0,0 +1,32 @@ +/* SPDX-License-Identifier: GPL-2.0 + * + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#ifndef _SIMD_H +#define _SIMD_H + +typedef enum { + HAVE_NO_SIMD = 1 << 0, + HAVE_FULL_SIMD = 1 << 1, + HAVE_SIMD_IN_USE = 1 << 31 +} simd_context_t; + +#define DONT_USE_SIMD ((simd_context_t []){ HAVE_NO_SIMD }) + +#include +#include + +static inline bool simd_relax(simd_context_t *ctx) +{ +#ifdef CONFIG_PREEMPT + if ((*ctx & HAVE_SIMD_IN_USE) && need_resched()) { + simd_put(ctx); + simd_get(ctx); + return true; + } +#endif + return false; +} + +#endif /* _SIMD_H */ From patchwork Sat Oct 6 02:56:45 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Jason A. Donenfeld" X-Patchwork-Id: 148302 Delivered-To: patch@linaro.org Received: by 2002:a2e:8595:0:0:0:0:0 with SMTP id b21-v6csp1157314lji; Fri, 5 Oct 2018 19:57:48 -0700 (PDT) X-Google-Smtp-Source: ACcGV61yXiFlwaqVOe7xLC59VcHU64qAHH7fYI+UPFXNy6WlhT0rVU9svpGbGB90b2xUPt14MUqy X-Received: by 2002:a63:ec4b:: with SMTP id r11-v6mr12389404pgj.295.1538794667846; Fri, 05 Oct 2018 19:57:47 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1538794667; cv=none; d=google.com; s=arc-20160816; b=iQx1zfATi5hczHaZ7CEdOdXiwRYJrPvMRrCJ+JJoK0JC0d8Ovf2nns8LerbVgQV0Ap POJJgNQJV8e0mQuNqf+ZPG4vLNbqEG3Qo26rESFJ+Bs3hytgDOFXClJhxf4c25EJPD0C Ywk4mcxklCBjWJrS6jSrdihHziRiBXhRo6Ut1aA/FP2SGeA9Z8BIvlbNWVj1utcRmuvg wTLsMi1bYHy2F7iyq2sq3OY2d/6u2RgdY8fxGHEY6gxzfEUZ6NqyRXt/VIEJEvjXXDuQ iM6M9FgunLNK1AOmJeBx1CZNJeI9TXVciqbXFMnI+XxpuQf+uTwWCRtyRsf8V2+45GJr RSVw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=QItzX0eyHOILWwkGCERLg+zvwMi0B1uW8CyX/Ell7XA=; b=L8EZgkgg4iH/KisB5I5XBjW0TquqXbKM+8ETbcN65kW0pUhSiuB69Sl6cqtz92K0AE DAgag09Ok4Oi489yOdPgIjC8pGtLsWsNoROZ1KA2jB9bWzKaiZq9pex8xqlXMCXgjjdJ 9RwqRrF/4D54AX4zYMlemu3qM0Uf5gkfDSQ9EqCpu6qapj7HEecLkWdESV/J3pdbOdtn ezpr7EdsX4u+hFkI8aAzUzToCrKc00WM47ARtiRPu+NlHKeFRXanQKk60hBbEaF1pvH0 dZVYU7yFsX1xCkdSQUQnzV+p9fqY8LzVHOqx6D6tytcBW8wHwlkUJGS4iq1K9QdjhXyy KXMA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=BGKv4SvG; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id b124-v6si9757350pgc.45.2018.10.05.19.57.46; Fri, 05 Oct 2018 19:57:47 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=BGKv4SvG; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729536AbeJFJ7S (ORCPT + 32 others); Sat, 6 Oct 2018 05:59:18 -0400 Received: from frisell.zx2c4.com ([192.95.5.64]:50073 "EHLO frisell.zx2c4.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726812AbeJFJ7R (ORCPT ); Sat, 6 Oct 2018 05:59:17 -0400 Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTP id c84e6210; Sat, 6 Oct 2018 02:57:10 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=zx2c4.com; h=from:to:cc :subject:date:message-id:in-reply-to:references:mime-version :content-transfer-encoding; s=mail; bh=euFfbtZ2EdqQMPU83R8gvr3xO /Y=; b=BGKv4SvG3AwaA9RvQ/mrrzzV3ks4+XxNXyVV+LplOQ40f+gX/fMpnQie/ om60695rcVHQ3UFxSli0z/zV5QILqFrSTteUW1vF7Te/9YslxWapsEErSbl8kpqC iONd1P6cYUp2WPd6eXxJV46r9pJPtl4DhffrWGGGkdEcZ5TKkqp1A2RwmYmBsH9l 6DasEMI3z22+AXud4ApPBo7WfJmY776nnkWZSm72VJ36mfdedCkx84hAq6+043XA g9aWeBJDrzWm62kKqTb3WenW2qm0mJ6fMDaYIslznO3kQ9iuf8F3GRQ4qPp1SiuV wF5W04Zd1Y1njZDehFL4ms49R8yOg== Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTPSA id 236432cb (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256:NO); Sat, 6 Oct 2018 02:57:08 +0000 (UTC) From: "Jason A. Donenfeld" To: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, davem@davemloft.net, gregkh@linuxfoundation.org Cc: "Jason A. Donenfeld" , Samuel Neves , Jean-Philippe Aumasson , Andy Lutomirski , Andrew Morton , Linus Torvalds , kernel-hardening@lists.openwall.com, linux-crypto@vger.kernel.org Subject: [PATCH net-next v7 04/28] zinc: ChaCha20 generic C implementation and selftest Date: Sat, 6 Oct 2018 04:56:45 +0200 Message-Id: <20181006025709.4019-5-Jason@zx2c4.com> In-Reply-To: <20181006025709.4019-1-Jason@zx2c4.com> References: <20181006025709.4019-1-Jason@zx2c4.com> MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This implements the ChaCha20 permutation as a single C statement, by way of the comma operator, which the compiler is able to simplify terrifically. Information: https://cr.yp.to/chacha.html Signed-off-by: Jason A. Donenfeld Cc: Samuel Neves Cc: Jean-Philippe Aumasson Cc: Andy Lutomirski Cc: Greg KH Cc: Andrew Morton Cc: Linus Torvalds Cc: kernel-hardening@lists.openwall.com Cc: linux-crypto@vger.kernel.org --- include/zinc/chacha20.h | 70 + lib/zinc/Kconfig | 4 + lib/zinc/Makefile | 3 + lib/zinc/chacha20/chacha20.c | 179 +++ lib/zinc/selftest/chacha20.c | 2698 ++++++++++++++++++++++++++++++++++ lib/zinc/selftest/run.h | 49 + 6 files changed, 3003 insertions(+) create mode 100644 include/zinc/chacha20.h create mode 100644 lib/zinc/chacha20/chacha20.c create mode 100644 lib/zinc/selftest/chacha20.c create mode 100644 lib/zinc/selftest/run.h -- 2.19.0 diff --git a/include/zinc/chacha20.h b/include/zinc/chacha20.h new file mode 100644 index 000000000000..0b98bd6946ae --- /dev/null +++ b/include/zinc/chacha20.h @@ -0,0 +1,70 @@ +/* SPDX-License-Identifier: GPL-2.0 OR MIT */ +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#ifndef _ZINC_CHACHA20_H +#define _ZINC_CHACHA20_H + +#include +#include +#include +#include + +enum { + CHACHA20_NONCE_SIZE = 16, + CHACHA20_KEY_SIZE = 32, + CHACHA20_KEY_WORDS = CHACHA20_KEY_SIZE / sizeof(u32), + CHACHA20_BLOCK_SIZE = 64, + CHACHA20_BLOCK_WORDS = CHACHA20_BLOCK_SIZE / sizeof(u32), + HCHACHA20_NONCE_SIZE = CHACHA20_NONCE_SIZE, + HCHACHA20_KEY_SIZE = CHACHA20_KEY_SIZE +}; + +enum { /* expand 32-byte k */ + CHACHA20_CONSTANT_EXPA = 0x61707865U, + CHACHA20_CONSTANT_ND_3 = 0x3320646eU, + CHACHA20_CONSTANT_2_BY = 0x79622d32U, + CHACHA20_CONSTANT_TE_K = 0x6b206574U +}; + +struct chacha20_ctx { + union { + u32 state[16]; + struct { + u32 constant[4]; + u32 key[8]; + u32 counter[4]; + }; + }; +}; + +static inline void chacha20_init(struct chacha20_ctx *ctx, + const u8 key[CHACHA20_KEY_SIZE], + const u64 nonce) +{ + ctx->constant[0] = CHACHA20_CONSTANT_EXPA; + ctx->constant[1] = CHACHA20_CONSTANT_ND_3; + ctx->constant[2] = CHACHA20_CONSTANT_2_BY; + ctx->constant[3] = CHACHA20_CONSTANT_TE_K; + ctx->key[0] = get_unaligned_le32(key + 0); + ctx->key[1] = get_unaligned_le32(key + 4); + ctx->key[2] = get_unaligned_le32(key + 8); + ctx->key[3] = get_unaligned_le32(key + 12); + ctx->key[4] = get_unaligned_le32(key + 16); + ctx->key[5] = get_unaligned_le32(key + 20); + ctx->key[6] = get_unaligned_le32(key + 24); + ctx->key[7] = get_unaligned_le32(key + 28); + ctx->counter[0] = 0; + ctx->counter[1] = 0; + ctx->counter[2] = nonce & U32_MAX; + ctx->counter[3] = nonce >> 32; +} +void chacha20(struct chacha20_ctx *ctx, u8 *dst, const u8 *src, u32 len, + simd_context_t *simd_context); + +void hchacha20(u32 derived_key[CHACHA20_KEY_WORDS], + const u8 nonce[HCHACHA20_NONCE_SIZE], + const u8 key[HCHACHA20_KEY_SIZE], simd_context_t *simd_context); + +#endif /* _ZINC_CHACHA20_H */ diff --git a/lib/zinc/Kconfig b/lib/zinc/Kconfig index 90e066ea93a0..d271be37cecb 100644 --- a/lib/zinc/Kconfig +++ b/lib/zinc/Kconfig @@ -1,3 +1,7 @@ +config ZINC_CHACHA20 + tristate + select CRYPTO_ALGAPI + config ZINC_SELFTEST bool "Zinc cryptography library self-tests" help diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile index a61c80d676cb..3d80144d55a6 100644 --- a/lib/zinc/Makefile +++ b/lib/zinc/Makefile @@ -1,3 +1,6 @@ ccflags-y := -O2 ccflags-y += -D'pr_fmt(fmt)="zinc: " fmt' ccflags-$(CONFIG_ZINC_DEBUG) += -DDEBUG + +zinc_chacha20-y := chacha20/chacha20.o +obj-$(CONFIG_ZINC_CHACHA20) += zinc_chacha20.o diff --git a/lib/zinc/chacha20/chacha20.c b/lib/zinc/chacha20/chacha20.c new file mode 100644 index 000000000000..03209c15d1ca --- /dev/null +++ b/lib/zinc/chacha20/chacha20.c @@ -0,0 +1,179 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + * + * Implementation of the ChaCha20 stream cipher. + * + * Information: https://cr.yp.to/chacha.html + */ + +#include +#include "../selftest/run.h" + +#include +#include +#include +#include +#include // For crypto_xor_cpy. + +static bool *const chacha20_nobs[] __initconst = { }; +static void __init chacha20_fpu_init(void) +{ +} +static inline bool chacha20_arch(struct chacha20_ctx *ctx, u8 *dst, + const u8 *src, size_t len, + simd_context_t *simd_context) +{ + return false; +} +static inline bool hchacha20_arch(u32 derived_key[CHACHA20_KEY_WORDS], + const u8 nonce[HCHACHA20_NONCE_SIZE], + const u8 key[HCHACHA20_KEY_SIZE], + simd_context_t *simd_context) +{ + return false; +} + +#define QUARTER_ROUND(x, a, b, c, d) ( \ + x[a] += x[b], \ + x[d] = rol32((x[d] ^ x[a]), 16), \ + x[c] += x[d], \ + x[b] = rol32((x[b] ^ x[c]), 12), \ + x[a] += x[b], \ + x[d] = rol32((x[d] ^ x[a]), 8), \ + x[c] += x[d], \ + x[b] = rol32((x[b] ^ x[c]), 7) \ +) + +#define C(i, j) (i * 4 + j) + +#define DOUBLE_ROUND(x) ( \ + /* Column Round */ \ + QUARTER_ROUND(x, C(0, 0), C(1, 0), C(2, 0), C(3, 0)), \ + QUARTER_ROUND(x, C(0, 1), C(1, 1), C(2, 1), C(3, 1)), \ + QUARTER_ROUND(x, C(0, 2), C(1, 2), C(2, 2), C(3, 2)), \ + QUARTER_ROUND(x, C(0, 3), C(1, 3), C(2, 3), C(3, 3)), \ + /* Diagonal Round */ \ + QUARTER_ROUND(x, C(0, 0), C(1, 1), C(2, 2), C(3, 3)), \ + QUARTER_ROUND(x, C(0, 1), C(1, 2), C(2, 3), C(3, 0)), \ + QUARTER_ROUND(x, C(0, 2), C(1, 3), C(2, 0), C(3, 1)), \ + QUARTER_ROUND(x, C(0, 3), C(1, 0), C(2, 1), C(3, 2)) \ +) + +#define TWENTY_ROUNDS(x) ( \ + DOUBLE_ROUND(x), \ + DOUBLE_ROUND(x), \ + DOUBLE_ROUND(x), \ + DOUBLE_ROUND(x), \ + DOUBLE_ROUND(x), \ + DOUBLE_ROUND(x), \ + DOUBLE_ROUND(x), \ + DOUBLE_ROUND(x), \ + DOUBLE_ROUND(x), \ + DOUBLE_ROUND(x) \ +) + +static void chacha20_block_generic(struct chacha20_ctx *ctx, __le32 *stream) +{ + u32 x[CHACHA20_BLOCK_WORDS]; + int i; + + for (i = 0; i < ARRAY_SIZE(x); ++i) + x[i] = ctx->state[i]; + + TWENTY_ROUNDS(x); + + for (i = 0; i < ARRAY_SIZE(x); ++i) + stream[i] = cpu_to_le32(x[i] + ctx->state[i]); + + ctx->counter[0] += 1; +} + +static void chacha20_generic(struct chacha20_ctx *ctx, u8 *out, const u8 *in, + u32 len) +{ + __le32 buf[CHACHA20_BLOCK_WORDS]; + + while (len >= CHACHA20_BLOCK_SIZE) { + chacha20_block_generic(ctx, buf); + crypto_xor_cpy(out, in, (u8 *)buf, CHACHA20_BLOCK_SIZE); + len -= CHACHA20_BLOCK_SIZE; + out += CHACHA20_BLOCK_SIZE; + in += CHACHA20_BLOCK_SIZE; + } + if (len) { + chacha20_block_generic(ctx, buf); + crypto_xor_cpy(out, in, (u8 *)buf, len); + } +} + +void chacha20(struct chacha20_ctx *ctx, u8 *dst, const u8 *src, u32 len, + simd_context_t *simd_context) +{ + if (!chacha20_arch(ctx, dst, src, len, simd_context)) + chacha20_generic(ctx, dst, src, len); +} +EXPORT_SYMBOL(chacha20); + +static void hchacha20_generic(u32 derived_key[CHACHA20_KEY_WORDS], + const u8 nonce[HCHACHA20_NONCE_SIZE], + const u8 key[HCHACHA20_KEY_SIZE]) +{ + u32 x[] = { CHACHA20_CONSTANT_EXPA, + CHACHA20_CONSTANT_ND_3, + CHACHA20_CONSTANT_2_BY, + CHACHA20_CONSTANT_TE_K, + get_unaligned_le32(key + 0), + get_unaligned_le32(key + 4), + get_unaligned_le32(key + 8), + get_unaligned_le32(key + 12), + get_unaligned_le32(key + 16), + get_unaligned_le32(key + 20), + get_unaligned_le32(key + 24), + get_unaligned_le32(key + 28), + get_unaligned_le32(nonce + 0), + get_unaligned_le32(nonce + 4), + get_unaligned_le32(nonce + 8), + get_unaligned_le32(nonce + 12) + }; + + TWENTY_ROUNDS(x); + + memcpy(derived_key + 0, x + 0, sizeof(u32) * 4); + memcpy(derived_key + 4, x + 12, sizeof(u32) * 4); +} + +/* Derived key should be 32-bit aligned */ +void hchacha20(u32 derived_key[CHACHA20_KEY_WORDS], + const u8 nonce[HCHACHA20_NONCE_SIZE], + const u8 key[HCHACHA20_KEY_SIZE], simd_context_t *simd_context) +{ + if (!hchacha20_arch(derived_key, nonce, key, simd_context)) + hchacha20_generic(derived_key, nonce, key); +} +EXPORT_SYMBOL(hchacha20); + +#include "../selftest/chacha20.c" + +static bool nosimd __initdata = false; + +static int __init mod_init(void) +{ + if (!nosimd) + chacha20_fpu_init(); + if (!selftest_run("chacha20", chacha20_selftest, chacha20_nobs, + ARRAY_SIZE(chacha20_nobs))) + return -ENOTRECOVERABLE; + return 0; +} + +static void __exit mod_exit(void) +{ +} + +module_param(nosimd, bool, 0); +module_init(mod_init); +module_exit(mod_exit); +MODULE_LICENSE("GPL v2"); +MODULE_DESCRIPTION("ChaCha20 stream cipher"); +MODULE_AUTHOR("Jason A. Donenfeld "); diff --git a/lib/zinc/selftest/chacha20.c b/lib/zinc/selftest/chacha20.c new file mode 100644 index 000000000000..b8c9c709071d --- /dev/null +++ b/lib/zinc/selftest/chacha20.c @@ -0,0 +1,2698 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +struct chacha20_testvec { + const u8 *input, *output, *key; + u64 nonce; + size_t ilen; +}; + +struct hchacha20_testvec { + u8 key[HCHACHA20_KEY_SIZE]; + u8 nonce[HCHACHA20_NONCE_SIZE]; + u8 output[CHACHA20_KEY_SIZE]; +}; + +/* These test vectors are generated by reference implementations and are + * designed to check chacha20 implementation block handling, as well as from + * the draft-arciszewski-xchacha-01 document. + */ + +static const u8 input01[] __initconst = { }; +static const u8 output01[] __initconst = { }; +static const u8 key01[] __initconst = { + 0x09, 0xf4, 0xe8, 0x57, 0x10, 0xf2, 0x12, 0xc3, + 0xc6, 0x91, 0xc4, 0x09, 0x97, 0x46, 0xef, 0xfe, + 0x02, 0x00, 0xe4, 0x5c, 0x82, 0xed, 0x16, 0xf3, + 0x32, 0xbe, 0xec, 0x7a, 0xe6, 0x68, 0x12, 0x26 +}; +enum { nonce01 = 0x3834e2afca3c66d3ULL }; + +static const u8 input02[] __initconst = { + 0x9d +}; +static const u8 output02[] __initconst = { + 0x94 +}; +static const u8 key02[] __initconst = { + 0x8c, 0x01, 0xac, 0xaf, 0x62, 0x63, 0x56, 0x7a, + 0xad, 0x23, 0x4c, 0x58, 0x29, 0x29, 0xbe, 0xab, + 0xe9, 0xf8, 0xdf, 0x6c, 0x8c, 0x74, 0x4d, 0x7d, + 0x13, 0x94, 0x10, 0x02, 0x3d, 0x8e, 0x9f, 0x94 +}; +enum { nonce02 = 0x5d1b3bfdedd9f73aULL }; + +static const u8 input03[] __initconst = { + 0x04, 0x16 +}; +static const u8 output03[] __initconst = { + 0x92, 0x07 +}; +static const u8 key03[] __initconst = { + 0x22, 0x0c, 0x79, 0x2c, 0x38, 0x51, 0xbe, 0x99, + 0xa9, 0x59, 0x24, 0x50, 0xef, 0x87, 0x38, 0xa6, + 0xa0, 0x97, 0x20, 0xcb, 0xb4, 0x0c, 0x94, 0x67, + 0x1f, 0x98, 0xdc, 0xc4, 0x83, 0xbc, 0x35, 0x4d +}; +enum { nonce03 = 0x7a3353ad720a3e2eULL }; + +static const u8 input04[] __initconst = { + 0xc7, 0xcc, 0xd0 +}; +static const u8 output04[] __initconst = { + 0xd8, 0x41, 0x80 +}; +static const u8 key04[] __initconst = { + 0x81, 0x5e, 0x12, 0x01, 0xc4, 0x36, 0x15, 0x03, + 0x11, 0xa0, 0xe9, 0x86, 0xbb, 0x5a, 0xdc, 0x45, + 0x7d, 0x5e, 0x98, 0xf8, 0x06, 0x76, 0x1c, 0xec, + 0xc0, 0xf7, 0xca, 0x4e, 0x99, 0xd9, 0x42, 0x38 +}; +enum { nonce04 = 0x6816e2fc66176da2ULL }; + +static const u8 input05[] __initconst = { + 0x48, 0xf1, 0x31, 0x5f +}; +static const u8 output05[] __initconst = { + 0x48, 0xf7, 0x13, 0x67 +}; +static const u8 key05[] __initconst = { + 0x3f, 0xd6, 0xb6, 0x5e, 0x2f, 0xda, 0x82, 0x39, + 0x97, 0x06, 0xd3, 0x62, 0x4f, 0xbd, 0xcb, 0x9b, + 0x1d, 0xe6, 0x4a, 0x76, 0xab, 0xdd, 0x14, 0x50, + 0x59, 0x21, 0xe3, 0xb2, 0xc7, 0x95, 0xbc, 0x45 +}; +enum { nonce05 = 0xc41a7490e228cc42ULL }; + +static const u8 input06[] __initconst = { + 0xae, 0xa2, 0x85, 0x1d, 0xc8 +}; +static const u8 output06[] __initconst = { + 0xfa, 0xff, 0x45, 0x6b, 0x6f +}; +static const u8 key06[] __initconst = { + 0x04, 0x8d, 0xea, 0x67, 0x20, 0x78, 0xfb, 0x8f, + 0x49, 0x80, 0x35, 0xb5, 0x7b, 0xe4, 0x31, 0x74, + 0x57, 0x43, 0x3a, 0x64, 0x64, 0xb9, 0xe6, 0x23, + 0x4d, 0xfe, 0xb8, 0x7b, 0x71, 0x4d, 0x9d, 0x21 +}; +enum { nonce06 = 0x251366db50b10903ULL }; + +static const u8 input07[] __initconst = { + 0x1a, 0x32, 0x85, 0xb6, 0xe8, 0x52 +}; +static const u8 output07[] __initconst = { + 0xd3, 0x5f, 0xf0, 0x07, 0x69, 0xec +}; +static const u8 key07[] __initconst = { + 0xbf, 0x2d, 0x42, 0x99, 0x97, 0x76, 0x04, 0xad, + 0xd3, 0x8f, 0x6e, 0x6a, 0x34, 0x85, 0xaf, 0x81, + 0xef, 0x36, 0x33, 0xd5, 0x43, 0xa2, 0xaa, 0x08, + 0x0f, 0x77, 0x42, 0x83, 0x58, 0xc5, 0x42, 0x2a +}; +enum { nonce07 = 0xe0796da17dba9b58ULL }; + +static const u8 input08[] __initconst = { + 0x40, 0xae, 0xcd, 0xe4, 0x3d, 0x22, 0xe0 +}; +static const u8 output08[] __initconst = { + 0xfd, 0x8a, 0x9f, 0x3d, 0x05, 0xc9, 0xd3 +}; +static const u8 key08[] __initconst = { + 0xdc, 0x3f, 0x41, 0xe3, 0x23, 0x2a, 0x8d, 0xf6, + 0x41, 0x2a, 0xa7, 0x66, 0x05, 0x68, 0xe4, 0x7b, + 0xc4, 0x58, 0xd6, 0xcc, 0xdf, 0x0d, 0xc6, 0x25, + 0x1b, 0x61, 0x32, 0x12, 0x4e, 0xf1, 0xe6, 0x29 +}; +enum { nonce08 = 0xb1d2536d9e159832ULL }; + +static const u8 input09[] __initconst = { + 0xba, 0x1d, 0x14, 0x16, 0x9f, 0x83, 0x67, 0x24 +}; +static const u8 output09[] __initconst = { + 0x7c, 0xe3, 0x78, 0x1d, 0xa2, 0xe7, 0xe9, 0x39 +}; +static const u8 key09[] __initconst = { + 0x17, 0x55, 0x90, 0x52, 0xa4, 0xce, 0x12, 0xae, + 0xd4, 0xfd, 0xd4, 0xfb, 0xd5, 0x18, 0x59, 0x50, + 0x4e, 0x51, 0x99, 0x32, 0x09, 0x31, 0xfc, 0xf7, + 0x27, 0x10, 0x8e, 0xa2, 0x4b, 0xa5, 0xf5, 0x62 +}; +enum { nonce09 = 0x495fc269536d003ULL }; + +static const u8 input10[] __initconst = { + 0x09, 0xfd, 0x3c, 0x0b, 0x3d, 0x0e, 0xf3, 0x9d, + 0x27 +}; +static const u8 output10[] __initconst = { + 0xdc, 0xe4, 0x33, 0x60, 0x0c, 0x07, 0xcb, 0x51, + 0x6b +}; +static const u8 key10[] __initconst = { + 0x4e, 0x00, 0x72, 0x37, 0x0f, 0x52, 0x4d, 0x6f, + 0x37, 0x50, 0x3c, 0xb3, 0x51, 0x81, 0x49, 0x16, + 0x7e, 0xfd, 0xb1, 0x51, 0x72, 0x2e, 0xe4, 0x16, + 0x68, 0x5c, 0x5b, 0x8a, 0xc3, 0x90, 0x70, 0x04 +}; +enum { nonce10 = 0x1ad9d1114d88cbbdULL }; + +static const u8 input11[] __initconst = { + 0x70, 0x18, 0x52, 0x85, 0xba, 0x66, 0xff, 0x2c, + 0x9a, 0x46 +}; +static const u8 output11[] __initconst = { + 0xf5, 0x2a, 0x7a, 0xfd, 0x31, 0x7c, 0x91, 0x41, + 0xb1, 0xcf +}; +static const u8 key11[] __initconst = { + 0x48, 0xb4, 0xd0, 0x7c, 0x88, 0xd1, 0x96, 0x0d, + 0x80, 0x33, 0xb4, 0xd5, 0x31, 0x9a, 0x88, 0xca, + 0x14, 0xdc, 0xf0, 0xa8, 0xf3, 0xac, 0xb8, 0x47, + 0x75, 0x86, 0x7c, 0x88, 0x50, 0x11, 0x43, 0x40 +}; +enum { nonce11 = 0x47c35dd1f4f8aa4fULL }; + +static const u8 input12[] __initconst = { + 0x9e, 0x8e, 0x3d, 0x2a, 0x05, 0xfd, 0xe4, 0x90, + 0x24, 0x1c, 0xd3 +}; +static const u8 output12[] __initconst = { + 0x97, 0x72, 0x40, 0x9f, 0xc0, 0x6b, 0x05, 0x33, + 0x42, 0x7e, 0x28 +}; +static const u8 key12[] __initconst = { + 0xee, 0xff, 0x33, 0x33, 0xe0, 0x28, 0xdf, 0xa2, + 0xb6, 0x5e, 0x25, 0x09, 0x52, 0xde, 0xa5, 0x9c, + 0x8f, 0x95, 0xa9, 0x03, 0x77, 0x0f, 0xbe, 0xa1, + 0xd0, 0x7d, 0x73, 0x2f, 0xf8, 0x7e, 0x51, 0x44 +}; +enum { nonce12 = 0xc22d044dc6ea4af3ULL }; + +static const u8 input13[] __initconst = { + 0x9c, 0x16, 0xa2, 0x22, 0x4d, 0xbe, 0x04, 0x9a, + 0xb3, 0xb5, 0xc6, 0x58 +}; +static const u8 output13[] __initconst = { + 0xf0, 0x81, 0xdb, 0x6d, 0xa3, 0xe9, 0xb2, 0xc6, + 0x32, 0x50, 0x16, 0x9f +}; +static const u8 key13[] __initconst = { + 0x96, 0xb3, 0x01, 0xd2, 0x7a, 0x8c, 0x94, 0x09, + 0x4f, 0x58, 0xbe, 0x80, 0xcc, 0xa9, 0x7e, 0x2d, + 0xad, 0x58, 0x3b, 0x63, 0xb8, 0x5c, 0x17, 0xce, + 0xbf, 0x43, 0x33, 0x7a, 0x7b, 0x82, 0x28, 0x2f +}; +enum { nonce13 = 0x2a5d05d88cd7b0daULL }; + +static const u8 input14[] __initconst = { + 0x57, 0x4f, 0xaa, 0x30, 0xe6, 0x23, 0x50, 0x86, + 0x91, 0xa5, 0x60, 0x96, 0x2b +}; +static const u8 output14[] __initconst = { + 0x6c, 0x1f, 0x3b, 0x42, 0xb6, 0x2f, 0xf0, 0xbd, + 0x76, 0x60, 0xc7, 0x7e, 0x8d +}; +static const u8 key14[] __initconst = { + 0x22, 0x85, 0xaf, 0x8f, 0xa3, 0x53, 0xa0, 0xc4, + 0xb5, 0x75, 0xc0, 0xba, 0x30, 0x92, 0xc3, 0x32, + 0x20, 0x5a, 0x8f, 0x7e, 0x93, 0xda, 0x65, 0x18, + 0xd1, 0xf6, 0x9a, 0x9b, 0x8f, 0x85, 0x30, 0xe6 +}; +enum { nonce14 = 0xf9946c166aa4475fULL }; + +static const u8 input15[] __initconst = { + 0x89, 0x81, 0xc7, 0xe2, 0x00, 0xac, 0x52, 0x70, + 0xa4, 0x79, 0xab, 0xeb, 0x74, 0xf7 +}; +static const u8 output15[] __initconst = { + 0xb4, 0xd0, 0xa9, 0x9d, 0x15, 0x5f, 0x48, 0xd6, + 0x00, 0x7e, 0x4c, 0x77, 0x5a, 0x46 +}; +static const u8 key15[] __initconst = { + 0x0a, 0x66, 0x36, 0xca, 0x5d, 0x82, 0x23, 0xb6, + 0xe4, 0x9b, 0xad, 0x5e, 0xd0, 0x7f, 0xf6, 0x7a, + 0x7b, 0x03, 0xa7, 0x4c, 0xfd, 0xec, 0xd5, 0xa1, + 0xfc, 0x25, 0x54, 0xda, 0x5a, 0x5c, 0xf0, 0x2c +}; +enum { nonce15 = 0x9ab2b87a35e772c8ULL }; + +static const u8 input16[] __initconst = { + 0x5f, 0x09, 0xc0, 0x8b, 0x1e, 0xde, 0xca, 0xd9, + 0xb7, 0x5c, 0x23, 0xc9, 0x55, 0x1e, 0xcf +}; +static const u8 output16[] __initconst = { + 0x76, 0x9b, 0x53, 0xf3, 0x66, 0x88, 0x28, 0x60, + 0x98, 0x80, 0x2c, 0xa8, 0x80, 0xa6, 0x48 +}; +static const u8 key16[] __initconst = { + 0x80, 0xb5, 0x51, 0xdf, 0x17, 0x5b, 0xb0, 0xef, + 0x8b, 0x5b, 0x2e, 0x3e, 0xc5, 0xe3, 0xa5, 0x86, + 0xac, 0x0d, 0x8e, 0x32, 0x90, 0x9d, 0x82, 0x27, + 0xf1, 0x23, 0x26, 0xc3, 0xea, 0x55, 0xb6, 0x63 +}; +enum { nonce16 = 0xa82e9d39e4d02ef5ULL }; + +static const u8 input17[] __initconst = { + 0x87, 0x0b, 0x36, 0x71, 0x7c, 0xb9, 0x0b, 0x80, + 0x4d, 0x77, 0x5c, 0x4f, 0xf5, 0x51, 0x0e, 0x1a +}; +static const u8 output17[] __initconst = { + 0xf1, 0x12, 0x4a, 0x8a, 0xd9, 0xd0, 0x08, 0x67, + 0x66, 0xd7, 0x34, 0xea, 0x32, 0x3b, 0x54, 0x0e +}; +static const u8 key17[] __initconst = { + 0xfb, 0x71, 0x5f, 0x3f, 0x7a, 0xc0, 0x9a, 0xc8, + 0xc8, 0xcf, 0xe8, 0xbc, 0xfb, 0x09, 0xbf, 0x89, + 0x6a, 0xef, 0xd5, 0xe5, 0x36, 0x87, 0x14, 0x76, + 0x00, 0xb9, 0x32, 0x28, 0xb2, 0x00, 0x42, 0x53 +}; +enum { nonce17 = 0x229b87e73d557b96ULL }; + +static const u8 input18[] __initconst = { + 0x38, 0x42, 0xb5, 0x37, 0xb4, 0x3d, 0xfe, 0x59, + 0x38, 0x68, 0x88, 0xfa, 0x89, 0x8a, 0x5f, 0x90, + 0x3c +}; +static const u8 output18[] __initconst = { + 0xac, 0xad, 0x14, 0xe8, 0x7e, 0xd7, 0xce, 0x96, + 0x3d, 0xb3, 0x78, 0x85, 0x22, 0x5a, 0xcb, 0x39, + 0xd4 +}; +static const u8 key18[] __initconst = { + 0xe1, 0xc1, 0xa8, 0xe0, 0x91, 0xe7, 0x38, 0x66, + 0x80, 0x17, 0x12, 0x3c, 0x5e, 0x2d, 0xbb, 0xea, + 0xeb, 0x6c, 0x8b, 0xc8, 0x1b, 0x6f, 0x7c, 0xea, + 0x50, 0x57, 0x23, 0x1e, 0x65, 0x6f, 0x6d, 0x81 +}; +enum { nonce18 = 0xfaf5fcf8f30e57a9ULL }; + +static const u8 input19[] __initconst = { + 0x1c, 0x4a, 0x30, 0x26, 0xef, 0x9a, 0x32, 0xa7, + 0x8f, 0xe5, 0xc0, 0x0f, 0x30, 0x3a, 0xbf, 0x38, + 0x54, 0xba +}; +static const u8 output19[] __initconst = { + 0x57, 0x67, 0x54, 0x4f, 0x31, 0xd6, 0xef, 0x35, + 0x0b, 0xd9, 0x52, 0xa7, 0x46, 0x7d, 0x12, 0x17, + 0x1e, 0xe3 +}; +static const u8 key19[] __initconst = { + 0x5a, 0x79, 0xc1, 0xea, 0x33, 0xb3, 0xc7, 0x21, + 0xec, 0xf8, 0xcb, 0xd2, 0x58, 0x96, 0x23, 0xd6, + 0x4d, 0xed, 0x2f, 0xdf, 0x8a, 0x79, 0xe6, 0x8b, + 0x38, 0xa3, 0xc3, 0x7a, 0x33, 0xda, 0x02, 0xc7 +}; +enum { nonce19 = 0x2b23b61840429604ULL }; + +static const u8 input20[] __initconst = { + 0xab, 0xe9, 0x32, 0xbb, 0x35, 0x17, 0xe0, 0x60, + 0x80, 0xb1, 0x27, 0xdc, 0xe6, 0x62, 0x9e, 0x0c, + 0x77, 0xf4, 0x50 +}; +static const u8 output20[] __initconst = { + 0x54, 0x6d, 0xaa, 0xfc, 0x08, 0xfb, 0x71, 0xa8, + 0xd6, 0x1d, 0x7d, 0xf3, 0x45, 0x10, 0xb5, 0x4c, + 0xcc, 0x4b, 0x45 +}; +static const u8 key20[] __initconst = { + 0xa3, 0xfd, 0x3d, 0xa9, 0xeb, 0xea, 0x2c, 0x69, + 0xcf, 0x59, 0x38, 0x13, 0x5b, 0xa7, 0x53, 0x8f, + 0x5e, 0xa2, 0x33, 0x86, 0x4c, 0x75, 0x26, 0xaf, + 0x35, 0x12, 0x09, 0x71, 0x81, 0xea, 0x88, 0x66 +}; +enum { nonce20 = 0x7459667a8fadff58ULL }; + +static const u8 input21[] __initconst = { + 0xa6, 0x82, 0x21, 0x23, 0xad, 0x27, 0x3f, 0xc6, + 0xd7, 0x16, 0x0d, 0x6d, 0x24, 0x15, 0x54, 0xc5, + 0x96, 0x72, 0x59, 0x8a +}; +static const u8 output21[] __initconst = { + 0x5f, 0x34, 0x32, 0xea, 0x06, 0xd4, 0x9e, 0x01, + 0xdc, 0x32, 0x32, 0x40, 0x66, 0x73, 0x6d, 0x4a, + 0x6b, 0x12, 0x20, 0xe8 +}; +static const u8 key21[] __initconst = { + 0x96, 0xfd, 0x13, 0x23, 0xa9, 0x89, 0x04, 0xe6, + 0x31, 0xa5, 0x2c, 0xc1, 0x40, 0xd5, 0x69, 0x5c, + 0x32, 0x79, 0x56, 0xe0, 0x29, 0x93, 0x8f, 0xe8, + 0x5f, 0x65, 0x53, 0x7f, 0xc1, 0xe9, 0xaf, 0xaf +}; +enum { nonce21 = 0xba8defee9d8e13b5ULL }; + +static const u8 input22[] __initconst = { + 0xb8, 0x32, 0x1a, 0x81, 0xd8, 0x38, 0x89, 0x5a, + 0xb0, 0x05, 0xbe, 0xf4, 0xd2, 0x08, 0xc6, 0xee, + 0x79, 0x7b, 0x3a, 0x76, 0x59 +}; +static const u8 output22[] __initconst = { + 0xb7, 0xba, 0xae, 0x80, 0xe4, 0x9f, 0x79, 0x84, + 0x5a, 0x48, 0x50, 0x6d, 0xcb, 0xd0, 0x06, 0x0c, + 0x15, 0x63, 0xa7, 0x5e, 0xbd +}; +static const u8 key22[] __initconst = { + 0x0f, 0x35, 0x3d, 0xeb, 0x5f, 0x0a, 0x82, 0x0d, + 0x24, 0x59, 0x71, 0xd8, 0xe6, 0x2d, 0x5f, 0xe1, + 0x7e, 0x0c, 0xae, 0xf6, 0xdc, 0x2c, 0xc5, 0x4a, + 0x38, 0x88, 0xf2, 0xde, 0xd9, 0x5f, 0x76, 0x7c +}; +enum { nonce22 = 0xe77f1760e9f5e192ULL }; + +static const u8 input23[] __initconst = { + 0x4b, 0x1e, 0x79, 0x99, 0xcf, 0xef, 0x64, 0x4b, + 0xb0, 0x66, 0xae, 0x99, 0x2e, 0x68, 0x97, 0xf5, + 0x5d, 0x9b, 0x3f, 0x7a, 0xa9, 0xd9 +}; +static const u8 output23[] __initconst = { + 0x5f, 0xa4, 0x08, 0x39, 0xca, 0xfa, 0x2b, 0x83, + 0x5d, 0x95, 0x70, 0x7c, 0x2e, 0xd4, 0xae, 0xfa, + 0x45, 0x4a, 0x77, 0x7f, 0xa7, 0x65 +}; +static const u8 key23[] __initconst = { + 0x4a, 0x06, 0x83, 0x64, 0xaa, 0xe3, 0x38, 0x32, + 0x28, 0x5d, 0xa4, 0xb2, 0x5a, 0xee, 0xcf, 0x8e, + 0x19, 0x67, 0xf1, 0x09, 0xe8, 0xc9, 0xf6, 0x40, + 0x02, 0x6d, 0x0b, 0xde, 0xfa, 0x81, 0x03, 0xb1 +}; +enum { nonce23 = 0x9b3f349158709849ULL }; + +static const u8 input24[] __initconst = { + 0xc6, 0xfc, 0x47, 0x5e, 0xd8, 0xed, 0xa9, 0xe5, + 0x4f, 0x82, 0x79, 0x35, 0xee, 0x3e, 0x7e, 0x3e, + 0x35, 0x70, 0x6e, 0xfa, 0x6d, 0x08, 0xe8 +}; +static const u8 output24[] __initconst = { + 0x3b, 0xc5, 0xf8, 0xc2, 0xbf, 0x2b, 0x90, 0x33, + 0xa6, 0xae, 0xf5, 0x5a, 0x65, 0xb3, 0x3d, 0xe1, + 0xcd, 0x5f, 0x55, 0xfa, 0xe7, 0xa5, 0x4a +}; +static const u8 key24[] __initconst = { + 0x00, 0x24, 0xc3, 0x65, 0x5f, 0xe6, 0x31, 0xbb, + 0x6d, 0xfc, 0x20, 0x7b, 0x1b, 0xa8, 0x96, 0x26, + 0x55, 0x21, 0x62, 0x25, 0x7e, 0xba, 0x23, 0x97, + 0xc9, 0xb8, 0x53, 0xa8, 0xef, 0xab, 0xad, 0x61 +}; +enum { nonce24 = 0x13ee0b8f526177c3ULL }; + +static const u8 input25[] __initconst = { + 0x33, 0x07, 0x16, 0xb1, 0x34, 0x33, 0x67, 0x04, + 0x9b, 0x0a, 0xce, 0x1b, 0xe9, 0xde, 0x1a, 0xec, + 0xd0, 0x55, 0xfb, 0xc6, 0x33, 0xaf, 0x2d, 0xe3 +}; +static const u8 output25[] __initconst = { + 0x05, 0x93, 0x10, 0xd1, 0x58, 0x6f, 0x68, 0x62, + 0x45, 0xdb, 0x91, 0xae, 0x70, 0xcf, 0xd4, 0x5f, + 0xee, 0xdf, 0xd5, 0xba, 0x9e, 0xde, 0x68, 0xe6 +}; +static const u8 key25[] __initconst = { + 0x83, 0xa9, 0x4f, 0x5d, 0x74, 0xd5, 0x91, 0xb3, + 0xc9, 0x97, 0x19, 0x15, 0xdb, 0x0d, 0x0b, 0x4a, + 0x3d, 0x55, 0xcf, 0xab, 0xb2, 0x05, 0x21, 0x35, + 0x45, 0x50, 0xeb, 0xf8, 0xf5, 0xbf, 0x36, 0x35 +}; +enum { nonce25 = 0x7c6f459e49ebfebcULL }; + +static const u8 input26[] __initconst = { + 0xc2, 0xd4, 0x7a, 0xa3, 0x92, 0xe1, 0xac, 0x46, + 0x1a, 0x15, 0x38, 0xc9, 0xb5, 0xfd, 0xdf, 0x84, + 0x38, 0xbc, 0x6b, 0x1d, 0xb0, 0x83, 0x43, 0x04, + 0x39 +}; +static const u8 output26[] __initconst = { + 0x7f, 0xde, 0xd6, 0x87, 0xcc, 0x34, 0xf4, 0x12, + 0xae, 0x55, 0xa5, 0x89, 0x95, 0x29, 0xfc, 0x18, + 0xd8, 0xc7, 0x7c, 0xd3, 0xcb, 0x85, 0x95, 0x21, + 0xd2 +}; +static const u8 key26[] __initconst = { + 0xe4, 0xd0, 0x54, 0x1d, 0x7d, 0x47, 0xa8, 0xc1, + 0x08, 0xca, 0xe2, 0x42, 0x52, 0x95, 0x16, 0x43, + 0xa3, 0x01, 0x23, 0x03, 0xcc, 0x3b, 0x81, 0x78, + 0x23, 0xcc, 0xa7, 0x36, 0xd7, 0xa0, 0x97, 0x8d +}; +enum { nonce26 = 0x524401012231683ULL }; + +static const u8 input27[] __initconst = { + 0x0d, 0xb0, 0xcf, 0xec, 0xfc, 0x38, 0x9d, 0x9d, + 0x89, 0x00, 0x96, 0xf2, 0x79, 0x8a, 0xa1, 0x8d, + 0x32, 0x5e, 0xc6, 0x12, 0x22, 0xec, 0xf6, 0x52, + 0xc1, 0x0b +}; +static const u8 output27[] __initconst = { + 0xef, 0xe1, 0xf2, 0x67, 0x8e, 0x2c, 0x00, 0x9f, + 0x1d, 0x4c, 0x66, 0x1f, 0x94, 0x58, 0xdc, 0xbb, + 0xb9, 0x11, 0x8f, 0x74, 0xfd, 0x0e, 0x14, 0x01, + 0xa8, 0x21 +}; +static const u8 key27[] __initconst = { + 0x78, 0x71, 0xa4, 0xe6, 0xb2, 0x95, 0x44, 0x12, + 0x81, 0xaa, 0x7e, 0x94, 0xa7, 0x8d, 0x44, 0xea, + 0xc4, 0xbc, 0x01, 0xb7, 0x9e, 0xf7, 0x82, 0x9e, + 0x3b, 0x23, 0x9f, 0x31, 0xdd, 0xb8, 0x0d, 0x18 +}; +enum { nonce27 = 0xd58fe0e58fb254d6ULL }; + +static const u8 input28[] __initconst = { + 0xaa, 0xb7, 0xaa, 0xd9, 0xa8, 0x91, 0xd7, 0x8a, + 0x97, 0x9b, 0xdb, 0x7c, 0x47, 0x2b, 0xdb, 0xd2, + 0xda, 0x77, 0xb1, 0xfa, 0x2d, 0x12, 0xe3, 0xe9, + 0xc4, 0x7f, 0x54 +}; +static const u8 output28[] __initconst = { + 0x87, 0x84, 0xa9, 0xa6, 0xad, 0x8f, 0xe6, 0x0f, + 0x69, 0xf8, 0x21, 0xc3, 0x54, 0x95, 0x0f, 0xb0, + 0x4e, 0xc7, 0x02, 0xe4, 0x04, 0xb0, 0x6c, 0x42, + 0x8c, 0x63, 0xe3 +}; +static const u8 key28[] __initconst = { + 0x12, 0x23, 0x37, 0x95, 0x04, 0xb4, 0x21, 0xe8, + 0xbc, 0x65, 0x46, 0x7a, 0xf4, 0x01, 0x05, 0x3f, + 0xb1, 0x34, 0x73, 0xd2, 0x49, 0xbf, 0x6f, 0x20, + 0xbd, 0x23, 0x58, 0x5f, 0xd1, 0x73, 0x57, 0xa6 +}; +enum { nonce28 = 0x3a04d51491eb4e07ULL }; + +static const u8 input29[] __initconst = { + 0x55, 0xd0, 0xd4, 0x4b, 0x17, 0xc8, 0xc4, 0x2b, + 0xc0, 0x28, 0xbd, 0x9d, 0x65, 0x4d, 0xaf, 0x77, + 0x72, 0x7c, 0x36, 0x68, 0xa7, 0xb6, 0x87, 0x4d, + 0xb9, 0x27, 0x25, 0x6c +}; +static const u8 output29[] __initconst = { + 0x0e, 0xac, 0x4c, 0xf5, 0x12, 0xb5, 0x56, 0xa5, + 0x00, 0x9a, 0xd6, 0xe5, 0x1a, 0x59, 0x2c, 0xf6, + 0x42, 0x22, 0xcf, 0x23, 0x98, 0x34, 0x29, 0xac, + 0x6e, 0xe3, 0x37, 0x6d +}; +static const u8 key29[] __initconst = { + 0xda, 0x9d, 0x05, 0x0c, 0x0c, 0xba, 0x75, 0xb9, + 0x9e, 0xb1, 0x8d, 0xd9, 0x73, 0x26, 0x2c, 0xa9, + 0x3a, 0xb5, 0xcb, 0x19, 0x49, 0xa7, 0x4f, 0xf7, + 0x64, 0x35, 0x23, 0x20, 0x2a, 0x45, 0x78, 0xc7 +}; +enum { nonce29 = 0xc25ac9982431cbfULL }; + +static const u8 input30[] __initconst = { + 0x4e, 0xd6, 0x85, 0xbb, 0xe7, 0x99, 0xfa, 0x04, + 0x33, 0x24, 0xfd, 0x75, 0x18, 0xe3, 0xd3, 0x25, + 0xcd, 0xca, 0xae, 0x00, 0xbe, 0x52, 0x56, 0x4a, + 0x31, 0xe9, 0x4f, 0xae, 0x8a +}; +static const u8 output30[] __initconst = { + 0x30, 0x36, 0x32, 0xa2, 0x3c, 0xb6, 0xf9, 0xf9, + 0x76, 0x70, 0xad, 0xa6, 0x10, 0x41, 0x00, 0x4a, + 0xfa, 0xce, 0x1b, 0x86, 0x05, 0xdb, 0x77, 0x96, + 0xb3, 0xb7, 0x8f, 0x61, 0x24 +}; +static const u8 key30[] __initconst = { + 0x49, 0x35, 0x4c, 0x15, 0x98, 0xfb, 0xc6, 0x57, + 0x62, 0x6d, 0x06, 0xc3, 0xd4, 0x79, 0x20, 0x96, + 0x05, 0x2a, 0x31, 0x63, 0xc0, 0x44, 0x42, 0x09, + 0x13, 0x13, 0xff, 0x1b, 0xc8, 0x63, 0x1f, 0x0b +}; +enum { nonce30 = 0x4967f9c08e41568bULL }; + +static const u8 input31[] __initconst = { + 0x91, 0x04, 0x20, 0x47, 0x59, 0xee, 0xa6, 0x0f, + 0x04, 0x75, 0xc8, 0x18, 0x95, 0x44, 0x01, 0x28, + 0x20, 0x6f, 0x73, 0x68, 0x66, 0xb5, 0x03, 0xb3, + 0x58, 0x27, 0x6e, 0x7a, 0x76, 0xb8 +}; +static const u8 output31[] __initconst = { + 0xe8, 0x03, 0x78, 0x9d, 0x13, 0x15, 0x98, 0xef, + 0x64, 0x68, 0x12, 0x41, 0xb0, 0x29, 0x94, 0x0c, + 0x83, 0x35, 0x46, 0xa9, 0x74, 0xe1, 0x75, 0xf0, + 0xb6, 0x96, 0xc3, 0x6f, 0xd7, 0x70 +}; +static const u8 key31[] __initconst = { + 0xef, 0xcd, 0x5a, 0x4a, 0xf4, 0x7e, 0x6a, 0x3a, + 0x11, 0x88, 0x72, 0x94, 0xb8, 0xae, 0x84, 0xc3, + 0x66, 0xe0, 0xde, 0x4b, 0x00, 0xa5, 0xd6, 0x2d, + 0x50, 0xb7, 0x28, 0xff, 0x76, 0x57, 0x18, 0x1f +}; +enum { nonce31 = 0xcb6f428fa4192e19ULL }; + +static const u8 input32[] __initconst = { + 0x90, 0x06, 0x50, 0x4b, 0x98, 0x14, 0x30, 0xf1, + 0xb8, 0xd7, 0xf0, 0xa4, 0x3e, 0x4e, 0xd8, 0x00, + 0xea, 0xdb, 0x4f, 0x93, 0x05, 0xef, 0x02, 0x71, + 0x1a, 0xcd, 0xa3, 0xb1, 0xae, 0xd3, 0x18 +}; +static const u8 output32[] __initconst = { + 0xcb, 0x4a, 0x37, 0x3f, 0xea, 0x40, 0xab, 0x86, + 0xfe, 0xcc, 0x07, 0xd5, 0xdc, 0xb2, 0x25, 0xb6, + 0xfd, 0x2a, 0x72, 0xbc, 0x5e, 0xd4, 0x75, 0xff, + 0x71, 0xfc, 0xce, 0x1e, 0x6f, 0x22, 0xc1 +}; +static const u8 key32[] __initconst = { + 0xfc, 0x6d, 0xc3, 0x80, 0xce, 0xa4, 0x31, 0xa1, + 0xcc, 0xfa, 0x9d, 0x10, 0x0b, 0xc9, 0x11, 0x77, + 0x34, 0xdb, 0xad, 0x1b, 0xc4, 0xfc, 0xeb, 0x79, + 0x91, 0xda, 0x59, 0x3b, 0x0d, 0xb1, 0x19, 0x3b +}; +enum { nonce32 = 0x88551bf050059467ULL }; + +static const u8 input33[] __initconst = { + 0x88, 0x94, 0x71, 0x92, 0xe8, 0xd7, 0xf9, 0xbd, + 0x55, 0xe3, 0x22, 0xdb, 0x99, 0x51, 0xfb, 0x50, + 0xbf, 0x82, 0xb5, 0x70, 0x8b, 0x2b, 0x6a, 0x03, + 0x37, 0xa0, 0xc6, 0x19, 0x5d, 0xc9, 0xbc, 0xcc +}; +static const u8 output33[] __initconst = { + 0xb6, 0x17, 0x51, 0xc8, 0xea, 0x8a, 0x14, 0xdc, + 0x23, 0x1b, 0xd4, 0xed, 0xbf, 0x50, 0xb9, 0x38, + 0x00, 0xc2, 0x3f, 0x78, 0x3d, 0xbf, 0xa0, 0x84, + 0xef, 0x45, 0xb2, 0x7d, 0x48, 0x7b, 0x62, 0xa7 +}; +static const u8 key33[] __initconst = { + 0xb9, 0x8f, 0x6a, 0xad, 0xb4, 0x6f, 0xb5, 0xdc, + 0x48, 0xfa, 0x43, 0x57, 0x62, 0x97, 0xef, 0x89, + 0x4c, 0x5a, 0x7b, 0x67, 0xb8, 0x9d, 0xf0, 0x42, + 0x2b, 0x8f, 0xf3, 0x18, 0x05, 0x2e, 0x48, 0xd0 +}; +enum { nonce33 = 0x31f16488fe8447f5ULL }; + +static const u8 input34[] __initconst = { + 0xda, 0x2b, 0x3d, 0x63, 0x9e, 0x4f, 0xc2, 0xb8, + 0x7f, 0xc2, 0x1a, 0x8b, 0x0d, 0x95, 0x65, 0x55, + 0x52, 0xba, 0x51, 0x51, 0xc0, 0x61, 0x9f, 0x0a, + 0x5d, 0xb0, 0x59, 0x8c, 0x64, 0x6a, 0xab, 0xf5, + 0x57 +}; +static const u8 output34[] __initconst = { + 0x5c, 0xf6, 0x62, 0x24, 0x8c, 0x45, 0xa3, 0x26, + 0xd0, 0xe4, 0x88, 0x1c, 0xed, 0xc4, 0x26, 0x58, + 0xb5, 0x5d, 0x92, 0xc4, 0x17, 0x44, 0x1c, 0xb8, + 0x2c, 0xf3, 0x55, 0x7e, 0xd6, 0xe5, 0xb3, 0x65, + 0xa8 +}; +static const u8 key34[] __initconst = { + 0xde, 0xd1, 0x27, 0xb7, 0x7c, 0xfa, 0xa6, 0x78, + 0x39, 0x80, 0xdf, 0xb7, 0x46, 0xac, 0x71, 0x26, + 0xd0, 0x2a, 0x56, 0x79, 0x12, 0xeb, 0x26, 0x37, + 0x01, 0x0d, 0x30, 0xe0, 0xe3, 0x66, 0xb2, 0xf4 +}; +enum { nonce34 = 0x92d0d9b252c24149ULL }; + +static const u8 input35[] __initconst = { + 0x3a, 0x15, 0x5b, 0x75, 0x6e, 0xd0, 0x52, 0x20, + 0x6c, 0x82, 0xfa, 0xce, 0x5b, 0xea, 0xf5, 0x43, + 0xc1, 0x81, 0x7c, 0xb2, 0xac, 0x16, 0x3f, 0xd3, + 0x5a, 0xaf, 0x55, 0x98, 0xf4, 0xc6, 0xba, 0x71, + 0x25, 0x8b +}; +static const u8 output35[] __initconst = { + 0xb3, 0xaf, 0xac, 0x6d, 0x4d, 0xc7, 0x68, 0x56, + 0x50, 0x5b, 0x69, 0x2a, 0xe5, 0x90, 0xf9, 0x5f, + 0x99, 0x88, 0xff, 0x0c, 0xa6, 0xb1, 0x83, 0xd6, + 0x80, 0xa6, 0x1b, 0xde, 0x94, 0xa4, 0x2c, 0xc3, + 0x74, 0xfa +}; +static const u8 key35[] __initconst = { + 0xd8, 0x24, 0xe2, 0x06, 0xd7, 0x7a, 0xce, 0x81, + 0x52, 0x72, 0x02, 0x69, 0x89, 0xc4, 0xe9, 0x53, + 0x3b, 0x08, 0x5f, 0x98, 0x1e, 0x1b, 0x99, 0x6e, + 0x28, 0x17, 0x6d, 0xba, 0xc0, 0x96, 0xf9, 0x3c +}; +enum { nonce35 = 0x7baf968c4c8e3a37ULL }; + +static const u8 input36[] __initconst = { + 0x31, 0x5d, 0x4f, 0xe3, 0xac, 0xad, 0x17, 0xa6, + 0xb5, 0x01, 0xe2, 0xc6, 0xd4, 0x7e, 0xc4, 0x80, + 0xc0, 0x59, 0x72, 0xbb, 0x4b, 0x74, 0x6a, 0x41, + 0x0f, 0x9c, 0xf6, 0xca, 0x20, 0xb3, 0x73, 0x07, + 0x6b, 0x02, 0x2a +}; +static const u8 output36[] __initconst = { + 0xf9, 0x09, 0x92, 0x94, 0x7e, 0x31, 0xf7, 0x53, + 0xe8, 0x8a, 0x5b, 0x20, 0xef, 0x9b, 0x45, 0x81, + 0xba, 0x5e, 0x45, 0x63, 0xc1, 0xc7, 0x9e, 0x06, + 0x0e, 0xd9, 0x62, 0x8e, 0x96, 0xf9, 0xfa, 0x43, + 0x4d, 0xd4, 0x28 +}; +static const u8 key36[] __initconst = { + 0x13, 0x30, 0x4c, 0x06, 0xae, 0x18, 0xde, 0x03, + 0x1d, 0x02, 0x40, 0xf5, 0xbb, 0x19, 0xe3, 0x88, + 0x41, 0xb1, 0x29, 0x15, 0x97, 0xc2, 0x69, 0x3f, + 0x32, 0x2a, 0x0c, 0x8b, 0xcf, 0x83, 0x8b, 0x6c +}; +enum { nonce36 = 0x226d251d475075a0ULL }; + +static const u8 input37[] __initconst = { + 0x10, 0x18, 0xbe, 0xfd, 0x66, 0xc9, 0x77, 0xcc, + 0x43, 0xe5, 0x46, 0x0b, 0x08, 0x8b, 0xae, 0x11, + 0x86, 0x15, 0xc2, 0xf6, 0x45, 0xd4, 0x5f, 0xd6, + 0xb6, 0x5f, 0x9f, 0x3e, 0x97, 0xb7, 0xd4, 0xad, + 0x0b, 0xe8, 0x31, 0x94 +}; +static const u8 output37[] __initconst = { + 0x03, 0x2c, 0x1c, 0xee, 0xc6, 0xdd, 0xed, 0x38, + 0x80, 0x6d, 0x84, 0x16, 0xc3, 0xc2, 0x04, 0x63, + 0xcd, 0xa7, 0x6e, 0x36, 0x8b, 0xed, 0x78, 0x63, + 0x95, 0xfc, 0x69, 0x7a, 0x3f, 0x8d, 0x75, 0x6b, + 0x6c, 0x26, 0x56, 0x4d +}; +static const u8 key37[] __initconst = { + 0xac, 0x84, 0x4d, 0xa9, 0x29, 0x49, 0x3c, 0x39, + 0x7f, 0xd9, 0xa6, 0x01, 0xf3, 0x7e, 0xfa, 0x4a, + 0x14, 0x80, 0x22, 0x74, 0xf0, 0x29, 0x30, 0x2d, + 0x07, 0x21, 0xda, 0xc0, 0x4d, 0x70, 0x56, 0xa2 +}; +enum { nonce37 = 0x167823ce3b64925aULL }; + +static const u8 input38[] __initconst = { + 0x30, 0x8f, 0xfa, 0x24, 0x29, 0xb1, 0xfb, 0xce, + 0x31, 0x62, 0xdc, 0xd0, 0x46, 0xab, 0xe1, 0x31, + 0xd9, 0xae, 0x60, 0x0d, 0xca, 0x0a, 0x49, 0x12, + 0x3d, 0x92, 0xe9, 0x91, 0x67, 0x12, 0x62, 0x18, + 0x89, 0xe2, 0xf9, 0x1c, 0xcc +}; +static const u8 output38[] __initconst = { + 0x56, 0x9c, 0xc8, 0x7a, 0xc5, 0x98, 0xa3, 0x0f, + 0xba, 0xd5, 0x3e, 0xe1, 0xc9, 0x33, 0x64, 0x33, + 0xf0, 0xd5, 0xf7, 0x43, 0x66, 0x0e, 0x08, 0x9a, + 0x6e, 0x09, 0xe4, 0x01, 0x0d, 0x1e, 0x2f, 0x4b, + 0xed, 0x9c, 0x08, 0x8c, 0x03 +}; +static const u8 key38[] __initconst = { + 0x77, 0x52, 0x2a, 0x23, 0xf1, 0xc5, 0x96, 0x2b, + 0x89, 0x4f, 0x3e, 0xf3, 0xff, 0x0e, 0x94, 0xce, + 0xf1, 0xbd, 0x53, 0xf5, 0x77, 0xd6, 0x9e, 0x47, + 0x49, 0x3d, 0x16, 0x64, 0xff, 0x95, 0x42, 0x42 +}; +enum { nonce38 = 0xff629d7b82cef357ULL }; + +static const u8 input39[] __initconst = { + 0x38, 0x26, 0x27, 0xd0, 0xc2, 0xf5, 0x34, 0xba, + 0xda, 0x0f, 0x1c, 0x1c, 0x9a, 0x70, 0xe5, 0x8a, + 0x78, 0x2d, 0x8f, 0x9a, 0xbf, 0x89, 0x6a, 0xfd, + 0xd4, 0x9c, 0x33, 0xf1, 0xb6, 0x89, 0x16, 0xe3, + 0x6a, 0x00, 0xfa, 0x3a, 0x0f, 0x26 +}; +static const u8 output39[] __initconst = { + 0x0f, 0xaf, 0x91, 0x6d, 0x9c, 0x99, 0xa4, 0xf7, + 0x3b, 0x9d, 0x9a, 0x98, 0xca, 0xbb, 0x50, 0x48, + 0xee, 0xcb, 0x5d, 0xa1, 0x37, 0x2d, 0x36, 0x09, + 0x2a, 0xe2, 0x1c, 0x3d, 0x98, 0x40, 0x1c, 0x16, + 0x56, 0xa7, 0x98, 0xe9, 0x7d, 0x2b +}; +static const u8 key39[] __initconst = { + 0x6e, 0x83, 0x15, 0x4d, 0xf8, 0x78, 0xa8, 0x0e, + 0x71, 0x37, 0xd4, 0x6e, 0x28, 0x5c, 0x06, 0xa1, + 0x2d, 0x6c, 0x72, 0x7a, 0xfd, 0xf8, 0x65, 0x1a, + 0xb8, 0xe6, 0x29, 0x7b, 0xe5, 0xb3, 0x23, 0x79 +}; +enum { nonce39 = 0xa4d8c491cf093e9dULL }; + +static const u8 input40[] __initconst = { + 0x8f, 0x32, 0x7c, 0x40, 0x37, 0x95, 0x08, 0x00, + 0x00, 0xfe, 0x2f, 0x95, 0x20, 0x12, 0x40, 0x18, + 0x5e, 0x7e, 0x5e, 0x99, 0xee, 0x8d, 0x91, 0x7d, + 0x50, 0x7d, 0x21, 0x45, 0x27, 0xe1, 0x7f, 0xd4, + 0x73, 0x10, 0xe1, 0x33, 0xbc, 0xf8, 0xdd +}; +static const u8 output40[] __initconst = { + 0x78, 0x7c, 0xdc, 0x55, 0x2b, 0xd9, 0x2b, 0x3a, + 0xdd, 0x56, 0x11, 0x52, 0xd3, 0x2e, 0xe0, 0x0d, + 0x23, 0x20, 0x8a, 0xf1, 0x4f, 0xee, 0xf1, 0x68, + 0xf6, 0xdc, 0x53, 0xcf, 0x17, 0xd4, 0xf0, 0x6c, + 0xdc, 0x80, 0x5f, 0x1c, 0xa4, 0x91, 0x05 +}; +static const u8 key40[] __initconst = { + 0x0d, 0x86, 0xbf, 0x8a, 0xba, 0x9e, 0x39, 0x91, + 0xa8, 0xe7, 0x22, 0xf0, 0x0c, 0x43, 0x18, 0xe4, + 0x1f, 0xb0, 0xaf, 0x8a, 0x34, 0x31, 0xf4, 0x41, + 0xf0, 0x89, 0x85, 0xca, 0x5d, 0x05, 0x3b, 0x94 +}; +enum { nonce40 = 0xae7acc4f5986439eULL }; + +static const u8 input41[] __initconst = { + 0x20, 0x5f, 0xc1, 0x83, 0x36, 0x02, 0x76, 0x96, + 0xf0, 0xbf, 0x8e, 0x0e, 0x1a, 0xd1, 0xc7, 0x88, + 0x18, 0xc7, 0x09, 0xc4, 0x15, 0xd9, 0x4f, 0x5e, + 0x1f, 0xb3, 0xb4, 0x6d, 0xcb, 0xa0, 0xd6, 0x8a, + 0x3b, 0x40, 0x8e, 0x80, 0xf1, 0xe8, 0x8f, 0x5f +}; +static const u8 output41[] __initconst = { + 0x0b, 0xd1, 0x49, 0x9a, 0x9d, 0xe8, 0x97, 0xb8, + 0xd1, 0xeb, 0x90, 0x62, 0x37, 0xd2, 0x99, 0x15, + 0x67, 0x6d, 0x27, 0x93, 0xce, 0x37, 0x65, 0xa2, + 0x94, 0x88, 0xd6, 0x17, 0xbc, 0x1c, 0x6e, 0xa2, + 0xcc, 0xfb, 0x81, 0x0e, 0x30, 0x60, 0x5a, 0x6f +}; +static const u8 key41[] __initconst = { + 0x36, 0x27, 0x57, 0x01, 0x21, 0x68, 0x97, 0xc7, + 0x00, 0x67, 0x7b, 0xe9, 0x0f, 0x55, 0x49, 0xbb, + 0x92, 0x18, 0x98, 0xf5, 0x5e, 0xbc, 0xe7, 0x5a, + 0x9d, 0x3d, 0xc7, 0xbd, 0x59, 0xec, 0x82, 0x8e +}; +enum { nonce41 = 0x5da05e4c8dfab464ULL }; + +static const u8 input42[] __initconst = { + 0xca, 0x30, 0xcd, 0x63, 0xf0, 0x2d, 0xf1, 0x03, + 0x4d, 0x0d, 0xf2, 0xf7, 0x6f, 0xae, 0xd6, 0x34, + 0xea, 0xf6, 0x13, 0xcf, 0x1c, 0xa0, 0xd0, 0xe8, + 0xa4, 0x78, 0x80, 0x3b, 0x1e, 0xa5, 0x32, 0x4c, + 0x73, 0x12, 0xd4, 0x6a, 0x94, 0xbc, 0xba, 0x80, + 0x5e +}; +static const u8 output42[] __initconst = { + 0xec, 0x3f, 0x18, 0x31, 0xc0, 0x7b, 0xb5, 0xe2, + 0xad, 0xf3, 0xec, 0xa0, 0x16, 0x9d, 0xef, 0xce, + 0x05, 0x65, 0x59, 0x9d, 0x5a, 0xca, 0x3e, 0x13, + 0xb9, 0x5d, 0x5d, 0xb5, 0xeb, 0xae, 0xc0, 0x87, + 0xbb, 0xfd, 0xe7, 0xe4, 0x89, 0x5b, 0xd2, 0x6c, + 0x56 +}; +static const u8 key42[] __initconst = { + 0x7c, 0x6b, 0x7e, 0x77, 0xcc, 0x8c, 0x1b, 0x03, + 0x8b, 0x2a, 0xb3, 0x7c, 0x5a, 0x73, 0xcc, 0xac, + 0xdd, 0x53, 0x54, 0x0c, 0x85, 0xed, 0xcd, 0x47, + 0x24, 0xc1, 0xb8, 0x9b, 0x2e, 0x41, 0x92, 0x36 +}; +enum { nonce42 = 0xe4d7348b09682c9cULL }; + +static const u8 input43[] __initconst = { + 0x52, 0xf2, 0x4b, 0x7c, 0xe5, 0x58, 0xe8, 0xd2, + 0xb7, 0xf3, 0xa1, 0x29, 0x68, 0xa2, 0x50, 0x50, + 0xae, 0x9c, 0x1b, 0xe2, 0x67, 0x77, 0xe2, 0xdb, + 0x85, 0x55, 0x7e, 0x84, 0x8a, 0x12, 0x3c, 0xb6, + 0x2e, 0xed, 0xd3, 0xec, 0x47, 0x68, 0xfa, 0x52, + 0x46, 0x9d +}; +static const u8 output43[] __initconst = { + 0x1b, 0xf0, 0x05, 0xe4, 0x1c, 0xd8, 0x74, 0x9a, + 0xf0, 0xee, 0x00, 0x54, 0xce, 0x02, 0x83, 0x15, + 0xfb, 0x23, 0x35, 0x78, 0xc3, 0xda, 0x98, 0xd8, + 0x9d, 0x1b, 0xb2, 0x51, 0x82, 0xb0, 0xff, 0xbe, + 0x05, 0xa9, 0xa4, 0x04, 0xba, 0xea, 0x4b, 0x73, + 0x47, 0x6e +}; +static const u8 key43[] __initconst = { + 0xeb, 0xec, 0x0e, 0xa1, 0x65, 0xe2, 0x99, 0x46, + 0xd8, 0x54, 0x8c, 0x4a, 0x93, 0xdf, 0x6d, 0xbf, + 0x93, 0x34, 0x94, 0x57, 0xc9, 0x12, 0x9d, 0x68, + 0x05, 0xc5, 0x05, 0xad, 0x5a, 0xc9, 0x2a, 0x3b +}; +enum { nonce43 = 0xe14f6a902b7827fULL }; + +static const u8 input44[] __initconst = { + 0x3e, 0x22, 0x3e, 0x8e, 0xcd, 0x18, 0xe2, 0xa3, + 0x8d, 0x8b, 0x38, 0xc3, 0x02, 0xa3, 0x31, 0x48, + 0xc6, 0x0e, 0xec, 0x99, 0x51, 0x11, 0x6d, 0x8b, + 0x32, 0x35, 0x3b, 0x08, 0x58, 0x76, 0x25, 0x30, + 0xe2, 0xfc, 0xa2, 0x46, 0x7d, 0x6e, 0x34, 0x87, + 0xac, 0x42, 0xbf +}; +static const u8 output44[] __initconst = { + 0x08, 0x92, 0x58, 0x02, 0x1a, 0xf4, 0x1f, 0x3d, + 0x38, 0x7b, 0x6b, 0xf6, 0x84, 0x07, 0xa3, 0x19, + 0x17, 0x2a, 0xed, 0x57, 0x1c, 0xf9, 0x55, 0x37, + 0x4e, 0xf4, 0x68, 0x68, 0x82, 0x02, 0x4f, 0xca, + 0x21, 0x00, 0xc6, 0x66, 0x79, 0x53, 0x19, 0xef, + 0x7f, 0xdd, 0x74 +}; +static const u8 key44[] __initconst = { + 0x73, 0xb6, 0x3e, 0xf4, 0x57, 0x52, 0xa6, 0x43, + 0x51, 0xd8, 0x25, 0x00, 0xdb, 0xb4, 0x52, 0x69, + 0xd6, 0x27, 0x49, 0xeb, 0x9b, 0xf1, 0x7b, 0xa0, + 0xd6, 0x7c, 0x9c, 0xd8, 0x95, 0x03, 0x69, 0x26 +}; +enum { nonce44 = 0xf5e6dc4f35ce24e5ULL }; + +static const u8 input45[] __initconst = { + 0x55, 0x76, 0xc0, 0xf1, 0x74, 0x03, 0x7a, 0x6d, + 0x14, 0xd8, 0x36, 0x2c, 0x9f, 0x9a, 0x59, 0x7a, + 0x2a, 0xf5, 0x77, 0x84, 0x70, 0x7c, 0x1d, 0x04, + 0x90, 0x45, 0xa4, 0xc1, 0x5e, 0xdd, 0x2e, 0x07, + 0x18, 0x34, 0xa6, 0x85, 0x56, 0x4f, 0x09, 0xaf, + 0x2f, 0x83, 0xe1, 0xc6 +}; +static const u8 output45[] __initconst = { + 0x22, 0x46, 0xe4, 0x0b, 0x3a, 0x55, 0xcc, 0x9b, + 0xf0, 0xc0, 0x53, 0xcd, 0x95, 0xc7, 0x57, 0x6c, + 0x77, 0x46, 0x41, 0x72, 0x07, 0xbf, 0xa8, 0xe5, + 0x68, 0x69, 0xd8, 0x1e, 0x45, 0xc1, 0xa2, 0x50, + 0xa5, 0xd1, 0x62, 0xc9, 0x5a, 0x7d, 0x08, 0x14, + 0xae, 0x44, 0x16, 0xb9 +}; +static const u8 key45[] __initconst = { + 0x41, 0xf3, 0x88, 0xb2, 0x51, 0x25, 0x47, 0x02, + 0x39, 0xe8, 0x15, 0x3a, 0x22, 0x78, 0x86, 0x0b, + 0xf9, 0x1e, 0x8d, 0x98, 0xb2, 0x22, 0x82, 0xac, + 0x42, 0x94, 0xde, 0x64, 0xf0, 0xfd, 0xb3, 0x6c +}; +enum { nonce45 = 0xf51a582daf4aa01aULL }; + +static const u8 input46[] __initconst = { + 0xf6, 0xff, 0x20, 0xf9, 0x26, 0x7e, 0x0f, 0xa8, + 0x6a, 0x45, 0x5a, 0x91, 0x73, 0xc4, 0x4c, 0x63, + 0xe5, 0x61, 0x59, 0xca, 0xec, 0xc0, 0x20, 0x35, + 0xbc, 0x9f, 0x58, 0x9c, 0x5e, 0xa1, 0x17, 0x46, + 0xcc, 0xab, 0x6e, 0xd0, 0x4f, 0x24, 0xeb, 0x05, + 0x4d, 0x40, 0x41, 0xe0, 0x9d +}; +static const u8 output46[] __initconst = { + 0x31, 0x6e, 0x63, 0x3f, 0x9c, 0xe6, 0xb1, 0xb7, + 0xef, 0x47, 0x46, 0xd7, 0xb1, 0x53, 0x42, 0x2f, + 0x2c, 0xc8, 0x01, 0xae, 0x8b, 0xec, 0x42, 0x2c, + 0x6b, 0x2c, 0x9c, 0xb2, 0xf0, 0x29, 0x06, 0xa5, + 0xcd, 0x7e, 0xc7, 0x3a, 0x38, 0x98, 0x8a, 0xde, + 0x03, 0x29, 0x14, 0x8f, 0xf9 +}; +static const u8 key46[] __initconst = { + 0xac, 0xa6, 0x44, 0x4a, 0x0d, 0x42, 0x10, 0xbc, + 0xd3, 0xc9, 0x8e, 0x9e, 0x71, 0xa3, 0x1c, 0x14, + 0x9d, 0x65, 0x0d, 0x49, 0x4d, 0x8c, 0xec, 0x46, + 0xe1, 0x41, 0xcd, 0xf5, 0xfc, 0x82, 0x75, 0x34 +}; +enum { nonce46 = 0x25f85182df84dec5ULL }; + +static const u8 input47[] __initconst = { + 0xa1, 0xd2, 0xf2, 0x52, 0x2f, 0x79, 0x50, 0xb2, + 0x42, 0x29, 0x5b, 0x44, 0x20, 0xf9, 0xbd, 0x85, + 0xb7, 0x65, 0x77, 0x86, 0xce, 0x3e, 0x1c, 0xe4, + 0x70, 0x80, 0xdd, 0x72, 0x07, 0x48, 0x0f, 0x84, + 0x0d, 0xfd, 0x97, 0xc0, 0xb7, 0x48, 0x9b, 0xb4, + 0xec, 0xff, 0x73, 0x14, 0x99, 0xe4 +}; +static const u8 output47[] __initconst = { + 0xe5, 0x3c, 0x78, 0x66, 0x31, 0x1e, 0xd6, 0xc4, + 0x9e, 0x71, 0xb3, 0xd7, 0xd5, 0xad, 0x84, 0xf2, + 0x78, 0x61, 0x77, 0xf8, 0x31, 0xf0, 0x13, 0xad, + 0x66, 0xf5, 0x31, 0x7d, 0xeb, 0xdf, 0xaf, 0xcb, + 0xac, 0x28, 0x6c, 0xc2, 0x9e, 0xe7, 0x78, 0xa2, + 0xa2, 0x58, 0xce, 0x84, 0x76, 0x70 +}; +static const u8 key47[] __initconst = { + 0x05, 0x7f, 0xc0, 0x7f, 0x37, 0x20, 0x71, 0x02, + 0x3a, 0xe7, 0x20, 0x5a, 0x0a, 0x8f, 0x79, 0x5a, + 0xfe, 0xbb, 0x43, 0x4d, 0x2f, 0xcb, 0xf6, 0x9e, + 0xa2, 0x97, 0x00, 0xad, 0x0d, 0x51, 0x7e, 0x17 +}; +enum { nonce47 = 0xae707c60f54de32bULL }; + +static const u8 input48[] __initconst = { + 0x80, 0x93, 0x77, 0x2e, 0x8d, 0xe8, 0xe6, 0xc1, + 0x27, 0xe6, 0xf2, 0x89, 0x5b, 0x33, 0x62, 0x18, + 0x80, 0x6e, 0x17, 0x22, 0x8e, 0x83, 0x31, 0x40, + 0x8f, 0xc9, 0x5c, 0x52, 0x6c, 0x0e, 0xa5, 0xe9, + 0x6c, 0x7f, 0xd4, 0x6a, 0x27, 0x56, 0x99, 0xce, + 0x8d, 0x37, 0x59, 0xaf, 0xc0, 0x0e, 0xe1 +}; +static const u8 output48[] __initconst = { + 0x02, 0xa4, 0x2e, 0x33, 0xb7, 0x7c, 0x2b, 0x9a, + 0x18, 0x5a, 0xba, 0x53, 0x38, 0xaf, 0x00, 0xeb, + 0xd8, 0x3d, 0x02, 0x77, 0x43, 0x45, 0x03, 0x91, + 0xe2, 0x5e, 0x4e, 0xeb, 0x50, 0xd5, 0x5b, 0xe0, + 0xf3, 0x33, 0xa7, 0xa2, 0xac, 0x07, 0x6f, 0xeb, + 0x3f, 0x6c, 0xcd, 0xf2, 0x6c, 0x61, 0x64 +}; +static const u8 key48[] __initconst = { + 0xf3, 0x79, 0xe7, 0xf8, 0x0e, 0x02, 0x05, 0x6b, + 0x83, 0x1a, 0xe7, 0x86, 0x6b, 0xe6, 0x8f, 0x3f, + 0xd3, 0xa3, 0xe4, 0x6e, 0x29, 0x06, 0xad, 0xbc, + 0xe8, 0x33, 0x56, 0x39, 0xdf, 0xb0, 0xe2, 0xfe +}; +enum { nonce48 = 0xd849b938c6569da0ULL }; + +static const u8 input49[] __initconst = { + 0x89, 0x3b, 0x88, 0x9e, 0x7b, 0x38, 0x16, 0x9f, + 0xa1, 0x28, 0xf6, 0xf5, 0x23, 0x74, 0x28, 0xb0, + 0xdf, 0x6c, 0x9e, 0x8a, 0x71, 0xaf, 0xed, 0x7a, + 0x39, 0x21, 0x57, 0x7d, 0x31, 0x6c, 0xee, 0x0d, + 0x11, 0x8d, 0x41, 0x9a, 0x5f, 0xb7, 0x27, 0x40, + 0x08, 0xad, 0xc6, 0xe0, 0x00, 0x43, 0x9e, 0xae +}; +static const u8 output49[] __initconst = { + 0x4d, 0xfd, 0xdb, 0x4c, 0x77, 0xc1, 0x05, 0x07, + 0x4d, 0x6d, 0x32, 0xcb, 0x2e, 0x0e, 0xff, 0x65, + 0xc9, 0x27, 0xeb, 0xa9, 0x46, 0x5b, 0xab, 0x06, + 0xe6, 0xb6, 0x5a, 0x1e, 0x00, 0xfb, 0xcf, 0xe4, + 0xb9, 0x71, 0x40, 0x10, 0xef, 0x12, 0x39, 0xf0, + 0xea, 0x40, 0xb8, 0x9a, 0xa2, 0x85, 0x38, 0x48 +}; +static const u8 key49[] __initconst = { + 0xe7, 0x10, 0x40, 0xd9, 0x66, 0xc0, 0xa8, 0x6d, + 0xa3, 0xcc, 0x8b, 0xdd, 0x93, 0xf2, 0x6e, 0xe0, + 0x90, 0x7f, 0xd0, 0xf4, 0x37, 0x0c, 0x8b, 0x9b, + 0x4c, 0x4d, 0xe6, 0xf2, 0x1f, 0xe9, 0x95, 0x24 +}; +enum { nonce49 = 0xf269817bdae01bc0ULL }; + +static const u8 input50[] __initconst = { + 0xda, 0x5b, 0x60, 0xcd, 0xed, 0x58, 0x8e, 0x7f, + 0xae, 0xdd, 0xc8, 0x2e, 0x16, 0x90, 0xea, 0x4b, + 0x0c, 0x74, 0x14, 0x35, 0xeb, 0xee, 0x2c, 0xff, + 0x46, 0x99, 0x97, 0x6e, 0xae, 0xa7, 0x8e, 0x6e, + 0x38, 0xfe, 0x63, 0xe7, 0x51, 0xd9, 0xaa, 0xce, + 0x7b, 0x1e, 0x7e, 0x5d, 0xc0, 0xe8, 0x10, 0x06, + 0x14 +}; +static const u8 output50[] __initconst = { + 0xe4, 0xe5, 0x86, 0x1b, 0x66, 0x19, 0xac, 0x49, + 0x1c, 0xbd, 0xee, 0x03, 0xaf, 0x11, 0xfc, 0x1f, + 0x6a, 0xd2, 0x50, 0x5c, 0xea, 0x2c, 0xa5, 0x75, + 0xfd, 0xb7, 0x0e, 0x80, 0x8f, 0xed, 0x3f, 0x31, + 0x47, 0xac, 0x67, 0x43, 0xb8, 0x2e, 0xb4, 0x81, + 0x6d, 0xe4, 0x1e, 0xb7, 0x8b, 0x0c, 0x53, 0xa9, + 0x26 +}; +static const u8 key50[] __initconst = { + 0xd7, 0xb2, 0x04, 0x76, 0x30, 0xcc, 0x38, 0x45, + 0xef, 0xdb, 0xc5, 0x86, 0x08, 0x61, 0xf0, 0xee, + 0x6d, 0xd8, 0x22, 0x04, 0x8c, 0xfb, 0xcb, 0x37, + 0xa6, 0xfb, 0x95, 0x22, 0xe1, 0x87, 0xb7, 0x6f +}; +enum { nonce50 = 0x3b44d09c45607d38ULL }; + +static const u8 input51[] __initconst = { + 0xa9, 0x41, 0x02, 0x4b, 0xd7, 0xd5, 0xd1, 0xf1, + 0x21, 0x55, 0xb2, 0x75, 0x6d, 0x77, 0x1b, 0x86, + 0xa9, 0xc8, 0x90, 0xfd, 0xed, 0x4a, 0x7b, 0x6c, + 0xb2, 0x5f, 0x9b, 0x5f, 0x16, 0xa1, 0x54, 0xdb, + 0xd6, 0x3f, 0x6a, 0x7f, 0x2e, 0x51, 0x9d, 0x49, + 0x5b, 0xa5, 0x0e, 0xf9, 0xfb, 0x2a, 0x38, 0xff, + 0x20, 0x8c +}; +static const u8 output51[] __initconst = { + 0x18, 0xf7, 0x88, 0xc1, 0x72, 0xfd, 0x90, 0x4b, + 0xa9, 0x2d, 0xdb, 0x47, 0xb0, 0xa5, 0xc4, 0x37, + 0x01, 0x95, 0xc4, 0xb1, 0xab, 0xc5, 0x5b, 0xcd, + 0xe1, 0x97, 0x78, 0x13, 0xde, 0x6a, 0xff, 0x36, + 0xce, 0xa4, 0x67, 0xc5, 0x4a, 0x45, 0x2b, 0xd9, + 0xff, 0x8f, 0x06, 0x7c, 0x63, 0xbb, 0x83, 0x17, + 0xb4, 0x6b +}; +static const u8 key51[] __initconst = { + 0x82, 0x1a, 0x79, 0xab, 0x9a, 0xb5, 0x49, 0x6a, + 0x30, 0x6b, 0x99, 0x19, 0x11, 0xc7, 0xa2, 0xf4, + 0xca, 0x55, 0xb9, 0xdd, 0xe7, 0x2f, 0xe7, 0xc1, + 0xdd, 0x27, 0xad, 0x80, 0xf2, 0x56, 0xad, 0xf3 +}; +enum { nonce51 = 0xe93aff94ca71a4a6ULL }; + +static const u8 input52[] __initconst = { + 0x89, 0xdd, 0xf3, 0xfa, 0xb6, 0xc1, 0xaa, 0x9a, + 0xc8, 0xad, 0x6b, 0x00, 0xa1, 0x65, 0xea, 0x14, + 0x55, 0x54, 0x31, 0x8f, 0xf0, 0x03, 0x84, 0x51, + 0x17, 0x1e, 0x0a, 0x93, 0x6e, 0x79, 0x96, 0xa3, + 0x2a, 0x85, 0x9c, 0x89, 0xf8, 0xd1, 0xe2, 0x15, + 0x95, 0x05, 0xf4, 0x43, 0x4d, 0x6b, 0xf0, 0x71, + 0x3b, 0x3e, 0xba +}; +static const u8 output52[] __initconst = { + 0x0c, 0x42, 0x6a, 0xb3, 0x66, 0x63, 0x5d, 0x2c, + 0x9f, 0x3d, 0xa6, 0x6e, 0xc7, 0x5f, 0x79, 0x2f, + 0x50, 0xe3, 0xd6, 0x07, 0x56, 0xa4, 0x2b, 0x2d, + 0x8d, 0x10, 0xc0, 0x6c, 0xa2, 0xfc, 0x97, 0xec, + 0x3f, 0x5c, 0x8d, 0x59, 0xbe, 0x84, 0xf1, 0x3e, + 0x38, 0x47, 0x4f, 0x75, 0x25, 0x66, 0x88, 0x14, + 0x03, 0xdd, 0xde +}; +static const u8 key52[] __initconst = { + 0x4f, 0xb0, 0x27, 0xb6, 0xdd, 0x24, 0x0c, 0xdb, + 0x6b, 0x71, 0x2e, 0xac, 0xfc, 0x3f, 0xa6, 0x48, + 0x5d, 0xd5, 0xff, 0x53, 0xb5, 0x62, 0xf1, 0xe0, + 0x93, 0xfe, 0x39, 0x4c, 0x9f, 0x03, 0x11, 0xa7 +}; +enum { nonce52 = 0xed8becec3bdf6f25ULL }; + +static const u8 input53[] __initconst = { + 0x68, 0xd1, 0xc7, 0x74, 0x44, 0x1c, 0x84, 0xde, + 0x27, 0x27, 0x35, 0xf0, 0x18, 0x0b, 0x57, 0xaa, + 0xd0, 0x1a, 0xd3, 0x3b, 0x5e, 0x5c, 0x62, 0x93, + 0xd7, 0x6b, 0x84, 0x3b, 0x71, 0x83, 0x77, 0x01, + 0x3e, 0x59, 0x45, 0xf4, 0x77, 0x6c, 0x6b, 0xcb, + 0x88, 0x45, 0x09, 0x1d, 0xc6, 0x45, 0x6e, 0xdc, + 0x6e, 0x51, 0xb8, 0x28 +}; +static const u8 output53[] __initconst = { + 0xc5, 0x90, 0x96, 0x78, 0x02, 0xf5, 0xc4, 0x3c, + 0xde, 0xd4, 0xd4, 0xc6, 0xa7, 0xad, 0x12, 0x47, + 0x45, 0xce, 0xcd, 0x8c, 0x35, 0xcc, 0xa6, 0x9e, + 0x5a, 0xc6, 0x60, 0xbb, 0xe3, 0xed, 0xec, 0x68, + 0x3f, 0x64, 0xf7, 0x06, 0x63, 0x9c, 0x8c, 0xc8, + 0x05, 0x3a, 0xad, 0x32, 0x79, 0x8b, 0x45, 0x96, + 0x93, 0x73, 0x4c, 0xe0 +}; +static const u8 key53[] __initconst = { + 0x42, 0x4b, 0x20, 0x81, 0x49, 0x50, 0xe9, 0xc2, + 0x43, 0x69, 0x36, 0xe7, 0x68, 0xae, 0xd5, 0x7e, + 0x42, 0x1a, 0x1b, 0xb4, 0x06, 0x4d, 0xa7, 0x17, + 0xb5, 0x31, 0xd6, 0x0c, 0xb0, 0x5c, 0x41, 0x0b +}; +enum { nonce53 = 0xf44ce1931fbda3d7ULL }; + +static const u8 input54[] __initconst = { + 0x7b, 0xf6, 0x8b, 0xae, 0xc0, 0xcb, 0x10, 0x8e, + 0xe8, 0xd8, 0x2e, 0x3b, 0x14, 0xba, 0xb4, 0xd2, + 0x58, 0x6b, 0x2c, 0xec, 0xc1, 0x81, 0x71, 0xb4, + 0xc6, 0xea, 0x08, 0xc5, 0xc9, 0x78, 0xdb, 0xa2, + 0xfa, 0x44, 0x50, 0x9b, 0xc8, 0x53, 0x8d, 0x45, + 0x42, 0xe7, 0x09, 0xc4, 0x29, 0xd8, 0x75, 0x02, + 0xbb, 0xb2, 0x78, 0xcf, 0xe7 +}; +static const u8 output54[] __initconst = { + 0xaf, 0x2c, 0x83, 0x26, 0x6e, 0x7f, 0xa6, 0xe9, + 0x03, 0x75, 0xfe, 0xfe, 0x87, 0x58, 0xcf, 0xb5, + 0xbc, 0x3c, 0x9d, 0xa1, 0x6e, 0x13, 0xf1, 0x0f, + 0x9e, 0xbc, 0xe0, 0x54, 0x24, 0x32, 0xce, 0x95, + 0xe6, 0xa5, 0x59, 0x3d, 0x24, 0x1d, 0x8f, 0xb1, + 0x74, 0x6c, 0x56, 0xe7, 0x96, 0xc1, 0x91, 0xc8, + 0x2d, 0x0e, 0xb7, 0x51, 0x10 +}; +static const u8 key54[] __initconst = { + 0x00, 0x68, 0x74, 0xdc, 0x30, 0x9e, 0xe3, 0x52, + 0xa9, 0xae, 0xb6, 0x7c, 0xa1, 0xdc, 0x12, 0x2d, + 0x98, 0x32, 0x7a, 0x77, 0xe1, 0xdd, 0xa3, 0x76, + 0x72, 0x34, 0x83, 0xd8, 0xb7, 0x69, 0xba, 0x77 +}; +enum { nonce54 = 0xbea57d79b798b63aULL }; + +static const u8 input55[] __initconst = { + 0xb5, 0xf4, 0x2f, 0xc1, 0x5e, 0x10, 0xa7, 0x4e, + 0x74, 0x3d, 0xa3, 0x96, 0xc0, 0x4d, 0x7b, 0x92, + 0x8f, 0xdb, 0x2d, 0x15, 0x52, 0x6a, 0x95, 0x5e, + 0x40, 0x81, 0x4f, 0x70, 0x73, 0xea, 0x84, 0x65, + 0x3d, 0x9a, 0x4e, 0x03, 0x95, 0xf8, 0x5d, 0x2f, + 0x07, 0x02, 0x13, 0x13, 0xdd, 0x82, 0xe6, 0x3b, + 0xe1, 0x5f, 0xb3, 0x37, 0x9b, 0x88 +}; +static const u8 output55[] __initconst = { + 0xc1, 0x88, 0xbd, 0x92, 0x77, 0xad, 0x7c, 0x5f, + 0xaf, 0xa8, 0x57, 0x0e, 0x40, 0x0a, 0xdc, 0x70, + 0xfb, 0xc6, 0x71, 0xfd, 0xc4, 0x74, 0x60, 0xcc, + 0xa0, 0x89, 0x8e, 0x99, 0xf0, 0x06, 0xa6, 0x7c, + 0x97, 0x42, 0x21, 0x81, 0x6a, 0x07, 0xe7, 0xb3, + 0xf7, 0xa5, 0x03, 0x71, 0x50, 0x05, 0x63, 0x17, + 0xa9, 0x46, 0x0b, 0xff, 0x30, 0x78 +}; +static const u8 key55[] __initconst = { + 0x19, 0x8f, 0xe7, 0xd7, 0x6b, 0x7f, 0x6f, 0x69, + 0x86, 0x91, 0x0f, 0xa7, 0x4a, 0x69, 0x8e, 0x34, + 0xf3, 0xdb, 0xde, 0xaf, 0xf2, 0x66, 0x1d, 0x64, + 0x97, 0x0c, 0xcf, 0xfa, 0x33, 0x84, 0xfd, 0x0c +}; +enum { nonce55 = 0x80aa3d3e2c51ef06ULL }; + +static const u8 input56[] __initconst = { + 0x6b, 0xe9, 0x73, 0x42, 0x27, 0x5e, 0x12, 0xcd, + 0xaa, 0x45, 0x12, 0x8b, 0xb3, 0xe6, 0x54, 0x33, + 0x31, 0x7d, 0xe2, 0x25, 0xc6, 0x86, 0x47, 0x67, + 0x86, 0x83, 0xe4, 0x46, 0xb5, 0x8f, 0x2c, 0xbb, + 0xe4, 0xb8, 0x9f, 0xa2, 0xa4, 0xe8, 0x75, 0x96, + 0x92, 0x51, 0x51, 0xac, 0x8e, 0x2e, 0x6f, 0xfc, + 0xbd, 0x0d, 0xa3, 0x9f, 0x16, 0x55, 0x3e +}; +static const u8 output56[] __initconst = { + 0x42, 0x99, 0x73, 0x6c, 0xd9, 0x4b, 0x16, 0xe5, + 0x18, 0x63, 0x1a, 0xd9, 0x0e, 0xf1, 0x15, 0x2e, + 0x0f, 0x4b, 0xe4, 0x5f, 0xa0, 0x4d, 0xde, 0x9f, + 0xa7, 0x18, 0xc1, 0x0c, 0x0b, 0xae, 0x55, 0xe4, + 0x89, 0x18, 0xa4, 0x78, 0x9d, 0x25, 0x0d, 0xd5, + 0x94, 0x0f, 0xf9, 0x78, 0xa3, 0xa6, 0xe9, 0x9e, + 0x2c, 0x73, 0xf0, 0xf7, 0x35, 0xf3, 0x2b +}; +static const u8 key56[] __initconst = { + 0x7d, 0x12, 0xad, 0x51, 0xd5, 0x6f, 0x8f, 0x96, + 0xc0, 0x5d, 0x9a, 0xd1, 0x7e, 0x20, 0x98, 0x0e, + 0x3c, 0x0a, 0x67, 0x6b, 0x1b, 0x88, 0x69, 0xd4, + 0x07, 0x8c, 0xaf, 0x0f, 0x3a, 0x28, 0xe4, 0x5d +}; +enum { nonce56 = 0x70f4c372fb8b5984ULL }; + +static const u8 input57[] __initconst = { + 0x28, 0xa3, 0x06, 0xe8, 0xe7, 0x08, 0xb9, 0xef, + 0x0d, 0x63, 0x15, 0x99, 0xb2, 0x78, 0x7e, 0xaf, + 0x30, 0x50, 0xcf, 0xea, 0xc9, 0x91, 0x41, 0x2f, + 0x3b, 0x38, 0x70, 0xc4, 0x87, 0xb0, 0x3a, 0xee, + 0x4a, 0xea, 0xe3, 0x83, 0x68, 0x8b, 0xcf, 0xda, + 0x04, 0xa5, 0xbd, 0xb2, 0xde, 0x3c, 0x55, 0x13, + 0xfe, 0x96, 0xad, 0xc1, 0x61, 0x1b, 0x98, 0xde +}; +static const u8 output57[] __initconst = { + 0xf4, 0x44, 0xe9, 0xd2, 0x6d, 0xc2, 0x5a, 0xe9, + 0xfd, 0x7e, 0x41, 0x54, 0x3f, 0xf4, 0x12, 0xd8, + 0x55, 0x0d, 0x12, 0x9b, 0xd5, 0x2e, 0x95, 0xe5, + 0x77, 0x42, 0x3f, 0x2c, 0xfb, 0x28, 0x9d, 0x72, + 0x6d, 0x89, 0x82, 0x27, 0x64, 0x6f, 0x0d, 0x57, + 0xa1, 0x25, 0xa3, 0x6b, 0x88, 0x9a, 0xac, 0x0c, + 0x76, 0x19, 0x90, 0xe2, 0x50, 0x5a, 0xf8, 0x12 +}; +static const u8 key57[] __initconst = { + 0x08, 0x26, 0xb8, 0xac, 0xf3, 0xa5, 0xc6, 0xa3, + 0x7f, 0x09, 0x87, 0xf5, 0x6c, 0x5a, 0x85, 0x6c, + 0x3d, 0xbd, 0xde, 0xd5, 0x87, 0xa3, 0x98, 0x7a, + 0xaa, 0x40, 0x3e, 0xf7, 0xff, 0x44, 0x5d, 0xee +}; +enum { nonce57 = 0xc03a6130bf06b089ULL }; + +static const u8 input58[] __initconst = { + 0x82, 0xa5, 0x38, 0x6f, 0xaa, 0xb4, 0xaf, 0xb2, + 0x42, 0x01, 0xa8, 0x39, 0x3f, 0x15, 0x51, 0xa8, + 0x11, 0x1b, 0x93, 0xca, 0x9c, 0xa0, 0x57, 0x68, + 0x8f, 0xdb, 0x68, 0x53, 0x51, 0x6d, 0x13, 0x22, + 0x12, 0x9b, 0xbd, 0x33, 0xa8, 0x52, 0x40, 0x57, + 0x80, 0x9b, 0x98, 0xef, 0x56, 0x70, 0x11, 0xfa, + 0x36, 0x69, 0x7d, 0x15, 0x48, 0xf9, 0x3b, 0xeb, + 0x42 +}; +static const u8 output58[] __initconst = { + 0xff, 0x3a, 0x74, 0xc3, 0x3e, 0x44, 0x64, 0x4d, + 0x0e, 0x5f, 0x9d, 0xa8, 0xdb, 0xbe, 0x12, 0xef, + 0xba, 0x56, 0x65, 0x50, 0x76, 0xaf, 0xa4, 0x4e, + 0x01, 0xc1, 0xd3, 0x31, 0x14, 0xe2, 0xbe, 0x7b, + 0xa5, 0x67, 0xb4, 0xe3, 0x68, 0x40, 0x9c, 0xb0, + 0xb1, 0x78, 0xef, 0x49, 0x03, 0x0f, 0x2d, 0x56, + 0xb4, 0x37, 0xdb, 0xbc, 0x2d, 0x68, 0x1c, 0x3c, + 0xf1 +}; +static const u8 key58[] __initconst = { + 0x7e, 0xf1, 0x7c, 0x20, 0x65, 0xed, 0xcd, 0xd7, + 0x57, 0xe8, 0xdb, 0x90, 0x87, 0xdb, 0x5f, 0x63, + 0x3d, 0xdd, 0xb8, 0x2b, 0x75, 0x8e, 0x04, 0xb5, + 0xf4, 0x12, 0x79, 0xa9, 0x4d, 0x42, 0x16, 0x7f +}; +enum { nonce58 = 0x92838183f80d2f7fULL }; + +static const u8 input59[] __initconst = { + 0x37, 0xf1, 0x9d, 0xdd, 0xd7, 0x08, 0x9f, 0x13, + 0xc5, 0x21, 0x82, 0x75, 0x08, 0x9e, 0x25, 0x16, + 0xb1, 0xd1, 0x71, 0x42, 0x28, 0x63, 0xac, 0x47, + 0x71, 0x54, 0xb1, 0xfc, 0x39, 0xf0, 0x61, 0x4f, + 0x7c, 0x6d, 0x4f, 0xc8, 0x33, 0xef, 0x7e, 0xc8, + 0xc0, 0x97, 0xfc, 0x1a, 0x61, 0xb4, 0x87, 0x6f, + 0xdd, 0x5a, 0x15, 0x7b, 0x1b, 0x95, 0x50, 0x94, + 0x1d, 0xba +}; +static const u8 output59[] __initconst = { + 0x73, 0x67, 0xc5, 0x07, 0xbb, 0x57, 0x79, 0xd5, + 0xc9, 0x04, 0xdd, 0x88, 0xf3, 0x86, 0xe5, 0x70, + 0x49, 0x31, 0xe0, 0xcc, 0x3b, 0x1d, 0xdf, 0xb0, + 0xaf, 0xf4, 0x2d, 0xe0, 0x06, 0x10, 0x91, 0x8d, + 0x1c, 0xcf, 0x31, 0x0b, 0xf6, 0x73, 0xda, 0x1c, + 0xf0, 0x17, 0x52, 0x9e, 0x20, 0x2e, 0x9f, 0x8c, + 0xb3, 0x59, 0xce, 0xd4, 0xd3, 0xc1, 0x81, 0xe9, + 0x11, 0x36 +}; +static const u8 key59[] __initconst = { + 0xbd, 0x07, 0xd0, 0x53, 0x2c, 0xb3, 0xcc, 0x3f, + 0xc4, 0x95, 0xfd, 0xe7, 0x81, 0xb3, 0x29, 0x99, + 0x05, 0x45, 0xd6, 0x95, 0x25, 0x0b, 0x72, 0xd3, + 0xcd, 0xbb, 0x73, 0xf8, 0xfa, 0xc0, 0x9b, 0x7a +}; +enum { nonce59 = 0x4a0db819b0d519e2ULL }; + +static const u8 input60[] __initconst = { + 0x58, 0x4e, 0xdf, 0x94, 0x3c, 0x76, 0x0a, 0x79, + 0x47, 0xf1, 0xbe, 0x88, 0xd3, 0xba, 0x94, 0xd8, + 0xe2, 0x8f, 0xe3, 0x2f, 0x2f, 0x74, 0x82, 0x55, + 0xc3, 0xda, 0xe2, 0x4e, 0x2c, 0x8c, 0x45, 0x1d, + 0x72, 0x8f, 0x54, 0x41, 0xb5, 0xb7, 0x69, 0xe4, + 0xdc, 0xd2, 0x36, 0x21, 0x5c, 0x28, 0x52, 0xf7, + 0x98, 0x8e, 0x72, 0xa7, 0x6d, 0x57, 0xed, 0xdc, + 0x3c, 0xe6, 0x6a +}; +static const u8 output60[] __initconst = { + 0xda, 0xaf, 0xb5, 0xe3, 0x30, 0x65, 0x5c, 0xb1, + 0x48, 0x08, 0x43, 0x7b, 0x9e, 0xd2, 0x6a, 0x62, + 0x56, 0x7c, 0xad, 0xd9, 0xe5, 0xf6, 0x09, 0x71, + 0xcd, 0xe6, 0x05, 0x6b, 0x3f, 0x44, 0x3a, 0x5c, + 0xf6, 0xf8, 0xd7, 0xce, 0x7d, 0xd1, 0xe0, 0x4f, + 0x88, 0x15, 0x04, 0xd8, 0x20, 0xf0, 0x3e, 0xef, + 0xae, 0xa6, 0x27, 0xa3, 0x0e, 0xfc, 0x18, 0x90, + 0x33, 0xcd, 0xd3 +}; +static const u8 key60[] __initconst = { + 0xbf, 0xfd, 0x25, 0xb5, 0xb2, 0xfc, 0x78, 0x0c, + 0x8e, 0xb9, 0x57, 0x2f, 0x26, 0x4a, 0x7e, 0x71, + 0xcc, 0xf2, 0xe0, 0xfd, 0x24, 0x11, 0x20, 0x23, + 0x57, 0x00, 0xff, 0x80, 0x11, 0x0c, 0x1e, 0xff +}; +enum { nonce60 = 0xf18df56fdb7954adULL }; + +static const u8 input61[] __initconst = { + 0xb0, 0xf3, 0x06, 0xbc, 0x22, 0xae, 0x49, 0x40, + 0xae, 0xff, 0x1b, 0x31, 0xa7, 0x98, 0xab, 0x1d, + 0xe7, 0x40, 0x23, 0x18, 0x4f, 0xab, 0x8e, 0x93, + 0x82, 0xf4, 0x56, 0x61, 0xfd, 0x2b, 0xcf, 0xa7, + 0xc4, 0xb4, 0x0a, 0xf4, 0xcb, 0xc7, 0x8c, 0x40, + 0x57, 0xac, 0x0b, 0x3e, 0x2a, 0x0a, 0x67, 0x83, + 0x50, 0xbf, 0xec, 0xb0, 0xc7, 0xf1, 0x32, 0x26, + 0x98, 0x80, 0x33, 0xb4 +}; +static const u8 output61[] __initconst = { + 0x9d, 0x23, 0x0e, 0xff, 0xcc, 0x7c, 0xd5, 0xcf, + 0x1a, 0xb8, 0x59, 0x1e, 0x92, 0xfd, 0x7f, 0xca, + 0xca, 0x3c, 0x18, 0x81, 0xde, 0xfa, 0x59, 0xc8, + 0x6f, 0x9c, 0x24, 0x3f, 0x3a, 0xe6, 0x0b, 0xb4, + 0x34, 0x48, 0x69, 0xfc, 0xb6, 0xea, 0xb2, 0xde, + 0x9f, 0xfd, 0x92, 0x36, 0x18, 0x98, 0x99, 0xaa, + 0x65, 0xe2, 0xea, 0xf4, 0xb1, 0x47, 0x8e, 0xb0, + 0xe7, 0xd4, 0x7a, 0x2c +}; +static const u8 key61[] __initconst = { + 0xd7, 0xfd, 0x9b, 0xbd, 0x8f, 0x65, 0x0d, 0x00, + 0xca, 0xa1, 0x6c, 0x85, 0x85, 0xa4, 0x6d, 0xf1, + 0xb1, 0x68, 0x0c, 0x8b, 0x5d, 0x37, 0x72, 0xd0, + 0xd8, 0xd2, 0x25, 0xab, 0x9f, 0x7b, 0x7d, 0x95 +}; +enum { nonce61 = 0xd82caf72a9c4864fULL }; + +static const u8 input62[] __initconst = { + 0x10, 0x77, 0xf3, 0x2f, 0xc2, 0x50, 0xd6, 0x0c, + 0xba, 0xa8, 0x8d, 0xce, 0x0d, 0x58, 0x9e, 0x87, + 0xb1, 0x59, 0x66, 0x0a, 0x4a, 0xb3, 0xd8, 0xca, + 0x0a, 0x6b, 0xf8, 0xc6, 0x2b, 0x3f, 0x8e, 0x09, + 0xe0, 0x0a, 0x15, 0x85, 0xfe, 0xaa, 0xc6, 0xbd, + 0x30, 0xef, 0xe4, 0x10, 0x78, 0x03, 0xc1, 0xc7, + 0x8a, 0xd9, 0xde, 0x0b, 0x51, 0x07, 0xc4, 0x7b, + 0xe2, 0x2e, 0x36, 0x3a, 0xc2 +}; +static const u8 output62[] __initconst = { + 0xa0, 0x0c, 0xfc, 0xc1, 0xf6, 0xaf, 0xc2, 0xb8, + 0x5c, 0xef, 0x6e, 0xf3, 0xce, 0x15, 0x48, 0x05, + 0xb5, 0x78, 0x49, 0x51, 0x1f, 0x9d, 0xf4, 0xbf, + 0x2f, 0x53, 0xa2, 0xd1, 0x15, 0x20, 0x82, 0x6b, + 0xd2, 0x22, 0x6c, 0x4e, 0x14, 0x87, 0xe3, 0xd7, + 0x49, 0x45, 0x84, 0xdb, 0x5f, 0x68, 0x60, 0xc4, + 0xb3, 0xe6, 0x3f, 0xd1, 0xfc, 0xa5, 0x73, 0xf3, + 0xfc, 0xbb, 0xbe, 0xc8, 0x9d +}; +static const u8 key62[] __initconst = { + 0x6e, 0xc9, 0xaf, 0xce, 0x35, 0xb9, 0x86, 0xd1, + 0xce, 0x5f, 0xd9, 0xbb, 0xd5, 0x1f, 0x7c, 0xcd, + 0xfe, 0x19, 0xaa, 0x3d, 0xea, 0x64, 0xc1, 0x28, + 0x40, 0xba, 0xa1, 0x28, 0xcd, 0x40, 0xb6, 0xf2 +}; +enum { nonce62 = 0xa1c0c265f900cde8ULL }; + +static const u8 input63[] __initconst = { + 0x7a, 0x70, 0x21, 0x2c, 0xef, 0xa6, 0x36, 0xd4, + 0xe0, 0xab, 0x8c, 0x25, 0x73, 0x34, 0xc8, 0x94, + 0x6c, 0x81, 0xcb, 0x19, 0x8d, 0x5a, 0x49, 0xaa, + 0x6f, 0xba, 0x83, 0x72, 0x02, 0x5e, 0xf5, 0x89, + 0xce, 0x79, 0x7e, 0x13, 0x3d, 0x5b, 0x98, 0x60, + 0x5d, 0xd9, 0xfb, 0x15, 0x93, 0x4c, 0xf3, 0x51, + 0x49, 0x55, 0xd1, 0x58, 0xdd, 0x7e, 0x6d, 0xfe, + 0xdd, 0x84, 0x23, 0x05, 0xba, 0xe9 +}; +static const u8 output63[] __initconst = { + 0x20, 0xb3, 0x5c, 0x03, 0x03, 0x78, 0x17, 0xfc, + 0x3b, 0x35, 0x30, 0x9a, 0x00, 0x18, 0xf5, 0xc5, + 0x06, 0x53, 0xf5, 0x04, 0x24, 0x9d, 0xd1, 0xb2, + 0xac, 0x5a, 0xb6, 0x2a, 0xa5, 0xda, 0x50, 0x00, + 0xec, 0xff, 0xa0, 0x7a, 0x14, 0x7b, 0xe4, 0x6b, + 0x63, 0xe8, 0x66, 0x86, 0x34, 0xfd, 0x74, 0x44, + 0xa2, 0x50, 0x97, 0x0d, 0xdc, 0xc3, 0x84, 0xf8, + 0x71, 0x02, 0x31, 0x95, 0xed, 0x54 +}; +static const u8 key63[] __initconst = { + 0x7d, 0x64, 0xb4, 0x12, 0x81, 0xe4, 0xe6, 0x8f, + 0xcc, 0xe7, 0xd1, 0x1f, 0x70, 0x20, 0xfd, 0xb8, + 0x3a, 0x7d, 0xa6, 0x53, 0x65, 0x30, 0x5d, 0xe3, + 0x1a, 0x44, 0xbe, 0x62, 0xed, 0x90, 0xc4, 0xd1 +}; +enum { nonce63 = 0xe8e849596c942276ULL }; + +static const u8 input64[] __initconst = { + 0x84, 0xf8, 0xda, 0x87, 0x23, 0x39, 0x60, 0xcf, + 0xc5, 0x50, 0x7e, 0xc5, 0x47, 0x29, 0x7c, 0x05, + 0xc2, 0xb4, 0xf4, 0xb2, 0xec, 0x5d, 0x48, 0x36, + 0xbf, 0xfc, 0x06, 0x8c, 0xf2, 0x0e, 0x88, 0xe7, + 0xc9, 0xc5, 0xa4, 0xa2, 0x83, 0x20, 0xa1, 0x6f, + 0x37, 0xe5, 0x2d, 0xa1, 0x72, 0xa1, 0x19, 0xef, + 0x05, 0x42, 0x08, 0xf2, 0x57, 0x47, 0x31, 0x1e, + 0x17, 0x76, 0x13, 0xd3, 0xcc, 0x75, 0x2c +}; +static const u8 output64[] __initconst = { + 0xcb, 0xec, 0x90, 0x88, 0xeb, 0x31, 0x69, 0x20, + 0xa6, 0xdc, 0xff, 0x76, 0x98, 0xb0, 0x24, 0x49, + 0x7b, 0x20, 0xd9, 0xd1, 0x1b, 0xe3, 0x61, 0xdc, + 0xcf, 0x51, 0xf6, 0x70, 0x72, 0x33, 0x28, 0x94, + 0xac, 0x73, 0x18, 0xcf, 0x93, 0xfd, 0xca, 0x08, + 0x0d, 0xa2, 0xb9, 0x57, 0x1e, 0x51, 0xb6, 0x07, + 0x5c, 0xc1, 0x13, 0x64, 0x1d, 0x18, 0x6f, 0xe6, + 0x0b, 0xb7, 0x14, 0x03, 0x43, 0xb6, 0xaf +}; +static const u8 key64[] __initconst = { + 0xbf, 0x82, 0x65, 0xe4, 0x50, 0xf9, 0x5e, 0xea, + 0x28, 0x91, 0xd1, 0xd2, 0x17, 0x7c, 0x13, 0x7e, + 0xf5, 0xd5, 0x6b, 0x06, 0x1c, 0x20, 0xc2, 0x82, + 0xa1, 0x7a, 0xa2, 0x14, 0xa1, 0xb0, 0x54, 0x58 +}; +enum { nonce64 = 0xe57c5095aa5723c9ULL }; + +static const u8 input65[] __initconst = { + 0x1c, 0xfb, 0xd3, 0x3f, 0x85, 0xd7, 0xba, 0x7b, + 0xae, 0xb1, 0xa5, 0xd2, 0xe5, 0x40, 0xce, 0x4d, + 0x3e, 0xab, 0x17, 0x9d, 0x7d, 0x9f, 0x03, 0x98, + 0x3f, 0x9f, 0xc8, 0xdd, 0x36, 0x17, 0x43, 0x5c, + 0x34, 0xd1, 0x23, 0xe0, 0x77, 0xbf, 0x35, 0x5d, + 0x8f, 0xb1, 0xcb, 0x82, 0xbb, 0x39, 0x69, 0xd8, + 0x90, 0x45, 0x37, 0xfd, 0x98, 0x25, 0xf7, 0x5b, + 0xce, 0x06, 0x43, 0xba, 0x61, 0xa8, 0x47, 0xb9 +}; +static const u8 output65[] __initconst = { + 0x73, 0xa5, 0x68, 0xab, 0x8b, 0xa5, 0xc3, 0x7e, + 0x74, 0xf8, 0x9d, 0xf5, 0x93, 0x6e, 0xf2, 0x71, + 0x6d, 0xde, 0x82, 0xc5, 0x40, 0xa0, 0x46, 0xb3, + 0x9a, 0x78, 0xa8, 0xf7, 0xdf, 0xb1, 0xc3, 0xdd, + 0x8d, 0x90, 0x00, 0x68, 0x21, 0x48, 0xe8, 0xba, + 0x56, 0x9f, 0x8f, 0xe7, 0xa4, 0x4d, 0x36, 0x55, + 0xd0, 0x34, 0x99, 0xa6, 0x1c, 0x4c, 0xc1, 0xe2, + 0x65, 0x98, 0x14, 0x8e, 0x6a, 0x05, 0xb1, 0x2b +}; +static const u8 key65[] __initconst = { + 0xbd, 0x5c, 0x8a, 0xb0, 0x11, 0x29, 0xf3, 0x00, + 0x7a, 0x78, 0x32, 0x63, 0x34, 0x00, 0xe6, 0x7d, + 0x30, 0x54, 0xde, 0x37, 0xda, 0xc2, 0xc4, 0x3d, + 0x92, 0x6b, 0x4c, 0xc2, 0x92, 0xe9, 0x9e, 0x2a +}; +enum { nonce65 = 0xf654a3031de746f2ULL }; + +static const u8 input66[] __initconst = { + 0x4b, 0x27, 0x30, 0x8f, 0x28, 0xd8, 0x60, 0x46, + 0x39, 0x06, 0x49, 0xea, 0x1b, 0x71, 0x26, 0xe0, + 0x99, 0x2b, 0xd4, 0x8f, 0x64, 0x64, 0xcd, 0xac, + 0x1d, 0x78, 0x88, 0x90, 0xe1, 0x5c, 0x24, 0x4b, + 0xdc, 0x2d, 0xb7, 0xee, 0x3a, 0xe6, 0x86, 0x2c, + 0x21, 0xe4, 0x2b, 0xfc, 0xe8, 0x19, 0xca, 0x65, + 0xe7, 0xdd, 0x6f, 0x52, 0xb3, 0x11, 0xe1, 0xe2, + 0xbf, 0xe8, 0x70, 0xe3, 0x0d, 0x45, 0xb8, 0xa5, + 0x20, 0xb7, 0xb5, 0xaf, 0xff, 0x08, 0xcf, 0x23, + 0x65, 0xdf, 0x8d, 0xc3, 0x31, 0xf3, 0x1e, 0x6a, + 0x58, 0x8d, 0xcc, 0x45, 0x16, 0x86, 0x1f, 0x31, + 0x5c, 0x27, 0xcd, 0xc8, 0x6b, 0x19, 0x1e, 0xec, + 0x44, 0x75, 0x63, 0x97, 0xfd, 0x79, 0xf6, 0x62, + 0xc5, 0xba, 0x17, 0xc7, 0xab, 0x8f, 0xbb, 0xed, + 0x85, 0x2a, 0x98, 0x79, 0x21, 0xec, 0x6e, 0x4d, + 0xdc, 0xfa, 0x72, 0x52, 0xba, 0xc8, 0x4c +}; +static const u8 output66[] __initconst = { + 0x76, 0x5b, 0x2c, 0xa7, 0x62, 0xb9, 0x08, 0x4a, + 0xc6, 0x4a, 0x92, 0xc3, 0xbb, 0x10, 0xb3, 0xee, + 0xff, 0xb9, 0x07, 0xc7, 0x27, 0xcb, 0x1e, 0xcf, + 0x58, 0x6f, 0xa1, 0x64, 0xe8, 0xf1, 0x4e, 0xe1, + 0xef, 0x18, 0x96, 0xab, 0x97, 0x28, 0xd1, 0x7c, + 0x71, 0x6c, 0xd1, 0xe2, 0xfa, 0xd9, 0x75, 0xcb, + 0xeb, 0xea, 0x0c, 0x86, 0x82, 0xd8, 0xf4, 0xcc, + 0xea, 0xa3, 0x00, 0xfa, 0x82, 0xd2, 0xcd, 0xcb, + 0xdb, 0x63, 0x28, 0xe2, 0x82, 0xe9, 0x01, 0xed, + 0x31, 0xe6, 0x71, 0x45, 0x08, 0x89, 0x8a, 0x23, + 0xa8, 0xb5, 0xc2, 0xe2, 0x9f, 0xe9, 0xb8, 0x9a, + 0xc4, 0x79, 0x6d, 0x71, 0x52, 0x61, 0x74, 0x6c, + 0x1b, 0xd7, 0x65, 0x6d, 0x03, 0xc4, 0x1a, 0xc0, + 0x50, 0xba, 0xd6, 0xc9, 0x43, 0x50, 0xbe, 0x09, + 0x09, 0x8a, 0xdb, 0xaa, 0x76, 0x4e, 0x3b, 0x61, + 0x3c, 0x7c, 0x44, 0xe7, 0xdb, 0x10, 0xa7 +}; +static const u8 key66[] __initconst = { + 0x88, 0xdf, 0xca, 0x68, 0xaf, 0x4f, 0xb3, 0xfd, + 0x6e, 0xa7, 0x95, 0x35, 0x8a, 0xe8, 0x37, 0xe8, + 0xc8, 0x55, 0xa2, 0x2a, 0x6d, 0x77, 0xf8, 0x93, + 0x7a, 0x41, 0xf3, 0x7b, 0x95, 0xdf, 0x89, 0xf5 +}; +enum { nonce66 = 0x1024b4fdd415cf82ULL }; + +static const u8 input67[] __initconst = { + 0xd4, 0x2e, 0xfa, 0x92, 0xe9, 0x29, 0x68, 0xb7, + 0x54, 0x2c, 0xf7, 0xa4, 0x2d, 0xb7, 0x50, 0xb5, + 0xc5, 0xb2, 0x9d, 0x17, 0x5e, 0x0a, 0xca, 0x37, + 0xbf, 0x60, 0xae, 0xd2, 0x98, 0xe9, 0xfa, 0x59, + 0x67, 0x62, 0xe6, 0x43, 0x0c, 0x77, 0x80, 0x82, + 0x33, 0x61, 0xa3, 0xff, 0xc1, 0xa0, 0x8f, 0x56, + 0xbc, 0xec, 0x65, 0x43, 0x88, 0xa5, 0xff, 0x51, + 0x64, 0x30, 0xee, 0x34, 0xb7, 0x5c, 0x28, 0x68, + 0xc3, 0x52, 0xd2, 0xac, 0x78, 0x2a, 0xa6, 0x10, + 0xb8, 0xb2, 0x4c, 0x80, 0x4f, 0x99, 0xb2, 0x36, + 0x94, 0x8f, 0x66, 0xcb, 0xa1, 0x91, 0xed, 0x06, + 0x42, 0x6d, 0xc1, 0xae, 0x55, 0x93, 0xdd, 0x93, + 0x9e, 0x88, 0x34, 0x7f, 0x98, 0xeb, 0xbe, 0x61, + 0xf9, 0xa9, 0x0f, 0xd9, 0xc4, 0x87, 0xd5, 0xef, + 0xcc, 0x71, 0x8c, 0x0e, 0xce, 0xad, 0x02, 0xcf, + 0xa2, 0x61, 0xdf, 0xb1, 0xfe, 0x3b, 0xdc, 0xc0, + 0x58, 0xb5, 0x71, 0xa1, 0x83, 0xc9, 0xb4, 0xaf, + 0x9d, 0x54, 0x12, 0xcd, 0xea, 0x06, 0xd6, 0x4e, + 0xe5, 0x27, 0x0c, 0xc3, 0xbb, 0xa8, 0x0a, 0x81, + 0x75, 0xc3, 0xc9, 0xd4, 0x35, 0x3e, 0x53, 0x9f, + 0xaa, 0x20, 0xc0, 0x68, 0x39, 0x2c, 0x96, 0x39, + 0x53, 0x81, 0xda, 0x07, 0x0f, 0x44, 0xa5, 0x47, + 0x0e, 0xb3, 0x87, 0x0d, 0x1b, 0xc1, 0xe5, 0x41, + 0x35, 0x12, 0x58, 0x96, 0x69, 0x8a, 0x1a, 0xa3, + 0x9d, 0x3d, 0xd4, 0xb1, 0x8e, 0x1f, 0x96, 0x87, + 0xda, 0xd3, 0x19, 0xe2, 0xb1, 0x3a, 0x19, 0x74, + 0xa0, 0x00, 0x9f, 0x4d, 0xbc, 0xcb, 0x0c, 0xe9, + 0xec, 0x10, 0xdf, 0x2a, 0x88, 0xdc, 0x30, 0x51, + 0x46, 0x56, 0x53, 0x98, 0x6a, 0x26, 0x14, 0x05, + 0x54, 0x81, 0x55, 0x0b, 0x3c, 0x85, 0xdd, 0x33, + 0x81, 0x11, 0x29, 0x82, 0x46, 0x35, 0xe1, 0xdb, + 0x59, 0x7b +}; +static const u8 output67[] __initconst = { + 0x64, 0x6c, 0xda, 0x7f, 0xd4, 0xa9, 0x2a, 0x5e, + 0x22, 0xae, 0x8d, 0x67, 0xdb, 0xee, 0xfd, 0xd0, + 0x44, 0x80, 0x17, 0xb2, 0xe3, 0x87, 0xad, 0x57, + 0x15, 0xcb, 0x88, 0x64, 0xc0, 0xf1, 0x49, 0x3d, + 0xfa, 0xbe, 0xa8, 0x9f, 0x12, 0xc3, 0x57, 0x56, + 0x70, 0xa5, 0xc5, 0x6b, 0xf1, 0xab, 0xd5, 0xde, + 0x77, 0x92, 0x6a, 0x56, 0x03, 0xf5, 0x21, 0x0d, + 0xb6, 0xc4, 0xcc, 0x62, 0x44, 0x3f, 0xb1, 0xc1, + 0x61, 0x41, 0x90, 0xb2, 0xd5, 0xb8, 0xf3, 0x57, + 0xfb, 0xc2, 0x6b, 0x25, 0x58, 0xc8, 0x45, 0x20, + 0x72, 0x29, 0x6f, 0x9d, 0xb5, 0x81, 0x4d, 0x2b, + 0xb2, 0x89, 0x9e, 0x91, 0x53, 0x97, 0x1c, 0xd9, + 0x3d, 0x79, 0xdc, 0x14, 0xae, 0x01, 0x73, 0x75, + 0xf0, 0xca, 0xd5, 0xab, 0x62, 0x5c, 0x7a, 0x7d, + 0x3f, 0xfe, 0x22, 0x7d, 0xee, 0xe2, 0xcb, 0x76, + 0x55, 0xec, 0x06, 0xdd, 0x41, 0x47, 0x18, 0x62, + 0x1d, 0x57, 0xd0, 0xd6, 0xb6, 0x0f, 0x4b, 0xfc, + 0x79, 0x19, 0xf4, 0xd6, 0x37, 0x86, 0x18, 0x1f, + 0x98, 0x0d, 0x9e, 0x15, 0x2d, 0xb6, 0x9a, 0x8a, + 0x8c, 0x80, 0x22, 0x2f, 0x82, 0xc4, 0xc7, 0x36, + 0xfa, 0xfa, 0x07, 0xbd, 0xc2, 0x2a, 0xe2, 0xea, + 0x93, 0xc8, 0xb2, 0x90, 0x33, 0xf2, 0xee, 0x4b, + 0x1b, 0xf4, 0x37, 0x92, 0x13, 0xbb, 0xe2, 0xce, + 0xe3, 0x03, 0xcf, 0x07, 0x94, 0xab, 0x9a, 0xc9, + 0xff, 0x83, 0x69, 0x3a, 0xda, 0x2c, 0xd0, 0x47, + 0x3d, 0x6c, 0x1a, 0x60, 0x68, 0x47, 0xb9, 0x36, + 0x52, 0xdd, 0x16, 0xef, 0x6c, 0xbf, 0x54, 0x11, + 0x72, 0x62, 0xce, 0x8c, 0x9d, 0x90, 0xa0, 0x25, + 0x06, 0x92, 0x3e, 0x12, 0x7e, 0x1a, 0x1d, 0xe5, + 0xa2, 0x71, 0xce, 0x1c, 0x4c, 0x6a, 0x7c, 0xdc, + 0x3d, 0xe3, 0x6e, 0x48, 0x9d, 0xb3, 0x64, 0x7d, + 0x78, 0x40 +}; +static const u8 key67[] __initconst = { + 0xa9, 0x20, 0x75, 0x89, 0x7e, 0x37, 0x85, 0x48, + 0xa3, 0xfb, 0x7b, 0xe8, 0x30, 0xa7, 0xe3, 0x6e, + 0xa6, 0xc1, 0x71, 0x17, 0xc1, 0x6c, 0x9b, 0xc2, + 0xde, 0xf0, 0xa7, 0x19, 0xec, 0xce, 0xc6, 0x53 +}; +enum { nonce67 = 0x4adc4d1f968c8a10ULL }; + +static const u8 input68[] __initconst = { + 0x99, 0xae, 0x72, 0xfb, 0x16, 0xe1, 0xf1, 0x59, + 0x43, 0x15, 0x4e, 0x33, 0xa0, 0x95, 0xe7, 0x6c, + 0x74, 0x24, 0x31, 0xca, 0x3b, 0x2e, 0xeb, 0xd7, + 0x11, 0xd8, 0xe0, 0x56, 0x92, 0x91, 0x61, 0x57, + 0xe2, 0x82, 0x9f, 0x8f, 0x37, 0xf5, 0x3d, 0x24, + 0x92, 0x9d, 0x87, 0x00, 0x8d, 0x89, 0xe0, 0x25, + 0x8b, 0xe4, 0x20, 0x5b, 0x8a, 0x26, 0x2c, 0x61, + 0x78, 0xb0, 0xa6, 0x3e, 0x82, 0x18, 0xcf, 0xdc, + 0x2d, 0x24, 0xdd, 0x81, 0x42, 0xc4, 0x95, 0xf0, + 0x48, 0x60, 0x71, 0xe3, 0xe3, 0xac, 0xec, 0xbe, + 0x98, 0x6b, 0x0c, 0xb5, 0x6a, 0xa9, 0xc8, 0x79, + 0x23, 0x2e, 0x38, 0x0b, 0x72, 0x88, 0x8c, 0xe7, + 0x71, 0x8b, 0x36, 0xe3, 0x58, 0x3d, 0x9c, 0xa0, + 0xa2, 0xea, 0xcf, 0x0c, 0x6a, 0x6c, 0x64, 0xdf, + 0x97, 0x21, 0x8f, 0x93, 0xfb, 0xba, 0xf3, 0x5a, + 0xd7, 0x8f, 0xa6, 0x37, 0x15, 0x50, 0x43, 0x02, + 0x46, 0x7f, 0x93, 0x46, 0x86, 0x31, 0xe2, 0xaa, + 0x24, 0xa8, 0x26, 0xae, 0xe6, 0xc0, 0x05, 0x73, + 0x0b, 0x4f, 0x7e, 0xed, 0x65, 0xeb, 0x56, 0x1e, + 0xb6, 0xb3, 0x0b, 0xc3, 0x0e, 0x31, 0x95, 0xa9, + 0x18, 0x4d, 0xaf, 0x38, 0xd7, 0xec, 0xc6, 0x44, + 0x72, 0x77, 0x4e, 0x25, 0x4b, 0x25, 0xdd, 0x1e, + 0x8c, 0xa2, 0xdf, 0xf6, 0x2a, 0x97, 0x1a, 0x88, + 0x2c, 0x8a, 0x5d, 0xfe, 0xe8, 0xfb, 0x35, 0xe8, + 0x0f, 0x2b, 0x7a, 0x18, 0x69, 0x43, 0x31, 0x1d, + 0x38, 0x6a, 0x62, 0x95, 0x0f, 0x20, 0x4b, 0xbb, + 0x97, 0x3c, 0xe0, 0x64, 0x2f, 0x52, 0xc9, 0x2d, + 0x4d, 0x9d, 0x54, 0x04, 0x3d, 0xc9, 0xea, 0xeb, + 0xd0, 0x86, 0x52, 0xff, 0x42, 0xe1, 0x0d, 0x7a, + 0xad, 0x88, 0xf9, 0x9b, 0x1e, 0x5e, 0x12, 0x27, + 0x95, 0x3e, 0x0c, 0x2c, 0x13, 0x00, 0x6f, 0x8e, + 0x93, 0x69, 0x0e, 0x01, 0x8c, 0xc1, 0xfd, 0xb3 +}; +static const u8 output68[] __initconst = { + 0x26, 0x3e, 0xf2, 0xb1, 0xf5, 0xef, 0x81, 0xa4, + 0xb7, 0x42, 0xd4, 0x26, 0x18, 0x4b, 0xdd, 0x6a, + 0x47, 0x15, 0xcb, 0x0e, 0x57, 0xdb, 0xa7, 0x29, + 0x7e, 0x7b, 0x3f, 0x47, 0x89, 0x57, 0xab, 0xea, + 0x14, 0x7b, 0xcf, 0x37, 0xdb, 0x1c, 0xe1, 0x11, + 0x77, 0xae, 0x2e, 0x4c, 0xd2, 0x08, 0x3f, 0xa6, + 0x62, 0x86, 0xa6, 0xb2, 0x07, 0xd5, 0x3f, 0x9b, + 0xdc, 0xc8, 0x50, 0x4b, 0x7b, 0xb9, 0x06, 0xe6, + 0xeb, 0xac, 0x98, 0x8c, 0x36, 0x0c, 0x1e, 0xb2, + 0xc8, 0xfb, 0x24, 0x60, 0x2c, 0x08, 0x17, 0x26, + 0x5b, 0xc8, 0xc2, 0xdf, 0x9c, 0x73, 0x67, 0x4a, + 0xdb, 0xcf, 0xd5, 0x2c, 0x2b, 0xca, 0x24, 0xcc, + 0xdb, 0xc9, 0xa8, 0xf2, 0x5d, 0x67, 0xdf, 0x5c, + 0x62, 0x0b, 0x58, 0xc0, 0x83, 0xde, 0x8b, 0xf6, + 0x15, 0x0a, 0xd6, 0x32, 0xd8, 0xf5, 0xf2, 0x5f, + 0x33, 0xce, 0x7e, 0xab, 0x76, 0xcd, 0x14, 0x91, + 0xd8, 0x41, 0x90, 0x93, 0xa1, 0xaf, 0xf3, 0x45, + 0x6c, 0x1b, 0x25, 0xbd, 0x48, 0x51, 0x6d, 0x15, + 0x47, 0xe6, 0x23, 0x50, 0x32, 0x69, 0x1e, 0xb5, + 0x94, 0xd3, 0x97, 0xba, 0xd7, 0x37, 0x4a, 0xba, + 0xb9, 0xcd, 0xfb, 0x96, 0x9a, 0x90, 0xe0, 0x37, + 0xf8, 0xdf, 0x91, 0x6c, 0x62, 0x13, 0x19, 0x21, + 0x4b, 0xa9, 0xf1, 0x12, 0x66, 0xe2, 0x74, 0xd7, + 0x81, 0xa0, 0x74, 0x8d, 0x7e, 0x7e, 0xc9, 0xb1, + 0x69, 0x8f, 0xed, 0xb3, 0xf6, 0x97, 0xcd, 0x72, + 0x78, 0x93, 0xd3, 0x54, 0x6b, 0x43, 0xac, 0x29, + 0xb4, 0xbc, 0x7d, 0xa4, 0x26, 0x4b, 0x7b, 0xab, + 0xd6, 0x67, 0x22, 0xff, 0x03, 0x92, 0xb6, 0xd4, + 0x96, 0x94, 0x5a, 0xe5, 0x02, 0x35, 0x77, 0xfa, + 0x3f, 0x54, 0x1d, 0xdd, 0x35, 0x39, 0xfe, 0x03, + 0xdd, 0x8e, 0x3c, 0x8c, 0xc2, 0x69, 0x2a, 0xb1, + 0xb7, 0xb3, 0xa1, 0x89, 0x84, 0xea, 0x16, 0xe2 +}; +static const u8 key68[] __initconst = { + 0xd2, 0x49, 0x7f, 0xd7, 0x49, 0x66, 0x0d, 0xb3, + 0x5a, 0x7e, 0x3c, 0xfc, 0x37, 0x83, 0x0e, 0xf7, + 0x96, 0xd8, 0xd6, 0x33, 0x79, 0x2b, 0x84, 0x53, + 0x06, 0xbc, 0x6c, 0x0a, 0x55, 0x84, 0xfe, 0xab +}; +enum { nonce68 = 0x6a6df7ff0a20de06ULL }; + +static const u8 input69[] __initconst = { + 0xf9, 0x18, 0x4c, 0xd2, 0x3f, 0xf7, 0x22, 0xd9, + 0x58, 0xb6, 0x3b, 0x38, 0x69, 0x79, 0xf4, 0x71, + 0x5f, 0x38, 0x52, 0x1f, 0x17, 0x6f, 0x6f, 0xd9, + 0x09, 0x2b, 0xfb, 0x67, 0xdc, 0xc9, 0xe8, 0x4a, + 0x70, 0x9f, 0x2e, 0x3c, 0x06, 0xe5, 0x12, 0x20, + 0x25, 0x29, 0xd0, 0xdc, 0x81, 0xc5, 0xc6, 0x0f, + 0xd2, 0xa8, 0x81, 0x15, 0x98, 0xb2, 0x71, 0x5a, + 0x9a, 0xe9, 0xfb, 0xaf, 0x0e, 0x5f, 0x8a, 0xf3, + 0x16, 0x4a, 0x47, 0xf2, 0x5c, 0xbf, 0xda, 0x52, + 0x9a, 0xa6, 0x36, 0xfd, 0xc6, 0xf7, 0x66, 0x00, + 0xcc, 0x6c, 0xd4, 0xb3, 0x07, 0x6d, 0xeb, 0xfe, + 0x92, 0x71, 0x25, 0xd0, 0xcf, 0x9c, 0xe8, 0x65, + 0x45, 0x10, 0xcf, 0x62, 0x74, 0x7d, 0xf2, 0x1b, + 0x57, 0xa0, 0xf1, 0x6b, 0xa4, 0xd5, 0xfa, 0x12, + 0x27, 0x5a, 0xf7, 0x99, 0xfc, 0xca, 0xf3, 0xb8, + 0x2c, 0x8b, 0xba, 0x28, 0x74, 0xde, 0x8f, 0x78, + 0xa2, 0x8c, 0xaf, 0x89, 0x4b, 0x05, 0xe2, 0xf3, + 0xf8, 0xd2, 0xef, 0xac, 0xa4, 0xc4, 0xe2, 0xe2, + 0x36, 0xbb, 0x5e, 0xae, 0xe6, 0x87, 0x3d, 0x88, + 0x9f, 0xb8, 0x11, 0xbb, 0xcf, 0x57, 0xce, 0xd0, + 0xba, 0x62, 0xf4, 0xf8, 0x9b, 0x95, 0x04, 0xc9, + 0xcf, 0x01, 0xe9, 0xf1, 0xc8, 0xc6, 0x22, 0xa4, + 0xf2, 0x8b, 0x2f, 0x24, 0x0a, 0xf5, 0x6e, 0xb7, + 0xd4, 0x2c, 0xb6, 0xf7, 0x5c, 0x97, 0x61, 0x0b, + 0xd9, 0xb5, 0x06, 0xcd, 0xed, 0x3e, 0x1f, 0xc5, + 0xb2, 0x6c, 0xa3, 0xea, 0xb8, 0xad, 0xa6, 0x42, + 0x88, 0x7a, 0x52, 0xd5, 0x64, 0xba, 0xb5, 0x20, + 0x10, 0xa0, 0x0f, 0x0d, 0xea, 0xef, 0x5a, 0x9b, + 0x27, 0xb8, 0xca, 0x20, 0x19, 0x6d, 0xa8, 0xc4, + 0x46, 0x04, 0xb3, 0xe8, 0xf8, 0x66, 0x1b, 0x0a, + 0xce, 0x76, 0x5d, 0x59, 0x58, 0x05, 0xee, 0x3e, + 0x3c, 0x86, 0x5b, 0x49, 0x1c, 0x72, 0x18, 0x01, + 0x62, 0x92, 0x0f, 0x3e, 0xd1, 0x57, 0x5e, 0x20, + 0x7b, 0xfb, 0x4d, 0x3c, 0xc5, 0x35, 0x43, 0x2f, + 0xb0, 0xc5, 0x7c, 0xe4, 0xa2, 0x84, 0x13, 0x77 +}; +static const u8 output69[] __initconst = { + 0xbb, 0x4a, 0x7f, 0x7c, 0xd5, 0x2f, 0x89, 0x06, + 0xec, 0x20, 0xf1, 0x9a, 0x11, 0x09, 0x14, 0x2e, + 0x17, 0x50, 0xf9, 0xd5, 0xf5, 0x48, 0x7c, 0x7a, + 0x55, 0xc0, 0x57, 0x03, 0xe3, 0xc4, 0xb2, 0xb7, + 0x18, 0x47, 0x95, 0xde, 0xaf, 0x80, 0x06, 0x3c, + 0x5a, 0xf2, 0xc3, 0x53, 0xe3, 0x29, 0x92, 0xf8, + 0xff, 0x64, 0x85, 0xb9, 0xf7, 0xd3, 0x80, 0xd2, + 0x0c, 0x5d, 0x7b, 0x57, 0x0c, 0x51, 0x79, 0x86, + 0xf3, 0x20, 0xd2, 0xb8, 0x6e, 0x0c, 0x5a, 0xce, + 0xeb, 0x88, 0x02, 0x8b, 0x82, 0x1b, 0x7f, 0xf5, + 0xde, 0x7f, 0x48, 0x48, 0xdf, 0xa0, 0x55, 0xc6, + 0x0c, 0x22, 0xa1, 0x80, 0x8d, 0x3b, 0xcb, 0x40, + 0x2d, 0x3d, 0x0b, 0xf2, 0xe0, 0x22, 0x13, 0x99, + 0xe1, 0xa7, 0x27, 0x68, 0x31, 0xe1, 0x24, 0x5d, + 0xd2, 0xee, 0x16, 0xc1, 0xd7, 0xa8, 0x14, 0x19, + 0x23, 0x72, 0x67, 0x27, 0xdc, 0x5e, 0xb9, 0xc7, + 0xd8, 0xe3, 0x55, 0x50, 0x40, 0x98, 0x7b, 0xe7, + 0x34, 0x1c, 0x3b, 0x18, 0x14, 0xd8, 0x62, 0xc1, + 0x93, 0x84, 0xf3, 0x5b, 0xdd, 0x9e, 0x1f, 0x3b, + 0x0b, 0xbc, 0x4e, 0x5b, 0x79, 0xa3, 0xca, 0x74, + 0x2a, 0x98, 0xe8, 0x04, 0x39, 0xef, 0xc6, 0x76, + 0x6d, 0xee, 0x9f, 0x67, 0x5b, 0x59, 0x3a, 0xe5, + 0xf2, 0x3b, 0xca, 0x89, 0xe8, 0x9b, 0x03, 0x3d, + 0x11, 0xd2, 0x4a, 0x70, 0xaf, 0x88, 0xb0, 0x94, + 0x96, 0x26, 0xab, 0x3c, 0xc1, 0xb8, 0xe4, 0xe7, + 0x14, 0x61, 0x64, 0x3a, 0x61, 0x08, 0x0f, 0xa9, + 0xce, 0x64, 0xb2, 0x40, 0xf8, 0x20, 0x3a, 0xa9, + 0x31, 0xbd, 0x7e, 0x16, 0xca, 0xf5, 0x62, 0x0f, + 0x91, 0x9f, 0x8e, 0x1d, 0xa4, 0x77, 0xf3, 0x87, + 0x61, 0xe8, 0x14, 0xde, 0x18, 0x68, 0x4e, 0x9d, + 0x73, 0xcd, 0x8a, 0xe4, 0x80, 0x84, 0x23, 0xaa, + 0x9d, 0x64, 0x1c, 0x80, 0x41, 0xca, 0x82, 0x40, + 0x94, 0x55, 0xe3, 0x28, 0xa1, 0x97, 0x71, 0xba, + 0xf2, 0x2c, 0x39, 0x62, 0x29, 0x56, 0xd0, 0xff, + 0xb2, 0x82, 0x20, 0x59, 0x1f, 0xc3, 0x64, 0x57 +}; +static const u8 key69[] __initconst = { + 0x19, 0x09, 0xe9, 0x7c, 0xd9, 0x02, 0x4a, 0x0c, + 0x52, 0x25, 0xad, 0x5c, 0x2e, 0x8d, 0x86, 0x10, + 0x85, 0x2b, 0xba, 0xa4, 0x44, 0x5b, 0x39, 0x3e, + 0x18, 0xaa, 0xce, 0x0e, 0xe2, 0x69, 0x3c, 0xcf +}; +enum { nonce69 = 0xdb925a1948f0f060ULL }; + +static const u8 input70[] __initconst = { + 0x10, 0xe7, 0x83, 0xcf, 0x42, 0x9f, 0xf2, 0x41, + 0xc7, 0xe4, 0xdb, 0xf9, 0xa3, 0x02, 0x1d, 0x8d, + 0x50, 0x81, 0x2c, 0x6b, 0x92, 0xe0, 0x4e, 0xea, + 0x26, 0x83, 0x2a, 0xd0, 0x31, 0xf1, 0x23, 0xf3, + 0x0e, 0x88, 0x14, 0x31, 0xf9, 0x01, 0x63, 0x59, + 0x21, 0xd1, 0x8b, 0xdd, 0x06, 0xd0, 0xc6, 0xab, + 0x91, 0x71, 0x82, 0x4d, 0xd4, 0x62, 0x37, 0x17, + 0xf9, 0x50, 0xf9, 0xb5, 0x74, 0xce, 0x39, 0x80, + 0x80, 0x78, 0xf8, 0xdc, 0x1c, 0xdb, 0x7c, 0x3d, + 0xd4, 0x86, 0x31, 0x00, 0x75, 0x7b, 0xd1, 0x42, + 0x9f, 0x1b, 0x97, 0x88, 0x0e, 0x14, 0x0e, 0x1e, + 0x7d, 0x7b, 0xc4, 0xd2, 0xf3, 0xc1, 0x6d, 0x17, + 0x5d, 0xc4, 0x75, 0x54, 0x0f, 0x38, 0x65, 0x89, + 0xd8, 0x7d, 0xab, 0xc9, 0xa7, 0x0a, 0x21, 0x0b, + 0x37, 0x12, 0x05, 0x07, 0xb5, 0x68, 0x32, 0x32, + 0xb9, 0xf8, 0x97, 0x17, 0x03, 0xed, 0x51, 0x8f, + 0x3d, 0x5a, 0xd0, 0x12, 0x01, 0x6e, 0x2e, 0x91, + 0x1c, 0xbe, 0x6b, 0xa3, 0xcc, 0x75, 0x62, 0x06, + 0x8e, 0x65, 0xbb, 0xe2, 0x29, 0x71, 0x4b, 0x89, + 0x6a, 0x9d, 0x85, 0x8c, 0x8c, 0xdf, 0x94, 0x95, + 0x23, 0x66, 0xf8, 0x92, 0xee, 0x56, 0xeb, 0xb3, + 0xeb, 0xd2, 0x4a, 0x3b, 0x77, 0x8a, 0x6e, 0xf6, + 0xca, 0xd2, 0x34, 0x00, 0xde, 0xbe, 0x1d, 0x7a, + 0x73, 0xef, 0x2b, 0x80, 0x56, 0x16, 0x29, 0xbf, + 0x6e, 0x33, 0xed, 0x0d, 0xe2, 0x02, 0x60, 0x74, + 0xe9, 0x0a, 0xbc, 0xd1, 0xc5, 0xe8, 0x53, 0x02, + 0x79, 0x0f, 0x25, 0x0c, 0xef, 0xab, 0xd3, 0xbc, + 0xb7, 0xfc, 0xf3, 0xb0, 0x34, 0xd1, 0x07, 0xd2, + 0x5a, 0x31, 0x1f, 0xec, 0x1f, 0x87, 0xed, 0xdd, + 0x6a, 0xc1, 0xe8, 0xb3, 0x25, 0x4c, 0xc6, 0x9b, + 0x91, 0x73, 0xec, 0x06, 0x73, 0x9e, 0x57, 0x65, + 0x32, 0x75, 0x11, 0x74, 0x6e, 0xa4, 0x7d, 0x0d, + 0x74, 0x9f, 0x51, 0x10, 0x10, 0x47, 0xc9, 0x71, + 0x6e, 0x97, 0xae, 0x44, 0x41, 0xef, 0x98, 0x78, + 0xf4, 0xc5, 0xbd, 0x5e, 0x00, 0xe5, 0xfd, 0xe2, + 0xbe, 0x8c, 0xc2, 0xae, 0xc2, 0xee, 0x59, 0xf6, + 0xcb, 0x20, 0x54, 0x84, 0xc3, 0x31, 0x7e, 0x67, + 0x71, 0xb6, 0x76, 0xbe, 0x81, 0x8f, 0x82, 0xad, + 0x01, 0x8f, 0xc4, 0x00, 0x04, 0x3d, 0x8d, 0x34, + 0xaa, 0xea, 0xc0, 0xea, 0x91, 0x42, 0xb6, 0xb8, + 0x43, 0xf3, 0x17, 0xb2, 0x73, 0x64, 0x82, 0x97, + 0xd5, 0xc9, 0x07, 0x77, 0xb1, 0x26, 0xe2, 0x00, + 0x6a, 0xae, 0x70, 0x0b, 0xbe, 0xe6, 0xb8, 0x42, + 0x81, 0x55, 0xf7, 0xb8, 0x96, 0x41, 0x9d, 0xd4, + 0x2c, 0x27, 0x00, 0xcc, 0x91, 0x28, 0x22, 0xa4, + 0x7b, 0x42, 0x51, 0x9e, 0xd6, 0xec, 0xf3, 0x6b, + 0x00, 0xff, 0x5c, 0xa2, 0xac, 0x47, 0x33, 0x2d, + 0xf8, 0x11, 0x65, 0x5f, 0x4d, 0x79, 0x8b, 0x4f, + 0xad, 0xf0, 0x9d, 0xcd, 0xb9, 0x7b, 0x08, 0xf7, + 0x32, 0x51, 0xfa, 0x39, 0xaa, 0x78, 0x05, 0xb1, + 0xf3, 0x5d, 0xe8, 0x7c, 0x8e, 0x4f, 0xa2, 0xe0, + 0x98, 0x0c, 0xb2, 0xa7, 0xf0, 0x35, 0x8e, 0x70, + 0x7c, 0x82, 0xf3, 0x1b, 0x26, 0x28, 0x12, 0xe5, + 0x23, 0x57, 0xe4, 0xb4, 0x9b, 0x00, 0x39, 0x97, + 0xef, 0x7c, 0x46, 0x9b, 0x34, 0x6b, 0xe7, 0x0e, + 0xa3, 0x2a, 0x18, 0x11, 0x64, 0xc6, 0x7c, 0x8b, + 0x06, 0x02, 0xf5, 0x69, 0x76, 0xf9, 0xaa, 0x09, + 0x5f, 0x68, 0xf8, 0x4a, 0x79, 0x58, 0xec, 0x37, + 0xcf, 0x3a, 0xcc, 0x97, 0x70, 0x1d, 0x3e, 0x52, + 0x18, 0x0a, 0xad, 0x28, 0x5b, 0x3b, 0xe9, 0x03, + 0x84, 0xe9, 0x68, 0x50, 0xce, 0xc4, 0xbc, 0x3e, + 0x21, 0xad, 0x63, 0xfe, 0xc6, 0xfd, 0x6e, 0x69, + 0x84, 0xa9, 0x30, 0xb1, 0x7a, 0xc4, 0x31, 0x10, + 0xc1, 0x1f, 0x6e, 0xeb, 0xa5, 0xa6, 0x01 +}; +static const u8 output70[] __initconst = { + 0x0f, 0x93, 0x2a, 0x20, 0xb3, 0x87, 0x2d, 0xce, + 0xd1, 0x3b, 0x30, 0xfd, 0x06, 0x6d, 0x0a, 0xaa, + 0x3e, 0xc4, 0x29, 0x02, 0x8a, 0xde, 0xa6, 0x4b, + 0x45, 0x1b, 0x4f, 0x25, 0x59, 0xd5, 0x56, 0x6a, + 0x3b, 0x37, 0xbd, 0x3e, 0x47, 0x12, 0x2c, 0x4e, + 0x60, 0x5f, 0x05, 0x75, 0x61, 0x23, 0x05, 0x74, + 0xcb, 0xfc, 0x5a, 0xb3, 0xac, 0x5c, 0x3d, 0xab, + 0x52, 0x5f, 0x05, 0xbc, 0x57, 0xc0, 0x7e, 0xcf, + 0x34, 0x5d, 0x7f, 0x41, 0xa3, 0x17, 0x78, 0xd5, + 0x9f, 0xec, 0x0f, 0x1e, 0xf9, 0xfe, 0xa3, 0xbd, + 0x28, 0xb0, 0xba, 0x4d, 0x84, 0xdb, 0xae, 0x8f, + 0x1d, 0x98, 0xb7, 0xdc, 0xf9, 0xad, 0x55, 0x9c, + 0x89, 0xfe, 0x9b, 0x9c, 0xa9, 0x89, 0xf6, 0x97, + 0x9c, 0x3f, 0x09, 0x3e, 0xc6, 0x02, 0xc2, 0x55, + 0x58, 0x09, 0x54, 0x66, 0xe4, 0x36, 0x81, 0x35, + 0xca, 0x88, 0x17, 0x89, 0x80, 0x24, 0x2b, 0x21, + 0x89, 0xee, 0x45, 0x5a, 0xe7, 0x1f, 0xd5, 0xa5, + 0x16, 0xa4, 0xda, 0x70, 0x7e, 0xe9, 0x4f, 0x24, + 0x61, 0x97, 0xab, 0xa0, 0xe0, 0xe7, 0xb8, 0x5c, + 0x0f, 0x25, 0x17, 0x37, 0x75, 0x12, 0xb5, 0x40, + 0xde, 0x1c, 0x0d, 0x8a, 0x77, 0x62, 0x3c, 0x86, + 0xd9, 0x70, 0x2e, 0x96, 0x30, 0xd2, 0x55, 0xb3, + 0x6b, 0xc3, 0xf2, 0x9c, 0x47, 0xf3, 0x3a, 0x24, + 0x52, 0xc6, 0x38, 0xd8, 0x22, 0xb3, 0x0c, 0xfd, + 0x2f, 0xa3, 0x3c, 0xb5, 0xe8, 0x26, 0xe1, 0xa3, + 0xad, 0xb0, 0x82, 0x17, 0xc1, 0x53, 0xb8, 0x34, + 0x48, 0xee, 0x39, 0xae, 0x51, 0x43, 0xec, 0x82, + 0xce, 0x87, 0xc6, 0x76, 0xb9, 0x76, 0xd3, 0x53, + 0xfe, 0x49, 0x24, 0x7d, 0x02, 0x42, 0x2b, 0x72, + 0xfb, 0xcb, 0xd8, 0x96, 0x02, 0xc6, 0x9a, 0x20, + 0xf3, 0x5a, 0x67, 0xe8, 0x13, 0xf8, 0xb2, 0xcb, + 0xa2, 0xec, 0x18, 0x20, 0x4a, 0xb0, 0x73, 0x53, + 0x21, 0xb0, 0x77, 0x53, 0xd8, 0x76, 0xa1, 0x30, + 0x17, 0x72, 0x2e, 0x33, 0x5f, 0x33, 0x6b, 0x28, + 0xfb, 0xb0, 0xf4, 0xec, 0x8e, 0xed, 0x20, 0x7d, + 0x57, 0x8c, 0x74, 0x28, 0x64, 0x8b, 0xeb, 0x59, + 0x38, 0x3f, 0xe7, 0x83, 0x2e, 0xe5, 0x64, 0x4d, + 0x5c, 0x1f, 0xe1, 0x3b, 0xd9, 0x84, 0xdb, 0xc9, + 0xec, 0xd8, 0xc1, 0x7c, 0x1f, 0x1b, 0x68, 0x35, + 0xc6, 0x34, 0x10, 0xef, 0x19, 0xc9, 0x0a, 0xd6, + 0x43, 0x7f, 0xa6, 0xcb, 0x9d, 0xf4, 0xf0, 0x16, + 0xb1, 0xb1, 0x96, 0x64, 0xec, 0x8d, 0x22, 0x4c, + 0x4b, 0xe8, 0x1a, 0xba, 0x6f, 0xb7, 0xfc, 0xa5, + 0x69, 0x3e, 0xad, 0x78, 0x79, 0x19, 0xb5, 0x04, + 0x69, 0xe5, 0x3f, 0xff, 0x60, 0x8c, 0xda, 0x0b, + 0x7b, 0xf7, 0xe7, 0xe6, 0x29, 0x3a, 0x85, 0xba, + 0xb5, 0xb0, 0x35, 0xbd, 0x38, 0xce, 0x34, 0x5e, + 0xf2, 0xdc, 0xd1, 0x8f, 0xc3, 0x03, 0x24, 0xa2, + 0x03, 0xf7, 0x4e, 0x49, 0x5b, 0xcf, 0x6d, 0xb0, + 0xeb, 0xe3, 0x30, 0x28, 0xd5, 0x5b, 0x82, 0x5f, + 0xe4, 0x7c, 0x1e, 0xec, 0xd2, 0x39, 0xf9, 0x6f, + 0x2e, 0xb3, 0xcd, 0x01, 0xb1, 0x67, 0xaa, 0xea, + 0xaa, 0xb3, 0x63, 0xaf, 0xd9, 0xb2, 0x1f, 0xba, + 0x05, 0x20, 0xeb, 0x19, 0x32, 0xf0, 0x6c, 0x3f, + 0x40, 0xcc, 0x93, 0xb3, 0xd8, 0x25, 0xa6, 0xe4, + 0xce, 0xd7, 0x7e, 0x48, 0x99, 0x65, 0x7f, 0x86, + 0xc5, 0xd4, 0x79, 0x6b, 0xab, 0x43, 0xb8, 0x6b, + 0xf1, 0x2f, 0xea, 0x4c, 0x5e, 0xf0, 0x3b, 0xb4, + 0xb8, 0xb0, 0x94, 0x0c, 0x6b, 0xe7, 0x22, 0x93, + 0xaa, 0x01, 0xcb, 0xf1, 0x11, 0x60, 0xf6, 0x69, + 0xcf, 0x14, 0xde, 0xfb, 0x90, 0x05, 0x27, 0x0c, + 0x1a, 0x9e, 0xf0, 0xb4, 0xc6, 0xa1, 0xe8, 0xdd, + 0xd0, 0x4c, 0x25, 0x4f, 0x9c, 0xb7, 0xb1, 0xb0, + 0x21, 0xdb, 0x87, 0x09, 0x03, 0xf2, 0xb3 +}; +static const u8 key70[] __initconst = { + 0x3b, 0x5b, 0x59, 0x36, 0x44, 0xd1, 0xba, 0x71, + 0x55, 0x87, 0x4d, 0x62, 0x3d, 0xc2, 0xfc, 0xaa, + 0x3f, 0x4e, 0x1a, 0xe4, 0xca, 0x09, 0xfc, 0x6a, + 0xb2, 0xd6, 0x5d, 0x79, 0xf9, 0x1a, 0x91, 0xa7 +}; +enum { nonce70 = 0x3fd6786dd147a85ULL }; + +static const u8 input71[] __initconst = { + 0x18, 0x78, 0xd6, 0x79, 0xe4, 0x9a, 0x6c, 0x73, + 0x17, 0xd4, 0x05, 0x0f, 0x1e, 0x9f, 0xd9, 0x2b, + 0x86, 0x48, 0x7d, 0xf4, 0xd9, 0x1c, 0x76, 0xfc, + 0x8e, 0x22, 0x34, 0xe1, 0x48, 0x4a, 0x8d, 0x79, + 0xb7, 0xbb, 0x88, 0xab, 0x90, 0xde, 0xc5, 0xb4, + 0xb4, 0xe7, 0x85, 0x49, 0xda, 0x57, 0xeb, 0xc9, + 0xcd, 0x21, 0xfc, 0x45, 0x6e, 0x32, 0x67, 0xf2, + 0x4f, 0xa6, 0x54, 0xe5, 0x20, 0xed, 0xcf, 0xc6, + 0x62, 0x25, 0x8e, 0x00, 0xf8, 0x6b, 0xa2, 0x80, + 0xac, 0x88, 0xa6, 0x59, 0x27, 0x83, 0x95, 0x11, + 0x3f, 0x70, 0x5e, 0x3f, 0x11, 0xfb, 0x26, 0xbf, + 0xe1, 0x48, 0x75, 0xf9, 0x86, 0xbf, 0xa6, 0x5d, + 0x15, 0x61, 0x66, 0xbf, 0x78, 0x8f, 0x6b, 0x9b, + 0xda, 0x98, 0xb7, 0x19, 0xe2, 0xf2, 0xa3, 0x9c, + 0x7c, 0x6a, 0x9a, 0xd8, 0x3d, 0x4c, 0x2c, 0xe1, + 0x09, 0xb4, 0x28, 0x82, 0x4e, 0xab, 0x0c, 0x75, + 0x63, 0xeb, 0xbc, 0xd0, 0x71, 0xa2, 0x73, 0x85, + 0xed, 0x53, 0x7a, 0x3f, 0x68, 0x9f, 0xd0, 0xa9, + 0x00, 0x5a, 0x9e, 0x80, 0x55, 0x00, 0xe6, 0xae, + 0x0c, 0x03, 0x40, 0xed, 0xfc, 0x68, 0x4a, 0xb7, + 0x1e, 0x09, 0x65, 0x30, 0x5a, 0x3d, 0x97, 0x4d, + 0x5e, 0x51, 0x8e, 0xda, 0xc3, 0x55, 0x8c, 0xfb, + 0xcf, 0x83, 0x05, 0x35, 0x0d, 0x08, 0x1b, 0xf3, + 0x3a, 0x57, 0x96, 0xac, 0x58, 0x8b, 0xfa, 0x00, + 0x49, 0x15, 0x78, 0xd2, 0x4b, 0xed, 0xb8, 0x59, + 0x78, 0x9b, 0x7f, 0xaa, 0xfc, 0xe7, 0x46, 0xdc, + 0x7b, 0x34, 0xd0, 0x34, 0xe5, 0x10, 0xff, 0x4d, + 0x5a, 0x4d, 0x60, 0xa7, 0x16, 0x54, 0xc4, 0xfd, + 0xca, 0x5d, 0x68, 0xc7, 0x4a, 0x01, 0x8d, 0x7f, + 0x74, 0x5d, 0xff, 0xb8, 0x37, 0x15, 0x62, 0xfa, + 0x44, 0x45, 0xcf, 0x77, 0x3b, 0x1d, 0xb2, 0xd2, + 0x0d, 0x42, 0x00, 0x39, 0x68, 0x1f, 0xcc, 0x89, + 0x73, 0x5d, 0xa9, 0x2e, 0xfd, 0x58, 0x62, 0xca, + 0x35, 0x8e, 0x70, 0x70, 0xaa, 0x6e, 0x14, 0xe9, + 0xa4, 0xe2, 0x10, 0x66, 0x71, 0xdc, 0x4c, 0xfc, + 0xa9, 0xdc, 0x8f, 0x57, 0x4d, 0xc5, 0xac, 0xd7, + 0xa9, 0xf3, 0xf3, 0xa1, 0xff, 0x62, 0xa0, 0x8f, + 0xe4, 0x96, 0x3e, 0xcb, 0x9f, 0x76, 0x42, 0x39, + 0x1f, 0x24, 0xfd, 0xfd, 0x79, 0xe8, 0x27, 0xdf, + 0xa8, 0xf6, 0x33, 0x8b, 0x31, 0x59, 0x69, 0xcf, + 0x6a, 0xef, 0x89, 0x4d, 0xa7, 0xf6, 0x7e, 0x97, + 0x14, 0xbd, 0xda, 0xdd, 0xb4, 0x84, 0x04, 0x24, + 0xe0, 0x17, 0xe1, 0x0f, 0x1f, 0x8a, 0x6a, 0x71, + 0x74, 0x41, 0xdc, 0x59, 0x5c, 0x8f, 0x01, 0x25, + 0x92, 0xf0, 0x2e, 0x15, 0x62, 0x71, 0x9a, 0x9f, + 0x87, 0xdf, 0x62, 0x49, 0x7f, 0x86, 0x62, 0xfc, + 0x20, 0x84, 0xd7, 0xe3, 0x3a, 0xd9, 0x37, 0x85, + 0xb7, 0x84, 0x5a, 0xf9, 0xed, 0x21, 0x32, 0x94, + 0x3e, 0x04, 0xe7, 0x8c, 0x46, 0x76, 0x21, 0x67, + 0xf6, 0x95, 0x64, 0x92, 0xb7, 0x15, 0xf6, 0xe3, + 0x41, 0x27, 0x9d, 0xd7, 0xe3, 0x79, 0x75, 0x92, + 0xd0, 0xc1, 0xf3, 0x40, 0x92, 0x08, 0xde, 0x90, + 0x22, 0x82, 0xb2, 0x69, 0xae, 0x1a, 0x35, 0x11, + 0x89, 0xc8, 0x06, 0x82, 0x95, 0x23, 0x44, 0x08, + 0x22, 0xf2, 0x71, 0x73, 0x1b, 0x88, 0x11, 0xcf, + 0x1c, 0x7e, 0x8a, 0x2e, 0xdc, 0x79, 0x57, 0xce, + 0x1f, 0xe7, 0x6c, 0x07, 0xd8, 0x06, 0xbe, 0xec, + 0xa3, 0xcf, 0xf9, 0x68, 0xa5, 0xb8, 0xf0, 0xe3, + 0x3f, 0x01, 0x92, 0xda, 0xf1, 0xa0, 0x2d, 0x7b, + 0xab, 0x57, 0x58, 0x2a, 0xaf, 0xab, 0xbd, 0xf2, + 0xe5, 0xaf, 0x7e, 0x1f, 0x46, 0x24, 0x9e, 0x20, + 0x22, 0x0f, 0x84, 0x4c, 0xb7, 0xd8, 0x03, 0xe8, + 0x09, 0x73, 0x6c, 0xc6, 0x9b, 0x90, 0xe0, 0xdb, + 0xf2, 0x71, 0xba, 0xad, 0xb3, 0xec, 0xda, 0x7a +}; +static const u8 output71[] __initconst = { + 0x28, 0xc5, 0x9b, 0x92, 0xf9, 0x21, 0x4f, 0xbb, + 0xef, 0x3b, 0xf0, 0xf5, 0x3a, 0x6d, 0x7f, 0xd6, + 0x6a, 0x8d, 0xa1, 0x01, 0x5c, 0x62, 0x20, 0x8b, + 0x5b, 0x39, 0xd5, 0xd3, 0xc2, 0xf6, 0x9d, 0x5e, + 0xcc, 0xe1, 0xa2, 0x61, 0x16, 0xe2, 0xce, 0xe9, + 0x86, 0xd0, 0xfc, 0xce, 0x9a, 0x28, 0x27, 0xc4, + 0x0c, 0xb9, 0xaa, 0x8d, 0x48, 0xdb, 0xbf, 0x82, + 0x7d, 0xd0, 0x35, 0xc4, 0x06, 0x34, 0xb4, 0x19, + 0x51, 0x73, 0xf4, 0x7a, 0xf4, 0xfd, 0xe9, 0x1d, + 0xdc, 0x0f, 0x7e, 0xf7, 0x96, 0x03, 0xe3, 0xb1, + 0x2e, 0x22, 0x59, 0xb7, 0x6d, 0x1c, 0x97, 0x8c, + 0xd7, 0x31, 0x08, 0x26, 0x4c, 0x6d, 0xc6, 0x14, + 0xa5, 0xeb, 0x45, 0x6a, 0x88, 0xa3, 0xa2, 0x36, + 0xc4, 0x35, 0xb1, 0x5a, 0xa0, 0xad, 0xf7, 0x06, + 0x9b, 0x5d, 0xc1, 0x15, 0xc1, 0xce, 0x0a, 0xb0, + 0x57, 0x2e, 0x3f, 0x6f, 0x0d, 0x10, 0xd9, 0x11, + 0x2c, 0x9c, 0xad, 0x2d, 0xa5, 0x81, 0xfb, 0x4e, + 0x8f, 0xd5, 0x32, 0x4e, 0xaf, 0x5c, 0xc1, 0x86, + 0xde, 0x56, 0x5a, 0x33, 0x29, 0xf7, 0x67, 0xc6, + 0x37, 0x6f, 0xb2, 0x37, 0x4e, 0xd4, 0x69, 0x79, + 0xaf, 0xd5, 0x17, 0x79, 0xe0, 0xba, 0x62, 0xa3, + 0x68, 0xa4, 0x87, 0x93, 0x8d, 0x7e, 0x8f, 0xa3, + 0x9c, 0xef, 0xda, 0xe3, 0xa5, 0x1f, 0xcd, 0x30, + 0xa6, 0x55, 0xac, 0x4c, 0x69, 0x74, 0x02, 0xc7, + 0x5d, 0x95, 0x81, 0x4a, 0x68, 0x11, 0xd3, 0xa9, + 0x98, 0xb1, 0x0b, 0x0d, 0xae, 0x40, 0x86, 0x65, + 0xbf, 0xcc, 0x2d, 0xef, 0x57, 0xca, 0x1f, 0xe4, + 0x34, 0x4e, 0xa6, 0x5e, 0x82, 0x6e, 0x61, 0xad, + 0x0b, 0x3c, 0xf8, 0xeb, 0x01, 0x43, 0x7f, 0x87, + 0xa2, 0xa7, 0x6a, 0xe9, 0x62, 0x23, 0x24, 0x61, + 0xf1, 0xf7, 0x36, 0xdb, 0x10, 0xe5, 0x57, 0x72, + 0x3a, 0xc2, 0xae, 0xcc, 0x75, 0xc7, 0x80, 0x05, + 0x0a, 0x5c, 0x4c, 0x95, 0xda, 0x02, 0x01, 0x14, + 0x06, 0x6b, 0x5c, 0x65, 0xc2, 0xb8, 0x4a, 0xd6, + 0xd3, 0xb4, 0xd8, 0x12, 0x52, 0xb5, 0x60, 0xd3, + 0x8e, 0x5f, 0x5c, 0x76, 0x33, 0x7a, 0x05, 0xe5, + 0xcb, 0xef, 0x4f, 0x89, 0xf1, 0xba, 0x32, 0x6f, + 0x33, 0xcd, 0x15, 0x8d, 0xa3, 0x0c, 0x3f, 0x63, + 0x11, 0xe7, 0x0e, 0xe0, 0x00, 0x01, 0xe9, 0xe8, + 0x8e, 0x36, 0x34, 0x8d, 0x96, 0xb5, 0x03, 0xcf, + 0x55, 0x62, 0x49, 0x7a, 0x34, 0x44, 0xa5, 0xee, + 0x8c, 0x46, 0x06, 0x22, 0xab, 0x1d, 0x53, 0x9c, + 0xa1, 0xf9, 0x67, 0x18, 0x57, 0x89, 0xf9, 0xc2, + 0xd1, 0x7e, 0xbe, 0x36, 0x40, 0xcb, 0xe9, 0x04, + 0xde, 0xb1, 0x3b, 0x29, 0x52, 0xc5, 0x9a, 0xb5, + 0xa2, 0x7c, 0x7b, 0xfe, 0xe5, 0x92, 0x73, 0xea, + 0xea, 0x7b, 0xba, 0x0a, 0x8c, 0x88, 0x15, 0xe6, + 0x53, 0xbf, 0x1c, 0x33, 0xf4, 0x9b, 0x9a, 0x5e, + 0x8d, 0xae, 0x60, 0xdc, 0xcb, 0x5d, 0xfa, 0xbe, + 0x06, 0xc3, 0x3f, 0x06, 0xe7, 0x00, 0x40, 0x7b, + 0xaa, 0x94, 0xfa, 0x6d, 0x1f, 0xe4, 0xc5, 0xa9, + 0x1b, 0x5f, 0x36, 0xea, 0x5a, 0xdd, 0xa5, 0x48, + 0x6a, 0x55, 0xd2, 0x47, 0x28, 0xbf, 0x96, 0xf1, + 0x9f, 0xb6, 0x11, 0x4b, 0xd3, 0x44, 0x7d, 0x48, + 0x41, 0x61, 0xdb, 0x12, 0xd4, 0xc2, 0x59, 0x82, + 0x4c, 0x47, 0x5c, 0x04, 0xf6, 0x7b, 0xd3, 0x92, + 0x2e, 0xe8, 0x40, 0xef, 0x15, 0x32, 0x97, 0xdc, + 0x35, 0x4c, 0x6e, 0xa4, 0x97, 0xe9, 0x24, 0xde, + 0x63, 0x8b, 0xb1, 0x6b, 0x48, 0xbb, 0x46, 0x1f, + 0x84, 0xd6, 0x17, 0xb0, 0x5a, 0x4a, 0x4e, 0xd5, + 0x31, 0xd7, 0xcf, 0xa0, 0x39, 0xc6, 0x2e, 0xfc, + 0xa6, 0xa3, 0xd3, 0x0f, 0xa4, 0x28, 0xac, 0xb2, + 0xf4, 0x48, 0x8d, 0x50, 0xa5, 0x1c, 0x44, 0x5d, + 0x6e, 0x38, 0xb7, 0x2b, 0x8a, 0x45, 0xa7, 0x3d +}; +static const u8 key71[] __initconst = { + 0x8b, 0x68, 0xc4, 0xb7, 0x0d, 0x81, 0xef, 0x52, + 0x1e, 0x05, 0x96, 0x72, 0x62, 0x89, 0x27, 0x83, + 0xd0, 0xc7, 0x33, 0x6d, 0xf2, 0xcc, 0x69, 0xf9, + 0x23, 0xae, 0x99, 0xb1, 0xd1, 0x05, 0x4e, 0x54 +}; +enum { nonce71 = 0x983f03656d64b5f6ULL }; + +static const u8 input72[] __initconst = { + 0x6b, 0x09, 0xc9, 0x57, 0x3d, 0x79, 0x04, 0x8c, + 0x65, 0xad, 0x4a, 0x0f, 0xa1, 0x31, 0x3a, 0xdd, + 0x14, 0x8e, 0xe8, 0xfe, 0xbf, 0x42, 0x87, 0x98, + 0x2e, 0x8d, 0x83, 0xa3, 0xf8, 0x55, 0x3d, 0x84, + 0x1e, 0x0e, 0x05, 0x4a, 0x38, 0x9e, 0xe7, 0xfe, + 0xd0, 0x4d, 0x79, 0x74, 0x3a, 0x0b, 0x9b, 0xe1, + 0xfd, 0x51, 0x84, 0x4e, 0xb2, 0x25, 0xe4, 0x64, + 0x4c, 0xda, 0xcf, 0x46, 0xec, 0xba, 0x12, 0xeb, + 0x5a, 0x33, 0x09, 0x6e, 0x78, 0x77, 0x8f, 0x30, + 0xb1, 0x7d, 0x3f, 0x60, 0x8c, 0xf2, 0x1d, 0x8e, + 0xb4, 0x70, 0xa2, 0x90, 0x7c, 0x79, 0x1a, 0x2c, + 0xf6, 0x28, 0x79, 0x7c, 0x53, 0xc5, 0xfa, 0xcc, + 0x65, 0x9b, 0xe1, 0x51, 0xd1, 0x7f, 0x1d, 0xc4, + 0xdb, 0xd4, 0xd9, 0x04, 0x61, 0x7d, 0xbe, 0x12, + 0xfc, 0xcd, 0xaf, 0xe4, 0x0f, 0x9c, 0x20, 0xb5, + 0x22, 0x40, 0x18, 0xda, 0xe4, 0xda, 0x8c, 0x2d, + 0x84, 0xe3, 0x5f, 0x53, 0x17, 0xed, 0x78, 0xdc, + 0x2f, 0xe8, 0x31, 0xc7, 0xe6, 0x39, 0x71, 0x40, + 0xb4, 0x0f, 0xc9, 0xa9, 0x7e, 0x78, 0x87, 0xc1, + 0x05, 0x78, 0xbb, 0x01, 0xf2, 0x8f, 0x33, 0xb0, + 0x6e, 0x84, 0xcd, 0x36, 0x33, 0x5c, 0x5b, 0x8e, + 0xf1, 0xac, 0x30, 0xfe, 0x33, 0xec, 0x08, 0xf3, + 0x7e, 0xf2, 0xf0, 0x4c, 0xf2, 0xad, 0xd8, 0xc1, + 0xd4, 0x4e, 0x87, 0x06, 0xd4, 0x75, 0xe7, 0xe3, + 0x09, 0xd3, 0x4d, 0xe3, 0x21, 0x32, 0xba, 0xb4, + 0x68, 0x68, 0xcb, 0x4c, 0xa3, 0x1e, 0xb3, 0x87, + 0x7b, 0xd3, 0x0c, 0x63, 0x37, 0x71, 0x79, 0xfb, + 0x58, 0x36, 0x57, 0x0f, 0x34, 0x1d, 0xc1, 0x42, + 0x02, 0x17, 0xe7, 0xed, 0xe8, 0xe7, 0x76, 0xcb, + 0x42, 0xc4, 0x4b, 0xe2, 0xb2, 0x5e, 0x42, 0xd5, + 0xec, 0x9d, 0xc1, 0x32, 0x71, 0xe4, 0xeb, 0x10, + 0x68, 0x1a, 0x6e, 0x99, 0x8e, 0x73, 0x12, 0x1f, + 0x97, 0x0c, 0x9e, 0xcd, 0x02, 0x3e, 0x4c, 0xa0, + 0xf2, 0x8d, 0xe5, 0x44, 0xca, 0x6d, 0xfe, 0x07, + 0xe3, 0xe8, 0x9b, 0x76, 0xc1, 0x6d, 0xb7, 0x6e, + 0x0d, 0x14, 0x00, 0x6f, 0x8a, 0xfd, 0x43, 0xc6, + 0x43, 0xa5, 0x9c, 0x02, 0x47, 0x10, 0xd4, 0xb4, + 0x9b, 0x55, 0x67, 0xc8, 0x7f, 0xc1, 0x8a, 0x1f, + 0x1e, 0xd1, 0xbc, 0x99, 0x5d, 0x50, 0x4f, 0x89, + 0xf1, 0xe6, 0x5d, 0x91, 0x40, 0xdc, 0x20, 0x67, + 0x56, 0xc2, 0xef, 0xbd, 0x2c, 0xa2, 0x99, 0x38, + 0xe0, 0x45, 0xec, 0x44, 0x05, 0x52, 0x65, 0x11, + 0xfc, 0x3b, 0x19, 0xcb, 0x71, 0xc2, 0x8e, 0x0e, + 0x03, 0x2a, 0x03, 0x3b, 0x63, 0x06, 0x31, 0x9a, + 0xac, 0x53, 0x04, 0x14, 0xd4, 0x80, 0x9d, 0x6b, + 0x42, 0x7e, 0x7e, 0x4e, 0xdc, 0xc7, 0x01, 0x49, + 0x9f, 0xf5, 0x19, 0x86, 0x13, 0x28, 0x2b, 0xa6, + 0xa6, 0xbe, 0xa1, 0x7e, 0x71, 0x05, 0x00, 0xff, + 0x59, 0x2d, 0xb6, 0x63, 0xf0, 0x1e, 0x2e, 0x69, + 0x9b, 0x85, 0xf1, 0x1e, 0x8a, 0x64, 0x39, 0xab, + 0x00, 0x12, 0xe4, 0x33, 0x4b, 0xb5, 0xd8, 0xb3, + 0x6b, 0x5b, 0x8b, 0x5c, 0xd7, 0x6f, 0x23, 0xcf, + 0x3f, 0x2e, 0x5e, 0x47, 0xb9, 0xb8, 0x1f, 0xf0, + 0x1d, 0xda, 0xe7, 0x4f, 0x6e, 0xab, 0xc3, 0x36, + 0xb4, 0x74, 0x6b, 0xeb, 0xc7, 0x5d, 0x91, 0xe5, + 0xda, 0xf2, 0xc2, 0x11, 0x17, 0x48, 0xf8, 0x9c, + 0xc9, 0x8b, 0xc1, 0xa2, 0xf4, 0xcd, 0x16, 0xf8, + 0x27, 0xd9, 0x6c, 0x6f, 0xb5, 0x8f, 0x77, 0xca, + 0x1b, 0xd8, 0xef, 0x84, 0x68, 0x71, 0x53, 0xc1, + 0x43, 0x0f, 0x9f, 0x98, 0xae, 0x7e, 0x31, 0xd2, + 0x98, 0xfb, 0x20, 0xa2, 0xad, 0x00, 0x10, 0x83, + 0x00, 0x8b, 0xeb, 0x56, 0xd2, 0xc4, 0xcc, 0x7f, + 0x2f, 0x4e, 0xfa, 0x88, 0x13, 0xa4, 0x2c, 0xde, + 0x6b, 0x77, 0x86, 0x10, 0x6a, 0xab, 0x43, 0x0a, + 0x02 +}; +static const u8 output72[] __initconst = { + 0x42, 0x89, 0xa4, 0x80, 0xd2, 0xcb, 0x5f, 0x7f, + 0x2a, 0x1a, 0x23, 0x00, 0xa5, 0x6a, 0x95, 0xa3, + 0x9a, 0x41, 0xa1, 0xd0, 0x2d, 0x1e, 0xd6, 0x13, + 0x34, 0x40, 0x4e, 0x7f, 0x1a, 0xbe, 0xa0, 0x3d, + 0x33, 0x9c, 0x56, 0x2e, 0x89, 0x25, 0x45, 0xf9, + 0xf0, 0xba, 0x9c, 0x6d, 0xd1, 0xd1, 0xde, 0x51, + 0x47, 0x63, 0xc9, 0xbd, 0xfa, 0xa2, 0x9e, 0xad, + 0x6a, 0x7b, 0x21, 0x1a, 0x6c, 0x3e, 0xff, 0x46, + 0xbe, 0xf3, 0x35, 0x7a, 0x6e, 0xb3, 0xb9, 0xf7, + 0xda, 0x5e, 0xf0, 0x14, 0xb5, 0x70, 0xa4, 0x2b, + 0xdb, 0xbb, 0xc7, 0x31, 0x4b, 0x69, 0x5a, 0x83, + 0x70, 0xd9, 0x58, 0xd4, 0x33, 0x84, 0x23, 0xf0, + 0xae, 0xbb, 0x6d, 0x26, 0x7c, 0xc8, 0x30, 0xf7, + 0x24, 0xad, 0xbd, 0xe4, 0x2c, 0x38, 0x38, 0xac, + 0xe1, 0x4a, 0x9b, 0xac, 0x33, 0x0e, 0x4a, 0xf4, + 0x93, 0xed, 0x07, 0x82, 0x81, 0x4f, 0x8f, 0xb1, + 0xdd, 0x73, 0xd5, 0x50, 0x6d, 0x44, 0x1e, 0xbe, + 0xa7, 0xcd, 0x17, 0x57, 0xd5, 0x3b, 0x62, 0x36, + 0xcf, 0x7d, 0xc8, 0xd8, 0xd1, 0x78, 0xd7, 0x85, + 0x46, 0x76, 0x5d, 0xcc, 0xfe, 0xe8, 0x94, 0xc5, + 0xad, 0xbc, 0x5e, 0xbc, 0x8d, 0x1d, 0xdf, 0x03, + 0xc9, 0x6b, 0x1b, 0x81, 0xd1, 0xb6, 0x5a, 0x24, + 0xe3, 0xdc, 0x3f, 0x20, 0xc9, 0x07, 0x73, 0x4c, + 0x43, 0x13, 0x87, 0x58, 0x34, 0x0d, 0x14, 0x63, + 0x0f, 0x6f, 0xad, 0x8d, 0xac, 0x7c, 0x67, 0x68, + 0xa3, 0x9d, 0x7f, 0x00, 0xdf, 0x28, 0xee, 0x67, + 0xf4, 0x5c, 0x26, 0xcb, 0xef, 0x56, 0x71, 0xc8, + 0xc6, 0x67, 0x5f, 0x38, 0xbb, 0xa0, 0xb1, 0x5c, + 0x1f, 0xb3, 0x08, 0xd9, 0x38, 0xcf, 0x74, 0x54, + 0xc6, 0xa4, 0xc4, 0xc0, 0x9f, 0xb3, 0xd0, 0xda, + 0x62, 0x67, 0x8b, 0x81, 0x33, 0xf0, 0xa9, 0x73, + 0xa4, 0xd1, 0x46, 0x88, 0x8d, 0x85, 0x12, 0x40, + 0xba, 0x1a, 0xcd, 0x82, 0xd8, 0x8d, 0xc4, 0x52, + 0xe7, 0x01, 0x94, 0x2e, 0x0e, 0xd0, 0xaf, 0xe7, + 0x2d, 0x3f, 0x3c, 0xaa, 0xf4, 0xf5, 0xa7, 0x01, + 0x4c, 0x14, 0xe2, 0xc2, 0x96, 0x76, 0xbe, 0x05, + 0xaa, 0x19, 0xb1, 0xbd, 0x95, 0xbb, 0x5a, 0xf9, + 0xa5, 0xa7, 0xe6, 0x16, 0x38, 0x34, 0xf7, 0x9d, + 0x19, 0x66, 0x16, 0x8e, 0x7f, 0x2b, 0x5a, 0xfb, + 0xb5, 0x29, 0x79, 0xbf, 0x52, 0xae, 0x30, 0x95, + 0x3f, 0x31, 0x33, 0x28, 0xde, 0xc5, 0x0d, 0x55, + 0x89, 0xec, 0x21, 0x11, 0x0f, 0x8b, 0xfe, 0x63, + 0x3a, 0xf1, 0x95, 0x5c, 0xcd, 0x50, 0xe4, 0x5d, + 0x8f, 0xa7, 0xc8, 0xca, 0x93, 0xa0, 0x67, 0x82, + 0x63, 0x5c, 0xd0, 0xed, 0xe7, 0x08, 0xc5, 0x60, + 0xf8, 0xb4, 0x47, 0xf0, 0x1a, 0x65, 0x4e, 0xa3, + 0x51, 0x68, 0xc7, 0x14, 0xa1, 0xd9, 0x39, 0x72, + 0xa8, 0x6f, 0x7c, 0x7e, 0xf6, 0x03, 0x0b, 0x25, + 0x9b, 0xf2, 0xca, 0x49, 0xae, 0x5b, 0xf8, 0x0f, + 0x71, 0x51, 0x01, 0xa6, 0x23, 0xa9, 0xdf, 0xd0, + 0x7a, 0x39, 0x19, 0xf5, 0xc5, 0x26, 0x44, 0x7b, + 0x0a, 0x4a, 0x41, 0xbf, 0xf2, 0x8e, 0x83, 0x50, + 0x91, 0x96, 0x72, 0x02, 0xf6, 0x80, 0xbf, 0x95, + 0x41, 0xac, 0xda, 0xb0, 0xba, 0xe3, 0x76, 0xb1, + 0x9d, 0xff, 0x1f, 0x33, 0x02, 0x85, 0xfc, 0x2a, + 0x29, 0xe6, 0xe3, 0x9d, 0xd0, 0xef, 0xc2, 0xd6, + 0x9c, 0x4a, 0x62, 0xac, 0xcb, 0xea, 0x8b, 0xc3, + 0x08, 0x6e, 0x49, 0x09, 0x26, 0x19, 0xc1, 0x30, + 0xcc, 0x27, 0xaa, 0xc6, 0x45, 0x88, 0xbd, 0xae, + 0xd6, 0x79, 0xff, 0x4e, 0xfc, 0x66, 0x4d, 0x02, + 0xa5, 0xee, 0x8e, 0xa5, 0xb6, 0x15, 0x72, 0x24, + 0xb1, 0xbf, 0xbf, 0x64, 0xcf, 0xcc, 0x93, 0xe9, + 0xb6, 0xfd, 0xb4, 0xb6, 0x21, 0xb5, 0x48, 0x08, + 0x0f, 0x11, 0x65, 0xe1, 0x47, 0xee, 0x93, 0x29, + 0xad +}; +static const u8 key72[] __initconst = { + 0xb9, 0xa2, 0xfc, 0x59, 0x06, 0x3f, 0x77, 0xa5, + 0x66, 0xd0, 0x2b, 0x22, 0x74, 0x22, 0x4c, 0x1e, + 0x6a, 0x39, 0xdf, 0xe1, 0x0d, 0x4c, 0x64, 0x99, + 0x54, 0x8a, 0xba, 0x1d, 0x2c, 0x21, 0x5f, 0xc3 +}; +enum { nonce72 = 0x3d069308fa3db04bULL }; + +static const u8 input73[] __initconst = { + 0xe4, 0xdd, 0x36, 0xd4, 0xf5, 0x70, 0x51, 0x73, + 0x97, 0x1d, 0x45, 0x05, 0x92, 0xe7, 0xeb, 0xb7, + 0x09, 0x82, 0x6e, 0x25, 0x6c, 0x50, 0xf5, 0x40, + 0x19, 0xba, 0xbc, 0xf4, 0x39, 0x14, 0xc5, 0x15, + 0x83, 0x40, 0xbd, 0x26, 0xe0, 0xff, 0x3b, 0x22, + 0x7c, 0x7c, 0xd7, 0x0b, 0xe9, 0x25, 0x0c, 0x3d, + 0x92, 0x38, 0xbe, 0xe4, 0x22, 0x75, 0x65, 0xf1, + 0x03, 0x85, 0x34, 0x09, 0xb8, 0x77, 0xfb, 0x48, + 0xb1, 0x2e, 0x21, 0x67, 0x9b, 0x9d, 0xad, 0x18, + 0x82, 0x0d, 0x6b, 0xc3, 0xcf, 0x00, 0x61, 0x6e, + 0xda, 0xdc, 0xa7, 0x0b, 0x5c, 0x02, 0x1d, 0xa6, + 0x4e, 0x0d, 0x7f, 0x37, 0x01, 0x5a, 0x37, 0xf3, + 0x2b, 0xbf, 0xba, 0xe2, 0x1c, 0xb3, 0xa3, 0xbc, + 0x1c, 0x93, 0x1a, 0xb1, 0x71, 0xaf, 0xe2, 0xdd, + 0x17, 0xee, 0x53, 0xfa, 0xfb, 0x02, 0x40, 0x3e, + 0x03, 0xca, 0xe7, 0xc3, 0x51, 0x81, 0xcc, 0x8c, + 0xca, 0xcf, 0x4e, 0xc5, 0x78, 0x99, 0xfd, 0xbf, + 0xea, 0xab, 0x38, 0x81, 0xfc, 0xd1, 0x9e, 0x41, + 0x0b, 0x84, 0x25, 0xf1, 0x6b, 0x3c, 0xf5, 0x40, + 0x0d, 0xc4, 0x3e, 0xb3, 0x6a, 0xec, 0x6e, 0x75, + 0xdc, 0x9b, 0xdf, 0x08, 0x21, 0x16, 0xfb, 0x7a, + 0x8e, 0x19, 0x13, 0x02, 0xa7, 0xfc, 0x58, 0x21, + 0xc3, 0xb3, 0x59, 0x5a, 0x9c, 0xef, 0x38, 0xbd, + 0x87, 0x55, 0xd7, 0x0d, 0x1f, 0x84, 0xdc, 0x98, + 0x22, 0xca, 0x87, 0x96, 0x71, 0x6d, 0x68, 0x00, + 0xcb, 0x4f, 0x2f, 0xc4, 0x64, 0x0c, 0xc1, 0x53, + 0x0c, 0x90, 0xe7, 0x3c, 0x88, 0xca, 0xc5, 0x85, + 0xa3, 0x2a, 0x96, 0x7c, 0x82, 0x6d, 0x45, 0xf5, + 0xb7, 0x8d, 0x17, 0x69, 0xd6, 0xcd, 0x3c, 0xd3, + 0xe7, 0x1c, 0xce, 0x93, 0x50, 0xd4, 0x59, 0xa2, + 0xd8, 0x8b, 0x72, 0x60, 0x5b, 0x25, 0x14, 0xcd, + 0x5a, 0xe8, 0x8c, 0xdb, 0x23, 0x8d, 0x2b, 0x59, + 0x12, 0x13, 0x10, 0x47, 0xa4, 0xc8, 0x3c, 0xc1, + 0x81, 0x89, 0x6c, 0x98, 0xec, 0x8f, 0x7b, 0x32, + 0xf2, 0x87, 0xd9, 0xa2, 0x0d, 0xc2, 0x08, 0xf9, + 0xd5, 0xf3, 0x91, 0xe7, 0xb3, 0x87, 0xa7, 0x0b, + 0x64, 0x8f, 0xb9, 0x55, 0x1c, 0x81, 0x96, 0x6c, + 0xa1, 0xc9, 0x6e, 0x3b, 0xcd, 0x17, 0x1b, 0xfc, + 0xa6, 0x05, 0xba, 0x4a, 0x7d, 0x03, 0x3c, 0x59, + 0xc8, 0xee, 0x50, 0xb2, 0x5b, 0xe1, 0x4d, 0x6a, + 0x1f, 0x09, 0xdc, 0xa2, 0x51, 0xd1, 0x93, 0x3a, + 0x5f, 0x72, 0x1d, 0x26, 0x14, 0x62, 0xa2, 0x41, + 0x3d, 0x08, 0x70, 0x7b, 0x27, 0x3d, 0xbc, 0xdf, + 0x15, 0xfa, 0xb9, 0x5f, 0xb5, 0x38, 0x84, 0x0b, + 0x58, 0x3d, 0xee, 0x3f, 0x32, 0x65, 0x6d, 0xd7, + 0xce, 0x97, 0x3c, 0x8d, 0xfb, 0x63, 0xb9, 0xb0, + 0xa8, 0x4a, 0x72, 0x99, 0x97, 0x58, 0xc8, 0xa7, + 0xf9, 0x4c, 0xae, 0xc1, 0x63, 0xb9, 0x57, 0x18, + 0x8a, 0xfa, 0xab, 0xe9, 0xf3, 0x67, 0xe6, 0xfd, + 0xd2, 0x9d, 0x5c, 0xa9, 0x8e, 0x11, 0x0a, 0xf4, + 0x4b, 0xf1, 0xec, 0x1a, 0xaf, 0x50, 0x5d, 0x16, + 0x13, 0x69, 0x2e, 0xbd, 0x0d, 0xe6, 0xf0, 0xb2, + 0xed, 0xb4, 0x4c, 0x59, 0x77, 0x37, 0x00, 0x0b, + 0xc7, 0xa7, 0x9e, 0x37, 0xf3, 0x60, 0x70, 0xef, + 0xf3, 0xc1, 0x74, 0x52, 0x87, 0xc6, 0xa1, 0x81, + 0xbd, 0x0a, 0x2c, 0x5d, 0x2c, 0x0c, 0x6a, 0x81, + 0xa1, 0xfe, 0x26, 0x78, 0x6c, 0x03, 0x06, 0x07, + 0x34, 0xaa, 0xd1, 0x1b, 0x40, 0x03, 0x39, 0x56, + 0xcf, 0x2a, 0x92, 0xc1, 0x4e, 0xdf, 0x29, 0x24, + 0x83, 0x22, 0x7a, 0xea, 0x67, 0x1e, 0xe7, 0x54, + 0x64, 0xd3, 0xbd, 0x3a, 0x5d, 0xae, 0xca, 0xf0, + 0x9c, 0xd6, 0x5a, 0x9a, 0x62, 0xc8, 0xc7, 0x83, + 0xf9, 0x89, 0xde, 0x2d, 0x53, 0x64, 0x61, 0xf7, + 0xa3, 0xa7, 0x31, 0x38, 0xc6, 0x22, 0x9c, 0xb4, + 0x87, 0xe0 +}; +static const u8 output73[] __initconst = { + 0x34, 0xed, 0x05, 0xb0, 0x14, 0xbc, 0x8c, 0xcc, + 0x95, 0xbd, 0x99, 0x0f, 0xb1, 0x98, 0x17, 0x10, + 0xae, 0xe0, 0x08, 0x53, 0xa3, 0x69, 0xd2, 0xed, + 0x66, 0xdb, 0x2a, 0x34, 0x8d, 0x0c, 0x6e, 0xce, + 0x63, 0x69, 0xc9, 0xe4, 0x57, 0xc3, 0x0c, 0x8b, + 0xa6, 0x2c, 0xa7, 0xd2, 0x08, 0xff, 0x4f, 0xec, + 0x61, 0x8c, 0xee, 0x0d, 0xfa, 0x6b, 0xe0, 0xe8, + 0x71, 0xbc, 0x41, 0x46, 0xd7, 0x33, 0x1d, 0xc0, + 0xfd, 0xad, 0xca, 0x8b, 0x34, 0x56, 0xa4, 0x86, + 0x71, 0x62, 0xae, 0x5e, 0x3d, 0x2b, 0x66, 0x3e, + 0xae, 0xd8, 0xc0, 0xe1, 0x21, 0x3b, 0xca, 0xd2, + 0x6b, 0xa2, 0xb8, 0xc7, 0x98, 0x4a, 0xf3, 0xcf, + 0xb8, 0x62, 0xd8, 0x33, 0xe6, 0x80, 0xdb, 0x2f, + 0x0a, 0xaf, 0x90, 0x3c, 0xe1, 0xec, 0xe9, 0x21, + 0x29, 0x42, 0x9e, 0xa5, 0x50, 0xe9, 0x93, 0xd3, + 0x53, 0x1f, 0xac, 0x2a, 0x24, 0x07, 0xb8, 0xed, + 0xed, 0x38, 0x2c, 0xc4, 0xa1, 0x2b, 0x31, 0x5d, + 0x9c, 0x24, 0x7b, 0xbf, 0xd9, 0xbb, 0x4e, 0x87, + 0x8f, 0x32, 0x30, 0xf1, 0x11, 0x29, 0x54, 0x94, + 0x00, 0x95, 0x1d, 0x1d, 0x24, 0xc0, 0xd4, 0x34, + 0x49, 0x1d, 0xd5, 0xe3, 0xa6, 0xde, 0x8b, 0xbf, + 0x5a, 0x9f, 0x58, 0x5a, 0x9b, 0x70, 0xe5, 0x9b, + 0xb3, 0xdb, 0xe8, 0xb8, 0xca, 0x1b, 0x43, 0xe3, + 0xc6, 0x6f, 0x0a, 0xd6, 0x32, 0x11, 0xd4, 0x04, + 0xef, 0xa3, 0xe4, 0x3f, 0x12, 0xd8, 0xc1, 0x73, + 0x51, 0x87, 0x03, 0xbd, 0xba, 0x60, 0x79, 0xee, + 0x08, 0xcc, 0xf7, 0xc0, 0xaa, 0x4c, 0x33, 0xc4, + 0xc7, 0x09, 0xf5, 0x91, 0xcb, 0x74, 0x57, 0x08, + 0x1b, 0x90, 0xa9, 0x1b, 0x60, 0x02, 0xd2, 0x3f, + 0x7a, 0xbb, 0xfd, 0x78, 0xf0, 0x15, 0xf9, 0x29, + 0x82, 0x8f, 0xc4, 0xb2, 0x88, 0x1f, 0xbc, 0xcc, + 0x53, 0x27, 0x8b, 0x07, 0x5f, 0xfc, 0x91, 0x29, + 0x82, 0x80, 0x59, 0x0a, 0x3c, 0xea, 0xc4, 0x7e, + 0xad, 0xd2, 0x70, 0x46, 0xbd, 0x9e, 0x3b, 0x1c, + 0x8a, 0x62, 0xea, 0x69, 0xbd, 0xf6, 0x96, 0x15, + 0xb5, 0x57, 0xe8, 0x63, 0x5f, 0x65, 0x46, 0x84, + 0x58, 0x50, 0x87, 0x4b, 0x0e, 0x5b, 0x52, 0x90, + 0xb0, 0xae, 0x37, 0x0f, 0xdd, 0x7e, 0xa2, 0xa0, + 0x8b, 0x78, 0xc8, 0x5a, 0x1f, 0x53, 0xdb, 0xc5, + 0xbf, 0x73, 0x20, 0xa9, 0x44, 0xfb, 0x1e, 0xc7, + 0x97, 0xb2, 0x3a, 0x5a, 0x17, 0xe6, 0x8b, 0x9b, + 0xe8, 0xf8, 0x2a, 0x01, 0x27, 0xa3, 0x71, 0x28, + 0xe3, 0x19, 0xc6, 0xaf, 0xf5, 0x3a, 0x26, 0xc0, + 0x5c, 0x69, 0x30, 0x78, 0x75, 0x27, 0xf2, 0x0c, + 0x22, 0x71, 0x65, 0xc6, 0x8e, 0x7b, 0x47, 0xe3, + 0x31, 0xaf, 0x7b, 0xc6, 0xc2, 0x55, 0x68, 0x81, + 0xaa, 0x1b, 0x21, 0x65, 0xfb, 0x18, 0x35, 0x45, + 0x36, 0x9a, 0x44, 0xba, 0x5c, 0xff, 0x06, 0xde, + 0x3a, 0xc8, 0x44, 0x0b, 0xaa, 0x8e, 0x34, 0xe2, + 0x84, 0xac, 0x18, 0xfe, 0x9b, 0xe1, 0x4f, 0xaa, + 0xb6, 0x90, 0x0b, 0x1c, 0x2c, 0xd9, 0x9a, 0x10, + 0x18, 0xf9, 0x49, 0x41, 0x42, 0x1b, 0xb5, 0xe1, + 0x26, 0xac, 0x2d, 0x38, 0x00, 0x00, 0xe4, 0xb4, + 0x50, 0x6f, 0x14, 0x18, 0xd6, 0x3d, 0x00, 0x59, + 0x3c, 0x45, 0xf3, 0x42, 0x13, 0x44, 0xb8, 0x57, + 0xd4, 0x43, 0x5c, 0x8a, 0x2a, 0xb4, 0xfc, 0x0a, + 0x25, 0x5a, 0xdc, 0x8f, 0x11, 0x0b, 0x11, 0x44, + 0xc7, 0x0e, 0x54, 0x8b, 0x22, 0x01, 0x7e, 0x67, + 0x2e, 0x15, 0x3a, 0xb9, 0xee, 0x84, 0x10, 0xd4, + 0x80, 0x57, 0xd7, 0x75, 0xcf, 0x8b, 0xcb, 0x03, + 0xc9, 0x92, 0x2b, 0x69, 0xd8, 0x5a, 0x9b, 0x06, + 0x85, 0x47, 0xaa, 0x4c, 0x28, 0xde, 0x49, 0x58, + 0xe6, 0x11, 0x1e, 0x5e, 0x64, 0x8e, 0x3b, 0xe0, + 0x40, 0x2e, 0xac, 0x96, 0x97, 0x15, 0x37, 0x1e, + 0x30, 0xdd +}; +static const u8 key73[] __initconst = { + 0x96, 0x06, 0x1e, 0xc1, 0x6d, 0xba, 0x49, 0x5b, + 0x65, 0x80, 0x79, 0xdd, 0xf3, 0x67, 0xa8, 0x6e, + 0x2d, 0x9c, 0x54, 0x46, 0xd8, 0x4a, 0xeb, 0x7e, + 0x23, 0x86, 0x51, 0xd8, 0x49, 0x49, 0x56, 0xe0 +}; +enum { nonce73 = 0xbefb83cb67e11ffdULL }; + +static const u8 input74[] __initconst = { + 0x47, 0x22, 0x70, 0xe5, 0x2f, 0x41, 0x18, 0x45, + 0x07, 0xd3, 0x6d, 0x32, 0x0d, 0x43, 0x92, 0x2b, + 0x9b, 0x65, 0x73, 0x13, 0x1a, 0x4f, 0x49, 0x8f, + 0xff, 0xf8, 0xcc, 0xae, 0x15, 0xab, 0x9d, 0x7d, + 0xee, 0x22, 0x5d, 0x8b, 0xde, 0x81, 0x5b, 0x81, + 0x83, 0x49, 0x35, 0x9b, 0xb4, 0xbc, 0x4e, 0x01, + 0xc2, 0x29, 0xa7, 0xf1, 0xca, 0x3a, 0xce, 0x3f, + 0xf5, 0x31, 0x93, 0xa8, 0xe2, 0xc9, 0x7d, 0x03, + 0x26, 0xa4, 0xbc, 0xa8, 0x9c, 0xb9, 0x68, 0xf3, + 0xb3, 0x91, 0xe8, 0xe6, 0xc7, 0x2b, 0x1a, 0xce, + 0xd2, 0x41, 0x53, 0xbd, 0xa3, 0x2c, 0x54, 0x94, + 0x21, 0xa1, 0x40, 0xae, 0xc9, 0x0c, 0x11, 0x92, + 0xfd, 0x91, 0xa9, 0x40, 0xca, 0xde, 0x21, 0x4e, + 0x1e, 0x3d, 0xcc, 0x2c, 0x87, 0x11, 0xef, 0x46, + 0xed, 0x52, 0x03, 0x11, 0x19, 0x43, 0x25, 0xc7, + 0x0d, 0xc3, 0x37, 0x5f, 0xd3, 0x6f, 0x0c, 0x6a, + 0x45, 0x30, 0x88, 0xec, 0xf0, 0x21, 0xef, 0x1d, + 0x7b, 0x38, 0x63, 0x4b, 0x49, 0x0c, 0x72, 0xf6, + 0x4c, 0x40, 0xc3, 0xcc, 0x03, 0xa7, 0xae, 0xa8, + 0x8c, 0x37, 0x03, 0x1c, 0x11, 0xae, 0x0d, 0x1b, + 0x62, 0x97, 0x27, 0xfc, 0x56, 0x4b, 0xb7, 0xfd, + 0xbc, 0xfb, 0x0e, 0xfc, 0x61, 0xad, 0xc6, 0xb5, + 0x9c, 0x8c, 0xc6, 0x38, 0x27, 0x91, 0x29, 0x3d, + 0x29, 0xc8, 0x37, 0xc9, 0x96, 0x69, 0xe3, 0xdc, + 0x3e, 0x61, 0x35, 0x9b, 0x99, 0x4f, 0xb9, 0x4e, + 0x5a, 0x29, 0x1c, 0x2e, 0xcf, 0x16, 0xcb, 0x69, + 0x87, 0xe4, 0x1a, 0xc4, 0x6e, 0x78, 0x43, 0x00, + 0x03, 0xb2, 0x8b, 0x03, 0xd0, 0xb4, 0xf1, 0xd2, + 0x7d, 0x2d, 0x7e, 0xfc, 0x19, 0x66, 0x5b, 0xa3, + 0x60, 0x3f, 0x9d, 0xbd, 0xfa, 0x3e, 0xca, 0x7b, + 0x26, 0x08, 0x19, 0x16, 0x93, 0x5d, 0x83, 0xfd, + 0xf9, 0x21, 0xc6, 0x31, 0x34, 0x6f, 0x0c, 0xaa, + 0x28, 0xf9, 0x18, 0xa2, 0xc4, 0x78, 0x3b, 0x56, + 0xc0, 0x88, 0x16, 0xba, 0x22, 0x2c, 0x07, 0x2f, + 0x70, 0xd0, 0xb0, 0x46, 0x35, 0xc7, 0x14, 0xdc, + 0xbb, 0x56, 0x23, 0x1e, 0x36, 0x36, 0x2d, 0x73, + 0x78, 0xc7, 0xce, 0xf3, 0x58, 0xf7, 0x58, 0xb5, + 0x51, 0xff, 0x33, 0x86, 0x0e, 0x3b, 0x39, 0xfb, + 0x1a, 0xfd, 0xf8, 0x8b, 0x09, 0x33, 0x1b, 0x83, + 0xf2, 0xe6, 0x38, 0x37, 0xef, 0x47, 0x84, 0xd9, + 0x82, 0x77, 0x2b, 0x82, 0xcc, 0xf9, 0xee, 0x94, + 0x71, 0x78, 0x81, 0xc8, 0x4d, 0x91, 0xd7, 0x35, + 0x29, 0x31, 0x30, 0x5c, 0x4a, 0x23, 0x23, 0xb1, + 0x38, 0x6b, 0xac, 0x22, 0x3f, 0x80, 0xc7, 0xe0, + 0x7d, 0xfa, 0x76, 0x47, 0xd4, 0x6f, 0x93, 0xa0, + 0xa0, 0x93, 0x5d, 0x68, 0xf7, 0x43, 0x25, 0x8f, + 0x1b, 0xc7, 0x87, 0xea, 0x59, 0x0c, 0xa2, 0xfa, + 0xdb, 0x2f, 0x72, 0x43, 0xcf, 0x90, 0xf1, 0xd6, + 0x58, 0xf3, 0x17, 0x6a, 0xdf, 0xb3, 0x4e, 0x0e, + 0x38, 0x24, 0x48, 0x1f, 0xb7, 0x01, 0xec, 0x81, + 0xb1, 0x87, 0x5b, 0xec, 0x9c, 0x11, 0x1a, 0xff, + 0xa5, 0xca, 0x5a, 0x63, 0x31, 0xb2, 0xe4, 0xc6, + 0x3c, 0x1d, 0xaf, 0x27, 0xb2, 0xd4, 0x19, 0xa2, + 0xcc, 0x04, 0x92, 0x42, 0xd2, 0xc1, 0x8c, 0x3b, + 0xce, 0xf5, 0x74, 0xc1, 0x81, 0xf8, 0x20, 0x23, + 0x6f, 0x20, 0x6d, 0x78, 0x36, 0x72, 0x2c, 0x52, + 0xdf, 0x5e, 0xe8, 0x75, 0xce, 0x1c, 0x49, 0x9d, + 0x93, 0x6f, 0x65, 0xeb, 0xb1, 0xbd, 0x8e, 0x5e, + 0xe5, 0x89, 0xc4, 0x8a, 0x81, 0x3d, 0x9a, 0xa7, + 0x11, 0x82, 0x8e, 0x38, 0x5b, 0x5b, 0xca, 0x7d, + 0x4b, 0x72, 0xc2, 0x9c, 0x30, 0x5e, 0x7f, 0xc0, + 0x6f, 0x91, 0xd5, 0x67, 0x8c, 0x3e, 0xae, 0xda, + 0x2b, 0x3c, 0x53, 0xcc, 0x50, 0x97, 0x36, 0x0b, + 0x79, 0xd6, 0x73, 0x6e, 0x7d, 0x42, 0x56, 0xe1, + 0xaa, 0xfc, 0xb3, 0xa7, 0xc8, 0x01, 0xaa, 0xc1, + 0xfc, 0x5c, 0x72, 0x8e, 0x63, 0xa8, 0x46, 0x18, + 0xee, 0x11, 0xe7, 0x30, 0x09, 0x83, 0x6c, 0xd9, + 0xf4, 0x7a, 0x7b, 0xb5, 0x1f, 0x6d, 0xc7, 0xbc, + 0xcb, 0x55, 0xea, 0x40, 0x58, 0x7a, 0x00, 0x00, + 0x90, 0x60, 0xc5, 0x64, 0x69, 0x05, 0x99, 0xd2, + 0x49, 0x62, 0x4f, 0xcb, 0x97, 0xdf, 0xdd, 0x6b, + 0x60, 0x75, 0xe2, 0xe0, 0x6f, 0x76, 0xd0, 0x37, + 0x67, 0x0a, 0xcf, 0xff, 0xc8, 0x61, 0x84, 0x14, + 0x80, 0x7c, 0x1d, 0x31, 0x8d, 0x90, 0xde, 0x0b, + 0x1c, 0x74, 0x9f, 0x82, 0x96, 0x80, 0xda, 0xaf, + 0x8d, 0x99, 0x86, 0x9f, 0x24, 0x99, 0x28, 0x3e, + 0xe0, 0xa3, 0xc3, 0x90, 0x2d, 0x14, 0x65, 0x1e, + 0x3b, 0xb9, 0xba, 0x13, 0xa5, 0x77, 0x73, 0x63, + 0x9a, 0x06, 0x3d, 0xa9, 0x28, 0x9b, 0xba, 0x25, + 0x61, 0xc9, 0xcd, 0xcf, 0x7a, 0x4d, 0x96, 0x09, + 0xcb, 0xca, 0x03, 0x9c, 0x54, 0x34, 0x31, 0x85, + 0xa0, 0x3d, 0xe5, 0xbc, 0xa5, 0x5f, 0x1b, 0xd3, + 0x10, 0x63, 0x74, 0x9d, 0x01, 0x92, 0x88, 0xf0, + 0x27, 0x9c, 0x28, 0xd9, 0xfd, 0xe2, 0x4e, 0x01, + 0x8d, 0x61, 0x79, 0x60, 0x61, 0x5b, 0x76, 0xab, + 0x06, 0xd3, 0x44, 0x87, 0x43, 0x52, 0xcd, 0x06, + 0x68, 0x1e, 0x2d, 0xc5, 0xb0, 0x07, 0x25, 0xdf, + 0x0a, 0x50, 0xd7, 0xd9, 0x08, 0x53, 0x65, 0xf1, + 0x0c, 0x2c, 0xde, 0x3f, 0x9d, 0x03, 0x1f, 0xe1, + 0x49, 0x43, 0x3c, 0x83, 0x81, 0x37, 0xf8, 0xa2, + 0x0b, 0xf9, 0x61, 0x1c, 0xc1, 0xdb, 0x79, 0xbc, + 0x64, 0xce, 0x06, 0x4e, 0x87, 0x89, 0x62, 0x73, + 0x51, 0xbc, 0xa4, 0x32, 0xd4, 0x18, 0x62, 0xab, + 0x65, 0x7e, 0xad, 0x1e, 0x91, 0xa3, 0xfa, 0x2d, + 0x58, 0x9e, 0x2a, 0xe9, 0x74, 0x44, 0x64, 0x11, + 0xe6, 0xb6, 0xb3, 0x00, 0x7e, 0xa3, 0x16, 0xef, + 0x72 +}; +static const u8 output74[] __initconst = { + 0xf5, 0xca, 0x45, 0x65, 0x50, 0x35, 0x47, 0x67, + 0x6f, 0x4f, 0x67, 0xff, 0x34, 0xd9, 0xc3, 0x37, + 0x2a, 0x26, 0xb0, 0x4f, 0x08, 0x1e, 0x45, 0x13, + 0xc7, 0x2c, 0x14, 0x75, 0x33, 0xd8, 0x8e, 0x1e, + 0x1b, 0x11, 0x0d, 0x97, 0x04, 0x33, 0x8a, 0xe4, + 0xd8, 0x8d, 0x0e, 0x12, 0x8d, 0xdb, 0x6e, 0x02, + 0xfa, 0xe5, 0xbd, 0x3a, 0xb5, 0x28, 0x07, 0x7d, + 0x20, 0xf0, 0x12, 0x64, 0x83, 0x2f, 0x59, 0x79, + 0x17, 0x88, 0x3c, 0x2d, 0x08, 0x2f, 0x55, 0xda, + 0xcc, 0x02, 0x3a, 0x82, 0xcd, 0x03, 0x94, 0xdf, + 0xdf, 0xab, 0x8a, 0x13, 0xf5, 0xe6, 0x74, 0xdf, + 0x7b, 0xe2, 0xab, 0x34, 0xbc, 0x00, 0x85, 0xbf, + 0x5a, 0x48, 0xc8, 0xff, 0x8d, 0x6c, 0x27, 0x48, + 0x19, 0x2d, 0x08, 0xfa, 0x82, 0x62, 0x39, 0x55, + 0x32, 0x11, 0xa8, 0xd7, 0xb9, 0x08, 0x2c, 0xd6, + 0x7a, 0xd9, 0x83, 0x9f, 0x9b, 0xfb, 0xec, 0x3a, + 0xd1, 0x08, 0xc7, 0xad, 0xdc, 0x98, 0x4c, 0xbc, + 0x98, 0xeb, 0x36, 0xb0, 0x39, 0xf4, 0x3a, 0xd6, + 0x53, 0x02, 0xa0, 0xa9, 0x73, 0xa1, 0xca, 0xef, + 0xd8, 0xd2, 0xec, 0x0e, 0xf8, 0xf5, 0xac, 0x8d, + 0x34, 0x41, 0x06, 0xa8, 0xc6, 0xc3, 0x31, 0xbc, + 0xe5, 0xcc, 0x7e, 0x72, 0x63, 0x59, 0x3e, 0x63, + 0xc2, 0x8d, 0x2b, 0xd5, 0xb9, 0xfd, 0x1e, 0x31, + 0x69, 0x32, 0x05, 0xd6, 0xde, 0xc9, 0xe6, 0x4c, + 0xac, 0x68, 0xf7, 0x1f, 0x9d, 0xcd, 0x0e, 0xa2, + 0x15, 0x3d, 0xd6, 0x47, 0x99, 0xab, 0x08, 0x5f, + 0x28, 0xc3, 0x4c, 0xc2, 0xd5, 0xdd, 0x10, 0xb7, + 0xbd, 0xdb, 0x9b, 0xcf, 0x85, 0x27, 0x29, 0x76, + 0x98, 0xeb, 0xad, 0x31, 0x64, 0xe7, 0xfb, 0x61, + 0xe0, 0xd8, 0x1a, 0xa6, 0xe2, 0xe7, 0x43, 0x42, + 0x77, 0xc9, 0x82, 0x00, 0xac, 0x85, 0xe0, 0xa2, + 0xd4, 0x62, 0xe3, 0xb7, 0x17, 0x6e, 0xb2, 0x9e, + 0x21, 0x58, 0x73, 0xa9, 0x53, 0x2d, 0x3c, 0xe1, + 0xdd, 0xd6, 0x6e, 0x92, 0xf2, 0x1d, 0xc2, 0x22, + 0x5f, 0x9a, 0x7e, 0xd0, 0x52, 0xbf, 0x54, 0x19, + 0xd7, 0x80, 0x63, 0x3e, 0xd0, 0x08, 0x2d, 0x37, + 0x0c, 0x15, 0xf7, 0xde, 0xab, 0x2b, 0xe3, 0x16, + 0x21, 0x3a, 0xee, 0xa5, 0xdc, 0xdf, 0xde, 0xa3, + 0x69, 0xcb, 0xfd, 0x92, 0x89, 0x75, 0xcf, 0xc9, + 0x8a, 0xa4, 0xc8, 0xdd, 0xcc, 0x21, 0xe6, 0xfe, + 0x9e, 0x43, 0x76, 0xb2, 0x45, 0x22, 0xb9, 0xb5, + 0xac, 0x7e, 0x3d, 0x26, 0xb0, 0x53, 0xc8, 0xab, + 0xfd, 0xea, 0x2c, 0xd1, 0x44, 0xc5, 0x60, 0x1b, + 0x8a, 0x99, 0x0d, 0xa5, 0x0e, 0x67, 0x6e, 0x3a, + 0x96, 0x55, 0xec, 0xe8, 0xcc, 0xbe, 0x49, 0xd9, + 0xf2, 0x72, 0x9f, 0x30, 0x21, 0x97, 0x57, 0x19, + 0xbe, 0x5e, 0x33, 0x0c, 0xee, 0xc0, 0x72, 0x0d, + 0x2e, 0xd1, 0xe1, 0x52, 0xc2, 0xea, 0x41, 0xbb, + 0xe1, 0x6d, 0xd4, 0x17, 0xa9, 0x8d, 0x89, 0xa9, + 0xd6, 0x4b, 0xc6, 0x4c, 0xf2, 0x88, 0x97, 0x54, + 0x3f, 0x4f, 0x57, 0xb7, 0x37, 0xf0, 0x2c, 0x11, + 0x15, 0x56, 0xdb, 0x28, 0xb5, 0x16, 0x84, 0x66, + 0xce, 0x45, 0x3f, 0x61, 0x75, 0xb6, 0xbe, 0x00, + 0xd1, 0xe4, 0xf5, 0x27, 0x54, 0x7f, 0xc2, 0xf1, + 0xb3, 0x32, 0x9a, 0xe8, 0x07, 0x02, 0xf3, 0xdb, + 0xa9, 0xd1, 0xc2, 0xdf, 0xee, 0xad, 0xe5, 0x8a, + 0x3c, 0xfa, 0x67, 0xec, 0x6b, 0xa4, 0x08, 0xfe, + 0xba, 0x5a, 0x58, 0x0b, 0x78, 0x11, 0x91, 0x76, + 0xe3, 0x1a, 0x28, 0x54, 0x5e, 0xbd, 0x71, 0x1b, + 0x8b, 0xdc, 0x6c, 0xf4, 0x6f, 0xd7, 0xf4, 0xf3, + 0xe1, 0x03, 0xa4, 0x3c, 0x8d, 0x91, 0x2e, 0xba, + 0x5f, 0x7f, 0x8c, 0xaf, 0x69, 0x89, 0x29, 0x0a, + 0x5b, 0x25, 0x13, 0xc4, 0x2e, 0x16, 0xc2, 0x15, + 0x07, 0x5d, 0x58, 0x33, 0x7c, 0xe0, 0xf0, 0x55, + 0x5f, 0xbf, 0x5e, 0xf0, 0x71, 0x48, 0x8f, 0xf7, + 0x48, 0xb3, 0xf7, 0x0d, 0xa1, 0xd0, 0x63, 0xb1, + 0xad, 0xae, 0xb5, 0xb0, 0x5f, 0x71, 0xaf, 0x24, + 0x8b, 0xb9, 0x1c, 0x44, 0xd2, 0x1a, 0x53, 0xd1, + 0xd5, 0xb4, 0xa9, 0xff, 0x88, 0x73, 0xb5, 0xaa, + 0x15, 0x32, 0x5f, 0x59, 0x9d, 0x2e, 0xb5, 0xcb, + 0xde, 0x21, 0x2e, 0xe9, 0x35, 0xed, 0xfd, 0x0f, + 0xb6, 0xbb, 0xe6, 0x4b, 0x16, 0xf1, 0x45, 0x1e, + 0xb4, 0x84, 0xe9, 0x58, 0x1c, 0x0c, 0x95, 0xc0, + 0xcf, 0x49, 0x8b, 0x59, 0xa1, 0x78, 0xe6, 0x80, + 0x12, 0x49, 0x7a, 0xd4, 0x66, 0x62, 0xdf, 0x9c, + 0x18, 0xc8, 0x8c, 0xda, 0xc1, 0xa6, 0xbc, 0x65, + 0x28, 0xd2, 0xa4, 0xe8, 0xf1, 0x35, 0xdb, 0x5a, + 0x75, 0x1f, 0x73, 0x60, 0xec, 0xa8, 0xda, 0x5a, + 0x43, 0x15, 0x83, 0x9b, 0xe7, 0xb1, 0xa6, 0x81, + 0xbb, 0xef, 0xf3, 0x8f, 0x0f, 0xd3, 0x79, 0xa2, + 0xe5, 0xaa, 0x42, 0xef, 0xa0, 0x13, 0x4e, 0x91, + 0x2d, 0xcb, 0x61, 0x7a, 0x9a, 0x33, 0x14, 0x50, + 0x77, 0x4a, 0xd0, 0x91, 0x48, 0xe0, 0x0c, 0xe0, + 0x11, 0xcb, 0xdf, 0xb0, 0xce, 0x06, 0xd2, 0x79, + 0x4d, 0x69, 0xb9, 0xc9, 0x36, 0x74, 0x8f, 0x81, + 0x72, 0x73, 0xf3, 0x17, 0xb7, 0x13, 0xcb, 0x5b, + 0xd2, 0x5c, 0x33, 0x61, 0xb7, 0x61, 0x79, 0xb0, + 0xc0, 0x4d, 0xa1, 0xc7, 0x5d, 0x98, 0xc9, 0xe1, + 0x98, 0xbd, 0x78, 0x5a, 0x2c, 0x64, 0x53, 0xaf, + 0xaf, 0x66, 0x51, 0x47, 0xe4, 0x48, 0x66, 0x8b, + 0x07, 0x52, 0xa3, 0x03, 0x93, 0x28, 0xad, 0xcc, + 0xa3, 0x86, 0xad, 0x63, 0x04, 0x35, 0x6c, 0x49, + 0xd5, 0x28, 0x0e, 0x00, 0x47, 0xf4, 0xd4, 0x32, + 0x27, 0x19, 0xb3, 0x29, 0xe7, 0xbc, 0xbb, 0xce, + 0x3e, 0x3e, 0xd5, 0x67, 0x20, 0xe4, 0x0b, 0x75, + 0x95, 0x24, 0xe0, 0x6c, 0xb6, 0x29, 0x0c, 0x14, + 0xfd +}; +static const u8 key74[] __initconst = { + 0xf0, 0x41, 0x5b, 0x00, 0x56, 0xc4, 0xac, 0xf6, + 0xa2, 0x4c, 0x33, 0x41, 0x16, 0x09, 0x1b, 0x8e, + 0x4d, 0xe8, 0x8c, 0xd9, 0x48, 0xab, 0x3e, 0x60, + 0xcb, 0x49, 0x3e, 0xaf, 0x2b, 0x8b, 0xc8, 0xf0 +}; +enum { nonce74 = 0xcbdb0ffd0e923384ULL }; + +static const struct chacha20_testvec chacha20_testvecs[] __initconst = { + { input01, output01, key01, nonce01, sizeof(input01) }, + { input02, output02, key02, nonce02, sizeof(input02) }, + { input03, output03, key03, nonce03, sizeof(input03) }, + { input04, output04, key04, nonce04, sizeof(input04) }, + { input05, output05, key05, nonce05, sizeof(input05) }, + { input06, output06, key06, nonce06, sizeof(input06) }, + { input07, output07, key07, nonce07, sizeof(input07) }, + { input08, output08, key08, nonce08, sizeof(input08) }, + { input09, output09, key09, nonce09, sizeof(input09) }, + { input10, output10, key10, nonce10, sizeof(input10) }, + { input11, output11, key11, nonce11, sizeof(input11) }, + { input12, output12, key12, nonce12, sizeof(input12) }, + { input13, output13, key13, nonce13, sizeof(input13) }, + { input14, output14, key14, nonce14, sizeof(input14) }, + { input15, output15, key15, nonce15, sizeof(input15) }, + { input16, output16, key16, nonce16, sizeof(input16) }, + { input17, output17, key17, nonce17, sizeof(input17) }, + { input18, output18, key18, nonce18, sizeof(input18) }, + { input19, output19, key19, nonce19, sizeof(input19) }, + { input20, output20, key20, nonce20, sizeof(input20) }, + { input21, output21, key21, nonce21, sizeof(input21) }, + { input22, output22, key22, nonce22, sizeof(input22) }, + { input23, output23, key23, nonce23, sizeof(input23) }, + { input24, output24, key24, nonce24, sizeof(input24) }, + { input25, output25, key25, nonce25, sizeof(input25) }, + { input26, output26, key26, nonce26, sizeof(input26) }, + { input27, output27, key27, nonce27, sizeof(input27) }, + { input28, output28, key28, nonce28, sizeof(input28) }, + { input29, output29, key29, nonce29, sizeof(input29) }, + { input30, output30, key30, nonce30, sizeof(input30) }, + { input31, output31, key31, nonce31, sizeof(input31) }, + { input32, output32, key32, nonce32, sizeof(input32) }, + { input33, output33, key33, nonce33, sizeof(input33) }, + { input34, output34, key34, nonce34, sizeof(input34) }, + { input35, output35, key35, nonce35, sizeof(input35) }, + { input36, output36, key36, nonce36, sizeof(input36) }, + { input37, output37, key37, nonce37, sizeof(input37) }, + { input38, output38, key38, nonce38, sizeof(input38) }, + { input39, output39, key39, nonce39, sizeof(input39) }, + { input40, output40, key40, nonce40, sizeof(input40) }, + { input41, output41, key41, nonce41, sizeof(input41) }, + { input42, output42, key42, nonce42, sizeof(input42) }, + { input43, output43, key43, nonce43, sizeof(input43) }, + { input44, output44, key44, nonce44, sizeof(input44) }, + { input45, output45, key45, nonce45, sizeof(input45) }, + { input46, output46, key46, nonce46, sizeof(input46) }, + { input47, output47, key47, nonce47, sizeof(input47) }, + { input48, output48, key48, nonce48, sizeof(input48) }, + { input49, output49, key49, nonce49, sizeof(input49) }, + { input50, output50, key50, nonce50, sizeof(input50) }, + { input51, output51, key51, nonce51, sizeof(input51) }, + { input52, output52, key52, nonce52, sizeof(input52) }, + { input53, output53, key53, nonce53, sizeof(input53) }, + { input54, output54, key54, nonce54, sizeof(input54) }, + { input55, output55, key55, nonce55, sizeof(input55) }, + { input56, output56, key56, nonce56, sizeof(input56) }, + { input57, output57, key57, nonce57, sizeof(input57) }, + { input58, output58, key58, nonce58, sizeof(input58) }, + { input59, output59, key59, nonce59, sizeof(input59) }, + { input60, output60, key60, nonce60, sizeof(input60) }, + { input61, output61, key61, nonce61, sizeof(input61) }, + { input62, output62, key62, nonce62, sizeof(input62) }, + { input63, output63, key63, nonce63, sizeof(input63) }, + { input64, output64, key64, nonce64, sizeof(input64) }, + { input65, output65, key65, nonce65, sizeof(input65) }, + { input66, output66, key66, nonce66, sizeof(input66) }, + { input67, output67, key67, nonce67, sizeof(input67) }, + { input68, output68, key68, nonce68, sizeof(input68) }, + { input69, output69, key69, nonce69, sizeof(input69) }, + { input70, output70, key70, nonce70, sizeof(input70) }, + { input71, output71, key71, nonce71, sizeof(input71) }, + { input72, output72, key72, nonce72, sizeof(input72) }, + { input73, output73, key73, nonce73, sizeof(input73) }, + { input74, output74, key74, nonce74, sizeof(input74) } +}; + +static const struct hchacha20_testvec hchacha20_testvecs[] __initconst = {{ + .key = { 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, + 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, + 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, + 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f }, + .nonce = { 0x00, 0x00, 0x00, 0x09, 0x00, 0x00, 0x00, 0x4a, + 0x00, 0x00, 0x00, 0x00, 0x31, 0x41, 0x59, 0x27 }, + .output = { 0x82, 0x41, 0x3b, 0x42, 0x27, 0xb2, 0x7b, 0xfe, + 0xd3, 0x0e, 0x42, 0x50, 0x8a, 0x87, 0x7d, 0x73, + 0xa0, 0xf9, 0xe4, 0xd5, 0x8a, 0x74, 0xa8, 0x53, + 0xc1, 0x2e, 0xc4, 0x13, 0x26, 0xd3, 0xec, 0xdc } +}}; + +static bool __init chacha20_selftest(void) +{ + enum { + MAXIMUM_TEST_BUFFER_LEN = 1UL << 10, + OUTRAGEOUSLY_HUGE_BUFFER_LEN = PAGE_SIZE * 35 + 17 /* 143k */ + }; + size_t i, j, k; + u32 derived_key[CHACHA20_KEY_WORDS]; + u8 *offset_input = NULL, *computed_output = NULL, *massive_input = NULL; + u8 offset_key[CHACHA20_KEY_SIZE + 1] + __aligned(__alignof__(unsigned long)); + struct chacha20_ctx state; + bool success = true; + simd_context_t simd_context; + + offset_input = kmalloc(MAXIMUM_TEST_BUFFER_LEN + 1, GFP_KERNEL); + computed_output = kmalloc(MAXIMUM_TEST_BUFFER_LEN + 1, GFP_KERNEL); + massive_input = vzalloc(OUTRAGEOUSLY_HUGE_BUFFER_LEN); + if (!computed_output || !offset_input || !massive_input) { + pr_err("chacha20 self-test malloc: FAIL\n"); + success = false; + goto out; + } + + simd_get(&simd_context); + for (i = 0; i < ARRAY_SIZE(chacha20_testvecs); ++i) { + /* Boring case */ + memset(computed_output, 0, MAXIMUM_TEST_BUFFER_LEN + 1); + memset(&state, 0, sizeof(state)); + chacha20_init(&state, chacha20_testvecs[i].key, + chacha20_testvecs[i].nonce); + chacha20(&state, computed_output, chacha20_testvecs[i].input, + chacha20_testvecs[i].ilen, &simd_context); + if (memcmp(computed_output, chacha20_testvecs[i].output, + chacha20_testvecs[i].ilen)) { + pr_err("chacha20 self-test %zu: FAIL\n", i + 1); + success = false; + } + for (k = chacha20_testvecs[i].ilen; + k < MAXIMUM_TEST_BUFFER_LEN + 1; ++k) { + if (computed_output[k]) { + pr_err("chacha20 self-test %zu (zero check): FAIL\n", + i + 1); + success = false; + break; + } + } + + /* Unaligned case */ + memset(computed_output, 0, MAXIMUM_TEST_BUFFER_LEN + 1); + memset(&state, 0, sizeof(state)); + memcpy(offset_input + 1, chacha20_testvecs[i].input, + chacha20_testvecs[i].ilen); + memcpy(offset_key + 1, chacha20_testvecs[i].key, + CHACHA20_KEY_SIZE); + chacha20_init(&state, offset_key + 1, chacha20_testvecs[i].nonce); + chacha20(&state, computed_output + 1, offset_input + 1, + chacha20_testvecs[i].ilen, &simd_context); + if (memcmp(computed_output + 1, chacha20_testvecs[i].output, + chacha20_testvecs[i].ilen)) { + pr_err("chacha20 self-test %zu (unaligned): FAIL\n", + i + 1); + success = false; + } + if (computed_output[0]) { + pr_err("chacha20 self-test %zu (unaligned, zero check): FAIL\n", + i + 1); + success = false; + } + for (k = chacha20_testvecs[i].ilen + 1; + k < MAXIMUM_TEST_BUFFER_LEN + 1; ++k) { + if (computed_output[k]) { + pr_err("chacha20 self-test %zu (unaligned, zero check): FAIL\n", + i + 1); + success = false; + break; + } + } + + /* Chunked case */ + if (chacha20_testvecs[i].ilen <= CHACHA20_BLOCK_SIZE) + goto next_test; + memset(computed_output, 0, MAXIMUM_TEST_BUFFER_LEN + 1); + memset(&state, 0, sizeof(state)); + chacha20_init(&state, chacha20_testvecs[i].key, + chacha20_testvecs[i].nonce); + chacha20(&state, computed_output, chacha20_testvecs[i].input, + CHACHA20_BLOCK_SIZE, &simd_context); + chacha20(&state, computed_output + CHACHA20_BLOCK_SIZE, + chacha20_testvecs[i].input + CHACHA20_BLOCK_SIZE, + chacha20_testvecs[i].ilen - CHACHA20_BLOCK_SIZE, + &simd_context); + if (memcmp(computed_output, chacha20_testvecs[i].output, + chacha20_testvecs[i].ilen)) { + pr_err("chacha20 self-test %zu (chunked): FAIL\n", + i + 1); + success = false; + } + for (k = chacha20_testvecs[i].ilen; + k < MAXIMUM_TEST_BUFFER_LEN + 1; ++k) { + if (computed_output[k]) { + pr_err("chacha20 self-test %zu (chunked, zero check): FAIL\n", + i + 1); + success = false; + break; + } + } + +next_test: + /* Sliding unaligned case */ + if (chacha20_testvecs[i].ilen > CHACHA20_BLOCK_SIZE + 1 || + !chacha20_testvecs[i].ilen) + continue; + for (j = 1; j < CHACHA20_BLOCK_SIZE; ++j) { + memset(computed_output, 0, MAXIMUM_TEST_BUFFER_LEN + 1); + memset(&state, 0, sizeof(state)); + memcpy(offset_input + j, chacha20_testvecs[i].input, + chacha20_testvecs[i].ilen); + chacha20_init(&state, chacha20_testvecs[i].key, + chacha20_testvecs[i].nonce); + chacha20(&state, computed_output + j, offset_input + j, + chacha20_testvecs[i].ilen, &simd_context); + if (memcmp(computed_output + j, + chacha20_testvecs[i].output, + chacha20_testvecs[i].ilen)) { + pr_err("chacha20 self-test %zu (unaligned, slide %zu): FAIL\n", + i + 1, j); + success = false; + } + for (k = j; k < j; ++k) { + if (computed_output[k]) { + pr_err("chacha20 self-test %zu (unaligned, slide %zu, zero check): FAIL\n", + i + 1, j); + success = false; + break; + } + } + for (k = chacha20_testvecs[i].ilen + j; + k < MAXIMUM_TEST_BUFFER_LEN + 1; ++k) { + if (computed_output[k]) { + pr_err("chacha20 self-test %zu (unaligned, slide %zu, zero check): FAIL\n", + i + 1, j); + success = false; + break; + } + } + } + } + for (i = 0; i < ARRAY_SIZE(hchacha20_testvecs); ++i) { + memset(&derived_key, 0, sizeof(derived_key)); + hchacha20(derived_key, hchacha20_testvecs[i].nonce, + hchacha20_testvecs[i].key, &simd_context); + cpu_to_le32_array(derived_key, ARRAY_SIZE(derived_key)); + if (memcmp(derived_key, hchacha20_testvecs[i].output, + CHACHA20_KEY_SIZE)) { + pr_err("hchacha20 self-test %zu: FAIL\n", i + 1); + success = false; + } + } + memset(&state, 0, sizeof(state)); + chacha20_init(&state, chacha20_testvecs[0].key, + chacha20_testvecs[0].nonce); + chacha20(&state, massive_input, massive_input, + OUTRAGEOUSLY_HUGE_BUFFER_LEN, &simd_context); + chacha20_init(&state, chacha20_testvecs[0].key, + chacha20_testvecs[0].nonce); + chacha20(&state, massive_input, massive_input, + OUTRAGEOUSLY_HUGE_BUFFER_LEN, DONT_USE_SIMD); + for (k = 0; k < OUTRAGEOUSLY_HUGE_BUFFER_LEN; ++k) { + if (massive_input[k]) { + pr_err("chacha20 self-test massive: FAIL\n"); + success = false; + break; + } + } + + simd_put(&simd_context); + +out: + kfree(offset_input); + kfree(computed_output); + vfree(massive_input); + return success; +} diff --git a/lib/zinc/selftest/run.h b/lib/zinc/selftest/run.h new file mode 100644 index 000000000000..4cbafe2b2565 --- /dev/null +++ b/lib/zinc/selftest/run.h @@ -0,0 +1,49 @@ +/* SPDX-License-Identifier: GPL-2.0 OR MIT */ +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#ifndef _ZINC_SELFTEST_RUN_H +#define _ZINC_SELFTEST_RUN_H + +#include +#include +#include + +static inline bool selftest_run(const char *name, bool (*selftest)(void), + bool *const nobs[], unsigned int nobs_len) +{ + unsigned long subset = 0, set = 0; + unsigned int i; + bool ret = true; + + BUILD_BUG_ON(!__builtin_constant_p(nobs_len) || + nobs_len >= BITS_PER_LONG); + + if (!IS_ENABLED(CONFIG_ZINC_SELFTEST)) + return true; + + for (i = 0; i < nobs_len; ++i) + set |= ((unsigned long)*nobs[i]) << i; + + do { + for (i = 0; i < nobs_len; ++i) + *nobs[i] = (subset >> i) & 1; + if (!selftest()) { + pr_err("%s self-test combo 0x%lx: FAIL\n", name, + subset); + ret = false; + } + subset = (subset - set) & set; + } while (subset); + + for (i = 0; i < nobs_len; ++i) + *nobs[i] = (set >> i) & 1; + + if (ret) + pr_info("%s self-tests: pass\n", name); + + return !WARN_ON(!ret); +} + +#endif From patchwork Sat Oct 6 02:56:46 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Jason A. Donenfeld" X-Patchwork-Id: 148303 Delivered-To: patch@linaro.org Received: by 2002:a2e:8595:0:0:0:0:0 with SMTP id b21-v6csp1157343lji; Fri, 5 Oct 2018 19:57:53 -0700 (PDT) X-Google-Smtp-Source: ACcGV62Y457n4otYFVsWgwlwZvoVHv8/XNnfEZYqt+sWqhpLyJQtKs4WxOd1fcj1ypzyd39L9U2s X-Received: by 2002:a17:902:3143:: with SMTP id w61-v6mr14317390plb.85.1538794672946; Fri, 05 Oct 2018 19:57:52 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1538794672; cv=none; d=google.com; s=arc-20160816; b=rdrUnxOIqsgRPog26dLFwsRv+z8jNFhoRJYeBqedCmFC3C2keW12MPd3LxZC5EJXm7 MNUnUjW2wXzZo3sIXxYJDMEKYi/x7f0BUIVIremwByx9GpqC/k145wuulZquWdU9Zab3 h13ifq1lZzGvPLp4EOAUqdKEn8RiHUuVmErwCgkPkYIL1J6dkp0kC3eWuwDEyyI+y8b1 b11H8t3TE5FnOLZ4BW0TY/ahHtStboRRhXpIXUz6YGm+OE0R5pzYFlozGMVcn2I1xAk0 MS2zireg6FtjzCoA9VmSFeYWyhXc322F4d8hEgaYeJvgjuH9jL/hN+TEz0qIuNCRhMIZ MkVQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=F6LRspqeLVEJsqD819THUOkOZ7FC5NgGy212VsLbvRM=; b=Jo+EikdWCExtuyK8w2aUCfaIjETqXuVZshca+soa5OrZaEitmqKer5C/n7fMi2q8fS lq6FDF9c+HqeRzDHIuBvcR7Deuf1Zda7oq5V2ACKq859DXzajRV07w2o2Hthc6wNsgnx lzXkc0t/+K7gjXDZrHwyOXSAE4dBZrffXNiApCqI3i4pG4SkBAQuNt+gGRx5+lVZPwn/ PFR1T+p5clpRErAFnxBmIiHn56A3UkR7t3zrZMLgbgOQyo08leduWQtD6GYm6mlkFgim JWv7kMBZfnPeib/JBZpeetbyymig3pRUdGkm41xXMuYKAPeVnMEmRjH6jCmcAduaxZpw /I3Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=KAi+Kz7h; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 37-v6si5956456plv.113.2018.10.05.19.57.52; Fri, 05 Oct 2018 19:57:52 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=KAi+Kz7h; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729558AbeJFJ7X (ORCPT + 32 others); Sat, 6 Oct 2018 05:59:23 -0400 Received: from frisell.zx2c4.com ([192.95.5.64]:50073 "EHLO frisell.zx2c4.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729509AbeJFJ7W (ORCPT ); Sat, 6 Oct 2018 05:59:22 -0400 Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTP id 080f60b2; Sat, 6 Oct 2018 02:57:13 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=zx2c4.com; h=from:to:cc :subject:date:message-id:in-reply-to:references:mime-version :content-transfer-encoding; s=mail; bh=6218ugLteS9yNjl8R+tDnUTSB kU=; b=KAi+Kz7huW9trzcwL1VynEW0gKeO7hWSb/C4+NuxnqQtfkOtJe6I45NPX B3+AziUrqVnANhUESZAyEqn7p6jffP3exstL/+RaYDgCQIh+AacjtqBi6GV2XAbh 6a7At90jh0amGdCZURdsp10vLCnXFJioSv0QqbLVeAShuK38RsqikPGeXbe852T2 b6d1pfvLVYSe6MiHgED+KWxP0/gVydublONgJ8UGjr/+fjdmySBGr2cTs+deeuWO alY3+H0vQB02ubjT3qEwk5RF4cqJ2MQRlvhmQaABEZcZq42BTuEnpCvraT2XE9iV MdYqgvMbl52MKLaSOwxi+QdAF5anw== Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTPSA id c467ab71 (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256:NO); Sat, 6 Oct 2018 02:57:12 +0000 (UTC) From: "Jason A. Donenfeld" To: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, davem@davemloft.net, gregkh@linuxfoundation.org Cc: "Jason A. Donenfeld" , Andy Polyakov , Thomas Gleixner , Ingo Molnar , x86@kernel.org, Samuel Neves , Jean-Philippe Aumasson , Andy Lutomirski , Andrew Morton , Linus Torvalds , kernel-hardening@lists.openwall.com, linux-crypto@vger.kernel.org Subject: [PATCH net-next v7 05/28] zinc: import Andy Polyakov's ChaCha20 x86_64 implementation Date: Sat, 6 Oct 2018 04:56:46 +0200 Message-Id: <20181006025709.4019-6-Jason@zx2c4.com> In-Reply-To: <20181006025709.4019-1-Jason@zx2c4.com> References: <20181006025709.4019-1-Jason@zx2c4.com> MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org These x86_64 vectorized implementations come from Andy Polyakov's implementation, and are included here in raw form without modification, so that subsequent commits that fix these up for the kernel can see how it has changed. While this is CRYPTOGAMS code, the originating code for this happens to be the same as OpenSSL's commit cded951378069a478391843f5f8653c1eb5128da Signed-off-by: Jason A. Donenfeld Based-on-code-from: Andy Polyakov Cc: Andy Polyakov Cc: Thomas Gleixner Cc: Ingo Molnar Cc: x86@kernel.org Cc: Samuel Neves Cc: Jean-Philippe Aumasson Cc: Andy Lutomirski Cc: Greg KH Cc: Andrew Morton Cc: Linus Torvalds Cc: kernel-hardening@lists.openwall.com Cc: linux-crypto@vger.kernel.org --- .../chacha20/chacha20-x86_64-cryptogams.S | 3433 +++++++++++++++++ 1 file changed, 3433 insertions(+) create mode 100644 lib/zinc/chacha20/chacha20-x86_64-cryptogams.S -- 2.19.0 diff --git a/lib/zinc/chacha20/chacha20-x86_64-cryptogams.S b/lib/zinc/chacha20/chacha20-x86_64-cryptogams.S new file mode 100644 index 000000000000..2bfc76f7e01f --- /dev/null +++ b/lib/zinc/chacha20/chacha20-x86_64-cryptogams.S @@ -0,0 +1,3433 @@ +/* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */ +/* + * Copyright (C) 2006-2017 CRYPTOGAMS by . All Rights Reserved. + */ + +.text + + + +.align 64 +.Lzero: +.long 0,0,0,0 +.Lone: +.long 1,0,0,0 +.Linc: +.long 0,1,2,3 +.Lfour: +.long 4,4,4,4 +.Lincy: +.long 0,2,4,6,1,3,5,7 +.Leight: +.long 8,8,8,8,8,8,8,8 +.Lrot16: +.byte 0x2,0x3,0x0,0x1, 0x6,0x7,0x4,0x5, 0xa,0xb,0x8,0x9, 0xe,0xf,0xc,0xd +.Lrot24: +.byte 0x3,0x0,0x1,0x2, 0x7,0x4,0x5,0x6, 0xb,0x8,0x9,0xa, 0xf,0xc,0xd,0xe +.Ltwoy: +.long 2,0,0,0, 2,0,0,0 +.align 64 +.Lzeroz: +.long 0,0,0,0, 1,0,0,0, 2,0,0,0, 3,0,0,0 +.Lfourz: +.long 4,0,0,0, 4,0,0,0, 4,0,0,0, 4,0,0,0 +.Lincz: +.long 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 +.Lsixteen: +.long 16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16 +.Lsigma: +.byte 101,120,112,97,110,100,32,51,50,45,98,121,116,101,32,107,0 +.byte 67,104,97,67,104,97,50,48,32,102,111,114,32,120,56,54,95,54,52,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0 +.globl ChaCha20_ctr32 +.type ChaCha20_ctr32,@function +.align 64 +ChaCha20_ctr32: +.cfi_startproc + cmpq $0,%rdx + je .Lno_data + movq OPENSSL_ia32cap_P+4(%rip),%r10 + btq $48,%r10 + jc .LChaCha20_avx512 + testq %r10,%r10 + js .LChaCha20_avx512vl + testl $512,%r10d + jnz .LChaCha20_ssse3 + + pushq %rbx +.cfi_adjust_cfa_offset 8 +.cfi_offset %rbx,-16 + pushq %rbp +.cfi_adjust_cfa_offset 8 +.cfi_offset %rbp,-24 + pushq %r12 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r12,-32 + pushq %r13 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r13,-40 + pushq %r14 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r14,-48 + pushq %r15 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r15,-56 + subq $64+24,%rsp +.cfi_adjust_cfa_offset 64+24 +.Lctr32_body: + + + movdqu (%rcx),%xmm1 + movdqu 16(%rcx),%xmm2 + movdqu (%r8),%xmm3 + movdqa .Lone(%rip),%xmm4 + + + movdqa %xmm1,16(%rsp) + movdqa %xmm2,32(%rsp) + movdqa %xmm3,48(%rsp) + movq %rdx,%rbp + jmp .Loop_outer + +.align 32 +.Loop_outer: + movl $0x61707865,%eax + movl $0x3320646e,%ebx + movl $0x79622d32,%ecx + movl $0x6b206574,%edx + movl 16(%rsp),%r8d + movl 20(%rsp),%r9d + movl 24(%rsp),%r10d + movl 28(%rsp),%r11d + movd %xmm3,%r12d + movl 52(%rsp),%r13d + movl 56(%rsp),%r14d + movl 60(%rsp),%r15d + + movq %rbp,64+0(%rsp) + movl $10,%ebp + movq %rsi,64+8(%rsp) +.byte 102,72,15,126,214 + movq %rdi,64+16(%rsp) + movq %rsi,%rdi + shrq $32,%rdi + jmp .Loop + +.align 32 +.Loop: + addl %r8d,%eax + xorl %eax,%r12d + roll $16,%r12d + addl %r9d,%ebx + xorl %ebx,%r13d + roll $16,%r13d + addl %r12d,%esi + xorl %esi,%r8d + roll $12,%r8d + addl %r13d,%edi + xorl %edi,%r9d + roll $12,%r9d + addl %r8d,%eax + xorl %eax,%r12d + roll $8,%r12d + addl %r9d,%ebx + xorl %ebx,%r13d + roll $8,%r13d + addl %r12d,%esi + xorl %esi,%r8d + roll $7,%r8d + addl %r13d,%edi + xorl %edi,%r9d + roll $7,%r9d + movl %esi,32(%rsp) + movl %edi,36(%rsp) + movl 40(%rsp),%esi + movl 44(%rsp),%edi + addl %r10d,%ecx + xorl %ecx,%r14d + roll $16,%r14d + addl %r11d,%edx + xorl %edx,%r15d + roll $16,%r15d + addl %r14d,%esi + xorl %esi,%r10d + roll $12,%r10d + addl %r15d,%edi + xorl %edi,%r11d + roll $12,%r11d + addl %r10d,%ecx + xorl %ecx,%r14d + roll $8,%r14d + addl %r11d,%edx + xorl %edx,%r15d + roll $8,%r15d + addl %r14d,%esi + xorl %esi,%r10d + roll $7,%r10d + addl %r15d,%edi + xorl %edi,%r11d + roll $7,%r11d + addl %r9d,%eax + xorl %eax,%r15d + roll $16,%r15d + addl %r10d,%ebx + xorl %ebx,%r12d + roll $16,%r12d + addl %r15d,%esi + xorl %esi,%r9d + roll $12,%r9d + addl %r12d,%edi + xorl %edi,%r10d + roll $12,%r10d + addl %r9d,%eax + xorl %eax,%r15d + roll $8,%r15d + addl %r10d,%ebx + xorl %ebx,%r12d + roll $8,%r12d + addl %r15d,%esi + xorl %esi,%r9d + roll $7,%r9d + addl %r12d,%edi + xorl %edi,%r10d + roll $7,%r10d + movl %esi,40(%rsp) + movl %edi,44(%rsp) + movl 32(%rsp),%esi + movl 36(%rsp),%edi + addl %r11d,%ecx + xorl %ecx,%r13d + roll $16,%r13d + addl %r8d,%edx + xorl %edx,%r14d + roll $16,%r14d + addl %r13d,%esi + xorl %esi,%r11d + roll $12,%r11d + addl %r14d,%edi + xorl %edi,%r8d + roll $12,%r8d + addl %r11d,%ecx + xorl %ecx,%r13d + roll $8,%r13d + addl %r8d,%edx + xorl %edx,%r14d + roll $8,%r14d + addl %r13d,%esi + xorl %esi,%r11d + roll $7,%r11d + addl %r14d,%edi + xorl %edi,%r8d + roll $7,%r8d + decl %ebp + jnz .Loop + movl %edi,36(%rsp) + movl %esi,32(%rsp) + movq 64(%rsp),%rbp + movdqa %xmm2,%xmm1 + movq 64+8(%rsp),%rsi + paddd %xmm4,%xmm3 + movq 64+16(%rsp),%rdi + + addl $0x61707865,%eax + addl $0x3320646e,%ebx + addl $0x79622d32,%ecx + addl $0x6b206574,%edx + addl 16(%rsp),%r8d + addl 20(%rsp),%r9d + addl 24(%rsp),%r10d + addl 28(%rsp),%r11d + addl 48(%rsp),%r12d + addl 52(%rsp),%r13d + addl 56(%rsp),%r14d + addl 60(%rsp),%r15d + paddd 32(%rsp),%xmm1 + + cmpq $64,%rbp + jb .Ltail + + xorl 0(%rsi),%eax + xorl 4(%rsi),%ebx + xorl 8(%rsi),%ecx + xorl 12(%rsi),%edx + xorl 16(%rsi),%r8d + xorl 20(%rsi),%r9d + xorl 24(%rsi),%r10d + xorl 28(%rsi),%r11d + movdqu 32(%rsi),%xmm0 + xorl 48(%rsi),%r12d + xorl 52(%rsi),%r13d + xorl 56(%rsi),%r14d + xorl 60(%rsi),%r15d + leaq 64(%rsi),%rsi + pxor %xmm1,%xmm0 + + movdqa %xmm2,32(%rsp) + movd %xmm3,48(%rsp) + + movl %eax,0(%rdi) + movl %ebx,4(%rdi) + movl %ecx,8(%rdi) + movl %edx,12(%rdi) + movl %r8d,16(%rdi) + movl %r9d,20(%rdi) + movl %r10d,24(%rdi) + movl %r11d,28(%rdi) + movdqu %xmm0,32(%rdi) + movl %r12d,48(%rdi) + movl %r13d,52(%rdi) + movl %r14d,56(%rdi) + movl %r15d,60(%rdi) + leaq 64(%rdi),%rdi + + subq $64,%rbp + jnz .Loop_outer + + jmp .Ldone + +.align 16 +.Ltail: + movl %eax,0(%rsp) + movl %ebx,4(%rsp) + xorq %rbx,%rbx + movl %ecx,8(%rsp) + movl %edx,12(%rsp) + movl %r8d,16(%rsp) + movl %r9d,20(%rsp) + movl %r10d,24(%rsp) + movl %r11d,28(%rsp) + movdqa %xmm1,32(%rsp) + movl %r12d,48(%rsp) + movl %r13d,52(%rsp) + movl %r14d,56(%rsp) + movl %r15d,60(%rsp) + +.Loop_tail: + movzbl (%rsi,%rbx,1),%eax + movzbl (%rsp,%rbx,1),%edx + leaq 1(%rbx),%rbx + xorl %edx,%eax + movb %al,-1(%rdi,%rbx,1) + decq %rbp + jnz .Loop_tail + +.Ldone: + leaq 64+24+48(%rsp),%rsi +.cfi_def_cfa %rsi,8 + movq -48(%rsi),%r15 +.cfi_restore %r15 + movq -40(%rsi),%r14 +.cfi_restore %r14 + movq -32(%rsi),%r13 +.cfi_restore %r13 + movq -24(%rsi),%r12 +.cfi_restore %r12 + movq -16(%rsi),%rbp +.cfi_restore %rbp + movq -8(%rsi),%rbx +.cfi_restore %rbx + leaq (%rsi),%rsp +.cfi_def_cfa_register %rsp +.Lno_data: + .byte 0xf3,0xc3 +.cfi_endproc +.size ChaCha20_ctr32,.-ChaCha20_ctr32 +.type ChaCha20_ssse3,@function +.align 32 +ChaCha20_ssse3: +.cfi_startproc +.LChaCha20_ssse3: + movq %rsp,%r9 +.cfi_def_cfa_register %r9 + testl $2048,%r10d + jnz .LChaCha20_4xop + cmpq $128,%rdx + je .LChaCha20_128 + ja .LChaCha20_4x + +.Ldo_sse3_after_all: + subq $64+8,%rsp + movdqa .Lsigma(%rip),%xmm0 + movdqu (%rcx),%xmm1 + movdqu 16(%rcx),%xmm2 + movdqu (%r8),%xmm3 + movdqa .Lrot16(%rip),%xmm6 + movdqa .Lrot24(%rip),%xmm7 + + movdqa %xmm0,0(%rsp) + movdqa %xmm1,16(%rsp) + movdqa %xmm2,32(%rsp) + movdqa %xmm3,48(%rsp) + movq $10,%r8 + jmp .Loop_ssse3 + +.align 32 +.Loop_outer_ssse3: + movdqa .Lone(%rip),%xmm3 + movdqa 0(%rsp),%xmm0 + movdqa 16(%rsp),%xmm1 + movdqa 32(%rsp),%xmm2 + paddd 48(%rsp),%xmm3 + movq $10,%r8 + movdqa %xmm3,48(%rsp) + jmp .Loop_ssse3 + +.align 32 +.Loop_ssse3: + paddd %xmm1,%xmm0 + pxor %xmm0,%xmm3 +.byte 102,15,56,0,222 + paddd %xmm3,%xmm2 + pxor %xmm2,%xmm1 + movdqa %xmm1,%xmm4 + psrld $20,%xmm1 + pslld $12,%xmm4 + por %xmm4,%xmm1 + paddd %xmm1,%xmm0 + pxor %xmm0,%xmm3 +.byte 102,15,56,0,223 + paddd %xmm3,%xmm2 + pxor %xmm2,%xmm1 + movdqa %xmm1,%xmm4 + psrld $25,%xmm1 + pslld $7,%xmm4 + por %xmm4,%xmm1 + pshufd $78,%xmm2,%xmm2 + pshufd $57,%xmm1,%xmm1 + pshufd $147,%xmm3,%xmm3 + nop + paddd %xmm1,%xmm0 + pxor %xmm0,%xmm3 +.byte 102,15,56,0,222 + paddd %xmm3,%xmm2 + pxor %xmm2,%xmm1 + movdqa %xmm1,%xmm4 + psrld $20,%xmm1 + pslld $12,%xmm4 + por %xmm4,%xmm1 + paddd %xmm1,%xmm0 + pxor %xmm0,%xmm3 +.byte 102,15,56,0,223 + paddd %xmm3,%xmm2 + pxor %xmm2,%xmm1 + movdqa %xmm1,%xmm4 + psrld $25,%xmm1 + pslld $7,%xmm4 + por %xmm4,%xmm1 + pshufd $78,%xmm2,%xmm2 + pshufd $147,%xmm1,%xmm1 + pshufd $57,%xmm3,%xmm3 + decq %r8 + jnz .Loop_ssse3 + paddd 0(%rsp),%xmm0 + paddd 16(%rsp),%xmm1 + paddd 32(%rsp),%xmm2 + paddd 48(%rsp),%xmm3 + + cmpq $64,%rdx + jb .Ltail_ssse3 + + movdqu 0(%rsi),%xmm4 + movdqu 16(%rsi),%xmm5 + pxor %xmm4,%xmm0 + movdqu 32(%rsi),%xmm4 + pxor %xmm5,%xmm1 + movdqu 48(%rsi),%xmm5 + leaq 64(%rsi),%rsi + pxor %xmm4,%xmm2 + pxor %xmm5,%xmm3 + + movdqu %xmm0,0(%rdi) + movdqu %xmm1,16(%rdi) + movdqu %xmm2,32(%rdi) + movdqu %xmm3,48(%rdi) + leaq 64(%rdi),%rdi + + subq $64,%rdx + jnz .Loop_outer_ssse3 + + jmp .Ldone_ssse3 + +.align 16 +.Ltail_ssse3: + movdqa %xmm0,0(%rsp) + movdqa %xmm1,16(%rsp) + movdqa %xmm2,32(%rsp) + movdqa %xmm3,48(%rsp) + xorq %r8,%r8 + +.Loop_tail_ssse3: + movzbl (%rsi,%r8,1),%eax + movzbl (%rsp,%r8,1),%ecx + leaq 1(%r8),%r8 + xorl %ecx,%eax + movb %al,-1(%rdi,%r8,1) + decq %rdx + jnz .Loop_tail_ssse3 + +.Ldone_ssse3: + leaq (%r9),%rsp +.cfi_def_cfa_register %rsp +.Lssse3_epilogue: + .byte 0xf3,0xc3 +.cfi_endproc +.size ChaCha20_ssse3,.-ChaCha20_ssse3 +.type ChaCha20_128,@function +.align 32 +ChaCha20_128: +.cfi_startproc +.LChaCha20_128: + movq %rsp,%r9 +.cfi_def_cfa_register %r9 + subq $64+8,%rsp + movdqa .Lsigma(%rip),%xmm8 + movdqu (%rcx),%xmm9 + movdqu 16(%rcx),%xmm2 + movdqu (%r8),%xmm3 + movdqa .Lone(%rip),%xmm1 + movdqa .Lrot16(%rip),%xmm6 + movdqa .Lrot24(%rip),%xmm7 + + movdqa %xmm8,%xmm10 + movdqa %xmm8,0(%rsp) + movdqa %xmm9,%xmm11 + movdqa %xmm9,16(%rsp) + movdqa %xmm2,%xmm0 + movdqa %xmm2,32(%rsp) + paddd %xmm3,%xmm1 + movdqa %xmm3,48(%rsp) + movq $10,%r8 + jmp .Loop_128 + +.align 32 +.Loop_128: + paddd %xmm9,%xmm8 + pxor %xmm8,%xmm3 + paddd %xmm11,%xmm10 + pxor %xmm10,%xmm1 +.byte 102,15,56,0,222 +.byte 102,15,56,0,206 + paddd %xmm3,%xmm2 + paddd %xmm1,%xmm0 + pxor %xmm2,%xmm9 + pxor %xmm0,%xmm11 + movdqa %xmm9,%xmm4 + psrld $20,%xmm9 + movdqa %xmm11,%xmm5 + pslld $12,%xmm4 + psrld $20,%xmm11 + por %xmm4,%xmm9 + pslld $12,%xmm5 + por %xmm5,%xmm11 + paddd %xmm9,%xmm8 + pxor %xmm8,%xmm3 + paddd %xmm11,%xmm10 + pxor %xmm10,%xmm1 +.byte 102,15,56,0,223 +.byte 102,15,56,0,207 + paddd %xmm3,%xmm2 + paddd %xmm1,%xmm0 + pxor %xmm2,%xmm9 + pxor %xmm0,%xmm11 + movdqa %xmm9,%xmm4 + psrld $25,%xmm9 + movdqa %xmm11,%xmm5 + pslld $7,%xmm4 + psrld $25,%xmm11 + por %xmm4,%xmm9 + pslld $7,%xmm5 + por %xmm5,%xmm11 + pshufd $78,%xmm2,%xmm2 + pshufd $57,%xmm9,%xmm9 + pshufd $147,%xmm3,%xmm3 + pshufd $78,%xmm0,%xmm0 + pshufd $57,%xmm11,%xmm11 + pshufd $147,%xmm1,%xmm1 + paddd %xmm9,%xmm8 + pxor %xmm8,%xmm3 + paddd %xmm11,%xmm10 + pxor %xmm10,%xmm1 +.byte 102,15,56,0,222 +.byte 102,15,56,0,206 + paddd %xmm3,%xmm2 + paddd %xmm1,%xmm0 + pxor %xmm2,%xmm9 + pxor %xmm0,%xmm11 + movdqa %xmm9,%xmm4 + psrld $20,%xmm9 + movdqa %xmm11,%xmm5 + pslld $12,%xmm4 + psrld $20,%xmm11 + por %xmm4,%xmm9 + pslld $12,%xmm5 + por %xmm5,%xmm11 + paddd %xmm9,%xmm8 + pxor %xmm8,%xmm3 + paddd %xmm11,%xmm10 + pxor %xmm10,%xmm1 +.byte 102,15,56,0,223 +.byte 102,15,56,0,207 + paddd %xmm3,%xmm2 + paddd %xmm1,%xmm0 + pxor %xmm2,%xmm9 + pxor %xmm0,%xmm11 + movdqa %xmm9,%xmm4 + psrld $25,%xmm9 + movdqa %xmm11,%xmm5 + pslld $7,%xmm4 + psrld $25,%xmm11 + por %xmm4,%xmm9 + pslld $7,%xmm5 + por %xmm5,%xmm11 + pshufd $78,%xmm2,%xmm2 + pshufd $147,%xmm9,%xmm9 + pshufd $57,%xmm3,%xmm3 + pshufd $78,%xmm0,%xmm0 + pshufd $147,%xmm11,%xmm11 + pshufd $57,%xmm1,%xmm1 + decq %r8 + jnz .Loop_128 + paddd 0(%rsp),%xmm8 + paddd 16(%rsp),%xmm9 + paddd 32(%rsp),%xmm2 + paddd 48(%rsp),%xmm3 + paddd .Lone(%rip),%xmm1 + paddd 0(%rsp),%xmm10 + paddd 16(%rsp),%xmm11 + paddd 32(%rsp),%xmm0 + paddd 48(%rsp),%xmm1 + + movdqu 0(%rsi),%xmm4 + movdqu 16(%rsi),%xmm5 + pxor %xmm4,%xmm8 + movdqu 32(%rsi),%xmm4 + pxor %xmm5,%xmm9 + movdqu 48(%rsi),%xmm5 + pxor %xmm4,%xmm2 + movdqu 64(%rsi),%xmm4 + pxor %xmm5,%xmm3 + movdqu 80(%rsi),%xmm5 + pxor %xmm4,%xmm10 + movdqu 96(%rsi),%xmm4 + pxor %xmm5,%xmm11 + movdqu 112(%rsi),%xmm5 + pxor %xmm4,%xmm0 + pxor %xmm5,%xmm1 + + movdqu %xmm8,0(%rdi) + movdqu %xmm9,16(%rdi) + movdqu %xmm2,32(%rdi) + movdqu %xmm3,48(%rdi) + movdqu %xmm10,64(%rdi) + movdqu %xmm11,80(%rdi) + movdqu %xmm0,96(%rdi) + movdqu %xmm1,112(%rdi) + leaq (%r9),%rsp +.cfi_def_cfa_register %rsp +.L128_epilogue: + .byte 0xf3,0xc3 +.cfi_endproc +.size ChaCha20_128,.-ChaCha20_128 +.type ChaCha20_4x,@function +.align 32 +ChaCha20_4x: +.cfi_startproc +.LChaCha20_4x: + movq %rsp,%r9 +.cfi_def_cfa_register %r9 + movq %r10,%r11 + shrq $32,%r10 + testq $32,%r10 + jnz .LChaCha20_8x + cmpq $192,%rdx + ja .Lproceed4x + + andq $71303168,%r11 + cmpq $4194304,%r11 + je .Ldo_sse3_after_all + +.Lproceed4x: + subq $0x140+8,%rsp + movdqa .Lsigma(%rip),%xmm11 + movdqu (%rcx),%xmm15 + movdqu 16(%rcx),%xmm7 + movdqu (%r8),%xmm3 + leaq 256(%rsp),%rcx + leaq .Lrot16(%rip),%r10 + leaq .Lrot24(%rip),%r11 + + pshufd $0x00,%xmm11,%xmm8 + pshufd $0x55,%xmm11,%xmm9 + movdqa %xmm8,64(%rsp) + pshufd $0xaa,%xmm11,%xmm10 + movdqa %xmm9,80(%rsp) + pshufd $0xff,%xmm11,%xmm11 + movdqa %xmm10,96(%rsp) + movdqa %xmm11,112(%rsp) + + pshufd $0x00,%xmm15,%xmm12 + pshufd $0x55,%xmm15,%xmm13 + movdqa %xmm12,128-256(%rcx) + pshufd $0xaa,%xmm15,%xmm14 + movdqa %xmm13,144-256(%rcx) + pshufd $0xff,%xmm15,%xmm15 + movdqa %xmm14,160-256(%rcx) + movdqa %xmm15,176-256(%rcx) + + pshufd $0x00,%xmm7,%xmm4 + pshufd $0x55,%xmm7,%xmm5 + movdqa %xmm4,192-256(%rcx) + pshufd $0xaa,%xmm7,%xmm6 + movdqa %xmm5,208-256(%rcx) + pshufd $0xff,%xmm7,%xmm7 + movdqa %xmm6,224-256(%rcx) + movdqa %xmm7,240-256(%rcx) + + pshufd $0x00,%xmm3,%xmm0 + pshufd $0x55,%xmm3,%xmm1 + paddd .Linc(%rip),%xmm0 + pshufd $0xaa,%xmm3,%xmm2 + movdqa %xmm1,272-256(%rcx) + pshufd $0xff,%xmm3,%xmm3 + movdqa %xmm2,288-256(%rcx) + movdqa %xmm3,304-256(%rcx) + + jmp .Loop_enter4x + +.align 32 +.Loop_outer4x: + movdqa 64(%rsp),%xmm8 + movdqa 80(%rsp),%xmm9 + movdqa 96(%rsp),%xmm10 + movdqa 112(%rsp),%xmm11 + movdqa 128-256(%rcx),%xmm12 + movdqa 144-256(%rcx),%xmm13 + movdqa 160-256(%rcx),%xmm14 + movdqa 176-256(%rcx),%xmm15 + movdqa 192-256(%rcx),%xmm4 + movdqa 208-256(%rcx),%xmm5 + movdqa 224-256(%rcx),%xmm6 + movdqa 240-256(%rcx),%xmm7 + movdqa 256-256(%rcx),%xmm0 + movdqa 272-256(%rcx),%xmm1 + movdqa 288-256(%rcx),%xmm2 + movdqa 304-256(%rcx),%xmm3 + paddd .Lfour(%rip),%xmm0 + +.Loop_enter4x: + movdqa %xmm6,32(%rsp) + movdqa %xmm7,48(%rsp) + movdqa (%r10),%xmm7 + movl $10,%eax + movdqa %xmm0,256-256(%rcx) + jmp .Loop4x + +.align 32 +.Loop4x: + paddd %xmm12,%xmm8 + paddd %xmm13,%xmm9 + pxor %xmm8,%xmm0 + pxor %xmm9,%xmm1 +.byte 102,15,56,0,199 +.byte 102,15,56,0,207 + paddd %xmm0,%xmm4 + paddd %xmm1,%xmm5 + pxor %xmm4,%xmm12 + pxor %xmm5,%xmm13 + movdqa %xmm12,%xmm6 + pslld $12,%xmm12 + psrld $20,%xmm6 + movdqa %xmm13,%xmm7 + pslld $12,%xmm13 + por %xmm6,%xmm12 + psrld $20,%xmm7 + movdqa (%r11),%xmm6 + por %xmm7,%xmm13 + paddd %xmm12,%xmm8 + paddd %xmm13,%xmm9 + pxor %xmm8,%xmm0 + pxor %xmm9,%xmm1 +.byte 102,15,56,0,198 +.byte 102,15,56,0,206 + paddd %xmm0,%xmm4 + paddd %xmm1,%xmm5 + pxor %xmm4,%xmm12 + pxor %xmm5,%xmm13 + movdqa %xmm12,%xmm7 + pslld $7,%xmm12 + psrld $25,%xmm7 + movdqa %xmm13,%xmm6 + pslld $7,%xmm13 + por %xmm7,%xmm12 + psrld $25,%xmm6 + movdqa (%r10),%xmm7 + por %xmm6,%xmm13 + movdqa %xmm4,0(%rsp) + movdqa %xmm5,16(%rsp) + movdqa 32(%rsp),%xmm4 + movdqa 48(%rsp),%xmm5 + paddd %xmm14,%xmm10 + paddd %xmm15,%xmm11 + pxor %xmm10,%xmm2 + pxor %xmm11,%xmm3 +.byte 102,15,56,0,215 +.byte 102,15,56,0,223 + paddd %xmm2,%xmm4 + paddd %xmm3,%xmm5 + pxor %xmm4,%xmm14 + pxor %xmm5,%xmm15 + movdqa %xmm14,%xmm6 + pslld $12,%xmm14 + psrld $20,%xmm6 + movdqa %xmm15,%xmm7 + pslld $12,%xmm15 + por %xmm6,%xmm14 + psrld $20,%xmm7 + movdqa (%r11),%xmm6 + por %xmm7,%xmm15 + paddd %xmm14,%xmm10 + paddd %xmm15,%xmm11 + pxor %xmm10,%xmm2 + pxor %xmm11,%xmm3 +.byte 102,15,56,0,214 +.byte 102,15,56,0,222 + paddd %xmm2,%xmm4 + paddd %xmm3,%xmm5 + pxor %xmm4,%xmm14 + pxor %xmm5,%xmm15 + movdqa %xmm14,%xmm7 + pslld $7,%xmm14 + psrld $25,%xmm7 + movdqa %xmm15,%xmm6 + pslld $7,%xmm15 + por %xmm7,%xmm14 + psrld $25,%xmm6 + movdqa (%r10),%xmm7 + por %xmm6,%xmm15 + paddd %xmm13,%xmm8 + paddd %xmm14,%xmm9 + pxor %xmm8,%xmm3 + pxor %xmm9,%xmm0 +.byte 102,15,56,0,223 +.byte 102,15,56,0,199 + paddd %xmm3,%xmm4 + paddd %xmm0,%xmm5 + pxor %xmm4,%xmm13 + pxor %xmm5,%xmm14 + movdqa %xmm13,%xmm6 + pslld $12,%xmm13 + psrld $20,%xmm6 + movdqa %xmm14,%xmm7 + pslld $12,%xmm14 + por %xmm6,%xmm13 + psrld $20,%xmm7 + movdqa (%r11),%xmm6 + por %xmm7,%xmm14 + paddd %xmm13,%xmm8 + paddd %xmm14,%xmm9 + pxor %xmm8,%xmm3 + pxor %xmm9,%xmm0 +.byte 102,15,56,0,222 +.byte 102,15,56,0,198 + paddd %xmm3,%xmm4 + paddd %xmm0,%xmm5 + pxor %xmm4,%xmm13 + pxor %xmm5,%xmm14 + movdqa %xmm13,%xmm7 + pslld $7,%xmm13 + psrld $25,%xmm7 + movdqa %xmm14,%xmm6 + pslld $7,%xmm14 + por %xmm7,%xmm13 + psrld $25,%xmm6 + movdqa (%r10),%xmm7 + por %xmm6,%xmm14 + movdqa %xmm4,32(%rsp) + movdqa %xmm5,48(%rsp) + movdqa 0(%rsp),%xmm4 + movdqa 16(%rsp),%xmm5 + paddd %xmm15,%xmm10 + paddd %xmm12,%xmm11 + pxor %xmm10,%xmm1 + pxor %xmm11,%xmm2 +.byte 102,15,56,0,207 +.byte 102,15,56,0,215 + paddd %xmm1,%xmm4 + paddd %xmm2,%xmm5 + pxor %xmm4,%xmm15 + pxor %xmm5,%xmm12 + movdqa %xmm15,%xmm6 + pslld $12,%xmm15 + psrld $20,%xmm6 + movdqa %xmm12,%xmm7 + pslld $12,%xmm12 + por %xmm6,%xmm15 + psrld $20,%xmm7 + movdqa (%r11),%xmm6 + por %xmm7,%xmm12 + paddd %xmm15,%xmm10 + paddd %xmm12,%xmm11 + pxor %xmm10,%xmm1 + pxor %xmm11,%xmm2 +.byte 102,15,56,0,206 +.byte 102,15,56,0,214 + paddd %xmm1,%xmm4 + paddd %xmm2,%xmm5 + pxor %xmm4,%xmm15 + pxor %xmm5,%xmm12 + movdqa %xmm15,%xmm7 + pslld $7,%xmm15 + psrld $25,%xmm7 + movdqa %xmm12,%xmm6 + pslld $7,%xmm12 + por %xmm7,%xmm15 + psrld $25,%xmm6 + movdqa (%r10),%xmm7 + por %xmm6,%xmm12 + decl %eax + jnz .Loop4x + + paddd 64(%rsp),%xmm8 + paddd 80(%rsp),%xmm9 + paddd 96(%rsp),%xmm10 + paddd 112(%rsp),%xmm11 + + movdqa %xmm8,%xmm6 + punpckldq %xmm9,%xmm8 + movdqa %xmm10,%xmm7 + punpckldq %xmm11,%xmm10 + punpckhdq %xmm9,%xmm6 + punpckhdq %xmm11,%xmm7 + movdqa %xmm8,%xmm9 + punpcklqdq %xmm10,%xmm8 + movdqa %xmm6,%xmm11 + punpcklqdq %xmm7,%xmm6 + punpckhqdq %xmm10,%xmm9 + punpckhqdq %xmm7,%xmm11 + paddd 128-256(%rcx),%xmm12 + paddd 144-256(%rcx),%xmm13 + paddd 160-256(%rcx),%xmm14 + paddd 176-256(%rcx),%xmm15 + + movdqa %xmm8,0(%rsp) + movdqa %xmm9,16(%rsp) + movdqa 32(%rsp),%xmm8 + movdqa 48(%rsp),%xmm9 + + movdqa %xmm12,%xmm10 + punpckldq %xmm13,%xmm12 + movdqa %xmm14,%xmm7 + punpckldq %xmm15,%xmm14 + punpckhdq %xmm13,%xmm10 + punpckhdq %xmm15,%xmm7 + movdqa %xmm12,%xmm13 + punpcklqdq %xmm14,%xmm12 + movdqa %xmm10,%xmm15 + punpcklqdq %xmm7,%xmm10 + punpckhqdq %xmm14,%xmm13 + punpckhqdq %xmm7,%xmm15 + paddd 192-256(%rcx),%xmm4 + paddd 208-256(%rcx),%xmm5 + paddd 224-256(%rcx),%xmm8 + paddd 240-256(%rcx),%xmm9 + + movdqa %xmm6,32(%rsp) + movdqa %xmm11,48(%rsp) + + movdqa %xmm4,%xmm14 + punpckldq %xmm5,%xmm4 + movdqa %xmm8,%xmm7 + punpckldq %xmm9,%xmm8 + punpckhdq %xmm5,%xmm14 + punpckhdq %xmm9,%xmm7 + movdqa %xmm4,%xmm5 + punpcklqdq %xmm8,%xmm4 + movdqa %xmm14,%xmm9 + punpcklqdq %xmm7,%xmm14 + punpckhqdq %xmm8,%xmm5 + punpckhqdq %xmm7,%xmm9 + paddd 256-256(%rcx),%xmm0 + paddd 272-256(%rcx),%xmm1 + paddd 288-256(%rcx),%xmm2 + paddd 304-256(%rcx),%xmm3 + + movdqa %xmm0,%xmm8 + punpckldq %xmm1,%xmm0 + movdqa %xmm2,%xmm7 + punpckldq %xmm3,%xmm2 + punpckhdq %xmm1,%xmm8 + punpckhdq %xmm3,%xmm7 + movdqa %xmm0,%xmm1 + punpcklqdq %xmm2,%xmm0 + movdqa %xmm8,%xmm3 + punpcklqdq %xmm7,%xmm8 + punpckhqdq %xmm2,%xmm1 + punpckhqdq %xmm7,%xmm3 + cmpq $256,%rdx + jb .Ltail4x + + movdqu 0(%rsi),%xmm6 + movdqu 16(%rsi),%xmm11 + movdqu 32(%rsi),%xmm2 + movdqu 48(%rsi),%xmm7 + pxor 0(%rsp),%xmm6 + pxor %xmm12,%xmm11 + pxor %xmm4,%xmm2 + pxor %xmm0,%xmm7 + + movdqu %xmm6,0(%rdi) + movdqu 64(%rsi),%xmm6 + movdqu %xmm11,16(%rdi) + movdqu 80(%rsi),%xmm11 + movdqu %xmm2,32(%rdi) + movdqu 96(%rsi),%xmm2 + movdqu %xmm7,48(%rdi) + movdqu 112(%rsi),%xmm7 + leaq 128(%rsi),%rsi + pxor 16(%rsp),%xmm6 + pxor %xmm13,%xmm11 + pxor %xmm5,%xmm2 + pxor %xmm1,%xmm7 + + movdqu %xmm6,64(%rdi) + movdqu 0(%rsi),%xmm6 + movdqu %xmm11,80(%rdi) + movdqu 16(%rsi),%xmm11 + movdqu %xmm2,96(%rdi) + movdqu 32(%rsi),%xmm2 + movdqu %xmm7,112(%rdi) + leaq 128(%rdi),%rdi + movdqu 48(%rsi),%xmm7 + pxor 32(%rsp),%xmm6 + pxor %xmm10,%xmm11 + pxor %xmm14,%xmm2 + pxor %xmm8,%xmm7 + + movdqu %xmm6,0(%rdi) + movdqu 64(%rsi),%xmm6 + movdqu %xmm11,16(%rdi) + movdqu 80(%rsi),%xmm11 + movdqu %xmm2,32(%rdi) + movdqu 96(%rsi),%xmm2 + movdqu %xmm7,48(%rdi) + movdqu 112(%rsi),%xmm7 + leaq 128(%rsi),%rsi + pxor 48(%rsp),%xmm6 + pxor %xmm15,%xmm11 + pxor %xmm9,%xmm2 + pxor %xmm3,%xmm7 + movdqu %xmm6,64(%rdi) + movdqu %xmm11,80(%rdi) + movdqu %xmm2,96(%rdi) + movdqu %xmm7,112(%rdi) + leaq 128(%rdi),%rdi + + subq $256,%rdx + jnz .Loop_outer4x + + jmp .Ldone4x + +.Ltail4x: + cmpq $192,%rdx + jae .L192_or_more4x + cmpq $128,%rdx + jae .L128_or_more4x + cmpq $64,%rdx + jae .L64_or_more4x + + + xorq %r10,%r10 + + movdqa %xmm12,16(%rsp) + movdqa %xmm4,32(%rsp) + movdqa %xmm0,48(%rsp) + jmp .Loop_tail4x + +.align 32 +.L64_or_more4x: + movdqu 0(%rsi),%xmm6 + movdqu 16(%rsi),%xmm11 + movdqu 32(%rsi),%xmm2 + movdqu 48(%rsi),%xmm7 + pxor 0(%rsp),%xmm6 + pxor %xmm12,%xmm11 + pxor %xmm4,%xmm2 + pxor %xmm0,%xmm7 + movdqu %xmm6,0(%rdi) + movdqu %xmm11,16(%rdi) + movdqu %xmm2,32(%rdi) + movdqu %xmm7,48(%rdi) + je .Ldone4x + + movdqa 16(%rsp),%xmm6 + leaq 64(%rsi),%rsi + xorq %r10,%r10 + movdqa %xmm6,0(%rsp) + movdqa %xmm13,16(%rsp) + leaq 64(%rdi),%rdi + movdqa %xmm5,32(%rsp) + subq $64,%rdx + movdqa %xmm1,48(%rsp) + jmp .Loop_tail4x + +.align 32 +.L128_or_more4x: + movdqu 0(%rsi),%xmm6 + movdqu 16(%rsi),%xmm11 + movdqu 32(%rsi),%xmm2 + movdqu 48(%rsi),%xmm7 + pxor 0(%rsp),%xmm6 + pxor %xmm12,%xmm11 + pxor %xmm4,%xmm2 + pxor %xmm0,%xmm7 + + movdqu %xmm6,0(%rdi) + movdqu 64(%rsi),%xmm6 + movdqu %xmm11,16(%rdi) + movdqu 80(%rsi),%xmm11 + movdqu %xmm2,32(%rdi) + movdqu 96(%rsi),%xmm2 + movdqu %xmm7,48(%rdi) + movdqu 112(%rsi),%xmm7 + pxor 16(%rsp),%xmm6 + pxor %xmm13,%xmm11 + pxor %xmm5,%xmm2 + pxor %xmm1,%xmm7 + movdqu %xmm6,64(%rdi) + movdqu %xmm11,80(%rdi) + movdqu %xmm2,96(%rdi) + movdqu %xmm7,112(%rdi) + je .Ldone4x + + movdqa 32(%rsp),%xmm6 + leaq 128(%rsi),%rsi + xorq %r10,%r10 + movdqa %xmm6,0(%rsp) + movdqa %xmm10,16(%rsp) + leaq 128(%rdi),%rdi + movdqa %xmm14,32(%rsp) + subq $128,%rdx + movdqa %xmm8,48(%rsp) + jmp .Loop_tail4x + +.align 32 +.L192_or_more4x: + movdqu 0(%rsi),%xmm6 + movdqu 16(%rsi),%xmm11 + movdqu 32(%rsi),%xmm2 + movdqu 48(%rsi),%xmm7 + pxor 0(%rsp),%xmm6 + pxor %xmm12,%xmm11 + pxor %xmm4,%xmm2 + pxor %xmm0,%xmm7 + + movdqu %xmm6,0(%rdi) + movdqu 64(%rsi),%xmm6 + movdqu %xmm11,16(%rdi) + movdqu 80(%rsi),%xmm11 + movdqu %xmm2,32(%rdi) + movdqu 96(%rsi),%xmm2 + movdqu %xmm7,48(%rdi) + movdqu 112(%rsi),%xmm7 + leaq 128(%rsi),%rsi + pxor 16(%rsp),%xmm6 + pxor %xmm13,%xmm11 + pxor %xmm5,%xmm2 + pxor %xmm1,%xmm7 + + movdqu %xmm6,64(%rdi) + movdqu 0(%rsi),%xmm6 + movdqu %xmm11,80(%rdi) + movdqu 16(%rsi),%xmm11 + movdqu %xmm2,96(%rdi) + movdqu 32(%rsi),%xmm2 + movdqu %xmm7,112(%rdi) + leaq 128(%rdi),%rdi + movdqu 48(%rsi),%xmm7 + pxor 32(%rsp),%xmm6 + pxor %xmm10,%xmm11 + pxor %xmm14,%xmm2 + pxor %xmm8,%xmm7 + movdqu %xmm6,0(%rdi) + movdqu %xmm11,16(%rdi) + movdqu %xmm2,32(%rdi) + movdqu %xmm7,48(%rdi) + je .Ldone4x + + movdqa 48(%rsp),%xmm6 + leaq 64(%rsi),%rsi + xorq %r10,%r10 + movdqa %xmm6,0(%rsp) + movdqa %xmm15,16(%rsp) + leaq 64(%rdi),%rdi + movdqa %xmm9,32(%rsp) + subq $192,%rdx + movdqa %xmm3,48(%rsp) + +.Loop_tail4x: + movzbl (%rsi,%r10,1),%eax + movzbl (%rsp,%r10,1),%ecx + leaq 1(%r10),%r10 + xorl %ecx,%eax + movb %al,-1(%rdi,%r10,1) + decq %rdx + jnz .Loop_tail4x + +.Ldone4x: + leaq (%r9),%rsp +.cfi_def_cfa_register %rsp +.L4x_epilogue: + .byte 0xf3,0xc3 +.cfi_endproc +.size ChaCha20_4x,.-ChaCha20_4x +.type ChaCha20_4xop,@function +.align 32 +ChaCha20_4xop: +.cfi_startproc +.LChaCha20_4xop: + movq %rsp,%r9 +.cfi_def_cfa_register %r9 + subq $0x140+8,%rsp + vzeroupper + + vmovdqa .Lsigma(%rip),%xmm11 + vmovdqu (%rcx),%xmm3 + vmovdqu 16(%rcx),%xmm15 + vmovdqu (%r8),%xmm7 + leaq 256(%rsp),%rcx + + vpshufd $0x00,%xmm11,%xmm8 + vpshufd $0x55,%xmm11,%xmm9 + vmovdqa %xmm8,64(%rsp) + vpshufd $0xaa,%xmm11,%xmm10 + vmovdqa %xmm9,80(%rsp) + vpshufd $0xff,%xmm11,%xmm11 + vmovdqa %xmm10,96(%rsp) + vmovdqa %xmm11,112(%rsp) + + vpshufd $0x00,%xmm3,%xmm0 + vpshufd $0x55,%xmm3,%xmm1 + vmovdqa %xmm0,128-256(%rcx) + vpshufd $0xaa,%xmm3,%xmm2 + vmovdqa %xmm1,144-256(%rcx) + vpshufd $0xff,%xmm3,%xmm3 + vmovdqa %xmm2,160-256(%rcx) + vmovdqa %xmm3,176-256(%rcx) + + vpshufd $0x00,%xmm15,%xmm12 + vpshufd $0x55,%xmm15,%xmm13 + vmovdqa %xmm12,192-256(%rcx) + vpshufd $0xaa,%xmm15,%xmm14 + vmovdqa %xmm13,208-256(%rcx) + vpshufd $0xff,%xmm15,%xmm15 + vmovdqa %xmm14,224-256(%rcx) + vmovdqa %xmm15,240-256(%rcx) + + vpshufd $0x00,%xmm7,%xmm4 + vpshufd $0x55,%xmm7,%xmm5 + vpaddd .Linc(%rip),%xmm4,%xmm4 + vpshufd $0xaa,%xmm7,%xmm6 + vmovdqa %xmm5,272-256(%rcx) + vpshufd $0xff,%xmm7,%xmm7 + vmovdqa %xmm6,288-256(%rcx) + vmovdqa %xmm7,304-256(%rcx) + + jmp .Loop_enter4xop + +.align 32 +.Loop_outer4xop: + vmovdqa 64(%rsp),%xmm8 + vmovdqa 80(%rsp),%xmm9 + vmovdqa 96(%rsp),%xmm10 + vmovdqa 112(%rsp),%xmm11 + vmovdqa 128-256(%rcx),%xmm0 + vmovdqa 144-256(%rcx),%xmm1 + vmovdqa 160-256(%rcx),%xmm2 + vmovdqa 176-256(%rcx),%xmm3 + vmovdqa 192-256(%rcx),%xmm12 + vmovdqa 208-256(%rcx),%xmm13 + vmovdqa 224-256(%rcx),%xmm14 + vmovdqa 240-256(%rcx),%xmm15 + vmovdqa 256-256(%rcx),%xmm4 + vmovdqa 272-256(%rcx),%xmm5 + vmovdqa 288-256(%rcx),%xmm6 + vmovdqa 304-256(%rcx),%xmm7 + vpaddd .Lfour(%rip),%xmm4,%xmm4 + +.Loop_enter4xop: + movl $10,%eax + vmovdqa %xmm4,256-256(%rcx) + jmp .Loop4xop + +.align 32 +.Loop4xop: + vpaddd %xmm0,%xmm8,%xmm8 + vpaddd %xmm1,%xmm9,%xmm9 + vpaddd %xmm2,%xmm10,%xmm10 + vpaddd %xmm3,%xmm11,%xmm11 + vpxor %xmm4,%xmm8,%xmm4 + vpxor %xmm5,%xmm9,%xmm5 + vpxor %xmm6,%xmm10,%xmm6 + vpxor %xmm7,%xmm11,%xmm7 +.byte 143,232,120,194,228,16 +.byte 143,232,120,194,237,16 +.byte 143,232,120,194,246,16 +.byte 143,232,120,194,255,16 + vpaddd %xmm4,%xmm12,%xmm12 + vpaddd %xmm5,%xmm13,%xmm13 + vpaddd %xmm6,%xmm14,%xmm14 + vpaddd %xmm7,%xmm15,%xmm15 + vpxor %xmm0,%xmm12,%xmm0 + vpxor %xmm1,%xmm13,%xmm1 + vpxor %xmm14,%xmm2,%xmm2 + vpxor %xmm15,%xmm3,%xmm3 +.byte 143,232,120,194,192,12 +.byte 143,232,120,194,201,12 +.byte 143,232,120,194,210,12 +.byte 143,232,120,194,219,12 + vpaddd %xmm8,%xmm0,%xmm8 + vpaddd %xmm9,%xmm1,%xmm9 + vpaddd %xmm2,%xmm10,%xmm10 + vpaddd %xmm3,%xmm11,%xmm11 + vpxor %xmm4,%xmm8,%xmm4 + vpxor %xmm5,%xmm9,%xmm5 + vpxor %xmm6,%xmm10,%xmm6 + vpxor %xmm7,%xmm11,%xmm7 +.byte 143,232,120,194,228,8 +.byte 143,232,120,194,237,8 +.byte 143,232,120,194,246,8 +.byte 143,232,120,194,255,8 + vpaddd %xmm4,%xmm12,%xmm12 + vpaddd %xmm5,%xmm13,%xmm13 + vpaddd %xmm6,%xmm14,%xmm14 + vpaddd %xmm7,%xmm15,%xmm15 + vpxor %xmm0,%xmm12,%xmm0 + vpxor %xmm1,%xmm13,%xmm1 + vpxor %xmm14,%xmm2,%xmm2 + vpxor %xmm15,%xmm3,%xmm3 +.byte 143,232,120,194,192,7 +.byte 143,232,120,194,201,7 +.byte 143,232,120,194,210,7 +.byte 143,232,120,194,219,7 + vpaddd %xmm1,%xmm8,%xmm8 + vpaddd %xmm2,%xmm9,%xmm9 + vpaddd %xmm3,%xmm10,%xmm10 + vpaddd %xmm0,%xmm11,%xmm11 + vpxor %xmm7,%xmm8,%xmm7 + vpxor %xmm4,%xmm9,%xmm4 + vpxor %xmm5,%xmm10,%xmm5 + vpxor %xmm6,%xmm11,%xmm6 +.byte 143,232,120,194,255,16 +.byte 143,232,120,194,228,16 +.byte 143,232,120,194,237,16 +.byte 143,232,120,194,246,16 + vpaddd %xmm7,%xmm14,%xmm14 + vpaddd %xmm4,%xmm15,%xmm15 + vpaddd %xmm5,%xmm12,%xmm12 + vpaddd %xmm6,%xmm13,%xmm13 + vpxor %xmm1,%xmm14,%xmm1 + vpxor %xmm2,%xmm15,%xmm2 + vpxor %xmm12,%xmm3,%xmm3 + vpxor %xmm13,%xmm0,%xmm0 +.byte 143,232,120,194,201,12 +.byte 143,232,120,194,210,12 +.byte 143,232,120,194,219,12 +.byte 143,232,120,194,192,12 + vpaddd %xmm8,%xmm1,%xmm8 + vpaddd %xmm9,%xmm2,%xmm9 + vpaddd %xmm3,%xmm10,%xmm10 + vpaddd %xmm0,%xmm11,%xmm11 + vpxor %xmm7,%xmm8,%xmm7 + vpxor %xmm4,%xmm9,%xmm4 + vpxor %xmm5,%xmm10,%xmm5 + vpxor %xmm6,%xmm11,%xmm6 +.byte 143,232,120,194,255,8 +.byte 143,232,120,194,228,8 +.byte 143,232,120,194,237,8 +.byte 143,232,120,194,246,8 + vpaddd %xmm7,%xmm14,%xmm14 + vpaddd %xmm4,%xmm15,%xmm15 + vpaddd %xmm5,%xmm12,%xmm12 + vpaddd %xmm6,%xmm13,%xmm13 + vpxor %xmm1,%xmm14,%xmm1 + vpxor %xmm2,%xmm15,%xmm2 + vpxor %xmm12,%xmm3,%xmm3 + vpxor %xmm13,%xmm0,%xmm0 +.byte 143,232,120,194,201,7 +.byte 143,232,120,194,210,7 +.byte 143,232,120,194,219,7 +.byte 143,232,120,194,192,7 + decl %eax + jnz .Loop4xop + + vpaddd 64(%rsp),%xmm8,%xmm8 + vpaddd 80(%rsp),%xmm9,%xmm9 + vpaddd 96(%rsp),%xmm10,%xmm10 + vpaddd 112(%rsp),%xmm11,%xmm11 + + vmovdqa %xmm14,32(%rsp) + vmovdqa %xmm15,48(%rsp) + + vpunpckldq %xmm9,%xmm8,%xmm14 + vpunpckldq %xmm11,%xmm10,%xmm15 + vpunpckhdq %xmm9,%xmm8,%xmm8 + vpunpckhdq %xmm11,%xmm10,%xmm10 + vpunpcklqdq %xmm15,%xmm14,%xmm9 + vpunpckhqdq %xmm15,%xmm14,%xmm14 + vpunpcklqdq %xmm10,%xmm8,%xmm11 + vpunpckhqdq %xmm10,%xmm8,%xmm8 + vpaddd 128-256(%rcx),%xmm0,%xmm0 + vpaddd 144-256(%rcx),%xmm1,%xmm1 + vpaddd 160-256(%rcx),%xmm2,%xmm2 + vpaddd 176-256(%rcx),%xmm3,%xmm3 + + vmovdqa %xmm9,0(%rsp) + vmovdqa %xmm14,16(%rsp) + vmovdqa 32(%rsp),%xmm9 + vmovdqa 48(%rsp),%xmm14 + + vpunpckldq %xmm1,%xmm0,%xmm10 + vpunpckldq %xmm3,%xmm2,%xmm15 + vpunpckhdq %xmm1,%xmm0,%xmm0 + vpunpckhdq %xmm3,%xmm2,%xmm2 + vpunpcklqdq %xmm15,%xmm10,%xmm1 + vpunpckhqdq %xmm15,%xmm10,%xmm10 + vpunpcklqdq %xmm2,%xmm0,%xmm3 + vpunpckhqdq %xmm2,%xmm0,%xmm0 + vpaddd 192-256(%rcx),%xmm12,%xmm12 + vpaddd 208-256(%rcx),%xmm13,%xmm13 + vpaddd 224-256(%rcx),%xmm9,%xmm9 + vpaddd 240-256(%rcx),%xmm14,%xmm14 + + vpunpckldq %xmm13,%xmm12,%xmm2 + vpunpckldq %xmm14,%xmm9,%xmm15 + vpunpckhdq %xmm13,%xmm12,%xmm12 + vpunpckhdq %xmm14,%xmm9,%xmm9 + vpunpcklqdq %xmm15,%xmm2,%xmm13 + vpunpckhqdq %xmm15,%xmm2,%xmm2 + vpunpcklqdq %xmm9,%xmm12,%xmm14 + vpunpckhqdq %xmm9,%xmm12,%xmm12 + vpaddd 256-256(%rcx),%xmm4,%xmm4 + vpaddd 272-256(%rcx),%xmm5,%xmm5 + vpaddd 288-256(%rcx),%xmm6,%xmm6 + vpaddd 304-256(%rcx),%xmm7,%xmm7 + + vpunpckldq %xmm5,%xmm4,%xmm9 + vpunpckldq %xmm7,%xmm6,%xmm15 + vpunpckhdq %xmm5,%xmm4,%xmm4 + vpunpckhdq %xmm7,%xmm6,%xmm6 + vpunpcklqdq %xmm15,%xmm9,%xmm5 + vpunpckhqdq %xmm15,%xmm9,%xmm9 + vpunpcklqdq %xmm6,%xmm4,%xmm7 + vpunpckhqdq %xmm6,%xmm4,%xmm4 + vmovdqa 0(%rsp),%xmm6 + vmovdqa 16(%rsp),%xmm15 + + cmpq $256,%rdx + jb .Ltail4xop + + vpxor 0(%rsi),%xmm6,%xmm6 + vpxor 16(%rsi),%xmm1,%xmm1 + vpxor 32(%rsi),%xmm13,%xmm13 + vpxor 48(%rsi),%xmm5,%xmm5 + vpxor 64(%rsi),%xmm15,%xmm15 + vpxor 80(%rsi),%xmm10,%xmm10 + vpxor 96(%rsi),%xmm2,%xmm2 + vpxor 112(%rsi),%xmm9,%xmm9 + leaq 128(%rsi),%rsi + vpxor 0(%rsi),%xmm11,%xmm11 + vpxor 16(%rsi),%xmm3,%xmm3 + vpxor 32(%rsi),%xmm14,%xmm14 + vpxor 48(%rsi),%xmm7,%xmm7 + vpxor 64(%rsi),%xmm8,%xmm8 + vpxor 80(%rsi),%xmm0,%xmm0 + vpxor 96(%rsi),%xmm12,%xmm12 + vpxor 112(%rsi),%xmm4,%xmm4 + leaq 128(%rsi),%rsi + + vmovdqu %xmm6,0(%rdi) + vmovdqu %xmm1,16(%rdi) + vmovdqu %xmm13,32(%rdi) + vmovdqu %xmm5,48(%rdi) + vmovdqu %xmm15,64(%rdi) + vmovdqu %xmm10,80(%rdi) + vmovdqu %xmm2,96(%rdi) + vmovdqu %xmm9,112(%rdi) + leaq 128(%rdi),%rdi + vmovdqu %xmm11,0(%rdi) + vmovdqu %xmm3,16(%rdi) + vmovdqu %xmm14,32(%rdi) + vmovdqu %xmm7,48(%rdi) + vmovdqu %xmm8,64(%rdi) + vmovdqu %xmm0,80(%rdi) + vmovdqu %xmm12,96(%rdi) + vmovdqu %xmm4,112(%rdi) + leaq 128(%rdi),%rdi + + subq $256,%rdx + jnz .Loop_outer4xop + + jmp .Ldone4xop + +.align 32 +.Ltail4xop: + cmpq $192,%rdx + jae .L192_or_more4xop + cmpq $128,%rdx + jae .L128_or_more4xop + cmpq $64,%rdx + jae .L64_or_more4xop + + xorq %r10,%r10 + vmovdqa %xmm6,0(%rsp) + vmovdqa %xmm1,16(%rsp) + vmovdqa %xmm13,32(%rsp) + vmovdqa %xmm5,48(%rsp) + jmp .Loop_tail4xop + +.align 32 +.L64_or_more4xop: + vpxor 0(%rsi),%xmm6,%xmm6 + vpxor 16(%rsi),%xmm1,%xmm1 + vpxor 32(%rsi),%xmm13,%xmm13 + vpxor 48(%rsi),%xmm5,%xmm5 + vmovdqu %xmm6,0(%rdi) + vmovdqu %xmm1,16(%rdi) + vmovdqu %xmm13,32(%rdi) + vmovdqu %xmm5,48(%rdi) + je .Ldone4xop + + leaq 64(%rsi),%rsi + vmovdqa %xmm15,0(%rsp) + xorq %r10,%r10 + vmovdqa %xmm10,16(%rsp) + leaq 64(%rdi),%rdi + vmovdqa %xmm2,32(%rsp) + subq $64,%rdx + vmovdqa %xmm9,48(%rsp) + jmp .Loop_tail4xop + +.align 32 +.L128_or_more4xop: + vpxor 0(%rsi),%xmm6,%xmm6 + vpxor 16(%rsi),%xmm1,%xmm1 + vpxor 32(%rsi),%xmm13,%xmm13 + vpxor 48(%rsi),%xmm5,%xmm5 + vpxor 64(%rsi),%xmm15,%xmm15 + vpxor 80(%rsi),%xmm10,%xmm10 + vpxor 96(%rsi),%xmm2,%xmm2 + vpxor 112(%rsi),%xmm9,%xmm9 + + vmovdqu %xmm6,0(%rdi) + vmovdqu %xmm1,16(%rdi) + vmovdqu %xmm13,32(%rdi) + vmovdqu %xmm5,48(%rdi) + vmovdqu %xmm15,64(%rdi) + vmovdqu %xmm10,80(%rdi) + vmovdqu %xmm2,96(%rdi) + vmovdqu %xmm9,112(%rdi) + je .Ldone4xop + + leaq 128(%rsi),%rsi + vmovdqa %xmm11,0(%rsp) + xorq %r10,%r10 + vmovdqa %xmm3,16(%rsp) + leaq 128(%rdi),%rdi + vmovdqa %xmm14,32(%rsp) + subq $128,%rdx + vmovdqa %xmm7,48(%rsp) + jmp .Loop_tail4xop + +.align 32 +.L192_or_more4xop: + vpxor 0(%rsi),%xmm6,%xmm6 + vpxor 16(%rsi),%xmm1,%xmm1 + vpxor 32(%rsi),%xmm13,%xmm13 + vpxor 48(%rsi),%xmm5,%xmm5 + vpxor 64(%rsi),%xmm15,%xmm15 + vpxor 80(%rsi),%xmm10,%xmm10 + vpxor 96(%rsi),%xmm2,%xmm2 + vpxor 112(%rsi),%xmm9,%xmm9 + leaq 128(%rsi),%rsi + vpxor 0(%rsi),%xmm11,%xmm11 + vpxor 16(%rsi),%xmm3,%xmm3 + vpxor 32(%rsi),%xmm14,%xmm14 + vpxor 48(%rsi),%xmm7,%xmm7 + + vmovdqu %xmm6,0(%rdi) + vmovdqu %xmm1,16(%rdi) + vmovdqu %xmm13,32(%rdi) + vmovdqu %xmm5,48(%rdi) + vmovdqu %xmm15,64(%rdi) + vmovdqu %xmm10,80(%rdi) + vmovdqu %xmm2,96(%rdi) + vmovdqu %xmm9,112(%rdi) + leaq 128(%rdi),%rdi + vmovdqu %xmm11,0(%rdi) + vmovdqu %xmm3,16(%rdi) + vmovdqu %xmm14,32(%rdi) + vmovdqu %xmm7,48(%rdi) + je .Ldone4xop + + leaq 64(%rsi),%rsi + vmovdqa %xmm8,0(%rsp) + xorq %r10,%r10 + vmovdqa %xmm0,16(%rsp) + leaq 64(%rdi),%rdi + vmovdqa %xmm12,32(%rsp) + subq $192,%rdx + vmovdqa %xmm4,48(%rsp) + +.Loop_tail4xop: + movzbl (%rsi,%r10,1),%eax + movzbl (%rsp,%r10,1),%ecx + leaq 1(%r10),%r10 + xorl %ecx,%eax + movb %al,-1(%rdi,%r10,1) + decq %rdx + jnz .Loop_tail4xop + +.Ldone4xop: + vzeroupper + leaq (%r9),%rsp +.cfi_def_cfa_register %rsp +.L4xop_epilogue: + .byte 0xf3,0xc3 +.cfi_endproc +.size ChaCha20_4xop,.-ChaCha20_4xop +.type ChaCha20_8x,@function +.align 32 +ChaCha20_8x: +.cfi_startproc +.LChaCha20_8x: + movq %rsp,%r9 +.cfi_def_cfa_register %r9 + subq $0x280+8,%rsp + andq $-32,%rsp + vzeroupper + + + + + + + + + + + vbroadcasti128 .Lsigma(%rip),%ymm11 + vbroadcasti128 (%rcx),%ymm3 + vbroadcasti128 16(%rcx),%ymm15 + vbroadcasti128 (%r8),%ymm7 + leaq 256(%rsp),%rcx + leaq 512(%rsp),%rax + leaq .Lrot16(%rip),%r10 + leaq .Lrot24(%rip),%r11 + + vpshufd $0x00,%ymm11,%ymm8 + vpshufd $0x55,%ymm11,%ymm9 + vmovdqa %ymm8,128-256(%rcx) + vpshufd $0xaa,%ymm11,%ymm10 + vmovdqa %ymm9,160-256(%rcx) + vpshufd $0xff,%ymm11,%ymm11 + vmovdqa %ymm10,192-256(%rcx) + vmovdqa %ymm11,224-256(%rcx) + + vpshufd $0x00,%ymm3,%ymm0 + vpshufd $0x55,%ymm3,%ymm1 + vmovdqa %ymm0,256-256(%rcx) + vpshufd $0xaa,%ymm3,%ymm2 + vmovdqa %ymm1,288-256(%rcx) + vpshufd $0xff,%ymm3,%ymm3 + vmovdqa %ymm2,320-256(%rcx) + vmovdqa %ymm3,352-256(%rcx) + + vpshufd $0x00,%ymm15,%ymm12 + vpshufd $0x55,%ymm15,%ymm13 + vmovdqa %ymm12,384-512(%rax) + vpshufd $0xaa,%ymm15,%ymm14 + vmovdqa %ymm13,416-512(%rax) + vpshufd $0xff,%ymm15,%ymm15 + vmovdqa %ymm14,448-512(%rax) + vmovdqa %ymm15,480-512(%rax) + + vpshufd $0x00,%ymm7,%ymm4 + vpshufd $0x55,%ymm7,%ymm5 + vpaddd .Lincy(%rip),%ymm4,%ymm4 + vpshufd $0xaa,%ymm7,%ymm6 + vmovdqa %ymm5,544-512(%rax) + vpshufd $0xff,%ymm7,%ymm7 + vmovdqa %ymm6,576-512(%rax) + vmovdqa %ymm7,608-512(%rax) + + jmp .Loop_enter8x + +.align 32 +.Loop_outer8x: + vmovdqa 128-256(%rcx),%ymm8 + vmovdqa 160-256(%rcx),%ymm9 + vmovdqa 192-256(%rcx),%ymm10 + vmovdqa 224-256(%rcx),%ymm11 + vmovdqa 256-256(%rcx),%ymm0 + vmovdqa 288-256(%rcx),%ymm1 + vmovdqa 320-256(%rcx),%ymm2 + vmovdqa 352-256(%rcx),%ymm3 + vmovdqa 384-512(%rax),%ymm12 + vmovdqa 416-512(%rax),%ymm13 + vmovdqa 448-512(%rax),%ymm14 + vmovdqa 480-512(%rax),%ymm15 + vmovdqa 512-512(%rax),%ymm4 + vmovdqa 544-512(%rax),%ymm5 + vmovdqa 576-512(%rax),%ymm6 + vmovdqa 608-512(%rax),%ymm7 + vpaddd .Leight(%rip),%ymm4,%ymm4 + +.Loop_enter8x: + vmovdqa %ymm14,64(%rsp) + vmovdqa %ymm15,96(%rsp) + vbroadcasti128 (%r10),%ymm15 + vmovdqa %ymm4,512-512(%rax) + movl $10,%eax + jmp .Loop8x + +.align 32 +.Loop8x: + vpaddd %ymm0,%ymm8,%ymm8 + vpxor %ymm4,%ymm8,%ymm4 + vpshufb %ymm15,%ymm4,%ymm4 + vpaddd %ymm1,%ymm9,%ymm9 + vpxor %ymm5,%ymm9,%ymm5 + vpshufb %ymm15,%ymm5,%ymm5 + vpaddd %ymm4,%ymm12,%ymm12 + vpxor %ymm0,%ymm12,%ymm0 + vpslld $12,%ymm0,%ymm14 + vpsrld $20,%ymm0,%ymm0 + vpor %ymm0,%ymm14,%ymm0 + vbroadcasti128 (%r11),%ymm14 + vpaddd %ymm5,%ymm13,%ymm13 + vpxor %ymm1,%ymm13,%ymm1 + vpslld $12,%ymm1,%ymm15 + vpsrld $20,%ymm1,%ymm1 + vpor %ymm1,%ymm15,%ymm1 + vpaddd %ymm0,%ymm8,%ymm8 + vpxor %ymm4,%ymm8,%ymm4 + vpshufb %ymm14,%ymm4,%ymm4 + vpaddd %ymm1,%ymm9,%ymm9 + vpxor %ymm5,%ymm9,%ymm5 + vpshufb %ymm14,%ymm5,%ymm5 + vpaddd %ymm4,%ymm12,%ymm12 + vpxor %ymm0,%ymm12,%ymm0 + vpslld $7,%ymm0,%ymm15 + vpsrld $25,%ymm0,%ymm0 + vpor %ymm0,%ymm15,%ymm0 + vbroadcasti128 (%r10),%ymm15 + vpaddd %ymm5,%ymm13,%ymm13 + vpxor %ymm1,%ymm13,%ymm1 + vpslld $7,%ymm1,%ymm14 + vpsrld $25,%ymm1,%ymm1 + vpor %ymm1,%ymm14,%ymm1 + vmovdqa %ymm12,0(%rsp) + vmovdqa %ymm13,32(%rsp) + vmovdqa 64(%rsp),%ymm12 + vmovdqa 96(%rsp),%ymm13 + vpaddd %ymm2,%ymm10,%ymm10 + vpxor %ymm6,%ymm10,%ymm6 + vpshufb %ymm15,%ymm6,%ymm6 + vpaddd %ymm3,%ymm11,%ymm11 + vpxor %ymm7,%ymm11,%ymm7 + vpshufb %ymm15,%ymm7,%ymm7 + vpaddd %ymm6,%ymm12,%ymm12 + vpxor %ymm2,%ymm12,%ymm2 + vpslld $12,%ymm2,%ymm14 + vpsrld $20,%ymm2,%ymm2 + vpor %ymm2,%ymm14,%ymm2 + vbroadcasti128 (%r11),%ymm14 + vpaddd %ymm7,%ymm13,%ymm13 + vpxor %ymm3,%ymm13,%ymm3 + vpslld $12,%ymm3,%ymm15 + vpsrld $20,%ymm3,%ymm3 + vpor %ymm3,%ymm15,%ymm3 + vpaddd %ymm2,%ymm10,%ymm10 + vpxor %ymm6,%ymm10,%ymm6 + vpshufb %ymm14,%ymm6,%ymm6 + vpaddd %ymm3,%ymm11,%ymm11 + vpxor %ymm7,%ymm11,%ymm7 + vpshufb %ymm14,%ymm7,%ymm7 + vpaddd %ymm6,%ymm12,%ymm12 + vpxor %ymm2,%ymm12,%ymm2 + vpslld $7,%ymm2,%ymm15 + vpsrld $25,%ymm2,%ymm2 + vpor %ymm2,%ymm15,%ymm2 + vbroadcasti128 (%r10),%ymm15 + vpaddd %ymm7,%ymm13,%ymm13 + vpxor %ymm3,%ymm13,%ymm3 + vpslld $7,%ymm3,%ymm14 + vpsrld $25,%ymm3,%ymm3 + vpor %ymm3,%ymm14,%ymm3 + vpaddd %ymm1,%ymm8,%ymm8 + vpxor %ymm7,%ymm8,%ymm7 + vpshufb %ymm15,%ymm7,%ymm7 + vpaddd %ymm2,%ymm9,%ymm9 + vpxor %ymm4,%ymm9,%ymm4 + vpshufb %ymm15,%ymm4,%ymm4 + vpaddd %ymm7,%ymm12,%ymm12 + vpxor %ymm1,%ymm12,%ymm1 + vpslld $12,%ymm1,%ymm14 + vpsrld $20,%ymm1,%ymm1 + vpor %ymm1,%ymm14,%ymm1 + vbroadcasti128 (%r11),%ymm14 + vpaddd %ymm4,%ymm13,%ymm13 + vpxor %ymm2,%ymm13,%ymm2 + vpslld $12,%ymm2,%ymm15 + vpsrld $20,%ymm2,%ymm2 + vpor %ymm2,%ymm15,%ymm2 + vpaddd %ymm1,%ymm8,%ymm8 + vpxor %ymm7,%ymm8,%ymm7 + vpshufb %ymm14,%ymm7,%ymm7 + vpaddd %ymm2,%ymm9,%ymm9 + vpxor %ymm4,%ymm9,%ymm4 + vpshufb %ymm14,%ymm4,%ymm4 + vpaddd %ymm7,%ymm12,%ymm12 + vpxor %ymm1,%ymm12,%ymm1 + vpslld $7,%ymm1,%ymm15 + vpsrld $25,%ymm1,%ymm1 + vpor %ymm1,%ymm15,%ymm1 + vbroadcasti128 (%r10),%ymm15 + vpaddd %ymm4,%ymm13,%ymm13 + vpxor %ymm2,%ymm13,%ymm2 + vpslld $7,%ymm2,%ymm14 + vpsrld $25,%ymm2,%ymm2 + vpor %ymm2,%ymm14,%ymm2 + vmovdqa %ymm12,64(%rsp) + vmovdqa %ymm13,96(%rsp) + vmovdqa 0(%rsp),%ymm12 + vmovdqa 32(%rsp),%ymm13 + vpaddd %ymm3,%ymm10,%ymm10 + vpxor %ymm5,%ymm10,%ymm5 + vpshufb %ymm15,%ymm5,%ymm5 + vpaddd %ymm0,%ymm11,%ymm11 + vpxor %ymm6,%ymm11,%ymm6 + vpshufb %ymm15,%ymm6,%ymm6 + vpaddd %ymm5,%ymm12,%ymm12 + vpxor %ymm3,%ymm12,%ymm3 + vpslld $12,%ymm3,%ymm14 + vpsrld $20,%ymm3,%ymm3 + vpor %ymm3,%ymm14,%ymm3 + vbroadcasti128 (%r11),%ymm14 + vpaddd %ymm6,%ymm13,%ymm13 + vpxor %ymm0,%ymm13,%ymm0 + vpslld $12,%ymm0,%ymm15 + vpsrld $20,%ymm0,%ymm0 + vpor %ymm0,%ymm15,%ymm0 + vpaddd %ymm3,%ymm10,%ymm10 + vpxor %ymm5,%ymm10,%ymm5 + vpshufb %ymm14,%ymm5,%ymm5 + vpaddd %ymm0,%ymm11,%ymm11 + vpxor %ymm6,%ymm11,%ymm6 + vpshufb %ymm14,%ymm6,%ymm6 + vpaddd %ymm5,%ymm12,%ymm12 + vpxor %ymm3,%ymm12,%ymm3 + vpslld $7,%ymm3,%ymm15 + vpsrld $25,%ymm3,%ymm3 + vpor %ymm3,%ymm15,%ymm3 + vbroadcasti128 (%r10),%ymm15 + vpaddd %ymm6,%ymm13,%ymm13 + vpxor %ymm0,%ymm13,%ymm0 + vpslld $7,%ymm0,%ymm14 + vpsrld $25,%ymm0,%ymm0 + vpor %ymm0,%ymm14,%ymm0 + decl %eax + jnz .Loop8x + + leaq 512(%rsp),%rax + vpaddd 128-256(%rcx),%ymm8,%ymm8 + vpaddd 160-256(%rcx),%ymm9,%ymm9 + vpaddd 192-256(%rcx),%ymm10,%ymm10 + vpaddd 224-256(%rcx),%ymm11,%ymm11 + + vpunpckldq %ymm9,%ymm8,%ymm14 + vpunpckldq %ymm11,%ymm10,%ymm15 + vpunpckhdq %ymm9,%ymm8,%ymm8 + vpunpckhdq %ymm11,%ymm10,%ymm10 + vpunpcklqdq %ymm15,%ymm14,%ymm9 + vpunpckhqdq %ymm15,%ymm14,%ymm14 + vpunpcklqdq %ymm10,%ymm8,%ymm11 + vpunpckhqdq %ymm10,%ymm8,%ymm8 + vpaddd 256-256(%rcx),%ymm0,%ymm0 + vpaddd 288-256(%rcx),%ymm1,%ymm1 + vpaddd 320-256(%rcx),%ymm2,%ymm2 + vpaddd 352-256(%rcx),%ymm3,%ymm3 + + vpunpckldq %ymm1,%ymm0,%ymm10 + vpunpckldq %ymm3,%ymm2,%ymm15 + vpunpckhdq %ymm1,%ymm0,%ymm0 + vpunpckhdq %ymm3,%ymm2,%ymm2 + vpunpcklqdq %ymm15,%ymm10,%ymm1 + vpunpckhqdq %ymm15,%ymm10,%ymm10 + vpunpcklqdq %ymm2,%ymm0,%ymm3 + vpunpckhqdq %ymm2,%ymm0,%ymm0 + vperm2i128 $0x20,%ymm1,%ymm9,%ymm15 + vperm2i128 $0x31,%ymm1,%ymm9,%ymm1 + vperm2i128 $0x20,%ymm10,%ymm14,%ymm9 + vperm2i128 $0x31,%ymm10,%ymm14,%ymm10 + vperm2i128 $0x20,%ymm3,%ymm11,%ymm14 + vperm2i128 $0x31,%ymm3,%ymm11,%ymm3 + vperm2i128 $0x20,%ymm0,%ymm8,%ymm11 + vperm2i128 $0x31,%ymm0,%ymm8,%ymm0 + vmovdqa %ymm15,0(%rsp) + vmovdqa %ymm9,32(%rsp) + vmovdqa 64(%rsp),%ymm15 + vmovdqa 96(%rsp),%ymm9 + + vpaddd 384-512(%rax),%ymm12,%ymm12 + vpaddd 416-512(%rax),%ymm13,%ymm13 + vpaddd 448-512(%rax),%ymm15,%ymm15 + vpaddd 480-512(%rax),%ymm9,%ymm9 + + vpunpckldq %ymm13,%ymm12,%ymm2 + vpunpckldq %ymm9,%ymm15,%ymm8 + vpunpckhdq %ymm13,%ymm12,%ymm12 + vpunpckhdq %ymm9,%ymm15,%ymm15 + vpunpcklqdq %ymm8,%ymm2,%ymm13 + vpunpckhqdq %ymm8,%ymm2,%ymm2 + vpunpcklqdq %ymm15,%ymm12,%ymm9 + vpunpckhqdq %ymm15,%ymm12,%ymm12 + vpaddd 512-512(%rax),%ymm4,%ymm4 + vpaddd 544-512(%rax),%ymm5,%ymm5 + vpaddd 576-512(%rax),%ymm6,%ymm6 + vpaddd 608-512(%rax),%ymm7,%ymm7 + + vpunpckldq %ymm5,%ymm4,%ymm15 + vpunpckldq %ymm7,%ymm6,%ymm8 + vpunpckhdq %ymm5,%ymm4,%ymm4 + vpunpckhdq %ymm7,%ymm6,%ymm6 + vpunpcklqdq %ymm8,%ymm15,%ymm5 + vpunpckhqdq %ymm8,%ymm15,%ymm15 + vpunpcklqdq %ymm6,%ymm4,%ymm7 + vpunpckhqdq %ymm6,%ymm4,%ymm4 + vperm2i128 $0x20,%ymm5,%ymm13,%ymm8 + vperm2i128 $0x31,%ymm5,%ymm13,%ymm5 + vperm2i128 $0x20,%ymm15,%ymm2,%ymm13 + vperm2i128 $0x31,%ymm15,%ymm2,%ymm15 + vperm2i128 $0x20,%ymm7,%ymm9,%ymm2 + vperm2i128 $0x31,%ymm7,%ymm9,%ymm7 + vperm2i128 $0x20,%ymm4,%ymm12,%ymm9 + vperm2i128 $0x31,%ymm4,%ymm12,%ymm4 + vmovdqa 0(%rsp),%ymm6 + vmovdqa 32(%rsp),%ymm12 + + cmpq $512,%rdx + jb .Ltail8x + + vpxor 0(%rsi),%ymm6,%ymm6 + vpxor 32(%rsi),%ymm8,%ymm8 + vpxor 64(%rsi),%ymm1,%ymm1 + vpxor 96(%rsi),%ymm5,%ymm5 + leaq 128(%rsi),%rsi + vmovdqu %ymm6,0(%rdi) + vmovdqu %ymm8,32(%rdi) + vmovdqu %ymm1,64(%rdi) + vmovdqu %ymm5,96(%rdi) + leaq 128(%rdi),%rdi + + vpxor 0(%rsi),%ymm12,%ymm12 + vpxor 32(%rsi),%ymm13,%ymm13 + vpxor 64(%rsi),%ymm10,%ymm10 + vpxor 96(%rsi),%ymm15,%ymm15 + leaq 128(%rsi),%rsi + vmovdqu %ymm12,0(%rdi) + vmovdqu %ymm13,32(%rdi) + vmovdqu %ymm10,64(%rdi) + vmovdqu %ymm15,96(%rdi) + leaq 128(%rdi),%rdi + + vpxor 0(%rsi),%ymm14,%ymm14 + vpxor 32(%rsi),%ymm2,%ymm2 + vpxor 64(%rsi),%ymm3,%ymm3 + vpxor 96(%rsi),%ymm7,%ymm7 + leaq 128(%rsi),%rsi + vmovdqu %ymm14,0(%rdi) + vmovdqu %ymm2,32(%rdi) + vmovdqu %ymm3,64(%rdi) + vmovdqu %ymm7,96(%rdi) + leaq 128(%rdi),%rdi + + vpxor 0(%rsi),%ymm11,%ymm11 + vpxor 32(%rsi),%ymm9,%ymm9 + vpxor 64(%rsi),%ymm0,%ymm0 + vpxor 96(%rsi),%ymm4,%ymm4 + leaq 128(%rsi),%rsi + vmovdqu %ymm11,0(%rdi) + vmovdqu %ymm9,32(%rdi) + vmovdqu %ymm0,64(%rdi) + vmovdqu %ymm4,96(%rdi) + leaq 128(%rdi),%rdi + + subq $512,%rdx + jnz .Loop_outer8x + + jmp .Ldone8x + +.Ltail8x: + cmpq $448,%rdx + jae .L448_or_more8x + cmpq $384,%rdx + jae .L384_or_more8x + cmpq $320,%rdx + jae .L320_or_more8x + cmpq $256,%rdx + jae .L256_or_more8x + cmpq $192,%rdx + jae .L192_or_more8x + cmpq $128,%rdx + jae .L128_or_more8x + cmpq $64,%rdx + jae .L64_or_more8x + + xorq %r10,%r10 + vmovdqa %ymm6,0(%rsp) + vmovdqa %ymm8,32(%rsp) + jmp .Loop_tail8x + +.align 32 +.L64_or_more8x: + vpxor 0(%rsi),%ymm6,%ymm6 + vpxor 32(%rsi),%ymm8,%ymm8 + vmovdqu %ymm6,0(%rdi) + vmovdqu %ymm8,32(%rdi) + je .Ldone8x + + leaq 64(%rsi),%rsi + xorq %r10,%r10 + vmovdqa %ymm1,0(%rsp) + leaq 64(%rdi),%rdi + subq $64,%rdx + vmovdqa %ymm5,32(%rsp) + jmp .Loop_tail8x + +.align 32 +.L128_or_more8x: + vpxor 0(%rsi),%ymm6,%ymm6 + vpxor 32(%rsi),%ymm8,%ymm8 + vpxor 64(%rsi),%ymm1,%ymm1 + vpxor 96(%rsi),%ymm5,%ymm5 + vmovdqu %ymm6,0(%rdi) + vmovdqu %ymm8,32(%rdi) + vmovdqu %ymm1,64(%rdi) + vmovdqu %ymm5,96(%rdi) + je .Ldone8x + + leaq 128(%rsi),%rsi + xorq %r10,%r10 + vmovdqa %ymm12,0(%rsp) + leaq 128(%rdi),%rdi + subq $128,%rdx + vmovdqa %ymm13,32(%rsp) + jmp .Loop_tail8x + +.align 32 +.L192_or_more8x: + vpxor 0(%rsi),%ymm6,%ymm6 + vpxor 32(%rsi),%ymm8,%ymm8 + vpxor 64(%rsi),%ymm1,%ymm1 + vpxor 96(%rsi),%ymm5,%ymm5 + vpxor 128(%rsi),%ymm12,%ymm12 + vpxor 160(%rsi),%ymm13,%ymm13 + vmovdqu %ymm6,0(%rdi) + vmovdqu %ymm8,32(%rdi) + vmovdqu %ymm1,64(%rdi) + vmovdqu %ymm5,96(%rdi) + vmovdqu %ymm12,128(%rdi) + vmovdqu %ymm13,160(%rdi) + je .Ldone8x + + leaq 192(%rsi),%rsi + xorq %r10,%r10 + vmovdqa %ymm10,0(%rsp) + leaq 192(%rdi),%rdi + subq $192,%rdx + vmovdqa %ymm15,32(%rsp) + jmp .Loop_tail8x + +.align 32 +.L256_or_more8x: + vpxor 0(%rsi),%ymm6,%ymm6 + vpxor 32(%rsi),%ymm8,%ymm8 + vpxor 64(%rsi),%ymm1,%ymm1 + vpxor 96(%rsi),%ymm5,%ymm5 + vpxor 128(%rsi),%ymm12,%ymm12 + vpxor 160(%rsi),%ymm13,%ymm13 + vpxor 192(%rsi),%ymm10,%ymm10 + vpxor 224(%rsi),%ymm15,%ymm15 + vmovdqu %ymm6,0(%rdi) + vmovdqu %ymm8,32(%rdi) + vmovdqu %ymm1,64(%rdi) + vmovdqu %ymm5,96(%rdi) + vmovdqu %ymm12,128(%rdi) + vmovdqu %ymm13,160(%rdi) + vmovdqu %ymm10,192(%rdi) + vmovdqu %ymm15,224(%rdi) + je .Ldone8x + + leaq 256(%rsi),%rsi + xorq %r10,%r10 + vmovdqa %ymm14,0(%rsp) + leaq 256(%rdi),%rdi + subq $256,%rdx + vmovdqa %ymm2,32(%rsp) + jmp .Loop_tail8x + +.align 32 +.L320_or_more8x: + vpxor 0(%rsi),%ymm6,%ymm6 + vpxor 32(%rsi),%ymm8,%ymm8 + vpxor 64(%rsi),%ymm1,%ymm1 + vpxor 96(%rsi),%ymm5,%ymm5 + vpxor 128(%rsi),%ymm12,%ymm12 + vpxor 160(%rsi),%ymm13,%ymm13 + vpxor 192(%rsi),%ymm10,%ymm10 + vpxor 224(%rsi),%ymm15,%ymm15 + vpxor 256(%rsi),%ymm14,%ymm14 + vpxor 288(%rsi),%ymm2,%ymm2 + vmovdqu %ymm6,0(%rdi) + vmovdqu %ymm8,32(%rdi) + vmovdqu %ymm1,64(%rdi) + vmovdqu %ymm5,96(%rdi) + vmovdqu %ymm12,128(%rdi) + vmovdqu %ymm13,160(%rdi) + vmovdqu %ymm10,192(%rdi) + vmovdqu %ymm15,224(%rdi) + vmovdqu %ymm14,256(%rdi) + vmovdqu %ymm2,288(%rdi) + je .Ldone8x + + leaq 320(%rsi),%rsi + xorq %r10,%r10 + vmovdqa %ymm3,0(%rsp) + leaq 320(%rdi),%rdi + subq $320,%rdx + vmovdqa %ymm7,32(%rsp) + jmp .Loop_tail8x + +.align 32 +.L384_or_more8x: + vpxor 0(%rsi),%ymm6,%ymm6 + vpxor 32(%rsi),%ymm8,%ymm8 + vpxor 64(%rsi),%ymm1,%ymm1 + vpxor 96(%rsi),%ymm5,%ymm5 + vpxor 128(%rsi),%ymm12,%ymm12 + vpxor 160(%rsi),%ymm13,%ymm13 + vpxor 192(%rsi),%ymm10,%ymm10 + vpxor 224(%rsi),%ymm15,%ymm15 + vpxor 256(%rsi),%ymm14,%ymm14 + vpxor 288(%rsi),%ymm2,%ymm2 + vpxor 320(%rsi),%ymm3,%ymm3 + vpxor 352(%rsi),%ymm7,%ymm7 + vmovdqu %ymm6,0(%rdi) + vmovdqu %ymm8,32(%rdi) + vmovdqu %ymm1,64(%rdi) + vmovdqu %ymm5,96(%rdi) + vmovdqu %ymm12,128(%rdi) + vmovdqu %ymm13,160(%rdi) + vmovdqu %ymm10,192(%rdi) + vmovdqu %ymm15,224(%rdi) + vmovdqu %ymm14,256(%rdi) + vmovdqu %ymm2,288(%rdi) + vmovdqu %ymm3,320(%rdi) + vmovdqu %ymm7,352(%rdi) + je .Ldone8x + + leaq 384(%rsi),%rsi + xorq %r10,%r10 + vmovdqa %ymm11,0(%rsp) + leaq 384(%rdi),%rdi + subq $384,%rdx + vmovdqa %ymm9,32(%rsp) + jmp .Loop_tail8x + +.align 32 +.L448_or_more8x: + vpxor 0(%rsi),%ymm6,%ymm6 + vpxor 32(%rsi),%ymm8,%ymm8 + vpxor 64(%rsi),%ymm1,%ymm1 + vpxor 96(%rsi),%ymm5,%ymm5 + vpxor 128(%rsi),%ymm12,%ymm12 + vpxor 160(%rsi),%ymm13,%ymm13 + vpxor 192(%rsi),%ymm10,%ymm10 + vpxor 224(%rsi),%ymm15,%ymm15 + vpxor 256(%rsi),%ymm14,%ymm14 + vpxor 288(%rsi),%ymm2,%ymm2 + vpxor 320(%rsi),%ymm3,%ymm3 + vpxor 352(%rsi),%ymm7,%ymm7 + vpxor 384(%rsi),%ymm11,%ymm11 + vpxor 416(%rsi),%ymm9,%ymm9 + vmovdqu %ymm6,0(%rdi) + vmovdqu %ymm8,32(%rdi) + vmovdqu %ymm1,64(%rdi) + vmovdqu %ymm5,96(%rdi) + vmovdqu %ymm12,128(%rdi) + vmovdqu %ymm13,160(%rdi) + vmovdqu %ymm10,192(%rdi) + vmovdqu %ymm15,224(%rdi) + vmovdqu %ymm14,256(%rdi) + vmovdqu %ymm2,288(%rdi) + vmovdqu %ymm3,320(%rdi) + vmovdqu %ymm7,352(%rdi) + vmovdqu %ymm11,384(%rdi) + vmovdqu %ymm9,416(%rdi) + je .Ldone8x + + leaq 448(%rsi),%rsi + xorq %r10,%r10 + vmovdqa %ymm0,0(%rsp) + leaq 448(%rdi),%rdi + subq $448,%rdx + vmovdqa %ymm4,32(%rsp) + +.Loop_tail8x: + movzbl (%rsi,%r10,1),%eax + movzbl (%rsp,%r10,1),%ecx + leaq 1(%r10),%r10 + xorl %ecx,%eax + movb %al,-1(%rdi,%r10,1) + decq %rdx + jnz .Loop_tail8x + +.Ldone8x: + vzeroall + leaq (%r9),%rsp +.cfi_def_cfa_register %rsp +.L8x_epilogue: + .byte 0xf3,0xc3 +.cfi_endproc +.size ChaCha20_8x,.-ChaCha20_8x +.type ChaCha20_avx512,@function +.align 32 +ChaCha20_avx512: +.cfi_startproc +.LChaCha20_avx512: + movq %rsp,%r9 +.cfi_def_cfa_register %r9 + cmpq $512,%rdx + ja .LChaCha20_16x + + subq $64+8,%rsp + vbroadcasti32x4 .Lsigma(%rip),%zmm0 + vbroadcasti32x4 (%rcx),%zmm1 + vbroadcasti32x4 16(%rcx),%zmm2 + vbroadcasti32x4 (%r8),%zmm3 + + vmovdqa32 %zmm0,%zmm16 + vmovdqa32 %zmm1,%zmm17 + vmovdqa32 %zmm2,%zmm18 + vpaddd .Lzeroz(%rip),%zmm3,%zmm3 + vmovdqa32 .Lfourz(%rip),%zmm20 + movq $10,%r8 + vmovdqa32 %zmm3,%zmm19 + jmp .Loop_avx512 + +.align 16 +.Loop_outer_avx512: + vmovdqa32 %zmm16,%zmm0 + vmovdqa32 %zmm17,%zmm1 + vmovdqa32 %zmm18,%zmm2 + vpaddd %zmm20,%zmm19,%zmm3 + movq $10,%r8 + vmovdqa32 %zmm3,%zmm19 + jmp .Loop_avx512 + +.align 32 +.Loop_avx512: + vpaddd %zmm1,%zmm0,%zmm0 + vpxord %zmm0,%zmm3,%zmm3 + vprold $16,%zmm3,%zmm3 + vpaddd %zmm3,%zmm2,%zmm2 + vpxord %zmm2,%zmm1,%zmm1 + vprold $12,%zmm1,%zmm1 + vpaddd %zmm1,%zmm0,%zmm0 + vpxord %zmm0,%zmm3,%zmm3 + vprold $8,%zmm3,%zmm3 + vpaddd %zmm3,%zmm2,%zmm2 + vpxord %zmm2,%zmm1,%zmm1 + vprold $7,%zmm1,%zmm1 + vpshufd $78,%zmm2,%zmm2 + vpshufd $57,%zmm1,%zmm1 + vpshufd $147,%zmm3,%zmm3 + vpaddd %zmm1,%zmm0,%zmm0 + vpxord %zmm0,%zmm3,%zmm3 + vprold $16,%zmm3,%zmm3 + vpaddd %zmm3,%zmm2,%zmm2 + vpxord %zmm2,%zmm1,%zmm1 + vprold $12,%zmm1,%zmm1 + vpaddd %zmm1,%zmm0,%zmm0 + vpxord %zmm0,%zmm3,%zmm3 + vprold $8,%zmm3,%zmm3 + vpaddd %zmm3,%zmm2,%zmm2 + vpxord %zmm2,%zmm1,%zmm1 + vprold $7,%zmm1,%zmm1 + vpshufd $78,%zmm2,%zmm2 + vpshufd $147,%zmm1,%zmm1 + vpshufd $57,%zmm3,%zmm3 + decq %r8 + jnz .Loop_avx512 + vpaddd %zmm16,%zmm0,%zmm0 + vpaddd %zmm17,%zmm1,%zmm1 + vpaddd %zmm18,%zmm2,%zmm2 + vpaddd %zmm19,%zmm3,%zmm3 + + subq $64,%rdx + jb .Ltail64_avx512 + + vpxor 0(%rsi),%xmm0,%xmm4 + vpxor 16(%rsi),%xmm1,%xmm5 + vpxor 32(%rsi),%xmm2,%xmm6 + vpxor 48(%rsi),%xmm3,%xmm7 + leaq 64(%rsi),%rsi + + vmovdqu %xmm4,0(%rdi) + vmovdqu %xmm5,16(%rdi) + vmovdqu %xmm6,32(%rdi) + vmovdqu %xmm7,48(%rdi) + leaq 64(%rdi),%rdi + + jz .Ldone_avx512 + + vextracti32x4 $1,%zmm0,%xmm4 + vextracti32x4 $1,%zmm1,%xmm5 + vextracti32x4 $1,%zmm2,%xmm6 + vextracti32x4 $1,%zmm3,%xmm7 + + subq $64,%rdx + jb .Ltail_avx512 + + vpxor 0(%rsi),%xmm4,%xmm4 + vpxor 16(%rsi),%xmm5,%xmm5 + vpxor 32(%rsi),%xmm6,%xmm6 + vpxor 48(%rsi),%xmm7,%xmm7 + leaq 64(%rsi),%rsi + + vmovdqu %xmm4,0(%rdi) + vmovdqu %xmm5,16(%rdi) + vmovdqu %xmm6,32(%rdi) + vmovdqu %xmm7,48(%rdi) + leaq 64(%rdi),%rdi + + jz .Ldone_avx512 + + vextracti32x4 $2,%zmm0,%xmm4 + vextracti32x4 $2,%zmm1,%xmm5 + vextracti32x4 $2,%zmm2,%xmm6 + vextracti32x4 $2,%zmm3,%xmm7 + + subq $64,%rdx + jb .Ltail_avx512 + + vpxor 0(%rsi),%xmm4,%xmm4 + vpxor 16(%rsi),%xmm5,%xmm5 + vpxor 32(%rsi),%xmm6,%xmm6 + vpxor 48(%rsi),%xmm7,%xmm7 + leaq 64(%rsi),%rsi + + vmovdqu %xmm4,0(%rdi) + vmovdqu %xmm5,16(%rdi) + vmovdqu %xmm6,32(%rdi) + vmovdqu %xmm7,48(%rdi) + leaq 64(%rdi),%rdi + + jz .Ldone_avx512 + + vextracti32x4 $3,%zmm0,%xmm4 + vextracti32x4 $3,%zmm1,%xmm5 + vextracti32x4 $3,%zmm2,%xmm6 + vextracti32x4 $3,%zmm3,%xmm7 + + subq $64,%rdx + jb .Ltail_avx512 + + vpxor 0(%rsi),%xmm4,%xmm4 + vpxor 16(%rsi),%xmm5,%xmm5 + vpxor 32(%rsi),%xmm6,%xmm6 + vpxor 48(%rsi),%xmm7,%xmm7 + leaq 64(%rsi),%rsi + + vmovdqu %xmm4,0(%rdi) + vmovdqu %xmm5,16(%rdi) + vmovdqu %xmm6,32(%rdi) + vmovdqu %xmm7,48(%rdi) + leaq 64(%rdi),%rdi + + jnz .Loop_outer_avx512 + + jmp .Ldone_avx512 + +.align 16 +.Ltail64_avx512: + vmovdqa %xmm0,0(%rsp) + vmovdqa %xmm1,16(%rsp) + vmovdqa %xmm2,32(%rsp) + vmovdqa %xmm3,48(%rsp) + addq $64,%rdx + jmp .Loop_tail_avx512 + +.align 16 +.Ltail_avx512: + vmovdqa %xmm4,0(%rsp) + vmovdqa %xmm5,16(%rsp) + vmovdqa %xmm6,32(%rsp) + vmovdqa %xmm7,48(%rsp) + addq $64,%rdx + +.Loop_tail_avx512: + movzbl (%rsi,%r8,1),%eax + movzbl (%rsp,%r8,1),%ecx + leaq 1(%r8),%r8 + xorl %ecx,%eax + movb %al,-1(%rdi,%r8,1) + decq %rdx + jnz .Loop_tail_avx512 + + vmovdqu32 %zmm16,0(%rsp) + +.Ldone_avx512: + vzeroall + leaq (%r9),%rsp +.cfi_def_cfa_register %rsp +.Lavx512_epilogue: + .byte 0xf3,0xc3 +.cfi_endproc +.size ChaCha20_avx512,.-ChaCha20_avx512 +.type ChaCha20_avx512vl,@function +.align 32 +ChaCha20_avx512vl: +.cfi_startproc +.LChaCha20_avx512vl: + movq %rsp,%r9 +.cfi_def_cfa_register %r9 + cmpq $128,%rdx + ja .LChaCha20_8xvl + + subq $64+8,%rsp + vbroadcasti128 .Lsigma(%rip),%ymm0 + vbroadcasti128 (%rcx),%ymm1 + vbroadcasti128 16(%rcx),%ymm2 + vbroadcasti128 (%r8),%ymm3 + + vmovdqa32 %ymm0,%ymm16 + vmovdqa32 %ymm1,%ymm17 + vmovdqa32 %ymm2,%ymm18 + vpaddd .Lzeroz(%rip),%ymm3,%ymm3 + vmovdqa32 .Ltwoy(%rip),%ymm20 + movq $10,%r8 + vmovdqa32 %ymm3,%ymm19 + jmp .Loop_avx512vl + +.align 16 +.Loop_outer_avx512vl: + vmovdqa32 %ymm18,%ymm2 + vpaddd %ymm20,%ymm19,%ymm3 + movq $10,%r8 + vmovdqa32 %ymm3,%ymm19 + jmp .Loop_avx512vl + +.align 32 +.Loop_avx512vl: + vpaddd %ymm1,%ymm0,%ymm0 + vpxor %ymm0,%ymm3,%ymm3 + vprold $16,%ymm3,%ymm3 + vpaddd %ymm3,%ymm2,%ymm2 + vpxor %ymm2,%ymm1,%ymm1 + vprold $12,%ymm1,%ymm1 + vpaddd %ymm1,%ymm0,%ymm0 + vpxor %ymm0,%ymm3,%ymm3 + vprold $8,%ymm3,%ymm3 + vpaddd %ymm3,%ymm2,%ymm2 + vpxor %ymm2,%ymm1,%ymm1 + vprold $7,%ymm1,%ymm1 + vpshufd $78,%ymm2,%ymm2 + vpshufd $57,%ymm1,%ymm1 + vpshufd $147,%ymm3,%ymm3 + vpaddd %ymm1,%ymm0,%ymm0 + vpxor %ymm0,%ymm3,%ymm3 + vprold $16,%ymm3,%ymm3 + vpaddd %ymm3,%ymm2,%ymm2 + vpxor %ymm2,%ymm1,%ymm1 + vprold $12,%ymm1,%ymm1 + vpaddd %ymm1,%ymm0,%ymm0 + vpxor %ymm0,%ymm3,%ymm3 + vprold $8,%ymm3,%ymm3 + vpaddd %ymm3,%ymm2,%ymm2 + vpxor %ymm2,%ymm1,%ymm1 + vprold $7,%ymm1,%ymm1 + vpshufd $78,%ymm2,%ymm2 + vpshufd $147,%ymm1,%ymm1 + vpshufd $57,%ymm3,%ymm3 + decq %r8 + jnz .Loop_avx512vl + vpaddd %ymm16,%ymm0,%ymm0 + vpaddd %ymm17,%ymm1,%ymm1 + vpaddd %ymm18,%ymm2,%ymm2 + vpaddd %ymm19,%ymm3,%ymm3 + + subq $64,%rdx + jb .Ltail64_avx512vl + + vpxor 0(%rsi),%xmm0,%xmm4 + vpxor 16(%rsi),%xmm1,%xmm5 + vpxor 32(%rsi),%xmm2,%xmm6 + vpxor 48(%rsi),%xmm3,%xmm7 + leaq 64(%rsi),%rsi + + vmovdqu %xmm4,0(%rdi) + vmovdqu %xmm5,16(%rdi) + vmovdqu %xmm6,32(%rdi) + vmovdqu %xmm7,48(%rdi) + leaq 64(%rdi),%rdi + + jz .Ldone_avx512vl + + vextracti128 $1,%ymm0,%xmm4 + vextracti128 $1,%ymm1,%xmm5 + vextracti128 $1,%ymm2,%xmm6 + vextracti128 $1,%ymm3,%xmm7 + + subq $64,%rdx + jb .Ltail_avx512vl + + vpxor 0(%rsi),%xmm4,%xmm4 + vpxor 16(%rsi),%xmm5,%xmm5 + vpxor 32(%rsi),%xmm6,%xmm6 + vpxor 48(%rsi),%xmm7,%xmm7 + leaq 64(%rsi),%rsi + + vmovdqu %xmm4,0(%rdi) + vmovdqu %xmm5,16(%rdi) + vmovdqu %xmm6,32(%rdi) + vmovdqu %xmm7,48(%rdi) + leaq 64(%rdi),%rdi + + vmovdqa32 %ymm16,%ymm0 + vmovdqa32 %ymm17,%ymm1 + jnz .Loop_outer_avx512vl + + jmp .Ldone_avx512vl + +.align 16 +.Ltail64_avx512vl: + vmovdqa %xmm0,0(%rsp) + vmovdqa %xmm1,16(%rsp) + vmovdqa %xmm2,32(%rsp) + vmovdqa %xmm3,48(%rsp) + addq $64,%rdx + jmp .Loop_tail_avx512vl + +.align 16 +.Ltail_avx512vl: + vmovdqa %xmm4,0(%rsp) + vmovdqa %xmm5,16(%rsp) + vmovdqa %xmm6,32(%rsp) + vmovdqa %xmm7,48(%rsp) + addq $64,%rdx + +.Loop_tail_avx512vl: + movzbl (%rsi,%r8,1),%eax + movzbl (%rsp,%r8,1),%ecx + leaq 1(%r8),%r8 + xorl %ecx,%eax + movb %al,-1(%rdi,%r8,1) + decq %rdx + jnz .Loop_tail_avx512vl + + vmovdqu32 %ymm16,0(%rsp) + vmovdqu32 %ymm16,32(%rsp) + +.Ldone_avx512vl: + vzeroall + leaq (%r9),%rsp +.cfi_def_cfa_register %rsp +.Lavx512vl_epilogue: + .byte 0xf3,0xc3 +.cfi_endproc +.size ChaCha20_avx512vl,.-ChaCha20_avx512vl +.type ChaCha20_16x,@function +.align 32 +ChaCha20_16x: +.cfi_startproc +.LChaCha20_16x: + movq %rsp,%r9 +.cfi_def_cfa_register %r9 + subq $64+8,%rsp + andq $-64,%rsp + vzeroupper + + leaq .Lsigma(%rip),%r10 + vbroadcasti32x4 (%r10),%zmm3 + vbroadcasti32x4 (%rcx),%zmm7 + vbroadcasti32x4 16(%rcx),%zmm11 + vbroadcasti32x4 (%r8),%zmm15 + + vpshufd $0x00,%zmm3,%zmm0 + vpshufd $0x55,%zmm3,%zmm1 + vpshufd $0xaa,%zmm3,%zmm2 + vpshufd $0xff,%zmm3,%zmm3 + vmovdqa64 %zmm0,%zmm16 + vmovdqa64 %zmm1,%zmm17 + vmovdqa64 %zmm2,%zmm18 + vmovdqa64 %zmm3,%zmm19 + + vpshufd $0x00,%zmm7,%zmm4 + vpshufd $0x55,%zmm7,%zmm5 + vpshufd $0xaa,%zmm7,%zmm6 + vpshufd $0xff,%zmm7,%zmm7 + vmovdqa64 %zmm4,%zmm20 + vmovdqa64 %zmm5,%zmm21 + vmovdqa64 %zmm6,%zmm22 + vmovdqa64 %zmm7,%zmm23 + + vpshufd $0x00,%zmm11,%zmm8 + vpshufd $0x55,%zmm11,%zmm9 + vpshufd $0xaa,%zmm11,%zmm10 + vpshufd $0xff,%zmm11,%zmm11 + vmovdqa64 %zmm8,%zmm24 + vmovdqa64 %zmm9,%zmm25 + vmovdqa64 %zmm10,%zmm26 + vmovdqa64 %zmm11,%zmm27 + + vpshufd $0x00,%zmm15,%zmm12 + vpshufd $0x55,%zmm15,%zmm13 + vpshufd $0xaa,%zmm15,%zmm14 + vpshufd $0xff,%zmm15,%zmm15 + vpaddd .Lincz(%rip),%zmm12,%zmm12 + vmovdqa64 %zmm12,%zmm28 + vmovdqa64 %zmm13,%zmm29 + vmovdqa64 %zmm14,%zmm30 + vmovdqa64 %zmm15,%zmm31 + + movl $10,%eax + jmp .Loop16x + +.align 32 +.Loop_outer16x: + vpbroadcastd 0(%r10),%zmm0 + vpbroadcastd 4(%r10),%zmm1 + vpbroadcastd 8(%r10),%zmm2 + vpbroadcastd 12(%r10),%zmm3 + vpaddd .Lsixteen(%rip),%zmm28,%zmm28 + vmovdqa64 %zmm20,%zmm4 + vmovdqa64 %zmm21,%zmm5 + vmovdqa64 %zmm22,%zmm6 + vmovdqa64 %zmm23,%zmm7 + vmovdqa64 %zmm24,%zmm8 + vmovdqa64 %zmm25,%zmm9 + vmovdqa64 %zmm26,%zmm10 + vmovdqa64 %zmm27,%zmm11 + vmovdqa64 %zmm28,%zmm12 + vmovdqa64 %zmm29,%zmm13 + vmovdqa64 %zmm30,%zmm14 + vmovdqa64 %zmm31,%zmm15 + + vmovdqa64 %zmm0,%zmm16 + vmovdqa64 %zmm1,%zmm17 + vmovdqa64 %zmm2,%zmm18 + vmovdqa64 %zmm3,%zmm19 + + movl $10,%eax + jmp .Loop16x + +.align 32 +.Loop16x: + vpaddd %zmm4,%zmm0,%zmm0 + vpaddd %zmm5,%zmm1,%zmm1 + vpaddd %zmm6,%zmm2,%zmm2 + vpaddd %zmm7,%zmm3,%zmm3 + vpxord %zmm0,%zmm12,%zmm12 + vpxord %zmm1,%zmm13,%zmm13 + vpxord %zmm2,%zmm14,%zmm14 + vpxord %zmm3,%zmm15,%zmm15 + vprold $16,%zmm12,%zmm12 + vprold $16,%zmm13,%zmm13 + vprold $16,%zmm14,%zmm14 + vprold $16,%zmm15,%zmm15 + vpaddd %zmm12,%zmm8,%zmm8 + vpaddd %zmm13,%zmm9,%zmm9 + vpaddd %zmm14,%zmm10,%zmm10 + vpaddd %zmm15,%zmm11,%zmm11 + vpxord %zmm8,%zmm4,%zmm4 + vpxord %zmm9,%zmm5,%zmm5 + vpxord %zmm10,%zmm6,%zmm6 + vpxord %zmm11,%zmm7,%zmm7 + vprold $12,%zmm4,%zmm4 + vprold $12,%zmm5,%zmm5 + vprold $12,%zmm6,%zmm6 + vprold $12,%zmm7,%zmm7 + vpaddd %zmm4,%zmm0,%zmm0 + vpaddd %zmm5,%zmm1,%zmm1 + vpaddd %zmm6,%zmm2,%zmm2 + vpaddd %zmm7,%zmm3,%zmm3 + vpxord %zmm0,%zmm12,%zmm12 + vpxord %zmm1,%zmm13,%zmm13 + vpxord %zmm2,%zmm14,%zmm14 + vpxord %zmm3,%zmm15,%zmm15 + vprold $8,%zmm12,%zmm12 + vprold $8,%zmm13,%zmm13 + vprold $8,%zmm14,%zmm14 + vprold $8,%zmm15,%zmm15 + vpaddd %zmm12,%zmm8,%zmm8 + vpaddd %zmm13,%zmm9,%zmm9 + vpaddd %zmm14,%zmm10,%zmm10 + vpaddd %zmm15,%zmm11,%zmm11 + vpxord %zmm8,%zmm4,%zmm4 + vpxord %zmm9,%zmm5,%zmm5 + vpxord %zmm10,%zmm6,%zmm6 + vpxord %zmm11,%zmm7,%zmm7 + vprold $7,%zmm4,%zmm4 + vprold $7,%zmm5,%zmm5 + vprold $7,%zmm6,%zmm6 + vprold $7,%zmm7,%zmm7 + vpaddd %zmm5,%zmm0,%zmm0 + vpaddd %zmm6,%zmm1,%zmm1 + vpaddd %zmm7,%zmm2,%zmm2 + vpaddd %zmm4,%zmm3,%zmm3 + vpxord %zmm0,%zmm15,%zmm15 + vpxord %zmm1,%zmm12,%zmm12 + vpxord %zmm2,%zmm13,%zmm13 + vpxord %zmm3,%zmm14,%zmm14 + vprold $16,%zmm15,%zmm15 + vprold $16,%zmm12,%zmm12 + vprold $16,%zmm13,%zmm13 + vprold $16,%zmm14,%zmm14 + vpaddd %zmm15,%zmm10,%zmm10 + vpaddd %zmm12,%zmm11,%zmm11 + vpaddd %zmm13,%zmm8,%zmm8 + vpaddd %zmm14,%zmm9,%zmm9 + vpxord %zmm10,%zmm5,%zmm5 + vpxord %zmm11,%zmm6,%zmm6 + vpxord %zmm8,%zmm7,%zmm7 + vpxord %zmm9,%zmm4,%zmm4 + vprold $12,%zmm5,%zmm5 + vprold $12,%zmm6,%zmm6 + vprold $12,%zmm7,%zmm7 + vprold $12,%zmm4,%zmm4 + vpaddd %zmm5,%zmm0,%zmm0 + vpaddd %zmm6,%zmm1,%zmm1 + vpaddd %zmm7,%zmm2,%zmm2 + vpaddd %zmm4,%zmm3,%zmm3 + vpxord %zmm0,%zmm15,%zmm15 + vpxord %zmm1,%zmm12,%zmm12 + vpxord %zmm2,%zmm13,%zmm13 + vpxord %zmm3,%zmm14,%zmm14 + vprold $8,%zmm15,%zmm15 + vprold $8,%zmm12,%zmm12 + vprold $8,%zmm13,%zmm13 + vprold $8,%zmm14,%zmm14 + vpaddd %zmm15,%zmm10,%zmm10 + vpaddd %zmm12,%zmm11,%zmm11 + vpaddd %zmm13,%zmm8,%zmm8 + vpaddd %zmm14,%zmm9,%zmm9 + vpxord %zmm10,%zmm5,%zmm5 + vpxord %zmm11,%zmm6,%zmm6 + vpxord %zmm8,%zmm7,%zmm7 + vpxord %zmm9,%zmm4,%zmm4 + vprold $7,%zmm5,%zmm5 + vprold $7,%zmm6,%zmm6 + vprold $7,%zmm7,%zmm7 + vprold $7,%zmm4,%zmm4 + decl %eax + jnz .Loop16x + + vpaddd %zmm16,%zmm0,%zmm0 + vpaddd %zmm17,%zmm1,%zmm1 + vpaddd %zmm18,%zmm2,%zmm2 + vpaddd %zmm19,%zmm3,%zmm3 + + vpunpckldq %zmm1,%zmm0,%zmm18 + vpunpckldq %zmm3,%zmm2,%zmm19 + vpunpckhdq %zmm1,%zmm0,%zmm0 + vpunpckhdq %zmm3,%zmm2,%zmm2 + vpunpcklqdq %zmm19,%zmm18,%zmm1 + vpunpckhqdq %zmm19,%zmm18,%zmm18 + vpunpcklqdq %zmm2,%zmm0,%zmm3 + vpunpckhqdq %zmm2,%zmm0,%zmm0 + vpaddd %zmm20,%zmm4,%zmm4 + vpaddd %zmm21,%zmm5,%zmm5 + vpaddd %zmm22,%zmm6,%zmm6 + vpaddd %zmm23,%zmm7,%zmm7 + + vpunpckldq %zmm5,%zmm4,%zmm2 + vpunpckldq %zmm7,%zmm6,%zmm19 + vpunpckhdq %zmm5,%zmm4,%zmm4 + vpunpckhdq %zmm7,%zmm6,%zmm6 + vpunpcklqdq %zmm19,%zmm2,%zmm5 + vpunpckhqdq %zmm19,%zmm2,%zmm2 + vpunpcklqdq %zmm6,%zmm4,%zmm7 + vpunpckhqdq %zmm6,%zmm4,%zmm4 + vshufi32x4 $0x44,%zmm5,%zmm1,%zmm19 + vshufi32x4 $0xee,%zmm5,%zmm1,%zmm5 + vshufi32x4 $0x44,%zmm2,%zmm18,%zmm1 + vshufi32x4 $0xee,%zmm2,%zmm18,%zmm2 + vshufi32x4 $0x44,%zmm7,%zmm3,%zmm18 + vshufi32x4 $0xee,%zmm7,%zmm3,%zmm7 + vshufi32x4 $0x44,%zmm4,%zmm0,%zmm3 + vshufi32x4 $0xee,%zmm4,%zmm0,%zmm4 + vpaddd %zmm24,%zmm8,%zmm8 + vpaddd %zmm25,%zmm9,%zmm9 + vpaddd %zmm26,%zmm10,%zmm10 + vpaddd %zmm27,%zmm11,%zmm11 + + vpunpckldq %zmm9,%zmm8,%zmm6 + vpunpckldq %zmm11,%zmm10,%zmm0 + vpunpckhdq %zmm9,%zmm8,%zmm8 + vpunpckhdq %zmm11,%zmm10,%zmm10 + vpunpcklqdq %zmm0,%zmm6,%zmm9 + vpunpckhqdq %zmm0,%zmm6,%zmm6 + vpunpcklqdq %zmm10,%zmm8,%zmm11 + vpunpckhqdq %zmm10,%zmm8,%zmm8 + vpaddd %zmm28,%zmm12,%zmm12 + vpaddd %zmm29,%zmm13,%zmm13 + vpaddd %zmm30,%zmm14,%zmm14 + vpaddd %zmm31,%zmm15,%zmm15 + + vpunpckldq %zmm13,%zmm12,%zmm10 + vpunpckldq %zmm15,%zmm14,%zmm0 + vpunpckhdq %zmm13,%zmm12,%zmm12 + vpunpckhdq %zmm15,%zmm14,%zmm14 + vpunpcklqdq %zmm0,%zmm10,%zmm13 + vpunpckhqdq %zmm0,%zmm10,%zmm10 + vpunpcklqdq %zmm14,%zmm12,%zmm15 + vpunpckhqdq %zmm14,%zmm12,%zmm12 + vshufi32x4 $0x44,%zmm13,%zmm9,%zmm0 + vshufi32x4 $0xee,%zmm13,%zmm9,%zmm13 + vshufi32x4 $0x44,%zmm10,%zmm6,%zmm9 + vshufi32x4 $0xee,%zmm10,%zmm6,%zmm10 + vshufi32x4 $0x44,%zmm15,%zmm11,%zmm6 + vshufi32x4 $0xee,%zmm15,%zmm11,%zmm15 + vshufi32x4 $0x44,%zmm12,%zmm8,%zmm11 + vshufi32x4 $0xee,%zmm12,%zmm8,%zmm12 + vshufi32x4 $0x88,%zmm0,%zmm19,%zmm16 + vshufi32x4 $0xdd,%zmm0,%zmm19,%zmm19 + vshufi32x4 $0x88,%zmm13,%zmm5,%zmm0 + vshufi32x4 $0xdd,%zmm13,%zmm5,%zmm13 + vshufi32x4 $0x88,%zmm9,%zmm1,%zmm17 + vshufi32x4 $0xdd,%zmm9,%zmm1,%zmm1 + vshufi32x4 $0x88,%zmm10,%zmm2,%zmm9 + vshufi32x4 $0xdd,%zmm10,%zmm2,%zmm10 + vshufi32x4 $0x88,%zmm6,%zmm18,%zmm14 + vshufi32x4 $0xdd,%zmm6,%zmm18,%zmm18 + vshufi32x4 $0x88,%zmm15,%zmm7,%zmm6 + vshufi32x4 $0xdd,%zmm15,%zmm7,%zmm15 + vshufi32x4 $0x88,%zmm11,%zmm3,%zmm8 + vshufi32x4 $0xdd,%zmm11,%zmm3,%zmm3 + vshufi32x4 $0x88,%zmm12,%zmm4,%zmm11 + vshufi32x4 $0xdd,%zmm12,%zmm4,%zmm12 + cmpq $1024,%rdx + jb .Ltail16x + + vpxord 0(%rsi),%zmm16,%zmm16 + vpxord 64(%rsi),%zmm17,%zmm17 + vpxord 128(%rsi),%zmm14,%zmm14 + vpxord 192(%rsi),%zmm8,%zmm8 + vmovdqu32 %zmm16,0(%rdi) + vmovdqu32 %zmm17,64(%rdi) + vmovdqu32 %zmm14,128(%rdi) + vmovdqu32 %zmm8,192(%rdi) + + vpxord 256(%rsi),%zmm19,%zmm19 + vpxord 320(%rsi),%zmm1,%zmm1 + vpxord 384(%rsi),%zmm18,%zmm18 + vpxord 448(%rsi),%zmm3,%zmm3 + vmovdqu32 %zmm19,256(%rdi) + vmovdqu32 %zmm1,320(%rdi) + vmovdqu32 %zmm18,384(%rdi) + vmovdqu32 %zmm3,448(%rdi) + + vpxord 512(%rsi),%zmm0,%zmm0 + vpxord 576(%rsi),%zmm9,%zmm9 + vpxord 640(%rsi),%zmm6,%zmm6 + vpxord 704(%rsi),%zmm11,%zmm11 + vmovdqu32 %zmm0,512(%rdi) + vmovdqu32 %zmm9,576(%rdi) + vmovdqu32 %zmm6,640(%rdi) + vmovdqu32 %zmm11,704(%rdi) + + vpxord 768(%rsi),%zmm13,%zmm13 + vpxord 832(%rsi),%zmm10,%zmm10 + vpxord 896(%rsi),%zmm15,%zmm15 + vpxord 960(%rsi),%zmm12,%zmm12 + leaq 1024(%rsi),%rsi + vmovdqu32 %zmm13,768(%rdi) + vmovdqu32 %zmm10,832(%rdi) + vmovdqu32 %zmm15,896(%rdi) + vmovdqu32 %zmm12,960(%rdi) + leaq 1024(%rdi),%rdi + + subq $1024,%rdx + jnz .Loop_outer16x + + jmp .Ldone16x + +.align 32 +.Ltail16x: + xorq %r10,%r10 + subq %rsi,%rdi + cmpq $64,%rdx + jb .Less_than_64_16x + vpxord (%rsi),%zmm16,%zmm16 + vmovdqu32 %zmm16,(%rdi,%rsi,1) + je .Ldone16x + vmovdqa32 %zmm17,%zmm16 + leaq 64(%rsi),%rsi + + cmpq $128,%rdx + jb .Less_than_64_16x + vpxord (%rsi),%zmm17,%zmm17 + vmovdqu32 %zmm17,(%rdi,%rsi,1) + je .Ldone16x + vmovdqa32 %zmm14,%zmm16 + leaq 64(%rsi),%rsi + + cmpq $192,%rdx + jb .Less_than_64_16x + vpxord (%rsi),%zmm14,%zmm14 + vmovdqu32 %zmm14,(%rdi,%rsi,1) + je .Ldone16x + vmovdqa32 %zmm8,%zmm16 + leaq 64(%rsi),%rsi + + cmpq $256,%rdx + jb .Less_than_64_16x + vpxord (%rsi),%zmm8,%zmm8 + vmovdqu32 %zmm8,(%rdi,%rsi,1) + je .Ldone16x + vmovdqa32 %zmm19,%zmm16 + leaq 64(%rsi),%rsi + + cmpq $320,%rdx + jb .Less_than_64_16x + vpxord (%rsi),%zmm19,%zmm19 + vmovdqu32 %zmm19,(%rdi,%rsi,1) + je .Ldone16x + vmovdqa32 %zmm1,%zmm16 + leaq 64(%rsi),%rsi + + cmpq $384,%rdx + jb .Less_than_64_16x + vpxord (%rsi),%zmm1,%zmm1 + vmovdqu32 %zmm1,(%rdi,%rsi,1) + je .Ldone16x + vmovdqa32 %zmm18,%zmm16 + leaq 64(%rsi),%rsi + + cmpq $448,%rdx + jb .Less_than_64_16x + vpxord (%rsi),%zmm18,%zmm18 + vmovdqu32 %zmm18,(%rdi,%rsi,1) + je .Ldone16x + vmovdqa32 %zmm3,%zmm16 + leaq 64(%rsi),%rsi + + cmpq $512,%rdx + jb .Less_than_64_16x + vpxord (%rsi),%zmm3,%zmm3 + vmovdqu32 %zmm3,(%rdi,%rsi,1) + je .Ldone16x + vmovdqa32 %zmm0,%zmm16 + leaq 64(%rsi),%rsi + + cmpq $576,%rdx + jb .Less_than_64_16x + vpxord (%rsi),%zmm0,%zmm0 + vmovdqu32 %zmm0,(%rdi,%rsi,1) + je .Ldone16x + vmovdqa32 %zmm9,%zmm16 + leaq 64(%rsi),%rsi + + cmpq $640,%rdx + jb .Less_than_64_16x + vpxord (%rsi),%zmm9,%zmm9 + vmovdqu32 %zmm9,(%rdi,%rsi,1) + je .Ldone16x + vmovdqa32 %zmm6,%zmm16 + leaq 64(%rsi),%rsi + + cmpq $704,%rdx + jb .Less_than_64_16x + vpxord (%rsi),%zmm6,%zmm6 + vmovdqu32 %zmm6,(%rdi,%rsi,1) + je .Ldone16x + vmovdqa32 %zmm11,%zmm16 + leaq 64(%rsi),%rsi + + cmpq $768,%rdx + jb .Less_than_64_16x + vpxord (%rsi),%zmm11,%zmm11 + vmovdqu32 %zmm11,(%rdi,%rsi,1) + je .Ldone16x + vmovdqa32 %zmm13,%zmm16 + leaq 64(%rsi),%rsi + + cmpq $832,%rdx + jb .Less_than_64_16x + vpxord (%rsi),%zmm13,%zmm13 + vmovdqu32 %zmm13,(%rdi,%rsi,1) + je .Ldone16x + vmovdqa32 %zmm10,%zmm16 + leaq 64(%rsi),%rsi + + cmpq $896,%rdx + jb .Less_than_64_16x + vpxord (%rsi),%zmm10,%zmm10 + vmovdqu32 %zmm10,(%rdi,%rsi,1) + je .Ldone16x + vmovdqa32 %zmm15,%zmm16 + leaq 64(%rsi),%rsi + + cmpq $960,%rdx + jb .Less_than_64_16x + vpxord (%rsi),%zmm15,%zmm15 + vmovdqu32 %zmm15,(%rdi,%rsi,1) + je .Ldone16x + vmovdqa32 %zmm12,%zmm16 + leaq 64(%rsi),%rsi + +.Less_than_64_16x: + vmovdqa32 %zmm16,0(%rsp) + leaq (%rdi,%rsi,1),%rdi + andq $63,%rdx + +.Loop_tail16x: + movzbl (%rsi,%r10,1),%eax + movzbl (%rsp,%r10,1),%ecx + leaq 1(%r10),%r10 + xorl %ecx,%eax + movb %al,-1(%rdi,%r10,1) + decq %rdx + jnz .Loop_tail16x + + vpxord %zmm16,%zmm16,%zmm16 + vmovdqa32 %zmm16,0(%rsp) + +.Ldone16x: + vzeroall + leaq (%r9),%rsp +.cfi_def_cfa_register %rsp +.L16x_epilogue: + .byte 0xf3,0xc3 +.cfi_endproc +.size ChaCha20_16x,.-ChaCha20_16x +.type ChaCha20_8xvl,@function +.align 32 +ChaCha20_8xvl: +.cfi_startproc +.LChaCha20_8xvl: + movq %rsp,%r9 +.cfi_def_cfa_register %r9 + subq $64+8,%rsp + andq $-64,%rsp + vzeroupper + + leaq .Lsigma(%rip),%r10 + vbroadcasti128 (%r10),%ymm3 + vbroadcasti128 (%rcx),%ymm7 + vbroadcasti128 16(%rcx),%ymm11 + vbroadcasti128 (%r8),%ymm15 + + vpshufd $0x00,%ymm3,%ymm0 + vpshufd $0x55,%ymm3,%ymm1 + vpshufd $0xaa,%ymm3,%ymm2 + vpshufd $0xff,%ymm3,%ymm3 + vmovdqa64 %ymm0,%ymm16 + vmovdqa64 %ymm1,%ymm17 + vmovdqa64 %ymm2,%ymm18 + vmovdqa64 %ymm3,%ymm19 + + vpshufd $0x00,%ymm7,%ymm4 + vpshufd $0x55,%ymm7,%ymm5 + vpshufd $0xaa,%ymm7,%ymm6 + vpshufd $0xff,%ymm7,%ymm7 + vmovdqa64 %ymm4,%ymm20 + vmovdqa64 %ymm5,%ymm21 + vmovdqa64 %ymm6,%ymm22 + vmovdqa64 %ymm7,%ymm23 + + vpshufd $0x00,%ymm11,%ymm8 + vpshufd $0x55,%ymm11,%ymm9 + vpshufd $0xaa,%ymm11,%ymm10 + vpshufd $0xff,%ymm11,%ymm11 + vmovdqa64 %ymm8,%ymm24 + vmovdqa64 %ymm9,%ymm25 + vmovdqa64 %ymm10,%ymm26 + vmovdqa64 %ymm11,%ymm27 + + vpshufd $0x00,%ymm15,%ymm12 + vpshufd $0x55,%ymm15,%ymm13 + vpshufd $0xaa,%ymm15,%ymm14 + vpshufd $0xff,%ymm15,%ymm15 + vpaddd .Lincy(%rip),%ymm12,%ymm12 + vmovdqa64 %ymm12,%ymm28 + vmovdqa64 %ymm13,%ymm29 + vmovdqa64 %ymm14,%ymm30 + vmovdqa64 %ymm15,%ymm31 + + movl $10,%eax + jmp .Loop8xvl + +.align 32 +.Loop_outer8xvl: + + + vpbroadcastd 8(%r10),%ymm2 + vpbroadcastd 12(%r10),%ymm3 + vpaddd .Leight(%rip),%ymm28,%ymm28 + vmovdqa64 %ymm20,%ymm4 + vmovdqa64 %ymm21,%ymm5 + vmovdqa64 %ymm22,%ymm6 + vmovdqa64 %ymm23,%ymm7 + vmovdqa64 %ymm24,%ymm8 + vmovdqa64 %ymm25,%ymm9 + vmovdqa64 %ymm26,%ymm10 + vmovdqa64 %ymm27,%ymm11 + vmovdqa64 %ymm28,%ymm12 + vmovdqa64 %ymm29,%ymm13 + vmovdqa64 %ymm30,%ymm14 + vmovdqa64 %ymm31,%ymm15 + + vmovdqa64 %ymm0,%ymm16 + vmovdqa64 %ymm1,%ymm17 + vmovdqa64 %ymm2,%ymm18 + vmovdqa64 %ymm3,%ymm19 + + movl $10,%eax + jmp .Loop8xvl + +.align 32 +.Loop8xvl: + vpaddd %ymm4,%ymm0,%ymm0 + vpaddd %ymm5,%ymm1,%ymm1 + vpaddd %ymm6,%ymm2,%ymm2 + vpaddd %ymm7,%ymm3,%ymm3 + vpxor %ymm0,%ymm12,%ymm12 + vpxor %ymm1,%ymm13,%ymm13 + vpxor %ymm2,%ymm14,%ymm14 + vpxor %ymm3,%ymm15,%ymm15 + vprold $16,%ymm12,%ymm12 + vprold $16,%ymm13,%ymm13 + vprold $16,%ymm14,%ymm14 + vprold $16,%ymm15,%ymm15 + vpaddd %ymm12,%ymm8,%ymm8 + vpaddd %ymm13,%ymm9,%ymm9 + vpaddd %ymm14,%ymm10,%ymm10 + vpaddd %ymm15,%ymm11,%ymm11 + vpxor %ymm8,%ymm4,%ymm4 + vpxor %ymm9,%ymm5,%ymm5 + vpxor %ymm10,%ymm6,%ymm6 + vpxor %ymm11,%ymm7,%ymm7 + vprold $12,%ymm4,%ymm4 + vprold $12,%ymm5,%ymm5 + vprold $12,%ymm6,%ymm6 + vprold $12,%ymm7,%ymm7 + vpaddd %ymm4,%ymm0,%ymm0 + vpaddd %ymm5,%ymm1,%ymm1 + vpaddd %ymm6,%ymm2,%ymm2 + vpaddd %ymm7,%ymm3,%ymm3 + vpxor %ymm0,%ymm12,%ymm12 + vpxor %ymm1,%ymm13,%ymm13 + vpxor %ymm2,%ymm14,%ymm14 + vpxor %ymm3,%ymm15,%ymm15 + vprold $8,%ymm12,%ymm12 + vprold $8,%ymm13,%ymm13 + vprold $8,%ymm14,%ymm14 + vprold $8,%ymm15,%ymm15 + vpaddd %ymm12,%ymm8,%ymm8 + vpaddd %ymm13,%ymm9,%ymm9 + vpaddd %ymm14,%ymm10,%ymm10 + vpaddd %ymm15,%ymm11,%ymm11 + vpxor %ymm8,%ymm4,%ymm4 + vpxor %ymm9,%ymm5,%ymm5 + vpxor %ymm10,%ymm6,%ymm6 + vpxor %ymm11,%ymm7,%ymm7 + vprold $7,%ymm4,%ymm4 + vprold $7,%ymm5,%ymm5 + vprold $7,%ymm6,%ymm6 + vprold $7,%ymm7,%ymm7 + vpaddd %ymm5,%ymm0,%ymm0 + vpaddd %ymm6,%ymm1,%ymm1 + vpaddd %ymm7,%ymm2,%ymm2 + vpaddd %ymm4,%ymm3,%ymm3 + vpxor %ymm0,%ymm15,%ymm15 + vpxor %ymm1,%ymm12,%ymm12 + vpxor %ymm2,%ymm13,%ymm13 + vpxor %ymm3,%ymm14,%ymm14 + vprold $16,%ymm15,%ymm15 + vprold $16,%ymm12,%ymm12 + vprold $16,%ymm13,%ymm13 + vprold $16,%ymm14,%ymm14 + vpaddd %ymm15,%ymm10,%ymm10 + vpaddd %ymm12,%ymm11,%ymm11 + vpaddd %ymm13,%ymm8,%ymm8 + vpaddd %ymm14,%ymm9,%ymm9 + vpxor %ymm10,%ymm5,%ymm5 + vpxor %ymm11,%ymm6,%ymm6 + vpxor %ymm8,%ymm7,%ymm7 + vpxor %ymm9,%ymm4,%ymm4 + vprold $12,%ymm5,%ymm5 + vprold $12,%ymm6,%ymm6 + vprold $12,%ymm7,%ymm7 + vprold $12,%ymm4,%ymm4 + vpaddd %ymm5,%ymm0,%ymm0 + vpaddd %ymm6,%ymm1,%ymm1 + vpaddd %ymm7,%ymm2,%ymm2 + vpaddd %ymm4,%ymm3,%ymm3 + vpxor %ymm0,%ymm15,%ymm15 + vpxor %ymm1,%ymm12,%ymm12 + vpxor %ymm2,%ymm13,%ymm13 + vpxor %ymm3,%ymm14,%ymm14 + vprold $8,%ymm15,%ymm15 + vprold $8,%ymm12,%ymm12 + vprold $8,%ymm13,%ymm13 + vprold $8,%ymm14,%ymm14 + vpaddd %ymm15,%ymm10,%ymm10 + vpaddd %ymm12,%ymm11,%ymm11 + vpaddd %ymm13,%ymm8,%ymm8 + vpaddd %ymm14,%ymm9,%ymm9 + vpxor %ymm10,%ymm5,%ymm5 + vpxor %ymm11,%ymm6,%ymm6 + vpxor %ymm8,%ymm7,%ymm7 + vpxor %ymm9,%ymm4,%ymm4 + vprold $7,%ymm5,%ymm5 + vprold $7,%ymm6,%ymm6 + vprold $7,%ymm7,%ymm7 + vprold $7,%ymm4,%ymm4 + decl %eax + jnz .Loop8xvl + + vpaddd %ymm16,%ymm0,%ymm0 + vpaddd %ymm17,%ymm1,%ymm1 + vpaddd %ymm18,%ymm2,%ymm2 + vpaddd %ymm19,%ymm3,%ymm3 + + vpunpckldq %ymm1,%ymm0,%ymm18 + vpunpckldq %ymm3,%ymm2,%ymm19 + vpunpckhdq %ymm1,%ymm0,%ymm0 + vpunpckhdq %ymm3,%ymm2,%ymm2 + vpunpcklqdq %ymm19,%ymm18,%ymm1 + vpunpckhqdq %ymm19,%ymm18,%ymm18 + vpunpcklqdq %ymm2,%ymm0,%ymm3 + vpunpckhqdq %ymm2,%ymm0,%ymm0 + vpaddd %ymm20,%ymm4,%ymm4 + vpaddd %ymm21,%ymm5,%ymm5 + vpaddd %ymm22,%ymm6,%ymm6 + vpaddd %ymm23,%ymm7,%ymm7 + + vpunpckldq %ymm5,%ymm4,%ymm2 + vpunpckldq %ymm7,%ymm6,%ymm19 + vpunpckhdq %ymm5,%ymm4,%ymm4 + vpunpckhdq %ymm7,%ymm6,%ymm6 + vpunpcklqdq %ymm19,%ymm2,%ymm5 + vpunpckhqdq %ymm19,%ymm2,%ymm2 + vpunpcklqdq %ymm6,%ymm4,%ymm7 + vpunpckhqdq %ymm6,%ymm4,%ymm4 + vshufi32x4 $0,%ymm5,%ymm1,%ymm19 + vshufi32x4 $3,%ymm5,%ymm1,%ymm5 + vshufi32x4 $0,%ymm2,%ymm18,%ymm1 + vshufi32x4 $3,%ymm2,%ymm18,%ymm2 + vshufi32x4 $0,%ymm7,%ymm3,%ymm18 + vshufi32x4 $3,%ymm7,%ymm3,%ymm7 + vshufi32x4 $0,%ymm4,%ymm0,%ymm3 + vshufi32x4 $3,%ymm4,%ymm0,%ymm4 + vpaddd %ymm24,%ymm8,%ymm8 + vpaddd %ymm25,%ymm9,%ymm9 + vpaddd %ymm26,%ymm10,%ymm10 + vpaddd %ymm27,%ymm11,%ymm11 + + vpunpckldq %ymm9,%ymm8,%ymm6 + vpunpckldq %ymm11,%ymm10,%ymm0 + vpunpckhdq %ymm9,%ymm8,%ymm8 + vpunpckhdq %ymm11,%ymm10,%ymm10 + vpunpcklqdq %ymm0,%ymm6,%ymm9 + vpunpckhqdq %ymm0,%ymm6,%ymm6 + vpunpcklqdq %ymm10,%ymm8,%ymm11 + vpunpckhqdq %ymm10,%ymm8,%ymm8 + vpaddd %ymm28,%ymm12,%ymm12 + vpaddd %ymm29,%ymm13,%ymm13 + vpaddd %ymm30,%ymm14,%ymm14 + vpaddd %ymm31,%ymm15,%ymm15 + + vpunpckldq %ymm13,%ymm12,%ymm10 + vpunpckldq %ymm15,%ymm14,%ymm0 + vpunpckhdq %ymm13,%ymm12,%ymm12 + vpunpckhdq %ymm15,%ymm14,%ymm14 + vpunpcklqdq %ymm0,%ymm10,%ymm13 + vpunpckhqdq %ymm0,%ymm10,%ymm10 + vpunpcklqdq %ymm14,%ymm12,%ymm15 + vpunpckhqdq %ymm14,%ymm12,%ymm12 + vperm2i128 $0x20,%ymm13,%ymm9,%ymm0 + vperm2i128 $0x31,%ymm13,%ymm9,%ymm13 + vperm2i128 $0x20,%ymm10,%ymm6,%ymm9 + vperm2i128 $0x31,%ymm10,%ymm6,%ymm10 + vperm2i128 $0x20,%ymm15,%ymm11,%ymm6 + vperm2i128 $0x31,%ymm15,%ymm11,%ymm15 + vperm2i128 $0x20,%ymm12,%ymm8,%ymm11 + vperm2i128 $0x31,%ymm12,%ymm8,%ymm12 + cmpq $512,%rdx + jb .Ltail8xvl + + movl $0x80,%eax + vpxord 0(%rsi),%ymm19,%ymm19 + vpxor 32(%rsi),%ymm0,%ymm0 + vpxor 64(%rsi),%ymm5,%ymm5 + vpxor 96(%rsi),%ymm13,%ymm13 + leaq (%rsi,%rax,1),%rsi + vmovdqu32 %ymm19,0(%rdi) + vmovdqu %ymm0,32(%rdi) + vmovdqu %ymm5,64(%rdi) + vmovdqu %ymm13,96(%rdi) + leaq (%rdi,%rax,1),%rdi + + vpxor 0(%rsi),%ymm1,%ymm1 + vpxor 32(%rsi),%ymm9,%ymm9 + vpxor 64(%rsi),%ymm2,%ymm2 + vpxor 96(%rsi),%ymm10,%ymm10 + leaq (%rsi,%rax,1),%rsi + vmovdqu %ymm1,0(%rdi) + vmovdqu %ymm9,32(%rdi) + vmovdqu %ymm2,64(%rdi) + vmovdqu %ymm10,96(%rdi) + leaq (%rdi,%rax,1),%rdi + + vpxord 0(%rsi),%ymm18,%ymm18 + vpxor 32(%rsi),%ymm6,%ymm6 + vpxor 64(%rsi),%ymm7,%ymm7 + vpxor 96(%rsi),%ymm15,%ymm15 + leaq (%rsi,%rax,1),%rsi + vmovdqu32 %ymm18,0(%rdi) + vmovdqu %ymm6,32(%rdi) + vmovdqu %ymm7,64(%rdi) + vmovdqu %ymm15,96(%rdi) + leaq (%rdi,%rax,1),%rdi + + vpxor 0(%rsi),%ymm3,%ymm3 + vpxor 32(%rsi),%ymm11,%ymm11 + vpxor 64(%rsi),%ymm4,%ymm4 + vpxor 96(%rsi),%ymm12,%ymm12 + leaq (%rsi,%rax,1),%rsi + vmovdqu %ymm3,0(%rdi) + vmovdqu %ymm11,32(%rdi) + vmovdqu %ymm4,64(%rdi) + vmovdqu %ymm12,96(%rdi) + leaq (%rdi,%rax,1),%rdi + + vpbroadcastd 0(%r10),%ymm0 + vpbroadcastd 4(%r10),%ymm1 + + subq $512,%rdx + jnz .Loop_outer8xvl + + jmp .Ldone8xvl + +.align 32 +.Ltail8xvl: + vmovdqa64 %ymm19,%ymm8 + xorq %r10,%r10 + subq %rsi,%rdi + cmpq $64,%rdx + jb .Less_than_64_8xvl + vpxor 0(%rsi),%ymm8,%ymm8 + vpxor 32(%rsi),%ymm0,%ymm0 + vmovdqu %ymm8,0(%rdi,%rsi,1) + vmovdqu %ymm0,32(%rdi,%rsi,1) + je .Ldone8xvl + vmovdqa %ymm5,%ymm8 + vmovdqa %ymm13,%ymm0 + leaq 64(%rsi),%rsi + + cmpq $128,%rdx + jb .Less_than_64_8xvl + vpxor 0(%rsi),%ymm5,%ymm5 + vpxor 32(%rsi),%ymm13,%ymm13 + vmovdqu %ymm5,0(%rdi,%rsi,1) + vmovdqu %ymm13,32(%rdi,%rsi,1) + je .Ldone8xvl + vmovdqa %ymm1,%ymm8 + vmovdqa %ymm9,%ymm0 + leaq 64(%rsi),%rsi + + cmpq $192,%rdx + jb .Less_than_64_8xvl + vpxor 0(%rsi),%ymm1,%ymm1 + vpxor 32(%rsi),%ymm9,%ymm9 + vmovdqu %ymm1,0(%rdi,%rsi,1) + vmovdqu %ymm9,32(%rdi,%rsi,1) + je .Ldone8xvl + vmovdqa %ymm2,%ymm8 + vmovdqa %ymm10,%ymm0 + leaq 64(%rsi),%rsi + + cmpq $256,%rdx + jb .Less_than_64_8xvl + vpxor 0(%rsi),%ymm2,%ymm2 + vpxor 32(%rsi),%ymm10,%ymm10 + vmovdqu %ymm2,0(%rdi,%rsi,1) + vmovdqu %ymm10,32(%rdi,%rsi,1) + je .Ldone8xvl + vmovdqa32 %ymm18,%ymm8 + vmovdqa %ymm6,%ymm0 + leaq 64(%rsi),%rsi + + cmpq $320,%rdx + jb .Less_than_64_8xvl + vpxord 0(%rsi),%ymm18,%ymm18 + vpxor 32(%rsi),%ymm6,%ymm6 + vmovdqu32 %ymm18,0(%rdi,%rsi,1) + vmovdqu %ymm6,32(%rdi,%rsi,1) + je .Ldone8xvl + vmovdqa %ymm7,%ymm8 + vmovdqa %ymm15,%ymm0 + leaq 64(%rsi),%rsi + + cmpq $384,%rdx + jb .Less_than_64_8xvl + vpxor 0(%rsi),%ymm7,%ymm7 + vpxor 32(%rsi),%ymm15,%ymm15 + vmovdqu %ymm7,0(%rdi,%rsi,1) + vmovdqu %ymm15,32(%rdi,%rsi,1) + je .Ldone8xvl + vmovdqa %ymm3,%ymm8 + vmovdqa %ymm11,%ymm0 + leaq 64(%rsi),%rsi + + cmpq $448,%rdx + jb .Less_than_64_8xvl + vpxor 0(%rsi),%ymm3,%ymm3 + vpxor 32(%rsi),%ymm11,%ymm11 + vmovdqu %ymm3,0(%rdi,%rsi,1) + vmovdqu %ymm11,32(%rdi,%rsi,1) + je .Ldone8xvl + vmovdqa %ymm4,%ymm8 + vmovdqa %ymm12,%ymm0 + leaq 64(%rsi),%rsi + +.Less_than_64_8xvl: + vmovdqa %ymm8,0(%rsp) + vmovdqa %ymm0,32(%rsp) + leaq (%rdi,%rsi,1),%rdi + andq $63,%rdx + +.Loop_tail8xvl: + movzbl (%rsi,%r10,1),%eax + movzbl (%rsp,%r10,1),%ecx + leaq 1(%r10),%r10 + xorl %ecx,%eax + movb %al,-1(%rdi,%r10,1) + decq %rdx + jnz .Loop_tail8xvl + + vpxor %ymm8,%ymm8,%ymm8 + vmovdqa %ymm8,0(%rsp) + vmovdqa %ymm8,32(%rsp) + +.Ldone8xvl: + vzeroall + leaq (%r9),%rsp +.cfi_def_cfa_register %rsp +.L8xvl_epilogue: + .byte 0xf3,0xc3 +.cfi_endproc +.size ChaCha20_8xvl,.-ChaCha20_8xvl From patchwork Sat Oct 6 02:56:47 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Jason A. Donenfeld" X-Patchwork-Id: 148304 Delivered-To: patch@linaro.org Received: by 2002:a2e:8595:0:0:0:0:0 with SMTP id b21-v6csp1157415lji; Fri, 5 Oct 2018 19:57:59 -0700 (PDT) X-Google-Smtp-Source: ACcGV63sa3xR/yw2kRwbyGKYdVJozvYeCRv6ov4G1SCigcKshaNS4mdZzKOEBBZqDPCrozt84cF6 X-Received: by 2002:a63:924e:: with SMTP id s14-v6mr4130658pgn.141.1538794679549; Fri, 05 Oct 2018 19:57:59 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1538794679; cv=none; d=google.com; s=arc-20160816; b=Mjd4wLWO4Bwlb7hVhtUDloE7fb0KDjpNQ4jCmj3W1DuXNIs/ZszRnL6Qowxd82TPDn j1eCogz/25M3KCew4gAFkh4dXQNRDpSUP/10QEJ/1If/krsbHx1nk4S3l8u37hj/XCEK 74AdInV+KzmluJezd2USoTy2RN0pz0EzoZMKwO02AapMV0ZvwO3SsW9rl+ALAA0SAB+L rt8rhSHHuxtEbdIyNG+R+RNUPpXCRfGuDmieNaP9dvJLyzTftdo/qJa+ikuqn4uTfGnr NlInzeccXn0ekh4e/nyTI+UmdMcMiIZ7ZPKydfBeffG9Jgo+/9O9JfWsiY4Q5KM/kYgu pjCw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=6KIr8ULCQU5aZi7a3SnAc3E4MGhtJmcvTq4vrJu4QrE=; b=TOiFLYMnE9C5ni6P2AaDJSizjJ/F8oT4aEp34R7D42/3/Ni/bP955Ov8j7ZeLlv+r/ kxOB0PncLCIKwVLkwgM7ru4T4+vpmC7uDFYATIBIBOxBt3I4py2Es95XojC86XBiEDgA FpQp+YbVhBL1HjHE7po/PQd9yeoWfQeuMBV9OtYB6ahGMk5PsFP11ijaeFhCJxDLTu86 h5QpCu7F4J9+dXi7x59E+oeCR1DZkBzJQ/zT0gAP+h1qf8OZeVVOyxHdzjzeQMmOHKhF w4Lusttlo1im734wzmWeqjUL+OhR9CEmrJcDbRls0TicVqiWO7BsLX3Q1CdJIBc7MgBa C8OA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=emcTaLKJ; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id g7-v6si9989178plt.259.2018.10.05.19.57.59; Fri, 05 Oct 2018 19:57:59 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=emcTaLKJ; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729586AbeJFJ73 (ORCPT + 32 others); Sat, 6 Oct 2018 05:59:29 -0400 Received: from frisell.zx2c4.com ([192.95.5.64]:50073 "EHLO frisell.zx2c4.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726812AbeJFJ71 (ORCPT ); Sat, 6 Oct 2018 05:59:27 -0400 Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTP id 8fc6e078; Sat, 6 Oct 2018 02:57:16 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=zx2c4.com; h=from:to:cc :subject:date:message-id:in-reply-to:references:mime-version :content-transfer-encoding; s=mail; bh=6NfkXggljHAQRCxUJ9ASMvFGv SU=; b=emcTaLKJXtpcAz6NXcoFyNsDIzzqgsmQBHEu9/pCVqLmrRoRaSjojoO9Q QzBVOzCxniRc8Ad77eHXgHHKN9mZ+24vP9vaZr80jTVz9qT1SOeEalxmeUvLNpFk TTEV722YPzaEFsBXFOTsN3flPdpfqcW9dcSn4hCR0ZDznqWFwq0y1tg76yjPixeQ pP9Ju562IspyFO1ADBjF1Bq82eehp2qFj6VHXoDoXVFwB9BVrTeIQc+wKRbq3iIk 9Sutas8ej9n8sBhjnOB1XT0kGbWT42Wn5HIcd1BTZpLs5vIf44vgBtlbAMisYdTy CiCVZiEdmJltIVoaVtNd7lL/UFpOA== Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTPSA id ab424437 (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256:NO); Sat, 6 Oct 2018 02:57:15 +0000 (UTC) From: "Jason A. Donenfeld" To: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, davem@davemloft.net, gregkh@linuxfoundation.org Cc: "Jason A. Donenfeld" , Samuel Neves , Thomas Gleixner , Ingo Molnar , x86@kernel.org, Jean-Philippe Aumasson , Andy Lutomirski , Andrew Morton , Linus Torvalds , kernel-hardening@lists.openwall.com, linux-crypto@vger.kernel.org Subject: [PATCH net-next v7 06/28] zinc: ChaCha20 x86_64 implementation Date: Sat, 6 Oct 2018 04:56:47 +0200 Message-Id: <20181006025709.4019-7-Jason@zx2c4.com> In-Reply-To: <20181006025709.4019-1-Jason@zx2c4.com> References: <20181006025709.4019-1-Jason@zx2c4.com> MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This ports SSSE3, AVX-2, AVX-512F, and AVX-512VL implementations for ChaCha20. The AVX-512F implementation is disabled on Skylake, due to throttling, and the VL ymm implementation is used instead. These come from Andy Polyakov's implementation, with the following modifications from Samuel Neves: - Some cosmetic changes, like renaming labels to .Lname, constants, and other Linux conventions. - CPU feature checking is done in C by the glue code, so that has been removed from the assembly. - Eliminate translating certain instructions, such as pshufb, palignr, vprotd, etc, to .byte directives. This is meant for compatibility with ancient toolchains, but presumably it is unnecessary here, since the build system already does checks on what GNU as can assemble. - When aligning the stack, the original code was saving %rsp to %r9. To keep objtool happy, we use instead the DRAP idiom to save %rsp to %r10: leaq 8(%rsp),%r10 ... code here ... leaq -8(%r10),%rsp - The original code assumes the stack comes aligned to 16 bytes. This is not necessarily the case, and to avoid crashes, `andq $-alignment, %rsp` was added in the prolog of a few functions. - The original hardcodes returns as .byte 0xf3,0xc3, aka "rep ret". We replace this by "ret". "rep ret" was meant to help with AMD K8 chips, cf. http://repzret.org/p/repzret. It makes no sense to continue to use this kludge for code that won't even run on ancient AMD chips. Cycle counts on a Core i7 6700HQ using the AVX-2 codepath, comparing this implementation ("new") to the implementation in the current crypto api ("old"): size old new ---- ---- ---- 0 62 52 16 414 376 32 410 400 48 414 422 64 362 356 80 714 666 96 714 700 112 712 718 128 692 646 144 1042 674 160 1042 694 176 1042 726 192 1018 650 208 1366 686 224 1366 696 240 1366 722 256 640 656 272 988 1246 288 988 1276 304 992 1296 320 972 1222 336 1318 1256 352 1318 1276 368 1316 1294 384 1294 1218 400 1642 1258 416 1642 1282 432 1642 1302 448 1628 1224 464 1970 1258 480 1970 1280 496 1970 1300 512 656 676 528 1010 1290 544 1010 1306 560 1010 1332 576 986 1254 592 1340 1284 608 1334 1310 624 1340 1334 640 1314 1254 656 1664 1282 672 1674 1306 688 1662 1336 704 1638 1250 720 1992 1292 736 1994 1308 752 1988 1334 768 1252 1254 784 1596 1290 800 1596 1314 816 1596 1330 832 1576 1256 848 1922 1286 864 1922 1314 880 1926 1338 896 1898 1258 912 2248 1288 928 2248 1320 944 2248 1338 960 2226 1268 976 2574 1288 992 2576 1312 1008 2574 1340 Cycle counts on a Xeon Gold 5120 using the AVX-512 codepath: size old new ---- ---- ---- 0 64 54 16 386 372 32 388 396 48 388 420 64 366 350 80 708 666 96 708 692 112 706 736 128 692 648 144 1036 682 160 1036 708 176 1036 730 192 1016 658 208 1360 684 224 1362 708 240 1360 732 256 644 500 272 990 526 288 988 556 304 988 576 320 972 500 336 1314 532 352 1316 558 368 1318 578 384 1308 506 400 1644 532 416 1644 556 432 1644 594 448 1624 508 464 1970 534 480 1970 556 496 1968 582 512 660 624 528 1016 682 544 1016 702 560 1018 728 576 998 654 592 1344 680 608 1344 708 624 1344 730 640 1326 654 656 1670 686 672 1670 708 688 1670 732 704 1652 658 720 1998 682 736 1998 710 752 1996 734 768 1256 662 784 1606 688 800 1606 714 816 1606 736 832 1584 660 848 1948 688 864 1950 714 880 1948 736 896 1912 688 912 2258 718 928 2258 744 944 2256 768 960 2238 692 976 2584 718 992 2584 744 1008 2584 770 Signed-off-by: Jason A. Donenfeld Signed-off-by: Samuel Neves Co-developed-by: Samuel Neves Cc: Thomas Gleixner Cc: Ingo Molnar Cc: x86@kernel.org Cc: Jean-Philippe Aumasson Cc: Andy Lutomirski Cc: Greg KH Cc: Andrew Morton Cc: Linus Torvalds Cc: kernel-hardening@lists.openwall.com Cc: linux-crypto@vger.kernel.org --- lib/zinc/Makefile | 1 + lib/zinc/chacha20/chacha20-x86_64-glue.c | 103 ++ ...-x86_64-cryptogams.S => chacha20-x86_64.S} | 1557 ++++------------- lib/zinc/chacha20/chacha20.c | 4 + 4 files changed, 486 insertions(+), 1179 deletions(-) create mode 100644 lib/zinc/chacha20/chacha20-x86_64-glue.c rename lib/zinc/chacha20/{chacha20-x86_64-cryptogams.S => chacha20-x86_64.S} (71%) -- 2.19.0 diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile index 3d80144d55a6..223a0816c918 100644 --- a/lib/zinc/Makefile +++ b/lib/zinc/Makefile @@ -3,4 +3,5 @@ ccflags-y += -D'pr_fmt(fmt)="zinc: " fmt' ccflags-$(CONFIG_ZINC_DEBUG) += -DDEBUG zinc_chacha20-y := chacha20/chacha20.o +zinc_chacha20-$(CONFIG_ZINC_ARCH_X86_64) += chacha20/chacha20-x86_64.o obj-$(CONFIG_ZINC_CHACHA20) += zinc_chacha20.o diff --git a/lib/zinc/chacha20/chacha20-x86_64-glue.c b/lib/zinc/chacha20/chacha20-x86_64-glue.c new file mode 100644 index 000000000000..8629d5d420e6 --- /dev/null +++ b/lib/zinc/chacha20/chacha20-x86_64-glue.c @@ -0,0 +1,103 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#include +#include +#include +#include + +asmlinkage void hchacha20_ssse3(u32 *derived_key, const u8 *nonce, + const u8 *key); +asmlinkage void chacha20_ssse3(u8 *out, const u8 *in, const size_t len, + const u32 key[8], const u32 counter[4]); +asmlinkage void chacha20_avx2(u8 *out, const u8 *in, const size_t len, + const u32 key[8], const u32 counter[4]); +asmlinkage void chacha20_avx512(u8 *out, const u8 *in, const size_t len, + const u32 key[8], const u32 counter[4]); +asmlinkage void chacha20_avx512vl(u8 *out, const u8 *in, const size_t len, + const u32 key[8], const u32 counter[4]); + +static bool chacha20_use_ssse3 __ro_after_init; +static bool chacha20_use_avx2 __ro_after_init; +static bool chacha20_use_avx512 __ro_after_init; +static bool chacha20_use_avx512vl __ro_after_init; +static bool *const chacha20_nobs[] __initconst = { + &chacha20_use_ssse3, &chacha20_use_avx2, &chacha20_use_avx512, + &chacha20_use_avx512vl }; + +static void __init chacha20_fpu_init(void) +{ + chacha20_use_ssse3 = boot_cpu_has(X86_FEATURE_SSSE3); + chacha20_use_avx2 = + boot_cpu_has(X86_FEATURE_AVX) && + boot_cpu_has(X86_FEATURE_AVX2) && + cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL); + chacha20_use_avx512 = + boot_cpu_has(X86_FEATURE_AVX) && + boot_cpu_has(X86_FEATURE_AVX2) && + boot_cpu_has(X86_FEATURE_AVX512F) && + cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM | + XFEATURE_MASK_AVX512, NULL) && + /* Skylake downclocks unacceptably much when using zmm. */ + boot_cpu_data.x86_model != INTEL_FAM6_SKYLAKE_X; + chacha20_use_avx512vl = + boot_cpu_has(X86_FEATURE_AVX) && + boot_cpu_has(X86_FEATURE_AVX2) && + boot_cpu_has(X86_FEATURE_AVX512F) && + boot_cpu_has(X86_FEATURE_AVX512VL) && + cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM | + XFEATURE_MASK_AVX512, NULL); +} + +static inline bool chacha20_arch(struct chacha20_ctx *ctx, u8 *dst, + const u8 *src, size_t len, + simd_context_t *simd_context) +{ + /* SIMD disables preemption, so relax after processing each page. */ + BUILD_BUG_ON(PAGE_SIZE < CHACHA20_BLOCK_SIZE || + PAGE_SIZE % CHACHA20_BLOCK_SIZE); + + if (!IS_ENABLED(CONFIG_AS_SSSE3) || !chacha20_use_ssse3 || + len <= CHACHA20_BLOCK_SIZE || !simd_use(simd_context)) + return false; + + for (;;) { + const size_t bytes = min_t(size_t, len, PAGE_SIZE); + + if (IS_ENABLED(CONFIG_AS_AVX512) && chacha20_use_avx512 && + len >= CHACHA20_BLOCK_SIZE * 8) + chacha20_avx512(dst, src, bytes, ctx->key, ctx->counter); + else if (IS_ENABLED(CONFIG_AS_AVX512) && chacha20_use_avx512vl && + len >= CHACHA20_BLOCK_SIZE * 4) + chacha20_avx512vl(dst, src, bytes, ctx->key, ctx->counter); + else if (IS_ENABLED(CONFIG_AS_AVX2) && chacha20_use_avx2 && + len >= CHACHA20_BLOCK_SIZE * 4) + chacha20_avx2(dst, src, bytes, ctx->key, ctx->counter); + else + chacha20_ssse3(dst, src, bytes, ctx->key, ctx->counter); + ctx->counter[0] += (bytes + 63) / 64; + len -= bytes; + if (!len) + break; + dst += bytes; + src += bytes; + simd_relax(simd_context); + } + + return true; +} + +static inline bool hchacha20_arch(u32 derived_key[CHACHA20_KEY_WORDS], + const u8 nonce[HCHACHA20_NONCE_SIZE], + const u8 key[HCHACHA20_KEY_SIZE], + simd_context_t *simd_context) +{ + if (IS_ENABLED(CONFIG_AS_SSSE3) && chacha20_use_ssse3 && + simd_use(simd_context)) { + hchacha20_ssse3(derived_key, nonce, key); + return true; + } + return false; +} diff --git a/lib/zinc/chacha20/chacha20-x86_64-cryptogams.S b/lib/zinc/chacha20/chacha20-x86_64.S similarity index 71% rename from lib/zinc/chacha20/chacha20-x86_64-cryptogams.S rename to lib/zinc/chacha20/chacha20-x86_64.S index 2bfc76f7e01f..3d10c7f21642 100644 --- a/lib/zinc/chacha20/chacha20-x86_64-cryptogams.S +++ b/lib/zinc/chacha20/chacha20-x86_64.S @@ -1,351 +1,148 @@ /* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */ /* + * Copyright (C) 2017 Samuel Neves . All Rights Reserved. + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. * Copyright (C) 2006-2017 CRYPTOGAMS by . All Rights Reserved. + * + * This is based in part on Andy Polyakov's implementation from CRYPTOGAMS. */ -.text +#include - - -.align 64 +.section .rodata.cst16.Lzero, "aM", @progbits, 16 +.align 16 .Lzero: .long 0,0,0,0 +.section .rodata.cst16.Lone, "aM", @progbits, 16 +.align 16 .Lone: .long 1,0,0,0 +.section .rodata.cst16.Linc, "aM", @progbits, 16 +.align 16 .Linc: .long 0,1,2,3 +.section .rodata.cst16.Lfour, "aM", @progbits, 16 +.align 16 .Lfour: .long 4,4,4,4 +.section .rodata.cst32.Lincy, "aM", @progbits, 32 +.align 32 .Lincy: .long 0,2,4,6,1,3,5,7 +.section .rodata.cst32.Leight, "aM", @progbits, 32 +.align 32 .Leight: .long 8,8,8,8,8,8,8,8 +.section .rodata.cst16.Lrot16, "aM", @progbits, 16 +.align 16 .Lrot16: .byte 0x2,0x3,0x0,0x1, 0x6,0x7,0x4,0x5, 0xa,0xb,0x8,0x9, 0xe,0xf,0xc,0xd +.section .rodata.cst16.Lrot24, "aM", @progbits, 16 +.align 16 .Lrot24: .byte 0x3,0x0,0x1,0x2, 0x7,0x4,0x5,0x6, 0xb,0x8,0x9,0xa, 0xf,0xc,0xd,0xe -.Ltwoy: -.long 2,0,0,0, 2,0,0,0 +.section .rodata.cst16.Lsigma, "aM", @progbits, 16 +.align 16 +.Lsigma: +.byte 101,120,112,97,110,100,32,51,50,45,98,121,116,101,32,107,0 +.section .rodata.cst64.Lzeroz, "aM", @progbits, 64 .align 64 .Lzeroz: .long 0,0,0,0, 1,0,0,0, 2,0,0,0, 3,0,0,0 +.section .rodata.cst64.Lfourz, "aM", @progbits, 64 +.align 64 .Lfourz: .long 4,0,0,0, 4,0,0,0, 4,0,0,0, 4,0,0,0 +.section .rodata.cst64.Lincz, "aM", @progbits, 64 +.align 64 .Lincz: .long 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 +.section .rodata.cst64.Lsixteen, "aM", @progbits, 64 +.align 64 .Lsixteen: .long 16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16 -.Lsigma: -.byte 101,120,112,97,110,100,32,51,50,45,98,121,116,101,32,107,0 -.byte 67,104,97,67,104,97,50,48,32,102,111,114,32,120,56,54,95,54,52,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0 -.globl ChaCha20_ctr32 -.type ChaCha20_ctr32,@function +.section .rodata.cst32.Ltwoy, "aM", @progbits, 32 .align 64 -ChaCha20_ctr32: -.cfi_startproc - cmpq $0,%rdx - je .Lno_data - movq OPENSSL_ia32cap_P+4(%rip),%r10 - btq $48,%r10 - jc .LChaCha20_avx512 - testq %r10,%r10 - js .LChaCha20_avx512vl - testl $512,%r10d - jnz .LChaCha20_ssse3 - - pushq %rbx -.cfi_adjust_cfa_offset 8 -.cfi_offset %rbx,-16 - pushq %rbp -.cfi_adjust_cfa_offset 8 -.cfi_offset %rbp,-24 - pushq %r12 -.cfi_adjust_cfa_offset 8 -.cfi_offset %r12,-32 - pushq %r13 -.cfi_adjust_cfa_offset 8 -.cfi_offset %r13,-40 - pushq %r14 -.cfi_adjust_cfa_offset 8 -.cfi_offset %r14,-48 - pushq %r15 -.cfi_adjust_cfa_offset 8 -.cfi_offset %r15,-56 - subq $64+24,%rsp -.cfi_adjust_cfa_offset 64+24 -.Lctr32_body: - - - movdqu (%rcx),%xmm1 - movdqu 16(%rcx),%xmm2 - movdqu (%r8),%xmm3 - movdqa .Lone(%rip),%xmm4 - +.Ltwoy: +.long 2,0,0,0, 2,0,0,0 - movdqa %xmm1,16(%rsp) - movdqa %xmm2,32(%rsp) - movdqa %xmm3,48(%rsp) - movq %rdx,%rbp - jmp .Loop_outer +.text +#ifdef CONFIG_AS_SSSE3 .align 32 -.Loop_outer: - movl $0x61707865,%eax - movl $0x3320646e,%ebx - movl $0x79622d32,%ecx - movl $0x6b206574,%edx - movl 16(%rsp),%r8d - movl 20(%rsp),%r9d - movl 24(%rsp),%r10d - movl 28(%rsp),%r11d - movd %xmm3,%r12d - movl 52(%rsp),%r13d - movl 56(%rsp),%r14d - movl 60(%rsp),%r15d - - movq %rbp,64+0(%rsp) - movl $10,%ebp - movq %rsi,64+8(%rsp) -.byte 102,72,15,126,214 - movq %rdi,64+16(%rsp) - movq %rsi,%rdi - shrq $32,%rdi - jmp .Loop +ENTRY(hchacha20_ssse3) + movdqa .Lsigma(%rip),%xmm0 + movdqu (%rdx),%xmm1 + movdqu 16(%rdx),%xmm2 + movdqu (%rsi),%xmm3 + movdqa .Lrot16(%rip),%xmm6 + movdqa .Lrot24(%rip),%xmm7 + movq $10,%r8 + .align 32 +.Loop_hssse3: + paddd %xmm1,%xmm0 + pxor %xmm0,%xmm3 + pshufb %xmm6,%xmm3 + paddd %xmm3,%xmm2 + pxor %xmm2,%xmm1 + movdqa %xmm1,%xmm4 + psrld $20,%xmm1 + pslld $12,%xmm4 + por %xmm4,%xmm1 + paddd %xmm1,%xmm0 + pxor %xmm0,%xmm3 + pshufb %xmm7,%xmm3 + paddd %xmm3,%xmm2 + pxor %xmm2,%xmm1 + movdqa %xmm1,%xmm4 + psrld $25,%xmm1 + pslld $7,%xmm4 + por %xmm4,%xmm1 + pshufd $78,%xmm2,%xmm2 + pshufd $57,%xmm1,%xmm1 + pshufd $147,%xmm3,%xmm3 + nop + paddd %xmm1,%xmm0 + pxor %xmm0,%xmm3 + pshufb %xmm6,%xmm3 + paddd %xmm3,%xmm2 + pxor %xmm2,%xmm1 + movdqa %xmm1,%xmm4 + psrld $20,%xmm1 + pslld $12,%xmm4 + por %xmm4,%xmm1 + paddd %xmm1,%xmm0 + pxor %xmm0,%xmm3 + pshufb %xmm7,%xmm3 + paddd %xmm3,%xmm2 + pxor %xmm2,%xmm1 + movdqa %xmm1,%xmm4 + psrld $25,%xmm1 + pslld $7,%xmm4 + por %xmm4,%xmm1 + pshufd $78,%xmm2,%xmm2 + pshufd $147,%xmm1,%xmm1 + pshufd $57,%xmm3,%xmm3 + decq %r8 + jnz .Loop_hssse3 + movdqu %xmm0,0(%rdi) + movdqu %xmm3,16(%rdi) + ret +ENDPROC(hchacha20_ssse3) .align 32 -.Loop: - addl %r8d,%eax - xorl %eax,%r12d - roll $16,%r12d - addl %r9d,%ebx - xorl %ebx,%r13d - roll $16,%r13d - addl %r12d,%esi - xorl %esi,%r8d - roll $12,%r8d - addl %r13d,%edi - xorl %edi,%r9d - roll $12,%r9d - addl %r8d,%eax - xorl %eax,%r12d - roll $8,%r12d - addl %r9d,%ebx - xorl %ebx,%r13d - roll $8,%r13d - addl %r12d,%esi - xorl %esi,%r8d - roll $7,%r8d - addl %r13d,%edi - xorl %edi,%r9d - roll $7,%r9d - movl %esi,32(%rsp) - movl %edi,36(%rsp) - movl 40(%rsp),%esi - movl 44(%rsp),%edi - addl %r10d,%ecx - xorl %ecx,%r14d - roll $16,%r14d - addl %r11d,%edx - xorl %edx,%r15d - roll $16,%r15d - addl %r14d,%esi - xorl %esi,%r10d - roll $12,%r10d - addl %r15d,%edi - xorl %edi,%r11d - roll $12,%r11d - addl %r10d,%ecx - xorl %ecx,%r14d - roll $8,%r14d - addl %r11d,%edx - xorl %edx,%r15d - roll $8,%r15d - addl %r14d,%esi - xorl %esi,%r10d - roll $7,%r10d - addl %r15d,%edi - xorl %edi,%r11d - roll $7,%r11d - addl %r9d,%eax - xorl %eax,%r15d - roll $16,%r15d - addl %r10d,%ebx - xorl %ebx,%r12d - roll $16,%r12d - addl %r15d,%esi - xorl %esi,%r9d - roll $12,%r9d - addl %r12d,%edi - xorl %edi,%r10d - roll $12,%r10d - addl %r9d,%eax - xorl %eax,%r15d - roll $8,%r15d - addl %r10d,%ebx - xorl %ebx,%r12d - roll $8,%r12d - addl %r15d,%esi - xorl %esi,%r9d - roll $7,%r9d - addl %r12d,%edi - xorl %edi,%r10d - roll $7,%r10d - movl %esi,40(%rsp) - movl %edi,44(%rsp) - movl 32(%rsp),%esi - movl 36(%rsp),%edi - addl %r11d,%ecx - xorl %ecx,%r13d - roll $16,%r13d - addl %r8d,%edx - xorl %edx,%r14d - roll $16,%r14d - addl %r13d,%esi - xorl %esi,%r11d - roll $12,%r11d - addl %r14d,%edi - xorl %edi,%r8d - roll $12,%r8d - addl %r11d,%ecx - xorl %ecx,%r13d - roll $8,%r13d - addl %r8d,%edx - xorl %edx,%r14d - roll $8,%r14d - addl %r13d,%esi - xorl %esi,%r11d - roll $7,%r11d - addl %r14d,%edi - xorl %edi,%r8d - roll $7,%r8d - decl %ebp - jnz .Loop - movl %edi,36(%rsp) - movl %esi,32(%rsp) - movq 64(%rsp),%rbp - movdqa %xmm2,%xmm1 - movq 64+8(%rsp),%rsi - paddd %xmm4,%xmm3 - movq 64+16(%rsp),%rdi - - addl $0x61707865,%eax - addl $0x3320646e,%ebx - addl $0x79622d32,%ecx - addl $0x6b206574,%edx - addl 16(%rsp),%r8d - addl 20(%rsp),%r9d - addl 24(%rsp),%r10d - addl 28(%rsp),%r11d - addl 48(%rsp),%r12d - addl 52(%rsp),%r13d - addl 56(%rsp),%r14d - addl 60(%rsp),%r15d - paddd 32(%rsp),%xmm1 - - cmpq $64,%rbp - jb .Ltail - - xorl 0(%rsi),%eax - xorl 4(%rsi),%ebx - xorl 8(%rsi),%ecx - xorl 12(%rsi),%edx - xorl 16(%rsi),%r8d - xorl 20(%rsi),%r9d - xorl 24(%rsi),%r10d - xorl 28(%rsi),%r11d - movdqu 32(%rsi),%xmm0 - xorl 48(%rsi),%r12d - xorl 52(%rsi),%r13d - xorl 56(%rsi),%r14d - xorl 60(%rsi),%r15d - leaq 64(%rsi),%rsi - pxor %xmm1,%xmm0 - - movdqa %xmm2,32(%rsp) - movd %xmm3,48(%rsp) - - movl %eax,0(%rdi) - movl %ebx,4(%rdi) - movl %ecx,8(%rdi) - movl %edx,12(%rdi) - movl %r8d,16(%rdi) - movl %r9d,20(%rdi) - movl %r10d,24(%rdi) - movl %r11d,28(%rdi) - movdqu %xmm0,32(%rdi) - movl %r12d,48(%rdi) - movl %r13d,52(%rdi) - movl %r14d,56(%rdi) - movl %r15d,60(%rdi) - leaq 64(%rdi),%rdi - - subq $64,%rbp - jnz .Loop_outer - - jmp .Ldone +ENTRY(chacha20_ssse3) +.Lchacha20_ssse3: + cmpq $0,%rdx + je .Lssse3_epilogue + leaq 8(%rsp),%r10 -.align 16 -.Ltail: - movl %eax,0(%rsp) - movl %ebx,4(%rsp) - xorq %rbx,%rbx - movl %ecx,8(%rsp) - movl %edx,12(%rsp) - movl %r8d,16(%rsp) - movl %r9d,20(%rsp) - movl %r10d,24(%rsp) - movl %r11d,28(%rsp) - movdqa %xmm1,32(%rsp) - movl %r12d,48(%rsp) - movl %r13d,52(%rsp) - movl %r14d,56(%rsp) - movl %r15d,60(%rsp) - -.Loop_tail: - movzbl (%rsi,%rbx,1),%eax - movzbl (%rsp,%rbx,1),%edx - leaq 1(%rbx),%rbx - xorl %edx,%eax - movb %al,-1(%rdi,%rbx,1) - decq %rbp - jnz .Loop_tail - -.Ldone: - leaq 64+24+48(%rsp),%rsi -.cfi_def_cfa %rsi,8 - movq -48(%rsi),%r15 -.cfi_restore %r15 - movq -40(%rsi),%r14 -.cfi_restore %r14 - movq -32(%rsi),%r13 -.cfi_restore %r13 - movq -24(%rsi),%r12 -.cfi_restore %r12 - movq -16(%rsi),%rbp -.cfi_restore %rbp - movq -8(%rsi),%rbx -.cfi_restore %rbx - leaq (%rsi),%rsp -.cfi_def_cfa_register %rsp -.Lno_data: - .byte 0xf3,0xc3 -.cfi_endproc -.size ChaCha20_ctr32,.-ChaCha20_ctr32 -.type ChaCha20_ssse3,@function -.align 32 -ChaCha20_ssse3: -.cfi_startproc -.LChaCha20_ssse3: - movq %rsp,%r9 -.cfi_def_cfa_register %r9 - testl $2048,%r10d - jnz .LChaCha20_4xop cmpq $128,%rdx - je .LChaCha20_128 - ja .LChaCha20_4x + ja .Lchacha20_4x .Ldo_sse3_after_all: subq $64+8,%rsp + andq $-32,%rsp movdqa .Lsigma(%rip),%xmm0 movdqu (%rcx),%xmm1 movdqu 16(%rcx),%xmm2 @@ -375,7 +172,7 @@ ChaCha20_ssse3: .Loop_ssse3: paddd %xmm1,%xmm0 pxor %xmm0,%xmm3 -.byte 102,15,56,0,222 + pshufb %xmm6,%xmm3 paddd %xmm3,%xmm2 pxor %xmm2,%xmm1 movdqa %xmm1,%xmm4 @@ -384,7 +181,7 @@ ChaCha20_ssse3: por %xmm4,%xmm1 paddd %xmm1,%xmm0 pxor %xmm0,%xmm3 -.byte 102,15,56,0,223 + pshufb %xmm7,%xmm3 paddd %xmm3,%xmm2 pxor %xmm2,%xmm1 movdqa %xmm1,%xmm4 @@ -397,7 +194,7 @@ ChaCha20_ssse3: nop paddd %xmm1,%xmm0 pxor %xmm0,%xmm3 -.byte 102,15,56,0,222 + pshufb %xmm6,%xmm3 paddd %xmm3,%xmm2 pxor %xmm2,%xmm1 movdqa %xmm1,%xmm4 @@ -406,7 +203,7 @@ ChaCha20_ssse3: por %xmm4,%xmm1 paddd %xmm1,%xmm0 pxor %xmm0,%xmm3 -.byte 102,15,56,0,223 + pshufb %xmm7,%xmm3 paddd %xmm3,%xmm2 pxor %xmm2,%xmm1 movdqa %xmm1,%xmm4 @@ -465,194 +262,24 @@ ChaCha20_ssse3: jnz .Loop_tail_ssse3 .Ldone_ssse3: - leaq (%r9),%rsp -.cfi_def_cfa_register %rsp -.Lssse3_epilogue: - .byte 0xf3,0xc3 -.cfi_endproc -.size ChaCha20_ssse3,.-ChaCha20_ssse3 -.type ChaCha20_128,@function -.align 32 -ChaCha20_128: -.cfi_startproc -.LChaCha20_128: - movq %rsp,%r9 -.cfi_def_cfa_register %r9 - subq $64+8,%rsp - movdqa .Lsigma(%rip),%xmm8 - movdqu (%rcx),%xmm9 - movdqu 16(%rcx),%xmm2 - movdqu (%r8),%xmm3 - movdqa .Lone(%rip),%xmm1 - movdqa .Lrot16(%rip),%xmm6 - movdqa .Lrot24(%rip),%xmm7 + leaq -8(%r10),%rsp - movdqa %xmm8,%xmm10 - movdqa %xmm8,0(%rsp) - movdqa %xmm9,%xmm11 - movdqa %xmm9,16(%rsp) - movdqa %xmm2,%xmm0 - movdqa %xmm2,32(%rsp) - paddd %xmm3,%xmm1 - movdqa %xmm3,48(%rsp) - movq $10,%r8 - jmp .Loop_128 - -.align 32 -.Loop_128: - paddd %xmm9,%xmm8 - pxor %xmm8,%xmm3 - paddd %xmm11,%xmm10 - pxor %xmm10,%xmm1 -.byte 102,15,56,0,222 -.byte 102,15,56,0,206 - paddd %xmm3,%xmm2 - paddd %xmm1,%xmm0 - pxor %xmm2,%xmm9 - pxor %xmm0,%xmm11 - movdqa %xmm9,%xmm4 - psrld $20,%xmm9 - movdqa %xmm11,%xmm5 - pslld $12,%xmm4 - psrld $20,%xmm11 - por %xmm4,%xmm9 - pslld $12,%xmm5 - por %xmm5,%xmm11 - paddd %xmm9,%xmm8 - pxor %xmm8,%xmm3 - paddd %xmm11,%xmm10 - pxor %xmm10,%xmm1 -.byte 102,15,56,0,223 -.byte 102,15,56,0,207 - paddd %xmm3,%xmm2 - paddd %xmm1,%xmm0 - pxor %xmm2,%xmm9 - pxor %xmm0,%xmm11 - movdqa %xmm9,%xmm4 - psrld $25,%xmm9 - movdqa %xmm11,%xmm5 - pslld $7,%xmm4 - psrld $25,%xmm11 - por %xmm4,%xmm9 - pslld $7,%xmm5 - por %xmm5,%xmm11 - pshufd $78,%xmm2,%xmm2 - pshufd $57,%xmm9,%xmm9 - pshufd $147,%xmm3,%xmm3 - pshufd $78,%xmm0,%xmm0 - pshufd $57,%xmm11,%xmm11 - pshufd $147,%xmm1,%xmm1 - paddd %xmm9,%xmm8 - pxor %xmm8,%xmm3 - paddd %xmm11,%xmm10 - pxor %xmm10,%xmm1 -.byte 102,15,56,0,222 -.byte 102,15,56,0,206 - paddd %xmm3,%xmm2 - paddd %xmm1,%xmm0 - pxor %xmm2,%xmm9 - pxor %xmm0,%xmm11 - movdqa %xmm9,%xmm4 - psrld $20,%xmm9 - movdqa %xmm11,%xmm5 - pslld $12,%xmm4 - psrld $20,%xmm11 - por %xmm4,%xmm9 - pslld $12,%xmm5 - por %xmm5,%xmm11 - paddd %xmm9,%xmm8 - pxor %xmm8,%xmm3 - paddd %xmm11,%xmm10 - pxor %xmm10,%xmm1 -.byte 102,15,56,0,223 -.byte 102,15,56,0,207 - paddd %xmm3,%xmm2 - paddd %xmm1,%xmm0 - pxor %xmm2,%xmm9 - pxor %xmm0,%xmm11 - movdqa %xmm9,%xmm4 - psrld $25,%xmm9 - movdqa %xmm11,%xmm5 - pslld $7,%xmm4 - psrld $25,%xmm11 - por %xmm4,%xmm9 - pslld $7,%xmm5 - por %xmm5,%xmm11 - pshufd $78,%xmm2,%xmm2 - pshufd $147,%xmm9,%xmm9 - pshufd $57,%xmm3,%xmm3 - pshufd $78,%xmm0,%xmm0 - pshufd $147,%xmm11,%xmm11 - pshufd $57,%xmm1,%xmm1 - decq %r8 - jnz .Loop_128 - paddd 0(%rsp),%xmm8 - paddd 16(%rsp),%xmm9 - paddd 32(%rsp),%xmm2 - paddd 48(%rsp),%xmm3 - paddd .Lone(%rip),%xmm1 - paddd 0(%rsp),%xmm10 - paddd 16(%rsp),%xmm11 - paddd 32(%rsp),%xmm0 - paddd 48(%rsp),%xmm1 - - movdqu 0(%rsi),%xmm4 - movdqu 16(%rsi),%xmm5 - pxor %xmm4,%xmm8 - movdqu 32(%rsi),%xmm4 - pxor %xmm5,%xmm9 - movdqu 48(%rsi),%xmm5 - pxor %xmm4,%xmm2 - movdqu 64(%rsi),%xmm4 - pxor %xmm5,%xmm3 - movdqu 80(%rsi),%xmm5 - pxor %xmm4,%xmm10 - movdqu 96(%rsi),%xmm4 - pxor %xmm5,%xmm11 - movdqu 112(%rsi),%xmm5 - pxor %xmm4,%xmm0 - pxor %xmm5,%xmm1 +.Lssse3_epilogue: + ret - movdqu %xmm8,0(%rdi) - movdqu %xmm9,16(%rdi) - movdqu %xmm2,32(%rdi) - movdqu %xmm3,48(%rdi) - movdqu %xmm10,64(%rdi) - movdqu %xmm11,80(%rdi) - movdqu %xmm0,96(%rdi) - movdqu %xmm1,112(%rdi) - leaq (%r9),%rsp -.cfi_def_cfa_register %rsp -.L128_epilogue: - .byte 0xf3,0xc3 -.cfi_endproc -.size ChaCha20_128,.-ChaCha20_128 -.type ChaCha20_4x,@function .align 32 -ChaCha20_4x: -.cfi_startproc -.LChaCha20_4x: - movq %rsp,%r9 -.cfi_def_cfa_register %r9 - movq %r10,%r11 - shrq $32,%r10 - testq $32,%r10 - jnz .LChaCha20_8x - cmpq $192,%rdx - ja .Lproceed4x - - andq $71303168,%r11 - cmpq $4194304,%r11 - je .Ldo_sse3_after_all +.Lchacha20_4x: + leaq 8(%rsp),%r10 .Lproceed4x: subq $0x140+8,%rsp + andq $-32,%rsp movdqa .Lsigma(%rip),%xmm11 movdqu (%rcx),%xmm15 movdqu 16(%rcx),%xmm7 movdqu (%r8),%xmm3 leaq 256(%rsp),%rcx - leaq .Lrot16(%rip),%r10 + leaq .Lrot16(%rip),%r9 leaq .Lrot24(%rip),%r11 pshufd $0x00,%xmm11,%xmm8 @@ -716,7 +343,7 @@ ChaCha20_4x: .Loop_enter4x: movdqa %xmm6,32(%rsp) movdqa %xmm7,48(%rsp) - movdqa (%r10),%xmm7 + movdqa (%r9),%xmm7 movl $10,%eax movdqa %xmm0,256-256(%rcx) jmp .Loop4x @@ -727,8 +354,8 @@ ChaCha20_4x: paddd %xmm13,%xmm9 pxor %xmm8,%xmm0 pxor %xmm9,%xmm1 -.byte 102,15,56,0,199 -.byte 102,15,56,0,207 + pshufb %xmm7,%xmm0 + pshufb %xmm7,%xmm1 paddd %xmm0,%xmm4 paddd %xmm1,%xmm5 pxor %xmm4,%xmm12 @@ -746,8 +373,8 @@ ChaCha20_4x: paddd %xmm13,%xmm9 pxor %xmm8,%xmm0 pxor %xmm9,%xmm1 -.byte 102,15,56,0,198 -.byte 102,15,56,0,206 + pshufb %xmm6,%xmm0 + pshufb %xmm6,%xmm1 paddd %xmm0,%xmm4 paddd %xmm1,%xmm5 pxor %xmm4,%xmm12 @@ -759,7 +386,7 @@ ChaCha20_4x: pslld $7,%xmm13 por %xmm7,%xmm12 psrld $25,%xmm6 - movdqa (%r10),%xmm7 + movdqa (%r9),%xmm7 por %xmm6,%xmm13 movdqa %xmm4,0(%rsp) movdqa %xmm5,16(%rsp) @@ -769,8 +396,8 @@ ChaCha20_4x: paddd %xmm15,%xmm11 pxor %xmm10,%xmm2 pxor %xmm11,%xmm3 -.byte 102,15,56,0,215 -.byte 102,15,56,0,223 + pshufb %xmm7,%xmm2 + pshufb %xmm7,%xmm3 paddd %xmm2,%xmm4 paddd %xmm3,%xmm5 pxor %xmm4,%xmm14 @@ -788,8 +415,8 @@ ChaCha20_4x: paddd %xmm15,%xmm11 pxor %xmm10,%xmm2 pxor %xmm11,%xmm3 -.byte 102,15,56,0,214 -.byte 102,15,56,0,222 + pshufb %xmm6,%xmm2 + pshufb %xmm6,%xmm3 paddd %xmm2,%xmm4 paddd %xmm3,%xmm5 pxor %xmm4,%xmm14 @@ -801,14 +428,14 @@ ChaCha20_4x: pslld $7,%xmm15 por %xmm7,%xmm14 psrld $25,%xmm6 - movdqa (%r10),%xmm7 + movdqa (%r9),%xmm7 por %xmm6,%xmm15 paddd %xmm13,%xmm8 paddd %xmm14,%xmm9 pxor %xmm8,%xmm3 pxor %xmm9,%xmm0 -.byte 102,15,56,0,223 -.byte 102,15,56,0,199 + pshufb %xmm7,%xmm3 + pshufb %xmm7,%xmm0 paddd %xmm3,%xmm4 paddd %xmm0,%xmm5 pxor %xmm4,%xmm13 @@ -826,8 +453,8 @@ ChaCha20_4x: paddd %xmm14,%xmm9 pxor %xmm8,%xmm3 pxor %xmm9,%xmm0 -.byte 102,15,56,0,222 -.byte 102,15,56,0,198 + pshufb %xmm6,%xmm3 + pshufb %xmm6,%xmm0 paddd %xmm3,%xmm4 paddd %xmm0,%xmm5 pxor %xmm4,%xmm13 @@ -839,7 +466,7 @@ ChaCha20_4x: pslld $7,%xmm14 por %xmm7,%xmm13 psrld $25,%xmm6 - movdqa (%r10),%xmm7 + movdqa (%r9),%xmm7 por %xmm6,%xmm14 movdqa %xmm4,32(%rsp) movdqa %xmm5,48(%rsp) @@ -849,8 +476,8 @@ ChaCha20_4x: paddd %xmm12,%xmm11 pxor %xmm10,%xmm1 pxor %xmm11,%xmm2 -.byte 102,15,56,0,207 -.byte 102,15,56,0,215 + pshufb %xmm7,%xmm1 + pshufb %xmm7,%xmm2 paddd %xmm1,%xmm4 paddd %xmm2,%xmm5 pxor %xmm4,%xmm15 @@ -868,8 +495,8 @@ ChaCha20_4x: paddd %xmm12,%xmm11 pxor %xmm10,%xmm1 pxor %xmm11,%xmm2 -.byte 102,15,56,0,206 -.byte 102,15,56,0,214 + pshufb %xmm6,%xmm1 + pshufb %xmm6,%xmm2 paddd %xmm1,%xmm4 paddd %xmm2,%xmm5 pxor %xmm4,%xmm15 @@ -881,7 +508,7 @@ ChaCha20_4x: pslld $7,%xmm12 por %xmm7,%xmm15 psrld $25,%xmm6 - movdqa (%r10),%xmm7 + movdqa (%r9),%xmm7 por %xmm6,%xmm12 decl %eax jnz .Loop4x @@ -1035,7 +662,7 @@ ChaCha20_4x: jae .L64_or_more4x - xorq %r10,%r10 + xorq %r9,%r9 movdqa %xmm12,16(%rsp) movdqa %xmm4,32(%rsp) @@ -1060,7 +687,7 @@ ChaCha20_4x: movdqa 16(%rsp),%xmm6 leaq 64(%rsi),%rsi - xorq %r10,%r10 + xorq %r9,%r9 movdqa %xmm6,0(%rsp) movdqa %xmm13,16(%rsp) leaq 64(%rdi),%rdi @@ -1100,7 +727,7 @@ ChaCha20_4x: movdqa 32(%rsp),%xmm6 leaq 128(%rsi),%rsi - xorq %r10,%r10 + xorq %r9,%r9 movdqa %xmm6,0(%rsp) movdqa %xmm10,16(%rsp) leaq 128(%rdi),%rdi @@ -1155,7 +782,7 @@ ChaCha20_4x: movdqa 48(%rsp),%xmm6 leaq 64(%rsi),%rsi - xorq %r10,%r10 + xorq %r9,%r9 movdqa %xmm6,0(%rsp) movdqa %xmm15,16(%rsp) leaq 64(%rdi),%rdi @@ -1164,463 +791,41 @@ ChaCha20_4x: movdqa %xmm3,48(%rsp) .Loop_tail4x: - movzbl (%rsi,%r10,1),%eax - movzbl (%rsp,%r10,1),%ecx - leaq 1(%r10),%r10 + movzbl (%rsi,%r9,1),%eax + movzbl (%rsp,%r9,1),%ecx + leaq 1(%r9),%r9 xorl %ecx,%eax - movb %al,-1(%rdi,%r10,1) + movb %al,-1(%rdi,%r9,1) decq %rdx jnz .Loop_tail4x .Ldone4x: - leaq (%r9),%rsp -.cfi_def_cfa_register %rsp -.L4x_epilogue: - .byte 0xf3,0xc3 -.cfi_endproc -.size ChaCha20_4x,.-ChaCha20_4x -.type ChaCha20_4xop,@function -.align 32 -ChaCha20_4xop: -.cfi_startproc -.LChaCha20_4xop: - movq %rsp,%r9 -.cfi_def_cfa_register %r9 - subq $0x140+8,%rsp - vzeroupper + leaq -8(%r10),%rsp - vmovdqa .Lsigma(%rip),%xmm11 - vmovdqu (%rcx),%xmm3 - vmovdqu 16(%rcx),%xmm15 - vmovdqu (%r8),%xmm7 - leaq 256(%rsp),%rcx - - vpshufd $0x00,%xmm11,%xmm8 - vpshufd $0x55,%xmm11,%xmm9 - vmovdqa %xmm8,64(%rsp) - vpshufd $0xaa,%xmm11,%xmm10 - vmovdqa %xmm9,80(%rsp) - vpshufd $0xff,%xmm11,%xmm11 - vmovdqa %xmm10,96(%rsp) - vmovdqa %xmm11,112(%rsp) - - vpshufd $0x00,%xmm3,%xmm0 - vpshufd $0x55,%xmm3,%xmm1 - vmovdqa %xmm0,128-256(%rcx) - vpshufd $0xaa,%xmm3,%xmm2 - vmovdqa %xmm1,144-256(%rcx) - vpshufd $0xff,%xmm3,%xmm3 - vmovdqa %xmm2,160-256(%rcx) - vmovdqa %xmm3,176-256(%rcx) - - vpshufd $0x00,%xmm15,%xmm12 - vpshufd $0x55,%xmm15,%xmm13 - vmovdqa %xmm12,192-256(%rcx) - vpshufd $0xaa,%xmm15,%xmm14 - vmovdqa %xmm13,208-256(%rcx) - vpshufd $0xff,%xmm15,%xmm15 - vmovdqa %xmm14,224-256(%rcx) - vmovdqa %xmm15,240-256(%rcx) - - vpshufd $0x00,%xmm7,%xmm4 - vpshufd $0x55,%xmm7,%xmm5 - vpaddd .Linc(%rip),%xmm4,%xmm4 - vpshufd $0xaa,%xmm7,%xmm6 - vmovdqa %xmm5,272-256(%rcx) - vpshufd $0xff,%xmm7,%xmm7 - vmovdqa %xmm6,288-256(%rcx) - vmovdqa %xmm7,304-256(%rcx) - - jmp .Loop_enter4xop - -.align 32 -.Loop_outer4xop: - vmovdqa 64(%rsp),%xmm8 - vmovdqa 80(%rsp),%xmm9 - vmovdqa 96(%rsp),%xmm10 - vmovdqa 112(%rsp),%xmm11 - vmovdqa 128-256(%rcx),%xmm0 - vmovdqa 144-256(%rcx),%xmm1 - vmovdqa 160-256(%rcx),%xmm2 - vmovdqa 176-256(%rcx),%xmm3 - vmovdqa 192-256(%rcx),%xmm12 - vmovdqa 208-256(%rcx),%xmm13 - vmovdqa 224-256(%rcx),%xmm14 - vmovdqa 240-256(%rcx),%xmm15 - vmovdqa 256-256(%rcx),%xmm4 - vmovdqa 272-256(%rcx),%xmm5 - vmovdqa 288-256(%rcx),%xmm6 - vmovdqa 304-256(%rcx),%xmm7 - vpaddd .Lfour(%rip),%xmm4,%xmm4 - -.Loop_enter4xop: - movl $10,%eax - vmovdqa %xmm4,256-256(%rcx) - jmp .Loop4xop - -.align 32 -.Loop4xop: - vpaddd %xmm0,%xmm8,%xmm8 - vpaddd %xmm1,%xmm9,%xmm9 - vpaddd %xmm2,%xmm10,%xmm10 - vpaddd %xmm3,%xmm11,%xmm11 - vpxor %xmm4,%xmm8,%xmm4 - vpxor %xmm5,%xmm9,%xmm5 - vpxor %xmm6,%xmm10,%xmm6 - vpxor %xmm7,%xmm11,%xmm7 -.byte 143,232,120,194,228,16 -.byte 143,232,120,194,237,16 -.byte 143,232,120,194,246,16 -.byte 143,232,120,194,255,16 - vpaddd %xmm4,%xmm12,%xmm12 - vpaddd %xmm5,%xmm13,%xmm13 - vpaddd %xmm6,%xmm14,%xmm14 - vpaddd %xmm7,%xmm15,%xmm15 - vpxor %xmm0,%xmm12,%xmm0 - vpxor %xmm1,%xmm13,%xmm1 - vpxor %xmm14,%xmm2,%xmm2 - vpxor %xmm15,%xmm3,%xmm3 -.byte 143,232,120,194,192,12 -.byte 143,232,120,194,201,12 -.byte 143,232,120,194,210,12 -.byte 143,232,120,194,219,12 - vpaddd %xmm8,%xmm0,%xmm8 - vpaddd %xmm9,%xmm1,%xmm9 - vpaddd %xmm2,%xmm10,%xmm10 - vpaddd %xmm3,%xmm11,%xmm11 - vpxor %xmm4,%xmm8,%xmm4 - vpxor %xmm5,%xmm9,%xmm5 - vpxor %xmm6,%xmm10,%xmm6 - vpxor %xmm7,%xmm11,%xmm7 -.byte 143,232,120,194,228,8 -.byte 143,232,120,194,237,8 -.byte 143,232,120,194,246,8 -.byte 143,232,120,194,255,8 - vpaddd %xmm4,%xmm12,%xmm12 - vpaddd %xmm5,%xmm13,%xmm13 - vpaddd %xmm6,%xmm14,%xmm14 - vpaddd %xmm7,%xmm15,%xmm15 - vpxor %xmm0,%xmm12,%xmm0 - vpxor %xmm1,%xmm13,%xmm1 - vpxor %xmm14,%xmm2,%xmm2 - vpxor %xmm15,%xmm3,%xmm3 -.byte 143,232,120,194,192,7 -.byte 143,232,120,194,201,7 -.byte 143,232,120,194,210,7 -.byte 143,232,120,194,219,7 - vpaddd %xmm1,%xmm8,%xmm8 - vpaddd %xmm2,%xmm9,%xmm9 - vpaddd %xmm3,%xmm10,%xmm10 - vpaddd %xmm0,%xmm11,%xmm11 - vpxor %xmm7,%xmm8,%xmm7 - vpxor %xmm4,%xmm9,%xmm4 - vpxor %xmm5,%xmm10,%xmm5 - vpxor %xmm6,%xmm11,%xmm6 -.byte 143,232,120,194,255,16 -.byte 143,232,120,194,228,16 -.byte 143,232,120,194,237,16 -.byte 143,232,120,194,246,16 - vpaddd %xmm7,%xmm14,%xmm14 - vpaddd %xmm4,%xmm15,%xmm15 - vpaddd %xmm5,%xmm12,%xmm12 - vpaddd %xmm6,%xmm13,%xmm13 - vpxor %xmm1,%xmm14,%xmm1 - vpxor %xmm2,%xmm15,%xmm2 - vpxor %xmm12,%xmm3,%xmm3 - vpxor %xmm13,%xmm0,%xmm0 -.byte 143,232,120,194,201,12 -.byte 143,232,120,194,210,12 -.byte 143,232,120,194,219,12 -.byte 143,232,120,194,192,12 - vpaddd %xmm8,%xmm1,%xmm8 - vpaddd %xmm9,%xmm2,%xmm9 - vpaddd %xmm3,%xmm10,%xmm10 - vpaddd %xmm0,%xmm11,%xmm11 - vpxor %xmm7,%xmm8,%xmm7 - vpxor %xmm4,%xmm9,%xmm4 - vpxor %xmm5,%xmm10,%xmm5 - vpxor %xmm6,%xmm11,%xmm6 -.byte 143,232,120,194,255,8 -.byte 143,232,120,194,228,8 -.byte 143,232,120,194,237,8 -.byte 143,232,120,194,246,8 - vpaddd %xmm7,%xmm14,%xmm14 - vpaddd %xmm4,%xmm15,%xmm15 - vpaddd %xmm5,%xmm12,%xmm12 - vpaddd %xmm6,%xmm13,%xmm13 - vpxor %xmm1,%xmm14,%xmm1 - vpxor %xmm2,%xmm15,%xmm2 - vpxor %xmm12,%xmm3,%xmm3 - vpxor %xmm13,%xmm0,%xmm0 -.byte 143,232,120,194,201,7 -.byte 143,232,120,194,210,7 -.byte 143,232,120,194,219,7 -.byte 143,232,120,194,192,7 - decl %eax - jnz .Loop4xop - - vpaddd 64(%rsp),%xmm8,%xmm8 - vpaddd 80(%rsp),%xmm9,%xmm9 - vpaddd 96(%rsp),%xmm10,%xmm10 - vpaddd 112(%rsp),%xmm11,%xmm11 - - vmovdqa %xmm14,32(%rsp) - vmovdqa %xmm15,48(%rsp) - - vpunpckldq %xmm9,%xmm8,%xmm14 - vpunpckldq %xmm11,%xmm10,%xmm15 - vpunpckhdq %xmm9,%xmm8,%xmm8 - vpunpckhdq %xmm11,%xmm10,%xmm10 - vpunpcklqdq %xmm15,%xmm14,%xmm9 - vpunpckhqdq %xmm15,%xmm14,%xmm14 - vpunpcklqdq %xmm10,%xmm8,%xmm11 - vpunpckhqdq %xmm10,%xmm8,%xmm8 - vpaddd 128-256(%rcx),%xmm0,%xmm0 - vpaddd 144-256(%rcx),%xmm1,%xmm1 - vpaddd 160-256(%rcx),%xmm2,%xmm2 - vpaddd 176-256(%rcx),%xmm3,%xmm3 - - vmovdqa %xmm9,0(%rsp) - vmovdqa %xmm14,16(%rsp) - vmovdqa 32(%rsp),%xmm9 - vmovdqa 48(%rsp),%xmm14 - - vpunpckldq %xmm1,%xmm0,%xmm10 - vpunpckldq %xmm3,%xmm2,%xmm15 - vpunpckhdq %xmm1,%xmm0,%xmm0 - vpunpckhdq %xmm3,%xmm2,%xmm2 - vpunpcklqdq %xmm15,%xmm10,%xmm1 - vpunpckhqdq %xmm15,%xmm10,%xmm10 - vpunpcklqdq %xmm2,%xmm0,%xmm3 - vpunpckhqdq %xmm2,%xmm0,%xmm0 - vpaddd 192-256(%rcx),%xmm12,%xmm12 - vpaddd 208-256(%rcx),%xmm13,%xmm13 - vpaddd 224-256(%rcx),%xmm9,%xmm9 - vpaddd 240-256(%rcx),%xmm14,%xmm14 - - vpunpckldq %xmm13,%xmm12,%xmm2 - vpunpckldq %xmm14,%xmm9,%xmm15 - vpunpckhdq %xmm13,%xmm12,%xmm12 - vpunpckhdq %xmm14,%xmm9,%xmm9 - vpunpcklqdq %xmm15,%xmm2,%xmm13 - vpunpckhqdq %xmm15,%xmm2,%xmm2 - vpunpcklqdq %xmm9,%xmm12,%xmm14 - vpunpckhqdq %xmm9,%xmm12,%xmm12 - vpaddd 256-256(%rcx),%xmm4,%xmm4 - vpaddd 272-256(%rcx),%xmm5,%xmm5 - vpaddd 288-256(%rcx),%xmm6,%xmm6 - vpaddd 304-256(%rcx),%xmm7,%xmm7 - - vpunpckldq %xmm5,%xmm4,%xmm9 - vpunpckldq %xmm7,%xmm6,%xmm15 - vpunpckhdq %xmm5,%xmm4,%xmm4 - vpunpckhdq %xmm7,%xmm6,%xmm6 - vpunpcklqdq %xmm15,%xmm9,%xmm5 - vpunpckhqdq %xmm15,%xmm9,%xmm9 - vpunpcklqdq %xmm6,%xmm4,%xmm7 - vpunpckhqdq %xmm6,%xmm4,%xmm4 - vmovdqa 0(%rsp),%xmm6 - vmovdqa 16(%rsp),%xmm15 - - cmpq $256,%rdx - jb .Ltail4xop - - vpxor 0(%rsi),%xmm6,%xmm6 - vpxor 16(%rsi),%xmm1,%xmm1 - vpxor 32(%rsi),%xmm13,%xmm13 - vpxor 48(%rsi),%xmm5,%xmm5 - vpxor 64(%rsi),%xmm15,%xmm15 - vpxor 80(%rsi),%xmm10,%xmm10 - vpxor 96(%rsi),%xmm2,%xmm2 - vpxor 112(%rsi),%xmm9,%xmm9 - leaq 128(%rsi),%rsi - vpxor 0(%rsi),%xmm11,%xmm11 - vpxor 16(%rsi),%xmm3,%xmm3 - vpxor 32(%rsi),%xmm14,%xmm14 - vpxor 48(%rsi),%xmm7,%xmm7 - vpxor 64(%rsi),%xmm8,%xmm8 - vpxor 80(%rsi),%xmm0,%xmm0 - vpxor 96(%rsi),%xmm12,%xmm12 - vpxor 112(%rsi),%xmm4,%xmm4 - leaq 128(%rsi),%rsi - - vmovdqu %xmm6,0(%rdi) - vmovdqu %xmm1,16(%rdi) - vmovdqu %xmm13,32(%rdi) - vmovdqu %xmm5,48(%rdi) - vmovdqu %xmm15,64(%rdi) - vmovdqu %xmm10,80(%rdi) - vmovdqu %xmm2,96(%rdi) - vmovdqu %xmm9,112(%rdi) - leaq 128(%rdi),%rdi - vmovdqu %xmm11,0(%rdi) - vmovdqu %xmm3,16(%rdi) - vmovdqu %xmm14,32(%rdi) - vmovdqu %xmm7,48(%rdi) - vmovdqu %xmm8,64(%rdi) - vmovdqu %xmm0,80(%rdi) - vmovdqu %xmm12,96(%rdi) - vmovdqu %xmm4,112(%rdi) - leaq 128(%rdi),%rdi - - subq $256,%rdx - jnz .Loop_outer4xop - - jmp .Ldone4xop - -.align 32 -.Ltail4xop: - cmpq $192,%rdx - jae .L192_or_more4xop - cmpq $128,%rdx - jae .L128_or_more4xop - cmpq $64,%rdx - jae .L64_or_more4xop - - xorq %r10,%r10 - vmovdqa %xmm6,0(%rsp) - vmovdqa %xmm1,16(%rsp) - vmovdqa %xmm13,32(%rsp) - vmovdqa %xmm5,48(%rsp) - jmp .Loop_tail4xop - -.align 32 -.L64_or_more4xop: - vpxor 0(%rsi),%xmm6,%xmm6 - vpxor 16(%rsi),%xmm1,%xmm1 - vpxor 32(%rsi),%xmm13,%xmm13 - vpxor 48(%rsi),%xmm5,%xmm5 - vmovdqu %xmm6,0(%rdi) - vmovdqu %xmm1,16(%rdi) - vmovdqu %xmm13,32(%rdi) - vmovdqu %xmm5,48(%rdi) - je .Ldone4xop - - leaq 64(%rsi),%rsi - vmovdqa %xmm15,0(%rsp) - xorq %r10,%r10 - vmovdqa %xmm10,16(%rsp) - leaq 64(%rdi),%rdi - vmovdqa %xmm2,32(%rsp) - subq $64,%rdx - vmovdqa %xmm9,48(%rsp) - jmp .Loop_tail4xop - -.align 32 -.L128_or_more4xop: - vpxor 0(%rsi),%xmm6,%xmm6 - vpxor 16(%rsi),%xmm1,%xmm1 - vpxor 32(%rsi),%xmm13,%xmm13 - vpxor 48(%rsi),%xmm5,%xmm5 - vpxor 64(%rsi),%xmm15,%xmm15 - vpxor 80(%rsi),%xmm10,%xmm10 - vpxor 96(%rsi),%xmm2,%xmm2 - vpxor 112(%rsi),%xmm9,%xmm9 - - vmovdqu %xmm6,0(%rdi) - vmovdqu %xmm1,16(%rdi) - vmovdqu %xmm13,32(%rdi) - vmovdqu %xmm5,48(%rdi) - vmovdqu %xmm15,64(%rdi) - vmovdqu %xmm10,80(%rdi) - vmovdqu %xmm2,96(%rdi) - vmovdqu %xmm9,112(%rdi) - je .Ldone4xop - - leaq 128(%rsi),%rsi - vmovdqa %xmm11,0(%rsp) - xorq %r10,%r10 - vmovdqa %xmm3,16(%rsp) - leaq 128(%rdi),%rdi - vmovdqa %xmm14,32(%rsp) - subq $128,%rdx - vmovdqa %xmm7,48(%rsp) - jmp .Loop_tail4xop +.L4x_epilogue: + ret +ENDPROC(chacha20_ssse3) +#endif /* CONFIG_AS_SSSE3 */ +#ifdef CONFIG_AS_AVX2 .align 32 -.L192_or_more4xop: - vpxor 0(%rsi),%xmm6,%xmm6 - vpxor 16(%rsi),%xmm1,%xmm1 - vpxor 32(%rsi),%xmm13,%xmm13 - vpxor 48(%rsi),%xmm5,%xmm5 - vpxor 64(%rsi),%xmm15,%xmm15 - vpxor 80(%rsi),%xmm10,%xmm10 - vpxor 96(%rsi),%xmm2,%xmm2 - vpxor 112(%rsi),%xmm9,%xmm9 - leaq 128(%rsi),%rsi - vpxor 0(%rsi),%xmm11,%xmm11 - vpxor 16(%rsi),%xmm3,%xmm3 - vpxor 32(%rsi),%xmm14,%xmm14 - vpxor 48(%rsi),%xmm7,%xmm7 - - vmovdqu %xmm6,0(%rdi) - vmovdqu %xmm1,16(%rdi) - vmovdqu %xmm13,32(%rdi) - vmovdqu %xmm5,48(%rdi) - vmovdqu %xmm15,64(%rdi) - vmovdqu %xmm10,80(%rdi) - vmovdqu %xmm2,96(%rdi) - vmovdqu %xmm9,112(%rdi) - leaq 128(%rdi),%rdi - vmovdqu %xmm11,0(%rdi) - vmovdqu %xmm3,16(%rdi) - vmovdqu %xmm14,32(%rdi) - vmovdqu %xmm7,48(%rdi) - je .Ldone4xop - - leaq 64(%rsi),%rsi - vmovdqa %xmm8,0(%rsp) - xorq %r10,%r10 - vmovdqa %xmm0,16(%rsp) - leaq 64(%rdi),%rdi - vmovdqa %xmm12,32(%rsp) - subq $192,%rdx - vmovdqa %xmm4,48(%rsp) - -.Loop_tail4xop: - movzbl (%rsi,%r10,1),%eax - movzbl (%rsp,%r10,1),%ecx - leaq 1(%r10),%r10 - xorl %ecx,%eax - movb %al,-1(%rdi,%r10,1) - decq %rdx - jnz .Loop_tail4xop +ENTRY(chacha20_avx2) +.Lchacha20_avx2: + cmpq $0,%rdx + je .L8x_epilogue + leaq 8(%rsp),%r10 -.Ldone4xop: - vzeroupper - leaq (%r9),%rsp -.cfi_def_cfa_register %rsp -.L4xop_epilogue: - .byte 0xf3,0xc3 -.cfi_endproc -.size ChaCha20_4xop,.-ChaCha20_4xop -.type ChaCha20_8x,@function -.align 32 -ChaCha20_8x: -.cfi_startproc -.LChaCha20_8x: - movq %rsp,%r9 -.cfi_def_cfa_register %r9 subq $0x280+8,%rsp andq $-32,%rsp vzeroupper - - - - - - - - - vbroadcasti128 .Lsigma(%rip),%ymm11 vbroadcasti128 (%rcx),%ymm3 vbroadcasti128 16(%rcx),%ymm15 vbroadcasti128 (%r8),%ymm7 leaq 256(%rsp),%rcx leaq 512(%rsp),%rax - leaq .Lrot16(%rip),%r10 + leaq .Lrot16(%rip),%r9 leaq .Lrot24(%rip),%r11 vpshufd $0x00,%ymm11,%ymm8 @@ -1684,7 +889,7 @@ ChaCha20_8x: .Loop_enter8x: vmovdqa %ymm14,64(%rsp) vmovdqa %ymm15,96(%rsp) - vbroadcasti128 (%r10),%ymm15 + vbroadcasti128 (%r9),%ymm15 vmovdqa %ymm4,512-512(%rax) movl $10,%eax jmp .Loop8x @@ -1719,7 +924,7 @@ ChaCha20_8x: vpslld $7,%ymm0,%ymm15 vpsrld $25,%ymm0,%ymm0 vpor %ymm0,%ymm15,%ymm0 - vbroadcasti128 (%r10),%ymm15 + vbroadcasti128 (%r9),%ymm15 vpaddd %ymm5,%ymm13,%ymm13 vpxor %ymm1,%ymm13,%ymm1 vpslld $7,%ymm1,%ymm14 @@ -1757,7 +962,7 @@ ChaCha20_8x: vpslld $7,%ymm2,%ymm15 vpsrld $25,%ymm2,%ymm2 vpor %ymm2,%ymm15,%ymm2 - vbroadcasti128 (%r10),%ymm15 + vbroadcasti128 (%r9),%ymm15 vpaddd %ymm7,%ymm13,%ymm13 vpxor %ymm3,%ymm13,%ymm3 vpslld $7,%ymm3,%ymm14 @@ -1791,7 +996,7 @@ ChaCha20_8x: vpslld $7,%ymm1,%ymm15 vpsrld $25,%ymm1,%ymm1 vpor %ymm1,%ymm15,%ymm1 - vbroadcasti128 (%r10),%ymm15 + vbroadcasti128 (%r9),%ymm15 vpaddd %ymm4,%ymm13,%ymm13 vpxor %ymm2,%ymm13,%ymm2 vpslld $7,%ymm2,%ymm14 @@ -1829,7 +1034,7 @@ ChaCha20_8x: vpslld $7,%ymm3,%ymm15 vpsrld $25,%ymm3,%ymm3 vpor %ymm3,%ymm15,%ymm3 - vbroadcasti128 (%r10),%ymm15 + vbroadcasti128 (%r9),%ymm15 vpaddd %ymm6,%ymm13,%ymm13 vpxor %ymm0,%ymm13,%ymm0 vpslld $7,%ymm0,%ymm14 @@ -1983,7 +1188,7 @@ ChaCha20_8x: cmpq $64,%rdx jae .L64_or_more8x - xorq %r10,%r10 + xorq %r9,%r9 vmovdqa %ymm6,0(%rsp) vmovdqa %ymm8,32(%rsp) jmp .Loop_tail8x @@ -1997,7 +1202,7 @@ ChaCha20_8x: je .Ldone8x leaq 64(%rsi),%rsi - xorq %r10,%r10 + xorq %r9,%r9 vmovdqa %ymm1,0(%rsp) leaq 64(%rdi),%rdi subq $64,%rdx @@ -2017,7 +1222,7 @@ ChaCha20_8x: je .Ldone8x leaq 128(%rsi),%rsi - xorq %r10,%r10 + xorq %r9,%r9 vmovdqa %ymm12,0(%rsp) leaq 128(%rdi),%rdi subq $128,%rdx @@ -2041,7 +1246,7 @@ ChaCha20_8x: je .Ldone8x leaq 192(%rsi),%rsi - xorq %r10,%r10 + xorq %r9,%r9 vmovdqa %ymm10,0(%rsp) leaq 192(%rdi),%rdi subq $192,%rdx @@ -2069,7 +1274,7 @@ ChaCha20_8x: je .Ldone8x leaq 256(%rsi),%rsi - xorq %r10,%r10 + xorq %r9,%r9 vmovdqa %ymm14,0(%rsp) leaq 256(%rdi),%rdi subq $256,%rdx @@ -2101,7 +1306,7 @@ ChaCha20_8x: je .Ldone8x leaq 320(%rsi),%rsi - xorq %r10,%r10 + xorq %r9,%r9 vmovdqa %ymm3,0(%rsp) leaq 320(%rdi),%rdi subq $320,%rdx @@ -2137,7 +1342,7 @@ ChaCha20_8x: je .Ldone8x leaq 384(%rsi),%rsi - xorq %r10,%r10 + xorq %r9,%r9 vmovdqa %ymm11,0(%rsp) leaq 384(%rdi),%rdi subq $384,%rdx @@ -2177,40 +1382,43 @@ ChaCha20_8x: je .Ldone8x leaq 448(%rsi),%rsi - xorq %r10,%r10 + xorq %r9,%r9 vmovdqa %ymm0,0(%rsp) leaq 448(%rdi),%rdi subq $448,%rdx vmovdqa %ymm4,32(%rsp) .Loop_tail8x: - movzbl (%rsi,%r10,1),%eax - movzbl (%rsp,%r10,1),%ecx - leaq 1(%r10),%r10 + movzbl (%rsi,%r9,1),%eax + movzbl (%rsp,%r9,1),%ecx + leaq 1(%r9),%r9 xorl %ecx,%eax - movb %al,-1(%rdi,%r10,1) + movb %al,-1(%rdi,%r9,1) decq %rdx jnz .Loop_tail8x .Ldone8x: vzeroall - leaq (%r9),%rsp -.cfi_def_cfa_register %rsp + leaq -8(%r10),%rsp + .L8x_epilogue: - .byte 0xf3,0xc3 -.cfi_endproc -.size ChaCha20_8x,.-ChaCha20_8x -.type ChaCha20_avx512,@function + ret +ENDPROC(chacha20_avx2) +#endif /* CONFIG_AS_AVX2 */ + +#ifdef CONFIG_AS_AVX512 .align 32 -ChaCha20_avx512: -.cfi_startproc -.LChaCha20_avx512: - movq %rsp,%r9 -.cfi_def_cfa_register %r9 +ENTRY(chacha20_avx512) +.Lchacha20_avx512: + cmpq $0,%rdx + je .Lavx512_epilogue + leaq 8(%rsp),%r10 + cmpq $512,%rdx - ja .LChaCha20_16x + ja .Lchacha20_16x subq $64+8,%rsp + andq $-64,%rsp vbroadcasti32x4 .Lsigma(%rip),%zmm0 vbroadcasti32x4 (%rcx),%zmm1 vbroadcasti32x4 16(%rcx),%zmm2 @@ -2385,181 +1593,25 @@ ChaCha20_avx512: decq %rdx jnz .Loop_tail_avx512 - vmovdqu32 %zmm16,0(%rsp) + vmovdqa32 %zmm16,0(%rsp) .Ldone_avx512: vzeroall - leaq (%r9),%rsp -.cfi_def_cfa_register %rsp -.Lavx512_epilogue: - .byte 0xf3,0xc3 -.cfi_endproc -.size ChaCha20_avx512,.-ChaCha20_avx512 -.type ChaCha20_avx512vl,@function -.align 32 -ChaCha20_avx512vl: -.cfi_startproc -.LChaCha20_avx512vl: - movq %rsp,%r9 -.cfi_def_cfa_register %r9 - cmpq $128,%rdx - ja .LChaCha20_8xvl - - subq $64+8,%rsp - vbroadcasti128 .Lsigma(%rip),%ymm0 - vbroadcasti128 (%rcx),%ymm1 - vbroadcasti128 16(%rcx),%ymm2 - vbroadcasti128 (%r8),%ymm3 + leaq -8(%r10),%rsp - vmovdqa32 %ymm0,%ymm16 - vmovdqa32 %ymm1,%ymm17 - vmovdqa32 %ymm2,%ymm18 - vpaddd .Lzeroz(%rip),%ymm3,%ymm3 - vmovdqa32 .Ltwoy(%rip),%ymm20 - movq $10,%r8 - vmovdqa32 %ymm3,%ymm19 - jmp .Loop_avx512vl - -.align 16 -.Loop_outer_avx512vl: - vmovdqa32 %ymm18,%ymm2 - vpaddd %ymm20,%ymm19,%ymm3 - movq $10,%r8 - vmovdqa32 %ymm3,%ymm19 - jmp .Loop_avx512vl +.Lavx512_epilogue: + ret .align 32 -.Loop_avx512vl: - vpaddd %ymm1,%ymm0,%ymm0 - vpxor %ymm0,%ymm3,%ymm3 - vprold $16,%ymm3,%ymm3 - vpaddd %ymm3,%ymm2,%ymm2 - vpxor %ymm2,%ymm1,%ymm1 - vprold $12,%ymm1,%ymm1 - vpaddd %ymm1,%ymm0,%ymm0 - vpxor %ymm0,%ymm3,%ymm3 - vprold $8,%ymm3,%ymm3 - vpaddd %ymm3,%ymm2,%ymm2 - vpxor %ymm2,%ymm1,%ymm1 - vprold $7,%ymm1,%ymm1 - vpshufd $78,%ymm2,%ymm2 - vpshufd $57,%ymm1,%ymm1 - vpshufd $147,%ymm3,%ymm3 - vpaddd %ymm1,%ymm0,%ymm0 - vpxor %ymm0,%ymm3,%ymm3 - vprold $16,%ymm3,%ymm3 - vpaddd %ymm3,%ymm2,%ymm2 - vpxor %ymm2,%ymm1,%ymm1 - vprold $12,%ymm1,%ymm1 - vpaddd %ymm1,%ymm0,%ymm0 - vpxor %ymm0,%ymm3,%ymm3 - vprold $8,%ymm3,%ymm3 - vpaddd %ymm3,%ymm2,%ymm2 - vpxor %ymm2,%ymm1,%ymm1 - vprold $7,%ymm1,%ymm1 - vpshufd $78,%ymm2,%ymm2 - vpshufd $147,%ymm1,%ymm1 - vpshufd $57,%ymm3,%ymm3 - decq %r8 - jnz .Loop_avx512vl - vpaddd %ymm16,%ymm0,%ymm0 - vpaddd %ymm17,%ymm1,%ymm1 - vpaddd %ymm18,%ymm2,%ymm2 - vpaddd %ymm19,%ymm3,%ymm3 - - subq $64,%rdx - jb .Ltail64_avx512vl - - vpxor 0(%rsi),%xmm0,%xmm4 - vpxor 16(%rsi),%xmm1,%xmm5 - vpxor 32(%rsi),%xmm2,%xmm6 - vpxor 48(%rsi),%xmm3,%xmm7 - leaq 64(%rsi),%rsi - - vmovdqu %xmm4,0(%rdi) - vmovdqu %xmm5,16(%rdi) - vmovdqu %xmm6,32(%rdi) - vmovdqu %xmm7,48(%rdi) - leaq 64(%rdi),%rdi - - jz .Ldone_avx512vl - - vextracti128 $1,%ymm0,%xmm4 - vextracti128 $1,%ymm1,%xmm5 - vextracti128 $1,%ymm2,%xmm6 - vextracti128 $1,%ymm3,%xmm7 - - subq $64,%rdx - jb .Ltail_avx512vl - - vpxor 0(%rsi),%xmm4,%xmm4 - vpxor 16(%rsi),%xmm5,%xmm5 - vpxor 32(%rsi),%xmm6,%xmm6 - vpxor 48(%rsi),%xmm7,%xmm7 - leaq 64(%rsi),%rsi +.Lchacha20_16x: + leaq 8(%rsp),%r10 - vmovdqu %xmm4,0(%rdi) - vmovdqu %xmm5,16(%rdi) - vmovdqu %xmm6,32(%rdi) - vmovdqu %xmm7,48(%rdi) - leaq 64(%rdi),%rdi - - vmovdqa32 %ymm16,%ymm0 - vmovdqa32 %ymm17,%ymm1 - jnz .Loop_outer_avx512vl - - jmp .Ldone_avx512vl - -.align 16 -.Ltail64_avx512vl: - vmovdqa %xmm0,0(%rsp) - vmovdqa %xmm1,16(%rsp) - vmovdqa %xmm2,32(%rsp) - vmovdqa %xmm3,48(%rsp) - addq $64,%rdx - jmp .Loop_tail_avx512vl - -.align 16 -.Ltail_avx512vl: - vmovdqa %xmm4,0(%rsp) - vmovdqa %xmm5,16(%rsp) - vmovdqa %xmm6,32(%rsp) - vmovdqa %xmm7,48(%rsp) - addq $64,%rdx - -.Loop_tail_avx512vl: - movzbl (%rsi,%r8,1),%eax - movzbl (%rsp,%r8,1),%ecx - leaq 1(%r8),%r8 - xorl %ecx,%eax - movb %al,-1(%rdi,%r8,1) - decq %rdx - jnz .Loop_tail_avx512vl - - vmovdqu32 %ymm16,0(%rsp) - vmovdqu32 %ymm16,32(%rsp) - -.Ldone_avx512vl: - vzeroall - leaq (%r9),%rsp -.cfi_def_cfa_register %rsp -.Lavx512vl_epilogue: - .byte 0xf3,0xc3 -.cfi_endproc -.size ChaCha20_avx512vl,.-ChaCha20_avx512vl -.type ChaCha20_16x,@function -.align 32 -ChaCha20_16x: -.cfi_startproc -.LChaCha20_16x: - movq %rsp,%r9 -.cfi_def_cfa_register %r9 subq $64+8,%rsp andq $-64,%rsp vzeroupper - leaq .Lsigma(%rip),%r10 - vbroadcasti32x4 (%r10),%zmm3 + leaq .Lsigma(%rip),%r9 + vbroadcasti32x4 (%r9),%zmm3 vbroadcasti32x4 (%rcx),%zmm7 vbroadcasti32x4 16(%rcx),%zmm11 vbroadcasti32x4 (%r8),%zmm15 @@ -2606,10 +1658,10 @@ ChaCha20_16x: .align 32 .Loop_outer16x: - vpbroadcastd 0(%r10),%zmm0 - vpbroadcastd 4(%r10),%zmm1 - vpbroadcastd 8(%r10),%zmm2 - vpbroadcastd 12(%r10),%zmm3 + vpbroadcastd 0(%r9),%zmm0 + vpbroadcastd 4(%r9),%zmm1 + vpbroadcastd 8(%r9),%zmm2 + vpbroadcastd 12(%r9),%zmm3 vpaddd .Lsixteen(%rip),%zmm28,%zmm28 vmovdqa64 %zmm20,%zmm4 vmovdqa64 %zmm21,%zmm5 @@ -2865,7 +1917,7 @@ ChaCha20_16x: .align 32 .Ltail16x: - xorq %r10,%r10 + xorq %r9,%r9 subq %rsi,%rdi cmpq $64,%rdx jb .Less_than_64_16x @@ -2993,11 +2045,11 @@ ChaCha20_16x: andq $63,%rdx .Loop_tail16x: - movzbl (%rsi,%r10,1),%eax - movzbl (%rsp,%r10,1),%ecx - leaq 1(%r10),%r10 + movzbl (%rsi,%r9,1),%eax + movzbl (%rsp,%r9,1),%ecx + leaq 1(%r9),%r9 xorl %ecx,%eax - movb %al,-1(%rdi,%r10,1) + movb %al,-1(%rdi,%r9,1) decq %rdx jnz .Loop_tail16x @@ -3006,25 +2058,172 @@ ChaCha20_16x: .Ldone16x: vzeroall - leaq (%r9),%rsp -.cfi_def_cfa_register %rsp + leaq -8(%r10),%rsp + .L16x_epilogue: - .byte 0xf3,0xc3 -.cfi_endproc -.size ChaCha20_16x,.-ChaCha20_16x -.type ChaCha20_8xvl,@function + ret +ENDPROC(chacha20_avx512) + .align 32 -ChaCha20_8xvl: -.cfi_startproc -.LChaCha20_8xvl: - movq %rsp,%r9 -.cfi_def_cfa_register %r9 +ENTRY(chacha20_avx512vl) + cmpq $0,%rdx + je .Lavx512vl_epilogue + + leaq 8(%rsp),%r10 + + cmpq $128,%rdx + ja .Lchacha20_8xvl + + subq $64+8,%rsp + andq $-64,%rsp + vbroadcasti128 .Lsigma(%rip),%ymm0 + vbroadcasti128 (%rcx),%ymm1 + vbroadcasti128 16(%rcx),%ymm2 + vbroadcasti128 (%r8),%ymm3 + + vmovdqa32 %ymm0,%ymm16 + vmovdqa32 %ymm1,%ymm17 + vmovdqa32 %ymm2,%ymm18 + vpaddd .Lzeroz(%rip),%ymm3,%ymm3 + vmovdqa32 .Ltwoy(%rip),%ymm20 + movq $10,%r8 + vmovdqa32 %ymm3,%ymm19 + jmp .Loop_avx512vl + +.align 16 +.Loop_outer_avx512vl: + vmovdqa32 %ymm18,%ymm2 + vpaddd %ymm20,%ymm19,%ymm3 + movq $10,%r8 + vmovdqa32 %ymm3,%ymm19 + jmp .Loop_avx512vl + +.align 32 +.Loop_avx512vl: + vpaddd %ymm1,%ymm0,%ymm0 + vpxor %ymm0,%ymm3,%ymm3 + vprold $16,%ymm3,%ymm3 + vpaddd %ymm3,%ymm2,%ymm2 + vpxor %ymm2,%ymm1,%ymm1 + vprold $12,%ymm1,%ymm1 + vpaddd %ymm1,%ymm0,%ymm0 + vpxor %ymm0,%ymm3,%ymm3 + vprold $8,%ymm3,%ymm3 + vpaddd %ymm3,%ymm2,%ymm2 + vpxor %ymm2,%ymm1,%ymm1 + vprold $7,%ymm1,%ymm1 + vpshufd $78,%ymm2,%ymm2 + vpshufd $57,%ymm1,%ymm1 + vpshufd $147,%ymm3,%ymm3 + vpaddd %ymm1,%ymm0,%ymm0 + vpxor %ymm0,%ymm3,%ymm3 + vprold $16,%ymm3,%ymm3 + vpaddd %ymm3,%ymm2,%ymm2 + vpxor %ymm2,%ymm1,%ymm1 + vprold $12,%ymm1,%ymm1 + vpaddd %ymm1,%ymm0,%ymm0 + vpxor %ymm0,%ymm3,%ymm3 + vprold $8,%ymm3,%ymm3 + vpaddd %ymm3,%ymm2,%ymm2 + vpxor %ymm2,%ymm1,%ymm1 + vprold $7,%ymm1,%ymm1 + vpshufd $78,%ymm2,%ymm2 + vpshufd $147,%ymm1,%ymm1 + vpshufd $57,%ymm3,%ymm3 + decq %r8 + jnz .Loop_avx512vl + vpaddd %ymm16,%ymm0,%ymm0 + vpaddd %ymm17,%ymm1,%ymm1 + vpaddd %ymm18,%ymm2,%ymm2 + vpaddd %ymm19,%ymm3,%ymm3 + + subq $64,%rdx + jb .Ltail64_avx512vl + + vpxor 0(%rsi),%xmm0,%xmm4 + vpxor 16(%rsi),%xmm1,%xmm5 + vpxor 32(%rsi),%xmm2,%xmm6 + vpxor 48(%rsi),%xmm3,%xmm7 + leaq 64(%rsi),%rsi + + vmovdqu %xmm4,0(%rdi) + vmovdqu %xmm5,16(%rdi) + vmovdqu %xmm6,32(%rdi) + vmovdqu %xmm7,48(%rdi) + leaq 64(%rdi),%rdi + + jz .Ldone_avx512vl + + vextracti128 $1,%ymm0,%xmm4 + vextracti128 $1,%ymm1,%xmm5 + vextracti128 $1,%ymm2,%xmm6 + vextracti128 $1,%ymm3,%xmm7 + + subq $64,%rdx + jb .Ltail_avx512vl + + vpxor 0(%rsi),%xmm4,%xmm4 + vpxor 16(%rsi),%xmm5,%xmm5 + vpxor 32(%rsi),%xmm6,%xmm6 + vpxor 48(%rsi),%xmm7,%xmm7 + leaq 64(%rsi),%rsi + + vmovdqu %xmm4,0(%rdi) + vmovdqu %xmm5,16(%rdi) + vmovdqu %xmm6,32(%rdi) + vmovdqu %xmm7,48(%rdi) + leaq 64(%rdi),%rdi + + vmovdqa32 %ymm16,%ymm0 + vmovdqa32 %ymm17,%ymm1 + jnz .Loop_outer_avx512vl + + jmp .Ldone_avx512vl + +.align 16 +.Ltail64_avx512vl: + vmovdqa %xmm0,0(%rsp) + vmovdqa %xmm1,16(%rsp) + vmovdqa %xmm2,32(%rsp) + vmovdqa %xmm3,48(%rsp) + addq $64,%rdx + jmp .Loop_tail_avx512vl + +.align 16 +.Ltail_avx512vl: + vmovdqa %xmm4,0(%rsp) + vmovdqa %xmm5,16(%rsp) + vmovdqa %xmm6,32(%rsp) + vmovdqa %xmm7,48(%rsp) + addq $64,%rdx + +.Loop_tail_avx512vl: + movzbl (%rsi,%r8,1),%eax + movzbl (%rsp,%r8,1),%ecx + leaq 1(%r8),%r8 + xorl %ecx,%eax + movb %al,-1(%rdi,%r8,1) + decq %rdx + jnz .Loop_tail_avx512vl + + vmovdqa32 %ymm16,0(%rsp) + vmovdqa32 %ymm16,32(%rsp) + +.Ldone_avx512vl: + vzeroall + leaq -8(%r10),%rsp +.Lavx512vl_epilogue: + ret + +.align 32 +.Lchacha20_8xvl: + leaq 8(%rsp),%r10 subq $64+8,%rsp andq $-64,%rsp vzeroupper - leaq .Lsigma(%rip),%r10 - vbroadcasti128 (%r10),%ymm3 + leaq .Lsigma(%rip),%r9 + vbroadcasti128 (%r9),%ymm3 vbroadcasti128 (%rcx),%ymm7 vbroadcasti128 16(%rcx),%ymm11 vbroadcasti128 (%r8),%ymm15 @@ -3073,8 +2272,8 @@ ChaCha20_8xvl: .Loop_outer8xvl: - vpbroadcastd 8(%r10),%ymm2 - vpbroadcastd 12(%r10),%ymm3 + vpbroadcastd 8(%r9),%ymm2 + vpbroadcastd 12(%r9),%ymm3 vpaddd .Leight(%rip),%ymm28,%ymm28 vmovdqa64 %ymm20,%ymm4 vmovdqa64 %ymm21,%ymm5 @@ -3314,8 +2513,8 @@ ChaCha20_8xvl: vmovdqu %ymm12,96(%rdi) leaq (%rdi,%rax,1),%rdi - vpbroadcastd 0(%r10),%ymm0 - vpbroadcastd 4(%r10),%ymm1 + vpbroadcastd 0(%r9),%ymm0 + vpbroadcastd 4(%r9),%ymm1 subq $512,%rdx jnz .Loop_outer8xvl @@ -3325,7 +2524,7 @@ ChaCha20_8xvl: .align 32 .Ltail8xvl: vmovdqa64 %ymm19,%ymm8 - xorq %r10,%r10 + xorq %r9,%r9 subq %rsi,%rdi cmpq $64,%rdx jb .Less_than_64_8xvl @@ -3411,11 +2610,11 @@ ChaCha20_8xvl: andq $63,%rdx .Loop_tail8xvl: - movzbl (%rsi,%r10,1),%eax - movzbl (%rsp,%r10,1),%ecx - leaq 1(%r10),%r10 + movzbl (%rsi,%r9,1),%eax + movzbl (%rsp,%r9,1),%ecx + leaq 1(%r9),%r9 xorl %ecx,%eax - movb %al,-1(%rdi,%r10,1) + movb %al,-1(%rdi,%r9,1) decq %rdx jnz .Loop_tail8xvl @@ -3425,9 +2624,9 @@ ChaCha20_8xvl: .Ldone8xvl: vzeroall - leaq (%r9),%rsp -.cfi_def_cfa_register %rsp + leaq -8(%r10),%rsp .L8xvl_epilogue: - .byte 0xf3,0xc3 -.cfi_endproc -.size ChaCha20_8xvl,.-ChaCha20_8xvl + ret +ENDPROC(chacha20_avx512vl) + +#endif /* CONFIG_AS_AVX512 */ diff --git a/lib/zinc/chacha20/chacha20.c b/lib/zinc/chacha20/chacha20.c index 03209c15d1ca..22a21431c221 100644 --- a/lib/zinc/chacha20/chacha20.c +++ b/lib/zinc/chacha20/chacha20.c @@ -16,6 +16,9 @@ #include #include // For crypto_xor_cpy. +#if defined(CONFIG_ZINC_ARCH_X86_64) +#include "chacha20-x86_64-glue.c" +#else static bool *const chacha20_nobs[] __initconst = { }; static void __init chacha20_fpu_init(void) { @@ -33,6 +36,7 @@ static inline bool hchacha20_arch(u32 derived_key[CHACHA20_KEY_WORDS], { return false; } +#endif #define QUARTER_ROUND(x, a, b, c, d) ( \ x[a] += x[b], \ From patchwork Sat Oct 6 02:56:48 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Jason A. Donenfeld" X-Patchwork-Id: 148307 Delivered-To: patch@linaro.org Received: by 2002:a2e:8595:0:0:0:0:0 with SMTP id b21-v6csp1157457lji; Fri, 5 Oct 2018 19:58:04 -0700 (PDT) X-Google-Smtp-Source: ACcGV615/HA2ACYYTEGvNFsKRENYmV31jj9wjFvAd0lTvvz6M0y6Ftn0oygIG/GuQypra4CFsqKX X-Received: by 2002:a17:902:20c5:: with SMTP id v5-v6mr14092839plg.62.1538794684736; Fri, 05 Oct 2018 19:58:04 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1538794684; cv=none; d=google.com; s=arc-20160816; b=0W+A5r7kkU01pKhdza+qMBMClTp0BoMC/Y38/MIab/MF5ogPRUpVP31fGHNGTqD6bE cpee56M1XST4+6SaMhW9pJCbqWNX2X1oIpTZFzczWelcS0OBlSZ8p39rVDw22ZlDULNe cH0JudUGaJPH2hGEz3ao2ojQxgS0vj0MMKFIjGLrfmpLvwJLfTDzRfDopMWy9Y+ZtNro r+qa0Ymty1/v47YSjaixB0DKiHbb/ivqQ+KghJTjn6pJyB1yKnQFWesA4ZYjyRqp0cJn MJHhLIc6sN/rAdQ9qe77lPd6ZVW4u5aKR75nAlLHgk9avYZvQMwjh918lTMOWdbIzw9f d6Gw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=71raICZQh1d6pXbSGhIrMe3PQmIOvCyRTyZprol4UYQ=; b=MUkq1YeMb2H0rOzZSgBcxxMNrkIftto7+X0CFv8cGIPKbPEY0n0fN4aME1qTirI599 N/XbX3mTmtlWxeRVZI808PkmzMCjJBK+8LR8heVB105Yt7wa/bmzAmUZlEtPJ0MG6w6/ ApQ5EeRG+/O6LWDCAhWoowTVa2kzITERoyLX0skpt145HtBAVuQAodJCXitGtKt6yhCg X17efC/5zW2aDGB2E5fTmsLWDi8g42pascC53uKYeT0zkIaXxpc6W8/22LitYJMz47S5 K90hTnKQyJZZpfjCRakOFWRf2gYPjWw13oUDt6gdsZvK+O5j/OFfNBNpOn/656hjjyVs dEJw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=udkc1QxH; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id q5-v6si8788661pgg.105.2018.10.05.19.58.03; Fri, 05 Oct 2018 19:58:04 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=udkc1QxH; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729612AbeJFJ7a (ORCPT + 32 others); Sat, 6 Oct 2018 05:59:30 -0400 Received: from frisell.zx2c4.com ([192.95.5.64]:35333 "EHLO frisell.zx2c4.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729562AbeJFJ73 (ORCPT ); Sat, 6 Oct 2018 05:59:29 -0400 Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTP id 60012cce; Sat, 6 Oct 2018 02:57:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=zx2c4.com; h=from:to:cc :subject:date:message-id:in-reply-to:references:mime-version :content-transfer-encoding; s=mail; bh=Gbp7R3GJKYDXL8q6bpjJxX77B 1s=; b=udkc1QxHtZXJ6ZTII+f6D/1vNaALlsQ6kNbR0N1U1P0P4NIrG7U0VocEl h2EvPb1g60fEWBYza+Kts8v/sRq37h7RWitaw8OwLql9nW3JaWssD+Mf0ft6APzj jZZrKHju0ZnxSbxnhdo+BBqS7JtvFAi8TqA60qq39b1rgBzm3mQQmho07jcUfqDT OekGq5bZ4rysw6kR0IXp8HvzhCVJZhGdTCqAR4ItNVbd5r/E08ScK1ZUcbkD+MxD pOuMOphGCSCsIKgGW+me6HRgKP4JPaSERJM+coTkGy90OoWgal/4K10ZxGAxt7jJ 6T0UYjqwcKxBv0AQGK9DxclEDwXvA== Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTPSA id e917d10b (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256:NO); Sat, 6 Oct 2018 02:57:18 +0000 (UTC) From: "Jason A. Donenfeld" To: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, davem@davemloft.net, gregkh@linuxfoundation.org Cc: "Jason A. Donenfeld" , Andy Polyakov , Russell King , linux-arm-kernel@lists.infradead.org, Samuel Neves , Jean-Philippe Aumasson , Andy Lutomirski , Andrew Morton , Linus Torvalds , kernel-hardening@lists.openwall.com, linux-crypto@vger.kernel.org Subject: [PATCH net-next v7 07/28] zinc: import Andy Polyakov's ChaCha20 ARM and ARM64 implementations Date: Sat, 6 Oct 2018 04:56:48 +0200 Message-Id: <20181006025709.4019-8-Jason@zx2c4.com> In-Reply-To: <20181006025709.4019-1-Jason@zx2c4.com> References: <20181006025709.4019-1-Jason@zx2c4.com> MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org These NEON and non-NEON implementations come from Andy Polyakov's implementation, and are included here in raw form without modification, so that subsequent commits that fix these up for the kernel can see how it has changed. While this is CRYPTOGAMS code, the originating code for this happens to be the same as OpenSSL's commit 87cc649f30aaf69b351701875b9dac07c29ce8a2 Signed-off-by: Jason A. Donenfeld Based-on-code-from: Andy Polyakov Cc: Andy Polyakov Cc: Russell King Cc: linux-arm-kernel@lists.infradead.org Cc: Samuel Neves Cc: Jean-Philippe Aumasson Cc: Andy Lutomirski Cc: Greg KH Cc: Andrew Morton Cc: Linus Torvalds Cc: kernel-hardening@lists.openwall.com Cc: linux-crypto@vger.kernel.org --- lib/zinc/chacha20/chacha20-arm-cryptogams.S | 1440 ++++++++++++ lib/zinc/chacha20/chacha20-arm64-cryptogams.S | 1973 +++++++++++++++++ 2 files changed, 3413 insertions(+) create mode 100644 lib/zinc/chacha20/chacha20-arm-cryptogams.S create mode 100644 lib/zinc/chacha20/chacha20-arm64-cryptogams.S -- 2.19.0 diff --git a/lib/zinc/chacha20/chacha20-arm-cryptogams.S b/lib/zinc/chacha20/chacha20-arm-cryptogams.S new file mode 100644 index 000000000000..05a3a9e6e93f --- /dev/null +++ b/lib/zinc/chacha20/chacha20-arm-cryptogams.S @@ -0,0 +1,1440 @@ +/* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */ +/* + * Copyright (C) 2006-2017 CRYPTOGAMS by . All Rights Reserved. + */ + +#include "arm_arch.h" + +.text +#if defined(__thumb2__) || defined(__clang__) +.syntax unified +#endif +#if defined(__thumb2__) +.thumb +#else +.code 32 +#endif + +#if defined(__thumb2__) || defined(__clang__) +#define ldrhsb ldrbhs +#endif + +.align 5 +.Lsigma: +.long 0x61707865,0x3320646e,0x79622d32,0x6b206574 @ endian-neutral +.Lone: +.long 1,0,0,0 +.Lrot8: +.long 0x02010003,0x06050407 +#if __ARM_MAX_ARCH__>=7 +.LOPENSSL_armcap: +.word OPENSSL_armcap_P-.LChaCha20_ctr32 +#else +.word -1 +#endif + +.globl ChaCha20_ctr32 +.type ChaCha20_ctr32,%function +.align 5 +ChaCha20_ctr32: +.LChaCha20_ctr32: + ldr r12,[sp,#0] @ pull pointer to counter and nonce + stmdb sp!,{r0-r2,r4-r11,lr} +#if __ARM_ARCH__<7 && !defined(__thumb2__) + sub r14,pc,#16 @ ChaCha20_ctr32 +#else + adr r14,.LChaCha20_ctr32 +#endif + cmp r2,#0 @ len==0? +#ifdef __thumb2__ + itt eq +#endif + addeq sp,sp,#4*3 + beq .Lno_data +#if __ARM_MAX_ARCH__>=7 + cmp r2,#192 @ test len + bls .Lshort + ldr r4,[r14,#-24] + ldr r4,[r14,r4] +# ifdef __APPLE__ + ldr r4,[r4] +# endif + tst r4,#ARMV7_NEON + bne .LChaCha20_neon +.Lshort: +#endif + ldmia r12,{r4-r7} @ load counter and nonce + sub sp,sp,#4*(16) @ off-load area + sub r14,r14,#64 @ .Lsigma + stmdb sp!,{r4-r7} @ copy counter and nonce + ldmia r3,{r4-r11} @ load key + ldmia r14,{r0-r3} @ load sigma + stmdb sp!,{r4-r11} @ copy key + stmdb sp!,{r0-r3} @ copy sigma + str r10,[sp,#4*(16+10)] @ off-load "rx" + str r11,[sp,#4*(16+11)] @ off-load "rx" + b .Loop_outer_enter + +.align 4 +.Loop_outer: + ldmia sp,{r0-r9} @ load key material + str r11,[sp,#4*(32+2)] @ save len + str r12, [sp,#4*(32+1)] @ save inp + str r14, [sp,#4*(32+0)] @ save out +.Loop_outer_enter: + ldr r11, [sp,#4*(15)] + mov r4,r4,ror#19 @ twist b[0..3] + ldr r12,[sp,#4*(12)] @ modulo-scheduled load + mov r5,r5,ror#19 + ldr r10, [sp,#4*(13)] + mov r6,r6,ror#19 + ldr r14,[sp,#4*(14)] + mov r7,r7,ror#19 + mov r11,r11,ror#8 @ twist d[0..3] + mov r12,r12,ror#8 + mov r10,r10,ror#8 + mov r14,r14,ror#8 + str r11, [sp,#4*(16+15)] + mov r11,#10 + b .Loop + +.align 4 +.Loop: + subs r11,r11,#1 + add r0,r0,r4,ror#13 + add r1,r1,r5,ror#13 + eor r12,r0,r12,ror#24 + eor r10,r1,r10,ror#24 + add r8,r8,r12,ror#16 + add r9,r9,r10,ror#16 + eor r4,r8,r4,ror#13 + eor r5,r9,r5,ror#13 + add r0,r0,r4,ror#20 + add r1,r1,r5,ror#20 + eor r12,r0,r12,ror#16 + eor r10,r1,r10,ror#16 + add r8,r8,r12,ror#24 + str r10,[sp,#4*(16+13)] + add r9,r9,r10,ror#24 + ldr r10,[sp,#4*(16+15)] + str r8,[sp,#4*(16+8)] + eor r4,r4,r8,ror#12 + str r9,[sp,#4*(16+9)] + eor r5,r5,r9,ror#12 + ldr r8,[sp,#4*(16+10)] + add r2,r2,r6,ror#13 + ldr r9,[sp,#4*(16+11)] + add r3,r3,r7,ror#13 + eor r14,r2,r14,ror#24 + eor r10,r3,r10,ror#24 + add r8,r8,r14,ror#16 + add r9,r9,r10,ror#16 + eor r6,r8,r6,ror#13 + eor r7,r9,r7,ror#13 + add r2,r2,r6,ror#20 + add r3,r3,r7,ror#20 + eor r14,r2,r14,ror#16 + eor r10,r3,r10,ror#16 + add r8,r8,r14,ror#24 + add r9,r9,r10,ror#24 + eor r6,r6,r8,ror#12 + eor r7,r7,r9,ror#12 + add r0,r0,r5,ror#13 + add r1,r1,r6,ror#13 + eor r10,r0,r10,ror#24 + eor r12,r1,r12,ror#24 + add r8,r8,r10,ror#16 + add r9,r9,r12,ror#16 + eor r5,r8,r5,ror#13 + eor r6,r9,r6,ror#13 + add r0,r0,r5,ror#20 + add r1,r1,r6,ror#20 + eor r10,r0,r10,ror#16 + eor r12,r1,r12,ror#16 + str r10,[sp,#4*(16+15)] + add r8,r8,r10,ror#24 + ldr r10,[sp,#4*(16+13)] + add r9,r9,r12,ror#24 + str r8,[sp,#4*(16+10)] + eor r5,r5,r8,ror#12 + str r9,[sp,#4*(16+11)] + eor r6,r6,r9,ror#12 + ldr r8,[sp,#4*(16+8)] + add r2,r2,r7,ror#13 + ldr r9,[sp,#4*(16+9)] + add r3,r3,r4,ror#13 + eor r10,r2,r10,ror#24 + eor r14,r3,r14,ror#24 + add r8,r8,r10,ror#16 + add r9,r9,r14,ror#16 + eor r7,r8,r7,ror#13 + eor r4,r9,r4,ror#13 + add r2,r2,r7,ror#20 + add r3,r3,r4,ror#20 + eor r10,r2,r10,ror#16 + eor r14,r3,r14,ror#16 + add r8,r8,r10,ror#24 + add r9,r9,r14,ror#24 + eor r7,r7,r8,ror#12 + eor r4,r4,r9,ror#12 + bne .Loop + + ldr r11,[sp,#4*(32+2)] @ load len + + str r8, [sp,#4*(16+8)] @ modulo-scheduled store + str r9, [sp,#4*(16+9)] + str r12,[sp,#4*(16+12)] + str r10, [sp,#4*(16+13)] + str r14,[sp,#4*(16+14)] + + @ at this point we have first half of 512-bit result in + @ rx and second half at sp+4*(16+8) + + cmp r11,#64 @ done yet? +#ifdef __thumb2__ + itete lo +#endif + addlo r12,sp,#4*(0) @ shortcut or ... + ldrhs r12,[sp,#4*(32+1)] @ ... load inp + addlo r14,sp,#4*(0) @ shortcut or ... + ldrhs r14,[sp,#4*(32+0)] @ ... load out + + ldr r8,[sp,#4*(0)] @ load key material + ldr r9,[sp,#4*(1)] + +#if __ARM_ARCH__>=6 || !defined(__ARMEB__) +# if __ARM_ARCH__<7 + orr r10,r12,r14 + tst r10,#3 @ are input and output aligned? + ldr r10,[sp,#4*(2)] + bne .Lunaligned + cmp r11,#64 @ restore flags +# else + ldr r10,[sp,#4*(2)] +# endif + ldr r11,[sp,#4*(3)] + + add r0,r0,r8 @ accumulate key material + add r1,r1,r9 +# ifdef __thumb2__ + itt hs +# endif + ldrhs r8,[r12],#16 @ load input + ldrhs r9,[r12,#-12] + + add r2,r2,r10 + add r3,r3,r11 +# ifdef __thumb2__ + itt hs +# endif + ldrhs r10,[r12,#-8] + ldrhs r11,[r12,#-4] +# if __ARM_ARCH__>=6 && defined(__ARMEB__) + rev r0,r0 + rev r1,r1 + rev r2,r2 + rev r3,r3 +# endif +# ifdef __thumb2__ + itt hs +# endif + eorhs r0,r0,r8 @ xor with input + eorhs r1,r1,r9 + add r8,sp,#4*(4) + str r0,[r14],#16 @ store output +# ifdef __thumb2__ + itt hs +# endif + eorhs r2,r2,r10 + eorhs r3,r3,r11 + ldmia r8,{r8-r11} @ load key material + str r1,[r14,#-12] + str r2,[r14,#-8] + str r3,[r14,#-4] + + add r4,r8,r4,ror#13 @ accumulate key material + add r5,r9,r5,ror#13 +# ifdef __thumb2__ + itt hs +# endif + ldrhs r8,[r12],#16 @ load input + ldrhs r9,[r12,#-12] + add r6,r10,r6,ror#13 + add r7,r11,r7,ror#13 +# ifdef __thumb2__ + itt hs +# endif + ldrhs r10,[r12,#-8] + ldrhs r11,[r12,#-4] +# if __ARM_ARCH__>=6 && defined(__ARMEB__) + rev r4,r4 + rev r5,r5 + rev r6,r6 + rev r7,r7 +# endif +# ifdef __thumb2__ + itt hs +# endif + eorhs r4,r4,r8 + eorhs r5,r5,r9 + add r8,sp,#4*(8) + str r4,[r14],#16 @ store output +# ifdef __thumb2__ + itt hs +# endif + eorhs r6,r6,r10 + eorhs r7,r7,r11 + str r5,[r14,#-12] + ldmia r8,{r8-r11} @ load key material + str r6,[r14,#-8] + add r0,sp,#4*(16+8) + str r7,[r14,#-4] + + ldmia r0,{r0-r7} @ load second half + + add r0,r0,r8 @ accumulate key material + add r1,r1,r9 +# ifdef __thumb2__ + itt hs +# endif + ldrhs r8,[r12],#16 @ load input + ldrhs r9,[r12,#-12] +# ifdef __thumb2__ + itt hi +# endif + strhi r10,[sp,#4*(16+10)] @ copy "rx" while at it + strhi r11,[sp,#4*(16+11)] @ copy "rx" while at it + add r2,r2,r10 + add r3,r3,r11 +# ifdef __thumb2__ + itt hs +# endif + ldrhs r10,[r12,#-8] + ldrhs r11,[r12,#-4] +# if __ARM_ARCH__>=6 && defined(__ARMEB__) + rev r0,r0 + rev r1,r1 + rev r2,r2 + rev r3,r3 +# endif +# ifdef __thumb2__ + itt hs +# endif + eorhs r0,r0,r8 + eorhs r1,r1,r9 + add r8,sp,#4*(12) + str r0,[r14],#16 @ store output +# ifdef __thumb2__ + itt hs +# endif + eorhs r2,r2,r10 + eorhs r3,r3,r11 + str r1,[r14,#-12] + ldmia r8,{r8-r11} @ load key material + str r2,[r14,#-8] + str r3,[r14,#-4] + + add r4,r8,r4,ror#24 @ accumulate key material + add r5,r9,r5,ror#24 +# ifdef __thumb2__ + itt hi +# endif + addhi r8,r8,#1 @ next counter value + strhi r8,[sp,#4*(12)] @ save next counter value +# ifdef __thumb2__ + itt hs +# endif + ldrhs r8,[r12],#16 @ load input + ldrhs r9,[r12,#-12] + add r6,r10,r6,ror#24 + add r7,r11,r7,ror#24 +# ifdef __thumb2__ + itt hs +# endif + ldrhs r10,[r12,#-8] + ldrhs r11,[r12,#-4] +# if __ARM_ARCH__>=6 && defined(__ARMEB__) + rev r4,r4 + rev r5,r5 + rev r6,r6 + rev r7,r7 +# endif +# ifdef __thumb2__ + itt hs +# endif + eorhs r4,r4,r8 + eorhs r5,r5,r9 +# ifdef __thumb2__ + it ne +# endif + ldrne r8,[sp,#4*(32+2)] @ re-load len +# ifdef __thumb2__ + itt hs +# endif + eorhs r6,r6,r10 + eorhs r7,r7,r11 + str r4,[r14],#16 @ store output + str r5,[r14,#-12] +# ifdef __thumb2__ + it hs +# endif + subhs r11,r8,#64 @ len-=64 + str r6,[r14,#-8] + str r7,[r14,#-4] + bhi .Loop_outer + + beq .Ldone +# if __ARM_ARCH__<7 + b .Ltail + +.align 4 +.Lunaligned: @ unaligned endian-neutral path + cmp r11,#64 @ restore flags +# endif +#endif +#if __ARM_ARCH__<7 + ldr r11,[sp,#4*(3)] + add r0,r8,r0 @ accumulate key material + add r1,r9,r1 + add r2,r10,r2 +# ifdef __thumb2__ + itete lo +# endif + eorlo r8,r8,r8 @ zero or ... + ldrhsb r8,[r12],#16 @ ... load input + eorlo r9,r9,r9 + ldrhsb r9,[r12,#-12] + + add r3,r11,r3 +# ifdef __thumb2__ + itete lo +# endif + eorlo r10,r10,r10 + ldrhsb r10,[r12,#-8] + eorlo r11,r11,r11 + ldrhsb r11,[r12,#-4] + + eor r0,r8,r0 @ xor with input (or zero) + eor r1,r9,r1 +# ifdef __thumb2__ + itt hs +# endif + ldrhsb r8,[r12,#-15] @ load more input + ldrhsb r9,[r12,#-11] + eor r2,r10,r2 + strb r0,[r14],#16 @ store output + eor r3,r11,r3 +# ifdef __thumb2__ + itt hs +# endif + ldrhsb r10,[r12,#-7] + ldrhsb r11,[r12,#-3] + strb r1,[r14,#-12] + eor r0,r8,r0,lsr#8 + strb r2,[r14,#-8] + eor r1,r9,r1,lsr#8 +# ifdef __thumb2__ + itt hs +# endif + ldrhsb r8,[r12,#-14] @ load more input + ldrhsb r9,[r12,#-10] + strb r3,[r14,#-4] + eor r2,r10,r2,lsr#8 + strb r0,[r14,#-15] + eor r3,r11,r3,lsr#8 +# ifdef __thumb2__ + itt hs +# endif + ldrhsb r10,[r12,#-6] + ldrhsb r11,[r12,#-2] + strb r1,[r14,#-11] + eor r0,r8,r0,lsr#8 + strb r2,[r14,#-7] + eor r1,r9,r1,lsr#8 +# ifdef __thumb2__ + itt hs +# endif + ldrhsb r8,[r12,#-13] @ load more input + ldrhsb r9,[r12,#-9] + strb r3,[r14,#-3] + eor r2,r10,r2,lsr#8 + strb r0,[r14,#-14] + eor r3,r11,r3,lsr#8 +# ifdef __thumb2__ + itt hs +# endif + ldrhsb r10,[r12,#-5] + ldrhsb r11,[r12,#-1] + strb r1,[r14,#-10] + strb r2,[r14,#-6] + eor r0,r8,r0,lsr#8 + strb r3,[r14,#-2] + eor r1,r9,r1,lsr#8 + strb r0,[r14,#-13] + eor r2,r10,r2,lsr#8 + strb r1,[r14,#-9] + eor r3,r11,r3,lsr#8 + strb r2,[r14,#-5] + strb r3,[r14,#-1] + add r8,sp,#4*(4+0) + ldmia r8,{r8-r11} @ load key material + add r0,sp,#4*(16+8) + add r4,r8,r4,ror#13 @ accumulate key material + add r5,r9,r5,ror#13 + add r6,r10,r6,ror#13 +# ifdef __thumb2__ + itete lo +# endif + eorlo r8,r8,r8 @ zero or ... + ldrhsb r8,[r12],#16 @ ... load input + eorlo r9,r9,r9 + ldrhsb r9,[r12,#-12] + + add r7,r11,r7,ror#13 +# ifdef __thumb2__ + itete lo +# endif + eorlo r10,r10,r10 + ldrhsb r10,[r12,#-8] + eorlo r11,r11,r11 + ldrhsb r11,[r12,#-4] + + eor r4,r8,r4 @ xor with input (or zero) + eor r5,r9,r5 +# ifdef __thumb2__ + itt hs +# endif + ldrhsb r8,[r12,#-15] @ load more input + ldrhsb r9,[r12,#-11] + eor r6,r10,r6 + strb r4,[r14],#16 @ store output + eor r7,r11,r7 +# ifdef __thumb2__ + itt hs +# endif + ldrhsb r10,[r12,#-7] + ldrhsb r11,[r12,#-3] + strb r5,[r14,#-12] + eor r4,r8,r4,lsr#8 + strb r6,[r14,#-8] + eor r5,r9,r5,lsr#8 +# ifdef __thumb2__ + itt hs +# endif + ldrhsb r8,[r12,#-14] @ load more input + ldrhsb r9,[r12,#-10] + strb r7,[r14,#-4] + eor r6,r10,r6,lsr#8 + strb r4,[r14,#-15] + eor r7,r11,r7,lsr#8 +# ifdef __thumb2__ + itt hs +# endif + ldrhsb r10,[r12,#-6] + ldrhsb r11,[r12,#-2] + strb r5,[r14,#-11] + eor r4,r8,r4,lsr#8 + strb r6,[r14,#-7] + eor r5,r9,r5,lsr#8 +# ifdef __thumb2__ + itt hs +# endif + ldrhsb r8,[r12,#-13] @ load more input + ldrhsb r9,[r12,#-9] + strb r7,[r14,#-3] + eor r6,r10,r6,lsr#8 + strb r4,[r14,#-14] + eor r7,r11,r7,lsr#8 +# ifdef __thumb2__ + itt hs +# endif + ldrhsb r10,[r12,#-5] + ldrhsb r11,[r12,#-1] + strb r5,[r14,#-10] + strb r6,[r14,#-6] + eor r4,r8,r4,lsr#8 + strb r7,[r14,#-2] + eor r5,r9,r5,lsr#8 + strb r4,[r14,#-13] + eor r6,r10,r6,lsr#8 + strb r5,[r14,#-9] + eor r7,r11,r7,lsr#8 + strb r6,[r14,#-5] + strb r7,[r14,#-1] + add r8,sp,#4*(4+4) + ldmia r8,{r8-r11} @ load key material + ldmia r0,{r0-r7} @ load second half +# ifdef __thumb2__ + itt hi +# endif + strhi r10,[sp,#4*(16+10)] @ copy "rx" + strhi r11,[sp,#4*(16+11)] @ copy "rx" + add r0,r8,r0 @ accumulate key material + add r1,r9,r1 + add r2,r10,r2 +# ifdef __thumb2__ + itete lo +# endif + eorlo r8,r8,r8 @ zero or ... + ldrhsb r8,[r12],#16 @ ... load input + eorlo r9,r9,r9 + ldrhsb r9,[r12,#-12] + + add r3,r11,r3 +# ifdef __thumb2__ + itete lo +# endif + eorlo r10,r10,r10 + ldrhsb r10,[r12,#-8] + eorlo r11,r11,r11 + ldrhsb r11,[r12,#-4] + + eor r0,r8,r0 @ xor with input (or zero) + eor r1,r9,r1 +# ifdef __thumb2__ + itt hs +# endif + ldrhsb r8,[r12,#-15] @ load more input + ldrhsb r9,[r12,#-11] + eor r2,r10,r2 + strb r0,[r14],#16 @ store output + eor r3,r11,r3 +# ifdef __thumb2__ + itt hs +# endif + ldrhsb r10,[r12,#-7] + ldrhsb r11,[r12,#-3] + strb r1,[r14,#-12] + eor r0,r8,r0,lsr#8 + strb r2,[r14,#-8] + eor r1,r9,r1,lsr#8 +# ifdef __thumb2__ + itt hs +# endif + ldrhsb r8,[r12,#-14] @ load more input + ldrhsb r9,[r12,#-10] + strb r3,[r14,#-4] + eor r2,r10,r2,lsr#8 + strb r0,[r14,#-15] + eor r3,r11,r3,lsr#8 +# ifdef __thumb2__ + itt hs +# endif + ldrhsb r10,[r12,#-6] + ldrhsb r11,[r12,#-2] + strb r1,[r14,#-11] + eor r0,r8,r0,lsr#8 + strb r2,[r14,#-7] + eor r1,r9,r1,lsr#8 +# ifdef __thumb2__ + itt hs +# endif + ldrhsb r8,[r12,#-13] @ load more input + ldrhsb r9,[r12,#-9] + strb r3,[r14,#-3] + eor r2,r10,r2,lsr#8 + strb r0,[r14,#-14] + eor r3,r11,r3,lsr#8 +# ifdef __thumb2__ + itt hs +# endif + ldrhsb r10,[r12,#-5] + ldrhsb r11,[r12,#-1] + strb r1,[r14,#-10] + strb r2,[r14,#-6] + eor r0,r8,r0,lsr#8 + strb r3,[r14,#-2] + eor r1,r9,r1,lsr#8 + strb r0,[r14,#-13] + eor r2,r10,r2,lsr#8 + strb r1,[r14,#-9] + eor r3,r11,r3,lsr#8 + strb r2,[r14,#-5] + strb r3,[r14,#-1] + add r8,sp,#4*(4+8) + ldmia r8,{r8-r11} @ load key material + add r4,r8,r4,ror#24 @ accumulate key material +# ifdef __thumb2__ + itt hi +# endif + addhi r8,r8,#1 @ next counter value + strhi r8,[sp,#4*(12)] @ save next counter value + add r5,r9,r5,ror#24 + add r6,r10,r6,ror#24 +# ifdef __thumb2__ + itete lo +# endif + eorlo r8,r8,r8 @ zero or ... + ldrhsb r8,[r12],#16 @ ... load input + eorlo r9,r9,r9 + ldrhsb r9,[r12,#-12] + + add r7,r11,r7,ror#24 +# ifdef __thumb2__ + itete lo +# endif + eorlo r10,r10,r10 + ldrhsb r10,[r12,#-8] + eorlo r11,r11,r11 + ldrhsb r11,[r12,#-4] + + eor r4,r8,r4 @ xor with input (or zero) + eor r5,r9,r5 +# ifdef __thumb2__ + itt hs +# endif + ldrhsb r8,[r12,#-15] @ load more input + ldrhsb r9,[r12,#-11] + eor r6,r10,r6 + strb r4,[r14],#16 @ store output + eor r7,r11,r7 +# ifdef __thumb2__ + itt hs +# endif + ldrhsb r10,[r12,#-7] + ldrhsb r11,[r12,#-3] + strb r5,[r14,#-12] + eor r4,r8,r4,lsr#8 + strb r6,[r14,#-8] + eor r5,r9,r5,lsr#8 +# ifdef __thumb2__ + itt hs +# endif + ldrhsb r8,[r12,#-14] @ load more input + ldrhsb r9,[r12,#-10] + strb r7,[r14,#-4] + eor r6,r10,r6,lsr#8 + strb r4,[r14,#-15] + eor r7,r11,r7,lsr#8 +# ifdef __thumb2__ + itt hs +# endif + ldrhsb r10,[r12,#-6] + ldrhsb r11,[r12,#-2] + strb r5,[r14,#-11] + eor r4,r8,r4,lsr#8 + strb r6,[r14,#-7] + eor r5,r9,r5,lsr#8 +# ifdef __thumb2__ + itt hs +# endif + ldrhsb r8,[r12,#-13] @ load more input + ldrhsb r9,[r12,#-9] + strb r7,[r14,#-3] + eor r6,r10,r6,lsr#8 + strb r4,[r14,#-14] + eor r7,r11,r7,lsr#8 +# ifdef __thumb2__ + itt hs +# endif + ldrhsb r10,[r12,#-5] + ldrhsb r11,[r12,#-1] + strb r5,[r14,#-10] + strb r6,[r14,#-6] + eor r4,r8,r4,lsr#8 + strb r7,[r14,#-2] + eor r5,r9,r5,lsr#8 + strb r4,[r14,#-13] + eor r6,r10,r6,lsr#8 + strb r5,[r14,#-9] + eor r7,r11,r7,lsr#8 + strb r6,[r14,#-5] + strb r7,[r14,#-1] +# ifdef __thumb2__ + it ne +# endif + ldrne r8,[sp,#4*(32+2)] @ re-load len +# ifdef __thumb2__ + it hs +# endif + subhs r11,r8,#64 @ len-=64 + bhi .Loop_outer + + beq .Ldone +#endif + +.Ltail: + ldr r12,[sp,#4*(32+1)] @ load inp + add r9,sp,#4*(0) + ldr r14,[sp,#4*(32+0)] @ load out + +.Loop_tail: + ldrb r10,[r9],#1 @ read buffer on stack + ldrb r11,[r12],#1 @ read input + subs r8,r8,#1 + eor r11,r11,r10 + strb r11,[r14],#1 @ store output + bne .Loop_tail + +.Ldone: + add sp,sp,#4*(32+3) +.Lno_data: + ldmia sp!,{r4-r11,pc} +.size ChaCha20_ctr32,.-ChaCha20_ctr32 +#if __ARM_MAX_ARCH__>=7 +.arch armv7-a +.fpu neon + +.type ChaCha20_neon,%function +.align 5 +ChaCha20_neon: + ldr r12,[sp,#0] @ pull pointer to counter and nonce + stmdb sp!,{r0-r2,r4-r11,lr} +.LChaCha20_neon: + adr r14,.Lsigma + vstmdb sp!,{d8-d15} @ ABI spec says so + stmdb sp!,{r0-r3} + + vld1.32 {q1-q2},[r3] @ load key + ldmia r3,{r4-r11} @ load key + + sub sp,sp,#4*(16+16) + vld1.32 {q3},[r12] @ load counter and nonce + add r12,sp,#4*8 + ldmia r14,{r0-r3} @ load sigma + vld1.32 {q0},[r14]! @ load sigma + vld1.32 {q12},[r14]! @ one + @ vld1.32 {d30},[r14] @ rot8 + vst1.32 {q2-q3},[r12] @ copy 1/2key|counter|nonce + vst1.32 {q0-q1},[sp] @ copy sigma|1/2key + + str r10,[sp,#4*(16+10)] @ off-load "rx" + str r11,[sp,#4*(16+11)] @ off-load "rx" + vshl.i32 d26,d24,#1 @ two + vstr d24,[sp,#4*(16+0)] + vshl.i32 d28,d24,#2 @ four + vstr d26,[sp,#4*(16+2)] + vmov q4,q0 + vstr d28,[sp,#4*(16+4)] + vmov q8,q0 + @ vstr d30,[sp,#4*(16+6)] + vmov q5,q1 + vmov q9,q1 + b .Loop_neon_enter + +.align 4 +.Loop_neon_outer: + ldmia sp,{r0-r9} @ load key material + cmp r11,#64*2 @ if len<=64*2 + bls .Lbreak_neon @ switch to integer-only + @ vldr d30,[sp,#4*(16+6)] @ rot8 + vmov q4,q0 + str r11,[sp,#4*(32+2)] @ save len + vmov q8,q0 + str r12, [sp,#4*(32+1)] @ save inp + vmov q5,q1 + str r14, [sp,#4*(32+0)] @ save out + vmov q9,q1 +.Loop_neon_enter: + ldr r11, [sp,#4*(15)] + mov r4,r4,ror#19 @ twist b[0..3] + vadd.i32 q7,q3,q12 @ counter+1 + ldr r12,[sp,#4*(12)] @ modulo-scheduled load + mov r5,r5,ror#19 + vmov q6,q2 + ldr r10, [sp,#4*(13)] + mov r6,r6,ror#19 + vmov q10,q2 + ldr r14,[sp,#4*(14)] + mov r7,r7,ror#19 + vadd.i32 q11,q7,q12 @ counter+2 + add r12,r12,#3 @ counter+3 + mov r11,r11,ror#8 @ twist d[0..3] + mov r12,r12,ror#8 + mov r10,r10,ror#8 + mov r14,r14,ror#8 + str r11, [sp,#4*(16+15)] + mov r11,#10 + b .Loop_neon + +.align 4 +.Loop_neon: + subs r11,r11,#1 + vadd.i32 q0,q0,q1 + add r0,r0,r4,ror#13 + vadd.i32 q4,q4,q5 + add r1,r1,r5,ror#13 + vadd.i32 q8,q8,q9 + eor r12,r0,r12,ror#24 + veor q3,q3,q0 + eor r10,r1,r10,ror#24 + veor q7,q7,q4 + add r8,r8,r12,ror#16 + veor q11,q11,q8 + add r9,r9,r10,ror#16 + vrev32.16 q3,q3 + eor r4,r8,r4,ror#13 + vrev32.16 q7,q7 + eor r5,r9,r5,ror#13 + vrev32.16 q11,q11 + add r0,r0,r4,ror#20 + vadd.i32 q2,q2,q3 + add r1,r1,r5,ror#20 + vadd.i32 q6,q6,q7 + eor r12,r0,r12,ror#16 + vadd.i32 q10,q10,q11 + eor r10,r1,r10,ror#16 + veor q12,q1,q2 + add r8,r8,r12,ror#24 + veor q13,q5,q6 + str r10,[sp,#4*(16+13)] + veor q14,q9,q10 + add r9,r9,r10,ror#24 + vshr.u32 q1,q12,#20 + ldr r10,[sp,#4*(16+15)] + vshr.u32 q5,q13,#20 + str r8,[sp,#4*(16+8)] + vshr.u32 q9,q14,#20 + eor r4,r4,r8,ror#12 + vsli.32 q1,q12,#12 + str r9,[sp,#4*(16+9)] + vsli.32 q5,q13,#12 + eor r5,r5,r9,ror#12 + vsli.32 q9,q14,#12 + ldr r8,[sp,#4*(16+10)] + vadd.i32 q0,q0,q1 + add r2,r2,r6,ror#13 + vadd.i32 q4,q4,q5 + ldr r9,[sp,#4*(16+11)] + vadd.i32 q8,q8,q9 + add r3,r3,r7,ror#13 + veor q12,q3,q0 + eor r14,r2,r14,ror#24 + veor q13,q7,q4 + eor r10,r3,r10,ror#24 + veor q14,q11,q8 + add r8,r8,r14,ror#16 + vshr.u32 q3,q12,#24 + add r9,r9,r10,ror#16 + vshr.u32 q7,q13,#24 + eor r6,r8,r6,ror#13 + vshr.u32 q11,q14,#24 + eor r7,r9,r7,ror#13 + vsli.32 q3,q12,#8 + add r2,r2,r6,ror#20 + vsli.32 q7,q13,#8 + add r3,r3,r7,ror#20 + vsli.32 q11,q14,#8 + eor r14,r2,r14,ror#16 + vadd.i32 q2,q2,q3 + eor r10,r3,r10,ror#16 + vadd.i32 q6,q6,q7 + add r8,r8,r14,ror#24 + vadd.i32 q10,q10,q11 + add r9,r9,r10,ror#24 + veor q12,q1,q2 + eor r6,r6,r8,ror#12 + veor q13,q5,q6 + eor r7,r7,r9,ror#12 + veor q14,q9,q10 + vshr.u32 q1,q12,#25 + vshr.u32 q5,q13,#25 + vshr.u32 q9,q14,#25 + vsli.32 q1,q12,#7 + vsli.32 q5,q13,#7 + vsli.32 q9,q14,#7 + vext.8 q2,q2,q2,#8 + vext.8 q6,q6,q6,#8 + vext.8 q10,q10,q10,#8 + vext.8 q1,q1,q1,#4 + vext.8 q5,q5,q5,#4 + vext.8 q9,q9,q9,#4 + vext.8 q3,q3,q3,#12 + vext.8 q7,q7,q7,#12 + vext.8 q11,q11,q11,#12 + vadd.i32 q0,q0,q1 + add r0,r0,r5,ror#13 + vadd.i32 q4,q4,q5 + add r1,r1,r6,ror#13 + vadd.i32 q8,q8,q9 + eor r10,r0,r10,ror#24 + veor q3,q3,q0 + eor r12,r1,r12,ror#24 + veor q7,q7,q4 + add r8,r8,r10,ror#16 + veor q11,q11,q8 + add r9,r9,r12,ror#16 + vrev32.16 q3,q3 + eor r5,r8,r5,ror#13 + vrev32.16 q7,q7 + eor r6,r9,r6,ror#13 + vrev32.16 q11,q11 + add r0,r0,r5,ror#20 + vadd.i32 q2,q2,q3 + add r1,r1,r6,ror#20 + vadd.i32 q6,q6,q7 + eor r10,r0,r10,ror#16 + vadd.i32 q10,q10,q11 + eor r12,r1,r12,ror#16 + veor q12,q1,q2 + str r10,[sp,#4*(16+15)] + veor q13,q5,q6 + add r8,r8,r10,ror#24 + veor q14,q9,q10 + ldr r10,[sp,#4*(16+13)] + vshr.u32 q1,q12,#20 + add r9,r9,r12,ror#24 + vshr.u32 q5,q13,#20 + str r8,[sp,#4*(16+10)] + vshr.u32 q9,q14,#20 + eor r5,r5,r8,ror#12 + vsli.32 q1,q12,#12 + str r9,[sp,#4*(16+11)] + vsli.32 q5,q13,#12 + eor r6,r6,r9,ror#12 + vsli.32 q9,q14,#12 + ldr r8,[sp,#4*(16+8)] + vadd.i32 q0,q0,q1 + add r2,r2,r7,ror#13 + vadd.i32 q4,q4,q5 + ldr r9,[sp,#4*(16+9)] + vadd.i32 q8,q8,q9 + add r3,r3,r4,ror#13 + veor q12,q3,q0 + eor r10,r2,r10,ror#24 + veor q13,q7,q4 + eor r14,r3,r14,ror#24 + veor q14,q11,q8 + add r8,r8,r10,ror#16 + vshr.u32 q3,q12,#24 + add r9,r9,r14,ror#16 + vshr.u32 q7,q13,#24 + eor r7,r8,r7,ror#13 + vshr.u32 q11,q14,#24 + eor r4,r9,r4,ror#13 + vsli.32 q3,q12,#8 + add r2,r2,r7,ror#20 + vsli.32 q7,q13,#8 + add r3,r3,r4,ror#20 + vsli.32 q11,q14,#8 + eor r10,r2,r10,ror#16 + vadd.i32 q2,q2,q3 + eor r14,r3,r14,ror#16 + vadd.i32 q6,q6,q7 + add r8,r8,r10,ror#24 + vadd.i32 q10,q10,q11 + add r9,r9,r14,ror#24 + veor q12,q1,q2 + eor r7,r7,r8,ror#12 + veor q13,q5,q6 + eor r4,r4,r9,ror#12 + veor q14,q9,q10 + vshr.u32 q1,q12,#25 + vshr.u32 q5,q13,#25 + vshr.u32 q9,q14,#25 + vsli.32 q1,q12,#7 + vsli.32 q5,q13,#7 + vsli.32 q9,q14,#7 + vext.8 q2,q2,q2,#8 + vext.8 q6,q6,q6,#8 + vext.8 q10,q10,q10,#8 + vext.8 q1,q1,q1,#12 + vext.8 q5,q5,q5,#12 + vext.8 q9,q9,q9,#12 + vext.8 q3,q3,q3,#4 + vext.8 q7,q7,q7,#4 + vext.8 q11,q11,q11,#4 + bne .Loop_neon + + add r11,sp,#32 + vld1.32 {q12-q13},[sp] @ load key material + vld1.32 {q14-q15},[r11] + + ldr r11,[sp,#4*(32+2)] @ load len + + str r8, [sp,#4*(16+8)] @ modulo-scheduled store + str r9, [sp,#4*(16+9)] + str r12,[sp,#4*(16+12)] + str r10, [sp,#4*(16+13)] + str r14,[sp,#4*(16+14)] + + @ at this point we have first half of 512-bit result in + @ rx and second half at sp+4*(16+8) + + ldr r12,[sp,#4*(32+1)] @ load inp + ldr r14,[sp,#4*(32+0)] @ load out + + vadd.i32 q0,q0,q12 @ accumulate key material + vadd.i32 q4,q4,q12 + vadd.i32 q8,q8,q12 + vldr d24,[sp,#4*(16+0)] @ one + + vadd.i32 q1,q1,q13 + vadd.i32 q5,q5,q13 + vadd.i32 q9,q9,q13 + vldr d26,[sp,#4*(16+2)] @ two + + vadd.i32 q2,q2,q14 + vadd.i32 q6,q6,q14 + vadd.i32 q10,q10,q14 + vadd.i32 d14,d14,d24 @ counter+1 + vadd.i32 d22,d22,d26 @ counter+2 + + vadd.i32 q3,q3,q15 + vadd.i32 q7,q7,q15 + vadd.i32 q11,q11,q15 + + cmp r11,#64*4 + blo .Ltail_neon + + vld1.8 {q12-q13},[r12]! @ load input + mov r11,sp + vld1.8 {q14-q15},[r12]! + veor q0,q0,q12 @ xor with input + veor q1,q1,q13 + vld1.8 {q12-q13},[r12]! + veor q2,q2,q14 + veor q3,q3,q15 + vld1.8 {q14-q15},[r12]! + + veor q4,q4,q12 + vst1.8 {q0-q1},[r14]! @ store output + veor q5,q5,q13 + vld1.8 {q12-q13},[r12]! + veor q6,q6,q14 + vst1.8 {q2-q3},[r14]! + veor q7,q7,q15 + vld1.8 {q14-q15},[r12]! + + veor q8,q8,q12 + vld1.32 {q0-q1},[r11]! @ load for next iteration + veor d25,d25,d25 + vldr d24,[sp,#4*(16+4)] @ four + veor q9,q9,q13 + vld1.32 {q2-q3},[r11] + veor q10,q10,q14 + vst1.8 {q4-q5},[r14]! + veor q11,q11,q15 + vst1.8 {q6-q7},[r14]! + + vadd.i32 d6,d6,d24 @ next counter value + vldr d24,[sp,#4*(16+0)] @ one + + ldmia sp,{r8-r11} @ load key material + add r0,r0,r8 @ accumulate key material + ldr r8,[r12],#16 @ load input + vst1.8 {q8-q9},[r14]! + add r1,r1,r9 + ldr r9,[r12,#-12] + vst1.8 {q10-q11},[r14]! + add r2,r2,r10 + ldr r10,[r12,#-8] + add r3,r3,r11 + ldr r11,[r12,#-4] +# ifdef __ARMEB__ + rev r0,r0 + rev r1,r1 + rev r2,r2 + rev r3,r3 +# endif + eor r0,r0,r8 @ xor with input + add r8,sp,#4*(4) + eor r1,r1,r9 + str r0,[r14],#16 @ store output + eor r2,r2,r10 + str r1,[r14,#-12] + eor r3,r3,r11 + ldmia r8,{r8-r11} @ load key material + str r2,[r14,#-8] + str r3,[r14,#-4] + + add r4,r8,r4,ror#13 @ accumulate key material + ldr r8,[r12],#16 @ load input + add r5,r9,r5,ror#13 + ldr r9,[r12,#-12] + add r6,r10,r6,ror#13 + ldr r10,[r12,#-8] + add r7,r11,r7,ror#13 + ldr r11,[r12,#-4] +# ifdef __ARMEB__ + rev r4,r4 + rev r5,r5 + rev r6,r6 + rev r7,r7 +# endif + eor r4,r4,r8 + add r8,sp,#4*(8) + eor r5,r5,r9 + str r4,[r14],#16 @ store output + eor r6,r6,r10 + str r5,[r14,#-12] + eor r7,r7,r11 + ldmia r8,{r8-r11} @ load key material + str r6,[r14,#-8] + add r0,sp,#4*(16+8) + str r7,[r14,#-4] + + ldmia r0,{r0-r7} @ load second half + + add r0,r0,r8 @ accumulate key material + ldr r8,[r12],#16 @ load input + add r1,r1,r9 + ldr r9,[r12,#-12] +# ifdef __thumb2__ + it hi +# endif + strhi r10,[sp,#4*(16+10)] @ copy "rx" while at it + add r2,r2,r10 + ldr r10,[r12,#-8] +# ifdef __thumb2__ + it hi +# endif + strhi r11,[sp,#4*(16+11)] @ copy "rx" while at it + add r3,r3,r11 + ldr r11,[r12,#-4] +# ifdef __ARMEB__ + rev r0,r0 + rev r1,r1 + rev r2,r2 + rev r3,r3 +# endif + eor r0,r0,r8 + add r8,sp,#4*(12) + eor r1,r1,r9 + str r0,[r14],#16 @ store output + eor r2,r2,r10 + str r1,[r14,#-12] + eor r3,r3,r11 + ldmia r8,{r8-r11} @ load key material + str r2,[r14,#-8] + str r3,[r14,#-4] + + add r4,r8,r4,ror#24 @ accumulate key material + add r8,r8,#4 @ next counter value + add r5,r9,r5,ror#24 + str r8,[sp,#4*(12)] @ save next counter value + ldr r8,[r12],#16 @ load input + add r6,r10,r6,ror#24 + add r4,r4,#3 @ counter+3 + ldr r9,[r12,#-12] + add r7,r11,r7,ror#24 + ldr r10,[r12,#-8] + ldr r11,[r12,#-4] +# ifdef __ARMEB__ + rev r4,r4 + rev r5,r5 + rev r6,r6 + rev r7,r7 +# endif + eor r4,r4,r8 +# ifdef __thumb2__ + it hi +# endif + ldrhi r8,[sp,#4*(32+2)] @ re-load len + eor r5,r5,r9 + eor r6,r6,r10 + str r4,[r14],#16 @ store output + eor r7,r7,r11 + str r5,[r14,#-12] + sub r11,r8,#64*4 @ len-=64*4 + str r6,[r14,#-8] + str r7,[r14,#-4] + bhi .Loop_neon_outer + + b .Ldone_neon + +.align 4 +.Lbreak_neon: + @ harmonize NEON and integer-only stack frames: load data + @ from NEON frame, but save to integer-only one; distance + @ between the two is 4*(32+4+16-32)=4*(20). + + str r11, [sp,#4*(20+32+2)] @ save len + add r11,sp,#4*(32+4) + str r12, [sp,#4*(20+32+1)] @ save inp + str r14, [sp,#4*(20+32+0)] @ save out + + ldr r12,[sp,#4*(16+10)] + ldr r14,[sp,#4*(16+11)] + vldmia r11,{d8-d15} @ fulfill ABI requirement + str r12,[sp,#4*(20+16+10)] @ copy "rx" + str r14,[sp,#4*(20+16+11)] @ copy "rx" + + ldr r11, [sp,#4*(15)] + mov r4,r4,ror#19 @ twist b[0..3] + ldr r12,[sp,#4*(12)] @ modulo-scheduled load + mov r5,r5,ror#19 + ldr r10, [sp,#4*(13)] + mov r6,r6,ror#19 + ldr r14,[sp,#4*(14)] + mov r7,r7,ror#19 + mov r11,r11,ror#8 @ twist d[0..3] + mov r12,r12,ror#8 + mov r10,r10,ror#8 + mov r14,r14,ror#8 + str r11, [sp,#4*(20+16+15)] + add r11,sp,#4*(20) + vst1.32 {q0-q1},[r11]! @ copy key + add sp,sp,#4*(20) @ switch frame + vst1.32 {q2-q3},[r11] + mov r11,#10 + b .Loop @ go integer-only + +.align 4 +.Ltail_neon: + cmp r11,#64*3 + bhs .L192_or_more_neon + cmp r11,#64*2 + bhs .L128_or_more_neon + cmp r11,#64*1 + bhs .L64_or_more_neon + + add r8,sp,#4*(8) + vst1.8 {q0-q1},[sp] + add r10,sp,#4*(0) + vst1.8 {q2-q3},[r8] + b .Loop_tail_neon + +.align 4 +.L64_or_more_neon: + vld1.8 {q12-q13},[r12]! + vld1.8 {q14-q15},[r12]! + veor q0,q0,q12 + veor q1,q1,q13 + veor q2,q2,q14 + veor q3,q3,q15 + vst1.8 {q0-q1},[r14]! + vst1.8 {q2-q3},[r14]! + + beq .Ldone_neon + + add r8,sp,#4*(8) + vst1.8 {q4-q5},[sp] + add r10,sp,#4*(0) + vst1.8 {q6-q7},[r8] + sub r11,r11,#64*1 @ len-=64*1 + b .Loop_tail_neon + +.align 4 +.L128_or_more_neon: + vld1.8 {q12-q13},[r12]! + vld1.8 {q14-q15},[r12]! + veor q0,q0,q12 + veor q1,q1,q13 + vld1.8 {q12-q13},[r12]! + veor q2,q2,q14 + veor q3,q3,q15 + vld1.8 {q14-q15},[r12]! + + veor q4,q4,q12 + veor q5,q5,q13 + vst1.8 {q0-q1},[r14]! + veor q6,q6,q14 + vst1.8 {q2-q3},[r14]! + veor q7,q7,q15 + vst1.8 {q4-q5},[r14]! + vst1.8 {q6-q7},[r14]! + + beq .Ldone_neon + + add r8,sp,#4*(8) + vst1.8 {q8-q9},[sp] + add r10,sp,#4*(0) + vst1.8 {q10-q11},[r8] + sub r11,r11,#64*2 @ len-=64*2 + b .Loop_tail_neon + +.align 4 +.L192_or_more_neon: + vld1.8 {q12-q13},[r12]! + vld1.8 {q14-q15},[r12]! + veor q0,q0,q12 + veor q1,q1,q13 + vld1.8 {q12-q13},[r12]! + veor q2,q2,q14 + veor q3,q3,q15 + vld1.8 {q14-q15},[r12]! + + veor q4,q4,q12 + veor q5,q5,q13 + vld1.8 {q12-q13},[r12]! + veor q6,q6,q14 + vst1.8 {q0-q1},[r14]! + veor q7,q7,q15 + vld1.8 {q14-q15},[r12]! + + veor q8,q8,q12 + vst1.8 {q2-q3},[r14]! + veor q9,q9,q13 + vst1.8 {q4-q5},[r14]! + veor q10,q10,q14 + vst1.8 {q6-q7},[r14]! + veor q11,q11,q15 + vst1.8 {q8-q9},[r14]! + vst1.8 {q10-q11},[r14]! + + beq .Ldone_neon + + ldmia sp,{r8-r11} @ load key material + add r0,r0,r8 @ accumulate key material + add r8,sp,#4*(4) + add r1,r1,r9 + add r2,r2,r10 + add r3,r3,r11 + ldmia r8,{r8-r11} @ load key material + + add r4,r8,r4,ror#13 @ accumulate key material + add r8,sp,#4*(8) + add r5,r9,r5,ror#13 + add r6,r10,r6,ror#13 + add r7,r11,r7,ror#13 + ldmia r8,{r8-r11} @ load key material +# ifdef __ARMEB__ + rev r0,r0 + rev r1,r1 + rev r2,r2 + rev r3,r3 + rev r4,r4 + rev r5,r5 + rev r6,r6 + rev r7,r7 +# endif + stmia sp,{r0-r7} + add r0,sp,#4*(16+8) + + ldmia r0,{r0-r7} @ load second half + + add r0,r0,r8 @ accumulate key material + add r8,sp,#4*(12) + add r1,r1,r9 + add r2,r2,r10 + add r3,r3,r11 + ldmia r8,{r8-r11} @ load key material + + add r4,r8,r4,ror#24 @ accumulate key material + add r8,sp,#4*(8) + add r5,r9,r5,ror#24 + add r4,r4,#3 @ counter+3 + add r6,r10,r6,ror#24 + add r7,r11,r7,ror#24 + ldr r11,[sp,#4*(32+2)] @ re-load len +# ifdef __ARMEB__ + rev r0,r0 + rev r1,r1 + rev r2,r2 + rev r3,r3 + rev r4,r4 + rev r5,r5 + rev r6,r6 + rev r7,r7 +# endif + stmia r8,{r0-r7} + add r10,sp,#4*(0) + sub r11,r11,#64*3 @ len-=64*3 + +.Loop_tail_neon: + ldrb r8,[r10],#1 @ read buffer on stack + ldrb r9,[r12],#1 @ read input + subs r11,r11,#1 + eor r8,r8,r9 + strb r8,[r14],#1 @ store output + bne .Loop_tail_neon + +.Ldone_neon: + add sp,sp,#4*(32+4) + vldmia sp,{d8-d15} + add sp,sp,#4*(16+3) + ldmia sp!,{r4-r11,pc} +.size ChaCha20_neon,.-ChaCha20_neon +.comm OPENSSL_armcap_P,4,4 +#endif diff --git a/lib/zinc/chacha20/chacha20-arm64-cryptogams.S b/lib/zinc/chacha20/chacha20-arm64-cryptogams.S new file mode 100644 index 000000000000..4d029bfdad3a --- /dev/null +++ b/lib/zinc/chacha20/chacha20-arm64-cryptogams.S @@ -0,0 +1,1973 @@ +/* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */ +/* + * Copyright (C) 2006-2017 CRYPTOGAMS by . All Rights Reserved. + */ + +#include "arm_arch.h" + +.text + + + +.align 5 +.Lsigma: +.quad 0x3320646e61707865,0x6b20657479622d32 // endian-neutral +.Lone: +.long 1,0,0,0 +.LOPENSSL_armcap_P: +#ifdef __ILP32__ +.long OPENSSL_armcap_P-. +#else +.quad OPENSSL_armcap_P-. +#endif +.byte 67,104,97,67,104,97,50,48,32,102,111,114,32,65,82,77,118,56,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0 +.align 2 + +.globl ChaCha20_ctr32 +.type ChaCha20_ctr32,%function +.align 5 +ChaCha20_ctr32: + cbz x2,.Labort + adr x5,.LOPENSSL_armcap_P + cmp x2,#192 + b.lo .Lshort +#ifdef __ILP32__ + ldrsw x6,[x5] +#else + ldr x6,[x5] +#endif + ldr w17,[x6,x5] + tst w17,#ARMV7_NEON + b.ne ChaCha20_neon + +.Lshort: + stp x29,x30,[sp,#-96]! + add x29,sp,#0 + + adr x5,.Lsigma + stp x19,x20,[sp,#16] + stp x21,x22,[sp,#32] + stp x23,x24,[sp,#48] + stp x25,x26,[sp,#64] + stp x27,x28,[sp,#80] + sub sp,sp,#64 + + ldp x22,x23,[x5] // load sigma + ldp x24,x25,[x3] // load key + ldp x26,x27,[x3,#16] + ldp x28,x30,[x4] // load counter +#ifdef __ARMEB__ + ror x24,x24,#32 + ror x25,x25,#32 + ror x26,x26,#32 + ror x27,x27,#32 + ror x28,x28,#32 + ror x30,x30,#32 +#endif + +.Loop_outer: + mov w5,w22 // unpack key block + lsr x6,x22,#32 + mov w7,w23 + lsr x8,x23,#32 + mov w9,w24 + lsr x10,x24,#32 + mov w11,w25 + lsr x12,x25,#32 + mov w13,w26 + lsr x14,x26,#32 + mov w15,w27 + lsr x16,x27,#32 + mov w17,w28 + lsr x19,x28,#32 + mov w20,w30 + lsr x21,x30,#32 + + mov x4,#10 + subs x2,x2,#64 +.Loop: + sub x4,x4,#1 + add w5,w5,w9 + add w6,w6,w10 + add w7,w7,w11 + add w8,w8,w12 + eor w17,w17,w5 + eor w19,w19,w6 + eor w20,w20,w7 + eor w21,w21,w8 + ror w17,w17,#16 + ror w19,w19,#16 + ror w20,w20,#16 + ror w21,w21,#16 + add w13,w13,w17 + add w14,w14,w19 + add w15,w15,w20 + add w16,w16,w21 + eor w9,w9,w13 + eor w10,w10,w14 + eor w11,w11,w15 + eor w12,w12,w16 + ror w9,w9,#20 + ror w10,w10,#20 + ror w11,w11,#20 + ror w12,w12,#20 + add w5,w5,w9 + add w6,w6,w10 + add w7,w7,w11 + add w8,w8,w12 + eor w17,w17,w5 + eor w19,w19,w6 + eor w20,w20,w7 + eor w21,w21,w8 + ror w17,w17,#24 + ror w19,w19,#24 + ror w20,w20,#24 + ror w21,w21,#24 + add w13,w13,w17 + add w14,w14,w19 + add w15,w15,w20 + add w16,w16,w21 + eor w9,w9,w13 + eor w10,w10,w14 + eor w11,w11,w15 + eor w12,w12,w16 + ror w9,w9,#25 + ror w10,w10,#25 + ror w11,w11,#25 + ror w12,w12,#25 + add w5,w5,w10 + add w6,w6,w11 + add w7,w7,w12 + add w8,w8,w9 + eor w21,w21,w5 + eor w17,w17,w6 + eor w19,w19,w7 + eor w20,w20,w8 + ror w21,w21,#16 + ror w17,w17,#16 + ror w19,w19,#16 + ror w20,w20,#16 + add w15,w15,w21 + add w16,w16,w17 + add w13,w13,w19 + add w14,w14,w20 + eor w10,w10,w15 + eor w11,w11,w16 + eor w12,w12,w13 + eor w9,w9,w14 + ror w10,w10,#20 + ror w11,w11,#20 + ror w12,w12,#20 + ror w9,w9,#20 + add w5,w5,w10 + add w6,w6,w11 + add w7,w7,w12 + add w8,w8,w9 + eor w21,w21,w5 + eor w17,w17,w6 + eor w19,w19,w7 + eor w20,w20,w8 + ror w21,w21,#24 + ror w17,w17,#24 + ror w19,w19,#24 + ror w20,w20,#24 + add w15,w15,w21 + add w16,w16,w17 + add w13,w13,w19 + add w14,w14,w20 + eor w10,w10,w15 + eor w11,w11,w16 + eor w12,w12,w13 + eor w9,w9,w14 + ror w10,w10,#25 + ror w11,w11,#25 + ror w12,w12,#25 + ror w9,w9,#25 + cbnz x4,.Loop + + add w5,w5,w22 // accumulate key block + add x6,x6,x22,lsr#32 + add w7,w7,w23 + add x8,x8,x23,lsr#32 + add w9,w9,w24 + add x10,x10,x24,lsr#32 + add w11,w11,w25 + add x12,x12,x25,lsr#32 + add w13,w13,w26 + add x14,x14,x26,lsr#32 + add w15,w15,w27 + add x16,x16,x27,lsr#32 + add w17,w17,w28 + add x19,x19,x28,lsr#32 + add w20,w20,w30 + add x21,x21,x30,lsr#32 + + b.lo .Ltail + + add x5,x5,x6,lsl#32 // pack + add x7,x7,x8,lsl#32 + ldp x6,x8,[x1,#0] // load input + add x9,x9,x10,lsl#32 + add x11,x11,x12,lsl#32 + ldp x10,x12,[x1,#16] + add x13,x13,x14,lsl#32 + add x15,x15,x16,lsl#32 + ldp x14,x16,[x1,#32] + add x17,x17,x19,lsl#32 + add x20,x20,x21,lsl#32 + ldp x19,x21,[x1,#48] + add x1,x1,#64 +#ifdef __ARMEB__ + rev x5,x5 + rev x7,x7 + rev x9,x9 + rev x11,x11 + rev x13,x13 + rev x15,x15 + rev x17,x17 + rev x20,x20 +#endif + eor x5,x5,x6 + eor x7,x7,x8 + eor x9,x9,x10 + eor x11,x11,x12 + eor x13,x13,x14 + eor x15,x15,x16 + eor x17,x17,x19 + eor x20,x20,x21 + + stp x5,x7,[x0,#0] // store output + add x28,x28,#1 // increment counter + stp x9,x11,[x0,#16] + stp x13,x15,[x0,#32] + stp x17,x20,[x0,#48] + add x0,x0,#64 + + b.hi .Loop_outer + + ldp x19,x20,[x29,#16] + add sp,sp,#64 + ldp x21,x22,[x29,#32] + ldp x23,x24,[x29,#48] + ldp x25,x26,[x29,#64] + ldp x27,x28,[x29,#80] + ldp x29,x30,[sp],#96 +.Labort: + ret + +.align 4 +.Ltail: + add x2,x2,#64 +.Less_than_64: + sub x0,x0,#1 + add x1,x1,x2 + add x0,x0,x2 + add x4,sp,x2 + neg x2,x2 + + add x5,x5,x6,lsl#32 // pack + add x7,x7,x8,lsl#32 + add x9,x9,x10,lsl#32 + add x11,x11,x12,lsl#32 + add x13,x13,x14,lsl#32 + add x15,x15,x16,lsl#32 + add x17,x17,x19,lsl#32 + add x20,x20,x21,lsl#32 +#ifdef __ARMEB__ + rev x5,x5 + rev x7,x7 + rev x9,x9 + rev x11,x11 + rev x13,x13 + rev x15,x15 + rev x17,x17 + rev x20,x20 +#endif + stp x5,x7,[sp,#0] + stp x9,x11,[sp,#16] + stp x13,x15,[sp,#32] + stp x17,x20,[sp,#48] + +.Loop_tail: + ldrb w10,[x1,x2] + ldrb w11,[x4,x2] + add x2,x2,#1 + eor w10,w10,w11 + strb w10,[x0,x2] + cbnz x2,.Loop_tail + + stp xzr,xzr,[sp,#0] + stp xzr,xzr,[sp,#16] + stp xzr,xzr,[sp,#32] + stp xzr,xzr,[sp,#48] + + ldp x19,x20,[x29,#16] + add sp,sp,#64 + ldp x21,x22,[x29,#32] + ldp x23,x24,[x29,#48] + ldp x25,x26,[x29,#64] + ldp x27,x28,[x29,#80] + ldp x29,x30,[sp],#96 + ret +.size ChaCha20_ctr32,.-ChaCha20_ctr32 + +.type ChaCha20_neon,%function +.align 5 +ChaCha20_neon: + stp x29,x30,[sp,#-96]! + add x29,sp,#0 + + adr x5,.Lsigma + stp x19,x20,[sp,#16] + stp x21,x22,[sp,#32] + stp x23,x24,[sp,#48] + stp x25,x26,[sp,#64] + stp x27,x28,[sp,#80] + cmp x2,#512 + b.hs .L512_or_more_neon + + sub sp,sp,#64 + + ldp x22,x23,[x5] // load sigma + ld1 {v24.4s},[x5],#16 + ldp x24,x25,[x3] // load key + ldp x26,x27,[x3,#16] + ld1 {v25.4s,v26.4s},[x3] + ldp x28,x30,[x4] // load counter + ld1 {v27.4s},[x4] + ld1 {v31.4s},[x5] +#ifdef __ARMEB__ + rev64 v24.4s,v24.4s + ror x24,x24,#32 + ror x25,x25,#32 + ror x26,x26,#32 + ror x27,x27,#32 + ror x28,x28,#32 + ror x30,x30,#32 +#endif + add v27.4s,v27.4s,v31.4s // += 1 + add v28.4s,v27.4s,v31.4s + add v29.4s,v28.4s,v31.4s + shl v31.4s,v31.4s,#2 // 1 -> 4 + +.Loop_outer_neon: + mov w5,w22 // unpack key block + lsr x6,x22,#32 + mov v0.16b,v24.16b + mov w7,w23 + lsr x8,x23,#32 + mov v4.16b,v24.16b + mov w9,w24 + lsr x10,x24,#32 + mov v16.16b,v24.16b + mov w11,w25 + mov v1.16b,v25.16b + lsr x12,x25,#32 + mov v5.16b,v25.16b + mov w13,w26 + mov v17.16b,v25.16b + lsr x14,x26,#32 + mov v3.16b,v27.16b + mov w15,w27 + mov v7.16b,v28.16b + lsr x16,x27,#32 + mov v19.16b,v29.16b + mov w17,w28 + mov v2.16b,v26.16b + lsr x19,x28,#32 + mov v6.16b,v26.16b + mov w20,w30 + mov v18.16b,v26.16b + lsr x21,x30,#32 + + mov x4,#10 + subs x2,x2,#256 +.Loop_neon: + sub x4,x4,#1 + add v0.4s,v0.4s,v1.4s + add w5,w5,w9 + add v4.4s,v4.4s,v5.4s + add w6,w6,w10 + add v16.4s,v16.4s,v17.4s + add w7,w7,w11 + eor v3.16b,v3.16b,v0.16b + add w8,w8,w12 + eor v7.16b,v7.16b,v4.16b + eor w17,w17,w5 + eor v19.16b,v19.16b,v16.16b + eor w19,w19,w6 + rev32 v3.8h,v3.8h + eor w20,w20,w7 + rev32 v7.8h,v7.8h + eor w21,w21,w8 + rev32 v19.8h,v19.8h + ror w17,w17,#16 + add v2.4s,v2.4s,v3.4s + ror w19,w19,#16 + add v6.4s,v6.4s,v7.4s + ror w20,w20,#16 + add v18.4s,v18.4s,v19.4s + ror w21,w21,#16 + eor v20.16b,v1.16b,v2.16b + add w13,w13,w17 + eor v21.16b,v5.16b,v6.16b + add w14,w14,w19 + eor v22.16b,v17.16b,v18.16b + add w15,w15,w20 + ushr v1.4s,v20.4s,#20 + add w16,w16,w21 + ushr v5.4s,v21.4s,#20 + eor w9,w9,w13 + ushr v17.4s,v22.4s,#20 + eor w10,w10,w14 + sli v1.4s,v20.4s,#12 + eor w11,w11,w15 + sli v5.4s,v21.4s,#12 + eor w12,w12,w16 + sli v17.4s,v22.4s,#12 + ror w9,w9,#20 + add v0.4s,v0.4s,v1.4s + ror w10,w10,#20 + add v4.4s,v4.4s,v5.4s + ror w11,w11,#20 + add v16.4s,v16.4s,v17.4s + ror w12,w12,#20 + eor v20.16b,v3.16b,v0.16b + add w5,w5,w9 + eor v21.16b,v7.16b,v4.16b + add w6,w6,w10 + eor v22.16b,v19.16b,v16.16b + add w7,w7,w11 + ushr v3.4s,v20.4s,#24 + add w8,w8,w12 + ushr v7.4s,v21.4s,#24 + eor w17,w17,w5 + ushr v19.4s,v22.4s,#24 + eor w19,w19,w6 + sli v3.4s,v20.4s,#8 + eor w20,w20,w7 + sli v7.4s,v21.4s,#8 + eor w21,w21,w8 + sli v19.4s,v22.4s,#8 + ror w17,w17,#24 + add v2.4s,v2.4s,v3.4s + ror w19,w19,#24 + add v6.4s,v6.4s,v7.4s + ror w20,w20,#24 + add v18.4s,v18.4s,v19.4s + ror w21,w21,#24 + eor v20.16b,v1.16b,v2.16b + add w13,w13,w17 + eor v21.16b,v5.16b,v6.16b + add w14,w14,w19 + eor v22.16b,v17.16b,v18.16b + add w15,w15,w20 + ushr v1.4s,v20.4s,#25 + add w16,w16,w21 + ushr v5.4s,v21.4s,#25 + eor w9,w9,w13 + ushr v17.4s,v22.4s,#25 + eor w10,w10,w14 + sli v1.4s,v20.4s,#7 + eor w11,w11,w15 + sli v5.4s,v21.4s,#7 + eor w12,w12,w16 + sli v17.4s,v22.4s,#7 + ror w9,w9,#25 + ext v2.16b,v2.16b,v2.16b,#8 + ror w10,w10,#25 + ext v6.16b,v6.16b,v6.16b,#8 + ror w11,w11,#25 + ext v18.16b,v18.16b,v18.16b,#8 + ror w12,w12,#25 + ext v3.16b,v3.16b,v3.16b,#12 + ext v7.16b,v7.16b,v7.16b,#12 + ext v19.16b,v19.16b,v19.16b,#12 + ext v1.16b,v1.16b,v1.16b,#4 + ext v5.16b,v5.16b,v5.16b,#4 + ext v17.16b,v17.16b,v17.16b,#4 + add v0.4s,v0.4s,v1.4s + add w5,w5,w10 + add v4.4s,v4.4s,v5.4s + add w6,w6,w11 + add v16.4s,v16.4s,v17.4s + add w7,w7,w12 + eor v3.16b,v3.16b,v0.16b + add w8,w8,w9 + eor v7.16b,v7.16b,v4.16b + eor w21,w21,w5 + eor v19.16b,v19.16b,v16.16b + eor w17,w17,w6 + rev32 v3.8h,v3.8h + eor w19,w19,w7 + rev32 v7.8h,v7.8h + eor w20,w20,w8 + rev32 v19.8h,v19.8h + ror w21,w21,#16 + add v2.4s,v2.4s,v3.4s + ror w17,w17,#16 + add v6.4s,v6.4s,v7.4s + ror w19,w19,#16 + add v18.4s,v18.4s,v19.4s + ror w20,w20,#16 + eor v20.16b,v1.16b,v2.16b + add w15,w15,w21 + eor v21.16b,v5.16b,v6.16b + add w16,w16,w17 + eor v22.16b,v17.16b,v18.16b + add w13,w13,w19 + ushr v1.4s,v20.4s,#20 + add w14,w14,w20 + ushr v5.4s,v21.4s,#20 + eor w10,w10,w15 + ushr v17.4s,v22.4s,#20 + eor w11,w11,w16 + sli v1.4s,v20.4s,#12 + eor w12,w12,w13 + sli v5.4s,v21.4s,#12 + eor w9,w9,w14 + sli v17.4s,v22.4s,#12 + ror w10,w10,#20 + add v0.4s,v0.4s,v1.4s + ror w11,w11,#20 + add v4.4s,v4.4s,v5.4s + ror w12,w12,#20 + add v16.4s,v16.4s,v17.4s + ror w9,w9,#20 + eor v20.16b,v3.16b,v0.16b + add w5,w5,w10 + eor v21.16b,v7.16b,v4.16b + add w6,w6,w11 + eor v22.16b,v19.16b,v16.16b + add w7,w7,w12 + ushr v3.4s,v20.4s,#24 + add w8,w8,w9 + ushr v7.4s,v21.4s,#24 + eor w21,w21,w5 + ushr v19.4s,v22.4s,#24 + eor w17,w17,w6 + sli v3.4s,v20.4s,#8 + eor w19,w19,w7 + sli v7.4s,v21.4s,#8 + eor w20,w20,w8 + sli v19.4s,v22.4s,#8 + ror w21,w21,#24 + add v2.4s,v2.4s,v3.4s + ror w17,w17,#24 + add v6.4s,v6.4s,v7.4s + ror w19,w19,#24 + add v18.4s,v18.4s,v19.4s + ror w20,w20,#24 + eor v20.16b,v1.16b,v2.16b + add w15,w15,w21 + eor v21.16b,v5.16b,v6.16b + add w16,w16,w17 + eor v22.16b,v17.16b,v18.16b + add w13,w13,w19 + ushr v1.4s,v20.4s,#25 + add w14,w14,w20 + ushr v5.4s,v21.4s,#25 + eor w10,w10,w15 + ushr v17.4s,v22.4s,#25 + eor w11,w11,w16 + sli v1.4s,v20.4s,#7 + eor w12,w12,w13 + sli v5.4s,v21.4s,#7 + eor w9,w9,w14 + sli v17.4s,v22.4s,#7 + ror w10,w10,#25 + ext v2.16b,v2.16b,v2.16b,#8 + ror w11,w11,#25 + ext v6.16b,v6.16b,v6.16b,#8 + ror w12,w12,#25 + ext v18.16b,v18.16b,v18.16b,#8 + ror w9,w9,#25 + ext v3.16b,v3.16b,v3.16b,#4 + ext v7.16b,v7.16b,v7.16b,#4 + ext v19.16b,v19.16b,v19.16b,#4 + ext v1.16b,v1.16b,v1.16b,#12 + ext v5.16b,v5.16b,v5.16b,#12 + ext v17.16b,v17.16b,v17.16b,#12 + cbnz x4,.Loop_neon + + add w5,w5,w22 // accumulate key block + add v0.4s,v0.4s,v24.4s + add x6,x6,x22,lsr#32 + add v4.4s,v4.4s,v24.4s + add w7,w7,w23 + add v16.4s,v16.4s,v24.4s + add x8,x8,x23,lsr#32 + add v2.4s,v2.4s,v26.4s + add w9,w9,w24 + add v6.4s,v6.4s,v26.4s + add x10,x10,x24,lsr#32 + add v18.4s,v18.4s,v26.4s + add w11,w11,w25 + add v3.4s,v3.4s,v27.4s + add x12,x12,x25,lsr#32 + add w13,w13,w26 + add v7.4s,v7.4s,v28.4s + add x14,x14,x26,lsr#32 + add w15,w15,w27 + add v19.4s,v19.4s,v29.4s + add x16,x16,x27,lsr#32 + add w17,w17,w28 + add v1.4s,v1.4s,v25.4s + add x19,x19,x28,lsr#32 + add w20,w20,w30 + add v5.4s,v5.4s,v25.4s + add x21,x21,x30,lsr#32 + add v17.4s,v17.4s,v25.4s + + b.lo .Ltail_neon + + add x5,x5,x6,lsl#32 // pack + add x7,x7,x8,lsl#32 + ldp x6,x8,[x1,#0] // load input + add x9,x9,x10,lsl#32 + add x11,x11,x12,lsl#32 + ldp x10,x12,[x1,#16] + add x13,x13,x14,lsl#32 + add x15,x15,x16,lsl#32 + ldp x14,x16,[x1,#32] + add x17,x17,x19,lsl#32 + add x20,x20,x21,lsl#32 + ldp x19,x21,[x1,#48] + add x1,x1,#64 +#ifdef __ARMEB__ + rev x5,x5 + rev x7,x7 + rev x9,x9 + rev x11,x11 + rev x13,x13 + rev x15,x15 + rev x17,x17 + rev x20,x20 +#endif + ld1 {v20.16b,v21.16b,v22.16b,v23.16b},[x1],#64 + eor x5,x5,x6 + eor x7,x7,x8 + eor x9,x9,x10 + eor x11,x11,x12 + eor x13,x13,x14 + eor v0.16b,v0.16b,v20.16b + eor x15,x15,x16 + eor v1.16b,v1.16b,v21.16b + eor x17,x17,x19 + eor v2.16b,v2.16b,v22.16b + eor x20,x20,x21 + eor v3.16b,v3.16b,v23.16b + ld1 {v20.16b,v21.16b,v22.16b,v23.16b},[x1],#64 + + stp x5,x7,[x0,#0] // store output + add x28,x28,#4 // increment counter + stp x9,x11,[x0,#16] + add v27.4s,v27.4s,v31.4s // += 4 + stp x13,x15,[x0,#32] + add v28.4s,v28.4s,v31.4s + stp x17,x20,[x0,#48] + add v29.4s,v29.4s,v31.4s + add x0,x0,#64 + + st1 {v0.16b,v1.16b,v2.16b,v3.16b},[x0],#64 + ld1 {v0.16b,v1.16b,v2.16b,v3.16b},[x1],#64 + + eor v4.16b,v4.16b,v20.16b + eor v5.16b,v5.16b,v21.16b + eor v6.16b,v6.16b,v22.16b + eor v7.16b,v7.16b,v23.16b + st1 {v4.16b,v5.16b,v6.16b,v7.16b},[x0],#64 + + eor v16.16b,v16.16b,v0.16b + eor v17.16b,v17.16b,v1.16b + eor v18.16b,v18.16b,v2.16b + eor v19.16b,v19.16b,v3.16b + st1 {v16.16b,v17.16b,v18.16b,v19.16b},[x0],#64 + + b.hi .Loop_outer_neon + + ldp x19,x20,[x29,#16] + add sp,sp,#64 + ldp x21,x22,[x29,#32] + ldp x23,x24,[x29,#48] + ldp x25,x26,[x29,#64] + ldp x27,x28,[x29,#80] + ldp x29,x30,[sp],#96 + ret + +.Ltail_neon: + add x2,x2,#256 + cmp x2,#64 + b.lo .Less_than_64 + + add x5,x5,x6,lsl#32 // pack + add x7,x7,x8,lsl#32 + ldp x6,x8,[x1,#0] // load input + add x9,x9,x10,lsl#32 + add x11,x11,x12,lsl#32 + ldp x10,x12,[x1,#16] + add x13,x13,x14,lsl#32 + add x15,x15,x16,lsl#32 + ldp x14,x16,[x1,#32] + add x17,x17,x19,lsl#32 + add x20,x20,x21,lsl#32 + ldp x19,x21,[x1,#48] + add x1,x1,#64 +#ifdef __ARMEB__ + rev x5,x5 + rev x7,x7 + rev x9,x9 + rev x11,x11 + rev x13,x13 + rev x15,x15 + rev x17,x17 + rev x20,x20 +#endif + eor x5,x5,x6 + eor x7,x7,x8 + eor x9,x9,x10 + eor x11,x11,x12 + eor x13,x13,x14 + eor x15,x15,x16 + eor x17,x17,x19 + eor x20,x20,x21 + + stp x5,x7,[x0,#0] // store output + add x28,x28,#4 // increment counter + stp x9,x11,[x0,#16] + stp x13,x15,[x0,#32] + stp x17,x20,[x0,#48] + add x0,x0,#64 + b.eq .Ldone_neon + sub x2,x2,#64 + cmp x2,#64 + b.lo .Less_than_128 + + ld1 {v20.16b,v21.16b,v22.16b,v23.16b},[x1],#64 + eor v0.16b,v0.16b,v20.16b + eor v1.16b,v1.16b,v21.16b + eor v2.16b,v2.16b,v22.16b + eor v3.16b,v3.16b,v23.16b + st1 {v0.16b,v1.16b,v2.16b,v3.16b},[x0],#64 + b.eq .Ldone_neon + sub x2,x2,#64 + cmp x2,#64 + b.lo .Less_than_192 + + ld1 {v20.16b,v21.16b,v22.16b,v23.16b},[x1],#64 + eor v4.16b,v4.16b,v20.16b + eor v5.16b,v5.16b,v21.16b + eor v6.16b,v6.16b,v22.16b + eor v7.16b,v7.16b,v23.16b + st1 {v4.16b,v5.16b,v6.16b,v7.16b},[x0],#64 + b.eq .Ldone_neon + sub x2,x2,#64 + + st1 {v16.16b,v17.16b,v18.16b,v19.16b},[sp] + b .Last_neon + +.Less_than_128: + st1 {v0.16b,v1.16b,v2.16b,v3.16b},[sp] + b .Last_neon +.Less_than_192: + st1 {v4.16b,v5.16b,v6.16b,v7.16b},[sp] + b .Last_neon + +.align 4 +.Last_neon: + sub x0,x0,#1 + add x1,x1,x2 + add x0,x0,x2 + add x4,sp,x2 + neg x2,x2 + +.Loop_tail_neon: + ldrb w10,[x1,x2] + ldrb w11,[x4,x2] + add x2,x2,#1 + eor w10,w10,w11 + strb w10,[x0,x2] + cbnz x2,.Loop_tail_neon + + stp xzr,xzr,[sp,#0] + stp xzr,xzr,[sp,#16] + stp xzr,xzr,[sp,#32] + stp xzr,xzr,[sp,#48] + +.Ldone_neon: + ldp x19,x20,[x29,#16] + add sp,sp,#64 + ldp x21,x22,[x29,#32] + ldp x23,x24,[x29,#48] + ldp x25,x26,[x29,#64] + ldp x27,x28,[x29,#80] + ldp x29,x30,[sp],#96 + ret +.size ChaCha20_neon,.-ChaCha20_neon +.type ChaCha20_512_neon,%function +.align 5 +ChaCha20_512_neon: + stp x29,x30,[sp,#-96]! + add x29,sp,#0 + + adr x5,.Lsigma + stp x19,x20,[sp,#16] + stp x21,x22,[sp,#32] + stp x23,x24,[sp,#48] + stp x25,x26,[sp,#64] + stp x27,x28,[sp,#80] + +.L512_or_more_neon: + sub sp,sp,#128+64 + + ldp x22,x23,[x5] // load sigma + ld1 {v24.4s},[x5],#16 + ldp x24,x25,[x3] // load key + ldp x26,x27,[x3,#16] + ld1 {v25.4s,v26.4s},[x3] + ldp x28,x30,[x4] // load counter + ld1 {v27.4s},[x4] + ld1 {v31.4s},[x5] +#ifdef __ARMEB__ + rev64 v24.4s,v24.4s + ror x24,x24,#32 + ror x25,x25,#32 + ror x26,x26,#32 + ror x27,x27,#32 + ror x28,x28,#32 + ror x30,x30,#32 +#endif + add v27.4s,v27.4s,v31.4s // += 1 + stp q24,q25,[sp,#0] // off-load key block, invariant part + add v27.4s,v27.4s,v31.4s // not typo + str q26,[sp,#32] + add v28.4s,v27.4s,v31.4s + add v29.4s,v28.4s,v31.4s + add v30.4s,v29.4s,v31.4s + shl v31.4s,v31.4s,#2 // 1 -> 4 + + stp d8,d9,[sp,#128+0] // meet ABI requirements + stp d10,d11,[sp,#128+16] + stp d12,d13,[sp,#128+32] + stp d14,d15,[sp,#128+48] + + sub x2,x2,#512 // not typo + +.Loop_outer_512_neon: + mov v0.16b,v24.16b + mov v4.16b,v24.16b + mov v8.16b,v24.16b + mov v12.16b,v24.16b + mov v16.16b,v24.16b + mov v20.16b,v24.16b + mov v1.16b,v25.16b + mov w5,w22 // unpack key block + mov v5.16b,v25.16b + lsr x6,x22,#32 + mov v9.16b,v25.16b + mov w7,w23 + mov v13.16b,v25.16b + lsr x8,x23,#32 + mov v17.16b,v25.16b + mov w9,w24 + mov v21.16b,v25.16b + lsr x10,x24,#32 + mov v3.16b,v27.16b + mov w11,w25 + mov v7.16b,v28.16b + lsr x12,x25,#32 + mov v11.16b,v29.16b + mov w13,w26 + mov v15.16b,v30.16b + lsr x14,x26,#32 + mov v2.16b,v26.16b + mov w15,w27 + mov v6.16b,v26.16b + lsr x16,x27,#32 + add v19.4s,v3.4s,v31.4s // +4 + mov w17,w28 + add v23.4s,v7.4s,v31.4s // +4 + lsr x19,x28,#32 + mov v10.16b,v26.16b + mov w20,w30 + mov v14.16b,v26.16b + lsr x21,x30,#32 + mov v18.16b,v26.16b + stp q27,q28,[sp,#48] // off-load key block, variable part + mov v22.16b,v26.16b + str q29,[sp,#80] + + mov x4,#5 + subs x2,x2,#512 +.Loop_upper_neon: + sub x4,x4,#1 + add v0.4s,v0.4s,v1.4s + add w5,w5,w9 + add v4.4s,v4.4s,v5.4s + add w6,w6,w10 + add v8.4s,v8.4s,v9.4s + add w7,w7,w11 + add v12.4s,v12.4s,v13.4s + add w8,w8,w12 + add v16.4s,v16.4s,v17.4s + eor w17,w17,w5 + add v20.4s,v20.4s,v21.4s + eor w19,w19,w6 + eor v3.16b,v3.16b,v0.16b + eor w20,w20,w7 + eor v7.16b,v7.16b,v4.16b + eor w21,w21,w8 + eor v11.16b,v11.16b,v8.16b + ror w17,w17,#16 + eor v15.16b,v15.16b,v12.16b + ror w19,w19,#16 + eor v19.16b,v19.16b,v16.16b + ror w20,w20,#16 + eor v23.16b,v23.16b,v20.16b + ror w21,w21,#16 + rev32 v3.8h,v3.8h + add w13,w13,w17 + rev32 v7.8h,v7.8h + add w14,w14,w19 + rev32 v11.8h,v11.8h + add w15,w15,w20 + rev32 v15.8h,v15.8h + add w16,w16,w21 + rev32 v19.8h,v19.8h + eor w9,w9,w13 + rev32 v23.8h,v23.8h + eor w10,w10,w14 + add v2.4s,v2.4s,v3.4s + eor w11,w11,w15 + add v6.4s,v6.4s,v7.4s + eor w12,w12,w16 + add v10.4s,v10.4s,v11.4s + ror w9,w9,#20 + add v14.4s,v14.4s,v15.4s + ror w10,w10,#20 + add v18.4s,v18.4s,v19.4s + ror w11,w11,#20 + add v22.4s,v22.4s,v23.4s + ror w12,w12,#20 + eor v24.16b,v1.16b,v2.16b + add w5,w5,w9 + eor v25.16b,v5.16b,v6.16b + add w6,w6,w10 + eor v26.16b,v9.16b,v10.16b + add w7,w7,w11 + eor v27.16b,v13.16b,v14.16b + add w8,w8,w12 + eor v28.16b,v17.16b,v18.16b + eor w17,w17,w5 + eor v29.16b,v21.16b,v22.16b + eor w19,w19,w6 + ushr v1.4s,v24.4s,#20 + eor w20,w20,w7 + ushr v5.4s,v25.4s,#20 + eor w21,w21,w8 + ushr v9.4s,v26.4s,#20 + ror w17,w17,#24 + ushr v13.4s,v27.4s,#20 + ror w19,w19,#24 + ushr v17.4s,v28.4s,#20 + ror w20,w20,#24 + ushr v21.4s,v29.4s,#20 + ror w21,w21,#24 + sli v1.4s,v24.4s,#12 + add w13,w13,w17 + sli v5.4s,v25.4s,#12 + add w14,w14,w19 + sli v9.4s,v26.4s,#12 + add w15,w15,w20 + sli v13.4s,v27.4s,#12 + add w16,w16,w21 + sli v17.4s,v28.4s,#12 + eor w9,w9,w13 + sli v21.4s,v29.4s,#12 + eor w10,w10,w14 + add v0.4s,v0.4s,v1.4s + eor w11,w11,w15 + add v4.4s,v4.4s,v5.4s + eor w12,w12,w16 + add v8.4s,v8.4s,v9.4s + ror w9,w9,#25 + add v12.4s,v12.4s,v13.4s + ror w10,w10,#25 + add v16.4s,v16.4s,v17.4s + ror w11,w11,#25 + add v20.4s,v20.4s,v21.4s + ror w12,w12,#25 + eor v24.16b,v3.16b,v0.16b + add w5,w5,w10 + eor v25.16b,v7.16b,v4.16b + add w6,w6,w11 + eor v26.16b,v11.16b,v8.16b + add w7,w7,w12 + eor v27.16b,v15.16b,v12.16b + add w8,w8,w9 + eor v28.16b,v19.16b,v16.16b + eor w21,w21,w5 + eor v29.16b,v23.16b,v20.16b + eor w17,w17,w6 + ushr v3.4s,v24.4s,#24 + eor w19,w19,w7 + ushr v7.4s,v25.4s,#24 + eor w20,w20,w8 + ushr v11.4s,v26.4s,#24 + ror w21,w21,#16 + ushr v15.4s,v27.4s,#24 + ror w17,w17,#16 + ushr v19.4s,v28.4s,#24 + ror w19,w19,#16 + ushr v23.4s,v29.4s,#24 + ror w20,w20,#16 + sli v3.4s,v24.4s,#8 + add w15,w15,w21 + sli v7.4s,v25.4s,#8 + add w16,w16,w17 + sli v11.4s,v26.4s,#8 + add w13,w13,w19 + sli v15.4s,v27.4s,#8 + add w14,w14,w20 + sli v19.4s,v28.4s,#8 + eor w10,w10,w15 + sli v23.4s,v29.4s,#8 + eor w11,w11,w16 + add v2.4s,v2.4s,v3.4s + eor w12,w12,w13 + add v6.4s,v6.4s,v7.4s + eor w9,w9,w14 + add v10.4s,v10.4s,v11.4s + ror w10,w10,#20 + add v14.4s,v14.4s,v15.4s + ror w11,w11,#20 + add v18.4s,v18.4s,v19.4s + ror w12,w12,#20 + add v22.4s,v22.4s,v23.4s + ror w9,w9,#20 + eor v24.16b,v1.16b,v2.16b + add w5,w5,w10 + eor v25.16b,v5.16b,v6.16b + add w6,w6,w11 + eor v26.16b,v9.16b,v10.16b + add w7,w7,w12 + eor v27.16b,v13.16b,v14.16b + add w8,w8,w9 + eor v28.16b,v17.16b,v18.16b + eor w21,w21,w5 + eor v29.16b,v21.16b,v22.16b + eor w17,w17,w6 + ushr v1.4s,v24.4s,#25 + eor w19,w19,w7 + ushr v5.4s,v25.4s,#25 + eor w20,w20,w8 + ushr v9.4s,v26.4s,#25 + ror w21,w21,#24 + ushr v13.4s,v27.4s,#25 + ror w17,w17,#24 + ushr v17.4s,v28.4s,#25 + ror w19,w19,#24 + ushr v21.4s,v29.4s,#25 + ror w20,w20,#24 + sli v1.4s,v24.4s,#7 + add w15,w15,w21 + sli v5.4s,v25.4s,#7 + add w16,w16,w17 + sli v9.4s,v26.4s,#7 + add w13,w13,w19 + sli v13.4s,v27.4s,#7 + add w14,w14,w20 + sli v17.4s,v28.4s,#7 + eor w10,w10,w15 + sli v21.4s,v29.4s,#7 + eor w11,w11,w16 + ext v2.16b,v2.16b,v2.16b,#8 + eor w12,w12,w13 + ext v6.16b,v6.16b,v6.16b,#8 + eor w9,w9,w14 + ext v10.16b,v10.16b,v10.16b,#8 + ror w10,w10,#25 + ext v14.16b,v14.16b,v14.16b,#8 + ror w11,w11,#25 + ext v18.16b,v18.16b,v18.16b,#8 + ror w12,w12,#25 + ext v22.16b,v22.16b,v22.16b,#8 + ror w9,w9,#25 + ext v3.16b,v3.16b,v3.16b,#12 + ext v7.16b,v7.16b,v7.16b,#12 + ext v11.16b,v11.16b,v11.16b,#12 + ext v15.16b,v15.16b,v15.16b,#12 + ext v19.16b,v19.16b,v19.16b,#12 + ext v23.16b,v23.16b,v23.16b,#12 + ext v1.16b,v1.16b,v1.16b,#4 + ext v5.16b,v5.16b,v5.16b,#4 + ext v9.16b,v9.16b,v9.16b,#4 + ext v13.16b,v13.16b,v13.16b,#4 + ext v17.16b,v17.16b,v17.16b,#4 + ext v21.16b,v21.16b,v21.16b,#4 + add v0.4s,v0.4s,v1.4s + add w5,w5,w9 + add v4.4s,v4.4s,v5.4s + add w6,w6,w10 + add v8.4s,v8.4s,v9.4s + add w7,w7,w11 + add v12.4s,v12.4s,v13.4s + add w8,w8,w12 + add v16.4s,v16.4s,v17.4s + eor w17,w17,w5 + add v20.4s,v20.4s,v21.4s + eor w19,w19,w6 + eor v3.16b,v3.16b,v0.16b + eor w20,w20,w7 + eor v7.16b,v7.16b,v4.16b + eor w21,w21,w8 + eor v11.16b,v11.16b,v8.16b + ror w17,w17,#16 + eor v15.16b,v15.16b,v12.16b + ror w19,w19,#16 + eor v19.16b,v19.16b,v16.16b + ror w20,w20,#16 + eor v23.16b,v23.16b,v20.16b + ror w21,w21,#16 + rev32 v3.8h,v3.8h + add w13,w13,w17 + rev32 v7.8h,v7.8h + add w14,w14,w19 + rev32 v11.8h,v11.8h + add w15,w15,w20 + rev32 v15.8h,v15.8h + add w16,w16,w21 + rev32 v19.8h,v19.8h + eor w9,w9,w13 + rev32 v23.8h,v23.8h + eor w10,w10,w14 + add v2.4s,v2.4s,v3.4s + eor w11,w11,w15 + add v6.4s,v6.4s,v7.4s + eor w12,w12,w16 + add v10.4s,v10.4s,v11.4s + ror w9,w9,#20 + add v14.4s,v14.4s,v15.4s + ror w10,w10,#20 + add v18.4s,v18.4s,v19.4s + ror w11,w11,#20 + add v22.4s,v22.4s,v23.4s + ror w12,w12,#20 + eor v24.16b,v1.16b,v2.16b + add w5,w5,w9 + eor v25.16b,v5.16b,v6.16b + add w6,w6,w10 + eor v26.16b,v9.16b,v10.16b + add w7,w7,w11 + eor v27.16b,v13.16b,v14.16b + add w8,w8,w12 + eor v28.16b,v17.16b,v18.16b + eor w17,w17,w5 + eor v29.16b,v21.16b,v22.16b + eor w19,w19,w6 + ushr v1.4s,v24.4s,#20 + eor w20,w20,w7 + ushr v5.4s,v25.4s,#20 + eor w21,w21,w8 + ushr v9.4s,v26.4s,#20 + ror w17,w17,#24 + ushr v13.4s,v27.4s,#20 + ror w19,w19,#24 + ushr v17.4s,v28.4s,#20 + ror w20,w20,#24 + ushr v21.4s,v29.4s,#20 + ror w21,w21,#24 + sli v1.4s,v24.4s,#12 + add w13,w13,w17 + sli v5.4s,v25.4s,#12 + add w14,w14,w19 + sli v9.4s,v26.4s,#12 + add w15,w15,w20 + sli v13.4s,v27.4s,#12 + add w16,w16,w21 + sli v17.4s,v28.4s,#12 + eor w9,w9,w13 + sli v21.4s,v29.4s,#12 + eor w10,w10,w14 + add v0.4s,v0.4s,v1.4s + eor w11,w11,w15 + add v4.4s,v4.4s,v5.4s + eor w12,w12,w16 + add v8.4s,v8.4s,v9.4s + ror w9,w9,#25 + add v12.4s,v12.4s,v13.4s + ror w10,w10,#25 + add v16.4s,v16.4s,v17.4s + ror w11,w11,#25 + add v20.4s,v20.4s,v21.4s + ror w12,w12,#25 + eor v24.16b,v3.16b,v0.16b + add w5,w5,w10 + eor v25.16b,v7.16b,v4.16b + add w6,w6,w11 + eor v26.16b,v11.16b,v8.16b + add w7,w7,w12 + eor v27.16b,v15.16b,v12.16b + add w8,w8,w9 + eor v28.16b,v19.16b,v16.16b + eor w21,w21,w5 + eor v29.16b,v23.16b,v20.16b + eor w17,w17,w6 + ushr v3.4s,v24.4s,#24 + eor w19,w19,w7 + ushr v7.4s,v25.4s,#24 + eor w20,w20,w8 + ushr v11.4s,v26.4s,#24 + ror w21,w21,#16 + ushr v15.4s,v27.4s,#24 + ror w17,w17,#16 + ushr v19.4s,v28.4s,#24 + ror w19,w19,#16 + ushr v23.4s,v29.4s,#24 + ror w20,w20,#16 + sli v3.4s,v24.4s,#8 + add w15,w15,w21 + sli v7.4s,v25.4s,#8 + add w16,w16,w17 + sli v11.4s,v26.4s,#8 + add w13,w13,w19 + sli v15.4s,v27.4s,#8 + add w14,w14,w20 + sli v19.4s,v28.4s,#8 + eor w10,w10,w15 + sli v23.4s,v29.4s,#8 + eor w11,w11,w16 + add v2.4s,v2.4s,v3.4s + eor w12,w12,w13 + add v6.4s,v6.4s,v7.4s + eor w9,w9,w14 + add v10.4s,v10.4s,v11.4s + ror w10,w10,#20 + add v14.4s,v14.4s,v15.4s + ror w11,w11,#20 + add v18.4s,v18.4s,v19.4s + ror w12,w12,#20 + add v22.4s,v22.4s,v23.4s + ror w9,w9,#20 + eor v24.16b,v1.16b,v2.16b + add w5,w5,w10 + eor v25.16b,v5.16b,v6.16b + add w6,w6,w11 + eor v26.16b,v9.16b,v10.16b + add w7,w7,w12 + eor v27.16b,v13.16b,v14.16b + add w8,w8,w9 + eor v28.16b,v17.16b,v18.16b + eor w21,w21,w5 + eor v29.16b,v21.16b,v22.16b + eor w17,w17,w6 + ushr v1.4s,v24.4s,#25 + eor w19,w19,w7 + ushr v5.4s,v25.4s,#25 + eor w20,w20,w8 + ushr v9.4s,v26.4s,#25 + ror w21,w21,#24 + ushr v13.4s,v27.4s,#25 + ror w17,w17,#24 + ushr v17.4s,v28.4s,#25 + ror w19,w19,#24 + ushr v21.4s,v29.4s,#25 + ror w20,w20,#24 + sli v1.4s,v24.4s,#7 + add w15,w15,w21 + sli v5.4s,v25.4s,#7 + add w16,w16,w17 + sli v9.4s,v26.4s,#7 + add w13,w13,w19 + sli v13.4s,v27.4s,#7 + add w14,w14,w20 + sli v17.4s,v28.4s,#7 + eor w10,w10,w15 + sli v21.4s,v29.4s,#7 + eor w11,w11,w16 + ext v2.16b,v2.16b,v2.16b,#8 + eor w12,w12,w13 + ext v6.16b,v6.16b,v6.16b,#8 + eor w9,w9,w14 + ext v10.16b,v10.16b,v10.16b,#8 + ror w10,w10,#25 + ext v14.16b,v14.16b,v14.16b,#8 + ror w11,w11,#25 + ext v18.16b,v18.16b,v18.16b,#8 + ror w12,w12,#25 + ext v22.16b,v22.16b,v22.16b,#8 + ror w9,w9,#25 + ext v3.16b,v3.16b,v3.16b,#4 + ext v7.16b,v7.16b,v7.16b,#4 + ext v11.16b,v11.16b,v11.16b,#4 + ext v15.16b,v15.16b,v15.16b,#4 + ext v19.16b,v19.16b,v19.16b,#4 + ext v23.16b,v23.16b,v23.16b,#4 + ext v1.16b,v1.16b,v1.16b,#12 + ext v5.16b,v5.16b,v5.16b,#12 + ext v9.16b,v9.16b,v9.16b,#12 + ext v13.16b,v13.16b,v13.16b,#12 + ext v17.16b,v17.16b,v17.16b,#12 + ext v21.16b,v21.16b,v21.16b,#12 + cbnz x4,.Loop_upper_neon + + add w5,w5,w22 // accumulate key block + add x6,x6,x22,lsr#32 + add w7,w7,w23 + add x8,x8,x23,lsr#32 + add w9,w9,w24 + add x10,x10,x24,lsr#32 + add w11,w11,w25 + add x12,x12,x25,lsr#32 + add w13,w13,w26 + add x14,x14,x26,lsr#32 + add w15,w15,w27 + add x16,x16,x27,lsr#32 + add w17,w17,w28 + add x19,x19,x28,lsr#32 + add w20,w20,w30 + add x21,x21,x30,lsr#32 + + add x5,x5,x6,lsl#32 // pack + add x7,x7,x8,lsl#32 + ldp x6,x8,[x1,#0] // load input + add x9,x9,x10,lsl#32 + add x11,x11,x12,lsl#32 + ldp x10,x12,[x1,#16] + add x13,x13,x14,lsl#32 + add x15,x15,x16,lsl#32 + ldp x14,x16,[x1,#32] + add x17,x17,x19,lsl#32 + add x20,x20,x21,lsl#32 + ldp x19,x21,[x1,#48] + add x1,x1,#64 +#ifdef __ARMEB__ + rev x5,x5 + rev x7,x7 + rev x9,x9 + rev x11,x11 + rev x13,x13 + rev x15,x15 + rev x17,x17 + rev x20,x20 +#endif + eor x5,x5,x6 + eor x7,x7,x8 + eor x9,x9,x10 + eor x11,x11,x12 + eor x13,x13,x14 + eor x15,x15,x16 + eor x17,x17,x19 + eor x20,x20,x21 + + stp x5,x7,[x0,#0] // store output + add x28,x28,#1 // increment counter + mov w5,w22 // unpack key block + lsr x6,x22,#32 + stp x9,x11,[x0,#16] + mov w7,w23 + lsr x8,x23,#32 + stp x13,x15,[x0,#32] + mov w9,w24 + lsr x10,x24,#32 + stp x17,x20,[x0,#48] + add x0,x0,#64 + mov w11,w25 + lsr x12,x25,#32 + mov w13,w26 + lsr x14,x26,#32 + mov w15,w27 + lsr x16,x27,#32 + mov w17,w28 + lsr x19,x28,#32 + mov w20,w30 + lsr x21,x30,#32 + + mov x4,#5 +.Loop_lower_neon: + sub x4,x4,#1 + add v0.4s,v0.4s,v1.4s + add w5,w5,w9 + add v4.4s,v4.4s,v5.4s + add w6,w6,w10 + add v8.4s,v8.4s,v9.4s + add w7,w7,w11 + add v12.4s,v12.4s,v13.4s + add w8,w8,w12 + add v16.4s,v16.4s,v17.4s + eor w17,w17,w5 + add v20.4s,v20.4s,v21.4s + eor w19,w19,w6 + eor v3.16b,v3.16b,v0.16b + eor w20,w20,w7 + eor v7.16b,v7.16b,v4.16b + eor w21,w21,w8 + eor v11.16b,v11.16b,v8.16b + ror w17,w17,#16 + eor v15.16b,v15.16b,v12.16b + ror w19,w19,#16 + eor v19.16b,v19.16b,v16.16b + ror w20,w20,#16 + eor v23.16b,v23.16b,v20.16b + ror w21,w21,#16 + rev32 v3.8h,v3.8h + add w13,w13,w17 + rev32 v7.8h,v7.8h + add w14,w14,w19 + rev32 v11.8h,v11.8h + add w15,w15,w20 + rev32 v15.8h,v15.8h + add w16,w16,w21 + rev32 v19.8h,v19.8h + eor w9,w9,w13 + rev32 v23.8h,v23.8h + eor w10,w10,w14 + add v2.4s,v2.4s,v3.4s + eor w11,w11,w15 + add v6.4s,v6.4s,v7.4s + eor w12,w12,w16 + add v10.4s,v10.4s,v11.4s + ror w9,w9,#20 + add v14.4s,v14.4s,v15.4s + ror w10,w10,#20 + add v18.4s,v18.4s,v19.4s + ror w11,w11,#20 + add v22.4s,v22.4s,v23.4s + ror w12,w12,#20 + eor v24.16b,v1.16b,v2.16b + add w5,w5,w9 + eor v25.16b,v5.16b,v6.16b + add w6,w6,w10 + eor v26.16b,v9.16b,v10.16b + add w7,w7,w11 + eor v27.16b,v13.16b,v14.16b + add w8,w8,w12 + eor v28.16b,v17.16b,v18.16b + eor w17,w17,w5 + eor v29.16b,v21.16b,v22.16b + eor w19,w19,w6 + ushr v1.4s,v24.4s,#20 + eor w20,w20,w7 + ushr v5.4s,v25.4s,#20 + eor w21,w21,w8 + ushr v9.4s,v26.4s,#20 + ror w17,w17,#24 + ushr v13.4s,v27.4s,#20 + ror w19,w19,#24 + ushr v17.4s,v28.4s,#20 + ror w20,w20,#24 + ushr v21.4s,v29.4s,#20 + ror w21,w21,#24 + sli v1.4s,v24.4s,#12 + add w13,w13,w17 + sli v5.4s,v25.4s,#12 + add w14,w14,w19 + sli v9.4s,v26.4s,#12 + add w15,w15,w20 + sli v13.4s,v27.4s,#12 + add w16,w16,w21 + sli v17.4s,v28.4s,#12 + eor w9,w9,w13 + sli v21.4s,v29.4s,#12 + eor w10,w10,w14 + add v0.4s,v0.4s,v1.4s + eor w11,w11,w15 + add v4.4s,v4.4s,v5.4s + eor w12,w12,w16 + add v8.4s,v8.4s,v9.4s + ror w9,w9,#25 + add v12.4s,v12.4s,v13.4s + ror w10,w10,#25 + add v16.4s,v16.4s,v17.4s + ror w11,w11,#25 + add v20.4s,v20.4s,v21.4s + ror w12,w12,#25 + eor v24.16b,v3.16b,v0.16b + add w5,w5,w10 + eor v25.16b,v7.16b,v4.16b + add w6,w6,w11 + eor v26.16b,v11.16b,v8.16b + add w7,w7,w12 + eor v27.16b,v15.16b,v12.16b + add w8,w8,w9 + eor v28.16b,v19.16b,v16.16b + eor w21,w21,w5 + eor v29.16b,v23.16b,v20.16b + eor w17,w17,w6 + ushr v3.4s,v24.4s,#24 + eor w19,w19,w7 + ushr v7.4s,v25.4s,#24 + eor w20,w20,w8 + ushr v11.4s,v26.4s,#24 + ror w21,w21,#16 + ushr v15.4s,v27.4s,#24 + ror w17,w17,#16 + ushr v19.4s,v28.4s,#24 + ror w19,w19,#16 + ushr v23.4s,v29.4s,#24 + ror w20,w20,#16 + sli v3.4s,v24.4s,#8 + add w15,w15,w21 + sli v7.4s,v25.4s,#8 + add w16,w16,w17 + sli v11.4s,v26.4s,#8 + add w13,w13,w19 + sli v15.4s,v27.4s,#8 + add w14,w14,w20 + sli v19.4s,v28.4s,#8 + eor w10,w10,w15 + sli v23.4s,v29.4s,#8 + eor w11,w11,w16 + add v2.4s,v2.4s,v3.4s + eor w12,w12,w13 + add v6.4s,v6.4s,v7.4s + eor w9,w9,w14 + add v10.4s,v10.4s,v11.4s + ror w10,w10,#20 + add v14.4s,v14.4s,v15.4s + ror w11,w11,#20 + add v18.4s,v18.4s,v19.4s + ror w12,w12,#20 + add v22.4s,v22.4s,v23.4s + ror w9,w9,#20 + eor v24.16b,v1.16b,v2.16b + add w5,w5,w10 + eor v25.16b,v5.16b,v6.16b + add w6,w6,w11 + eor v26.16b,v9.16b,v10.16b + add w7,w7,w12 + eor v27.16b,v13.16b,v14.16b + add w8,w8,w9 + eor v28.16b,v17.16b,v18.16b + eor w21,w21,w5 + eor v29.16b,v21.16b,v22.16b + eor w17,w17,w6 + ushr v1.4s,v24.4s,#25 + eor w19,w19,w7 + ushr v5.4s,v25.4s,#25 + eor w20,w20,w8 + ushr v9.4s,v26.4s,#25 + ror w21,w21,#24 + ushr v13.4s,v27.4s,#25 + ror w17,w17,#24 + ushr v17.4s,v28.4s,#25 + ror w19,w19,#24 + ushr v21.4s,v29.4s,#25 + ror w20,w20,#24 + sli v1.4s,v24.4s,#7 + add w15,w15,w21 + sli v5.4s,v25.4s,#7 + add w16,w16,w17 + sli v9.4s,v26.4s,#7 + add w13,w13,w19 + sli v13.4s,v27.4s,#7 + add w14,w14,w20 + sli v17.4s,v28.4s,#7 + eor w10,w10,w15 + sli v21.4s,v29.4s,#7 + eor w11,w11,w16 + ext v2.16b,v2.16b,v2.16b,#8 + eor w12,w12,w13 + ext v6.16b,v6.16b,v6.16b,#8 + eor w9,w9,w14 + ext v10.16b,v10.16b,v10.16b,#8 + ror w10,w10,#25 + ext v14.16b,v14.16b,v14.16b,#8 + ror w11,w11,#25 + ext v18.16b,v18.16b,v18.16b,#8 + ror w12,w12,#25 + ext v22.16b,v22.16b,v22.16b,#8 + ror w9,w9,#25 + ext v3.16b,v3.16b,v3.16b,#12 + ext v7.16b,v7.16b,v7.16b,#12 + ext v11.16b,v11.16b,v11.16b,#12 + ext v15.16b,v15.16b,v15.16b,#12 + ext v19.16b,v19.16b,v19.16b,#12 + ext v23.16b,v23.16b,v23.16b,#12 + ext v1.16b,v1.16b,v1.16b,#4 + ext v5.16b,v5.16b,v5.16b,#4 + ext v9.16b,v9.16b,v9.16b,#4 + ext v13.16b,v13.16b,v13.16b,#4 + ext v17.16b,v17.16b,v17.16b,#4 + ext v21.16b,v21.16b,v21.16b,#4 + add v0.4s,v0.4s,v1.4s + add w5,w5,w9 + add v4.4s,v4.4s,v5.4s + add w6,w6,w10 + add v8.4s,v8.4s,v9.4s + add w7,w7,w11 + add v12.4s,v12.4s,v13.4s + add w8,w8,w12 + add v16.4s,v16.4s,v17.4s + eor w17,w17,w5 + add v20.4s,v20.4s,v21.4s + eor w19,w19,w6 + eor v3.16b,v3.16b,v0.16b + eor w20,w20,w7 + eor v7.16b,v7.16b,v4.16b + eor w21,w21,w8 + eor v11.16b,v11.16b,v8.16b + ror w17,w17,#16 + eor v15.16b,v15.16b,v12.16b + ror w19,w19,#16 + eor v19.16b,v19.16b,v16.16b + ror w20,w20,#16 + eor v23.16b,v23.16b,v20.16b + ror w21,w21,#16 + rev32 v3.8h,v3.8h + add w13,w13,w17 + rev32 v7.8h,v7.8h + add w14,w14,w19 + rev32 v11.8h,v11.8h + add w15,w15,w20 + rev32 v15.8h,v15.8h + add w16,w16,w21 + rev32 v19.8h,v19.8h + eor w9,w9,w13 + rev32 v23.8h,v23.8h + eor w10,w10,w14 + add v2.4s,v2.4s,v3.4s + eor w11,w11,w15 + add v6.4s,v6.4s,v7.4s + eor w12,w12,w16 + add v10.4s,v10.4s,v11.4s + ror w9,w9,#20 + add v14.4s,v14.4s,v15.4s + ror w10,w10,#20 + add v18.4s,v18.4s,v19.4s + ror w11,w11,#20 + add v22.4s,v22.4s,v23.4s + ror w12,w12,#20 + eor v24.16b,v1.16b,v2.16b + add w5,w5,w9 + eor v25.16b,v5.16b,v6.16b + add w6,w6,w10 + eor v26.16b,v9.16b,v10.16b + add w7,w7,w11 + eor v27.16b,v13.16b,v14.16b + add w8,w8,w12 + eor v28.16b,v17.16b,v18.16b + eor w17,w17,w5 + eor v29.16b,v21.16b,v22.16b + eor w19,w19,w6 + ushr v1.4s,v24.4s,#20 + eor w20,w20,w7 + ushr v5.4s,v25.4s,#20 + eor w21,w21,w8 + ushr v9.4s,v26.4s,#20 + ror w17,w17,#24 + ushr v13.4s,v27.4s,#20 + ror w19,w19,#24 + ushr v17.4s,v28.4s,#20 + ror w20,w20,#24 + ushr v21.4s,v29.4s,#20 + ror w21,w21,#24 + sli v1.4s,v24.4s,#12 + add w13,w13,w17 + sli v5.4s,v25.4s,#12 + add w14,w14,w19 + sli v9.4s,v26.4s,#12 + add w15,w15,w20 + sli v13.4s,v27.4s,#12 + add w16,w16,w21 + sli v17.4s,v28.4s,#12 + eor w9,w9,w13 + sli v21.4s,v29.4s,#12 + eor w10,w10,w14 + add v0.4s,v0.4s,v1.4s + eor w11,w11,w15 + add v4.4s,v4.4s,v5.4s + eor w12,w12,w16 + add v8.4s,v8.4s,v9.4s + ror w9,w9,#25 + add v12.4s,v12.4s,v13.4s + ror w10,w10,#25 + add v16.4s,v16.4s,v17.4s + ror w11,w11,#25 + add v20.4s,v20.4s,v21.4s + ror w12,w12,#25 + eor v24.16b,v3.16b,v0.16b + add w5,w5,w10 + eor v25.16b,v7.16b,v4.16b + add w6,w6,w11 + eor v26.16b,v11.16b,v8.16b + add w7,w7,w12 + eor v27.16b,v15.16b,v12.16b + add w8,w8,w9 + eor v28.16b,v19.16b,v16.16b + eor w21,w21,w5 + eor v29.16b,v23.16b,v20.16b + eor w17,w17,w6 + ushr v3.4s,v24.4s,#24 + eor w19,w19,w7 + ushr v7.4s,v25.4s,#24 + eor w20,w20,w8 + ushr v11.4s,v26.4s,#24 + ror w21,w21,#16 + ushr v15.4s,v27.4s,#24 + ror w17,w17,#16 + ushr v19.4s,v28.4s,#24 + ror w19,w19,#16 + ushr v23.4s,v29.4s,#24 + ror w20,w20,#16 + sli v3.4s,v24.4s,#8 + add w15,w15,w21 + sli v7.4s,v25.4s,#8 + add w16,w16,w17 + sli v11.4s,v26.4s,#8 + add w13,w13,w19 + sli v15.4s,v27.4s,#8 + add w14,w14,w20 + sli v19.4s,v28.4s,#8 + eor w10,w10,w15 + sli v23.4s,v29.4s,#8 + eor w11,w11,w16 + add v2.4s,v2.4s,v3.4s + eor w12,w12,w13 + add v6.4s,v6.4s,v7.4s + eor w9,w9,w14 + add v10.4s,v10.4s,v11.4s + ror w10,w10,#20 + add v14.4s,v14.4s,v15.4s + ror w11,w11,#20 + add v18.4s,v18.4s,v19.4s + ror w12,w12,#20 + add v22.4s,v22.4s,v23.4s + ror w9,w9,#20 + eor v24.16b,v1.16b,v2.16b + add w5,w5,w10 + eor v25.16b,v5.16b,v6.16b + add w6,w6,w11 + eor v26.16b,v9.16b,v10.16b + add w7,w7,w12 + eor v27.16b,v13.16b,v14.16b + add w8,w8,w9 + eor v28.16b,v17.16b,v18.16b + eor w21,w21,w5 + eor v29.16b,v21.16b,v22.16b + eor w17,w17,w6 + ushr v1.4s,v24.4s,#25 + eor w19,w19,w7 + ushr v5.4s,v25.4s,#25 + eor w20,w20,w8 + ushr v9.4s,v26.4s,#25 + ror w21,w21,#24 + ushr v13.4s,v27.4s,#25 + ror w17,w17,#24 + ushr v17.4s,v28.4s,#25 + ror w19,w19,#24 + ushr v21.4s,v29.4s,#25 + ror w20,w20,#24 + sli v1.4s,v24.4s,#7 + add w15,w15,w21 + sli v5.4s,v25.4s,#7 + add w16,w16,w17 + sli v9.4s,v26.4s,#7 + add w13,w13,w19 + sli v13.4s,v27.4s,#7 + add w14,w14,w20 + sli v17.4s,v28.4s,#7 + eor w10,w10,w15 + sli v21.4s,v29.4s,#7 + eor w11,w11,w16 + ext v2.16b,v2.16b,v2.16b,#8 + eor w12,w12,w13 + ext v6.16b,v6.16b,v6.16b,#8 + eor w9,w9,w14 + ext v10.16b,v10.16b,v10.16b,#8 + ror w10,w10,#25 + ext v14.16b,v14.16b,v14.16b,#8 + ror w11,w11,#25 + ext v18.16b,v18.16b,v18.16b,#8 + ror w12,w12,#25 + ext v22.16b,v22.16b,v22.16b,#8 + ror w9,w9,#25 + ext v3.16b,v3.16b,v3.16b,#4 + ext v7.16b,v7.16b,v7.16b,#4 + ext v11.16b,v11.16b,v11.16b,#4 + ext v15.16b,v15.16b,v15.16b,#4 + ext v19.16b,v19.16b,v19.16b,#4 + ext v23.16b,v23.16b,v23.16b,#4 + ext v1.16b,v1.16b,v1.16b,#12 + ext v5.16b,v5.16b,v5.16b,#12 + ext v9.16b,v9.16b,v9.16b,#12 + ext v13.16b,v13.16b,v13.16b,#12 + ext v17.16b,v17.16b,v17.16b,#12 + ext v21.16b,v21.16b,v21.16b,#12 + cbnz x4,.Loop_lower_neon + + add w5,w5,w22 // accumulate key block + ldp q24,q25,[sp,#0] + add x6,x6,x22,lsr#32 + ldp q26,q27,[sp,#32] + add w7,w7,w23 + ldp q28,q29,[sp,#64] + add x8,x8,x23,lsr#32 + add v0.4s,v0.4s,v24.4s + add w9,w9,w24 + add v4.4s,v4.4s,v24.4s + add x10,x10,x24,lsr#32 + add v8.4s,v8.4s,v24.4s + add w11,w11,w25 + add v12.4s,v12.4s,v24.4s + add x12,x12,x25,lsr#32 + add v16.4s,v16.4s,v24.4s + add w13,w13,w26 + add v20.4s,v20.4s,v24.4s + add x14,x14,x26,lsr#32 + add v2.4s,v2.4s,v26.4s + add w15,w15,w27 + add v6.4s,v6.4s,v26.4s + add x16,x16,x27,lsr#32 + add v10.4s,v10.4s,v26.4s + add w17,w17,w28 + add v14.4s,v14.4s,v26.4s + add x19,x19,x28,lsr#32 + add v18.4s,v18.4s,v26.4s + add w20,w20,w30 + add v22.4s,v22.4s,v26.4s + add x21,x21,x30,lsr#32 + add v19.4s,v19.4s,v31.4s // +4 + add x5,x5,x6,lsl#32 // pack + add v23.4s,v23.4s,v31.4s // +4 + add x7,x7,x8,lsl#32 + add v3.4s,v3.4s,v27.4s + ldp x6,x8,[x1,#0] // load input + add v7.4s,v7.4s,v28.4s + add x9,x9,x10,lsl#32 + add v11.4s,v11.4s,v29.4s + add x11,x11,x12,lsl#32 + add v15.4s,v15.4s,v30.4s + ldp x10,x12,[x1,#16] + add v19.4s,v19.4s,v27.4s + add x13,x13,x14,lsl#32 + add v23.4s,v23.4s,v28.4s + add x15,x15,x16,lsl#32 + add v1.4s,v1.4s,v25.4s + ldp x14,x16,[x1,#32] + add v5.4s,v5.4s,v25.4s + add x17,x17,x19,lsl#32 + add v9.4s,v9.4s,v25.4s + add x20,x20,x21,lsl#32 + add v13.4s,v13.4s,v25.4s + ldp x19,x21,[x1,#48] + add v17.4s,v17.4s,v25.4s + add x1,x1,#64 + add v21.4s,v21.4s,v25.4s + +#ifdef __ARMEB__ + rev x5,x5 + rev x7,x7 + rev x9,x9 + rev x11,x11 + rev x13,x13 + rev x15,x15 + rev x17,x17 + rev x20,x20 +#endif + ld1 {v24.16b,v25.16b,v26.16b,v27.16b},[x1],#64 + eor x5,x5,x6 + eor x7,x7,x8 + eor x9,x9,x10 + eor x11,x11,x12 + eor x13,x13,x14 + eor v0.16b,v0.16b,v24.16b + eor x15,x15,x16 + eor v1.16b,v1.16b,v25.16b + eor x17,x17,x19 + eor v2.16b,v2.16b,v26.16b + eor x20,x20,x21 + eor v3.16b,v3.16b,v27.16b + ld1 {v24.16b,v25.16b,v26.16b,v27.16b},[x1],#64 + + stp x5,x7,[x0,#0] // store output + add x28,x28,#7 // increment counter + stp x9,x11,[x0,#16] + stp x13,x15,[x0,#32] + stp x17,x20,[x0,#48] + add x0,x0,#64 + st1 {v0.16b,v1.16b,v2.16b,v3.16b},[x0],#64 + + ld1 {v0.16b,v1.16b,v2.16b,v3.16b},[x1],#64 + eor v4.16b,v4.16b,v24.16b + eor v5.16b,v5.16b,v25.16b + eor v6.16b,v6.16b,v26.16b + eor v7.16b,v7.16b,v27.16b + st1 {v4.16b,v5.16b,v6.16b,v7.16b},[x0],#64 + + ld1 {v4.16b,v5.16b,v6.16b,v7.16b},[x1],#64 + eor v8.16b,v8.16b,v0.16b + ldp q24,q25,[sp,#0] + eor v9.16b,v9.16b,v1.16b + ldp q26,q27,[sp,#32] + eor v10.16b,v10.16b,v2.16b + eor v11.16b,v11.16b,v3.16b + st1 {v8.16b,v9.16b,v10.16b,v11.16b},[x0],#64 + + ld1 {v8.16b,v9.16b,v10.16b,v11.16b},[x1],#64 + eor v12.16b,v12.16b,v4.16b + eor v13.16b,v13.16b,v5.16b + eor v14.16b,v14.16b,v6.16b + eor v15.16b,v15.16b,v7.16b + st1 {v12.16b,v13.16b,v14.16b,v15.16b},[x0],#64 + + ld1 {v12.16b,v13.16b,v14.16b,v15.16b},[x1],#64 + eor v16.16b,v16.16b,v8.16b + eor v17.16b,v17.16b,v9.16b + eor v18.16b,v18.16b,v10.16b + eor v19.16b,v19.16b,v11.16b + st1 {v16.16b,v17.16b,v18.16b,v19.16b},[x0],#64 + + shl v0.4s,v31.4s,#1 // 4 -> 8 + eor v20.16b,v20.16b,v12.16b + eor v21.16b,v21.16b,v13.16b + eor v22.16b,v22.16b,v14.16b + eor v23.16b,v23.16b,v15.16b + st1 {v20.16b,v21.16b,v22.16b,v23.16b},[x0],#64 + + add v27.4s,v27.4s,v0.4s // += 8 + add v28.4s,v28.4s,v0.4s + add v29.4s,v29.4s,v0.4s + add v30.4s,v30.4s,v0.4s + + b.hs .Loop_outer_512_neon + + adds x2,x2,#512 + ushr v0.4s,v31.4s,#2 // 4 -> 1 + + ldp d8,d9,[sp,#128+0] // meet ABI requirements + ldp d10,d11,[sp,#128+16] + ldp d12,d13,[sp,#128+32] + ldp d14,d15,[sp,#128+48] + + stp q24,q31,[sp,#0] // wipe off-load area + stp q24,q31,[sp,#32] + stp q24,q31,[sp,#64] + + b.eq .Ldone_512_neon + + cmp x2,#192 + sub v27.4s,v27.4s,v0.4s // -= 1 + sub v28.4s,v28.4s,v0.4s + sub v29.4s,v29.4s,v0.4s + add sp,sp,#128 + b.hs .Loop_outer_neon + + eor v25.16b,v25.16b,v25.16b + eor v26.16b,v26.16b,v26.16b + eor v27.16b,v27.16b,v27.16b + eor v28.16b,v28.16b,v28.16b + eor v29.16b,v29.16b,v29.16b + eor v30.16b,v30.16b,v30.16b + b .Loop_outer + +.Ldone_512_neon: + ldp x19,x20,[x29,#16] + add sp,sp,#128+64 + ldp x21,x22,[x29,#32] + ldp x23,x24,[x29,#48] + ldp x25,x26,[x29,#64] + ldp x27,x28,[x29,#80] + ldp x29,x30,[sp],#96 + ret +.size ChaCha20_512_neon,.-ChaCha20_512_neon From patchwork Sat Oct 6 02:56:49 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Jason A. Donenfeld" X-Patchwork-Id: 148305 Delivered-To: patch@linaro.org Received: by 2002:a2e:8595:0:0:0:0:0 with SMTP id b21-v6csp1157469lji; Fri, 5 Oct 2018 19:58:05 -0700 (PDT) X-Google-Smtp-Source: ACcGV627JKc1o1qUFmiNfDnoM839q+DjnBfh/ywxsRlYokpt/Jwrqk2nuf5mZqiTeUoEeo70lKCS X-Received: by 2002:a17:902:7798:: with SMTP id o24-v6mr13908886pll.299.1538794685251; Fri, 05 Oct 2018 19:58:05 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1538794685; cv=none; d=google.com; s=arc-20160816; b=VmpQTHDBbuQrCDlyHw7QKDHuH3i21aycCjXGIC7ieHpMTo+cBtgPkVjvmFbyqEAjv3 2ebX5ZQCmeUWZSkpmFk18Lbc94VEJBbLFqOABmHqKTUxT7UZZ6Go0Tg0jQ4F/Y/eLR8Q maEGotSLLpwM29qAaTUMMDby0o0rmfEa3mWAb4+kFIFnkUBqI9BgM8pKU661MEUFn8dh bPvFwfMwjHKHkRQW5gm58tZRY8OcETqbbPip/AJpZOnqwCpsuIi1cpCDPSZUDHRQrOda s+UpfelJZA4lFVhdv0DidrWJhH2nMQfLKIE0w+Ja5bKmhgyybcuCfe27voSopQR6VGt/ q7Bw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=ncqMEo22h4OL+DxS+QRdKMGmH2LgZ1CUJGCeuyZBB5o=; b=biWKI/3JwDBCZCZBjj5P1x9UiQv6tO9KbXLyvq8Sq75CtXqkPTthCNZnjV2AlktpCJ G9RQUOCLWbFFJkdiV3n/JCRYGqobMvK/Le0c8CC1aHnBUy4Cs9LlOPq5P/9KK111PxB/ DHw0D/3wKGDeGPDTIhmYp7/5YMdKqttGY7Z4+SeMC6WrTaTXrqXe1y8lMxgX2ciZjSMn 6pwEghENVrk+7JoZkCd9M18S7FlH1YDDGLsRekARjfT8xOGNGiTWg+ry5LQwUSEluG4y e2oy+3d4+IRfH1njuHLVL8F6yBTgGxITpjqgflEFLvva4E/6ofA/eoVdSGF5IZzooP3V epHg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=r7zTwEGB; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id q5-v6si8788661pgg.105.2018.10.05.19.58.04; Fri, 05 Oct 2018 19:58:05 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=r7zTwEGB; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729639AbeJFJ7c (ORCPT + 32 others); Sat, 6 Oct 2018 05:59:32 -0400 Received: from frisell.zx2c4.com ([192.95.5.64]:50073 "EHLO frisell.zx2c4.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729509AbeJFJ7a (ORCPT ); Sat, 6 Oct 2018 05:59:30 -0400 Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTP id e3a7c0c1; Sat, 6 Oct 2018 02:57:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=zx2c4.com; h=from:to:cc :subject:date:message-id:in-reply-to:references:mime-version :content-transfer-encoding; s=mail; bh=Zkme7DKZqco0fbgMxUj7T46fp R8=; b=r7zTwEGBkg+J95AeJhrTZvw70JL0JjOIBe/rF8Sid2cQZKAQCXO4PfNk+ qlxy94wxw2A+gDnwM0oLc1nzBh4kaBE/HnABQuziph7l//uNYOMuvhFFSTHRUktZ v2S40Lz2ehVhXpKW2qgtoOxHMglnx+KWOAbY+gFsqpSAQA1pObgJjuX5TE6fWaxS 2ClZGcxOGktcw//IgLwqOsOFM0x7AjLWc4FTI4H6Cgs7/2ekekxJGoQitG5BQSMm 17H/718olCECG16SMyibZH8kVLaajKGBWG0kcG/h6mOeWnl32m+UrusyjDGXwlO8 1S1AtEM6g+NerZOBnWKDF5jdYj20w== Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTPSA id 2785a53f (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256:NO); Sat, 6 Oct 2018 02:57:21 +0000 (UTC) From: "Jason A. Donenfeld" To: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, davem@davemloft.net, gregkh@linuxfoundation.org Cc: "Jason A. Donenfeld" , Russell King , linux-arm-kernel@lists.infradead.org, Samuel Neves , Jean-Philippe Aumasson , Andy Lutomirski , Andrew Morton , Linus Torvalds , kernel-hardening@lists.openwall.com, linux-crypto@vger.kernel.org Subject: [PATCH net-next v7 08/28] zinc: port Andy Polyakov's ChaCha20 ARM and ARM64 implementations Date: Sat, 6 Oct 2018 04:56:49 +0200 Message-Id: <20181006025709.4019-9-Jason@zx2c4.com> In-Reply-To: <20181006025709.4019-1-Jason@zx2c4.com> References: <20181006025709.4019-1-Jason@zx2c4.com> MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org These port and prepare Andy Polyakov's implementations for the kernel, but don't actually wire up any of the code yet. The wiring will be done in a subsequent commit, since we'll need to merge these implementations with another one. We make a few small changes to the assembly: - Entries and exits use the proper kernel convention macro. - CPU feature checking is done in C by the glue code, so that has been removed from the assembly. - The function names have been renamed to fit kernel conventions. - Labels have been renamed (prefixed with .L) to fit kernel conventions. - Constants have been rearranged so that they are closer to the code that is using them. [ARM only] - The neon code can jump to the scalar code when it makes sense to do so. - The neon_512 function as a separate function has been removed, leaving the decision up to the main neon entry point. [ARM64 only] Signed-off-by: Jason A. Donenfeld Cc: Russell King Cc: linux-arm-kernel@lists.infradead.org Cc: Samuel Neves Cc: Jean-Philippe Aumasson Cc: Andy Lutomirski Cc: Greg KH Cc: Andrew Morton Cc: Linus Torvalds Cc: kernel-hardening@lists.openwall.com Cc: linux-crypto@vger.kernel.org --- lib/zinc/chacha20/chacha20-arm-cryptogams.S | 367 +++++++++--------- lib/zinc/chacha20/chacha20-arm64-cryptogams.S | 75 ++-- 2 files changed, 202 insertions(+), 240 deletions(-) -- 2.19.0 diff --git a/lib/zinc/chacha20/chacha20-arm-cryptogams.S b/lib/zinc/chacha20/chacha20-arm-cryptogams.S index 05a3a9e6e93f..770bab469171 100644 --- a/lib/zinc/chacha20/chacha20-arm-cryptogams.S +++ b/lib/zinc/chacha20/chacha20-arm-cryptogams.S @@ -1,9 +1,12 @@ /* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */ /* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. * Copyright (C) 2006-2017 CRYPTOGAMS by . All Rights Reserved. + * + * This is based in part on Andy Polyakov's implementation from CRYPTOGAMS. */ -#include "arm_arch.h" +#include .text #if defined(__thumb2__) || defined(__clang__) @@ -24,48 +27,25 @@ .long 0x61707865,0x3320646e,0x79622d32,0x6b206574 @ endian-neutral .Lone: .long 1,0,0,0 -.Lrot8: -.long 0x02010003,0x06050407 -#if __ARM_MAX_ARCH__>=7 -.LOPENSSL_armcap: -.word OPENSSL_armcap_P-.LChaCha20_ctr32 -#else .word -1 -#endif -.globl ChaCha20_ctr32 -.type ChaCha20_ctr32,%function .align 5 -ChaCha20_ctr32: -.LChaCha20_ctr32: +ENTRY(chacha20_arm) ldr r12,[sp,#0] @ pull pointer to counter and nonce stmdb sp!,{r0-r2,r4-r11,lr} -#if __ARM_ARCH__<7 && !defined(__thumb2__) - sub r14,pc,#16 @ ChaCha20_ctr32 -#else - adr r14,.LChaCha20_ctr32 -#endif cmp r2,#0 @ len==0? -#ifdef __thumb2__ +#ifdef __thumb2__ itt eq #endif addeq sp,sp,#4*3 - beq .Lno_data -#if __ARM_MAX_ARCH__>=7 - cmp r2,#192 @ test len - bls .Lshort - ldr r4,[r14,#-24] - ldr r4,[r14,r4] -# ifdef __APPLE__ - ldr r4,[r4] -# endif - tst r4,#ARMV7_NEON - bne .LChaCha20_neon -.Lshort: -#endif + beq .Lno_data_arm ldmia r12,{r4-r7} @ load counter and nonce sub sp,sp,#4*(16) @ off-load area - sub r14,r14,#64 @ .Lsigma +#if __LINUX_ARM_ARCH__ < 7 && !defined(__thumb2__) + sub r14,pc,#100 @ .Lsigma +#else + adr r14,.Lsigma @ .Lsigma +#endif stmdb sp!,{r4-r7} @ copy counter and nonce ldmia r3,{r4-r11} @ load key ldmia r14,{r0-r3} @ load sigma @@ -191,7 +171,7 @@ ChaCha20_ctr32: @ rx and second half at sp+4*(16+8) cmp r11,#64 @ done yet? -#ifdef __thumb2__ +#ifdef __thumb2__ itete lo #endif addlo r12,sp,#4*(0) @ shortcut or ... @@ -202,49 +182,49 @@ ChaCha20_ctr32: ldr r8,[sp,#4*(0)] @ load key material ldr r9,[sp,#4*(1)] -#if __ARM_ARCH__>=6 || !defined(__ARMEB__) -# if __ARM_ARCH__<7 +#if __LINUX_ARM_ARCH__ >= 6 || !defined(__ARMEB__) +#if __LINUX_ARM_ARCH__ < 7 orr r10,r12,r14 tst r10,#3 @ are input and output aligned? ldr r10,[sp,#4*(2)] bne .Lunaligned cmp r11,#64 @ restore flags -# else +#else ldr r10,[sp,#4*(2)] -# endif +#endif ldr r11,[sp,#4*(3)] add r0,r0,r8 @ accumulate key material add r1,r1,r9 -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif ldrhs r8,[r12],#16 @ load input ldrhs r9,[r12,#-12] add r2,r2,r10 add r3,r3,r11 -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif ldrhs r10,[r12,#-8] ldrhs r11,[r12,#-4] -# if __ARM_ARCH__>=6 && defined(__ARMEB__) +#if __LINUX_ARM_ARCH__ >= 6 && defined(__ARMEB__) rev r0,r0 rev r1,r1 rev r2,r2 rev r3,r3 -# endif -# ifdef __thumb2__ +#endif +#ifdef __thumb2__ itt hs -# endif +#endif eorhs r0,r0,r8 @ xor with input eorhs r1,r1,r9 add r8,sp,#4*(4) str r0,[r14],#16 @ store output -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif eorhs r2,r2,r10 eorhs r3,r3,r11 ldmia r8,{r8-r11} @ load key material @@ -254,34 +234,34 @@ ChaCha20_ctr32: add r4,r8,r4,ror#13 @ accumulate key material add r5,r9,r5,ror#13 -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif ldrhs r8,[r12],#16 @ load input ldrhs r9,[r12,#-12] add r6,r10,r6,ror#13 add r7,r11,r7,ror#13 -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif ldrhs r10,[r12,#-8] ldrhs r11,[r12,#-4] -# if __ARM_ARCH__>=6 && defined(__ARMEB__) +#if __LINUX_ARM_ARCH__ >= 6 && defined(__ARMEB__) rev r4,r4 rev r5,r5 rev r6,r6 rev r7,r7 -# endif -# ifdef __thumb2__ +#endif +#ifdef __thumb2__ itt hs -# endif +#endif eorhs r4,r4,r8 eorhs r5,r5,r9 add r8,sp,#4*(8) str r4,[r14],#16 @ store output -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif eorhs r6,r6,r10 eorhs r7,r7,r11 str r5,[r14,#-12] @@ -294,39 +274,39 @@ ChaCha20_ctr32: add r0,r0,r8 @ accumulate key material add r1,r1,r9 -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif ldrhs r8,[r12],#16 @ load input ldrhs r9,[r12,#-12] -# ifdef __thumb2__ +#ifdef __thumb2__ itt hi -# endif +#endif strhi r10,[sp,#4*(16+10)] @ copy "rx" while at it strhi r11,[sp,#4*(16+11)] @ copy "rx" while at it add r2,r2,r10 add r3,r3,r11 -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif ldrhs r10,[r12,#-8] ldrhs r11,[r12,#-4] -# if __ARM_ARCH__>=6 && defined(__ARMEB__) +#if __LINUX_ARM_ARCH__ >= 6 && defined(__ARMEB__) rev r0,r0 rev r1,r1 rev r2,r2 rev r3,r3 -# endif -# ifdef __thumb2__ +#endif +#ifdef __thumb2__ itt hs -# endif +#endif eorhs r0,r0,r8 eorhs r1,r1,r9 add r8,sp,#4*(12) str r0,[r14],#16 @ store output -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif eorhs r2,r2,r10 eorhs r3,r3,r11 str r1,[r14,#-12] @@ -336,79 +316,79 @@ ChaCha20_ctr32: add r4,r8,r4,ror#24 @ accumulate key material add r5,r9,r5,ror#24 -# ifdef __thumb2__ +#ifdef __thumb2__ itt hi -# endif +#endif addhi r8,r8,#1 @ next counter value strhi r8,[sp,#4*(12)] @ save next counter value -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif ldrhs r8,[r12],#16 @ load input ldrhs r9,[r12,#-12] add r6,r10,r6,ror#24 add r7,r11,r7,ror#24 -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif ldrhs r10,[r12,#-8] ldrhs r11,[r12,#-4] -# if __ARM_ARCH__>=6 && defined(__ARMEB__) +#if __LINUX_ARM_ARCH__ >= 6 && defined(__ARMEB__) rev r4,r4 rev r5,r5 rev r6,r6 rev r7,r7 -# endif -# ifdef __thumb2__ +#endif +#ifdef __thumb2__ itt hs -# endif +#endif eorhs r4,r4,r8 eorhs r5,r5,r9 -# ifdef __thumb2__ +#ifdef __thumb2__ it ne -# endif +#endif ldrne r8,[sp,#4*(32+2)] @ re-load len -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif eorhs r6,r6,r10 eorhs r7,r7,r11 str r4,[r14],#16 @ store output str r5,[r14,#-12] -# ifdef __thumb2__ +#ifdef __thumb2__ it hs -# endif +#endif subhs r11,r8,#64 @ len-=64 str r6,[r14,#-8] str r7,[r14,#-4] bhi .Loop_outer beq .Ldone -# if __ARM_ARCH__<7 +#if __LINUX_ARM_ARCH__ < 7 b .Ltail .align 4 .Lunaligned: @ unaligned endian-neutral path cmp r11,#64 @ restore flags -# endif #endif -#if __ARM_ARCH__<7 +#endif +#if __LINUX_ARM_ARCH__ < 7 ldr r11,[sp,#4*(3)] add r0,r8,r0 @ accumulate key material add r1,r9,r1 add r2,r10,r2 -# ifdef __thumb2__ +#ifdef __thumb2__ itete lo -# endif +#endif eorlo r8,r8,r8 @ zero or ... ldrhsb r8,[r12],#16 @ ... load input eorlo r9,r9,r9 ldrhsb r9,[r12,#-12] add r3,r11,r3 -# ifdef __thumb2__ +#ifdef __thumb2__ itete lo -# endif +#endif eorlo r10,r10,r10 ldrhsb r10,[r12,#-8] eorlo r11,r11,r11 @@ -416,53 +396,53 @@ ChaCha20_ctr32: eor r0,r8,r0 @ xor with input (or zero) eor r1,r9,r1 -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif ldrhsb r8,[r12,#-15] @ load more input ldrhsb r9,[r12,#-11] eor r2,r10,r2 strb r0,[r14],#16 @ store output eor r3,r11,r3 -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif ldrhsb r10,[r12,#-7] ldrhsb r11,[r12,#-3] strb r1,[r14,#-12] eor r0,r8,r0,lsr#8 strb r2,[r14,#-8] eor r1,r9,r1,lsr#8 -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif ldrhsb r8,[r12,#-14] @ load more input ldrhsb r9,[r12,#-10] strb r3,[r14,#-4] eor r2,r10,r2,lsr#8 strb r0,[r14,#-15] eor r3,r11,r3,lsr#8 -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif ldrhsb r10,[r12,#-6] ldrhsb r11,[r12,#-2] strb r1,[r14,#-11] eor r0,r8,r0,lsr#8 strb r2,[r14,#-7] eor r1,r9,r1,lsr#8 -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif ldrhsb r8,[r12,#-13] @ load more input ldrhsb r9,[r12,#-9] strb r3,[r14,#-3] eor r2,r10,r2,lsr#8 strb r0,[r14,#-14] eor r3,r11,r3,lsr#8 -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif ldrhsb r10,[r12,#-5] ldrhsb r11,[r12,#-1] strb r1,[r14,#-10] @@ -482,18 +462,18 @@ ChaCha20_ctr32: add r4,r8,r4,ror#13 @ accumulate key material add r5,r9,r5,ror#13 add r6,r10,r6,ror#13 -# ifdef __thumb2__ +#ifdef __thumb2__ itete lo -# endif +#endif eorlo r8,r8,r8 @ zero or ... ldrhsb r8,[r12],#16 @ ... load input eorlo r9,r9,r9 ldrhsb r9,[r12,#-12] add r7,r11,r7,ror#13 -# ifdef __thumb2__ +#ifdef __thumb2__ itete lo -# endif +#endif eorlo r10,r10,r10 ldrhsb r10,[r12,#-8] eorlo r11,r11,r11 @@ -501,53 +481,53 @@ ChaCha20_ctr32: eor r4,r8,r4 @ xor with input (or zero) eor r5,r9,r5 -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif ldrhsb r8,[r12,#-15] @ load more input ldrhsb r9,[r12,#-11] eor r6,r10,r6 strb r4,[r14],#16 @ store output eor r7,r11,r7 -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif ldrhsb r10,[r12,#-7] ldrhsb r11,[r12,#-3] strb r5,[r14,#-12] eor r4,r8,r4,lsr#8 strb r6,[r14,#-8] eor r5,r9,r5,lsr#8 -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif ldrhsb r8,[r12,#-14] @ load more input ldrhsb r9,[r12,#-10] strb r7,[r14,#-4] eor r6,r10,r6,lsr#8 strb r4,[r14,#-15] eor r7,r11,r7,lsr#8 -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif ldrhsb r10,[r12,#-6] ldrhsb r11,[r12,#-2] strb r5,[r14,#-11] eor r4,r8,r4,lsr#8 strb r6,[r14,#-7] eor r5,r9,r5,lsr#8 -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif ldrhsb r8,[r12,#-13] @ load more input ldrhsb r9,[r12,#-9] strb r7,[r14,#-3] eor r6,r10,r6,lsr#8 strb r4,[r14,#-14] eor r7,r11,r7,lsr#8 -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif ldrhsb r10,[r12,#-5] ldrhsb r11,[r12,#-1] strb r5,[r14,#-10] @@ -564,26 +544,26 @@ ChaCha20_ctr32: add r8,sp,#4*(4+4) ldmia r8,{r8-r11} @ load key material ldmia r0,{r0-r7} @ load second half -# ifdef __thumb2__ +#ifdef __thumb2__ itt hi -# endif +#endif strhi r10,[sp,#4*(16+10)] @ copy "rx" strhi r11,[sp,#4*(16+11)] @ copy "rx" add r0,r8,r0 @ accumulate key material add r1,r9,r1 add r2,r10,r2 -# ifdef __thumb2__ +#ifdef __thumb2__ itete lo -# endif +#endif eorlo r8,r8,r8 @ zero or ... ldrhsb r8,[r12],#16 @ ... load input eorlo r9,r9,r9 ldrhsb r9,[r12,#-12] add r3,r11,r3 -# ifdef __thumb2__ +#ifdef __thumb2__ itete lo -# endif +#endif eorlo r10,r10,r10 ldrhsb r10,[r12,#-8] eorlo r11,r11,r11 @@ -591,53 +571,53 @@ ChaCha20_ctr32: eor r0,r8,r0 @ xor with input (or zero) eor r1,r9,r1 -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif ldrhsb r8,[r12,#-15] @ load more input ldrhsb r9,[r12,#-11] eor r2,r10,r2 strb r0,[r14],#16 @ store output eor r3,r11,r3 -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif ldrhsb r10,[r12,#-7] ldrhsb r11,[r12,#-3] strb r1,[r14,#-12] eor r0,r8,r0,lsr#8 strb r2,[r14,#-8] eor r1,r9,r1,lsr#8 -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif ldrhsb r8,[r12,#-14] @ load more input ldrhsb r9,[r12,#-10] strb r3,[r14,#-4] eor r2,r10,r2,lsr#8 strb r0,[r14,#-15] eor r3,r11,r3,lsr#8 -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif ldrhsb r10,[r12,#-6] ldrhsb r11,[r12,#-2] strb r1,[r14,#-11] eor r0,r8,r0,lsr#8 strb r2,[r14,#-7] eor r1,r9,r1,lsr#8 -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif ldrhsb r8,[r12,#-13] @ load more input ldrhsb r9,[r12,#-9] strb r3,[r14,#-3] eor r2,r10,r2,lsr#8 strb r0,[r14,#-14] eor r3,r11,r3,lsr#8 -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif ldrhsb r10,[r12,#-5] ldrhsb r11,[r12,#-1] strb r1,[r14,#-10] @@ -654,25 +634,25 @@ ChaCha20_ctr32: add r8,sp,#4*(4+8) ldmia r8,{r8-r11} @ load key material add r4,r8,r4,ror#24 @ accumulate key material -# ifdef __thumb2__ +#ifdef __thumb2__ itt hi -# endif +#endif addhi r8,r8,#1 @ next counter value strhi r8,[sp,#4*(12)] @ save next counter value add r5,r9,r5,ror#24 add r6,r10,r6,ror#24 -# ifdef __thumb2__ +#ifdef __thumb2__ itete lo -# endif +#endif eorlo r8,r8,r8 @ zero or ... ldrhsb r8,[r12],#16 @ ... load input eorlo r9,r9,r9 ldrhsb r9,[r12,#-12] add r7,r11,r7,ror#24 -# ifdef __thumb2__ +#ifdef __thumb2__ itete lo -# endif +#endif eorlo r10,r10,r10 ldrhsb r10,[r12,#-8] eorlo r11,r11,r11 @@ -680,53 +660,53 @@ ChaCha20_ctr32: eor r4,r8,r4 @ xor with input (or zero) eor r5,r9,r5 -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif ldrhsb r8,[r12,#-15] @ load more input ldrhsb r9,[r12,#-11] eor r6,r10,r6 strb r4,[r14],#16 @ store output eor r7,r11,r7 -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif ldrhsb r10,[r12,#-7] ldrhsb r11,[r12,#-3] strb r5,[r14,#-12] eor r4,r8,r4,lsr#8 strb r6,[r14,#-8] eor r5,r9,r5,lsr#8 -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif ldrhsb r8,[r12,#-14] @ load more input ldrhsb r9,[r12,#-10] strb r7,[r14,#-4] eor r6,r10,r6,lsr#8 strb r4,[r14,#-15] eor r7,r11,r7,lsr#8 -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif ldrhsb r10,[r12,#-6] ldrhsb r11,[r12,#-2] strb r5,[r14,#-11] eor r4,r8,r4,lsr#8 strb r6,[r14,#-7] eor r5,r9,r5,lsr#8 -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif ldrhsb r8,[r12,#-13] @ load more input ldrhsb r9,[r12,#-9] strb r7,[r14,#-3] eor r6,r10,r6,lsr#8 strb r4,[r14,#-14] eor r7,r11,r7,lsr#8 -# ifdef __thumb2__ +#ifdef __thumb2__ itt hs -# endif +#endif ldrhsb r10,[r12,#-5] ldrhsb r11,[r12,#-1] strb r5,[r14,#-10] @@ -740,13 +720,13 @@ ChaCha20_ctr32: eor r7,r11,r7,lsr#8 strb r6,[r14,#-5] strb r7,[r14,#-1] -# ifdef __thumb2__ +#ifdef __thumb2__ it ne -# endif +#endif ldrne r8,[sp,#4*(32+2)] @ re-load len -# ifdef __thumb2__ +#ifdef __thumb2__ it hs -# endif +#endif subhs r11,r8,#64 @ len-=64 bhi .Loop_outer @@ -768,20 +748,33 @@ ChaCha20_ctr32: .Ldone: add sp,sp,#4*(32+3) -.Lno_data: +.Lno_data_arm: ldmia sp!,{r4-r11,pc} -.size ChaCha20_ctr32,.-ChaCha20_ctr32 -#if __ARM_MAX_ARCH__>=7 +ENDPROC(chacha20_arm) + +#ifdef CONFIG_KERNEL_MODE_NEON +.align 5 +.Lsigma2: +.long 0x61707865,0x3320646e,0x79622d32,0x6b206574 @ endian-neutral +.Lone2: +.long 1,0,0,0 +.word -1 + .arch armv7-a .fpu neon -.type ChaCha20_neon,%function .align 5 -ChaCha20_neon: +ENTRY(chacha20_neon) ldr r12,[sp,#0] @ pull pointer to counter and nonce stmdb sp!,{r0-r2,r4-r11,lr} -.LChaCha20_neon: - adr r14,.Lsigma + cmp r2,#0 @ len==0? +#ifdef __thumb2__ + itt eq +#endif + addeq sp,sp,#4*3 + beq .Lno_data_neon +.Lchacha20_neon_begin: + adr r14,.Lsigma2 vstmdb sp!,{d8-d15} @ ABI spec says so stmdb sp!,{r0-r3} @@ -1121,12 +1114,12 @@ ChaCha20_neon: ldr r10,[r12,#-8] add r3,r3,r11 ldr r11,[r12,#-4] -# ifdef __ARMEB__ +#ifdef __ARMEB__ rev r0,r0 rev r1,r1 rev r2,r2 rev r3,r3 -# endif +#endif eor r0,r0,r8 @ xor with input add r8,sp,#4*(4) eor r1,r1,r9 @@ -1146,12 +1139,12 @@ ChaCha20_neon: ldr r10,[r12,#-8] add r7,r11,r7,ror#13 ldr r11,[r12,#-4] -# ifdef __ARMEB__ +#ifdef __ARMEB__ rev r4,r4 rev r5,r5 rev r6,r6 rev r7,r7 -# endif +#endif eor r4,r4,r8 add r8,sp,#4*(8) eor r5,r5,r9 @@ -1170,24 +1163,24 @@ ChaCha20_neon: ldr r8,[r12],#16 @ load input add r1,r1,r9 ldr r9,[r12,#-12] -# ifdef __thumb2__ +#ifdef __thumb2__ it hi -# endif +#endif strhi r10,[sp,#4*(16+10)] @ copy "rx" while at it add r2,r2,r10 ldr r10,[r12,#-8] -# ifdef __thumb2__ +#ifdef __thumb2__ it hi -# endif +#endif strhi r11,[sp,#4*(16+11)] @ copy "rx" while at it add r3,r3,r11 ldr r11,[r12,#-4] -# ifdef __ARMEB__ +#ifdef __ARMEB__ rev r0,r0 rev r1,r1 rev r2,r2 rev r3,r3 -# endif +#endif eor r0,r0,r8 add r8,sp,#4*(12) eor r1,r1,r9 @@ -1210,16 +1203,16 @@ ChaCha20_neon: add r7,r11,r7,ror#24 ldr r10,[r12,#-8] ldr r11,[r12,#-4] -# ifdef __ARMEB__ +#ifdef __ARMEB__ rev r4,r4 rev r5,r5 rev r6,r6 rev r7,r7 -# endif +#endif eor r4,r4,r8 -# ifdef __thumb2__ +#ifdef __thumb2__ it hi -# endif +#endif ldrhi r8,[sp,#4*(32+2)] @ re-load len eor r5,r5,r9 eor r6,r6,r10 @@ -1379,7 +1372,7 @@ ChaCha20_neon: add r6,r10,r6,ror#13 add r7,r11,r7,ror#13 ldmia r8,{r8-r11} @ load key material -# ifdef __ARMEB__ +#ifdef __ARMEB__ rev r0,r0 rev r1,r1 rev r2,r2 @@ -1388,7 +1381,7 @@ ChaCha20_neon: rev r5,r5 rev r6,r6 rev r7,r7 -# endif +#endif stmia sp,{r0-r7} add r0,sp,#4*(16+8) @@ -1408,7 +1401,7 @@ ChaCha20_neon: add r6,r10,r6,ror#24 add r7,r11,r7,ror#24 ldr r11,[sp,#4*(32+2)] @ re-load len -# ifdef __ARMEB__ +#ifdef __ARMEB__ rev r0,r0 rev r1,r1 rev r2,r2 @@ -1417,7 +1410,7 @@ ChaCha20_neon: rev r5,r5 rev r6,r6 rev r7,r7 -# endif +#endif stmia r8,{r0-r7} add r10,sp,#4*(0) sub r11,r11,#64*3 @ len-=64*3 @@ -1434,7 +1427,7 @@ ChaCha20_neon: add sp,sp,#4*(32+4) vldmia sp,{d8-d15} add sp,sp,#4*(16+3) +.Lno_data_neon: ldmia sp!,{r4-r11,pc} -.size ChaCha20_neon,.-ChaCha20_neon -.comm OPENSSL_armcap_P,4,4 +ENDPROC(chacha20_neon) #endif diff --git a/lib/zinc/chacha20/chacha20-arm64-cryptogams.S b/lib/zinc/chacha20/chacha20-arm64-cryptogams.S index 4d029bfdad3a..1ae11a5c5a14 100644 --- a/lib/zinc/chacha20/chacha20-arm64-cryptogams.S +++ b/lib/zinc/chacha20/chacha20-arm64-cryptogams.S @@ -1,46 +1,24 @@ /* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */ /* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. * Copyright (C) 2006-2017 CRYPTOGAMS by . All Rights Reserved. + * + * This is based in part on Andy Polyakov's implementation from CRYPTOGAMS. */ -#include "arm_arch.h" +#include .text - - - .align 5 .Lsigma: .quad 0x3320646e61707865,0x6b20657479622d32 // endian-neutral .Lone: .long 1,0,0,0 -.LOPENSSL_armcap_P: -#ifdef __ILP32__ -.long OPENSSL_armcap_P-. -#else -.quad OPENSSL_armcap_P-. -#endif -.byte 67,104,97,67,104,97,50,48,32,102,111,114,32,65,82,77,118,56,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0 -.align 2 -.globl ChaCha20_ctr32 -.type ChaCha20_ctr32,%function .align 5 -ChaCha20_ctr32: +ENTRY(chacha20_arm) cbz x2,.Labort - adr x5,.LOPENSSL_armcap_P - cmp x2,#192 - b.lo .Lshort -#ifdef __ILP32__ - ldrsw x6,[x5] -#else - ldr x6,[x5] -#endif - ldr w17,[x6,x5] - tst w17,#ARMV7_NEON - b.ne ChaCha20_neon -.Lshort: stp x29,x30,[sp,#-96]! add x29,sp,#0 @@ -56,7 +34,7 @@ ChaCha20_ctr32: ldp x24,x25,[x3] // load key ldp x26,x27,[x3,#16] ldp x28,x30,[x4] // load counter -#ifdef __ARMEB__ +#ifdef __AARCH64EB__ ror x24,x24,#32 ror x25,x25,#32 ror x26,x26,#32 @@ -217,7 +195,7 @@ ChaCha20_ctr32: add x20,x20,x21,lsl#32 ldp x19,x21,[x1,#48] add x1,x1,#64 -#ifdef __ARMEB__ +#ifdef __AARCH64EB__ rev x5,x5 rev x7,x7 rev x9,x9 @@ -273,7 +251,7 @@ ChaCha20_ctr32: add x15,x15,x16,lsl#32 add x17,x17,x19,lsl#32 add x20,x20,x21,lsl#32 -#ifdef __ARMEB__ +#ifdef __AARCH64EB__ rev x5,x5 rev x7,x7 rev x9,x9 @@ -309,11 +287,13 @@ ChaCha20_ctr32: ldp x27,x28,[x29,#80] ldp x29,x30,[sp],#96 ret -.size ChaCha20_ctr32,.-ChaCha20_ctr32 +ENDPROC(chacha20_arm) -.type ChaCha20_neon,%function +#ifdef CONFIG_KERNEL_MODE_NEON .align 5 -ChaCha20_neon: +ENTRY(chacha20_neon) + cbz x2,.Labort_neon + stp x29,x30,[sp,#-96]! add x29,sp,#0 @@ -336,7 +316,7 @@ ChaCha20_neon: ldp x28,x30,[x4] // load counter ld1 {v27.4s},[x4] ld1 {v31.4s},[x5] -#ifdef __ARMEB__ +#ifdef __AARCH64EB__ rev64 v24.4s,v24.4s ror x24,x24,#32 ror x25,x25,#32 @@ -634,7 +614,7 @@ ChaCha20_neon: add x20,x20,x21,lsl#32 ldp x19,x21,[x1,#48] add x1,x1,#64 -#ifdef __ARMEB__ +#ifdef __AARCH64EB__ rev x5,x5 rev x7,x7 rev x9,x9 @@ -713,7 +693,7 @@ ChaCha20_neon: add x20,x20,x21,lsl#32 ldp x19,x21,[x1,#48] add x1,x1,#64 -#ifdef __ARMEB__ +#ifdef __AARCH64EB__ rev x5,x5 rev x7,x7 rev x9,x9 @@ -803,19 +783,6 @@ ChaCha20_neon: ldp x27,x28,[x29,#80] ldp x29,x30,[sp],#96 ret -.size ChaCha20_neon,.-ChaCha20_neon -.type ChaCha20_512_neon,%function -.align 5 -ChaCha20_512_neon: - stp x29,x30,[sp,#-96]! - add x29,sp,#0 - - adr x5,.Lsigma - stp x19,x20,[sp,#16] - stp x21,x22,[sp,#32] - stp x23,x24,[sp,#48] - stp x25,x26,[sp,#64] - stp x27,x28,[sp,#80] .L512_or_more_neon: sub sp,sp,#128+64 @@ -828,7 +795,7 @@ ChaCha20_512_neon: ldp x28,x30,[x4] // load counter ld1 {v27.4s},[x4] ld1 {v31.4s},[x5] -#ifdef __ARMEB__ +#ifdef __AARCH64EB__ rev64 v24.4s,v24.4s ror x24,x24,#32 ror x25,x25,#32 @@ -1341,7 +1308,7 @@ ChaCha20_512_neon: add x20,x20,x21,lsl#32 ldp x19,x21,[x1,#48] add x1,x1,#64 -#ifdef __ARMEB__ +#ifdef __AARCH64EB__ rev x5,x5 rev x7,x7 rev x9,x9 @@ -1855,7 +1822,7 @@ ChaCha20_512_neon: add x1,x1,#64 add v21.4s,v21.4s,v25.4s -#ifdef __ARMEB__ +#ifdef __AARCH64EB__ rev x5,x5 rev x7,x7 rev x9,x9 @@ -1969,5 +1936,7 @@ ChaCha20_512_neon: ldp x25,x26,[x29,#64] ldp x27,x28,[x29,#80] ldp x29,x30,[sp],#96 +.Labort_neon: ret -.size ChaCha20_512_neon,.-ChaCha20_512_neon +ENDPROC(chacha20_neon) +#endif From patchwork Sat Oct 6 02:56:50 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Jason A. Donenfeld" X-Patchwork-Id: 148326 Delivered-To: patch@linaro.org Received: by 2002:a2e:8595:0:0:0:0:0 with SMTP id b21-v6csp1159715lji; Fri, 5 Oct 2018 20:01:05 -0700 (PDT) X-Google-Smtp-Source: ACcGV618KDmLsmmRtem/c0VDla08cbFKZWPH+G/VHNPrLVH7nQagJhf+KRSz5dkmIYwnO1ToTFvG X-Received: by 2002:a62:4803:: with SMTP id v3-v6mr15123198pfa.89.1538794865297; Fri, 05 Oct 2018 20:01:05 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1538794865; cv=none; d=google.com; s=arc-20160816; b=BdkjMC7v5fS52wlvQ/kSRTOe4n6gOd8J0nSbr6bJWdINyEjFCNql8e/+a6kvuC1L+U TUu5AXj0QFJCH8+yfngoD3oLC847HfQtH87/KC07JuzBOYnC+UtxD12TaA6Ts2NS/jnW DKI4FgWO7+Jzbrli8y/v4fLxmAJVFaol+pfvTt3sb8mv7/0L/ec8mKHsttZw98STlD3x zd+KW14wgt4Bl+VRESA40qKuFA3Qn2cQi31TldBKuZema977HlPuaWvx629PDkrCgP2y 40jPHt6kOs2xuf/JtTfsR1LkrThkwHfS3+uXMh0Q/liGTSiBXxEETn+7ErGJXFfE9jc5 HnIQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=JccfZA1n2Z3DKGHMuqPZQ4B1yz61dUIeOcqr8p33wCU=; b=YzGHJMy8DmeJIsZloCR6c2NzOt4sTu8xs3cydlJVIZ5KsWJfTXpVbU8A0HedIW7qjg Yfg5LoxHzV+oD0b2d6Xp5WrWWfRDf8xeaCrk99pQnsWttEahwhOANctDltp322kyvpmz r0qdzgG1EJESop4Vf+SH7dFfPa1L82+c6VbJSQCmdzgoeUKrN4ThWTc6E6PzU72x5xZb HbMLFa7nW69+L5hjBSUg31HtkTrzg/zjnK/PS+k+dgTk26/WySPZ39x5QIbpiDIBIMf6 El/69MYyEG/TdLfKm//JEvdApSROsGBqYoKNM4RijzlddQxfq6+3i7XuijjWyYw3S8gm bhCQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=iMd6KXVV; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id s26-v6si10044596pge.339.2018.10.05.20.01.04; Fri, 05 Oct 2018 20:01:05 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=iMd6KXVV; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730022AbeJFKCg (ORCPT + 32 others); Sat, 6 Oct 2018 06:02:36 -0400 Received: from frisell.zx2c4.com ([192.95.5.64]:35333 "EHLO frisell.zx2c4.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726812AbeJFJ7b (ORCPT ); Sat, 6 Oct 2018 05:59:31 -0400 Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTP id 6940214a; Sat, 6 Oct 2018 02:57:25 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=zx2c4.com; h=from:to:cc :subject:date:message-id:in-reply-to:references:mime-version :content-transfer-encoding; s=mail; bh=mkw3ShVhrabbRD6aJWdLX6agB 9w=; b=iMd6KXVVgSRFq8Pxkg2eXC3SiB5svVIOBCqiD1sNfFYgw1G+M+WOw5C3+ jhXJYxmJgzKBGMPq9JA3F1c/apkmI3glYeL2dN1O05FjylJzKkJzhDwu1bboS+Sv yGnzXwa6BsCMzzuxDI4Dgwtt2BJuOc1X/FQSXhiF7AEHMHaLPF7osOimxc37oG6P mIEwVG5w/mTiKWmfsFomghWoowvOjTiizdoD+ohZyPYk+FCSJUDziSpxKu0x3AD6 +Z+iVXAKZX81ODN8EfyGWK1UVlv4w11oDQVbkcK5Uc164m6yJp0LJFkCrFjDDjNl 9k3XqMWmVLOrbaqPV02Tr4mdx1WIA== Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTPSA id e13a5264 (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256:NO); Sat, 6 Oct 2018 02:57:24 +0000 (UTC) From: "Jason A. Donenfeld" To: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, davem@davemloft.net, gregkh@linuxfoundation.org Cc: "Jason A. Donenfeld" , Russell King , linux-arm-kernel@lists.infradead.org, Samuel Neves , Jean-Philippe Aumasson , Andy Lutomirski , Andrew Morton , Linus Torvalds , kernel-hardening@lists.openwall.com, linux-crypto@vger.kernel.org Subject: [PATCH net-next v7 09/28] zinc: ChaCha20 ARM and ARM64 implementations Date: Sat, 6 Oct 2018 04:56:50 +0200 Message-Id: <20181006025709.4019-10-Jason@zx2c4.com> In-Reply-To: <20181006025709.4019-1-Jason@zx2c4.com> References: <20181006025709.4019-1-Jason@zx2c4.com> MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org These wire Andy Polyakov's implementations up to the kernel for ARMv7,8 NEON, and introduce Eric Biggers' ultra-fast scalar implementation for CPUs without NEON or for CPUs with slow NEON (Cortex-A5,7). This commit does the following: - Adds the glue code for the assembly implementations. - Renames the ARMv8 code into place, since it can at this point be used wholesale. - Merges Andy Polyakov's ARMv7 NEON code with Eric Biggers' <=ARMv7 scalar code. This commit delivers approximately the same or much better performance than the existing crypto API's code and has been measured to do as such on: - ARM1176JZF-S [ARMv6] - Cortex-A7 [ARMv7] - Cortex-A8 [ARMv7] - Cortex-A9 [ARMv7] - Cortex-A17 [ARMv7] - Cortex-A53 [ARMv8] - Cortex-A55 [ARMv8] - Cortex-A73 [ARMv8] - Cortex-A75 [ARMv8] Interestingly, Andy Polyakov's scalar code is slower than Eric Biggers', but is also significantly shorter. This has the advantage that it does not evict other code from L1 cache -- particularly on ARM11 chips -- and so in certain circumstances it can actually be faster. However, it wasn't found that this had an affect on any code existing in the kernel today. Signed-off-by: Jason A. Donenfeld Co-authored-by: Eric Biggers Cc: Russell King Cc: linux-arm-kernel@lists.infradead.org Cc: Samuel Neves Cc: Jean-Philippe Aumasson Cc: Andy Lutomirski Cc: Greg KH Cc: Andrew Morton Cc: Linus Torvalds Cc: kernel-hardening@lists.openwall.com Cc: linux-crypto@vger.kernel.org --- Notes: Eric Biggers' scalar code is brand new, and quite possibly prematurely added to this commit, and so it may require a bit of revision. In initial evaluation and fuzzing so far, it seems fine. But we'll be looking at this a bit more as well. lib/zinc/Makefile | 2 + lib/zinc/chacha20/chacha20-arm-glue.c | 98 ++++ ...acha20-arm-cryptogams.S => chacha20-arm.S} | 503 ++++++++++++++++-- ...20-arm64-cryptogams.S => chacha20-arm64.S} | 0 lib/zinc/chacha20/chacha20.c | 2 + 5 files changed, 567 insertions(+), 38 deletions(-) create mode 100644 lib/zinc/chacha20/chacha20-arm-glue.c rename lib/zinc/chacha20/{chacha20-arm-cryptogams.S => chacha20-arm.S} (71%) rename lib/zinc/chacha20/{chacha20-arm64-cryptogams.S => chacha20-arm64.S} (100%) -- 2.19.0 diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile index 223a0816c918..e47f64e12bbd 100644 --- a/lib/zinc/Makefile +++ b/lib/zinc/Makefile @@ -4,4 +4,6 @@ ccflags-$(CONFIG_ZINC_DEBUG) += -DDEBUG zinc_chacha20-y := chacha20/chacha20.o zinc_chacha20-$(CONFIG_ZINC_ARCH_X86_64) += chacha20/chacha20-x86_64.o +zinc_chacha20-$(CONFIG_ZINC_ARCH_ARM) += chacha20/chacha20-arm.o +zinc_chacha20-$(CONFIG_ZINC_ARCH_ARM64) += chacha20/chacha20-arm64.o obj-$(CONFIG_ZINC_CHACHA20) += zinc_chacha20.o diff --git a/lib/zinc/chacha20/chacha20-arm-glue.c b/lib/zinc/chacha20/chacha20-arm-glue.c new file mode 100644 index 000000000000..a0da95d3b9c4 --- /dev/null +++ b/lib/zinc/chacha20/chacha20-arm-glue.c @@ -0,0 +1,98 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#include +#include +#if defined(CONFIG_ZINC_ARCH_ARM) +#include +#include +#endif + +asmlinkage void chacha20_arm(u8 *out, const u8 *in, const size_t len, + const u32 key[8], const u32 counter[4]); +asmlinkage void hchacha20_arm(const u32 state[16], u32 out[8]); +asmlinkage void chacha20_neon(u8 *out, const u8 *in, const size_t len, + const u32 key[8], const u32 counter[4]); + +static bool chacha20_use_neon __ro_after_init; +static bool *const chacha20_nobs[] __initconst = { &chacha20_use_neon }; +static void __init chacha20_fpu_init(void) +{ +#if defined(CONFIG_ZINC_ARCH_ARM64) + chacha20_use_neon = elf_hwcap & HWCAP_ASIMD; +#elif defined(CONFIG_ZINC_ARCH_ARM) + switch (read_cpuid_part()) { + case ARM_CPU_PART_CORTEX_A7: + case ARM_CPU_PART_CORTEX_A5: + /* The Cortex-A7 and Cortex-A5 do not perform well with the NEON + * implementation but do incredibly with the scalar one and use + * less power. + */ + break; + default: + chacha20_use_neon = elf_hwcap & HWCAP_NEON; + } +#endif +} + +static inline bool chacha20_arch(struct chacha20_ctx *ctx, u8 *dst, + const u8 *src, size_t len, + simd_context_t *simd_context) +{ + /* SIMD disables preemption, so relax after processing each page. */ + BUILD_BUG_ON(PAGE_SIZE < CHACHA20_BLOCK_SIZE || + PAGE_SIZE % CHACHA20_BLOCK_SIZE); + + for (;;) { + if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON) && chacha20_use_neon && + len >= CHACHA20_BLOCK_SIZE * 3 && simd_use(simd_context)) { + const size_t bytes = min_t(size_t, len, PAGE_SIZE); + + chacha20_neon(dst, src, bytes, ctx->key, ctx->counter); + ctx->counter[0] += (bytes + 63) / 64; + len -= bytes; + if (!len) + break; + dst += bytes; + src += bytes; + simd_relax(simd_context); + } else { + chacha20_arm(dst, src, len, ctx->key, ctx->counter); + ctx->counter[0] += (len + 63) / 64; + break; + } + } + + return true; +} + +static inline bool hchacha20_arch(u32 derived_key[CHACHA20_KEY_WORDS], + const u8 nonce[HCHACHA20_NONCE_SIZE], + const u8 key[HCHACHA20_KEY_SIZE], + simd_context_t *simd_context) +{ + if (IS_ENABLED(CONFIG_ZINC_ARCH_ARM)) { + u32 x[] = { CHACHA20_CONSTANT_EXPA, + CHACHA20_CONSTANT_ND_3, + CHACHA20_CONSTANT_2_BY, + CHACHA20_CONSTANT_TE_K, + get_unaligned_le32(key + 0), + get_unaligned_le32(key + 4), + get_unaligned_le32(key + 8), + get_unaligned_le32(key + 12), + get_unaligned_le32(key + 16), + get_unaligned_le32(key + 20), + get_unaligned_le32(key + 24), + get_unaligned_le32(key + 28), + get_unaligned_le32(nonce + 0), + get_unaligned_le32(nonce + 4), + get_unaligned_le32(nonce + 8), + get_unaligned_le32(nonce + 12) + }; + hchacha20_arm(x, derived_key); + return true; + } + return false; +} diff --git a/lib/zinc/chacha20/chacha20-arm-cryptogams.S b/lib/zinc/chacha20/chacha20-arm.S similarity index 71% rename from lib/zinc/chacha20/chacha20-arm-cryptogams.S rename to lib/zinc/chacha20/chacha20-arm.S index 770bab469171..79ed18fbcce3 100644 --- a/lib/zinc/chacha20/chacha20-arm-cryptogams.S +++ b/lib/zinc/chacha20/chacha20-arm.S @@ -1,12 +1,475 @@ /* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */ /* + * Copyright (C) 2018 Google, Inc. * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. * Copyright (C) 2006-2017 CRYPTOGAMS by . All Rights Reserved. - * - * This is based in part on Andy Polyakov's implementation from CRYPTOGAMS. */ #include +#include + +/* + * The following scalar routine was written by Eric Biggers. + * + * Design notes: + * + * 16 registers would be needed to hold the state matrix, but only 14 are + * available because 'sp' and 'pc' cannot be used. So we spill the elements + * (x8, x9) to the stack and swap them out with (x10, x11). This adds one + * 'ldrd' and one 'strd' instruction per round. + * + * All rotates are performed using the implicit rotate operand accepted by the + * 'add' and 'eor' instructions. This is faster than using explicit rotate + * instructions. To make this work, we allow the values in the second and last + * rows of the ChaCha state matrix (rows 'b' and 'd') to temporarily have the + * wrong rotation amount. The rotation amount is then fixed up just in time + * when the values are used. 'brot' is the number of bits the values in row 'b' + * need to be rotated right to arrive at the correct values, and 'drot' + * similarly for row 'd'. (brot, drot) start out as (0, 0) but we make it such + * that they end up as (25, 24) after every round. + */ + + // ChaCha state registers + X0 .req r0 + X1 .req r1 + X2 .req r2 + X3 .req r3 + X4 .req r4 + X5 .req r5 + X6 .req r6 + X7 .req r7 + X8_X10 .req r8 // shared by x8 and x10 + X9_X11 .req r9 // shared by x9 and x11 + X12 .req r10 + X13 .req r11 + X14 .req r12 + X15 .req r14 + +.Lexpand_32byte_k: + // "expand 32-byte k" + .word 0x61707865, 0x3320646e, 0x79622d32, 0x6b206574 + +#ifdef __thumb2__ +# define adrl adr +#endif + +.macro __rev out, in, t0, t1, t2 +.if __LINUX_ARM_ARCH__ >= 6 + rev \out, \in +.else + lsl \t0, \in, #24 + and \t1, \in, #0xff00 + and \t2, \in, #0xff0000 + orr \out, \t0, \in, lsr #24 + orr \out, \out, \t1, lsl #8 + orr \out, \out, \t2, lsr #8 +.endif +.endm + +.macro _le32_bswap x, t0, t1, t2 +#ifdef __ARMEB__ + __rev \x, \x, \t0, \t1, \t2 +#endif +.endm + +.macro _le32_bswap_4x a, b, c, d, t0, t1, t2 + _le32_bswap \a, \t0, \t1, \t2 + _le32_bswap \b, \t0, \t1, \t2 + _le32_bswap \c, \t0, \t1, \t2 + _le32_bswap \d, \t0, \t1, \t2 +.endm + +.macro __ldrd a, b, src, offset +#if __LINUX_ARM_ARCH__ >= 6 + ldrd \a, \b, [\src, #\offset] +#else + ldr \a, [\src, #\offset] + ldr \b, [\src, #\offset + 4] +#endif +.endm + +.macro __strd a, b, dst, offset +#if __LINUX_ARM_ARCH__ >= 6 + strd \a, \b, [\dst, #\offset] +#else + str \a, [\dst, #\offset] + str \b, [\dst, #\offset + 4] +#endif +.endm + +.macro _halfround a1, b1, c1, d1, a2, b2, c2, d2 + + // a += b; d ^= a; d = rol(d, 16); + add \a1, \a1, \b1, ror #brot + add \a2, \a2, \b2, ror #brot + eor \d1, \a1, \d1, ror #drot + eor \d2, \a2, \d2, ror #drot + // drot == 32 - 16 == 16 + + // c += d; b ^= c; b = rol(b, 12); + add \c1, \c1, \d1, ror #16 + add \c2, \c2, \d2, ror #16 + eor \b1, \c1, \b1, ror #brot + eor \b2, \c2, \b2, ror #brot + // brot == 32 - 12 == 20 + + // a += b; d ^= a; d = rol(d, 8); + add \a1, \a1, \b1, ror #20 + add \a2, \a2, \b2, ror #20 + eor \d1, \a1, \d1, ror #16 + eor \d2, \a2, \d2, ror #16 + // drot == 32 - 8 == 24 + + // c += d; b ^= c; b = rol(b, 7); + add \c1, \c1, \d1, ror #24 + add \c2, \c2, \d2, ror #24 + eor \b1, \c1, \b1, ror #20 + eor \b2, \c2, \b2, ror #20 + // brot == 32 - 7 == 25 +.endm + +.macro _doubleround + + // column round + + // quarterrounds: (x0, x4, x8, x12) and (x1, x5, x9, x13) + _halfround X0, X4, X8_X10, X12, X1, X5, X9_X11, X13 + + // save (x8, x9); restore (x10, x11) + __strd X8_X10, X9_X11, sp, 0 + __ldrd X8_X10, X9_X11, sp, 8 + + // quarterrounds: (x2, x6, x10, x14) and (x3, x7, x11, x15) + _halfround X2, X6, X8_X10, X14, X3, X7, X9_X11, X15 + + .set brot, 25 + .set drot, 24 + + // diagonal round + + // quarterrounds: (x0, x5, x10, x15) and (x1, x6, x11, x12) + _halfround X0, X5, X8_X10, X15, X1, X6, X9_X11, X12 + + // save (x10, x11); restore (x8, x9) + __strd X8_X10, X9_X11, sp, 8 + __ldrd X8_X10, X9_X11, sp, 0 + + // quarterrounds: (x2, x7, x8, x13) and (x3, x4, x9, x14) + _halfround X2, X7, X8_X10, X13, X3, X4, X9_X11, X14 +.endm + +.macro _chacha_permute nrounds + .set brot, 0 + .set drot, 0 + .rept \nrounds / 2 + _doubleround + .endr +.endm + +.macro _chacha nrounds + +.Lnext_block\@: + // Stack: unused0-unused1 x10-x11 x0-x15 OUT IN LEN + // Registers contain x0-x9,x12-x15. + + // Do the core ChaCha permutation to update x0-x15. + _chacha_permute \nrounds + + add sp, #8 + // Stack: x10-x11 orig_x0-orig_x15 OUT IN LEN + // Registers contain x0-x9,x12-x15. + // x4-x7 are rotated by 'brot'; x12-x15 are rotated by 'drot'. + + // Free up some registers (r8-r12,r14) by pushing (x8-x9,x12-x15). + push {X8_X10, X9_X11, X12, X13, X14, X15} + + // Load (OUT, IN, LEN). + ldr r14, [sp, #96] + ldr r12, [sp, #100] + ldr r11, [sp, #104] + + orr r10, r14, r12 + + // Use slow path if fewer than 64 bytes remain. + cmp r11, #64 + blt .Lxor_slowpath\@ + + // Use slow path if IN and/or OUT isn't 4-byte aligned. Needed even on + // ARMv6+, since ldmia and stmia (used below) still require alignment. + tst r10, #3 + bne .Lxor_slowpath\@ + + // Fast path: XOR 64 bytes of aligned data. + + // Stack: x8-x9 x12-x15 x10-x11 orig_x0-orig_x15 OUT IN LEN + // Registers: r0-r7 are x0-x7; r8-r11 are free; r12 is IN; r14 is OUT. + // x4-x7 are rotated by 'brot'; x12-x15 are rotated by 'drot'. + + // x0-x3 + __ldrd r8, r9, sp, 32 + __ldrd r10, r11, sp, 40 + add X0, X0, r8 + add X1, X1, r9 + add X2, X2, r10 + add X3, X3, r11 + _le32_bswap_4x X0, X1, X2, X3, r8, r9, r10 + ldmia r12!, {r8-r11} + eor X0, X0, r8 + eor X1, X1, r9 + eor X2, X2, r10 + eor X3, X3, r11 + stmia r14!, {X0-X3} + + // x4-x7 + __ldrd r8, r9, sp, 48 + __ldrd r10, r11, sp, 56 + add X4, r8, X4, ror #brot + add X5, r9, X5, ror #brot + ldmia r12!, {X0-X3} + add X6, r10, X6, ror #brot + add X7, r11, X7, ror #brot + _le32_bswap_4x X4, X5, X6, X7, r8, r9, r10 + eor X4, X4, X0 + eor X5, X5, X1 + eor X6, X6, X2 + eor X7, X7, X3 + stmia r14!, {X4-X7} + + // x8-x15 + pop {r0-r7} // (x8-x9,x12-x15,x10-x11) + __ldrd r8, r9, sp, 32 + __ldrd r10, r11, sp, 40 + add r0, r0, r8 // x8 + add r1, r1, r9 // x9 + add r6, r6, r10 // x10 + add r7, r7, r11 // x11 + _le32_bswap_4x r0, r1, r6, r7, r8, r9, r10 + ldmia r12!, {r8-r11} + eor r0, r0, r8 // x8 + eor r1, r1, r9 // x9 + eor r6, r6, r10 // x10 + eor r7, r7, r11 // x11 + stmia r14!, {r0,r1,r6,r7} + ldmia r12!, {r0,r1,r6,r7} + __ldrd r8, r9, sp, 48 + __ldrd r10, r11, sp, 56 + add r2, r8, r2, ror #drot // x12 + add r3, r9, r3, ror #drot // x13 + add r4, r10, r4, ror #drot // x14 + add r5, r11, r5, ror #drot // x15 + _le32_bswap_4x r2, r3, r4, r5, r9, r10, r11 + ldr r9, [sp, #72] // load LEN + eor r2, r2, r0 // x12 + eor r3, r3, r1 // x13 + eor r4, r4, r6 // x14 + eor r5, r5, r7 // x15 + subs r9, #64 // decrement and check LEN + stmia r14!, {r2-r5} + + beq .Ldone\@ + +.Lprepare_for_next_block\@: + + // Stack: x0-x15 OUT IN LEN + + // Increment block counter (x12) + add r8, #1 + + // Store updated (OUT, IN, LEN) + str r14, [sp, #64] + str r12, [sp, #68] + str r9, [sp, #72] + + mov r14, sp + + // Store updated block counter (x12) + str r8, [sp, #48] + + sub sp, #16 + + // Reload state and do next block + ldmia r14!, {r0-r11} // load x0-x11 + __strd r10, r11, sp, 8 // store x10-x11 before state + ldmia r14, {r10-r12,r14} // load x12-x15 + b .Lnext_block\@ + +.Lxor_slowpath\@: + // Slow path: < 64 bytes remaining, or unaligned input or output buffer. + // We handle it by storing the 64 bytes of keystream to the stack, then + // XOR-ing the needed portion with the data. + + // Allocate keystream buffer + sub sp, #64 + mov r14, sp + + // Stack: ks0-ks15 x8-x9 x12-x15 x10-x11 orig_x0-orig_x15 OUT IN LEN + // Registers: r0-r7 are x0-x7; r8-r11 are free; r12 is IN; r14 is &ks0. + // x4-x7 are rotated by 'brot'; x12-x15 are rotated by 'drot'. + + // Save keystream for x0-x3 + __ldrd r8, r9, sp, 96 + __ldrd r10, r11, sp, 104 + add X0, X0, r8 + add X1, X1, r9 + add X2, X2, r10 + add X3, X3, r11 + _le32_bswap_4x X0, X1, X2, X3, r8, r9, r10 + stmia r14!, {X0-X3} + + // Save keystream for x4-x7 + __ldrd r8, r9, sp, 112 + __ldrd r10, r11, sp, 120 + add X4, r8, X4, ror #brot + add X5, r9, X5, ror #brot + add X6, r10, X6, ror #brot + add X7, r11, X7, ror #brot + _le32_bswap_4x X4, X5, X6, X7, r8, r9, r10 + add r8, sp, #64 + stmia r14!, {X4-X7} + + // Save keystream for x8-x15 + ldm r8, {r0-r7} // (x8-x9,x12-x15,x10-x11) + __ldrd r8, r9, sp, 128 + __ldrd r10, r11, sp, 136 + add r0, r0, r8 // x8 + add r1, r1, r9 // x9 + add r6, r6, r10 // x10 + add r7, r7, r11 // x11 + _le32_bswap_4x r0, r1, r6, r7, r8, r9, r10 + stmia r14!, {r0,r1,r6,r7} + __ldrd r8, r9, sp, 144 + __ldrd r10, r11, sp, 152 + add r2, r8, r2, ror #drot // x12 + add r3, r9, r3, ror #drot // x13 + add r4, r10, r4, ror #drot // x14 + add r5, r11, r5, ror #drot // x15 + _le32_bswap_4x r2, r3, r4, r5, r9, r10, r11 + stmia r14, {r2-r5} + + // Stack: ks0-ks15 unused0-unused7 x0-x15 OUT IN LEN + // Registers: r8 is block counter, r12 is IN. + + ldr r9, [sp, #168] // LEN + ldr r14, [sp, #160] // OUT + cmp r9, #64 + mov r0, sp + movle r1, r9 + movgt r1, #64 + // r1 is number of bytes to XOR, in range [1, 64] + +.if __LINUX_ARM_ARCH__ < 6 + orr r2, r12, r14 + tst r2, #3 // IN or OUT misaligned? + bne .Lxor_next_byte\@ +.endif + + // XOR a word at a time +.rept 16 + subs r1, #4 + blt .Lxor_words_done\@ + ldr r2, [r12], #4 + ldr r3, [r0], #4 + eor r2, r2, r3 + str r2, [r14], #4 +.endr + b .Lxor_slowpath_done\@ +.Lxor_words_done\@: + ands r1, r1, #3 + beq .Lxor_slowpath_done\@ + + // XOR a byte at a time +.Lxor_next_byte\@: + ldrb r2, [r12], #1 + ldrb r3, [r0], #1 + eor r2, r2, r3 + strb r2, [r14], #1 + subs r1, #1 + bne .Lxor_next_byte\@ + +.Lxor_slowpath_done\@: + subs r9, #64 + add sp, #96 + bgt .Lprepare_for_next_block\@ + +.Ldone\@: +.endm // _chacha + +/* + * void chacha20_arm(u8 *out, const u8 *in, size_t len, const u32 key[8], + * const u32 iv[4]); + */ +ENTRY(chacha20_arm) + cmp r2, #0 // len == 0? + reteq lr + + push {r0-r2,r4-r11,lr} + + // Push state x0-x15 onto stack. + // Also store an extra copy of x10-x11 just before the state. + + ldr r4, [sp, #48] // iv + mov r0, sp + sub sp, #80 + + // iv: x12-x15 + ldm r4, {X12,X13,X14,X15} + stmdb r0!, {X12,X13,X14,X15} + + // key: x4-x11 + __ldrd X8_X10, X9_X11, r3, 24 + __strd X8_X10, X9_X11, sp, 8 + stmdb r0!, {X8_X10, X9_X11} + ldm r3, {X4-X9_X11} + stmdb r0!, {X4-X9_X11} + + // constants: x0-x3 + adrl X3, .Lexpand_32byte_k + ldm X3, {X0-X3} + __strd X0, X1, sp, 16 + __strd X2, X3, sp, 24 + + _chacha 20 + + add sp, #76 + pop {r4-r11, pc} +ENDPROC(chacha20_arm) + +/* + * void hchacha20_arm(const u32 state[16], u32 out[8]); + */ +ENTRY(hchacha20_arm) + push {r1,r4-r11,lr} + + mov r14, r0 + ldmia r14!, {r0-r11} // load x0-x11 + push {r10-r11} // store x10-x11 to stack + ldm r14, {r10-r12,r14} // load x12-x15 + sub sp, #8 + + _chacha_permute 20 + + // Skip over (unused0-unused1, x10-x11) + add sp, #16 + + // Fix up rotations of x12-x15 + ror X12, X12, #drot + ror X13, X13, #drot + pop {r4} // load 'out' + ror X14, X14, #drot + ror X15, X15, #drot + + // Store (x0-x3,x12-x15) to 'out' + stm r4, {X0,X1,X2,X3,X12,X13,X14,X15} + + pop {r4-r11,pc} +ENDPROC(hchacha20_arm) + +#ifdef CONFIG_KERNEL_MODE_NEON +/* + * This following NEON routine was ported from Andy Polyakov's implementation + * from CRYPTOGAMS. It begins with parts of the CRYPTOGAMS scalar routine, + * since certain NEON code paths actually branch to it. + */ .text #if defined(__thumb2__) || defined(__clang__) @@ -22,39 +485,6 @@ #define ldrhsb ldrbhs #endif -.align 5 -.Lsigma: -.long 0x61707865,0x3320646e,0x79622d32,0x6b206574 @ endian-neutral -.Lone: -.long 1,0,0,0 -.word -1 - -.align 5 -ENTRY(chacha20_arm) - ldr r12,[sp,#0] @ pull pointer to counter and nonce - stmdb sp!,{r0-r2,r4-r11,lr} - cmp r2,#0 @ len==0? -#ifdef __thumb2__ - itt eq -#endif - addeq sp,sp,#4*3 - beq .Lno_data_arm - ldmia r12,{r4-r7} @ load counter and nonce - sub sp,sp,#4*(16) @ off-load area -#if __LINUX_ARM_ARCH__ < 7 && !defined(__thumb2__) - sub r14,pc,#100 @ .Lsigma -#else - adr r14,.Lsigma @ .Lsigma -#endif - stmdb sp!,{r4-r7} @ copy counter and nonce - ldmia r3,{r4-r11} @ load key - ldmia r14,{r0-r3} @ load sigma - stmdb sp!,{r4-r11} @ copy key - stmdb sp!,{r0-r3} @ copy sigma - str r10,[sp,#4*(16+10)] @ off-load "rx" - str r11,[sp,#4*(16+11)] @ off-load "rx" - b .Loop_outer_enter - .align 4 .Loop_outer: ldmia sp,{r0-r9} @ load key material @@ -748,11 +1178,8 @@ ENTRY(chacha20_arm) .Ldone: add sp,sp,#4*(32+3) -.Lno_data_arm: ldmia sp!,{r4-r11,pc} -ENDPROC(chacha20_arm) -#ifdef CONFIG_KERNEL_MODE_NEON .align 5 .Lsigma2: .long 0x61707865,0x3320646e,0x79622d32,0x6b206574 @ endian-neutral diff --git a/lib/zinc/chacha20/chacha20-arm64-cryptogams.S b/lib/zinc/chacha20/chacha20-arm64.S similarity index 100% rename from lib/zinc/chacha20/chacha20-arm64-cryptogams.S rename to lib/zinc/chacha20/chacha20-arm64.S diff --git a/lib/zinc/chacha20/chacha20.c b/lib/zinc/chacha20/chacha20.c index 22a21431c221..3698fcd8ae7f 100644 --- a/lib/zinc/chacha20/chacha20.c +++ b/lib/zinc/chacha20/chacha20.c @@ -18,6 +18,8 @@ #if defined(CONFIG_ZINC_ARCH_X86_64) #include "chacha20-x86_64-glue.c" +#elif defined(CONFIG_ZINC_ARCH_ARM) || defined(CONFIG_ZINC_ARCH_ARM64) +#include "chacha20-arm-glue.c" #else static bool *const chacha20_nobs[] __initconst = { }; static void __init chacha20_fpu_init(void) From patchwork Sat Oct 6 02:56:51 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: "Jason A. Donenfeld" X-Patchwork-Id: 148306 Delivered-To: patch@linaro.org Received: by 2002:a2e:8595:0:0:0:0:0 with SMTP id b21-v6csp1157478lji; Fri, 5 Oct 2018 19:58:05 -0700 (PDT) X-Google-Smtp-Source: ACcGV60EG9XY8B4lOrmCOxhWjEOHotUZ0p4AIElAyZheHEhQ+3RWuxKb6m8cLsBjtk9k8g6TUsPe X-Received: by 2002:a62:68c3:: with SMTP id d186-v6mr14665451pfc.70.1538794685722; Fri, 05 Oct 2018 19:58:05 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1538794685; cv=none; d=google.com; s=arc-20160816; b=gP2KxEK7ZltOKvP2yX6TzPjMeurtTg49w3osku05V74XYVwtHAR7ZXOwch/DL3VVMj W8pnDZnHOI+DjD8j4P+axKBc39o0bSlMEW0DOiyt7HSRhBCzP/PPSy9zPBSKRhmND3cM NAWkWN8jQeRHHdwugy9egBcJf5m2lnlTlLubVRx8It7XLG9i1xAC01TW0AX8DGqDJs5h kE4CmOwoEuKVEmVwTgZoe57i1syg6mTdHHyJmBMlKJdp4wnLSM7VSr4vl6oXCM6tli4p VcB1qxnmV+bgrBp1PD8ApdlbZMzENzUiuAI18RztCT5WlP2Swo9gwoGQsSuCAp8EhBPg qVxw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=cqZjMxLdnjP58SSi1yJp0VpOTqeAn3TimjzTAWw8rCI=; b=TACCEr2RB5U0ol7P9DPMPbeGwFO1QzMfnHMl7vgcRsEOcxwaodW7e7wkG0noYoBtt+ HaL8bjU0RAaj+ukvqU3P30r0bJc5L2bbAkyjoJxn3JmW+l+JNkC4j/jlX8VU28U/BTJP 6PGwSUMV2WXBTRnf/VqGPLRlnAohG4nZY0RbeaJZ9k95rFEDtExQVsnMFyTH661hw3WZ k8iNYBj/qZBsI2FLjbfOvu3wXfXKxY6k8GsZSPFyeq44pdpqugIF8tRYpwAO0qwBDrk+ wJsBX+/hbdQLcuvo57PoamcFhzP4VhDhBxaj2HRRe1pYSjMAfB3yPQ0ngenOX12vrn2B mcTg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=gixXm+U5; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id q5-v6si8788661pgg.105.2018.10.05.19.58.05; Fri, 05 Oct 2018 19:58:05 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=gixXm+U5; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729665AbeJFJ7d (ORCPT + 32 others); Sat, 6 Oct 2018 05:59:33 -0400 Received: from frisell.zx2c4.com ([192.95.5.64]:50073 "EHLO frisell.zx2c4.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729562AbeJFJ7d (ORCPT ); Sat, 6 Oct 2018 05:59:33 -0400 Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTP id ff105865; Sat, 6 Oct 2018 02:57:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=zx2c4.com; h=from:to:cc :subject:date:message-id:in-reply-to:references:mime-version :content-type:content-transfer-encoding; s=mail; bh=KpwH8NF0fwsV PvpV3wJNynv5LzU=; b=gixXm+U5Sm3i47NPAQbrJaQrihPHjuoIyS1IrSf7gqlG VsXOP6tyLcqazkQO6y0jjgc6lQprAOieB8ZBr31/h8p0LEXzhMWd6PZWqIkKXelW /VTI8js2Fq0Zen6uakUU4rTevvj4F4GXb6dRSRMUPwJUS27esyA0IBbWxG0STlub xex8ghgAnI0VY91Numlw/zc/NI2eOZ86SYpnimN85PAke8G48FeK8gJgxOypKCW+ hh3aSWmRhJZGd33XBlgnwjFplispw/vl++SRGKMWrJV2wkd3LgOAK7avK0SFXHXA pWGp154EFvNMNIdbhbaO2qOu8wZThwRlEJEH7tnSEg== Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTPSA id aa6459ae (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256:NO); Sat, 6 Oct 2018 02:57:27 +0000 (UTC) From: "Jason A. Donenfeld" To: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, davem@davemloft.net, gregkh@linuxfoundation.org Cc: "Jason A. Donenfeld" , =?utf-8?q?Ren=C3=A9_van_Dorst?= , Ralf Baechle , Paul Burton , James Hogan , linux-mips@linux-mips.org, Samuel Neves , Jean-Philippe Aumasson , Andy Lutomirski , Andrew Morton , Linus Torvalds , kernel-hardening@lists.openwall.com, linux-crypto@vger.kernel.org Subject: [PATCH net-next v7 10/28] zinc: ChaCha20 MIPS32r2 implementation Date: Sat, 6 Oct 2018 04:56:51 +0200 Message-Id: <20181006025709.4019-11-Jason@zx2c4.com> In-Reply-To: <20181006025709.4019-1-Jason@zx2c4.com> References: <20181006025709.4019-1-Jason@zx2c4.com> MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This MIPS32r2 implementation comes from René van Dorst and me and results in a nice speedup on the usual OpenWRT targets. Signed-off-by: Jason A. Donenfeld Signed-off-by: René van Dorst Co-developed-by: René van Dorst Cc: Ralf Baechle Cc: Paul Burton Cc: James Hogan Cc: linux-mips@linux-mips.org Cc: Samuel Neves Cc: Jean-Philippe Aumasson Cc: Andy Lutomirski Cc: Greg KH Cc: Andrew Morton Cc: Linus Torvalds Cc: kernel-hardening@lists.openwall.com Cc: linux-crypto@vger.kernel.org --- lib/zinc/Makefile | 2 + lib/zinc/chacha20/chacha20-mips-glue.c | 28 ++ lib/zinc/chacha20/chacha20-mips.S | 424 +++++++++++++++++++++++++ lib/zinc/chacha20/chacha20.c | 2 + 4 files changed, 456 insertions(+) create mode 100644 lib/zinc/chacha20/chacha20-mips-glue.c create mode 100644 lib/zinc/chacha20/chacha20-mips.S -- 2.19.0 diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile index e47f64e12bbd..60d568cf5206 100644 --- a/lib/zinc/Makefile +++ b/lib/zinc/Makefile @@ -6,4 +6,6 @@ zinc_chacha20-y := chacha20/chacha20.o zinc_chacha20-$(CONFIG_ZINC_ARCH_X86_64) += chacha20/chacha20-x86_64.o zinc_chacha20-$(CONFIG_ZINC_ARCH_ARM) += chacha20/chacha20-arm.o zinc_chacha20-$(CONFIG_ZINC_ARCH_ARM64) += chacha20/chacha20-arm64.o +zinc_chacha20-$(CONFIG_ZINC_ARCH_MIPS) += chacha20/chacha20-mips.o +AFLAGS_chacha20-mips.o += -O2 # This is required to fill the branch delay slots obj-$(CONFIG_ZINC_CHACHA20) += zinc_chacha20.o diff --git a/lib/zinc/chacha20/chacha20-mips-glue.c b/lib/zinc/chacha20/chacha20-mips-glue.c new file mode 100644 index 000000000000..917d8fa8e3f4 --- /dev/null +++ b/lib/zinc/chacha20/chacha20-mips-glue.c @@ -0,0 +1,28 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +asmlinkage void chacha20_mips(u32 state[16], u8 *out, const u8 *in, + const size_t len); +static bool *const chacha20_nobs[] __initconst = { }; +static void __init chacha20_fpu_init(void) +{ +} + +static inline bool chacha20_arch(struct chacha20_ctx *ctx, u8 *dst, + const u8 *src, size_t len, + simd_context_t *simd_context) +{ + chacha20_mips(ctx->state, dst, src, len); + return true; +} + + +static inline bool hchacha20_arch(u32 derived_key[CHACHA20_KEY_WORDS], + const u8 nonce[HCHACHA20_NONCE_SIZE], + const u8 key[HCHACHA20_KEY_SIZE], + simd_context_t *simd_context) +{ + return false; +} diff --git a/lib/zinc/chacha20/chacha20-mips.S b/lib/zinc/chacha20/chacha20-mips.S new file mode 100644 index 000000000000..031ee5e794df --- /dev/null +++ b/lib/zinc/chacha20/chacha20-mips.S @@ -0,0 +1,424 @@ +/* SPDX-License-Identifier: GPL-2.0 OR MIT */ +/* + * Copyright (C) 2016-2018 René van Dorst . All Rights Reserved. + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#define MASK_U32 0x3c +#define CHACHA20_BLOCK_SIZE 64 +#define STACK_SIZE 32 + +#define X0 $t0 +#define X1 $t1 +#define X2 $t2 +#define X3 $t3 +#define X4 $t4 +#define X5 $t5 +#define X6 $t6 +#define X7 $t7 +#define X8 $t8 +#define X9 $t9 +#define X10 $v1 +#define X11 $s6 +#define X12 $s5 +#define X13 $s4 +#define X14 $s3 +#define X15 $s2 +/* Use regs which are overwritten on exit for Tx so we don't leak clear data. */ +#define T0 $s1 +#define T1 $s0 +#define T(n) T ## n +#define X(n) X ## n + +/* Input arguments */ +#define STATE $a0 +#define OUT $a1 +#define IN $a2 +#define BYTES $a3 + +/* Output argument */ +/* NONCE[0] is kept in a register and not in memory. + * We don't want to touch original value in memory. + * Must be incremented every loop iteration. + */ +#define NONCE_0 $v0 + +/* SAVED_X and SAVED_CA are set in the jump table. + * Use regs which are overwritten on exit else we don't leak clear data. + * They are used to handling the last bytes which are not multiple of 4. + */ +#define SAVED_X X15 +#define SAVED_CA $s7 + +#define IS_UNALIGNED $s7 + +#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__ +#define MSB 0 +#define LSB 3 +#define ROTx rotl +#define ROTR(n) rotr n, 24 +#define CPU_TO_LE32(n) \ + wsbh n; \ + rotr n, 16; +#else +#define MSB 3 +#define LSB 0 +#define ROTx rotr +#define CPU_TO_LE32(n) +#define ROTR(n) +#endif + +#define FOR_EACH_WORD(x) \ + x( 0); \ + x( 1); \ + x( 2); \ + x( 3); \ + x( 4); \ + x( 5); \ + x( 6); \ + x( 7); \ + x( 8); \ + x( 9); \ + x(10); \ + x(11); \ + x(12); \ + x(13); \ + x(14); \ + x(15); + +#define FOR_EACH_WORD_REV(x) \ + x(15); \ + x(14); \ + x(13); \ + x(12); \ + x(11); \ + x(10); \ + x( 9); \ + x( 8); \ + x( 7); \ + x( 6); \ + x( 5); \ + x( 4); \ + x( 3); \ + x( 2); \ + x( 1); \ + x( 0); + +#define PLUS_ONE_0 1 +#define PLUS_ONE_1 2 +#define PLUS_ONE_2 3 +#define PLUS_ONE_3 4 +#define PLUS_ONE_4 5 +#define PLUS_ONE_5 6 +#define PLUS_ONE_6 7 +#define PLUS_ONE_7 8 +#define PLUS_ONE_8 9 +#define PLUS_ONE_9 10 +#define PLUS_ONE_10 11 +#define PLUS_ONE_11 12 +#define PLUS_ONE_12 13 +#define PLUS_ONE_13 14 +#define PLUS_ONE_14 15 +#define PLUS_ONE_15 16 +#define PLUS_ONE(x) PLUS_ONE_ ## x +#define _CONCAT3(a,b,c) a ## b ## c +#define CONCAT3(a,b,c) _CONCAT3(a,b,c) + +#define STORE_UNALIGNED(x) \ +CONCAT3(.Lchacha20_mips_xor_unaligned_, PLUS_ONE(x), _b: ;) \ + .if (x != 12); \ + lw T0, (x*4)(STATE); \ + .endif; \ + lwl T1, (x*4)+MSB ## (IN); \ + lwr T1, (x*4)+LSB ## (IN); \ + .if (x == 12); \ + addu X ## x, NONCE_0; \ + .else; \ + addu X ## x, T0; \ + .endif; \ + CPU_TO_LE32(X ## x); \ + xor X ## x, T1; \ + swl X ## x, (x*4)+MSB ## (OUT); \ + swr X ## x, (x*4)+LSB ## (OUT); + +#define STORE_ALIGNED(x) \ +CONCAT3(.Lchacha20_mips_xor_aligned_, PLUS_ONE(x), _b: ;) \ + .if (x != 12); \ + lw T0, (x*4)(STATE); \ + .endif; \ + lw T1, (x*4) ## (IN); \ + .if (x == 12); \ + addu X ## x, NONCE_0; \ + .else; \ + addu X ## x, T0; \ + .endif; \ + CPU_TO_LE32(X ## x); \ + xor X ## x, T1; \ + sw X ## x, (x*4) ## (OUT); + +/* Jump table macro. + * Used for setup and handling the last bytes, which are not multiple of 4. + * X15 is free to store Xn + * Every jumptable entry must be equal in size. + */ +#define JMPTBL_ALIGNED(x) \ +.Lchacha20_mips_jmptbl_aligned_ ## x: ; \ + .set noreorder; \ + b .Lchacha20_mips_xor_aligned_ ## x ## _b; \ + .if (x == 12); \ + addu SAVED_X, X ## x, NONCE_0; \ + .else; \ + addu SAVED_X, X ## x, SAVED_CA; \ + .endif; \ + .set reorder + +#define JMPTBL_UNALIGNED(x) \ +.Lchacha20_mips_jmptbl_unaligned_ ## x: ; \ + .set noreorder; \ + b .Lchacha20_mips_xor_unaligned_ ## x ## _b; \ + .if (x == 12); \ + addu SAVED_X, X ## x, NONCE_0; \ + .else; \ + addu SAVED_X, X ## x, SAVED_CA; \ + .endif; \ + .set reorder + +#define AXR(A, B, C, D, K, L, M, N, V, W, Y, Z, S) \ + addu X(A), X(K); \ + addu X(B), X(L); \ + addu X(C), X(M); \ + addu X(D), X(N); \ + xor X(V), X(A); \ + xor X(W), X(B); \ + xor X(Y), X(C); \ + xor X(Z), X(D); \ + rotl X(V), S; \ + rotl X(W), S; \ + rotl X(Y), S; \ + rotl X(Z), S; + +.text +.set reorder +.set noat +.globl chacha20_mips +.ent chacha20_mips +chacha20_mips: + .frame $sp, STACK_SIZE, $ra + + addiu $sp, -STACK_SIZE + + /* Return bytes = 0. */ + beqz BYTES, .Lchacha20_mips_end + + lw NONCE_0, 48(STATE) + + /* Save s0-s7 */ + sw $s0, 0($sp) + sw $s1, 4($sp) + sw $s2, 8($sp) + sw $s3, 12($sp) + sw $s4, 16($sp) + sw $s5, 20($sp) + sw $s6, 24($sp) + sw $s7, 28($sp) + + /* Test IN or OUT is unaligned. + * IS_UNALIGNED = ( IN | OUT ) & 0x00000003 + */ + or IS_UNALIGNED, IN, OUT + andi IS_UNALIGNED, 0x3 + + /* Set number of rounds */ + li $at, 20 + + b .Lchacha20_rounds_start + +.align 4 +.Loop_chacha20_rounds: + addiu IN, CHACHA20_BLOCK_SIZE + addiu OUT, CHACHA20_BLOCK_SIZE + addiu NONCE_0, 1 + +.Lchacha20_rounds_start: + lw X0, 0(STATE) + lw X1, 4(STATE) + lw X2, 8(STATE) + lw X3, 12(STATE) + + lw X4, 16(STATE) + lw X5, 20(STATE) + lw X6, 24(STATE) + lw X7, 28(STATE) + lw X8, 32(STATE) + lw X9, 36(STATE) + lw X10, 40(STATE) + lw X11, 44(STATE) + + move X12, NONCE_0 + lw X13, 52(STATE) + lw X14, 56(STATE) + lw X15, 60(STATE) + +.Loop_chacha20_xor_rounds: + addiu $at, -2 + AXR( 0, 1, 2, 3, 4, 5, 6, 7, 12,13,14,15, 16); + AXR( 8, 9,10,11, 12,13,14,15, 4, 5, 6, 7, 12); + AXR( 0, 1, 2, 3, 4, 5, 6, 7, 12,13,14,15, 8); + AXR( 8, 9,10,11, 12,13,14,15, 4, 5, 6, 7, 7); + AXR( 0, 1, 2, 3, 5, 6, 7, 4, 15,12,13,14, 16); + AXR(10,11, 8, 9, 15,12,13,14, 5, 6, 7, 4, 12); + AXR( 0, 1, 2, 3, 5, 6, 7, 4, 15,12,13,14, 8); + AXR(10,11, 8, 9, 15,12,13,14, 5, 6, 7, 4, 7); + bnez $at, .Loop_chacha20_xor_rounds + + addiu BYTES, -(CHACHA20_BLOCK_SIZE) + + /* Is data src/dst unaligned? Jump */ + bnez IS_UNALIGNED, .Loop_chacha20_unaligned + + /* Set number rounds here to fill delayslot. */ + li $at, 20 + + /* BYTES < 0, it has no full block. */ + bltz BYTES, .Lchacha20_mips_no_full_block_aligned + + FOR_EACH_WORD_REV(STORE_ALIGNED) + + /* BYTES > 0? Loop again. */ + bgtz BYTES, .Loop_chacha20_rounds + + /* Place this here to fill delay slot */ + addiu NONCE_0, 1 + + /* BYTES < 0? Handle last bytes */ + bltz BYTES, .Lchacha20_mips_xor_bytes + +.Lchacha20_mips_xor_done: + /* Restore used registers */ + lw $s0, 0($sp) + lw $s1, 4($sp) + lw $s2, 8($sp) + lw $s3, 12($sp) + lw $s4, 16($sp) + lw $s5, 20($sp) + lw $s6, 24($sp) + lw $s7, 28($sp) + + /* Write NONCE_0 back to right location in state */ + sw NONCE_0, 48(STATE) + +.Lchacha20_mips_end: + addiu $sp, STACK_SIZE + jr $ra + +.Lchacha20_mips_no_full_block_aligned: + /* Restore the offset on BYTES */ + addiu BYTES, CHACHA20_BLOCK_SIZE + + /* Get number of full WORDS */ + andi $at, BYTES, MASK_U32 + + /* Load upper half of jump table addr */ + lui T0, %hi(.Lchacha20_mips_jmptbl_aligned_0) + + /* Calculate lower half jump table offset */ + ins T0, $at, 1, 6 + + /* Add offset to STATE */ + addu T1, STATE, $at + + /* Add lower half jump table addr */ + addiu T0, %lo(.Lchacha20_mips_jmptbl_aligned_0) + + /* Read value from STATE */ + lw SAVED_CA, 0(T1) + + /* Store remaining bytecounter as negative value */ + subu BYTES, $at, BYTES + + jr T0 + + /* Jump table */ + FOR_EACH_WORD(JMPTBL_ALIGNED) + + +.Loop_chacha20_unaligned: + /* Set number rounds here to fill delayslot. */ + li $at, 20 + + /* BYTES > 0, it has no full block. */ + bltz BYTES, .Lchacha20_mips_no_full_block_unaligned + + FOR_EACH_WORD_REV(STORE_UNALIGNED) + + /* BYTES > 0? Loop again. */ + bgtz BYTES, .Loop_chacha20_rounds + + /* Write NONCE_0 back to right location in state */ + sw NONCE_0, 48(STATE) + + .set noreorder + /* Fall through to byte handling */ + bgez BYTES, .Lchacha20_mips_xor_done +.Lchacha20_mips_xor_unaligned_0_b: +.Lchacha20_mips_xor_aligned_0_b: + /* Place this here to fill delay slot */ + addiu NONCE_0, 1 + .set reorder + +.Lchacha20_mips_xor_bytes: + addu IN, $at + addu OUT, $at + /* First byte */ + lbu T1, 0(IN) + addiu $at, BYTES, 1 + CPU_TO_LE32(SAVED_X) + ROTR(SAVED_X) + xor T1, SAVED_X + sb T1, 0(OUT) + beqz $at, .Lchacha20_mips_xor_done + /* Second byte */ + lbu T1, 1(IN) + addiu $at, BYTES, 2 + ROTx SAVED_X, 8 + xor T1, SAVED_X + sb T1, 1(OUT) + beqz $at, .Lchacha20_mips_xor_done + /* Third byte */ + lbu T1, 2(IN) + ROTx SAVED_X, 8 + xor T1, SAVED_X + sb T1, 2(OUT) + b .Lchacha20_mips_xor_done + +.Lchacha20_mips_no_full_block_unaligned: + /* Restore the offset on BYTES */ + addiu BYTES, CHACHA20_BLOCK_SIZE + + /* Get number of full WORDS */ + andi $at, BYTES, MASK_U32 + + /* Load upper half of jump table addr */ + lui T0, %hi(.Lchacha20_mips_jmptbl_unaligned_0) + + /* Calculate lower half jump table offset */ + ins T0, $at, 1, 6 + + /* Add offset to STATE */ + addu T1, STATE, $at + + /* Add lower half jump table addr */ + addiu T0, %lo(.Lchacha20_mips_jmptbl_unaligned_0) + + /* Read value from STATE */ + lw SAVED_CA, 0(T1) + + /* Store remaining bytecounter as negative value */ + subu BYTES, $at, BYTES + + jr T0 + + /* Jump table */ + FOR_EACH_WORD(JMPTBL_UNALIGNED) +.end chacha20_mips +.set at diff --git a/lib/zinc/chacha20/chacha20.c b/lib/zinc/chacha20/chacha20.c index 3698fcd8ae7f..0b833310a7d8 100644 --- a/lib/zinc/chacha20/chacha20.c +++ b/lib/zinc/chacha20/chacha20.c @@ -20,6 +20,8 @@ #include "chacha20-x86_64-glue.c" #elif defined(CONFIG_ZINC_ARCH_ARM) || defined(CONFIG_ZINC_ARCH_ARM64) #include "chacha20-arm-glue.c" +#elif defined(CONFIG_ZINC_ARCH_MIPS) +#include "chacha20-mips-glue.c" #else static bool *const chacha20_nobs[] __initconst = { }; static void __init chacha20_fpu_init(void) From patchwork Sat Oct 6 02:56:53 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Jason A. Donenfeld" X-Patchwork-Id: 148309 Delivered-To: patch@linaro.org Received: by 2002:a2e:8595:0:0:0:0:0 with SMTP id b21-v6csp1157597lji; Fri, 5 Oct 2018 19:58:15 -0700 (PDT) X-Google-Smtp-Source: ACcGV60xJTEOMPn8SP2HuEkShzO3UGBPvTv//LK3nPDRA2UF8afo9Yqff8cSGlRm6wJl9LPHphKR X-Received: by 2002:a62:9c4a:: with SMTP id f71-v6mr14938787pfe.135.1538794695209; Fri, 05 Oct 2018 19:58:15 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1538794695; cv=none; d=google.com; s=arc-20160816; b=i8/a3yFyZdMQ6TXHMTH0BmndebmsPbu20NW6sNztFcz/XRCpO3a3SqxulwF+DCdJXB lR8qFG5a/JsEisM7KNc/lCR0H4kQ/6wuDV5YU81I71Mfu1ZbJG2S+f2IZt4wrJjvipYR 0Quk4kkp3QJyAXWnsg61xdAzMblgsYV+VYA3c9qlyZaAM7D+vo910a4oSuv5z75Yg0Xz 05z9ddQbovR7+LwH+ZViEnCAi8pn81koisj/GBGkghNMzFMAJ6daVjke7tfVtG0elD2H DNm3GOtBOxLBJ0IImGcdQoMl3ZYAiHNnnxB0njIkg71BvpDgk7ED46JS/N/SvWLoLAVW t0bg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=GXEgdmPnzOjkJxvM4t3nqD6K8Web3IRDTr58rwRqvMY=; b=aG8/cme1sJ7F7LU38XTLGvl5uj/AGQp6xdl3c3D7bNAl3zaKU+oMfhgpTWw1qTSPmx JXXpPzjUPKDt6nMhIOc0rUjHfrvCG46I7jtnn2pkfu6V8s+JWZo+Pt5rvJpXiLxGaBZm YbrQ1etCSqpyx90kVXp28ZTlvqYRH4PgNwFFhtQyOW7+q0MbWjcL3wi6nSjjOdR2bLRX Qchdqi2I/SVcMBpq9yzTnlhDEMhoglPSUVnGI/y8u0Yd1wanSb073QPH4B8Qu0UUCQg0 FYu6QV4VOG4b91GyCyIhbkPbuwPQPvi3nNyfSLE3JI3txqWvI+dT/tk7jZ0SJxWX8JmF XXVg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=GG+g5g10; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l191-v6si9434041pgd.543.2018.10.05.19.58.14; Fri, 05 Oct 2018 19:58:15 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=GG+g5g10; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729723AbeJFJ7n (ORCPT + 32 others); Sat, 6 Oct 2018 05:59:43 -0400 Received: from frisell.zx2c4.com ([192.95.5.64]:50073 "EHLO frisell.zx2c4.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729673AbeJFJ7m (ORCPT ); Sat, 6 Oct 2018 05:59:42 -0400 Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTP id feb22854; Sat, 6 Oct 2018 02:57:35 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=zx2c4.com; h=from:to:cc :subject:date:message-id:in-reply-to:references:mime-version :content-transfer-encoding; s=mail; bh=2mgBFc+IXgc1eJuL5ePgG8F8B k4=; b=GG+g5g10OFk23LHGe4HYEQxLXpGiljqHUwJbf3bvWPIz75HWGzW/7JCGh JtNGyVcoW1eBVSguwrqc0QdE2+K+PhZi/uNVNrKl7JfiHh3qPr2QSsS+tWXoLd5Y A5iUiD+GLoAwa1jrZodLd9PVT/GcBj10SHrNf0+a+J3OmnOLYRke0IBMatfG3R4p S6YFUdd94razc2f/r4G1D4eMb6rKXvPT75Yo/ESrtiphRfBbPqhBFeqYeH6Vtd0f mqt1/VBMrHZG22LfSVdKBzWXO9dxk3TeEBlUCsgFxmFq4zJMEwAHykQm3vkxNbzN JEaSRI4QErVGgPHOGekKSWnYC0wdg== Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTPSA id 33941515 (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256:NO); Sat, 6 Oct 2018 02:57:34 +0000 (UTC) From: "Jason A. Donenfeld" To: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, davem@davemloft.net, gregkh@linuxfoundation.org Cc: "Jason A. Donenfeld" , Andy Polyakov , Thomas Gleixner , Ingo Molnar , x86@kernel.org, Samuel Neves , Jean-Philippe Aumasson , Andy Lutomirski , Andrew Morton , Linus Torvalds , kernel-hardening@lists.openwall.com, linux-crypto@vger.kernel.org Subject: [PATCH net-next v7 12/28] zinc: import Andy Polyakov's Poly1305 x86_64 implementation Date: Sat, 6 Oct 2018 04:56:53 +0200 Message-Id: <20181006025709.4019-13-Jason@zx2c4.com> In-Reply-To: <20181006025709.4019-1-Jason@zx2c4.com> References: <20181006025709.4019-1-Jason@zx2c4.com> MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org These x86_64 vectorized implementations come from Andy Polyakov's implementation, and are included here in raw form without modification, so that subsequent commits that fix these up for the kernel can see how it has changed. While this is CRYPTOGAMS code, the originating code for this happens to be the same as OpenSSL's commit 4dfe4310c31c4483705991d9a798ce9be1ed1c68 Signed-off-by: Jason A. Donenfeld Based-on-code-from: Andy Polyakov Cc: Andy Polyakov Cc: Thomas Gleixner Cc: Ingo Molnar Cc: x86@kernel.org Cc: Samuel Neves Cc: Jean-Philippe Aumasson Cc: Andy Lutomirski Cc: Greg KH Cc: Andrew Morton Cc: Linus Torvalds Cc: kernel-hardening@lists.openwall.com Cc: linux-crypto@vger.kernel.org --- .../poly1305/poly1305-x86_64-cryptogams.S | 3565 +++++++++++++++++ 1 file changed, 3565 insertions(+) create mode 100644 lib/zinc/poly1305/poly1305-x86_64-cryptogams.S -- 2.19.0 diff --git a/lib/zinc/poly1305/poly1305-x86_64-cryptogams.S b/lib/zinc/poly1305/poly1305-x86_64-cryptogams.S new file mode 100644 index 000000000000..ed634757354b --- /dev/null +++ b/lib/zinc/poly1305/poly1305-x86_64-cryptogams.S @@ -0,0 +1,3565 @@ +/* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */ +/* + * Copyright (C) 2006-2017 CRYPTOGAMS by . All Rights Reserved. + */ + +.text + + + +.globl poly1305_init +.hidden poly1305_init +.globl poly1305_blocks +.hidden poly1305_blocks +.globl poly1305_emit +.hidden poly1305_emit + +.type poly1305_init,@function +.align 32 +poly1305_init: + xorq %rax,%rax + movq %rax,0(%rdi) + movq %rax,8(%rdi) + movq %rax,16(%rdi) + + cmpq $0,%rsi + je .Lno_key + + leaq poly1305_blocks(%rip),%r10 + leaq poly1305_emit(%rip),%r11 + movq OPENSSL_ia32cap_P+4(%rip),%r9 + leaq poly1305_blocks_avx(%rip),%rax + leaq poly1305_emit_avx(%rip),%rcx + btq $28,%r9 + cmovcq %rax,%r10 + cmovcq %rcx,%r11 + leaq poly1305_blocks_avx2(%rip),%rax + btq $37,%r9 + cmovcq %rax,%r10 + movq $2149646336,%rax + shrq $32,%r9 + andq %rax,%r9 + cmpq %rax,%r9 + je .Linit_base2_44 + movq $0x0ffffffc0fffffff,%rax + movq $0x0ffffffc0ffffffc,%rcx + andq 0(%rsi),%rax + andq 8(%rsi),%rcx + movq %rax,24(%rdi) + movq %rcx,32(%rdi) + movq %r10,0(%rdx) + movq %r11,8(%rdx) + movl $1,%eax +.Lno_key: + .byte 0xf3,0xc3 +.size poly1305_init,.-poly1305_init + +.type poly1305_blocks,@function +.align 32 +poly1305_blocks: +.cfi_startproc +.Lblocks: + shrq $4,%rdx + jz .Lno_data + + pushq %rbx +.cfi_adjust_cfa_offset 8 +.cfi_offset %rbx,-16 + pushq %rbp +.cfi_adjust_cfa_offset 8 +.cfi_offset %rbp,-24 + pushq %r12 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r12,-32 + pushq %r13 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r13,-40 + pushq %r14 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r14,-48 + pushq %r15 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r15,-56 +.Lblocks_body: + + movq %rdx,%r15 + + movq 24(%rdi),%r11 + movq 32(%rdi),%r13 + + movq 0(%rdi),%r14 + movq 8(%rdi),%rbx + movq 16(%rdi),%rbp + + movq %r13,%r12 + shrq $2,%r13 + movq %r12,%rax + addq %r12,%r13 + jmp .Loop + +.align 32 +.Loop: + addq 0(%rsi),%r14 + adcq 8(%rsi),%rbx + leaq 16(%rsi),%rsi + adcq %rcx,%rbp + mulq %r14 + movq %rax,%r9 + movq %r11,%rax + movq %rdx,%r10 + + mulq %r14 + movq %rax,%r14 + movq %r11,%rax + movq %rdx,%r8 + + mulq %rbx + addq %rax,%r9 + movq %r13,%rax + adcq %rdx,%r10 + + mulq %rbx + movq %rbp,%rbx + addq %rax,%r14 + adcq %rdx,%r8 + + imulq %r13,%rbx + addq %rbx,%r9 + movq %r8,%rbx + adcq $0,%r10 + + imulq %r11,%rbp + addq %r9,%rbx + movq $-4,%rax + adcq %rbp,%r10 + + andq %r10,%rax + movq %r10,%rbp + shrq $2,%r10 + andq $3,%rbp + addq %r10,%rax + addq %rax,%r14 + adcq $0,%rbx + adcq $0,%rbp + movq %r12,%rax + decq %r15 + jnz .Loop + + movq %r14,0(%rdi) + movq %rbx,8(%rdi) + movq %rbp,16(%rdi) + + movq 0(%rsp),%r15 +.cfi_restore %r15 + movq 8(%rsp),%r14 +.cfi_restore %r14 + movq 16(%rsp),%r13 +.cfi_restore %r13 + movq 24(%rsp),%r12 +.cfi_restore %r12 + movq 32(%rsp),%rbp +.cfi_restore %rbp + movq 40(%rsp),%rbx +.cfi_restore %rbx + leaq 48(%rsp),%rsp +.cfi_adjust_cfa_offset -48 +.Lno_data: +.Lblocks_epilogue: + .byte 0xf3,0xc3 +.cfi_endproc +.size poly1305_blocks,.-poly1305_blocks + +.type poly1305_emit,@function +.align 32 +poly1305_emit: +.Lemit: + movq 0(%rdi),%r8 + movq 8(%rdi),%r9 + movq 16(%rdi),%r10 + + movq %r8,%rax + addq $5,%r8 + movq %r9,%rcx + adcq $0,%r9 + adcq $0,%r10 + shrq $2,%r10 + cmovnzq %r8,%rax + cmovnzq %r9,%rcx + + addq 0(%rdx),%rax + adcq 8(%rdx),%rcx + movq %rax,0(%rsi) + movq %rcx,8(%rsi) + + .byte 0xf3,0xc3 +.size poly1305_emit,.-poly1305_emit +.type __poly1305_block,@function +.align 32 +__poly1305_block: + mulq %r14 + movq %rax,%r9 + movq %r11,%rax + movq %rdx,%r10 + + mulq %r14 + movq %rax,%r14 + movq %r11,%rax + movq %rdx,%r8 + + mulq %rbx + addq %rax,%r9 + movq %r13,%rax + adcq %rdx,%r10 + + mulq %rbx + movq %rbp,%rbx + addq %rax,%r14 + adcq %rdx,%r8 + + imulq %r13,%rbx + addq %rbx,%r9 + movq %r8,%rbx + adcq $0,%r10 + + imulq %r11,%rbp + addq %r9,%rbx + movq $-4,%rax + adcq %rbp,%r10 + + andq %r10,%rax + movq %r10,%rbp + shrq $2,%r10 + andq $3,%rbp + addq %r10,%rax + addq %rax,%r14 + adcq $0,%rbx + adcq $0,%rbp + .byte 0xf3,0xc3 +.size __poly1305_block,.-__poly1305_block + +.type __poly1305_init_avx,@function +.align 32 +__poly1305_init_avx: + movq %r11,%r14 + movq %r12,%rbx + xorq %rbp,%rbp + + leaq 48+64(%rdi),%rdi + + movq %r12,%rax + call __poly1305_block + + movl $0x3ffffff,%eax + movl $0x3ffffff,%edx + movq %r14,%r8 + andl %r14d,%eax + movq %r11,%r9 + andl %r11d,%edx + movl %eax,-64(%rdi) + shrq $26,%r8 + movl %edx,-60(%rdi) + shrq $26,%r9 + + movl $0x3ffffff,%eax + movl $0x3ffffff,%edx + andl %r8d,%eax + andl %r9d,%edx + movl %eax,-48(%rdi) + leal (%rax,%rax,4),%eax + movl %edx,-44(%rdi) + leal (%rdx,%rdx,4),%edx + movl %eax,-32(%rdi) + shrq $26,%r8 + movl %edx,-28(%rdi) + shrq $26,%r9 + + movq %rbx,%rax + movq %r12,%rdx + shlq $12,%rax + shlq $12,%rdx + orq %r8,%rax + orq %r9,%rdx + andl $0x3ffffff,%eax + andl $0x3ffffff,%edx + movl %eax,-16(%rdi) + leal (%rax,%rax,4),%eax + movl %edx,-12(%rdi) + leal (%rdx,%rdx,4),%edx + movl %eax,0(%rdi) + movq %rbx,%r8 + movl %edx,4(%rdi) + movq %r12,%r9 + + movl $0x3ffffff,%eax + movl $0x3ffffff,%edx + shrq $14,%r8 + shrq $14,%r9 + andl %r8d,%eax + andl %r9d,%edx + movl %eax,16(%rdi) + leal (%rax,%rax,4),%eax + movl %edx,20(%rdi) + leal (%rdx,%rdx,4),%edx + movl %eax,32(%rdi) + shrq $26,%r8 + movl %edx,36(%rdi) + shrq $26,%r9 + + movq %rbp,%rax + shlq $24,%rax + orq %rax,%r8 + movl %r8d,48(%rdi) + leaq (%r8,%r8,4),%r8 + movl %r9d,52(%rdi) + leaq (%r9,%r9,4),%r9 + movl %r8d,64(%rdi) + movl %r9d,68(%rdi) + + movq %r12,%rax + call __poly1305_block + + movl $0x3ffffff,%eax + movq %r14,%r8 + andl %r14d,%eax + shrq $26,%r8 + movl %eax,-52(%rdi) + + movl $0x3ffffff,%edx + andl %r8d,%edx + movl %edx,-36(%rdi) + leal (%rdx,%rdx,4),%edx + shrq $26,%r8 + movl %edx,-20(%rdi) + + movq %rbx,%rax + shlq $12,%rax + orq %r8,%rax + andl $0x3ffffff,%eax + movl %eax,-4(%rdi) + leal (%rax,%rax,4),%eax + movq %rbx,%r8 + movl %eax,12(%rdi) + + movl $0x3ffffff,%edx + shrq $14,%r8 + andl %r8d,%edx + movl %edx,28(%rdi) + leal (%rdx,%rdx,4),%edx + shrq $26,%r8 + movl %edx,44(%rdi) + + movq %rbp,%rax + shlq $24,%rax + orq %rax,%r8 + movl %r8d,60(%rdi) + leaq (%r8,%r8,4),%r8 + movl %r8d,76(%rdi) + + movq %r12,%rax + call __poly1305_block + + movl $0x3ffffff,%eax + movq %r14,%r8 + andl %r14d,%eax + shrq $26,%r8 + movl %eax,-56(%rdi) + + movl $0x3ffffff,%edx + andl %r8d,%edx + movl %edx,-40(%rdi) + leal (%rdx,%rdx,4),%edx + shrq $26,%r8 + movl %edx,-24(%rdi) + + movq %rbx,%rax + shlq $12,%rax + orq %r8,%rax + andl $0x3ffffff,%eax + movl %eax,-8(%rdi) + leal (%rax,%rax,4),%eax + movq %rbx,%r8 + movl %eax,8(%rdi) + + movl $0x3ffffff,%edx + shrq $14,%r8 + andl %r8d,%edx + movl %edx,24(%rdi) + leal (%rdx,%rdx,4),%edx + shrq $26,%r8 + movl %edx,40(%rdi) + + movq %rbp,%rax + shlq $24,%rax + orq %rax,%r8 + movl %r8d,56(%rdi) + leaq (%r8,%r8,4),%r8 + movl %r8d,72(%rdi) + + leaq -48-64(%rdi),%rdi + .byte 0xf3,0xc3 +.size __poly1305_init_avx,.-__poly1305_init_avx + +.type poly1305_blocks_avx,@function +.align 32 +poly1305_blocks_avx: +.cfi_startproc + movl 20(%rdi),%r8d + cmpq $128,%rdx + jae .Lblocks_avx + testl %r8d,%r8d + jz .Lblocks + +.Lblocks_avx: + andq $-16,%rdx + jz .Lno_data_avx + + vzeroupper + + testl %r8d,%r8d + jz .Lbase2_64_avx + + testq $31,%rdx + jz .Leven_avx + + pushq %rbx +.cfi_adjust_cfa_offset 8 +.cfi_offset %rbx,-16 + pushq %rbp +.cfi_adjust_cfa_offset 8 +.cfi_offset %rbp,-24 + pushq %r12 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r12,-32 + pushq %r13 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r13,-40 + pushq %r14 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r14,-48 + pushq %r15 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r15,-56 +.Lblocks_avx_body: + + movq %rdx,%r15 + + movq 0(%rdi),%r8 + movq 8(%rdi),%r9 + movl 16(%rdi),%ebp + + movq 24(%rdi),%r11 + movq 32(%rdi),%r13 + + + movl %r8d,%r14d + andq $-2147483648,%r8 + movq %r9,%r12 + movl %r9d,%ebx + andq $-2147483648,%r9 + + shrq $6,%r8 + shlq $52,%r12 + addq %r8,%r14 + shrq $12,%rbx + shrq $18,%r9 + addq %r12,%r14 + adcq %r9,%rbx + + movq %rbp,%r8 + shlq $40,%r8 + shrq $24,%rbp + addq %r8,%rbx + adcq $0,%rbp + + movq $-4,%r9 + movq %rbp,%r8 + andq %rbp,%r9 + shrq $2,%r8 + andq $3,%rbp + addq %r9,%r8 + addq %r8,%r14 + adcq $0,%rbx + adcq $0,%rbp + + movq %r13,%r12 + movq %r13,%rax + shrq $2,%r13 + addq %r12,%r13 + + addq 0(%rsi),%r14 + adcq 8(%rsi),%rbx + leaq 16(%rsi),%rsi + adcq %rcx,%rbp + + call __poly1305_block + + testq %rcx,%rcx + jz .Lstore_base2_64_avx + + + movq %r14,%rax + movq %r14,%rdx + shrq $52,%r14 + movq %rbx,%r11 + movq %rbx,%r12 + shrq $26,%rdx + andq $0x3ffffff,%rax + shlq $12,%r11 + andq $0x3ffffff,%rdx + shrq $14,%rbx + orq %r11,%r14 + shlq $24,%rbp + andq $0x3ffffff,%r14 + shrq $40,%r12 + andq $0x3ffffff,%rbx + orq %r12,%rbp + + subq $16,%r15 + jz .Lstore_base2_26_avx + + vmovd %eax,%xmm0 + vmovd %edx,%xmm1 + vmovd %r14d,%xmm2 + vmovd %ebx,%xmm3 + vmovd %ebp,%xmm4 + jmp .Lproceed_avx + +.align 32 +.Lstore_base2_64_avx: + movq %r14,0(%rdi) + movq %rbx,8(%rdi) + movq %rbp,16(%rdi) + jmp .Ldone_avx + +.align 16 +.Lstore_base2_26_avx: + movl %eax,0(%rdi) + movl %edx,4(%rdi) + movl %r14d,8(%rdi) + movl %ebx,12(%rdi) + movl %ebp,16(%rdi) +.align 16 +.Ldone_avx: + movq 0(%rsp),%r15 +.cfi_restore %r15 + movq 8(%rsp),%r14 +.cfi_restore %r14 + movq 16(%rsp),%r13 +.cfi_restore %r13 + movq 24(%rsp),%r12 +.cfi_restore %r12 + movq 32(%rsp),%rbp +.cfi_restore %rbp + movq 40(%rsp),%rbx +.cfi_restore %rbx + leaq 48(%rsp),%rsp +.cfi_adjust_cfa_offset -48 +.Lno_data_avx: +.Lblocks_avx_epilogue: + .byte 0xf3,0xc3 +.cfi_endproc + +.align 32 +.Lbase2_64_avx: +.cfi_startproc + pushq %rbx +.cfi_adjust_cfa_offset 8 +.cfi_offset %rbx,-16 + pushq %rbp +.cfi_adjust_cfa_offset 8 +.cfi_offset %rbp,-24 + pushq %r12 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r12,-32 + pushq %r13 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r13,-40 + pushq %r14 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r14,-48 + pushq %r15 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r15,-56 +.Lbase2_64_avx_body: + + movq %rdx,%r15 + + movq 24(%rdi),%r11 + movq 32(%rdi),%r13 + + movq 0(%rdi),%r14 + movq 8(%rdi),%rbx + movl 16(%rdi),%ebp + + movq %r13,%r12 + movq %r13,%rax + shrq $2,%r13 + addq %r12,%r13 + + testq $31,%rdx + jz .Linit_avx + + addq 0(%rsi),%r14 + adcq 8(%rsi),%rbx + leaq 16(%rsi),%rsi + adcq %rcx,%rbp + subq $16,%r15 + + call __poly1305_block + +.Linit_avx: + + movq %r14,%rax + movq %r14,%rdx + shrq $52,%r14 + movq %rbx,%r8 + movq %rbx,%r9 + shrq $26,%rdx + andq $0x3ffffff,%rax + shlq $12,%r8 + andq $0x3ffffff,%rdx + shrq $14,%rbx + orq %r8,%r14 + shlq $24,%rbp + andq $0x3ffffff,%r14 + shrq $40,%r9 + andq $0x3ffffff,%rbx + orq %r9,%rbp + + vmovd %eax,%xmm0 + vmovd %edx,%xmm1 + vmovd %r14d,%xmm2 + vmovd %ebx,%xmm3 + vmovd %ebp,%xmm4 + movl $1,20(%rdi) + + call __poly1305_init_avx + +.Lproceed_avx: + movq %r15,%rdx + + movq 0(%rsp),%r15 +.cfi_restore %r15 + movq 8(%rsp),%r14 +.cfi_restore %r14 + movq 16(%rsp),%r13 +.cfi_restore %r13 + movq 24(%rsp),%r12 +.cfi_restore %r12 + movq 32(%rsp),%rbp +.cfi_restore %rbp + movq 40(%rsp),%rbx +.cfi_restore %rbx + leaq 48(%rsp),%rax + leaq 48(%rsp),%rsp +.cfi_adjust_cfa_offset -48 +.Lbase2_64_avx_epilogue: + jmp .Ldo_avx +.cfi_endproc + +.align 32 +.Leven_avx: +.cfi_startproc + vmovd 0(%rdi),%xmm0 + vmovd 4(%rdi),%xmm1 + vmovd 8(%rdi),%xmm2 + vmovd 12(%rdi),%xmm3 + vmovd 16(%rdi),%xmm4 + +.Ldo_avx: + leaq -88(%rsp),%r11 +.cfi_def_cfa %r11,0x60 + subq $0x178,%rsp + subq $64,%rdx + leaq -32(%rsi),%rax + cmovcq %rax,%rsi + + vmovdqu 48(%rdi),%xmm14 + leaq 112(%rdi),%rdi + leaq .Lconst(%rip),%rcx + + + + vmovdqu 32(%rsi),%xmm5 + vmovdqu 48(%rsi),%xmm6 + vmovdqa 64(%rcx),%xmm15 + + vpsrldq $6,%xmm5,%xmm7 + vpsrldq $6,%xmm6,%xmm8 + vpunpckhqdq %xmm6,%xmm5,%xmm9 + vpunpcklqdq %xmm6,%xmm5,%xmm5 + vpunpcklqdq %xmm8,%xmm7,%xmm8 + + vpsrlq $40,%xmm9,%xmm9 + vpsrlq $26,%xmm5,%xmm6 + vpand %xmm15,%xmm5,%xmm5 + vpsrlq $4,%xmm8,%xmm7 + vpand %xmm15,%xmm6,%xmm6 + vpsrlq $30,%xmm8,%xmm8 + vpand %xmm15,%xmm7,%xmm7 + vpand %xmm15,%xmm8,%xmm8 + vpor 32(%rcx),%xmm9,%xmm9 + + jbe .Lskip_loop_avx + + + vmovdqu -48(%rdi),%xmm11 + vmovdqu -32(%rdi),%xmm12 + vpshufd $0xEE,%xmm14,%xmm13 + vpshufd $0x44,%xmm14,%xmm10 + vmovdqa %xmm13,-144(%r11) + vmovdqa %xmm10,0(%rsp) + vpshufd $0xEE,%xmm11,%xmm14 + vmovdqu -16(%rdi),%xmm10 + vpshufd $0x44,%xmm11,%xmm11 + vmovdqa %xmm14,-128(%r11) + vmovdqa %xmm11,16(%rsp) + vpshufd $0xEE,%xmm12,%xmm13 + vmovdqu 0(%rdi),%xmm11 + vpshufd $0x44,%xmm12,%xmm12 + vmovdqa %xmm13,-112(%r11) + vmovdqa %xmm12,32(%rsp) + vpshufd $0xEE,%xmm10,%xmm14 + vmovdqu 16(%rdi),%xmm12 + vpshufd $0x44,%xmm10,%xmm10 + vmovdqa %xmm14,-96(%r11) + vmovdqa %xmm10,48(%rsp) + vpshufd $0xEE,%xmm11,%xmm13 + vmovdqu 32(%rdi),%xmm10 + vpshufd $0x44,%xmm11,%xmm11 + vmovdqa %xmm13,-80(%r11) + vmovdqa %xmm11,64(%rsp) + vpshufd $0xEE,%xmm12,%xmm14 + vmovdqu 48(%rdi),%xmm11 + vpshufd $0x44,%xmm12,%xmm12 + vmovdqa %xmm14,-64(%r11) + vmovdqa %xmm12,80(%rsp) + vpshufd $0xEE,%xmm10,%xmm13 + vmovdqu 64(%rdi),%xmm12 + vpshufd $0x44,%xmm10,%xmm10 + vmovdqa %xmm13,-48(%r11) + vmovdqa %xmm10,96(%rsp) + vpshufd $0xEE,%xmm11,%xmm14 + vpshufd $0x44,%xmm11,%xmm11 + vmovdqa %xmm14,-32(%r11) + vmovdqa %xmm11,112(%rsp) + vpshufd $0xEE,%xmm12,%xmm13 + vmovdqa 0(%rsp),%xmm14 + vpshufd $0x44,%xmm12,%xmm12 + vmovdqa %xmm13,-16(%r11) + vmovdqa %xmm12,128(%rsp) + + jmp .Loop_avx + +.align 32 +.Loop_avx: + + + + + + + + + + + + + + + + + + + + + vpmuludq %xmm5,%xmm14,%xmm10 + vpmuludq %xmm6,%xmm14,%xmm11 + vmovdqa %xmm2,32(%r11) + vpmuludq %xmm7,%xmm14,%xmm12 + vmovdqa 16(%rsp),%xmm2 + vpmuludq %xmm8,%xmm14,%xmm13 + vpmuludq %xmm9,%xmm14,%xmm14 + + vmovdqa %xmm0,0(%r11) + vpmuludq 32(%rsp),%xmm9,%xmm0 + vmovdqa %xmm1,16(%r11) + vpmuludq %xmm8,%xmm2,%xmm1 + vpaddq %xmm0,%xmm10,%xmm10 + vpaddq %xmm1,%xmm14,%xmm14 + vmovdqa %xmm3,48(%r11) + vpmuludq %xmm7,%xmm2,%xmm0 + vpmuludq %xmm6,%xmm2,%xmm1 + vpaddq %xmm0,%xmm13,%xmm13 + vmovdqa 48(%rsp),%xmm3 + vpaddq %xmm1,%xmm12,%xmm12 + vmovdqa %xmm4,64(%r11) + vpmuludq %xmm5,%xmm2,%xmm2 + vpmuludq %xmm7,%xmm3,%xmm0 + vpaddq %xmm2,%xmm11,%xmm11 + + vmovdqa 64(%rsp),%xmm4 + vpaddq %xmm0,%xmm14,%xmm14 + vpmuludq %xmm6,%xmm3,%xmm1 + vpmuludq %xmm5,%xmm3,%xmm3 + vpaddq %xmm1,%xmm13,%xmm13 + vmovdqa 80(%rsp),%xmm2 + vpaddq %xmm3,%xmm12,%xmm12 + vpmuludq %xmm9,%xmm4,%xmm0 + vpmuludq %xmm8,%xmm4,%xmm4 + vpaddq %xmm0,%xmm11,%xmm11 + vmovdqa 96(%rsp),%xmm3 + vpaddq %xmm4,%xmm10,%xmm10 + + vmovdqa 128(%rsp),%xmm4 + vpmuludq %xmm6,%xmm2,%xmm1 + vpmuludq %xmm5,%xmm2,%xmm2 + vpaddq %xmm1,%xmm14,%xmm14 + vpaddq %xmm2,%xmm13,%xmm13 + vpmuludq %xmm9,%xmm3,%xmm0 + vpmuludq %xmm8,%xmm3,%xmm1 + vpaddq %xmm0,%xmm12,%xmm12 + vmovdqu 0(%rsi),%xmm0 + vpaddq %xmm1,%xmm11,%xmm11 + vpmuludq %xmm7,%xmm3,%xmm3 + vpmuludq %xmm7,%xmm4,%xmm7 + vpaddq %xmm3,%xmm10,%xmm10 + + vmovdqu 16(%rsi),%xmm1 + vpaddq %xmm7,%xmm11,%xmm11 + vpmuludq %xmm8,%xmm4,%xmm8 + vpmuludq %xmm9,%xmm4,%xmm9 + vpsrldq $6,%xmm0,%xmm2 + vpaddq %xmm8,%xmm12,%xmm12 + vpaddq %xmm9,%xmm13,%xmm13 + vpsrldq $6,%xmm1,%xmm3 + vpmuludq 112(%rsp),%xmm5,%xmm9 + vpmuludq %xmm6,%xmm4,%xmm5 + vpunpckhqdq %xmm1,%xmm0,%xmm4 + vpaddq %xmm9,%xmm14,%xmm14 + vmovdqa -144(%r11),%xmm9 + vpaddq %xmm5,%xmm10,%xmm10 + + vpunpcklqdq %xmm1,%xmm0,%xmm0 + vpunpcklqdq %xmm3,%xmm2,%xmm3 + + + vpsrldq $5,%xmm4,%xmm4 + vpsrlq $26,%xmm0,%xmm1 + vpand %xmm15,%xmm0,%xmm0 + vpsrlq $4,%xmm3,%xmm2 + vpand %xmm15,%xmm1,%xmm1 + vpand 0(%rcx),%xmm4,%xmm4 + vpsrlq $30,%xmm3,%xmm3 + vpand %xmm15,%xmm2,%xmm2 + vpand %xmm15,%xmm3,%xmm3 + vpor 32(%rcx),%xmm4,%xmm4 + + vpaddq 0(%r11),%xmm0,%xmm0 + vpaddq 16(%r11),%xmm1,%xmm1 + vpaddq 32(%r11),%xmm2,%xmm2 + vpaddq 48(%r11),%xmm3,%xmm3 + vpaddq 64(%r11),%xmm4,%xmm4 + + leaq 32(%rsi),%rax + leaq 64(%rsi),%rsi + subq $64,%rdx + cmovcq %rax,%rsi + + + + + + + + + + + vpmuludq %xmm0,%xmm9,%xmm5 + vpmuludq %xmm1,%xmm9,%xmm6 + vpaddq %xmm5,%xmm10,%xmm10 + vpaddq %xmm6,%xmm11,%xmm11 + vmovdqa -128(%r11),%xmm7 + vpmuludq %xmm2,%xmm9,%xmm5 + vpmuludq %xmm3,%xmm9,%xmm6 + vpaddq %xmm5,%xmm12,%xmm12 + vpaddq %xmm6,%xmm13,%xmm13 + vpmuludq %xmm4,%xmm9,%xmm9 + vpmuludq -112(%r11),%xmm4,%xmm5 + vpaddq %xmm9,%xmm14,%xmm14 + + vpaddq %xmm5,%xmm10,%xmm10 + vpmuludq %xmm2,%xmm7,%xmm6 + vpmuludq %xmm3,%xmm7,%xmm5 + vpaddq %xmm6,%xmm13,%xmm13 + vmovdqa -96(%r11),%xmm8 + vpaddq %xmm5,%xmm14,%xmm14 + vpmuludq %xmm1,%xmm7,%xmm6 + vpmuludq %xmm0,%xmm7,%xmm7 + vpaddq %xmm6,%xmm12,%xmm12 + vpaddq %xmm7,%xmm11,%xmm11 + + vmovdqa -80(%r11),%xmm9 + vpmuludq %xmm2,%xmm8,%xmm5 + vpmuludq %xmm1,%xmm8,%xmm6 + vpaddq %xmm5,%xmm14,%xmm14 + vpaddq %xmm6,%xmm13,%xmm13 + vmovdqa -64(%r11),%xmm7 + vpmuludq %xmm0,%xmm8,%xmm8 + vpmuludq %xmm4,%xmm9,%xmm5 + vpaddq %xmm8,%xmm12,%xmm12 + vpaddq %xmm5,%xmm11,%xmm11 + vmovdqa -48(%r11),%xmm8 + vpmuludq %xmm3,%xmm9,%xmm9 + vpmuludq %xmm1,%xmm7,%xmm6 + vpaddq %xmm9,%xmm10,%xmm10 + + vmovdqa -16(%r11),%xmm9 + vpaddq %xmm6,%xmm14,%xmm14 + vpmuludq %xmm0,%xmm7,%xmm7 + vpmuludq %xmm4,%xmm8,%xmm5 + vpaddq %xmm7,%xmm13,%xmm13 + vpaddq %xmm5,%xmm12,%xmm12 + vmovdqu 32(%rsi),%xmm5 + vpmuludq %xmm3,%xmm8,%xmm7 + vpmuludq %xmm2,%xmm8,%xmm8 + vpaddq %xmm7,%xmm11,%xmm11 + vmovdqu 48(%rsi),%xmm6 + vpaddq %xmm8,%xmm10,%xmm10 + + vpmuludq %xmm2,%xmm9,%xmm2 + vpmuludq %xmm3,%xmm9,%xmm3 + vpsrldq $6,%xmm5,%xmm7 + vpaddq %xmm2,%xmm11,%xmm11 + vpmuludq %xmm4,%xmm9,%xmm4 + vpsrldq $6,%xmm6,%xmm8 + vpaddq %xmm3,%xmm12,%xmm2 + vpaddq %xmm4,%xmm13,%xmm3 + vpmuludq -32(%r11),%xmm0,%xmm4 + vpmuludq %xmm1,%xmm9,%xmm0 + vpunpckhqdq %xmm6,%xmm5,%xmm9 + vpaddq %xmm4,%xmm14,%xmm4 + vpaddq %xmm0,%xmm10,%xmm0 + + vpunpcklqdq %xmm6,%xmm5,%xmm5 + vpunpcklqdq %xmm8,%xmm7,%xmm8 + + + vpsrldq $5,%xmm9,%xmm9 + vpsrlq $26,%xmm5,%xmm6 + vmovdqa 0(%rsp),%xmm14 + vpand %xmm15,%xmm5,%xmm5 + vpsrlq $4,%xmm8,%xmm7 + vpand %xmm15,%xmm6,%xmm6 + vpand 0(%rcx),%xmm9,%xmm9 + vpsrlq $30,%xmm8,%xmm8 + vpand %xmm15,%xmm7,%xmm7 + vpand %xmm15,%xmm8,%xmm8 + vpor 32(%rcx),%xmm9,%xmm9 + + + + + + vpsrlq $26,%xmm3,%xmm13 + vpand %xmm15,%xmm3,%xmm3 + vpaddq %xmm13,%xmm4,%xmm4 + + vpsrlq $26,%xmm0,%xmm10 + vpand %xmm15,%xmm0,%xmm0 + vpaddq %xmm10,%xmm11,%xmm1 + + vpsrlq $26,%xmm4,%xmm10 + vpand %xmm15,%xmm4,%xmm4 + + vpsrlq $26,%xmm1,%xmm11 + vpand %xmm15,%xmm1,%xmm1 + vpaddq %xmm11,%xmm2,%xmm2 + + vpaddq %xmm10,%xmm0,%xmm0 + vpsllq $2,%xmm10,%xmm10 + vpaddq %xmm10,%xmm0,%xmm0 + + vpsrlq $26,%xmm2,%xmm12 + vpand %xmm15,%xmm2,%xmm2 + vpaddq %xmm12,%xmm3,%xmm3 + + vpsrlq $26,%xmm0,%xmm10 + vpand %xmm15,%xmm0,%xmm0 + vpaddq %xmm10,%xmm1,%xmm1 + + vpsrlq $26,%xmm3,%xmm13 + vpand %xmm15,%xmm3,%xmm3 + vpaddq %xmm13,%xmm4,%xmm4 + + ja .Loop_avx + +.Lskip_loop_avx: + + + + vpshufd $0x10,%xmm14,%xmm14 + addq $32,%rdx + jnz .Long_tail_avx + + vpaddq %xmm2,%xmm7,%xmm7 + vpaddq %xmm0,%xmm5,%xmm5 + vpaddq %xmm1,%xmm6,%xmm6 + vpaddq %xmm3,%xmm8,%xmm8 + vpaddq %xmm4,%xmm9,%xmm9 + +.Long_tail_avx: + vmovdqa %xmm2,32(%r11) + vmovdqa %xmm0,0(%r11) + vmovdqa %xmm1,16(%r11) + vmovdqa %xmm3,48(%r11) + vmovdqa %xmm4,64(%r11) + + + + + + + + vpmuludq %xmm7,%xmm14,%xmm12 + vpmuludq %xmm5,%xmm14,%xmm10 + vpshufd $0x10,-48(%rdi),%xmm2 + vpmuludq %xmm6,%xmm14,%xmm11 + vpmuludq %xmm8,%xmm14,%xmm13 + vpmuludq %xmm9,%xmm14,%xmm14 + + vpmuludq %xmm8,%xmm2,%xmm0 + vpaddq %xmm0,%xmm14,%xmm14 + vpshufd $0x10,-32(%rdi),%xmm3 + vpmuludq %xmm7,%xmm2,%xmm1 + vpaddq %xmm1,%xmm13,%xmm13 + vpshufd $0x10,-16(%rdi),%xmm4 + vpmuludq %xmm6,%xmm2,%xmm0 + vpaddq %xmm0,%xmm12,%xmm12 + vpmuludq %xmm5,%xmm2,%xmm2 + vpaddq %xmm2,%xmm11,%xmm11 + vpmuludq %xmm9,%xmm3,%xmm3 + vpaddq %xmm3,%xmm10,%xmm10 + + vpshufd $0x10,0(%rdi),%xmm2 + vpmuludq %xmm7,%xmm4,%xmm1 + vpaddq %xmm1,%xmm14,%xmm14 + vpmuludq %xmm6,%xmm4,%xmm0 + vpaddq %xmm0,%xmm13,%xmm13 + vpshufd $0x10,16(%rdi),%xmm3 + vpmuludq %xmm5,%xmm4,%xmm4 + vpaddq %xmm4,%xmm12,%xmm12 + vpmuludq %xmm9,%xmm2,%xmm1 + vpaddq %xmm1,%xmm11,%xmm11 + vpshufd $0x10,32(%rdi),%xmm4 + vpmuludq %xmm8,%xmm2,%xmm2 + vpaddq %xmm2,%xmm10,%xmm10 + + vpmuludq %xmm6,%xmm3,%xmm0 + vpaddq %xmm0,%xmm14,%xmm14 + vpmuludq %xmm5,%xmm3,%xmm3 + vpaddq %xmm3,%xmm13,%xmm13 + vpshufd $0x10,48(%rdi),%xmm2 + vpmuludq %xmm9,%xmm4,%xmm1 + vpaddq %xmm1,%xmm12,%xmm12 + vpshufd $0x10,64(%rdi),%xmm3 + vpmuludq %xmm8,%xmm4,%xmm0 + vpaddq %xmm0,%xmm11,%xmm11 + vpmuludq %xmm7,%xmm4,%xmm4 + vpaddq %xmm4,%xmm10,%xmm10 + + vpmuludq %xmm5,%xmm2,%xmm2 + vpaddq %xmm2,%xmm14,%xmm14 + vpmuludq %xmm9,%xmm3,%xmm1 + vpaddq %xmm1,%xmm13,%xmm13 + vpmuludq %xmm8,%xmm3,%xmm0 + vpaddq %xmm0,%xmm12,%xmm12 + vpmuludq %xmm7,%xmm3,%xmm1 + vpaddq %xmm1,%xmm11,%xmm11 + vpmuludq %xmm6,%xmm3,%xmm3 + vpaddq %xmm3,%xmm10,%xmm10 + + jz .Lshort_tail_avx + + vmovdqu 0(%rsi),%xmm0 + vmovdqu 16(%rsi),%xmm1 + + vpsrldq $6,%xmm0,%xmm2 + vpsrldq $6,%xmm1,%xmm3 + vpunpckhqdq %xmm1,%xmm0,%xmm4 + vpunpcklqdq %xmm1,%xmm0,%xmm0 + vpunpcklqdq %xmm3,%xmm2,%xmm3 + + vpsrlq $40,%xmm4,%xmm4 + vpsrlq $26,%xmm0,%xmm1 + vpand %xmm15,%xmm0,%xmm0 + vpsrlq $4,%xmm3,%xmm2 + vpand %xmm15,%xmm1,%xmm1 + vpsrlq $30,%xmm3,%xmm3 + vpand %xmm15,%xmm2,%xmm2 + vpand %xmm15,%xmm3,%xmm3 + vpor 32(%rcx),%xmm4,%xmm4 + + vpshufd $0x32,-64(%rdi),%xmm9 + vpaddq 0(%r11),%xmm0,%xmm0 + vpaddq 16(%r11),%xmm1,%xmm1 + vpaddq 32(%r11),%xmm2,%xmm2 + vpaddq 48(%r11),%xmm3,%xmm3 + vpaddq 64(%r11),%xmm4,%xmm4 + + + + + vpmuludq %xmm0,%xmm9,%xmm5 + vpaddq %xmm5,%xmm10,%xmm10 + vpmuludq %xmm1,%xmm9,%xmm6 + vpaddq %xmm6,%xmm11,%xmm11 + vpmuludq %xmm2,%xmm9,%xmm5 + vpaddq %xmm5,%xmm12,%xmm12 + vpshufd $0x32,-48(%rdi),%xmm7 + vpmuludq %xmm3,%xmm9,%xmm6 + vpaddq %xmm6,%xmm13,%xmm13 + vpmuludq %xmm4,%xmm9,%xmm9 + vpaddq %xmm9,%xmm14,%xmm14 + + vpmuludq %xmm3,%xmm7,%xmm5 + vpaddq %xmm5,%xmm14,%xmm14 + vpshufd $0x32,-32(%rdi),%xmm8 + vpmuludq %xmm2,%xmm7,%xmm6 + vpaddq %xmm6,%xmm13,%xmm13 + vpshufd $0x32,-16(%rdi),%xmm9 + vpmuludq %xmm1,%xmm7,%xmm5 + vpaddq %xmm5,%xmm12,%xmm12 + vpmuludq %xmm0,%xmm7,%xmm7 + vpaddq %xmm7,%xmm11,%xmm11 + vpmuludq %xmm4,%xmm8,%xmm8 + vpaddq %xmm8,%xmm10,%xmm10 + + vpshufd $0x32,0(%rdi),%xmm7 + vpmuludq %xmm2,%xmm9,%xmm6 + vpaddq %xmm6,%xmm14,%xmm14 + vpmuludq %xmm1,%xmm9,%xmm5 + vpaddq %xmm5,%xmm13,%xmm13 + vpshufd $0x32,16(%rdi),%xmm8 + vpmuludq %xmm0,%xmm9,%xmm9 + vpaddq %xmm9,%xmm12,%xmm12 + vpmuludq %xmm4,%xmm7,%xmm6 + vpaddq %xmm6,%xmm11,%xmm11 + vpshufd $0x32,32(%rdi),%xmm9 + vpmuludq %xmm3,%xmm7,%xmm7 + vpaddq %xmm7,%xmm10,%xmm10 + + vpmuludq %xmm1,%xmm8,%xmm5 + vpaddq %xmm5,%xmm14,%xmm14 + vpmuludq %xmm0,%xmm8,%xmm8 + vpaddq %xmm8,%xmm13,%xmm13 + vpshufd $0x32,48(%rdi),%xmm7 + vpmuludq %xmm4,%xmm9,%xmm6 + vpaddq %xmm6,%xmm12,%xmm12 + vpshufd $0x32,64(%rdi),%xmm8 + vpmuludq %xmm3,%xmm9,%xmm5 + vpaddq %xmm5,%xmm11,%xmm11 + vpmuludq %xmm2,%xmm9,%xmm9 + vpaddq %xmm9,%xmm10,%xmm10 + + vpmuludq %xmm0,%xmm7,%xmm7 + vpaddq %xmm7,%xmm14,%xmm14 + vpmuludq %xmm4,%xmm8,%xmm6 + vpaddq %xmm6,%xmm13,%xmm13 + vpmuludq %xmm3,%xmm8,%xmm5 + vpaddq %xmm5,%xmm12,%xmm12 + vpmuludq %xmm2,%xmm8,%xmm6 + vpaddq %xmm6,%xmm11,%xmm11 + vpmuludq %xmm1,%xmm8,%xmm8 + vpaddq %xmm8,%xmm10,%xmm10 + +.Lshort_tail_avx: + + + + vpsrldq $8,%xmm14,%xmm9 + vpsrldq $8,%xmm13,%xmm8 + vpsrldq $8,%xmm11,%xmm6 + vpsrldq $8,%xmm10,%xmm5 + vpsrldq $8,%xmm12,%xmm7 + vpaddq %xmm8,%xmm13,%xmm13 + vpaddq %xmm9,%xmm14,%xmm14 + vpaddq %xmm5,%xmm10,%xmm10 + vpaddq %xmm6,%xmm11,%xmm11 + vpaddq %xmm7,%xmm12,%xmm12 + + + + + vpsrlq $26,%xmm13,%xmm3 + vpand %xmm15,%xmm13,%xmm13 + vpaddq %xmm3,%xmm14,%xmm14 + + vpsrlq $26,%xmm10,%xmm0 + vpand %xmm15,%xmm10,%xmm10 + vpaddq %xmm0,%xmm11,%xmm11 + + vpsrlq $26,%xmm14,%xmm4 + vpand %xmm15,%xmm14,%xmm14 + + vpsrlq $26,%xmm11,%xmm1 + vpand %xmm15,%xmm11,%xmm11 + vpaddq %xmm1,%xmm12,%xmm12 + + vpaddq %xmm4,%xmm10,%xmm10 + vpsllq $2,%xmm4,%xmm4 + vpaddq %xmm4,%xmm10,%xmm10 + + vpsrlq $26,%xmm12,%xmm2 + vpand %xmm15,%xmm12,%xmm12 + vpaddq %xmm2,%xmm13,%xmm13 + + vpsrlq $26,%xmm10,%xmm0 + vpand %xmm15,%xmm10,%xmm10 + vpaddq %xmm0,%xmm11,%xmm11 + + vpsrlq $26,%xmm13,%xmm3 + vpand %xmm15,%xmm13,%xmm13 + vpaddq %xmm3,%xmm14,%xmm14 + + vmovd %xmm10,-112(%rdi) + vmovd %xmm11,-108(%rdi) + vmovd %xmm12,-104(%rdi) + vmovd %xmm13,-100(%rdi) + vmovd %xmm14,-96(%rdi) + leaq 88(%r11),%rsp +.cfi_def_cfa %rsp,8 + vzeroupper + .byte 0xf3,0xc3 +.cfi_endproc +.size poly1305_blocks_avx,.-poly1305_blocks_avx + +.type poly1305_emit_avx,@function +.align 32 +poly1305_emit_avx: + cmpl $0,20(%rdi) + je .Lemit + + movl 0(%rdi),%eax + movl 4(%rdi),%ecx + movl 8(%rdi),%r8d + movl 12(%rdi),%r11d + movl 16(%rdi),%r10d + + shlq $26,%rcx + movq %r8,%r9 + shlq $52,%r8 + addq %rcx,%rax + shrq $12,%r9 + addq %rax,%r8 + adcq $0,%r9 + + shlq $14,%r11 + movq %r10,%rax + shrq $24,%r10 + addq %r11,%r9 + shlq $40,%rax + addq %rax,%r9 + adcq $0,%r10 + + movq %r10,%rax + movq %r10,%rcx + andq $3,%r10 + shrq $2,%rax + andq $-4,%rcx + addq %rcx,%rax + addq %rax,%r8 + adcq $0,%r9 + adcq $0,%r10 + + movq %r8,%rax + addq $5,%r8 + movq %r9,%rcx + adcq $0,%r9 + adcq $0,%r10 + shrq $2,%r10 + cmovnzq %r8,%rax + cmovnzq %r9,%rcx + + addq 0(%rdx),%rax + adcq 8(%rdx),%rcx + movq %rax,0(%rsi) + movq %rcx,8(%rsi) + + .byte 0xf3,0xc3 +.size poly1305_emit_avx,.-poly1305_emit_avx +.type poly1305_blocks_avx2,@function +.align 32 +poly1305_blocks_avx2: +.cfi_startproc + movl 20(%rdi),%r8d + cmpq $128,%rdx + jae .Lblocks_avx2 + testl %r8d,%r8d + jz .Lblocks + +.Lblocks_avx2: + andq $-16,%rdx + jz .Lno_data_avx2 + + vzeroupper + + testl %r8d,%r8d + jz .Lbase2_64_avx2 + + testq $63,%rdx + jz .Leven_avx2 + + pushq %rbx +.cfi_adjust_cfa_offset 8 +.cfi_offset %rbx,-16 + pushq %rbp +.cfi_adjust_cfa_offset 8 +.cfi_offset %rbp,-24 + pushq %r12 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r12,-32 + pushq %r13 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r13,-40 + pushq %r14 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r14,-48 + pushq %r15 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r15,-56 +.Lblocks_avx2_body: + + movq %rdx,%r15 + + movq 0(%rdi),%r8 + movq 8(%rdi),%r9 + movl 16(%rdi),%ebp + + movq 24(%rdi),%r11 + movq 32(%rdi),%r13 + + + movl %r8d,%r14d + andq $-2147483648,%r8 + movq %r9,%r12 + movl %r9d,%ebx + andq $-2147483648,%r9 + + shrq $6,%r8 + shlq $52,%r12 + addq %r8,%r14 + shrq $12,%rbx + shrq $18,%r9 + addq %r12,%r14 + adcq %r9,%rbx + + movq %rbp,%r8 + shlq $40,%r8 + shrq $24,%rbp + addq %r8,%rbx + adcq $0,%rbp + + movq $-4,%r9 + movq %rbp,%r8 + andq %rbp,%r9 + shrq $2,%r8 + andq $3,%rbp + addq %r9,%r8 + addq %r8,%r14 + adcq $0,%rbx + adcq $0,%rbp + + movq %r13,%r12 + movq %r13,%rax + shrq $2,%r13 + addq %r12,%r13 + +.Lbase2_26_pre_avx2: + addq 0(%rsi),%r14 + adcq 8(%rsi),%rbx + leaq 16(%rsi),%rsi + adcq %rcx,%rbp + subq $16,%r15 + + call __poly1305_block + movq %r12,%rax + + testq $63,%r15 + jnz .Lbase2_26_pre_avx2 + + testq %rcx,%rcx + jz .Lstore_base2_64_avx2 + + + movq %r14,%rax + movq %r14,%rdx + shrq $52,%r14 + movq %rbx,%r11 + movq %rbx,%r12 + shrq $26,%rdx + andq $0x3ffffff,%rax + shlq $12,%r11 + andq $0x3ffffff,%rdx + shrq $14,%rbx + orq %r11,%r14 + shlq $24,%rbp + andq $0x3ffffff,%r14 + shrq $40,%r12 + andq $0x3ffffff,%rbx + orq %r12,%rbp + + testq %r15,%r15 + jz .Lstore_base2_26_avx2 + + vmovd %eax,%xmm0 + vmovd %edx,%xmm1 + vmovd %r14d,%xmm2 + vmovd %ebx,%xmm3 + vmovd %ebp,%xmm4 + jmp .Lproceed_avx2 + +.align 32 +.Lstore_base2_64_avx2: + movq %r14,0(%rdi) + movq %rbx,8(%rdi) + movq %rbp,16(%rdi) + jmp .Ldone_avx2 + +.align 16 +.Lstore_base2_26_avx2: + movl %eax,0(%rdi) + movl %edx,4(%rdi) + movl %r14d,8(%rdi) + movl %ebx,12(%rdi) + movl %ebp,16(%rdi) +.align 16 +.Ldone_avx2: + movq 0(%rsp),%r15 +.cfi_restore %r15 + movq 8(%rsp),%r14 +.cfi_restore %r14 + movq 16(%rsp),%r13 +.cfi_restore %r13 + movq 24(%rsp),%r12 +.cfi_restore %r12 + movq 32(%rsp),%rbp +.cfi_restore %rbp + movq 40(%rsp),%rbx +.cfi_restore %rbx + leaq 48(%rsp),%rsp +.cfi_adjust_cfa_offset -48 +.Lno_data_avx2: +.Lblocks_avx2_epilogue: + .byte 0xf3,0xc3 +.cfi_endproc + +.align 32 +.Lbase2_64_avx2: +.cfi_startproc + pushq %rbx +.cfi_adjust_cfa_offset 8 +.cfi_offset %rbx,-16 + pushq %rbp +.cfi_adjust_cfa_offset 8 +.cfi_offset %rbp,-24 + pushq %r12 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r12,-32 + pushq %r13 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r13,-40 + pushq %r14 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r14,-48 + pushq %r15 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r15,-56 +.Lbase2_64_avx2_body: + + movq %rdx,%r15 + + movq 24(%rdi),%r11 + movq 32(%rdi),%r13 + + movq 0(%rdi),%r14 + movq 8(%rdi),%rbx + movl 16(%rdi),%ebp + + movq %r13,%r12 + movq %r13,%rax + shrq $2,%r13 + addq %r12,%r13 + + testq $63,%rdx + jz .Linit_avx2 + +.Lbase2_64_pre_avx2: + addq 0(%rsi),%r14 + adcq 8(%rsi),%rbx + leaq 16(%rsi),%rsi + adcq %rcx,%rbp + subq $16,%r15 + + call __poly1305_block + movq %r12,%rax + + testq $63,%r15 + jnz .Lbase2_64_pre_avx2 + +.Linit_avx2: + + movq %r14,%rax + movq %r14,%rdx + shrq $52,%r14 + movq %rbx,%r8 + movq %rbx,%r9 + shrq $26,%rdx + andq $0x3ffffff,%rax + shlq $12,%r8 + andq $0x3ffffff,%rdx + shrq $14,%rbx + orq %r8,%r14 + shlq $24,%rbp + andq $0x3ffffff,%r14 + shrq $40,%r9 + andq $0x3ffffff,%rbx + orq %r9,%rbp + + vmovd %eax,%xmm0 + vmovd %edx,%xmm1 + vmovd %r14d,%xmm2 + vmovd %ebx,%xmm3 + vmovd %ebp,%xmm4 + movl $1,20(%rdi) + + call __poly1305_init_avx + +.Lproceed_avx2: + movq %r15,%rdx + movl OPENSSL_ia32cap_P+8(%rip),%r10d + movl $3221291008,%r11d + + movq 0(%rsp),%r15 +.cfi_restore %r15 + movq 8(%rsp),%r14 +.cfi_restore %r14 + movq 16(%rsp),%r13 +.cfi_restore %r13 + movq 24(%rsp),%r12 +.cfi_restore %r12 + movq 32(%rsp),%rbp +.cfi_restore %rbp + movq 40(%rsp),%rbx +.cfi_restore %rbx + leaq 48(%rsp),%rax + leaq 48(%rsp),%rsp +.cfi_adjust_cfa_offset -48 +.Lbase2_64_avx2_epilogue: + jmp .Ldo_avx2 +.cfi_endproc + +.align 32 +.Leven_avx2: +.cfi_startproc + movl OPENSSL_ia32cap_P+8(%rip),%r10d + vmovd 0(%rdi),%xmm0 + vmovd 4(%rdi),%xmm1 + vmovd 8(%rdi),%xmm2 + vmovd 12(%rdi),%xmm3 + vmovd 16(%rdi),%xmm4 + +.Ldo_avx2: + cmpq $512,%rdx + jb .Lskip_avx512 + andl %r11d,%r10d + testl $65536,%r10d + jnz .Lblocks_avx512 +.Lskip_avx512: + leaq -8(%rsp),%r11 +.cfi_def_cfa %r11,16 + subq $0x128,%rsp + leaq .Lconst(%rip),%rcx + leaq 48+64(%rdi),%rdi + vmovdqa 96(%rcx),%ymm7 + + + vmovdqu -64(%rdi),%xmm9 + andq $-512,%rsp + vmovdqu -48(%rdi),%xmm10 + vmovdqu -32(%rdi),%xmm6 + vmovdqu -16(%rdi),%xmm11 + vmovdqu 0(%rdi),%xmm12 + vmovdqu 16(%rdi),%xmm13 + leaq 144(%rsp),%rax + vmovdqu 32(%rdi),%xmm14 + vpermd %ymm9,%ymm7,%ymm9 + vmovdqu 48(%rdi),%xmm15 + vpermd %ymm10,%ymm7,%ymm10 + vmovdqu 64(%rdi),%xmm5 + vpermd %ymm6,%ymm7,%ymm6 + vmovdqa %ymm9,0(%rsp) + vpermd %ymm11,%ymm7,%ymm11 + vmovdqa %ymm10,32-144(%rax) + vpermd %ymm12,%ymm7,%ymm12 + vmovdqa %ymm6,64-144(%rax) + vpermd %ymm13,%ymm7,%ymm13 + vmovdqa %ymm11,96-144(%rax) + vpermd %ymm14,%ymm7,%ymm14 + vmovdqa %ymm12,128-144(%rax) + vpermd %ymm15,%ymm7,%ymm15 + vmovdqa %ymm13,160-144(%rax) + vpermd %ymm5,%ymm7,%ymm5 + vmovdqa %ymm14,192-144(%rax) + vmovdqa %ymm15,224-144(%rax) + vmovdqa %ymm5,256-144(%rax) + vmovdqa 64(%rcx),%ymm5 + + + + vmovdqu 0(%rsi),%xmm7 + vmovdqu 16(%rsi),%xmm8 + vinserti128 $1,32(%rsi),%ymm7,%ymm7 + vinserti128 $1,48(%rsi),%ymm8,%ymm8 + leaq 64(%rsi),%rsi + + vpsrldq $6,%ymm7,%ymm9 + vpsrldq $6,%ymm8,%ymm10 + vpunpckhqdq %ymm8,%ymm7,%ymm6 + vpunpcklqdq %ymm10,%ymm9,%ymm9 + vpunpcklqdq %ymm8,%ymm7,%ymm7 + + vpsrlq $30,%ymm9,%ymm10 + vpsrlq $4,%ymm9,%ymm9 + vpsrlq $26,%ymm7,%ymm8 + vpsrlq $40,%ymm6,%ymm6 + vpand %ymm5,%ymm9,%ymm9 + vpand %ymm5,%ymm7,%ymm7 + vpand %ymm5,%ymm8,%ymm8 + vpand %ymm5,%ymm10,%ymm10 + vpor 32(%rcx),%ymm6,%ymm6 + + vpaddq %ymm2,%ymm9,%ymm2 + subq $64,%rdx + jz .Ltail_avx2 + jmp .Loop_avx2 + +.align 32 +.Loop_avx2: + + + + + + + + + vpaddq %ymm0,%ymm7,%ymm0 + vmovdqa 0(%rsp),%ymm7 + vpaddq %ymm1,%ymm8,%ymm1 + vmovdqa 32(%rsp),%ymm8 + vpaddq %ymm3,%ymm10,%ymm3 + vmovdqa 96(%rsp),%ymm9 + vpaddq %ymm4,%ymm6,%ymm4 + vmovdqa 48(%rax),%ymm10 + vmovdqa 112(%rax),%ymm5 + + + + + + + + + + + + + + + + + vpmuludq %ymm2,%ymm7,%ymm13 + vpmuludq %ymm2,%ymm8,%ymm14 + vpmuludq %ymm2,%ymm9,%ymm15 + vpmuludq %ymm2,%ymm10,%ymm11 + vpmuludq %ymm2,%ymm5,%ymm12 + + vpmuludq %ymm0,%ymm8,%ymm6 + vpmuludq %ymm1,%ymm8,%ymm2 + vpaddq %ymm6,%ymm12,%ymm12 + vpaddq %ymm2,%ymm13,%ymm13 + vpmuludq %ymm3,%ymm8,%ymm6 + vpmuludq 64(%rsp),%ymm4,%ymm2 + vpaddq %ymm6,%ymm15,%ymm15 + vpaddq %ymm2,%ymm11,%ymm11 + vmovdqa -16(%rax),%ymm8 + + vpmuludq %ymm0,%ymm7,%ymm6 + vpmuludq %ymm1,%ymm7,%ymm2 + vpaddq %ymm6,%ymm11,%ymm11 + vpaddq %ymm2,%ymm12,%ymm12 + vpmuludq %ymm3,%ymm7,%ymm6 + vpmuludq %ymm4,%ymm7,%ymm2 + vmovdqu 0(%rsi),%xmm7 + vpaddq %ymm6,%ymm14,%ymm14 + vpaddq %ymm2,%ymm15,%ymm15 + vinserti128 $1,32(%rsi),%ymm7,%ymm7 + + vpmuludq %ymm3,%ymm8,%ymm6 + vpmuludq %ymm4,%ymm8,%ymm2 + vmovdqu 16(%rsi),%xmm8 + vpaddq %ymm6,%ymm11,%ymm11 + vpaddq %ymm2,%ymm12,%ymm12 + vmovdqa 16(%rax),%ymm2 + vpmuludq %ymm1,%ymm9,%ymm6 + vpmuludq %ymm0,%ymm9,%ymm9 + vpaddq %ymm6,%ymm14,%ymm14 + vpaddq %ymm9,%ymm13,%ymm13 + vinserti128 $1,48(%rsi),%ymm8,%ymm8 + leaq 64(%rsi),%rsi + + vpmuludq %ymm1,%ymm2,%ymm6 + vpmuludq %ymm0,%ymm2,%ymm2 + vpsrldq $6,%ymm7,%ymm9 + vpaddq %ymm6,%ymm15,%ymm15 + vpaddq %ymm2,%ymm14,%ymm14 + vpmuludq %ymm3,%ymm10,%ymm6 + vpmuludq %ymm4,%ymm10,%ymm2 + vpsrldq $6,%ymm8,%ymm10 + vpaddq %ymm6,%ymm12,%ymm12 + vpaddq %ymm2,%ymm13,%ymm13 + vpunpckhqdq %ymm8,%ymm7,%ymm6 + + vpmuludq %ymm3,%ymm5,%ymm3 + vpmuludq %ymm4,%ymm5,%ymm4 + vpunpcklqdq %ymm8,%ymm7,%ymm7 + vpaddq %ymm3,%ymm13,%ymm2 + vpaddq %ymm4,%ymm14,%ymm3 + vpunpcklqdq %ymm10,%ymm9,%ymm10 + vpmuludq 80(%rax),%ymm0,%ymm4 + vpmuludq %ymm1,%ymm5,%ymm0 + vmovdqa 64(%rcx),%ymm5 + vpaddq %ymm4,%ymm15,%ymm4 + vpaddq %ymm0,%ymm11,%ymm0 + + + + + vpsrlq $26,%ymm3,%ymm14 + vpand %ymm5,%ymm3,%ymm3 + vpaddq %ymm14,%ymm4,%ymm4 + + vpsrlq $26,%ymm0,%ymm11 + vpand %ymm5,%ymm0,%ymm0 + vpaddq %ymm11,%ymm12,%ymm1 + + vpsrlq $26,%ymm4,%ymm15 + vpand %ymm5,%ymm4,%ymm4 + + vpsrlq $4,%ymm10,%ymm9 + + vpsrlq $26,%ymm1,%ymm12 + vpand %ymm5,%ymm1,%ymm1 + vpaddq %ymm12,%ymm2,%ymm2 + + vpaddq %ymm15,%ymm0,%ymm0 + vpsllq $2,%ymm15,%ymm15 + vpaddq %ymm15,%ymm0,%ymm0 + + vpand %ymm5,%ymm9,%ymm9 + vpsrlq $26,%ymm7,%ymm8 + + vpsrlq $26,%ymm2,%ymm13 + vpand %ymm5,%ymm2,%ymm2 + vpaddq %ymm13,%ymm3,%ymm3 + + vpaddq %ymm9,%ymm2,%ymm2 + vpsrlq $30,%ymm10,%ymm10 + + vpsrlq $26,%ymm0,%ymm11 + vpand %ymm5,%ymm0,%ymm0 + vpaddq %ymm11,%ymm1,%ymm1 + + vpsrlq $40,%ymm6,%ymm6 + + vpsrlq $26,%ymm3,%ymm14 + vpand %ymm5,%ymm3,%ymm3 + vpaddq %ymm14,%ymm4,%ymm4 + + vpand %ymm5,%ymm7,%ymm7 + vpand %ymm5,%ymm8,%ymm8 + vpand %ymm5,%ymm10,%ymm10 + vpor 32(%rcx),%ymm6,%ymm6 + + subq $64,%rdx + jnz .Loop_avx2 + +.byte 0x66,0x90 +.Ltail_avx2: + + + + + + + + vpaddq %ymm0,%ymm7,%ymm0 + vmovdqu 4(%rsp),%ymm7 + vpaddq %ymm1,%ymm8,%ymm1 + vmovdqu 36(%rsp),%ymm8 + vpaddq %ymm3,%ymm10,%ymm3 + vmovdqu 100(%rsp),%ymm9 + vpaddq %ymm4,%ymm6,%ymm4 + vmovdqu 52(%rax),%ymm10 + vmovdqu 116(%rax),%ymm5 + + vpmuludq %ymm2,%ymm7,%ymm13 + vpmuludq %ymm2,%ymm8,%ymm14 + vpmuludq %ymm2,%ymm9,%ymm15 + vpmuludq %ymm2,%ymm10,%ymm11 + vpmuludq %ymm2,%ymm5,%ymm12 + + vpmuludq %ymm0,%ymm8,%ymm6 + vpmuludq %ymm1,%ymm8,%ymm2 + vpaddq %ymm6,%ymm12,%ymm12 + vpaddq %ymm2,%ymm13,%ymm13 + vpmuludq %ymm3,%ymm8,%ymm6 + vpmuludq 68(%rsp),%ymm4,%ymm2 + vpaddq %ymm6,%ymm15,%ymm15 + vpaddq %ymm2,%ymm11,%ymm11 + + vpmuludq %ymm0,%ymm7,%ymm6 + vpmuludq %ymm1,%ymm7,%ymm2 + vpaddq %ymm6,%ymm11,%ymm11 + vmovdqu -12(%rax),%ymm8 + vpaddq %ymm2,%ymm12,%ymm12 + vpmuludq %ymm3,%ymm7,%ymm6 + vpmuludq %ymm4,%ymm7,%ymm2 + vpaddq %ymm6,%ymm14,%ymm14 + vpaddq %ymm2,%ymm15,%ymm15 + + vpmuludq %ymm3,%ymm8,%ymm6 + vpmuludq %ymm4,%ymm8,%ymm2 + vpaddq %ymm6,%ymm11,%ymm11 + vpaddq %ymm2,%ymm12,%ymm12 + vmovdqu 20(%rax),%ymm2 + vpmuludq %ymm1,%ymm9,%ymm6 + vpmuludq %ymm0,%ymm9,%ymm9 + vpaddq %ymm6,%ymm14,%ymm14 + vpaddq %ymm9,%ymm13,%ymm13 + + vpmuludq %ymm1,%ymm2,%ymm6 + vpmuludq %ymm0,%ymm2,%ymm2 + vpaddq %ymm6,%ymm15,%ymm15 + vpaddq %ymm2,%ymm14,%ymm14 + vpmuludq %ymm3,%ymm10,%ymm6 + vpmuludq %ymm4,%ymm10,%ymm2 + vpaddq %ymm6,%ymm12,%ymm12 + vpaddq %ymm2,%ymm13,%ymm13 + + vpmuludq %ymm3,%ymm5,%ymm3 + vpmuludq %ymm4,%ymm5,%ymm4 + vpaddq %ymm3,%ymm13,%ymm2 + vpaddq %ymm4,%ymm14,%ymm3 + vpmuludq 84(%rax),%ymm0,%ymm4 + vpmuludq %ymm1,%ymm5,%ymm0 + vmovdqa 64(%rcx),%ymm5 + vpaddq %ymm4,%ymm15,%ymm4 + vpaddq %ymm0,%ymm11,%ymm0 + + + + + vpsrldq $8,%ymm12,%ymm8 + vpsrldq $8,%ymm2,%ymm9 + vpsrldq $8,%ymm3,%ymm10 + vpsrldq $8,%ymm4,%ymm6 + vpsrldq $8,%ymm0,%ymm7 + vpaddq %ymm8,%ymm12,%ymm12 + vpaddq %ymm9,%ymm2,%ymm2 + vpaddq %ymm10,%ymm3,%ymm3 + vpaddq %ymm6,%ymm4,%ymm4 + vpaddq %ymm7,%ymm0,%ymm0 + + vpermq $0x2,%ymm3,%ymm10 + vpermq $0x2,%ymm4,%ymm6 + vpermq $0x2,%ymm0,%ymm7 + vpermq $0x2,%ymm12,%ymm8 + vpermq $0x2,%ymm2,%ymm9 + vpaddq %ymm10,%ymm3,%ymm3 + vpaddq %ymm6,%ymm4,%ymm4 + vpaddq %ymm7,%ymm0,%ymm0 + vpaddq %ymm8,%ymm12,%ymm12 + vpaddq %ymm9,%ymm2,%ymm2 + + + + + vpsrlq $26,%ymm3,%ymm14 + vpand %ymm5,%ymm3,%ymm3 + vpaddq %ymm14,%ymm4,%ymm4 + + vpsrlq $26,%ymm0,%ymm11 + vpand %ymm5,%ymm0,%ymm0 + vpaddq %ymm11,%ymm12,%ymm1 + + vpsrlq $26,%ymm4,%ymm15 + vpand %ymm5,%ymm4,%ymm4 + + vpsrlq $26,%ymm1,%ymm12 + vpand %ymm5,%ymm1,%ymm1 + vpaddq %ymm12,%ymm2,%ymm2 + + vpaddq %ymm15,%ymm0,%ymm0 + vpsllq $2,%ymm15,%ymm15 + vpaddq %ymm15,%ymm0,%ymm0 + + vpsrlq $26,%ymm2,%ymm13 + vpand %ymm5,%ymm2,%ymm2 + vpaddq %ymm13,%ymm3,%ymm3 + + vpsrlq $26,%ymm0,%ymm11 + vpand %ymm5,%ymm0,%ymm0 + vpaddq %ymm11,%ymm1,%ymm1 + + vpsrlq $26,%ymm3,%ymm14 + vpand %ymm5,%ymm3,%ymm3 + vpaddq %ymm14,%ymm4,%ymm4 + + vmovd %xmm0,-112(%rdi) + vmovd %xmm1,-108(%rdi) + vmovd %xmm2,-104(%rdi) + vmovd %xmm3,-100(%rdi) + vmovd %xmm4,-96(%rdi) + leaq 8(%r11),%rsp +.cfi_def_cfa %rsp,8 + vzeroupper + .byte 0xf3,0xc3 +.cfi_endproc +.size poly1305_blocks_avx2,.-poly1305_blocks_avx2 +.type poly1305_blocks_avx512,@function +.align 32 +poly1305_blocks_avx512: +.cfi_startproc +.Lblocks_avx512: + movl $15,%eax + kmovw %eax,%k2 + leaq -8(%rsp),%r11 +.cfi_def_cfa %r11,16 + subq $0x128,%rsp + leaq .Lconst(%rip),%rcx + leaq 48+64(%rdi),%rdi + vmovdqa 96(%rcx),%ymm9 + + + vmovdqu -64(%rdi),%xmm11 + andq $-512,%rsp + vmovdqu -48(%rdi),%xmm12 + movq $0x20,%rax + vmovdqu -32(%rdi),%xmm7 + vmovdqu -16(%rdi),%xmm13 + vmovdqu 0(%rdi),%xmm8 + vmovdqu 16(%rdi),%xmm14 + vmovdqu 32(%rdi),%xmm10 + vmovdqu 48(%rdi),%xmm15 + vmovdqu 64(%rdi),%xmm6 + vpermd %zmm11,%zmm9,%zmm16 + vpbroadcastq 64(%rcx),%zmm5 + vpermd %zmm12,%zmm9,%zmm17 + vpermd %zmm7,%zmm9,%zmm21 + vpermd %zmm13,%zmm9,%zmm18 + vmovdqa64 %zmm16,0(%rsp){%k2} + vpsrlq $32,%zmm16,%zmm7 + vpermd %zmm8,%zmm9,%zmm22 + vmovdqu64 %zmm17,0(%rsp,%rax,1){%k2} + vpsrlq $32,%zmm17,%zmm8 + vpermd %zmm14,%zmm9,%zmm19 + vmovdqa64 %zmm21,64(%rsp){%k2} + vpermd %zmm10,%zmm9,%zmm23 + vpermd %zmm15,%zmm9,%zmm20 + vmovdqu64 %zmm18,64(%rsp,%rax,1){%k2} + vpermd %zmm6,%zmm9,%zmm24 + vmovdqa64 %zmm22,128(%rsp){%k2} + vmovdqu64 %zmm19,128(%rsp,%rax,1){%k2} + vmovdqa64 %zmm23,192(%rsp){%k2} + vmovdqu64 %zmm20,192(%rsp,%rax,1){%k2} + vmovdqa64 %zmm24,256(%rsp){%k2} + + + + + + + + + + + vpmuludq %zmm7,%zmm16,%zmm11 + vpmuludq %zmm7,%zmm17,%zmm12 + vpmuludq %zmm7,%zmm18,%zmm13 + vpmuludq %zmm7,%zmm19,%zmm14 + vpmuludq %zmm7,%zmm20,%zmm15 + vpsrlq $32,%zmm18,%zmm9 + + vpmuludq %zmm8,%zmm24,%zmm25 + vpmuludq %zmm8,%zmm16,%zmm26 + vpmuludq %zmm8,%zmm17,%zmm27 + vpmuludq %zmm8,%zmm18,%zmm28 + vpmuludq %zmm8,%zmm19,%zmm29 + vpsrlq $32,%zmm19,%zmm10 + vpaddq %zmm25,%zmm11,%zmm11 + vpaddq %zmm26,%zmm12,%zmm12 + vpaddq %zmm27,%zmm13,%zmm13 + vpaddq %zmm28,%zmm14,%zmm14 + vpaddq %zmm29,%zmm15,%zmm15 + + vpmuludq %zmm9,%zmm23,%zmm25 + vpmuludq %zmm9,%zmm24,%zmm26 + vpmuludq %zmm9,%zmm17,%zmm28 + vpmuludq %zmm9,%zmm18,%zmm29 + vpmuludq %zmm9,%zmm16,%zmm27 + vpsrlq $32,%zmm20,%zmm6 + vpaddq %zmm25,%zmm11,%zmm11 + vpaddq %zmm26,%zmm12,%zmm12 + vpaddq %zmm28,%zmm14,%zmm14 + vpaddq %zmm29,%zmm15,%zmm15 + vpaddq %zmm27,%zmm13,%zmm13 + + vpmuludq %zmm10,%zmm22,%zmm25 + vpmuludq %zmm10,%zmm16,%zmm28 + vpmuludq %zmm10,%zmm17,%zmm29 + vpmuludq %zmm10,%zmm23,%zmm26 + vpmuludq %zmm10,%zmm24,%zmm27 + vpaddq %zmm25,%zmm11,%zmm11 + vpaddq %zmm28,%zmm14,%zmm14 + vpaddq %zmm29,%zmm15,%zmm15 + vpaddq %zmm26,%zmm12,%zmm12 + vpaddq %zmm27,%zmm13,%zmm13 + + vpmuludq %zmm6,%zmm24,%zmm28 + vpmuludq %zmm6,%zmm16,%zmm29 + vpmuludq %zmm6,%zmm21,%zmm25 + vpmuludq %zmm6,%zmm22,%zmm26 + vpmuludq %zmm6,%zmm23,%zmm27 + vpaddq %zmm28,%zmm14,%zmm14 + vpaddq %zmm29,%zmm15,%zmm15 + vpaddq %zmm25,%zmm11,%zmm11 + vpaddq %zmm26,%zmm12,%zmm12 + vpaddq %zmm27,%zmm13,%zmm13 + + + + vmovdqu64 0(%rsi),%zmm10 + vmovdqu64 64(%rsi),%zmm6 + leaq 128(%rsi),%rsi + + + + + vpsrlq $26,%zmm14,%zmm28 + vpandq %zmm5,%zmm14,%zmm14 + vpaddq %zmm28,%zmm15,%zmm15 + + vpsrlq $26,%zmm11,%zmm25 + vpandq %zmm5,%zmm11,%zmm11 + vpaddq %zmm25,%zmm12,%zmm12 + + vpsrlq $26,%zmm15,%zmm29 + vpandq %zmm5,%zmm15,%zmm15 + + vpsrlq $26,%zmm12,%zmm26 + vpandq %zmm5,%zmm12,%zmm12 + vpaddq %zmm26,%zmm13,%zmm13 + + vpaddq %zmm29,%zmm11,%zmm11 + vpsllq $2,%zmm29,%zmm29 + vpaddq %zmm29,%zmm11,%zmm11 + + vpsrlq $26,%zmm13,%zmm27 + vpandq %zmm5,%zmm13,%zmm13 + vpaddq %zmm27,%zmm14,%zmm14 + + vpsrlq $26,%zmm11,%zmm25 + vpandq %zmm5,%zmm11,%zmm11 + vpaddq %zmm25,%zmm12,%zmm12 + + vpsrlq $26,%zmm14,%zmm28 + vpandq %zmm5,%zmm14,%zmm14 + vpaddq %zmm28,%zmm15,%zmm15 + + + + + + vpunpcklqdq %zmm6,%zmm10,%zmm7 + vpunpckhqdq %zmm6,%zmm10,%zmm6 + + + + + + + vmovdqa32 128(%rcx),%zmm25 + movl $0x7777,%eax + kmovw %eax,%k1 + + vpermd %zmm16,%zmm25,%zmm16 + vpermd %zmm17,%zmm25,%zmm17 + vpermd %zmm18,%zmm25,%zmm18 + vpermd %zmm19,%zmm25,%zmm19 + vpermd %zmm20,%zmm25,%zmm20 + + vpermd %zmm11,%zmm25,%zmm16{%k1} + vpermd %zmm12,%zmm25,%zmm17{%k1} + vpermd %zmm13,%zmm25,%zmm18{%k1} + vpermd %zmm14,%zmm25,%zmm19{%k1} + vpermd %zmm15,%zmm25,%zmm20{%k1} + + vpslld $2,%zmm17,%zmm21 + vpslld $2,%zmm18,%zmm22 + vpslld $2,%zmm19,%zmm23 + vpslld $2,%zmm20,%zmm24 + vpaddd %zmm17,%zmm21,%zmm21 + vpaddd %zmm18,%zmm22,%zmm22 + vpaddd %zmm19,%zmm23,%zmm23 + vpaddd %zmm20,%zmm24,%zmm24 + + vpbroadcastq 32(%rcx),%zmm30 + + vpsrlq $52,%zmm7,%zmm9 + vpsllq $12,%zmm6,%zmm10 + vporq %zmm10,%zmm9,%zmm9 + vpsrlq $26,%zmm7,%zmm8 + vpsrlq $14,%zmm6,%zmm10 + vpsrlq $40,%zmm6,%zmm6 + vpandq %zmm5,%zmm9,%zmm9 + vpandq %zmm5,%zmm7,%zmm7 + + + + + vpaddq %zmm2,%zmm9,%zmm2 + subq $192,%rdx + jbe .Ltail_avx512 + jmp .Loop_avx512 + +.align 32 +.Loop_avx512: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + vpmuludq %zmm2,%zmm17,%zmm14 + vpaddq %zmm0,%zmm7,%zmm0 + vpmuludq %zmm2,%zmm18,%zmm15 + vpandq %zmm5,%zmm8,%zmm8 + vpmuludq %zmm2,%zmm23,%zmm11 + vpandq %zmm5,%zmm10,%zmm10 + vpmuludq %zmm2,%zmm24,%zmm12 + vporq %zmm30,%zmm6,%zmm6 + vpmuludq %zmm2,%zmm16,%zmm13 + vpaddq %zmm1,%zmm8,%zmm1 + vpaddq %zmm3,%zmm10,%zmm3 + vpaddq %zmm4,%zmm6,%zmm4 + + vmovdqu64 0(%rsi),%zmm10 + vmovdqu64 64(%rsi),%zmm6 + leaq 128(%rsi),%rsi + vpmuludq %zmm0,%zmm19,%zmm28 + vpmuludq %zmm0,%zmm20,%zmm29 + vpmuludq %zmm0,%zmm16,%zmm25 + vpmuludq %zmm0,%zmm17,%zmm26 + vpaddq %zmm28,%zmm14,%zmm14 + vpaddq %zmm29,%zmm15,%zmm15 + vpaddq %zmm25,%zmm11,%zmm11 + vpaddq %zmm26,%zmm12,%zmm12 + + vpmuludq %zmm1,%zmm18,%zmm28 + vpmuludq %zmm1,%zmm19,%zmm29 + vpmuludq %zmm1,%zmm24,%zmm25 + vpmuludq %zmm0,%zmm18,%zmm27 + vpaddq %zmm28,%zmm14,%zmm14 + vpaddq %zmm29,%zmm15,%zmm15 + vpaddq %zmm25,%zmm11,%zmm11 + vpaddq %zmm27,%zmm13,%zmm13 + + vpunpcklqdq %zmm6,%zmm10,%zmm7 + vpunpckhqdq %zmm6,%zmm10,%zmm6 + + vpmuludq %zmm3,%zmm16,%zmm28 + vpmuludq %zmm3,%zmm17,%zmm29 + vpmuludq %zmm1,%zmm16,%zmm26 + vpmuludq %zmm1,%zmm17,%zmm27 + vpaddq %zmm28,%zmm14,%zmm14 + vpaddq %zmm29,%zmm15,%zmm15 + vpaddq %zmm26,%zmm12,%zmm12 + vpaddq %zmm27,%zmm13,%zmm13 + + vpmuludq %zmm4,%zmm24,%zmm28 + vpmuludq %zmm4,%zmm16,%zmm29 + vpmuludq %zmm3,%zmm22,%zmm25 + vpmuludq %zmm3,%zmm23,%zmm26 + vpaddq %zmm28,%zmm14,%zmm14 + vpmuludq %zmm3,%zmm24,%zmm27 + vpaddq %zmm29,%zmm15,%zmm15 + vpaddq %zmm25,%zmm11,%zmm11 + vpaddq %zmm26,%zmm12,%zmm12 + vpaddq %zmm27,%zmm13,%zmm13 + + vpmuludq %zmm4,%zmm21,%zmm25 + vpmuludq %zmm4,%zmm22,%zmm26 + vpmuludq %zmm4,%zmm23,%zmm27 + vpaddq %zmm25,%zmm11,%zmm0 + vpaddq %zmm26,%zmm12,%zmm1 + vpaddq %zmm27,%zmm13,%zmm2 + + + + + vpsrlq $52,%zmm7,%zmm9 + vpsllq $12,%zmm6,%zmm10 + + vpsrlq $26,%zmm14,%zmm3 + vpandq %zmm5,%zmm14,%zmm14 + vpaddq %zmm3,%zmm15,%zmm4 + + vporq %zmm10,%zmm9,%zmm9 + + vpsrlq $26,%zmm0,%zmm11 + vpandq %zmm5,%zmm0,%zmm0 + vpaddq %zmm11,%zmm1,%zmm1 + + vpandq %zmm5,%zmm9,%zmm9 + + vpsrlq $26,%zmm4,%zmm15 + vpandq %zmm5,%zmm4,%zmm4 + + vpsrlq $26,%zmm1,%zmm12 + vpandq %zmm5,%zmm1,%zmm1 + vpaddq %zmm12,%zmm2,%zmm2 + + vpaddq %zmm15,%zmm0,%zmm0 + vpsllq $2,%zmm15,%zmm15 + vpaddq %zmm15,%zmm0,%zmm0 + + vpaddq %zmm9,%zmm2,%zmm2 + vpsrlq $26,%zmm7,%zmm8 + + vpsrlq $26,%zmm2,%zmm13 + vpandq %zmm5,%zmm2,%zmm2 + vpaddq %zmm13,%zmm14,%zmm3 + + vpsrlq $14,%zmm6,%zmm10 + + vpsrlq $26,%zmm0,%zmm11 + vpandq %zmm5,%zmm0,%zmm0 + vpaddq %zmm11,%zmm1,%zmm1 + + vpsrlq $40,%zmm6,%zmm6 + + vpsrlq $26,%zmm3,%zmm14 + vpandq %zmm5,%zmm3,%zmm3 + vpaddq %zmm14,%zmm4,%zmm4 + + vpandq %zmm5,%zmm7,%zmm7 + + + + + subq $128,%rdx + ja .Loop_avx512 + +.Ltail_avx512: + + + + + + vpsrlq $32,%zmm16,%zmm16 + vpsrlq $32,%zmm17,%zmm17 + vpsrlq $32,%zmm18,%zmm18 + vpsrlq $32,%zmm23,%zmm23 + vpsrlq $32,%zmm24,%zmm24 + vpsrlq $32,%zmm19,%zmm19 + vpsrlq $32,%zmm20,%zmm20 + vpsrlq $32,%zmm21,%zmm21 + vpsrlq $32,%zmm22,%zmm22 + + + + leaq (%rsi,%rdx,1),%rsi + + + vpaddq %zmm0,%zmm7,%zmm0 + + vpmuludq %zmm2,%zmm17,%zmm14 + vpmuludq %zmm2,%zmm18,%zmm15 + vpmuludq %zmm2,%zmm23,%zmm11 + vpandq %zmm5,%zmm8,%zmm8 + vpmuludq %zmm2,%zmm24,%zmm12 + vpandq %zmm5,%zmm10,%zmm10 + vpmuludq %zmm2,%zmm16,%zmm13 + vporq %zmm30,%zmm6,%zmm6 + vpaddq %zmm1,%zmm8,%zmm1 + vpaddq %zmm3,%zmm10,%zmm3 + vpaddq %zmm4,%zmm6,%zmm4 + + vmovdqu 0(%rsi),%xmm7 + vpmuludq %zmm0,%zmm19,%zmm28 + vpmuludq %zmm0,%zmm20,%zmm29 + vpmuludq %zmm0,%zmm16,%zmm25 + vpmuludq %zmm0,%zmm17,%zmm26 + vpaddq %zmm28,%zmm14,%zmm14 + vpaddq %zmm29,%zmm15,%zmm15 + vpaddq %zmm25,%zmm11,%zmm11 + vpaddq %zmm26,%zmm12,%zmm12 + + vmovdqu 16(%rsi),%xmm8 + vpmuludq %zmm1,%zmm18,%zmm28 + vpmuludq %zmm1,%zmm19,%zmm29 + vpmuludq %zmm1,%zmm24,%zmm25 + vpmuludq %zmm0,%zmm18,%zmm27 + vpaddq %zmm28,%zmm14,%zmm14 + vpaddq %zmm29,%zmm15,%zmm15 + vpaddq %zmm25,%zmm11,%zmm11 + vpaddq %zmm27,%zmm13,%zmm13 + + vinserti128 $1,32(%rsi),%ymm7,%ymm7 + vpmuludq %zmm3,%zmm16,%zmm28 + vpmuludq %zmm3,%zmm17,%zmm29 + vpmuludq %zmm1,%zmm16,%zmm26 + vpmuludq %zmm1,%zmm17,%zmm27 + vpaddq %zmm28,%zmm14,%zmm14 + vpaddq %zmm29,%zmm15,%zmm15 + vpaddq %zmm26,%zmm12,%zmm12 + vpaddq %zmm27,%zmm13,%zmm13 + + vinserti128 $1,48(%rsi),%ymm8,%ymm8 + vpmuludq %zmm4,%zmm24,%zmm28 + vpmuludq %zmm4,%zmm16,%zmm29 + vpmuludq %zmm3,%zmm22,%zmm25 + vpmuludq %zmm3,%zmm23,%zmm26 + vpmuludq %zmm3,%zmm24,%zmm27 + vpaddq %zmm28,%zmm14,%zmm3 + vpaddq %zmm29,%zmm15,%zmm15 + vpaddq %zmm25,%zmm11,%zmm11 + vpaddq %zmm26,%zmm12,%zmm12 + vpaddq %zmm27,%zmm13,%zmm13 + + vpmuludq %zmm4,%zmm21,%zmm25 + vpmuludq %zmm4,%zmm22,%zmm26 + vpmuludq %zmm4,%zmm23,%zmm27 + vpaddq %zmm25,%zmm11,%zmm0 + vpaddq %zmm26,%zmm12,%zmm1 + vpaddq %zmm27,%zmm13,%zmm2 + + + + + movl $1,%eax + vpermq $0xb1,%zmm3,%zmm14 + vpermq $0xb1,%zmm15,%zmm4 + vpermq $0xb1,%zmm0,%zmm11 + vpermq $0xb1,%zmm1,%zmm12 + vpermq $0xb1,%zmm2,%zmm13 + vpaddq %zmm14,%zmm3,%zmm3 + vpaddq %zmm15,%zmm4,%zmm4 + vpaddq %zmm11,%zmm0,%zmm0 + vpaddq %zmm12,%zmm1,%zmm1 + vpaddq %zmm13,%zmm2,%zmm2 + + kmovw %eax,%k3 + vpermq $0x2,%zmm3,%zmm14 + vpermq $0x2,%zmm4,%zmm15 + vpermq $0x2,%zmm0,%zmm11 + vpermq $0x2,%zmm1,%zmm12 + vpermq $0x2,%zmm2,%zmm13 + vpaddq %zmm14,%zmm3,%zmm3 + vpaddq %zmm15,%zmm4,%zmm4 + vpaddq %zmm11,%zmm0,%zmm0 + vpaddq %zmm12,%zmm1,%zmm1 + vpaddq %zmm13,%zmm2,%zmm2 + + vextracti64x4 $0x1,%zmm3,%ymm14 + vextracti64x4 $0x1,%zmm4,%ymm15 + vextracti64x4 $0x1,%zmm0,%ymm11 + vextracti64x4 $0x1,%zmm1,%ymm12 + vextracti64x4 $0x1,%zmm2,%ymm13 + vpaddq %zmm14,%zmm3,%zmm3{%k3}{z} + vpaddq %zmm15,%zmm4,%zmm4{%k3}{z} + vpaddq %zmm11,%zmm0,%zmm0{%k3}{z} + vpaddq %zmm12,%zmm1,%zmm1{%k3}{z} + vpaddq %zmm13,%zmm2,%zmm2{%k3}{z} + + + + vpsrlq $26,%ymm3,%ymm14 + vpand %ymm5,%ymm3,%ymm3 + vpsrldq $6,%ymm7,%ymm9 + vpsrldq $6,%ymm8,%ymm10 + vpunpckhqdq %ymm8,%ymm7,%ymm6 + vpaddq %ymm14,%ymm4,%ymm4 + + vpsrlq $26,%ymm0,%ymm11 + vpand %ymm5,%ymm0,%ymm0 + vpunpcklqdq %ymm10,%ymm9,%ymm9 + vpunpcklqdq %ymm8,%ymm7,%ymm7 + vpaddq %ymm11,%ymm1,%ymm1 + + vpsrlq $26,%ymm4,%ymm15 + vpand %ymm5,%ymm4,%ymm4 + + vpsrlq $26,%ymm1,%ymm12 + vpand %ymm5,%ymm1,%ymm1 + vpsrlq $30,%ymm9,%ymm10 + vpsrlq $4,%ymm9,%ymm9 + vpaddq %ymm12,%ymm2,%ymm2 + + vpaddq %ymm15,%ymm0,%ymm0 + vpsllq $2,%ymm15,%ymm15 + vpsrlq $26,%ymm7,%ymm8 + vpsrlq $40,%ymm6,%ymm6 + vpaddq %ymm15,%ymm0,%ymm0 + + vpsrlq $26,%ymm2,%ymm13 + vpand %ymm5,%ymm2,%ymm2 + vpand %ymm5,%ymm9,%ymm9 + vpand %ymm5,%ymm7,%ymm7 + vpaddq %ymm13,%ymm3,%ymm3 + + vpsrlq $26,%ymm0,%ymm11 + vpand %ymm5,%ymm0,%ymm0 + vpaddq %ymm2,%ymm9,%ymm2 + vpand %ymm5,%ymm8,%ymm8 + vpaddq %ymm11,%ymm1,%ymm1 + + vpsrlq $26,%ymm3,%ymm14 + vpand %ymm5,%ymm3,%ymm3 + vpand %ymm5,%ymm10,%ymm10 + vpor 32(%rcx),%ymm6,%ymm6 + vpaddq %ymm14,%ymm4,%ymm4 + + leaq 144(%rsp),%rax + addq $64,%rdx + jnz .Ltail_avx2 + + vpsubq %ymm9,%ymm2,%ymm2 + vmovd %xmm0,-112(%rdi) + vmovd %xmm1,-108(%rdi) + vmovd %xmm2,-104(%rdi) + vmovd %xmm3,-100(%rdi) + vmovd %xmm4,-96(%rdi) + vzeroall + leaq 8(%r11),%rsp +.cfi_def_cfa %rsp,8 + .byte 0xf3,0xc3 +.cfi_endproc +.size poly1305_blocks_avx512,.-poly1305_blocks_avx512 +.type poly1305_init_base2_44,@function +.align 32 +poly1305_init_base2_44: + xorq %rax,%rax + movq %rax,0(%rdi) + movq %rax,8(%rdi) + movq %rax,16(%rdi) + +.Linit_base2_44: + leaq poly1305_blocks_vpmadd52(%rip),%r10 + leaq poly1305_emit_base2_44(%rip),%r11 + + movq $0x0ffffffc0fffffff,%rax + movq $0x0ffffffc0ffffffc,%rcx + andq 0(%rsi),%rax + movq $0x00000fffffffffff,%r8 + andq 8(%rsi),%rcx + movq $0x00000fffffffffff,%r9 + andq %rax,%r8 + shrdq $44,%rcx,%rax + movq %r8,40(%rdi) + andq %r9,%rax + shrq $24,%rcx + movq %rax,48(%rdi) + leaq (%rax,%rax,4),%rax + movq %rcx,56(%rdi) + shlq $2,%rax + leaq (%rcx,%rcx,4),%rcx + shlq $2,%rcx + movq %rax,24(%rdi) + movq %rcx,32(%rdi) + movq $-1,64(%rdi) + movq %r10,0(%rdx) + movq %r11,8(%rdx) + movl $1,%eax + .byte 0xf3,0xc3 +.size poly1305_init_base2_44,.-poly1305_init_base2_44 +.type poly1305_blocks_vpmadd52,@function +.align 32 +poly1305_blocks_vpmadd52: + shrq $4,%rdx + jz .Lno_data_vpmadd52 + + shlq $40,%rcx + movq 64(%rdi),%r8 + + + + + + + movq $3,%rax + movq $1,%r10 + cmpq $4,%rdx + cmovaeq %r10,%rax + testq %r8,%r8 + cmovnsq %r10,%rax + + andq %rdx,%rax + jz .Lblocks_vpmadd52_4x + + subq %rax,%rdx + movl $7,%r10d + movl $1,%r11d + kmovw %r10d,%k7 + leaq .L2_44_inp_permd(%rip),%r10 + kmovw %r11d,%k1 + + vmovq %rcx,%xmm21 + vmovdqa64 0(%r10),%ymm19 + vmovdqa64 32(%r10),%ymm20 + vpermq $0xcf,%ymm21,%ymm21 + vmovdqa64 64(%r10),%ymm22 + + vmovdqu64 0(%rdi),%ymm16{%k7}{z} + vmovdqu64 40(%rdi),%ymm3{%k7}{z} + vmovdqu64 32(%rdi),%ymm4{%k7}{z} + vmovdqu64 24(%rdi),%ymm5{%k7}{z} + + vmovdqa64 96(%r10),%ymm23 + vmovdqa64 128(%r10),%ymm24 + + jmp .Loop_vpmadd52 + +.align 32 +.Loop_vpmadd52: + vmovdqu32 0(%rsi),%xmm18 + leaq 16(%rsi),%rsi + + vpermd %ymm18,%ymm19,%ymm18 + vpsrlvq %ymm20,%ymm18,%ymm18 + vpandq %ymm22,%ymm18,%ymm18 + vporq %ymm21,%ymm18,%ymm18 + + vpaddq %ymm18,%ymm16,%ymm16 + + vpermq $0,%ymm16,%ymm0{%k7}{z} + vpermq $85,%ymm16,%ymm1{%k7}{z} + vpermq $170,%ymm16,%ymm2{%k7}{z} + + vpxord %ymm16,%ymm16,%ymm16 + vpxord %ymm17,%ymm17,%ymm17 + + vpmadd52luq %ymm3,%ymm0,%ymm16 + vpmadd52huq %ymm3,%ymm0,%ymm17 + + vpmadd52luq %ymm4,%ymm1,%ymm16 + vpmadd52huq %ymm4,%ymm1,%ymm17 + + vpmadd52luq %ymm5,%ymm2,%ymm16 + vpmadd52huq %ymm5,%ymm2,%ymm17 + + vpsrlvq %ymm23,%ymm16,%ymm18 + vpsllvq %ymm24,%ymm17,%ymm17 + vpandq %ymm22,%ymm16,%ymm16 + + vpaddq %ymm18,%ymm17,%ymm17 + + vpermq $147,%ymm17,%ymm17 + + vpaddq %ymm17,%ymm16,%ymm16 + + vpsrlvq %ymm23,%ymm16,%ymm18 + vpandq %ymm22,%ymm16,%ymm16 + + vpermq $147,%ymm18,%ymm18 + + vpaddq %ymm18,%ymm16,%ymm16 + + vpermq $147,%ymm16,%ymm18{%k1}{z} + + vpaddq %ymm18,%ymm16,%ymm16 + vpsllq $2,%ymm18,%ymm18 + + vpaddq %ymm18,%ymm16,%ymm16 + + decq %rax + jnz .Loop_vpmadd52 + + vmovdqu64 %ymm16,0(%rdi){%k7} + + testq %rdx,%rdx + jnz .Lblocks_vpmadd52_4x + +.Lno_data_vpmadd52: + .byte 0xf3,0xc3 +.size poly1305_blocks_vpmadd52,.-poly1305_blocks_vpmadd52 +.type poly1305_blocks_vpmadd52_4x,@function +.align 32 +poly1305_blocks_vpmadd52_4x: + shrq $4,%rdx + jz .Lno_data_vpmadd52_4x + + shlq $40,%rcx + movq 64(%rdi),%r8 + +.Lblocks_vpmadd52_4x: + vpbroadcastq %rcx,%ymm31 + + vmovdqa64 .Lx_mask44(%rip),%ymm28 + movl $5,%eax + vmovdqa64 .Lx_mask42(%rip),%ymm29 + kmovw %eax,%k1 + + testq %r8,%r8 + js .Linit_vpmadd52 + + vmovq 0(%rdi),%xmm0 + vmovq 8(%rdi),%xmm1 + vmovq 16(%rdi),%xmm2 + + testq $3,%rdx + jnz .Lblocks_vpmadd52_2x_do + +.Lblocks_vpmadd52_4x_do: + vpbroadcastq 64(%rdi),%ymm3 + vpbroadcastq 96(%rdi),%ymm4 + vpbroadcastq 128(%rdi),%ymm5 + vpbroadcastq 160(%rdi),%ymm16 + +.Lblocks_vpmadd52_4x_key_loaded: + vpsllq $2,%ymm5,%ymm17 + vpaddq %ymm5,%ymm17,%ymm17 + vpsllq $2,%ymm17,%ymm17 + + testq $7,%rdx + jz .Lblocks_vpmadd52_8x + + vmovdqu64 0(%rsi),%ymm26 + vmovdqu64 32(%rsi),%ymm27 + leaq 64(%rsi),%rsi + + vpunpcklqdq %ymm27,%ymm26,%ymm25 + vpunpckhqdq %ymm27,%ymm26,%ymm27 + + + + vpsrlq $24,%ymm27,%ymm26 + vporq %ymm31,%ymm26,%ymm26 + vpaddq %ymm26,%ymm2,%ymm2 + vpandq %ymm28,%ymm25,%ymm24 + vpsrlq $44,%ymm25,%ymm25 + vpsllq $20,%ymm27,%ymm27 + vporq %ymm27,%ymm25,%ymm25 + vpandq %ymm28,%ymm25,%ymm25 + + subq $4,%rdx + jz .Ltail_vpmadd52_4x + jmp .Loop_vpmadd52_4x + ud2 + +.align 32 +.Linit_vpmadd52: + vmovq 24(%rdi),%xmm16 + vmovq 56(%rdi),%xmm2 + vmovq 32(%rdi),%xmm17 + vmovq 40(%rdi),%xmm3 + vmovq 48(%rdi),%xmm4 + + vmovdqa %ymm3,%ymm0 + vmovdqa %ymm4,%ymm1 + vmovdqa %ymm2,%ymm5 + + movl $2,%eax + +.Lmul_init_vpmadd52: + vpxorq %ymm18,%ymm18,%ymm18 + vpmadd52luq %ymm2,%ymm16,%ymm18 + vpxorq %ymm19,%ymm19,%ymm19 + vpmadd52huq %ymm2,%ymm16,%ymm19 + vpxorq %ymm20,%ymm20,%ymm20 + vpmadd52luq %ymm2,%ymm17,%ymm20 + vpxorq %ymm21,%ymm21,%ymm21 + vpmadd52huq %ymm2,%ymm17,%ymm21 + vpxorq %ymm22,%ymm22,%ymm22 + vpmadd52luq %ymm2,%ymm3,%ymm22 + vpxorq %ymm23,%ymm23,%ymm23 + vpmadd52huq %ymm2,%ymm3,%ymm23 + + vpmadd52luq %ymm0,%ymm3,%ymm18 + vpmadd52huq %ymm0,%ymm3,%ymm19 + vpmadd52luq %ymm0,%ymm4,%ymm20 + vpmadd52huq %ymm0,%ymm4,%ymm21 + vpmadd52luq %ymm0,%ymm5,%ymm22 + vpmadd52huq %ymm0,%ymm5,%ymm23 + + vpmadd52luq %ymm1,%ymm17,%ymm18 + vpmadd52huq %ymm1,%ymm17,%ymm19 + vpmadd52luq %ymm1,%ymm3,%ymm20 + vpmadd52huq %ymm1,%ymm3,%ymm21 + vpmadd52luq %ymm1,%ymm4,%ymm22 + vpmadd52huq %ymm1,%ymm4,%ymm23 + + + + vpsrlq $44,%ymm18,%ymm30 + vpsllq $8,%ymm19,%ymm19 + vpandq %ymm28,%ymm18,%ymm0 + vpaddq %ymm30,%ymm19,%ymm19 + + vpaddq %ymm19,%ymm20,%ymm20 + + vpsrlq $44,%ymm20,%ymm30 + vpsllq $8,%ymm21,%ymm21 + vpandq %ymm28,%ymm20,%ymm1 + vpaddq %ymm30,%ymm21,%ymm21 + + vpaddq %ymm21,%ymm22,%ymm22 + + vpsrlq $42,%ymm22,%ymm30 + vpsllq $10,%ymm23,%ymm23 + vpandq %ymm29,%ymm22,%ymm2 + vpaddq %ymm30,%ymm23,%ymm23 + + vpaddq %ymm23,%ymm0,%ymm0 + vpsllq $2,%ymm23,%ymm23 + + vpaddq %ymm23,%ymm0,%ymm0 + + vpsrlq $44,%ymm0,%ymm30 + vpandq %ymm28,%ymm0,%ymm0 + + vpaddq %ymm30,%ymm1,%ymm1 + + decl %eax + jz .Ldone_init_vpmadd52 + + vpunpcklqdq %ymm4,%ymm1,%ymm4 + vpbroadcastq %xmm1,%xmm1 + vpunpcklqdq %ymm5,%ymm2,%ymm5 + vpbroadcastq %xmm2,%xmm2 + vpunpcklqdq %ymm3,%ymm0,%ymm3 + vpbroadcastq %xmm0,%xmm0 + + vpsllq $2,%ymm4,%ymm16 + vpsllq $2,%ymm5,%ymm17 + vpaddq %ymm4,%ymm16,%ymm16 + vpaddq %ymm5,%ymm17,%ymm17 + vpsllq $2,%ymm16,%ymm16 + vpsllq $2,%ymm17,%ymm17 + + jmp .Lmul_init_vpmadd52 + ud2 + +.align 32 +.Ldone_init_vpmadd52: + vinserti128 $1,%xmm4,%ymm1,%ymm4 + vinserti128 $1,%xmm5,%ymm2,%ymm5 + vinserti128 $1,%xmm3,%ymm0,%ymm3 + + vpermq $216,%ymm4,%ymm4 + vpermq $216,%ymm5,%ymm5 + vpermq $216,%ymm3,%ymm3 + + vpsllq $2,%ymm4,%ymm16 + vpaddq %ymm4,%ymm16,%ymm16 + vpsllq $2,%ymm16,%ymm16 + + vmovq 0(%rdi),%xmm0 + vmovq 8(%rdi),%xmm1 + vmovq 16(%rdi),%xmm2 + + testq $3,%rdx + jnz .Ldone_init_vpmadd52_2x + + vmovdqu64 %ymm3,64(%rdi) + vpbroadcastq %xmm3,%ymm3 + vmovdqu64 %ymm4,96(%rdi) + vpbroadcastq %xmm4,%ymm4 + vmovdqu64 %ymm5,128(%rdi) + vpbroadcastq %xmm5,%ymm5 + vmovdqu64 %ymm16,160(%rdi) + vpbroadcastq %xmm16,%ymm16 + + jmp .Lblocks_vpmadd52_4x_key_loaded + ud2 + +.align 32 +.Ldone_init_vpmadd52_2x: + vmovdqu64 %ymm3,64(%rdi) + vpsrldq $8,%ymm3,%ymm3 + vmovdqu64 %ymm4,96(%rdi) + vpsrldq $8,%ymm4,%ymm4 + vmovdqu64 %ymm5,128(%rdi) + vpsrldq $8,%ymm5,%ymm5 + vmovdqu64 %ymm16,160(%rdi) + vpsrldq $8,%ymm16,%ymm16 + jmp .Lblocks_vpmadd52_2x_key_loaded + ud2 + +.align 32 +.Lblocks_vpmadd52_2x_do: + vmovdqu64 128+8(%rdi),%ymm5{%k1}{z} + vmovdqu64 160+8(%rdi),%ymm16{%k1}{z} + vmovdqu64 64+8(%rdi),%ymm3{%k1}{z} + vmovdqu64 96+8(%rdi),%ymm4{%k1}{z} + +.Lblocks_vpmadd52_2x_key_loaded: + vmovdqu64 0(%rsi),%ymm26 + vpxorq %ymm27,%ymm27,%ymm27 + leaq 32(%rsi),%rsi + + vpunpcklqdq %ymm27,%ymm26,%ymm25 + vpunpckhqdq %ymm27,%ymm26,%ymm27 + + + + vpsrlq $24,%ymm27,%ymm26 + vporq %ymm31,%ymm26,%ymm26 + vpaddq %ymm26,%ymm2,%ymm2 + vpandq %ymm28,%ymm25,%ymm24 + vpsrlq $44,%ymm25,%ymm25 + vpsllq $20,%ymm27,%ymm27 + vporq %ymm27,%ymm25,%ymm25 + vpandq %ymm28,%ymm25,%ymm25 + + jmp .Ltail_vpmadd52_2x + ud2 + +.align 32 +.Loop_vpmadd52_4x: + + vpaddq %ymm24,%ymm0,%ymm0 + vpaddq %ymm25,%ymm1,%ymm1 + + vpxorq %ymm18,%ymm18,%ymm18 + vpmadd52luq %ymm2,%ymm16,%ymm18 + vpxorq %ymm19,%ymm19,%ymm19 + vpmadd52huq %ymm2,%ymm16,%ymm19 + vpxorq %ymm20,%ymm20,%ymm20 + vpmadd52luq %ymm2,%ymm17,%ymm20 + vpxorq %ymm21,%ymm21,%ymm21 + vpmadd52huq %ymm2,%ymm17,%ymm21 + vpxorq %ymm22,%ymm22,%ymm22 + vpmadd52luq %ymm2,%ymm3,%ymm22 + vpxorq %ymm23,%ymm23,%ymm23 + vpmadd52huq %ymm2,%ymm3,%ymm23 + + vmovdqu64 0(%rsi),%ymm26 + vmovdqu64 32(%rsi),%ymm27 + leaq 64(%rsi),%rsi + vpmadd52luq %ymm0,%ymm3,%ymm18 + vpmadd52huq %ymm0,%ymm3,%ymm19 + vpmadd52luq %ymm0,%ymm4,%ymm20 + vpmadd52huq %ymm0,%ymm4,%ymm21 + vpmadd52luq %ymm0,%ymm5,%ymm22 + vpmadd52huq %ymm0,%ymm5,%ymm23 + + vpunpcklqdq %ymm27,%ymm26,%ymm25 + vpunpckhqdq %ymm27,%ymm26,%ymm27 + vpmadd52luq %ymm1,%ymm17,%ymm18 + vpmadd52huq %ymm1,%ymm17,%ymm19 + vpmadd52luq %ymm1,%ymm3,%ymm20 + vpmadd52huq %ymm1,%ymm3,%ymm21 + vpmadd52luq %ymm1,%ymm4,%ymm22 + vpmadd52huq %ymm1,%ymm4,%ymm23 + + + + vpsrlq $44,%ymm18,%ymm30 + vpsllq $8,%ymm19,%ymm19 + vpandq %ymm28,%ymm18,%ymm0 + vpaddq %ymm30,%ymm19,%ymm19 + + vpsrlq $24,%ymm27,%ymm26 + vporq %ymm31,%ymm26,%ymm26 + vpaddq %ymm19,%ymm20,%ymm20 + + vpsrlq $44,%ymm20,%ymm30 + vpsllq $8,%ymm21,%ymm21 + vpandq %ymm28,%ymm20,%ymm1 + vpaddq %ymm30,%ymm21,%ymm21 + + vpandq %ymm28,%ymm25,%ymm24 + vpsrlq $44,%ymm25,%ymm25 + vpsllq $20,%ymm27,%ymm27 + vpaddq %ymm21,%ymm22,%ymm22 + + vpsrlq $42,%ymm22,%ymm30 + vpsllq $10,%ymm23,%ymm23 + vpandq %ymm29,%ymm22,%ymm2 + vpaddq %ymm30,%ymm23,%ymm23 + + vpaddq %ymm26,%ymm2,%ymm2 + vpaddq %ymm23,%ymm0,%ymm0 + vpsllq $2,%ymm23,%ymm23 + + vpaddq %ymm23,%ymm0,%ymm0 + vporq %ymm27,%ymm25,%ymm25 + vpandq %ymm28,%ymm25,%ymm25 + + vpsrlq $44,%ymm0,%ymm30 + vpandq %ymm28,%ymm0,%ymm0 + + vpaddq %ymm30,%ymm1,%ymm1 + + subq $4,%rdx + jnz .Loop_vpmadd52_4x + +.Ltail_vpmadd52_4x: + vmovdqu64 128(%rdi),%ymm5 + vmovdqu64 160(%rdi),%ymm16 + vmovdqu64 64(%rdi),%ymm3 + vmovdqu64 96(%rdi),%ymm4 + +.Ltail_vpmadd52_2x: + vpsllq $2,%ymm5,%ymm17 + vpaddq %ymm5,%ymm17,%ymm17 + vpsllq $2,%ymm17,%ymm17 + + + vpaddq %ymm24,%ymm0,%ymm0 + vpaddq %ymm25,%ymm1,%ymm1 + + vpxorq %ymm18,%ymm18,%ymm18 + vpmadd52luq %ymm2,%ymm16,%ymm18 + vpxorq %ymm19,%ymm19,%ymm19 + vpmadd52huq %ymm2,%ymm16,%ymm19 + vpxorq %ymm20,%ymm20,%ymm20 + vpmadd52luq %ymm2,%ymm17,%ymm20 + vpxorq %ymm21,%ymm21,%ymm21 + vpmadd52huq %ymm2,%ymm17,%ymm21 + vpxorq %ymm22,%ymm22,%ymm22 + vpmadd52luq %ymm2,%ymm3,%ymm22 + vpxorq %ymm23,%ymm23,%ymm23 + vpmadd52huq %ymm2,%ymm3,%ymm23 + + vpmadd52luq %ymm0,%ymm3,%ymm18 + vpmadd52huq %ymm0,%ymm3,%ymm19 + vpmadd52luq %ymm0,%ymm4,%ymm20 + vpmadd52huq %ymm0,%ymm4,%ymm21 + vpmadd52luq %ymm0,%ymm5,%ymm22 + vpmadd52huq %ymm0,%ymm5,%ymm23 + + vpmadd52luq %ymm1,%ymm17,%ymm18 + vpmadd52huq %ymm1,%ymm17,%ymm19 + vpmadd52luq %ymm1,%ymm3,%ymm20 + vpmadd52huq %ymm1,%ymm3,%ymm21 + vpmadd52luq %ymm1,%ymm4,%ymm22 + vpmadd52huq %ymm1,%ymm4,%ymm23 + + + + + movl $1,%eax + kmovw %eax,%k1 + vpsrldq $8,%ymm18,%ymm24 + vpsrldq $8,%ymm19,%ymm0 + vpsrldq $8,%ymm20,%ymm25 + vpsrldq $8,%ymm21,%ymm1 + vpaddq %ymm24,%ymm18,%ymm18 + vpaddq %ymm0,%ymm19,%ymm19 + vpsrldq $8,%ymm22,%ymm26 + vpsrldq $8,%ymm23,%ymm2 + vpaddq %ymm25,%ymm20,%ymm20 + vpaddq %ymm1,%ymm21,%ymm21 + vpermq $0x2,%ymm18,%ymm24 + vpermq $0x2,%ymm19,%ymm0 + vpaddq %ymm26,%ymm22,%ymm22 + vpaddq %ymm2,%ymm23,%ymm23 + + vpermq $0x2,%ymm20,%ymm25 + vpermq $0x2,%ymm21,%ymm1 + vpaddq %ymm24,%ymm18,%ymm18{%k1}{z} + vpaddq %ymm0,%ymm19,%ymm19{%k1}{z} + vpermq $0x2,%ymm22,%ymm26 + vpermq $0x2,%ymm23,%ymm2 + vpaddq %ymm25,%ymm20,%ymm20{%k1}{z} + vpaddq %ymm1,%ymm21,%ymm21{%k1}{z} + vpaddq %ymm26,%ymm22,%ymm22{%k1}{z} + vpaddq %ymm2,%ymm23,%ymm23{%k1}{z} + + + + vpsrlq $44,%ymm18,%ymm30 + vpsllq $8,%ymm19,%ymm19 + vpandq %ymm28,%ymm18,%ymm0 + vpaddq %ymm30,%ymm19,%ymm19 + + vpaddq %ymm19,%ymm20,%ymm20 + + vpsrlq $44,%ymm20,%ymm30 + vpsllq $8,%ymm21,%ymm21 + vpandq %ymm28,%ymm20,%ymm1 + vpaddq %ymm30,%ymm21,%ymm21 + + vpaddq %ymm21,%ymm22,%ymm22 + + vpsrlq $42,%ymm22,%ymm30 + vpsllq $10,%ymm23,%ymm23 + vpandq %ymm29,%ymm22,%ymm2 + vpaddq %ymm30,%ymm23,%ymm23 + + vpaddq %ymm23,%ymm0,%ymm0 + vpsllq $2,%ymm23,%ymm23 + + vpaddq %ymm23,%ymm0,%ymm0 + + vpsrlq $44,%ymm0,%ymm30 + vpandq %ymm28,%ymm0,%ymm0 + + vpaddq %ymm30,%ymm1,%ymm1 + + + subq $2,%rdx + ja .Lblocks_vpmadd52_4x_do + + vmovq %xmm0,0(%rdi) + vmovq %xmm1,8(%rdi) + vmovq %xmm2,16(%rdi) + vzeroall + +.Lno_data_vpmadd52_4x: + .byte 0xf3,0xc3 +.size poly1305_blocks_vpmadd52_4x,.-poly1305_blocks_vpmadd52_4x +.type poly1305_blocks_vpmadd52_8x,@function +.align 32 +poly1305_blocks_vpmadd52_8x: + shrq $4,%rdx + jz .Lno_data_vpmadd52_8x + + shlq $40,%rcx + movq 64(%rdi),%r8 + + vmovdqa64 .Lx_mask44(%rip),%ymm28 + vmovdqa64 .Lx_mask42(%rip),%ymm29 + + testq %r8,%r8 + js .Linit_vpmadd52 + + vmovq 0(%rdi),%xmm0 + vmovq 8(%rdi),%xmm1 + vmovq 16(%rdi),%xmm2 + +.Lblocks_vpmadd52_8x: + + + + vmovdqu64 128(%rdi),%ymm5 + vmovdqu64 160(%rdi),%ymm16 + vmovdqu64 64(%rdi),%ymm3 + vmovdqu64 96(%rdi),%ymm4 + + vpsllq $2,%ymm5,%ymm17 + vpaddq %ymm5,%ymm17,%ymm17 + vpsllq $2,%ymm17,%ymm17 + + vpbroadcastq %xmm5,%ymm8 + vpbroadcastq %xmm3,%ymm6 + vpbroadcastq %xmm4,%ymm7 + + vpxorq %ymm18,%ymm18,%ymm18 + vpmadd52luq %ymm8,%ymm16,%ymm18 + vpxorq %ymm19,%ymm19,%ymm19 + vpmadd52huq %ymm8,%ymm16,%ymm19 + vpxorq %ymm20,%ymm20,%ymm20 + vpmadd52luq %ymm8,%ymm17,%ymm20 + vpxorq %ymm21,%ymm21,%ymm21 + vpmadd52huq %ymm8,%ymm17,%ymm21 + vpxorq %ymm22,%ymm22,%ymm22 + vpmadd52luq %ymm8,%ymm3,%ymm22 + vpxorq %ymm23,%ymm23,%ymm23 + vpmadd52huq %ymm8,%ymm3,%ymm23 + + vpmadd52luq %ymm6,%ymm3,%ymm18 + vpmadd52huq %ymm6,%ymm3,%ymm19 + vpmadd52luq %ymm6,%ymm4,%ymm20 + vpmadd52huq %ymm6,%ymm4,%ymm21 + vpmadd52luq %ymm6,%ymm5,%ymm22 + vpmadd52huq %ymm6,%ymm5,%ymm23 + + vpmadd52luq %ymm7,%ymm17,%ymm18 + vpmadd52huq %ymm7,%ymm17,%ymm19 + vpmadd52luq %ymm7,%ymm3,%ymm20 + vpmadd52huq %ymm7,%ymm3,%ymm21 + vpmadd52luq %ymm7,%ymm4,%ymm22 + vpmadd52huq %ymm7,%ymm4,%ymm23 + + + + vpsrlq $44,%ymm18,%ymm30 + vpsllq $8,%ymm19,%ymm19 + vpandq %ymm28,%ymm18,%ymm6 + vpaddq %ymm30,%ymm19,%ymm19 + + vpaddq %ymm19,%ymm20,%ymm20 + + vpsrlq $44,%ymm20,%ymm30 + vpsllq $8,%ymm21,%ymm21 + vpandq %ymm28,%ymm20,%ymm7 + vpaddq %ymm30,%ymm21,%ymm21 + + vpaddq %ymm21,%ymm22,%ymm22 + + vpsrlq $42,%ymm22,%ymm30 + vpsllq $10,%ymm23,%ymm23 + vpandq %ymm29,%ymm22,%ymm8 + vpaddq %ymm30,%ymm23,%ymm23 + + vpaddq %ymm23,%ymm6,%ymm6 + vpsllq $2,%ymm23,%ymm23 + + vpaddq %ymm23,%ymm6,%ymm6 + + vpsrlq $44,%ymm6,%ymm30 + vpandq %ymm28,%ymm6,%ymm6 + + vpaddq %ymm30,%ymm7,%ymm7 + + + + + + vpunpcklqdq %ymm5,%ymm8,%ymm26 + vpunpckhqdq %ymm5,%ymm8,%ymm5 + vpunpcklqdq %ymm3,%ymm6,%ymm24 + vpunpckhqdq %ymm3,%ymm6,%ymm3 + vpunpcklqdq %ymm4,%ymm7,%ymm25 + vpunpckhqdq %ymm4,%ymm7,%ymm4 + vshufi64x2 $0x44,%zmm5,%zmm26,%zmm8 + vshufi64x2 $0x44,%zmm3,%zmm24,%zmm6 + vshufi64x2 $0x44,%zmm4,%zmm25,%zmm7 + + vmovdqu64 0(%rsi),%zmm26 + vmovdqu64 64(%rsi),%zmm27 + leaq 128(%rsi),%rsi + + vpsllq $2,%zmm8,%zmm10 + vpsllq $2,%zmm7,%zmm9 + vpaddq %zmm8,%zmm10,%zmm10 + vpaddq %zmm7,%zmm9,%zmm9 + vpsllq $2,%zmm10,%zmm10 + vpsllq $2,%zmm9,%zmm9 + + vpbroadcastq %rcx,%zmm31 + vpbroadcastq %xmm28,%zmm28 + vpbroadcastq %xmm29,%zmm29 + + vpbroadcastq %xmm9,%zmm16 + vpbroadcastq %xmm10,%zmm17 + vpbroadcastq %xmm6,%zmm3 + vpbroadcastq %xmm7,%zmm4 + vpbroadcastq %xmm8,%zmm5 + + vpunpcklqdq %zmm27,%zmm26,%zmm25 + vpunpckhqdq %zmm27,%zmm26,%zmm27 + + + + vpsrlq $24,%zmm27,%zmm26 + vporq %zmm31,%zmm26,%zmm26 + vpaddq %zmm26,%zmm2,%zmm2 + vpandq %zmm28,%zmm25,%zmm24 + vpsrlq $44,%zmm25,%zmm25 + vpsllq $20,%zmm27,%zmm27 + vporq %zmm27,%zmm25,%zmm25 + vpandq %zmm28,%zmm25,%zmm25 + + subq $8,%rdx + jz .Ltail_vpmadd52_8x + jmp .Loop_vpmadd52_8x + +.align 32 +.Loop_vpmadd52_8x: + + vpaddq %zmm24,%zmm0,%zmm0 + vpaddq %zmm25,%zmm1,%zmm1 + + vpxorq %zmm18,%zmm18,%zmm18 + vpmadd52luq %zmm2,%zmm16,%zmm18 + vpxorq %zmm19,%zmm19,%zmm19 + vpmadd52huq %zmm2,%zmm16,%zmm19 + vpxorq %zmm20,%zmm20,%zmm20 + vpmadd52luq %zmm2,%zmm17,%zmm20 + vpxorq %zmm21,%zmm21,%zmm21 + vpmadd52huq %zmm2,%zmm17,%zmm21 + vpxorq %zmm22,%zmm22,%zmm22 + vpmadd52luq %zmm2,%zmm3,%zmm22 + vpxorq %zmm23,%zmm23,%zmm23 + vpmadd52huq %zmm2,%zmm3,%zmm23 + + vmovdqu64 0(%rsi),%zmm26 + vmovdqu64 64(%rsi),%zmm27 + leaq 128(%rsi),%rsi + vpmadd52luq %zmm0,%zmm3,%zmm18 + vpmadd52huq %zmm0,%zmm3,%zmm19 + vpmadd52luq %zmm0,%zmm4,%zmm20 + vpmadd52huq %zmm0,%zmm4,%zmm21 + vpmadd52luq %zmm0,%zmm5,%zmm22 + vpmadd52huq %zmm0,%zmm5,%zmm23 + + vpunpcklqdq %zmm27,%zmm26,%zmm25 + vpunpckhqdq %zmm27,%zmm26,%zmm27 + vpmadd52luq %zmm1,%zmm17,%zmm18 + vpmadd52huq %zmm1,%zmm17,%zmm19 + vpmadd52luq %zmm1,%zmm3,%zmm20 + vpmadd52huq %zmm1,%zmm3,%zmm21 + vpmadd52luq %zmm1,%zmm4,%zmm22 + vpmadd52huq %zmm1,%zmm4,%zmm23 + + + + vpsrlq $44,%zmm18,%zmm30 + vpsllq $8,%zmm19,%zmm19 + vpandq %zmm28,%zmm18,%zmm0 + vpaddq %zmm30,%zmm19,%zmm19 + + vpsrlq $24,%zmm27,%zmm26 + vporq %zmm31,%zmm26,%zmm26 + vpaddq %zmm19,%zmm20,%zmm20 + + vpsrlq $44,%zmm20,%zmm30 + vpsllq $8,%zmm21,%zmm21 + vpandq %zmm28,%zmm20,%zmm1 + vpaddq %zmm30,%zmm21,%zmm21 + + vpandq %zmm28,%zmm25,%zmm24 + vpsrlq $44,%zmm25,%zmm25 + vpsllq $20,%zmm27,%zmm27 + vpaddq %zmm21,%zmm22,%zmm22 + + vpsrlq $42,%zmm22,%zmm30 + vpsllq $10,%zmm23,%zmm23 + vpandq %zmm29,%zmm22,%zmm2 + vpaddq %zmm30,%zmm23,%zmm23 + + vpaddq %zmm26,%zmm2,%zmm2 + vpaddq %zmm23,%zmm0,%zmm0 + vpsllq $2,%zmm23,%zmm23 + + vpaddq %zmm23,%zmm0,%zmm0 + vporq %zmm27,%zmm25,%zmm25 + vpandq %zmm28,%zmm25,%zmm25 + + vpsrlq $44,%zmm0,%zmm30 + vpandq %zmm28,%zmm0,%zmm0 + + vpaddq %zmm30,%zmm1,%zmm1 + + subq $8,%rdx + jnz .Loop_vpmadd52_8x + +.Ltail_vpmadd52_8x: + + vpaddq %zmm24,%zmm0,%zmm0 + vpaddq %zmm25,%zmm1,%zmm1 + + vpxorq %zmm18,%zmm18,%zmm18 + vpmadd52luq %zmm2,%zmm9,%zmm18 + vpxorq %zmm19,%zmm19,%zmm19 + vpmadd52huq %zmm2,%zmm9,%zmm19 + vpxorq %zmm20,%zmm20,%zmm20 + vpmadd52luq %zmm2,%zmm10,%zmm20 + vpxorq %zmm21,%zmm21,%zmm21 + vpmadd52huq %zmm2,%zmm10,%zmm21 + vpxorq %zmm22,%zmm22,%zmm22 + vpmadd52luq %zmm2,%zmm6,%zmm22 + vpxorq %zmm23,%zmm23,%zmm23 + vpmadd52huq %zmm2,%zmm6,%zmm23 + + vpmadd52luq %zmm0,%zmm6,%zmm18 + vpmadd52huq %zmm0,%zmm6,%zmm19 + vpmadd52luq %zmm0,%zmm7,%zmm20 + vpmadd52huq %zmm0,%zmm7,%zmm21 + vpmadd52luq %zmm0,%zmm8,%zmm22 + vpmadd52huq %zmm0,%zmm8,%zmm23 + + vpmadd52luq %zmm1,%zmm10,%zmm18 + vpmadd52huq %zmm1,%zmm10,%zmm19 + vpmadd52luq %zmm1,%zmm6,%zmm20 + vpmadd52huq %zmm1,%zmm6,%zmm21 + vpmadd52luq %zmm1,%zmm7,%zmm22 + vpmadd52huq %zmm1,%zmm7,%zmm23 + + + + + movl $1,%eax + kmovw %eax,%k1 + vpsrldq $8,%zmm18,%zmm24 + vpsrldq $8,%zmm19,%zmm0 + vpsrldq $8,%zmm20,%zmm25 + vpsrldq $8,%zmm21,%zmm1 + vpaddq %zmm24,%zmm18,%zmm18 + vpaddq %zmm0,%zmm19,%zmm19 + vpsrldq $8,%zmm22,%zmm26 + vpsrldq $8,%zmm23,%zmm2 + vpaddq %zmm25,%zmm20,%zmm20 + vpaddq %zmm1,%zmm21,%zmm21 + vpermq $0x2,%zmm18,%zmm24 + vpermq $0x2,%zmm19,%zmm0 + vpaddq %zmm26,%zmm22,%zmm22 + vpaddq %zmm2,%zmm23,%zmm23 + + vpermq $0x2,%zmm20,%zmm25 + vpermq $0x2,%zmm21,%zmm1 + vpaddq %zmm24,%zmm18,%zmm18 + vpaddq %zmm0,%zmm19,%zmm19 + vpermq $0x2,%zmm22,%zmm26 + vpermq $0x2,%zmm23,%zmm2 + vpaddq %zmm25,%zmm20,%zmm20 + vpaddq %zmm1,%zmm21,%zmm21 + vextracti64x4 $1,%zmm18,%ymm24 + vextracti64x4 $1,%zmm19,%ymm0 + vpaddq %zmm26,%zmm22,%zmm22 + vpaddq %zmm2,%zmm23,%zmm23 + + vextracti64x4 $1,%zmm20,%ymm25 + vextracti64x4 $1,%zmm21,%ymm1 + vextracti64x4 $1,%zmm22,%ymm26 + vextracti64x4 $1,%zmm23,%ymm2 + vpaddq %ymm24,%ymm18,%ymm18{%k1}{z} + vpaddq %ymm0,%ymm19,%ymm19{%k1}{z} + vpaddq %ymm25,%ymm20,%ymm20{%k1}{z} + vpaddq %ymm1,%ymm21,%ymm21{%k1}{z} + vpaddq %ymm26,%ymm22,%ymm22{%k1}{z} + vpaddq %ymm2,%ymm23,%ymm23{%k1}{z} + + + + vpsrlq $44,%ymm18,%ymm30 + vpsllq $8,%ymm19,%ymm19 + vpandq %ymm28,%ymm18,%ymm0 + vpaddq %ymm30,%ymm19,%ymm19 + + vpaddq %ymm19,%ymm20,%ymm20 + + vpsrlq $44,%ymm20,%ymm30 + vpsllq $8,%ymm21,%ymm21 + vpandq %ymm28,%ymm20,%ymm1 + vpaddq %ymm30,%ymm21,%ymm21 + + vpaddq %ymm21,%ymm22,%ymm22 + + vpsrlq $42,%ymm22,%ymm30 + vpsllq $10,%ymm23,%ymm23 + vpandq %ymm29,%ymm22,%ymm2 + vpaddq %ymm30,%ymm23,%ymm23 + + vpaddq %ymm23,%ymm0,%ymm0 + vpsllq $2,%ymm23,%ymm23 + + vpaddq %ymm23,%ymm0,%ymm0 + + vpsrlq $44,%ymm0,%ymm30 + vpandq %ymm28,%ymm0,%ymm0 + + vpaddq %ymm30,%ymm1,%ymm1 + + + + vmovq %xmm0,0(%rdi) + vmovq %xmm1,8(%rdi) + vmovq %xmm2,16(%rdi) + vzeroall + +.Lno_data_vpmadd52_8x: + .byte 0xf3,0xc3 +.size poly1305_blocks_vpmadd52_8x,.-poly1305_blocks_vpmadd52_8x +.type poly1305_emit_base2_44,@function +.align 32 +poly1305_emit_base2_44: + movq 0(%rdi),%r8 + movq 8(%rdi),%r9 + movq 16(%rdi),%r10 + + movq %r9,%rax + shrq $20,%r9 + shlq $44,%rax + movq %r10,%rcx + shrq $40,%r10 + shlq $24,%rcx + + addq %rax,%r8 + adcq %rcx,%r9 + adcq $0,%r10 + + movq %r8,%rax + addq $5,%r8 + movq %r9,%rcx + adcq $0,%r9 + adcq $0,%r10 + shrq $2,%r10 + cmovnzq %r8,%rax + cmovnzq %r9,%rcx + + addq 0(%rdx),%rax + adcq 8(%rdx),%rcx + movq %rax,0(%rsi) + movq %rcx,8(%rsi) + + .byte 0xf3,0xc3 +.size poly1305_emit_base2_44,.-poly1305_emit_base2_44 +.align 64 +.Lconst: +.Lmask24: +.long 0x0ffffff,0,0x0ffffff,0,0x0ffffff,0,0x0ffffff,0 +.L129: +.long 16777216,0,16777216,0,16777216,0,16777216,0 +.Lmask26: +.long 0x3ffffff,0,0x3ffffff,0,0x3ffffff,0,0x3ffffff,0 +.Lpermd_avx2: +.long 2,2,2,3,2,0,2,1 +.Lpermd_avx512: +.long 0,0,0,1, 0,2,0,3, 0,4,0,5, 0,6,0,7 + +.L2_44_inp_permd: +.long 0,1,1,2,2,3,7,7 +.L2_44_inp_shift: +.quad 0,12,24,64 +.L2_44_mask: +.quad 0xfffffffffff,0xfffffffffff,0x3ffffffffff,0xffffffffffffffff +.L2_44_shift_rgt: +.quad 44,44,42,64 +.L2_44_shift_lft: +.quad 8,8,10,64 + +.align 64 +.Lx_mask44: +.quad 0xfffffffffff,0xfffffffffff,0xfffffffffff,0xfffffffffff +.quad 0xfffffffffff,0xfffffffffff,0xfffffffffff,0xfffffffffff +.Lx_mask42: +.quad 0x3ffffffffff,0x3ffffffffff,0x3ffffffffff,0x3ffffffffff +.quad 0x3ffffffffff,0x3ffffffffff,0x3ffffffffff,0x3ffffffffff +.byte 80,111,108,121,49,51,48,53,32,102,111,114,32,120,56,54,95,54,52,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0 +.align 16 +.globl xor128_encrypt_n_pad +.type xor128_encrypt_n_pad,@function +.align 16 +xor128_encrypt_n_pad: + subq %rdx,%rsi + subq %rdx,%rdi + movq %rcx,%r10 + shrq $4,%rcx + jz .Ltail_enc + nop +.Loop_enc_xmm: + movdqu (%rsi,%rdx,1),%xmm0 + pxor (%rdx),%xmm0 + movdqu %xmm0,(%rdi,%rdx,1) + movdqa %xmm0,(%rdx) + leaq 16(%rdx),%rdx + decq %rcx + jnz .Loop_enc_xmm + + andq $15,%r10 + jz .Ldone_enc + +.Ltail_enc: + movq $16,%rcx + subq %r10,%rcx + xorl %eax,%eax +.Loop_enc_byte: + movb (%rsi,%rdx,1),%al + xorb (%rdx),%al + movb %al,(%rdi,%rdx,1) + movb %al,(%rdx) + leaq 1(%rdx),%rdx + decq %r10 + jnz .Loop_enc_byte + + xorl %eax,%eax +.Loop_enc_pad: + movb %al,(%rdx) + leaq 1(%rdx),%rdx + decq %rcx + jnz .Loop_enc_pad + +.Ldone_enc: + movq %rdx,%rax + .byte 0xf3,0xc3 +.size xor128_encrypt_n_pad,.-xor128_encrypt_n_pad + +.globl xor128_decrypt_n_pad +.type xor128_decrypt_n_pad,@function +.align 16 +xor128_decrypt_n_pad: + subq %rdx,%rsi + subq %rdx,%rdi + movq %rcx,%r10 + shrq $4,%rcx + jz .Ltail_dec + nop +.Loop_dec_xmm: + movdqu (%rsi,%rdx,1),%xmm0 + movdqa (%rdx),%xmm1 + pxor %xmm0,%xmm1 + movdqu %xmm1,(%rdi,%rdx,1) + movdqa %xmm0,(%rdx) + leaq 16(%rdx),%rdx + decq %rcx + jnz .Loop_dec_xmm + + pxor %xmm1,%xmm1 + andq $15,%r10 + jz .Ldone_dec + +.Ltail_dec: + movq $16,%rcx + subq %r10,%rcx + xorl %eax,%eax + xorq %r11,%r11 +.Loop_dec_byte: + movb (%rsi,%rdx,1),%r11b + movb (%rdx),%al + xorb %r11b,%al + movb %al,(%rdi,%rdx,1) + movb %r11b,(%rdx) + leaq 1(%rdx),%rdx + decq %r10 + jnz .Loop_dec_byte + + xorl %eax,%eax +.Loop_dec_pad: + movb %al,(%rdx) + leaq 1(%rdx),%rdx + decq %rcx + jnz .Loop_dec_pad + +.Ldone_dec: + movq %rdx,%rax + .byte 0xf3,0xc3 +.size xor128_decrypt_n_pad,.-xor128_decrypt_n_pad From patchwork Sat Oct 6 02:56:54 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Jason A. Donenfeld" X-Patchwork-Id: 148310 Delivered-To: patch@linaro.org Received: by 2002:a2e:8595:0:0:0:0:0 with SMTP id b21-v6csp1157625lji; Fri, 5 Oct 2018 19:58:17 -0700 (PDT) X-Google-Smtp-Source: ACcGV60D/gs3KsNeZspCf1Md/RSHURgg/b8xRGAmgOI35kSSbvaJDuBSKpBBvYyKRsivy5bkM9HC X-Received: by 2002:a63:3152:: with SMTP id x79-v6mr12528294pgx.323.1538794697485; Fri, 05 Oct 2018 19:58:17 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1538794697; cv=none; d=google.com; s=arc-20160816; b=DKFn+0nfUL62BCUIQoNiNcTrjBozEoEwDiFXwDiXYUs5aCMLcfBIVrPJ8EdjFlc+DZ U1OezX7xVOG1wO8Vdp0h2nsZZg+P21+aQCPh2PBEd/7CVPhccYXNaUmkgHTI+Nbzf6SS +02sipIoGdtClqp9mhXdk/mAIG6j6/hg/YD4UeNHXIXz18K/CW+w4E8BBYrbihewbvwT IQAp8HER9eQ/dgD4BGMbO3jgUkG2NrN2slMzlR+F9SSiYSFQwBse017uQBpv355IeJo2 zDWG8VgTCDjzHqCSWq8ur2J4yQA6UmbevMB84/mJSWiKVN5fxNjEsRDfGAKoytotvVBj vkow== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=lrp/xaw7BJpyDNMVQIAu60qgCZBqIoJ8TT/m2SnKrlc=; b=P9ueh4Ir4Oy1mz2f5jTf2zIc4cbKiWwuEFxG9EsOeA+OEC9ZidyC7e2b3wgdkgU1UI Rj7z+0YYggdodg7bGW+BUTYCRVQYTLugClzw7U3gW9jfd7fs81EKkN5kcvKg7Q41wJ+8 d12BwUU9nfyuriAxTf+vTt8MkS12/xMD9gAWe66+pIXqYrIqF1HckBxVHI2tj+vjDA9Q V9xsPHiigz89gxBKvrYrcsIxHdRD/idl3/64MNTCMbVU1/aKCZkV+lHL4vAIRnWF8LZ4 K3UIyemKYAF23yD+ei/2i/ustV57vVFcbKz5/2uuOean8cjNIroBtDssQIY/h1RQDEa7 ljVw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=ptRZ0ght; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id f132-v6si9243773pgc.484.2018.10.05.19.58.16; Fri, 05 Oct 2018 19:58:17 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=ptRZ0ght; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729748AbeJFJ7r (ORCPT + 32 others); Sat, 6 Oct 2018 05:59:47 -0400 Received: from frisell.zx2c4.com ([192.95.5.64]:50073 "EHLO frisell.zx2c4.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729562AbeJFJ7q (ORCPT ); Sat, 6 Oct 2018 05:59:46 -0400 Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTP id bcb0930c; Sat, 6 Oct 2018 02:57:38 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=zx2c4.com; h=from:to:cc :subject:date:message-id:in-reply-to:references:mime-version :content-transfer-encoding; s=mail; bh=kW+cGk0bXCJlJts3LDkbp0l8/ fs=; b=ptRZ0ghtYX29/ff2LuqKCd9v9DMTEVI6Jy6cbN6cEisR6kOUpLPU0tzJ6 MbHqsmZBgJXZH611ffBmZ7xXRaeEj5opOJQyVTYawpCJ8BqIoHLvinO5sem0Ve4e 2thkmVHg6TSjL51Z9liot0NvAK3zskYD368fe9rzdHmD488wejYXmSBRAeRklA9l hABOsKTRrDtaq/VhsqX4qAV7e1ObVJNgGnAIFFsYlEj/PzjM/ZSCbwIYSxB/HAJf 7VOVz6S6YBKhGN8nN4LZTc7sCl47pi7WigTg+kViIr3FGxjb/ZrEQww/IfduG65M +vXRcsZUHCUQIRRnrmxFmRhQYG6Uw== Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTPSA id 8a2beac5 (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256:NO); Sat, 6 Oct 2018 02:57:37 +0000 (UTC) From: "Jason A. Donenfeld" To: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, davem@davemloft.net, gregkh@linuxfoundation.org Cc: "Jason A. Donenfeld" , Samuel Neves , Thomas Gleixner , Ingo Molnar , x86@kernel.org, Jean-Philippe Aumasson , Andy Lutomirski , Andrew Morton , Linus Torvalds , kernel-hardening@lists.openwall.com, linux-crypto@vger.kernel.org Subject: [PATCH net-next v7 13/28] zinc: Poly1305 x86_64 implementation Date: Sat, 6 Oct 2018 04:56:54 +0200 Message-Id: <20181006025709.4019-14-Jason@zx2c4.com> In-Reply-To: <20181006025709.4019-1-Jason@zx2c4.com> References: <20181006025709.4019-1-Jason@zx2c4.com> MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This ports AVX, AVX-2, and AVX-512F implementations for Poly1305. The AVX-512F implementation is disabled on Skylake, due to throttling. These come from Andy Polyakov's implementation, with the following modifications from Samuel Neves: - Some cosmetic changes, like renaming labels to .Lname, constants, and other Linux conventions. - CPU feature checking is done in C by the glue code, so that has been removed from the assembly. - poly1305_blocks_avx512 jumped to the middle of the poly1305_blocks_avx2 for the final blocks. To appease objtool, the relevant tail avx2 code was duplicated for the avx512 function. - The original uses %rbp as a scratch register. However, the kernel expects %rbp to be a valid frame pointer at any given time in order to do proper unwinding. Thus we need to alter the code in order to preserve it. The most straightforward manner in which this was accomplished was by replacing $d3, formerly %r10, by %rdi, and replacing %rbp by %r10. Because %rdi, a pointer to the context structure, does not change and is not used by poly1305_iteration, it is safe to use it here, and the overhead of saving and restoring it should be minimal. - The original hardcodes returns as .byte 0xf3,0xc3, aka "rep ret". We replace this by "ret". "rep ret" was meant to help with AMD K8 chips, cf. http://repzret.org/p/repzret. It makes no sense to continue to use this kludge for code that won't even run on ancient AMD chips. The AVX code uses base 2^26, while the scalar code uses base 2^64. If we hit the unfortunate situation of using AVX and then having to go back to scalar -- because the user is silly and has called the update function from two separate contexts -- then we need to convert back to the original base before proceeding. It is possible to reason that the initial reduction below is sufficient given the implementation invariants. However, for an avoidance of doubt and because this is not performance critical, we do the full reduction anyway. This conversion is found in the glue code, and a proof of correctness may be easily obtained from Z3: . Cycle counts on a Core i7 6700HQ using the AVX-2 codepath, comparing this implementation ("new") to the implementation in the current crypto api ("old"): size old new ---- ---- ---- 0 70 68 16 92 90 32 134 104 48 172 120 64 218 136 80 254 158 96 298 174 112 342 192 128 388 212 144 428 228 160 466 246 176 510 264 192 550 282 208 594 302 224 628 316 240 676 334 256 716 354 272 764 374 288 802 352 304 420 366 320 428 360 336 484 378 352 426 384 368 478 400 384 488 394 400 542 408 416 486 416 432 534 430 448 544 422 464 600 438 480 540 448 496 594 464 512 602 456 528 656 476 544 600 480 560 650 494 576 664 490 592 714 508 608 656 514 624 708 532 640 716 524 656 770 536 672 716 548 688 770 562 704 774 552 720 826 568 736 768 574 752 822 592 768 830 584 784 884 602 800 828 610 816 884 628 832 888 618 848 942 632 864 884 644 880 936 660 896 948 652 912 1000 664 928 942 676 944 994 690 960 1002 680 976 1054 694 992 1002 706 1008 1052 720 Cycle counts on a Xeon Gold 5120 using the AVX-512 codepath: size old new ---- ---- ---- 0 74 70 16 96 92 32 136 106 48 184 124 64 218 138 80 260 160 96 300 176 112 342 194 128 384 212 144 420 226 160 464 248 176 504 264 192 544 282 208 582 300 224 624 318 240 662 338 256 708 358 272 748 372 288 788 358 304 422 370 320 432 364 336 486 380 352 434 390 368 480 408 384 490 398 400 542 412 416 492 426 432 538 436 448 546 432 464 600 448 480 548 456 496 594 476 512 606 470 528 656 480 544 606 498 560 652 512 576 662 508 592 716 522 608 664 538 624 710 552 640 720 516 656 772 526 672 722 544 688 768 556 704 778 556 720 832 568 736 780 584 752 826 600 768 836 560 784 888 572 800 838 588 816 884 604 832 894 598 848 946 612 864 896 628 880 942 644 896 952 608 912 1004 616 928 954 634 944 1000 646 960 1008 646 976 1062 658 992 1012 674 1008 1058 690 Signed-off-by: Jason A. Donenfeld Signed-off-by: Samuel Neves Co-developed-by: Samuel Neves Cc: Thomas Gleixner Cc: Ingo Molnar Cc: x86@kernel.org Cc: Jean-Philippe Aumasson Cc: Andy Lutomirski Cc: Greg KH Cc: Andrew Morton Cc: Linus Torvalds Cc: kernel-hardening@lists.openwall.com Cc: linux-crypto@vger.kernel.org --- lib/zinc/Makefile | 1 + lib/zinc/poly1305/poly1305-x86_64-glue.c | 154 ++ ...-x86_64-cryptogams.S => poly1305-x86_64.S} | 2459 ++++++----------- lib/zinc/poly1305/poly1305.c | 4 + 4 files changed, 1002 insertions(+), 1616 deletions(-) create mode 100644 lib/zinc/poly1305/poly1305-x86_64-glue.c rename lib/zinc/poly1305/{poly1305-x86_64-cryptogams.S => poly1305-x86_64.S} (58%) -- 2.19.0 diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile index 6fc9626c55fa..a8943d960b6a 100644 --- a/lib/zinc/Makefile +++ b/lib/zinc/Makefile @@ -11,4 +11,5 @@ AFLAGS_chacha20-mips.o += -O2 # This is required to fill the branch delay slots obj-$(CONFIG_ZINC_CHACHA20) += zinc_chacha20.o zinc_poly1305-y := poly1305/poly1305.o +zinc_poly1305-$(CONFIG_ZINC_ARCH_X86_64) += poly1305/poly1305-x86_64.o obj-$(CONFIG_ZINC_POLY1305) += zinc_poly1305.o diff --git a/lib/zinc/poly1305/poly1305-x86_64-glue.c b/lib/zinc/poly1305/poly1305-x86_64-glue.c new file mode 100644 index 000000000000..ccf5f1952503 --- /dev/null +++ b/lib/zinc/poly1305/poly1305-x86_64-glue.c @@ -0,0 +1,154 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#include +#include +#include + +asmlinkage void poly1305_init_x86_64(void *ctx, + const u8 key[POLY1305_KEY_SIZE]); +asmlinkage void poly1305_blocks_x86_64(void *ctx, const u8 *inp, + const size_t len, const u32 padbit); +asmlinkage void poly1305_emit_x86_64(void *ctx, u8 mac[POLY1305_MAC_SIZE], + const u32 nonce[4]); +asmlinkage void poly1305_emit_avx(void *ctx, u8 mac[POLY1305_MAC_SIZE], + const u32 nonce[4]); +asmlinkage void poly1305_blocks_avx(void *ctx, const u8 *inp, const size_t len, + const u32 padbit); +asmlinkage void poly1305_blocks_avx2(void *ctx, const u8 *inp, const size_t len, + const u32 padbit); +asmlinkage void poly1305_blocks_avx512(void *ctx, const u8 *inp, + const size_t len, const u32 padbit); + +static bool poly1305_use_avx __ro_after_init; +static bool poly1305_use_avx2 __ro_after_init; +static bool poly1305_use_avx512 __ro_after_init; +static bool *const poly1305_nobs[] __initconst = { + &poly1305_use_avx, &poly1305_use_avx2, &poly1305_use_avx512 }; + +static void __init poly1305_fpu_init(void) +{ + poly1305_use_avx = + boot_cpu_has(X86_FEATURE_AVX) && + cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL); + poly1305_use_avx2 = + boot_cpu_has(X86_FEATURE_AVX) && + boot_cpu_has(X86_FEATURE_AVX2) && + cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL); + poly1305_use_avx512 = + boot_cpu_has(X86_FEATURE_AVX) && + boot_cpu_has(X86_FEATURE_AVX2) && + boot_cpu_has(X86_FEATURE_AVX512F) && + cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM | + XFEATURE_MASK_AVX512, NULL) && + /* Skylake downclocks unacceptably much when using zmm. */ + boot_cpu_data.x86_model != INTEL_FAM6_SKYLAKE_X; +} + +static inline bool poly1305_init_arch(void *ctx, + const u8 key[POLY1305_KEY_SIZE]) +{ + poly1305_init_x86_64(ctx, key); + return true; +} + +struct poly1305_arch_internal { + union { + struct { + u32 h[5]; + u32 is_base2_26; + }; + u64 hs[3]; + }; + u64 r[2]; + u64 pad; + struct { u32 r2, r1, r4, r3; } rn[9]; +}; + +/* The AVX code uses base 2^26, while the scalar code uses base 2^64. If we hit + * the unfortunate situation of using AVX and then having to go back to scalar + * -- because the user is silly and has called the update function from two + * separate contexts -- then we need to convert back to the original base before + * proceeding. It is possible to reason that the initial reduction below is + * sufficient given the implementation invariants. However, for an avoidance of + * doubt and because this is not performance critical, we do the full reduction + * anyway. + */ +static void convert_to_base2_64(void *ctx) +{ + struct poly1305_arch_internal *state = ctx; + u32 cy; + + if (!state->is_base2_26) + return; + + cy = state->h[0] >> 26; state->h[0] &= 0x3ffffff; state->h[1] += cy; + cy = state->h[1] >> 26; state->h[1] &= 0x3ffffff; state->h[2] += cy; + cy = state->h[2] >> 26; state->h[2] &= 0x3ffffff; state->h[3] += cy; + cy = state->h[3] >> 26; state->h[3] &= 0x3ffffff; state->h[4] += cy; + state->hs[0] = ((u64)state->h[2] << 52) | ((u64)state->h[1] << 26) | state->h[0]; + state->hs[1] = ((u64)state->h[4] << 40) | ((u64)state->h[3] << 14) | (state->h[2] >> 12); + state->hs[2] = state->h[4] >> 24; +#define ULT(a, b) ((a ^ ((a ^ b) | ((a - b) ^ b))) >> (sizeof(a) * 8 - 1)) + cy = (state->hs[2] >> 2) + (state->hs[2] & ~3ULL); + state->hs[2] &= 3; + state->hs[0] += cy; + state->hs[1] += (cy = ULT(state->hs[0], cy)); + state->hs[2] += ULT(state->hs[1], cy); +#undef ULT + state->is_base2_26 = 0; +} + +static inline bool poly1305_blocks_arch(void *ctx, const u8 *inp, + size_t len, const u32 padbit, + simd_context_t *simd_context) +{ + struct poly1305_arch_internal *state = ctx; + + /* SIMD disables preemption, so relax after processing each page. */ + BUILD_BUG_ON(PAGE_SIZE < POLY1305_BLOCK_SIZE || + PAGE_SIZE % POLY1305_BLOCK_SIZE); + + if (!IS_ENABLED(CONFIG_AS_AVX) || !poly1305_use_avx || + (len < (POLY1305_BLOCK_SIZE * 18) && !state->is_base2_26) || + !simd_use(simd_context)) { + convert_to_base2_64(ctx); + poly1305_blocks_x86_64(ctx, inp, len, padbit); + return true; + } + + for (;;) { + const size_t bytes = min_t(size_t, len, PAGE_SIZE); + + if (IS_ENABLED(CONFIG_AS_AVX512) && poly1305_use_avx512) + poly1305_blocks_avx512(ctx, inp, bytes, padbit); + else if (IS_ENABLED(CONFIG_AS_AVX2) && poly1305_use_avx2) + poly1305_blocks_avx2(ctx, inp, bytes, padbit); + else + poly1305_blocks_avx(ctx, inp, bytes, padbit); + len -= bytes; + if (!len) + break; + inp += bytes; + simd_relax(simd_context); + } + + return true; +} + +static inline bool poly1305_emit_arch(void *ctx, u8 mac[POLY1305_MAC_SIZE], + const u32 nonce[4], + simd_context_t *simd_context) +{ + struct poly1305_arch_internal *state = ctx; + + if (!IS_ENABLED(CONFIG_AS_AVX) || !poly1305_use_avx || + !state->is_base2_26 || !simd_use(simd_context)) { + convert_to_base2_64(ctx); + poly1305_emit_x86_64(ctx, mac, nonce); + } else + poly1305_emit_avx(ctx, mac, nonce); + return true; +} diff --git a/lib/zinc/poly1305/poly1305-x86_64-cryptogams.S b/lib/zinc/poly1305/poly1305-x86_64.S similarity index 58% rename from lib/zinc/poly1305/poly1305-x86_64-cryptogams.S rename to lib/zinc/poly1305/poly1305-x86_64.S index ed634757354b..3c3f2b4d880b 100644 --- a/lib/zinc/poly1305/poly1305-x86_64-cryptogams.S +++ b/lib/zinc/poly1305/poly1305-x86_64.S @@ -1,22 +1,27 @@ /* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */ /* + * Copyright (C) 2017 Samuel Neves . All Rights Reserved. + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. * Copyright (C) 2006-2017 CRYPTOGAMS by . All Rights Reserved. + * + * This is based in part on Andy Polyakov's implementation from CRYPTOGAMS. */ -.text - +#include +.section .rodata.cst192.Lconst, "aM", @progbits, 192 +.align 64 +.Lconst: +.long 0x0ffffff,0,0x0ffffff,0,0x0ffffff,0,0x0ffffff,0 +.long 16777216,0,16777216,0,16777216,0,16777216,0 +.long 0x3ffffff,0,0x3ffffff,0,0x3ffffff,0,0x3ffffff,0 +.long 2,2,2,3,2,0,2,1 +.long 0,0,0,1, 0,2,0,3, 0,4,0,5, 0,6,0,7 -.globl poly1305_init -.hidden poly1305_init -.globl poly1305_blocks -.hidden poly1305_blocks -.globl poly1305_emit -.hidden poly1305_emit +.text -.type poly1305_init,@function .align 32 -poly1305_init: +ENTRY(poly1305_init_x86_64) xorq %rax,%rax movq %rax,0(%rdi) movq %rax,8(%rdi) @@ -25,61 +30,30 @@ poly1305_init: cmpq $0,%rsi je .Lno_key - leaq poly1305_blocks(%rip),%r10 - leaq poly1305_emit(%rip),%r11 - movq OPENSSL_ia32cap_P+4(%rip),%r9 - leaq poly1305_blocks_avx(%rip),%rax - leaq poly1305_emit_avx(%rip),%rcx - btq $28,%r9 - cmovcq %rax,%r10 - cmovcq %rcx,%r11 - leaq poly1305_blocks_avx2(%rip),%rax - btq $37,%r9 - cmovcq %rax,%r10 - movq $2149646336,%rax - shrq $32,%r9 - andq %rax,%r9 - cmpq %rax,%r9 - je .Linit_base2_44 movq $0x0ffffffc0fffffff,%rax movq $0x0ffffffc0ffffffc,%rcx andq 0(%rsi),%rax andq 8(%rsi),%rcx movq %rax,24(%rdi) movq %rcx,32(%rdi) - movq %r10,0(%rdx) - movq %r11,8(%rdx) movl $1,%eax .Lno_key: - .byte 0xf3,0xc3 -.size poly1305_init,.-poly1305_init + ret +ENDPROC(poly1305_init_x86_64) -.type poly1305_blocks,@function .align 32 -poly1305_blocks: -.cfi_startproc +ENTRY(poly1305_blocks_x86_64) .Lblocks: shrq $4,%rdx jz .Lno_data pushq %rbx -.cfi_adjust_cfa_offset 8 -.cfi_offset %rbx,-16 - pushq %rbp -.cfi_adjust_cfa_offset 8 -.cfi_offset %rbp,-24 pushq %r12 -.cfi_adjust_cfa_offset 8 -.cfi_offset %r12,-32 pushq %r13 -.cfi_adjust_cfa_offset 8 -.cfi_offset %r13,-40 pushq %r14 -.cfi_adjust_cfa_offset 8 -.cfi_offset %r14,-48 pushq %r15 -.cfi_adjust_cfa_offset 8 -.cfi_offset %r15,-56 + pushq %rdi + .Lblocks_body: movq %rdx,%r15 @@ -89,7 +63,7 @@ poly1305_blocks: movq 0(%rdi),%r14 movq 8(%rdi),%rbx - movq 16(%rdi),%rbp + movq 16(%rdi),%r10 movq %r13,%r12 shrq $2,%r13 @@ -99,14 +73,15 @@ poly1305_blocks: .align 32 .Loop: + addq 0(%rsi),%r14 adcq 8(%rsi),%rbx leaq 16(%rsi),%rsi - adcq %rcx,%rbp + adcq %rcx,%r10 mulq %r14 movq %rax,%r9 movq %r11,%rax - movq %rdx,%r10 + movq %rdx,%rdi mulq %r14 movq %rax,%r14 @@ -116,62 +91,55 @@ poly1305_blocks: mulq %rbx addq %rax,%r9 movq %r13,%rax - adcq %rdx,%r10 + adcq %rdx,%rdi mulq %rbx - movq %rbp,%rbx + movq %r10,%rbx addq %rax,%r14 adcq %rdx,%r8 imulq %r13,%rbx addq %rbx,%r9 movq %r8,%rbx - adcq $0,%r10 + adcq $0,%rdi - imulq %r11,%rbp + imulq %r11,%r10 addq %r9,%rbx movq $-4,%rax - adcq %rbp,%r10 + adcq %r10,%rdi - andq %r10,%rax - movq %r10,%rbp - shrq $2,%r10 - andq $3,%rbp - addq %r10,%rax + andq %rdi,%rax + movq %rdi,%r10 + shrq $2,%rdi + andq $3,%r10 + addq %rdi,%rax addq %rax,%r14 adcq $0,%rbx - adcq $0,%rbp + adcq $0,%r10 + movq %r12,%rax decq %r15 jnz .Loop + movq 0(%rsp),%rdi + movq %r14,0(%rdi) movq %rbx,8(%rdi) - movq %rbp,16(%rdi) - - movq 0(%rsp),%r15 -.cfi_restore %r15 - movq 8(%rsp),%r14 -.cfi_restore %r14 - movq 16(%rsp),%r13 -.cfi_restore %r13 - movq 24(%rsp),%r12 -.cfi_restore %r12 - movq 32(%rsp),%rbp -.cfi_restore %rbp + movq %r10,16(%rdi) + + movq 8(%rsp),%r15 + movq 16(%rsp),%r14 + movq 24(%rsp),%r13 + movq 32(%rsp),%r12 movq 40(%rsp),%rbx -.cfi_restore %rbx leaq 48(%rsp),%rsp -.cfi_adjust_cfa_offset -48 .Lno_data: .Lblocks_epilogue: - .byte 0xf3,0xc3 -.cfi_endproc -.size poly1305_blocks,.-poly1305_blocks + ret +ENDPROC(poly1305_blocks_x86_64) -.type poly1305_emit,@function .align 32 -poly1305_emit: +ENTRY(poly1305_emit_x86_64) .Lemit: movq 0(%rdi),%r8 movq 8(%rdi),%r9 @@ -191,15 +159,14 @@ poly1305_emit: movq %rax,0(%rsi) movq %rcx,8(%rsi) - .byte 0xf3,0xc3 -.size poly1305_emit,.-poly1305_emit -.type __poly1305_block,@function -.align 32 -__poly1305_block: + ret +ENDPROC(poly1305_emit_x86_64) + +.macro __poly1305_block mulq %r14 movq %rax,%r9 movq %r11,%rax - movq %rdx,%r10 + movq %rdx,%rdi mulq %r14 movq %rax,%r14 @@ -209,45 +176,44 @@ __poly1305_block: mulq %rbx addq %rax,%r9 movq %r13,%rax - adcq %rdx,%r10 + adcq %rdx,%rdi mulq %rbx - movq %rbp,%rbx + movq %r10,%rbx addq %rax,%r14 adcq %rdx,%r8 imulq %r13,%rbx addq %rbx,%r9 movq %r8,%rbx - adcq $0,%r10 + adcq $0,%rdi - imulq %r11,%rbp + imulq %r11,%r10 addq %r9,%rbx movq $-4,%rax - adcq %rbp,%r10 + adcq %r10,%rdi - andq %r10,%rax - movq %r10,%rbp - shrq $2,%r10 - andq $3,%rbp - addq %r10,%rax + andq %rdi,%rax + movq %rdi,%r10 + shrq $2,%rdi + andq $3,%r10 + addq %rdi,%rax addq %rax,%r14 adcq $0,%rbx - adcq $0,%rbp - .byte 0xf3,0xc3 -.size __poly1305_block,.-__poly1305_block + adcq $0,%r10 +.endm -.type __poly1305_init_avx,@function -.align 32 -__poly1305_init_avx: +.macro __poly1305_init_avx movq %r11,%r14 movq %r12,%rbx - xorq %rbp,%rbp + xorq %r10,%r10 leaq 48+64(%rdi),%rdi movq %r12,%rax - call __poly1305_block + movq %rdi,0(%rsp) + __poly1305_block + movq 0(%rsp),%rdi movl $0x3ffffff,%eax movl $0x3ffffff,%edx @@ -305,7 +271,7 @@ __poly1305_init_avx: movl %edx,36(%rdi) shrq $26,%r9 - movq %rbp,%rax + movq %r10,%rax shlq $24,%rax orq %rax,%r8 movl %r8d,48(%rdi) @@ -316,7 +282,9 @@ __poly1305_init_avx: movl %r9d,68(%rdi) movq %r12,%rax - call __poly1305_block + movq %rdi,0(%rsp) + __poly1305_block + movq 0(%rsp),%rdi movl $0x3ffffff,%eax movq %r14,%r8 @@ -348,7 +316,7 @@ __poly1305_init_avx: shrq $26,%r8 movl %edx,44(%rdi) - movq %rbp,%rax + movq %r10,%rax shlq $24,%rax orq %rax,%r8 movl %r8d,60(%rdi) @@ -356,7 +324,9 @@ __poly1305_init_avx: movl %r8d,76(%rdi) movq %r12,%rax - call __poly1305_block + movq %rdi,0(%rsp) + __poly1305_block + movq 0(%rsp),%rdi movl $0x3ffffff,%eax movq %r14,%r8 @@ -388,7 +358,7 @@ __poly1305_init_avx: shrq $26,%r8 movl %edx,40(%rdi) - movq %rbp,%rax + movq %r10,%rax shlq $24,%rax orq %rax,%r8 movl %r8d,56(%rdi) @@ -396,13 +366,12 @@ __poly1305_init_avx: movl %r8d,72(%rdi) leaq -48-64(%rdi),%rdi - .byte 0xf3,0xc3 -.size __poly1305_init_avx,.-__poly1305_init_avx +.endm -.type poly1305_blocks_avx,@function +#ifdef CONFIG_AS_AVX .align 32 -poly1305_blocks_avx: -.cfi_startproc +ENTRY(poly1305_blocks_avx) + movl 20(%rdi),%r8d cmpq $128,%rdx jae .Lblocks_avx @@ -422,30 +391,19 @@ poly1305_blocks_avx: jz .Leven_avx pushq %rbx -.cfi_adjust_cfa_offset 8 -.cfi_offset %rbx,-16 - pushq %rbp -.cfi_adjust_cfa_offset 8 -.cfi_offset %rbp,-24 pushq %r12 -.cfi_adjust_cfa_offset 8 -.cfi_offset %r12,-32 pushq %r13 -.cfi_adjust_cfa_offset 8 -.cfi_offset %r13,-40 pushq %r14 -.cfi_adjust_cfa_offset 8 -.cfi_offset %r14,-48 pushq %r15 -.cfi_adjust_cfa_offset 8 -.cfi_offset %r15,-56 + pushq %rdi + .Lblocks_avx_body: movq %rdx,%r15 movq 0(%rdi),%r8 movq 8(%rdi),%r9 - movl 16(%rdi),%ebp + movl 16(%rdi),%r10d movq 24(%rdi),%r11 movq 32(%rdi),%r13 @@ -465,21 +423,21 @@ poly1305_blocks_avx: addq %r12,%r14 adcq %r9,%rbx - movq %rbp,%r8 + movq %r10,%r8 shlq $40,%r8 - shrq $24,%rbp + shrq $24,%r10 addq %r8,%rbx - adcq $0,%rbp + adcq $0,%r10 movq $-4,%r9 - movq %rbp,%r8 - andq %rbp,%r9 + movq %r10,%r8 + andq %r10,%r9 shrq $2,%r8 - andq $3,%rbp + andq $3,%r10 addq %r9,%r8 addq %r8,%r14 adcq $0,%rbx - adcq $0,%rbp + adcq $0,%r10 movq %r13,%r12 movq %r13,%rax @@ -489,9 +447,11 @@ poly1305_blocks_avx: addq 0(%rsi),%r14 adcq 8(%rsi),%rbx leaq 16(%rsi),%rsi - adcq %rcx,%rbp + adcq %rcx,%r10 - call __poly1305_block + movq %rdi,0(%rsp) + __poly1305_block + movq 0(%rsp),%rdi testq %rcx,%rcx jz .Lstore_base2_64_avx @@ -508,11 +468,11 @@ poly1305_blocks_avx: andq $0x3ffffff,%rdx shrq $14,%rbx orq %r11,%r14 - shlq $24,%rbp + shlq $24,%r10 andq $0x3ffffff,%r14 shrq $40,%r12 andq $0x3ffffff,%rbx - orq %r12,%rbp + orq %r12,%r10 subq $16,%r15 jz .Lstore_base2_26_avx @@ -521,14 +481,14 @@ poly1305_blocks_avx: vmovd %edx,%xmm1 vmovd %r14d,%xmm2 vmovd %ebx,%xmm3 - vmovd %ebp,%xmm4 + vmovd %r10d,%xmm4 jmp .Lproceed_avx .align 32 .Lstore_base2_64_avx: movq %r14,0(%rdi) movq %rbx,8(%rdi) - movq %rbp,16(%rdi) + movq %r10,16(%rdi) jmp .Ldone_avx .align 16 @@ -537,49 +497,30 @@ poly1305_blocks_avx: movl %edx,4(%rdi) movl %r14d,8(%rdi) movl %ebx,12(%rdi) - movl %ebp,16(%rdi) + movl %r10d,16(%rdi) .align 16 .Ldone_avx: - movq 0(%rsp),%r15 -.cfi_restore %r15 - movq 8(%rsp),%r14 -.cfi_restore %r14 - movq 16(%rsp),%r13 -.cfi_restore %r13 - movq 24(%rsp),%r12 -.cfi_restore %r12 - movq 32(%rsp),%rbp -.cfi_restore %rbp + movq 8(%rsp),%r15 + movq 16(%rsp),%r14 + movq 24(%rsp),%r13 + movq 32(%rsp),%r12 movq 40(%rsp),%rbx -.cfi_restore %rbx leaq 48(%rsp),%rsp -.cfi_adjust_cfa_offset -48 + .Lno_data_avx: .Lblocks_avx_epilogue: - .byte 0xf3,0xc3 -.cfi_endproc + ret .align 32 .Lbase2_64_avx: -.cfi_startproc + pushq %rbx -.cfi_adjust_cfa_offset 8 -.cfi_offset %rbx,-16 - pushq %rbp -.cfi_adjust_cfa_offset 8 -.cfi_offset %rbp,-24 pushq %r12 -.cfi_adjust_cfa_offset 8 -.cfi_offset %r12,-32 pushq %r13 -.cfi_adjust_cfa_offset 8 -.cfi_offset %r13,-40 pushq %r14 -.cfi_adjust_cfa_offset 8 -.cfi_offset %r14,-48 pushq %r15 -.cfi_adjust_cfa_offset 8 -.cfi_offset %r15,-56 + pushq %rdi + .Lbase2_64_avx_body: movq %rdx,%r15 @@ -589,7 +530,7 @@ poly1305_blocks_avx: movq 0(%rdi),%r14 movq 8(%rdi),%rbx - movl 16(%rdi),%ebp + movl 16(%rdi),%r10d movq %r13,%r12 movq %r13,%rax @@ -602,10 +543,12 @@ poly1305_blocks_avx: addq 0(%rsi),%r14 adcq 8(%rsi),%rbx leaq 16(%rsi),%rsi - adcq %rcx,%rbp + adcq %rcx,%r10 subq $16,%r15 - call __poly1305_block + movq %rdi,0(%rsp) + __poly1305_block + movq 0(%rsp),%rdi .Linit_avx: @@ -620,46 +563,38 @@ poly1305_blocks_avx: andq $0x3ffffff,%rdx shrq $14,%rbx orq %r8,%r14 - shlq $24,%rbp + shlq $24,%r10 andq $0x3ffffff,%r14 shrq $40,%r9 andq $0x3ffffff,%rbx - orq %r9,%rbp + orq %r9,%r10 vmovd %eax,%xmm0 vmovd %edx,%xmm1 vmovd %r14d,%xmm2 vmovd %ebx,%xmm3 - vmovd %ebp,%xmm4 + vmovd %r10d,%xmm4 movl $1,20(%rdi) - call __poly1305_init_avx + __poly1305_init_avx .Lproceed_avx: movq %r15,%rdx - movq 0(%rsp),%r15 -.cfi_restore %r15 - movq 8(%rsp),%r14 -.cfi_restore %r14 - movq 16(%rsp),%r13 -.cfi_restore %r13 - movq 24(%rsp),%r12 -.cfi_restore %r12 - movq 32(%rsp),%rbp -.cfi_restore %rbp + movq 8(%rsp),%r15 + movq 16(%rsp),%r14 + movq 24(%rsp),%r13 + movq 32(%rsp),%r12 movq 40(%rsp),%rbx -.cfi_restore %rbx leaq 48(%rsp),%rax leaq 48(%rsp),%rsp -.cfi_adjust_cfa_offset -48 + .Lbase2_64_avx_epilogue: jmp .Ldo_avx -.cfi_endproc + .align 32 .Leven_avx: -.cfi_startproc vmovd 0(%rdi),%xmm0 vmovd 4(%rdi),%xmm1 vmovd 8(%rdi),%xmm2 @@ -667,8 +602,10 @@ poly1305_blocks_avx: vmovd 16(%rdi),%xmm4 .Ldo_avx: + leaq 8(%rsp),%r10 + andq $-32,%rsp + subq $8,%rsp leaq -88(%rsp),%r11 -.cfi_def_cfa %r11,0x60 subq $0x178,%rsp subq $64,%rdx leaq -32(%rsi),%rax @@ -678,8 +615,6 @@ poly1305_blocks_avx: leaq 112(%rdi),%rdi leaq .Lconst(%rip),%rcx - - vmovdqu 32(%rsi),%xmm5 vmovdqu 48(%rsi),%xmm6 vmovdqa 64(%rcx),%xmm15 @@ -754,25 +689,6 @@ poly1305_blocks_avx: .align 32 .Loop_avx: - - - - - - - - - - - - - - - - - - - vpmuludq %xmm5,%xmm14,%xmm10 vpmuludq %xmm6,%xmm14,%xmm11 vmovdqa %xmm2,32(%r11) @@ -866,15 +782,6 @@ poly1305_blocks_avx: subq $64,%rdx cmovcq %rax,%rsi - - - - - - - - - vpmuludq %xmm0,%xmm9,%xmm5 vpmuludq %xmm1,%xmm9,%xmm6 vpaddq %xmm5,%xmm10,%xmm10 @@ -957,10 +864,6 @@ poly1305_blocks_avx: vpand %xmm15,%xmm8,%xmm8 vpor 32(%rcx),%xmm9,%xmm9 - - - - vpsrlq $26,%xmm3,%xmm13 vpand %xmm15,%xmm3,%xmm3 vpaddq %xmm13,%xmm4,%xmm4 @@ -995,9 +898,6 @@ poly1305_blocks_avx: ja .Loop_avx .Lskip_loop_avx: - - - vpshufd $0x10,%xmm14,%xmm14 addq $32,%rdx jnz .Long_tail_avx @@ -1015,12 +915,6 @@ poly1305_blocks_avx: vmovdqa %xmm3,48(%r11) vmovdqa %xmm4,64(%r11) - - - - - - vpmuludq %xmm7,%xmm14,%xmm12 vpmuludq %xmm5,%xmm14,%xmm10 vpshufd $0x10,-48(%rdi),%xmm2 @@ -1107,9 +1001,6 @@ poly1305_blocks_avx: vpaddq 48(%r11),%xmm3,%xmm3 vpaddq 64(%r11),%xmm4,%xmm4 - - - vpmuludq %xmm0,%xmm9,%xmm5 vpaddq %xmm5,%xmm10,%xmm10 vpmuludq %xmm1,%xmm9,%xmm6 @@ -1175,8 +1066,6 @@ poly1305_blocks_avx: .Lshort_tail_avx: - - vpsrldq $8,%xmm14,%xmm9 vpsrldq $8,%xmm13,%xmm8 vpsrldq $8,%xmm11,%xmm6 @@ -1188,9 +1077,6 @@ poly1305_blocks_avx: vpaddq %xmm6,%xmm11,%xmm11 vpaddq %xmm7,%xmm12,%xmm12 - - - vpsrlq $26,%xmm13,%xmm3 vpand %xmm15,%xmm13,%xmm13 vpaddq %xmm3,%xmm14,%xmm14 @@ -1227,16 +1113,14 @@ poly1305_blocks_avx: vmovd %xmm12,-104(%rdi) vmovd %xmm13,-100(%rdi) vmovd %xmm14,-96(%rdi) - leaq 88(%r11),%rsp -.cfi_def_cfa %rsp,8 + leaq -8(%r10),%rsp + vzeroupper - .byte 0xf3,0xc3 -.cfi_endproc -.size poly1305_blocks_avx,.-poly1305_blocks_avx + ret +ENDPROC(poly1305_blocks_avx) -.type poly1305_emit_avx,@function .align 32 -poly1305_emit_avx: +ENTRY(poly1305_emit_avx) cmpl $0,20(%rdi) je .Lemit @@ -1286,12 +1170,14 @@ poly1305_emit_avx: movq %rax,0(%rsi) movq %rcx,8(%rsi) - .byte 0xf3,0xc3 -.size poly1305_emit_avx,.-poly1305_emit_avx -.type poly1305_blocks_avx2,@function + ret +ENDPROC(poly1305_emit_avx) +#endif /* CONFIG_AS_AVX */ + +#ifdef CONFIG_AS_AVX2 .align 32 -poly1305_blocks_avx2: -.cfi_startproc +ENTRY(poly1305_blocks_avx2) + movl 20(%rdi),%r8d cmpq $128,%rdx jae .Lblocks_avx2 @@ -1311,30 +1197,19 @@ poly1305_blocks_avx2: jz .Leven_avx2 pushq %rbx -.cfi_adjust_cfa_offset 8 -.cfi_offset %rbx,-16 - pushq %rbp -.cfi_adjust_cfa_offset 8 -.cfi_offset %rbp,-24 pushq %r12 -.cfi_adjust_cfa_offset 8 -.cfi_offset %r12,-32 pushq %r13 -.cfi_adjust_cfa_offset 8 -.cfi_offset %r13,-40 pushq %r14 -.cfi_adjust_cfa_offset 8 -.cfi_offset %r14,-48 pushq %r15 -.cfi_adjust_cfa_offset 8 -.cfi_offset %r15,-56 + pushq %rdi + .Lblocks_avx2_body: movq %rdx,%r15 movq 0(%rdi),%r8 movq 8(%rdi),%r9 - movl 16(%rdi),%ebp + movl 16(%rdi),%r10d movq 24(%rdi),%r11 movq 32(%rdi),%r13 @@ -1354,21 +1229,21 @@ poly1305_blocks_avx2: addq %r12,%r14 adcq %r9,%rbx - movq %rbp,%r8 + movq %r10,%r8 shlq $40,%r8 - shrq $24,%rbp + shrq $24,%r10 addq %r8,%rbx - adcq $0,%rbp + adcq $0,%r10 movq $-4,%r9 - movq %rbp,%r8 - andq %rbp,%r9 + movq %r10,%r8 + andq %r10,%r9 shrq $2,%r8 - andq $3,%rbp + andq $3,%r10 addq %r9,%r8 addq %r8,%r14 adcq $0,%rbx - adcq $0,%rbp + adcq $0,%r10 movq %r13,%r12 movq %r13,%rax @@ -1379,10 +1254,12 @@ poly1305_blocks_avx2: addq 0(%rsi),%r14 adcq 8(%rsi),%rbx leaq 16(%rsi),%rsi - adcq %rcx,%rbp + adcq %rcx,%r10 subq $16,%r15 - call __poly1305_block + movq %rdi,0(%rsp) + __poly1305_block + movq 0(%rsp),%rdi movq %r12,%rax testq $63,%r15 @@ -1403,11 +1280,11 @@ poly1305_blocks_avx2: andq $0x3ffffff,%rdx shrq $14,%rbx orq %r11,%r14 - shlq $24,%rbp + shlq $24,%r10 andq $0x3ffffff,%r14 shrq $40,%r12 andq $0x3ffffff,%rbx - orq %r12,%rbp + orq %r12,%r10 testq %r15,%r15 jz .Lstore_base2_26_avx2 @@ -1416,14 +1293,14 @@ poly1305_blocks_avx2: vmovd %edx,%xmm1 vmovd %r14d,%xmm2 vmovd %ebx,%xmm3 - vmovd %ebp,%xmm4 + vmovd %r10d,%xmm4 jmp .Lproceed_avx2 .align 32 .Lstore_base2_64_avx2: movq %r14,0(%rdi) movq %rbx,8(%rdi) - movq %rbp,16(%rdi) + movq %r10,16(%rdi) jmp .Ldone_avx2 .align 16 @@ -1432,49 +1309,32 @@ poly1305_blocks_avx2: movl %edx,4(%rdi) movl %r14d,8(%rdi) movl %ebx,12(%rdi) - movl %ebp,16(%rdi) + movl %r10d,16(%rdi) .align 16 .Ldone_avx2: - movq 0(%rsp),%r15 -.cfi_restore %r15 - movq 8(%rsp),%r14 -.cfi_restore %r14 - movq 16(%rsp),%r13 -.cfi_restore %r13 - movq 24(%rsp),%r12 -.cfi_restore %r12 - movq 32(%rsp),%rbp -.cfi_restore %rbp + movq 8(%rsp),%r15 + movq 16(%rsp),%r14 + movq 24(%rsp),%r13 + movq 32(%rsp),%r12 movq 40(%rsp),%rbx -.cfi_restore %rbx leaq 48(%rsp),%rsp -.cfi_adjust_cfa_offset -48 + .Lno_data_avx2: .Lblocks_avx2_epilogue: - .byte 0xf3,0xc3 -.cfi_endproc + ret + .align 32 .Lbase2_64_avx2: -.cfi_startproc + + pushq %rbx -.cfi_adjust_cfa_offset 8 -.cfi_offset %rbx,-16 - pushq %rbp -.cfi_adjust_cfa_offset 8 -.cfi_offset %rbp,-24 pushq %r12 -.cfi_adjust_cfa_offset 8 -.cfi_offset %r12,-32 pushq %r13 -.cfi_adjust_cfa_offset 8 -.cfi_offset %r13,-40 pushq %r14 -.cfi_adjust_cfa_offset 8 -.cfi_offset %r14,-48 pushq %r15 -.cfi_adjust_cfa_offset 8 -.cfi_offset %r15,-56 + pushq %rdi + .Lbase2_64_avx2_body: movq %rdx,%r15 @@ -1484,7 +1344,7 @@ poly1305_blocks_avx2: movq 0(%rdi),%r14 movq 8(%rdi),%rbx - movl 16(%rdi),%ebp + movl 16(%rdi),%r10d movq %r13,%r12 movq %r13,%rax @@ -1498,10 +1358,12 @@ poly1305_blocks_avx2: addq 0(%rsi),%r14 adcq 8(%rsi),%rbx leaq 16(%rsi),%rsi - adcq %rcx,%rbp + adcq %rcx,%r10 subq $16,%r15 - call __poly1305_block + movq %rdi,0(%rsp) + __poly1305_block + movq 0(%rsp),%rdi movq %r12,%rax testq $63,%r15 @@ -1520,49 +1382,39 @@ poly1305_blocks_avx2: andq $0x3ffffff,%rdx shrq $14,%rbx orq %r8,%r14 - shlq $24,%rbp + shlq $24,%r10 andq $0x3ffffff,%r14 shrq $40,%r9 andq $0x3ffffff,%rbx - orq %r9,%rbp + orq %r9,%r10 vmovd %eax,%xmm0 vmovd %edx,%xmm1 vmovd %r14d,%xmm2 vmovd %ebx,%xmm3 - vmovd %ebp,%xmm4 + vmovd %r10d,%xmm4 movl $1,20(%rdi) - call __poly1305_init_avx + __poly1305_init_avx .Lproceed_avx2: movq %r15,%rdx - movl OPENSSL_ia32cap_P+8(%rip),%r10d - movl $3221291008,%r11d - - movq 0(%rsp),%r15 -.cfi_restore %r15 - movq 8(%rsp),%r14 -.cfi_restore %r14 - movq 16(%rsp),%r13 -.cfi_restore %r13 - movq 24(%rsp),%r12 -.cfi_restore %r12 - movq 32(%rsp),%rbp -.cfi_restore %rbp + + movq 8(%rsp),%r15 + movq 16(%rsp),%r14 + movq 24(%rsp),%r13 + movq 32(%rsp),%r12 movq 40(%rsp),%rbx -.cfi_restore %rbx leaq 48(%rsp),%rax leaq 48(%rsp),%rsp -.cfi_adjust_cfa_offset -48 + .Lbase2_64_avx2_epilogue: jmp .Ldo_avx2 -.cfi_endproc + .align 32 .Leven_avx2: -.cfi_startproc - movl OPENSSL_ia32cap_P+8(%rip),%r10d + vmovd 0(%rdi),%xmm0 vmovd 4(%rdi),%xmm1 vmovd 8(%rdi),%xmm2 @@ -1570,14 +1422,7 @@ poly1305_blocks_avx2: vmovd 16(%rdi),%xmm4 .Ldo_avx2: - cmpq $512,%rdx - jb .Lskip_avx512 - andl %r11d,%r10d - testl $65536,%r10d - jnz .Lblocks_avx512 -.Lskip_avx512: - leaq -8(%rsp),%r11 -.cfi_def_cfa %r11,16 + leaq 8(%rsp),%r10 subq $0x128,%rsp leaq .Lconst(%rip),%rcx leaq 48+64(%rdi),%rdi @@ -1647,13 +1492,6 @@ poly1305_blocks_avx2: .align 32 .Loop_avx2: - - - - - - - vpaddq %ymm0,%ymm7,%ymm0 vmovdqa 0(%rsp),%ymm7 vpaddq %ymm1,%ymm8,%ymm1 @@ -1664,21 +1502,6 @@ poly1305_blocks_avx2: vmovdqa 48(%rax),%ymm10 vmovdqa 112(%rax),%ymm5 - - - - - - - - - - - - - - - vpmuludq %ymm2,%ymm7,%ymm13 vpmuludq %ymm2,%ymm8,%ymm14 vpmuludq %ymm2,%ymm9,%ymm15 @@ -1743,9 +1566,6 @@ poly1305_blocks_avx2: vpaddq %ymm4,%ymm15,%ymm4 vpaddq %ymm0,%ymm11,%ymm0 - - - vpsrlq $26,%ymm3,%ymm14 vpand %ymm5,%ymm3,%ymm3 vpaddq %ymm14,%ymm4,%ymm4 @@ -1798,12 +1618,6 @@ poly1305_blocks_avx2: .byte 0x66,0x90 .Ltail_avx2: - - - - - - vpaddq %ymm0,%ymm7,%ymm0 vmovdqu 4(%rsp),%ymm7 vpaddq %ymm1,%ymm8,%ymm1 @@ -1868,9 +1682,6 @@ poly1305_blocks_avx2: vpaddq %ymm4,%ymm15,%ymm4 vpaddq %ymm0,%ymm11,%ymm0 - - - vpsrldq $8,%ymm12,%ymm8 vpsrldq $8,%ymm2,%ymm9 vpsrldq $8,%ymm3,%ymm10 @@ -1893,9 +1704,6 @@ poly1305_blocks_avx2: vpaddq %ymm8,%ymm12,%ymm12 vpaddq %ymm9,%ymm2,%ymm2 - - - vpsrlq $26,%ymm3,%ymm14 vpand %ymm5,%ymm3,%ymm3 vpaddq %ymm14,%ymm4,%ymm4 @@ -1932,110 +1740,673 @@ poly1305_blocks_avx2: vmovd %xmm2,-104(%rdi) vmovd %xmm3,-100(%rdi) vmovd %xmm4,-96(%rdi) - leaq 8(%r11),%rsp -.cfi_def_cfa %rsp,8 + leaq -8(%r10),%rsp + vzeroupper - .byte 0xf3,0xc3 -.cfi_endproc -.size poly1305_blocks_avx2,.-poly1305_blocks_avx2 -.type poly1305_blocks_avx512,@function + ret + +ENDPROC(poly1305_blocks_avx2) +#endif /* CONFIG_AS_AVX2 */ + +#ifdef CONFIG_AS_AVX512 .align 32 -poly1305_blocks_avx512: -.cfi_startproc -.Lblocks_avx512: - movl $15,%eax - kmovw %eax,%k2 - leaq -8(%rsp),%r11 -.cfi_def_cfa %r11,16 - subq $0x128,%rsp - leaq .Lconst(%rip),%rcx - leaq 48+64(%rdi),%rdi - vmovdqa 96(%rcx),%ymm9 +ENTRY(poly1305_blocks_avx512) + movl 20(%rdi),%r8d + cmpq $128,%rdx + jae .Lblocks_avx2_512 + testl %r8d,%r8d + jz .Lblocks - vmovdqu -64(%rdi),%xmm11 - andq $-512,%rsp - vmovdqu -48(%rdi),%xmm12 - movq $0x20,%rax - vmovdqu -32(%rdi),%xmm7 - vmovdqu -16(%rdi),%xmm13 - vmovdqu 0(%rdi),%xmm8 - vmovdqu 16(%rdi),%xmm14 - vmovdqu 32(%rdi),%xmm10 - vmovdqu 48(%rdi),%xmm15 - vmovdqu 64(%rdi),%xmm6 - vpermd %zmm11,%zmm9,%zmm16 - vpbroadcastq 64(%rcx),%zmm5 - vpermd %zmm12,%zmm9,%zmm17 - vpermd %zmm7,%zmm9,%zmm21 - vpermd %zmm13,%zmm9,%zmm18 - vmovdqa64 %zmm16,0(%rsp){%k2} - vpsrlq $32,%zmm16,%zmm7 - vpermd %zmm8,%zmm9,%zmm22 - vmovdqu64 %zmm17,0(%rsp,%rax,1){%k2} - vpsrlq $32,%zmm17,%zmm8 - vpermd %zmm14,%zmm9,%zmm19 - vmovdqa64 %zmm21,64(%rsp){%k2} - vpermd %zmm10,%zmm9,%zmm23 - vpermd %zmm15,%zmm9,%zmm20 - vmovdqu64 %zmm18,64(%rsp,%rax,1){%k2} - vpermd %zmm6,%zmm9,%zmm24 - vmovdqa64 %zmm22,128(%rsp){%k2} - vmovdqu64 %zmm19,128(%rsp,%rax,1){%k2} - vmovdqa64 %zmm23,192(%rsp){%k2} - vmovdqu64 %zmm20,192(%rsp,%rax,1){%k2} - vmovdqa64 %zmm24,256(%rsp){%k2} +.Lblocks_avx2_512: + andq $-16,%rdx + jz .Lno_data_avx2_512 + vzeroupper + testl %r8d,%r8d + jz .Lbase2_64_avx2_512 + testq $63,%rdx + jz .Leven_avx2_512 + pushq %rbx + pushq %r12 + pushq %r13 + pushq %r14 + pushq %r15 + pushq %rdi +.Lblocks_avx2_body_512: + movq %rdx,%r15 + movq 0(%rdi),%r8 + movq 8(%rdi),%r9 + movl 16(%rdi),%r10d + movq 24(%rdi),%r11 + movq 32(%rdi),%r13 - vpmuludq %zmm7,%zmm16,%zmm11 - vpmuludq %zmm7,%zmm17,%zmm12 - vpmuludq %zmm7,%zmm18,%zmm13 - vpmuludq %zmm7,%zmm19,%zmm14 - vpmuludq %zmm7,%zmm20,%zmm15 - vpsrlq $32,%zmm18,%zmm9 + movl %r8d,%r14d + andq $-2147483648,%r8 + movq %r9,%r12 + movl %r9d,%ebx + andq $-2147483648,%r9 - vpmuludq %zmm8,%zmm24,%zmm25 - vpmuludq %zmm8,%zmm16,%zmm26 - vpmuludq %zmm8,%zmm17,%zmm27 - vpmuludq %zmm8,%zmm18,%zmm28 - vpmuludq %zmm8,%zmm19,%zmm29 - vpsrlq $32,%zmm19,%zmm10 - vpaddq %zmm25,%zmm11,%zmm11 - vpaddq %zmm26,%zmm12,%zmm12 - vpaddq %zmm27,%zmm13,%zmm13 - vpaddq %zmm28,%zmm14,%zmm14 - vpaddq %zmm29,%zmm15,%zmm15 + shrq $6,%r8 + shlq $52,%r12 + addq %r8,%r14 + shrq $12,%rbx + shrq $18,%r9 + addq %r12,%r14 + adcq %r9,%rbx - vpmuludq %zmm9,%zmm23,%zmm25 - vpmuludq %zmm9,%zmm24,%zmm26 - vpmuludq %zmm9,%zmm17,%zmm28 - vpmuludq %zmm9,%zmm18,%zmm29 - vpmuludq %zmm9,%zmm16,%zmm27 - vpsrlq $32,%zmm20,%zmm6 - vpaddq %zmm25,%zmm11,%zmm11 - vpaddq %zmm26,%zmm12,%zmm12 - vpaddq %zmm28,%zmm14,%zmm14 - vpaddq %zmm29,%zmm15,%zmm15 - vpaddq %zmm27,%zmm13,%zmm13 + movq %r10,%r8 + shlq $40,%r8 + shrq $24,%r10 + addq %r8,%rbx + adcq $0,%r10 - vpmuludq %zmm10,%zmm22,%zmm25 - vpmuludq %zmm10,%zmm16,%zmm28 - vpmuludq %zmm10,%zmm17,%zmm29 - vpmuludq %zmm10,%zmm23,%zmm26 - vpmuludq %zmm10,%zmm24,%zmm27 - vpaddq %zmm25,%zmm11,%zmm11 - vpaddq %zmm28,%zmm14,%zmm14 - vpaddq %zmm29,%zmm15,%zmm15 - vpaddq %zmm26,%zmm12,%zmm12 - vpaddq %zmm27,%zmm13,%zmm13 + movq $-4,%r9 + movq %r10,%r8 + andq %r10,%r9 + shrq $2,%r8 + andq $3,%r10 + addq %r9,%r8 + addq %r8,%r14 + adcq $0,%rbx + adcq $0,%r10 + + movq %r13,%r12 + movq %r13,%rax + shrq $2,%r13 + addq %r12,%r13 + +.Lbase2_26_pre_avx2_512: + addq 0(%rsi),%r14 + adcq 8(%rsi),%rbx + leaq 16(%rsi),%rsi + adcq %rcx,%r10 + subq $16,%r15 + + movq %rdi,0(%rsp) + __poly1305_block + movq 0(%rsp),%rdi + movq %r12,%rax + + testq $63,%r15 + jnz .Lbase2_26_pre_avx2_512 + + testq %rcx,%rcx + jz .Lstore_base2_64_avx2_512 + + + movq %r14,%rax + movq %r14,%rdx + shrq $52,%r14 + movq %rbx,%r11 + movq %rbx,%r12 + shrq $26,%rdx + andq $0x3ffffff,%rax + shlq $12,%r11 + andq $0x3ffffff,%rdx + shrq $14,%rbx + orq %r11,%r14 + shlq $24,%r10 + andq $0x3ffffff,%r14 + shrq $40,%r12 + andq $0x3ffffff,%rbx + orq %r12,%r10 + + testq %r15,%r15 + jz .Lstore_base2_26_avx2_512 + + vmovd %eax,%xmm0 + vmovd %edx,%xmm1 + vmovd %r14d,%xmm2 + vmovd %ebx,%xmm3 + vmovd %r10d,%xmm4 + jmp .Lproceed_avx2_512 + +.align 32 +.Lstore_base2_64_avx2_512: + movq %r14,0(%rdi) + movq %rbx,8(%rdi) + movq %r10,16(%rdi) + jmp .Ldone_avx2_512 + +.align 16 +.Lstore_base2_26_avx2_512: + movl %eax,0(%rdi) + movl %edx,4(%rdi) + movl %r14d,8(%rdi) + movl %ebx,12(%rdi) + movl %r10d,16(%rdi) +.align 16 +.Ldone_avx2_512: + movq 8(%rsp),%r15 + movq 16(%rsp),%r14 + movq 24(%rsp),%r13 + movq 32(%rsp),%r12 + movq 40(%rsp),%rbx + leaq 48(%rsp),%rsp + +.Lno_data_avx2_512: +.Lblocks_avx2_epilogue_512: + ret + + +.align 32 +.Lbase2_64_avx2_512: + + pushq %rbx + pushq %r12 + pushq %r13 + pushq %r14 + pushq %r15 + pushq %rdi + +.Lbase2_64_avx2_body_512: + + movq %rdx,%r15 + + movq 24(%rdi),%r11 + movq 32(%rdi),%r13 + + movq 0(%rdi),%r14 + movq 8(%rdi),%rbx + movl 16(%rdi),%r10d + + movq %r13,%r12 + movq %r13,%rax + shrq $2,%r13 + addq %r12,%r13 + + testq $63,%rdx + jz .Linit_avx2_512 + +.Lbase2_64_pre_avx2_512: + addq 0(%rsi),%r14 + adcq 8(%rsi),%rbx + leaq 16(%rsi),%rsi + adcq %rcx,%r10 + subq $16,%r15 + + movq %rdi,0(%rsp) + __poly1305_block + movq 0(%rsp),%rdi + movq %r12,%rax + + testq $63,%r15 + jnz .Lbase2_64_pre_avx2_512 + +.Linit_avx2_512: + + movq %r14,%rax + movq %r14,%rdx + shrq $52,%r14 + movq %rbx,%r8 + movq %rbx,%r9 + shrq $26,%rdx + andq $0x3ffffff,%rax + shlq $12,%r8 + andq $0x3ffffff,%rdx + shrq $14,%rbx + orq %r8,%r14 + shlq $24,%r10 + andq $0x3ffffff,%r14 + shrq $40,%r9 + andq $0x3ffffff,%rbx + orq %r9,%r10 + + vmovd %eax,%xmm0 + vmovd %edx,%xmm1 + vmovd %r14d,%xmm2 + vmovd %ebx,%xmm3 + vmovd %r10d,%xmm4 + movl $1,20(%rdi) + + __poly1305_init_avx + +.Lproceed_avx2_512: + movq %r15,%rdx + + movq 8(%rsp),%r15 + movq 16(%rsp),%r14 + movq 24(%rsp),%r13 + movq 32(%rsp),%r12 + movq 40(%rsp),%rbx + leaq 48(%rsp),%rax + leaq 48(%rsp),%rsp + +.Lbase2_64_avx2_epilogue_512: + jmp .Ldo_avx2_512 + + +.align 32 +.Leven_avx2_512: + + vmovd 0(%rdi),%xmm0 + vmovd 4(%rdi),%xmm1 + vmovd 8(%rdi),%xmm2 + vmovd 12(%rdi),%xmm3 + vmovd 16(%rdi),%xmm4 + +.Ldo_avx2_512: + cmpq $512,%rdx + jae .Lblocks_avx512 +.Lskip_avx512: + leaq 8(%rsp),%r10 + + subq $0x128,%rsp + leaq .Lconst(%rip),%rcx + leaq 48+64(%rdi),%rdi + vmovdqa 96(%rcx),%ymm7 + + + vmovdqu -64(%rdi),%xmm9 + andq $-512,%rsp + vmovdqu -48(%rdi),%xmm10 + vmovdqu -32(%rdi),%xmm6 + vmovdqu -16(%rdi),%xmm11 + vmovdqu 0(%rdi),%xmm12 + vmovdqu 16(%rdi),%xmm13 + leaq 144(%rsp),%rax + vmovdqu 32(%rdi),%xmm14 + vpermd %ymm9,%ymm7,%ymm9 + vmovdqu 48(%rdi),%xmm15 + vpermd %ymm10,%ymm7,%ymm10 + vmovdqu 64(%rdi),%xmm5 + vpermd %ymm6,%ymm7,%ymm6 + vmovdqa %ymm9,0(%rsp) + vpermd %ymm11,%ymm7,%ymm11 + vmovdqa %ymm10,32-144(%rax) + vpermd %ymm12,%ymm7,%ymm12 + vmovdqa %ymm6,64-144(%rax) + vpermd %ymm13,%ymm7,%ymm13 + vmovdqa %ymm11,96-144(%rax) + vpermd %ymm14,%ymm7,%ymm14 + vmovdqa %ymm12,128-144(%rax) + vpermd %ymm15,%ymm7,%ymm15 + vmovdqa %ymm13,160-144(%rax) + vpermd %ymm5,%ymm7,%ymm5 + vmovdqa %ymm14,192-144(%rax) + vmovdqa %ymm15,224-144(%rax) + vmovdqa %ymm5,256-144(%rax) + vmovdqa 64(%rcx),%ymm5 + + + + vmovdqu 0(%rsi),%xmm7 + vmovdqu 16(%rsi),%xmm8 + vinserti128 $1,32(%rsi),%ymm7,%ymm7 + vinserti128 $1,48(%rsi),%ymm8,%ymm8 + leaq 64(%rsi),%rsi + + vpsrldq $6,%ymm7,%ymm9 + vpsrldq $6,%ymm8,%ymm10 + vpunpckhqdq %ymm8,%ymm7,%ymm6 + vpunpcklqdq %ymm10,%ymm9,%ymm9 + vpunpcklqdq %ymm8,%ymm7,%ymm7 + + vpsrlq $30,%ymm9,%ymm10 + vpsrlq $4,%ymm9,%ymm9 + vpsrlq $26,%ymm7,%ymm8 + vpsrlq $40,%ymm6,%ymm6 + vpand %ymm5,%ymm9,%ymm9 + vpand %ymm5,%ymm7,%ymm7 + vpand %ymm5,%ymm8,%ymm8 + vpand %ymm5,%ymm10,%ymm10 + vpor 32(%rcx),%ymm6,%ymm6 + + vpaddq %ymm2,%ymm9,%ymm2 + subq $64,%rdx + jz .Ltail_avx2_512 + jmp .Loop_avx2_512 + +.align 32 +.Loop_avx2_512: + + vpaddq %ymm0,%ymm7,%ymm0 + vmovdqa 0(%rsp),%ymm7 + vpaddq %ymm1,%ymm8,%ymm1 + vmovdqa 32(%rsp),%ymm8 + vpaddq %ymm3,%ymm10,%ymm3 + vmovdqa 96(%rsp),%ymm9 + vpaddq %ymm4,%ymm6,%ymm4 + vmovdqa 48(%rax),%ymm10 + vmovdqa 112(%rax),%ymm5 + + vpmuludq %ymm2,%ymm7,%ymm13 + vpmuludq %ymm2,%ymm8,%ymm14 + vpmuludq %ymm2,%ymm9,%ymm15 + vpmuludq %ymm2,%ymm10,%ymm11 + vpmuludq %ymm2,%ymm5,%ymm12 + + vpmuludq %ymm0,%ymm8,%ymm6 + vpmuludq %ymm1,%ymm8,%ymm2 + vpaddq %ymm6,%ymm12,%ymm12 + vpaddq %ymm2,%ymm13,%ymm13 + vpmuludq %ymm3,%ymm8,%ymm6 + vpmuludq 64(%rsp),%ymm4,%ymm2 + vpaddq %ymm6,%ymm15,%ymm15 + vpaddq %ymm2,%ymm11,%ymm11 + vmovdqa -16(%rax),%ymm8 + + vpmuludq %ymm0,%ymm7,%ymm6 + vpmuludq %ymm1,%ymm7,%ymm2 + vpaddq %ymm6,%ymm11,%ymm11 + vpaddq %ymm2,%ymm12,%ymm12 + vpmuludq %ymm3,%ymm7,%ymm6 + vpmuludq %ymm4,%ymm7,%ymm2 + vmovdqu 0(%rsi),%xmm7 + vpaddq %ymm6,%ymm14,%ymm14 + vpaddq %ymm2,%ymm15,%ymm15 + vinserti128 $1,32(%rsi),%ymm7,%ymm7 + + vpmuludq %ymm3,%ymm8,%ymm6 + vpmuludq %ymm4,%ymm8,%ymm2 + vmovdqu 16(%rsi),%xmm8 + vpaddq %ymm6,%ymm11,%ymm11 + vpaddq %ymm2,%ymm12,%ymm12 + vmovdqa 16(%rax),%ymm2 + vpmuludq %ymm1,%ymm9,%ymm6 + vpmuludq %ymm0,%ymm9,%ymm9 + vpaddq %ymm6,%ymm14,%ymm14 + vpaddq %ymm9,%ymm13,%ymm13 + vinserti128 $1,48(%rsi),%ymm8,%ymm8 + leaq 64(%rsi),%rsi + + vpmuludq %ymm1,%ymm2,%ymm6 + vpmuludq %ymm0,%ymm2,%ymm2 + vpsrldq $6,%ymm7,%ymm9 + vpaddq %ymm6,%ymm15,%ymm15 + vpaddq %ymm2,%ymm14,%ymm14 + vpmuludq %ymm3,%ymm10,%ymm6 + vpmuludq %ymm4,%ymm10,%ymm2 + vpsrldq $6,%ymm8,%ymm10 + vpaddq %ymm6,%ymm12,%ymm12 + vpaddq %ymm2,%ymm13,%ymm13 + vpunpckhqdq %ymm8,%ymm7,%ymm6 + + vpmuludq %ymm3,%ymm5,%ymm3 + vpmuludq %ymm4,%ymm5,%ymm4 + vpunpcklqdq %ymm8,%ymm7,%ymm7 + vpaddq %ymm3,%ymm13,%ymm2 + vpaddq %ymm4,%ymm14,%ymm3 + vpunpcklqdq %ymm10,%ymm9,%ymm10 + vpmuludq 80(%rax),%ymm0,%ymm4 + vpmuludq %ymm1,%ymm5,%ymm0 + vmovdqa 64(%rcx),%ymm5 + vpaddq %ymm4,%ymm15,%ymm4 + vpaddq %ymm0,%ymm11,%ymm0 + + vpsrlq $26,%ymm3,%ymm14 + vpand %ymm5,%ymm3,%ymm3 + vpaddq %ymm14,%ymm4,%ymm4 + + vpsrlq $26,%ymm0,%ymm11 + vpand %ymm5,%ymm0,%ymm0 + vpaddq %ymm11,%ymm12,%ymm1 + + vpsrlq $26,%ymm4,%ymm15 + vpand %ymm5,%ymm4,%ymm4 + + vpsrlq $4,%ymm10,%ymm9 + + vpsrlq $26,%ymm1,%ymm12 + vpand %ymm5,%ymm1,%ymm1 + vpaddq %ymm12,%ymm2,%ymm2 + + vpaddq %ymm15,%ymm0,%ymm0 + vpsllq $2,%ymm15,%ymm15 + vpaddq %ymm15,%ymm0,%ymm0 + + vpand %ymm5,%ymm9,%ymm9 + vpsrlq $26,%ymm7,%ymm8 + + vpsrlq $26,%ymm2,%ymm13 + vpand %ymm5,%ymm2,%ymm2 + vpaddq %ymm13,%ymm3,%ymm3 + + vpaddq %ymm9,%ymm2,%ymm2 + vpsrlq $30,%ymm10,%ymm10 + + vpsrlq $26,%ymm0,%ymm11 + vpand %ymm5,%ymm0,%ymm0 + vpaddq %ymm11,%ymm1,%ymm1 + + vpsrlq $40,%ymm6,%ymm6 + + vpsrlq $26,%ymm3,%ymm14 + vpand %ymm5,%ymm3,%ymm3 + vpaddq %ymm14,%ymm4,%ymm4 + + vpand %ymm5,%ymm7,%ymm7 + vpand %ymm5,%ymm8,%ymm8 + vpand %ymm5,%ymm10,%ymm10 + vpor 32(%rcx),%ymm6,%ymm6 + + subq $64,%rdx + jnz .Loop_avx2_512 + +.byte 0x66,0x90 +.Ltail_avx2_512: + + vpaddq %ymm0,%ymm7,%ymm0 + vmovdqu 4(%rsp),%ymm7 + vpaddq %ymm1,%ymm8,%ymm1 + vmovdqu 36(%rsp),%ymm8 + vpaddq %ymm3,%ymm10,%ymm3 + vmovdqu 100(%rsp),%ymm9 + vpaddq %ymm4,%ymm6,%ymm4 + vmovdqu 52(%rax),%ymm10 + vmovdqu 116(%rax),%ymm5 + + vpmuludq %ymm2,%ymm7,%ymm13 + vpmuludq %ymm2,%ymm8,%ymm14 + vpmuludq %ymm2,%ymm9,%ymm15 + vpmuludq %ymm2,%ymm10,%ymm11 + vpmuludq %ymm2,%ymm5,%ymm12 + + vpmuludq %ymm0,%ymm8,%ymm6 + vpmuludq %ymm1,%ymm8,%ymm2 + vpaddq %ymm6,%ymm12,%ymm12 + vpaddq %ymm2,%ymm13,%ymm13 + vpmuludq %ymm3,%ymm8,%ymm6 + vpmuludq 68(%rsp),%ymm4,%ymm2 + vpaddq %ymm6,%ymm15,%ymm15 + vpaddq %ymm2,%ymm11,%ymm11 + + vpmuludq %ymm0,%ymm7,%ymm6 + vpmuludq %ymm1,%ymm7,%ymm2 + vpaddq %ymm6,%ymm11,%ymm11 + vmovdqu -12(%rax),%ymm8 + vpaddq %ymm2,%ymm12,%ymm12 + vpmuludq %ymm3,%ymm7,%ymm6 + vpmuludq %ymm4,%ymm7,%ymm2 + vpaddq %ymm6,%ymm14,%ymm14 + vpaddq %ymm2,%ymm15,%ymm15 + + vpmuludq %ymm3,%ymm8,%ymm6 + vpmuludq %ymm4,%ymm8,%ymm2 + vpaddq %ymm6,%ymm11,%ymm11 + vpaddq %ymm2,%ymm12,%ymm12 + vmovdqu 20(%rax),%ymm2 + vpmuludq %ymm1,%ymm9,%ymm6 + vpmuludq %ymm0,%ymm9,%ymm9 + vpaddq %ymm6,%ymm14,%ymm14 + vpaddq %ymm9,%ymm13,%ymm13 + + vpmuludq %ymm1,%ymm2,%ymm6 + vpmuludq %ymm0,%ymm2,%ymm2 + vpaddq %ymm6,%ymm15,%ymm15 + vpaddq %ymm2,%ymm14,%ymm14 + vpmuludq %ymm3,%ymm10,%ymm6 + vpmuludq %ymm4,%ymm10,%ymm2 + vpaddq %ymm6,%ymm12,%ymm12 + vpaddq %ymm2,%ymm13,%ymm13 + + vpmuludq %ymm3,%ymm5,%ymm3 + vpmuludq %ymm4,%ymm5,%ymm4 + vpaddq %ymm3,%ymm13,%ymm2 + vpaddq %ymm4,%ymm14,%ymm3 + vpmuludq 84(%rax),%ymm0,%ymm4 + vpmuludq %ymm1,%ymm5,%ymm0 + vmovdqa 64(%rcx),%ymm5 + vpaddq %ymm4,%ymm15,%ymm4 + vpaddq %ymm0,%ymm11,%ymm0 + + vpsrldq $8,%ymm12,%ymm8 + vpsrldq $8,%ymm2,%ymm9 + vpsrldq $8,%ymm3,%ymm10 + vpsrldq $8,%ymm4,%ymm6 + vpsrldq $8,%ymm0,%ymm7 + vpaddq %ymm8,%ymm12,%ymm12 + vpaddq %ymm9,%ymm2,%ymm2 + vpaddq %ymm10,%ymm3,%ymm3 + vpaddq %ymm6,%ymm4,%ymm4 + vpaddq %ymm7,%ymm0,%ymm0 + + vpermq $0x2,%ymm3,%ymm10 + vpermq $0x2,%ymm4,%ymm6 + vpermq $0x2,%ymm0,%ymm7 + vpermq $0x2,%ymm12,%ymm8 + vpermq $0x2,%ymm2,%ymm9 + vpaddq %ymm10,%ymm3,%ymm3 + vpaddq %ymm6,%ymm4,%ymm4 + vpaddq %ymm7,%ymm0,%ymm0 + vpaddq %ymm8,%ymm12,%ymm12 + vpaddq %ymm9,%ymm2,%ymm2 + + vpsrlq $26,%ymm3,%ymm14 + vpand %ymm5,%ymm3,%ymm3 + vpaddq %ymm14,%ymm4,%ymm4 + + vpsrlq $26,%ymm0,%ymm11 + vpand %ymm5,%ymm0,%ymm0 + vpaddq %ymm11,%ymm12,%ymm1 + + vpsrlq $26,%ymm4,%ymm15 + vpand %ymm5,%ymm4,%ymm4 + + vpsrlq $26,%ymm1,%ymm12 + vpand %ymm5,%ymm1,%ymm1 + vpaddq %ymm12,%ymm2,%ymm2 + + vpaddq %ymm15,%ymm0,%ymm0 + vpsllq $2,%ymm15,%ymm15 + vpaddq %ymm15,%ymm0,%ymm0 + + vpsrlq $26,%ymm2,%ymm13 + vpand %ymm5,%ymm2,%ymm2 + vpaddq %ymm13,%ymm3,%ymm3 + + vpsrlq $26,%ymm0,%ymm11 + vpand %ymm5,%ymm0,%ymm0 + vpaddq %ymm11,%ymm1,%ymm1 + + vpsrlq $26,%ymm3,%ymm14 + vpand %ymm5,%ymm3,%ymm3 + vpaddq %ymm14,%ymm4,%ymm4 + + vmovd %xmm0,-112(%rdi) + vmovd %xmm1,-108(%rdi) + vmovd %xmm2,-104(%rdi) + vmovd %xmm3,-100(%rdi) + vmovd %xmm4,-96(%rdi) + leaq -8(%r10),%rsp + + vzeroupper + ret + +.Lblocks_avx512: + + movl $15,%eax + kmovw %eax,%k2 + leaq 8(%rsp),%r10 + + subq $0x128,%rsp + leaq .Lconst(%rip),%rcx + leaq 48+64(%rdi),%rdi + vmovdqa 96(%rcx),%ymm9 + + vmovdqu32 -64(%rdi),%zmm16{%k2}{z} + andq $-512,%rsp + vmovdqu32 -48(%rdi),%zmm17{%k2}{z} + movq $0x20,%rax + vmovdqu32 -32(%rdi),%zmm21{%k2}{z} + vmovdqu32 -16(%rdi),%zmm18{%k2}{z} + vmovdqu32 0(%rdi),%zmm22{%k2}{z} + vmovdqu32 16(%rdi),%zmm19{%k2}{z} + vmovdqu32 32(%rdi),%zmm23{%k2}{z} + vmovdqu32 48(%rdi),%zmm20{%k2}{z} + vmovdqu32 64(%rdi),%zmm24{%k2}{z} + vpermd %zmm16,%zmm9,%zmm16 + vpbroadcastq 64(%rcx),%zmm5 + vpermd %zmm17,%zmm9,%zmm17 + vpermd %zmm21,%zmm9,%zmm21 + vpermd %zmm18,%zmm9,%zmm18 + vmovdqa64 %zmm16,0(%rsp){%k2} + vpsrlq $32,%zmm16,%zmm7 + vpermd %zmm22,%zmm9,%zmm22 + vmovdqu64 %zmm17,0(%rsp,%rax,1){%k2} + vpsrlq $32,%zmm17,%zmm8 + vpermd %zmm19,%zmm9,%zmm19 + vmovdqa64 %zmm21,64(%rsp){%k2} + vpermd %zmm23,%zmm9,%zmm23 + vpermd %zmm20,%zmm9,%zmm20 + vmovdqu64 %zmm18,64(%rsp,%rax,1){%k2} + vpermd %zmm24,%zmm9,%zmm24 + vmovdqa64 %zmm22,128(%rsp){%k2} + vmovdqu64 %zmm19,128(%rsp,%rax,1){%k2} + vmovdqa64 %zmm23,192(%rsp){%k2} + vmovdqu64 %zmm20,192(%rsp,%rax,1){%k2} + vmovdqa64 %zmm24,256(%rsp){%k2} + + vpmuludq %zmm7,%zmm16,%zmm11 + vpmuludq %zmm7,%zmm17,%zmm12 + vpmuludq %zmm7,%zmm18,%zmm13 + vpmuludq %zmm7,%zmm19,%zmm14 + vpmuludq %zmm7,%zmm20,%zmm15 + vpsrlq $32,%zmm18,%zmm9 + + vpmuludq %zmm8,%zmm24,%zmm25 + vpmuludq %zmm8,%zmm16,%zmm26 + vpmuludq %zmm8,%zmm17,%zmm27 + vpmuludq %zmm8,%zmm18,%zmm28 + vpmuludq %zmm8,%zmm19,%zmm29 + vpsrlq $32,%zmm19,%zmm10 + vpaddq %zmm25,%zmm11,%zmm11 + vpaddq %zmm26,%zmm12,%zmm12 + vpaddq %zmm27,%zmm13,%zmm13 + vpaddq %zmm28,%zmm14,%zmm14 + vpaddq %zmm29,%zmm15,%zmm15 + + vpmuludq %zmm9,%zmm23,%zmm25 + vpmuludq %zmm9,%zmm24,%zmm26 + vpmuludq %zmm9,%zmm17,%zmm28 + vpmuludq %zmm9,%zmm18,%zmm29 + vpmuludq %zmm9,%zmm16,%zmm27 + vpsrlq $32,%zmm20,%zmm6 + vpaddq %zmm25,%zmm11,%zmm11 + vpaddq %zmm26,%zmm12,%zmm12 + vpaddq %zmm28,%zmm14,%zmm14 + vpaddq %zmm29,%zmm15,%zmm15 + vpaddq %zmm27,%zmm13,%zmm13 + + vpmuludq %zmm10,%zmm22,%zmm25 + vpmuludq %zmm10,%zmm16,%zmm28 + vpmuludq %zmm10,%zmm17,%zmm29 + vpmuludq %zmm10,%zmm23,%zmm26 + vpmuludq %zmm10,%zmm24,%zmm27 + vpaddq %zmm25,%zmm11,%zmm11 + vpaddq %zmm28,%zmm14,%zmm14 + vpaddq %zmm29,%zmm15,%zmm15 + vpaddq %zmm26,%zmm12,%zmm12 + vpaddq %zmm27,%zmm13,%zmm13 vpmuludq %zmm6,%zmm24,%zmm28 vpmuludq %zmm6,%zmm16,%zmm29 @@ -2048,15 +2419,10 @@ poly1305_blocks_avx512: vpaddq %zmm26,%zmm12,%zmm12 vpaddq %zmm27,%zmm13,%zmm13 - - vmovdqu64 0(%rsi),%zmm10 vmovdqu64 64(%rsi),%zmm6 leaq 128(%rsi),%rsi - - - vpsrlq $26,%zmm14,%zmm28 vpandq %zmm5,%zmm14,%zmm14 vpaddq %zmm28,%zmm15,%zmm15 @@ -2088,18 +2454,9 @@ poly1305_blocks_avx512: vpandq %zmm5,%zmm14,%zmm14 vpaddq %zmm28,%zmm15,%zmm15 - - - - vpunpcklqdq %zmm6,%zmm10,%zmm7 vpunpckhqdq %zmm6,%zmm10,%zmm6 - - - - - vmovdqa32 128(%rcx),%zmm25 movl $0x7777,%eax kmovw %eax,%k1 @@ -2136,9 +2493,6 @@ poly1305_blocks_avx512: vpandq %zmm5,%zmm9,%zmm9 vpandq %zmm5,%zmm7,%zmm7 - - - vpaddq %zmm2,%zmm9,%zmm2 subq $192,%rdx jbe .Ltail_avx512 @@ -2147,33 +2501,6 @@ poly1305_blocks_avx512: .align 32 .Loop_avx512: - - - - - - - - - - - - - - - - - - - - - - - - - - - vpmuludq %zmm2,%zmm17,%zmm14 vpaddq %zmm0,%zmm7,%zmm0 vpmuludq %zmm2,%zmm18,%zmm15 @@ -2238,9 +2565,6 @@ poly1305_blocks_avx512: vpaddq %zmm26,%zmm12,%zmm1 vpaddq %zmm27,%zmm13,%zmm2 - - - vpsrlq $52,%zmm7,%zmm9 vpsllq $12,%zmm6,%zmm10 @@ -2288,18 +2612,11 @@ poly1305_blocks_avx512: vpandq %zmm5,%zmm7,%zmm7 - - - subq $128,%rdx ja .Loop_avx512 .Ltail_avx512: - - - - vpsrlq $32,%zmm16,%zmm16 vpsrlq $32,%zmm17,%zmm17 vpsrlq $32,%zmm18,%zmm18 @@ -2310,11 +2627,8 @@ poly1305_blocks_avx512: vpsrlq $32,%zmm21,%zmm21 vpsrlq $32,%zmm22,%zmm22 - - leaq (%rsi,%rdx,1),%rsi - vpaddq %zmm0,%zmm7,%zmm0 vpmuludq %zmm2,%zmm17,%zmm14 @@ -2378,9 +2692,6 @@ poly1305_blocks_avx512: vpaddq %zmm26,%zmm12,%zmm1 vpaddq %zmm27,%zmm13,%zmm2 - - - movl $1,%eax vpermq $0xb1,%zmm3,%zmm14 vpermq $0xb1,%zmm15,%zmm4 @@ -2416,8 +2727,6 @@ poly1305_blocks_avx512: vpaddq %zmm12,%zmm1,%zmm1{%k3}{z} vpaddq %zmm13,%zmm2,%zmm2{%k3}{z} - - vpsrlq $26,%ymm3,%ymm14 vpand %ymm5,%ymm3,%ymm3 vpsrldq $6,%ymm7,%ymm9 @@ -2466,7 +2775,7 @@ poly1305_blocks_avx512: leaq 144(%rsp),%rax addq $64,%rdx - jnz .Ltail_avx2 + jnz .Ltail_avx2_512 vpsubq %ymm9,%ymm2,%ymm2 vmovd %xmm0,-112(%rdi) @@ -2475,1091 +2784,9 @@ poly1305_blocks_avx512: vmovd %xmm3,-100(%rdi) vmovd %xmm4,-96(%rdi) vzeroall - leaq 8(%r11),%rsp -.cfi_def_cfa %rsp,8 - .byte 0xf3,0xc3 -.cfi_endproc -.size poly1305_blocks_avx512,.-poly1305_blocks_avx512 -.type poly1305_init_base2_44,@function -.align 32 -poly1305_init_base2_44: - xorq %rax,%rax - movq %rax,0(%rdi) - movq %rax,8(%rdi) - movq %rax,16(%rdi) - -.Linit_base2_44: - leaq poly1305_blocks_vpmadd52(%rip),%r10 - leaq poly1305_emit_base2_44(%rip),%r11 - - movq $0x0ffffffc0fffffff,%rax - movq $0x0ffffffc0ffffffc,%rcx - andq 0(%rsi),%rax - movq $0x00000fffffffffff,%r8 - andq 8(%rsi),%rcx - movq $0x00000fffffffffff,%r9 - andq %rax,%r8 - shrdq $44,%rcx,%rax - movq %r8,40(%rdi) - andq %r9,%rax - shrq $24,%rcx - movq %rax,48(%rdi) - leaq (%rax,%rax,4),%rax - movq %rcx,56(%rdi) - shlq $2,%rax - leaq (%rcx,%rcx,4),%rcx - shlq $2,%rcx - movq %rax,24(%rdi) - movq %rcx,32(%rdi) - movq $-1,64(%rdi) - movq %r10,0(%rdx) - movq %r11,8(%rdx) - movl $1,%eax - .byte 0xf3,0xc3 -.size poly1305_init_base2_44,.-poly1305_init_base2_44 -.type poly1305_blocks_vpmadd52,@function -.align 32 -poly1305_blocks_vpmadd52: - shrq $4,%rdx - jz .Lno_data_vpmadd52 - - shlq $40,%rcx - movq 64(%rdi),%r8 - - - - - - - movq $3,%rax - movq $1,%r10 - cmpq $4,%rdx - cmovaeq %r10,%rax - testq %r8,%r8 - cmovnsq %r10,%rax - - andq %rdx,%rax - jz .Lblocks_vpmadd52_4x - - subq %rax,%rdx - movl $7,%r10d - movl $1,%r11d - kmovw %r10d,%k7 - leaq .L2_44_inp_permd(%rip),%r10 - kmovw %r11d,%k1 - - vmovq %rcx,%xmm21 - vmovdqa64 0(%r10),%ymm19 - vmovdqa64 32(%r10),%ymm20 - vpermq $0xcf,%ymm21,%ymm21 - vmovdqa64 64(%r10),%ymm22 - - vmovdqu64 0(%rdi),%ymm16{%k7}{z} - vmovdqu64 40(%rdi),%ymm3{%k7}{z} - vmovdqu64 32(%rdi),%ymm4{%k7}{z} - vmovdqu64 24(%rdi),%ymm5{%k7}{z} - - vmovdqa64 96(%r10),%ymm23 - vmovdqa64 128(%r10),%ymm24 - - jmp .Loop_vpmadd52 - -.align 32 -.Loop_vpmadd52: - vmovdqu32 0(%rsi),%xmm18 - leaq 16(%rsi),%rsi - - vpermd %ymm18,%ymm19,%ymm18 - vpsrlvq %ymm20,%ymm18,%ymm18 - vpandq %ymm22,%ymm18,%ymm18 - vporq %ymm21,%ymm18,%ymm18 - - vpaddq %ymm18,%ymm16,%ymm16 - - vpermq $0,%ymm16,%ymm0{%k7}{z} - vpermq $85,%ymm16,%ymm1{%k7}{z} - vpermq $170,%ymm16,%ymm2{%k7}{z} - - vpxord %ymm16,%ymm16,%ymm16 - vpxord %ymm17,%ymm17,%ymm17 - - vpmadd52luq %ymm3,%ymm0,%ymm16 - vpmadd52huq %ymm3,%ymm0,%ymm17 - - vpmadd52luq %ymm4,%ymm1,%ymm16 - vpmadd52huq %ymm4,%ymm1,%ymm17 - - vpmadd52luq %ymm5,%ymm2,%ymm16 - vpmadd52huq %ymm5,%ymm2,%ymm17 - - vpsrlvq %ymm23,%ymm16,%ymm18 - vpsllvq %ymm24,%ymm17,%ymm17 - vpandq %ymm22,%ymm16,%ymm16 - - vpaddq %ymm18,%ymm17,%ymm17 - - vpermq $147,%ymm17,%ymm17 - - vpaddq %ymm17,%ymm16,%ymm16 - - vpsrlvq %ymm23,%ymm16,%ymm18 - vpandq %ymm22,%ymm16,%ymm16 - - vpermq $147,%ymm18,%ymm18 - - vpaddq %ymm18,%ymm16,%ymm16 - - vpermq $147,%ymm16,%ymm18{%k1}{z} - - vpaddq %ymm18,%ymm16,%ymm16 - vpsllq $2,%ymm18,%ymm18 - - vpaddq %ymm18,%ymm16,%ymm16 - - decq %rax - jnz .Loop_vpmadd52 - - vmovdqu64 %ymm16,0(%rdi){%k7} - - testq %rdx,%rdx - jnz .Lblocks_vpmadd52_4x - -.Lno_data_vpmadd52: - .byte 0xf3,0xc3 -.size poly1305_blocks_vpmadd52,.-poly1305_blocks_vpmadd52 -.type poly1305_blocks_vpmadd52_4x,@function -.align 32 -poly1305_blocks_vpmadd52_4x: - shrq $4,%rdx - jz .Lno_data_vpmadd52_4x - - shlq $40,%rcx - movq 64(%rdi),%r8 - -.Lblocks_vpmadd52_4x: - vpbroadcastq %rcx,%ymm31 - - vmovdqa64 .Lx_mask44(%rip),%ymm28 - movl $5,%eax - vmovdqa64 .Lx_mask42(%rip),%ymm29 - kmovw %eax,%k1 - - testq %r8,%r8 - js .Linit_vpmadd52 - - vmovq 0(%rdi),%xmm0 - vmovq 8(%rdi),%xmm1 - vmovq 16(%rdi),%xmm2 - - testq $3,%rdx - jnz .Lblocks_vpmadd52_2x_do - -.Lblocks_vpmadd52_4x_do: - vpbroadcastq 64(%rdi),%ymm3 - vpbroadcastq 96(%rdi),%ymm4 - vpbroadcastq 128(%rdi),%ymm5 - vpbroadcastq 160(%rdi),%ymm16 - -.Lblocks_vpmadd52_4x_key_loaded: - vpsllq $2,%ymm5,%ymm17 - vpaddq %ymm5,%ymm17,%ymm17 - vpsllq $2,%ymm17,%ymm17 - - testq $7,%rdx - jz .Lblocks_vpmadd52_8x - - vmovdqu64 0(%rsi),%ymm26 - vmovdqu64 32(%rsi),%ymm27 - leaq 64(%rsi),%rsi - - vpunpcklqdq %ymm27,%ymm26,%ymm25 - vpunpckhqdq %ymm27,%ymm26,%ymm27 - - - - vpsrlq $24,%ymm27,%ymm26 - vporq %ymm31,%ymm26,%ymm26 - vpaddq %ymm26,%ymm2,%ymm2 - vpandq %ymm28,%ymm25,%ymm24 - vpsrlq $44,%ymm25,%ymm25 - vpsllq $20,%ymm27,%ymm27 - vporq %ymm27,%ymm25,%ymm25 - vpandq %ymm28,%ymm25,%ymm25 - - subq $4,%rdx - jz .Ltail_vpmadd52_4x - jmp .Loop_vpmadd52_4x - ud2 - -.align 32 -.Linit_vpmadd52: - vmovq 24(%rdi),%xmm16 - vmovq 56(%rdi),%xmm2 - vmovq 32(%rdi),%xmm17 - vmovq 40(%rdi),%xmm3 - vmovq 48(%rdi),%xmm4 - - vmovdqa %ymm3,%ymm0 - vmovdqa %ymm4,%ymm1 - vmovdqa %ymm2,%ymm5 - - movl $2,%eax - -.Lmul_init_vpmadd52: - vpxorq %ymm18,%ymm18,%ymm18 - vpmadd52luq %ymm2,%ymm16,%ymm18 - vpxorq %ymm19,%ymm19,%ymm19 - vpmadd52huq %ymm2,%ymm16,%ymm19 - vpxorq %ymm20,%ymm20,%ymm20 - vpmadd52luq %ymm2,%ymm17,%ymm20 - vpxorq %ymm21,%ymm21,%ymm21 - vpmadd52huq %ymm2,%ymm17,%ymm21 - vpxorq %ymm22,%ymm22,%ymm22 - vpmadd52luq %ymm2,%ymm3,%ymm22 - vpxorq %ymm23,%ymm23,%ymm23 - vpmadd52huq %ymm2,%ymm3,%ymm23 - - vpmadd52luq %ymm0,%ymm3,%ymm18 - vpmadd52huq %ymm0,%ymm3,%ymm19 - vpmadd52luq %ymm0,%ymm4,%ymm20 - vpmadd52huq %ymm0,%ymm4,%ymm21 - vpmadd52luq %ymm0,%ymm5,%ymm22 - vpmadd52huq %ymm0,%ymm5,%ymm23 - - vpmadd52luq %ymm1,%ymm17,%ymm18 - vpmadd52huq %ymm1,%ymm17,%ymm19 - vpmadd52luq %ymm1,%ymm3,%ymm20 - vpmadd52huq %ymm1,%ymm3,%ymm21 - vpmadd52luq %ymm1,%ymm4,%ymm22 - vpmadd52huq %ymm1,%ymm4,%ymm23 - - - - vpsrlq $44,%ymm18,%ymm30 - vpsllq $8,%ymm19,%ymm19 - vpandq %ymm28,%ymm18,%ymm0 - vpaddq %ymm30,%ymm19,%ymm19 - - vpaddq %ymm19,%ymm20,%ymm20 - - vpsrlq $44,%ymm20,%ymm30 - vpsllq $8,%ymm21,%ymm21 - vpandq %ymm28,%ymm20,%ymm1 - vpaddq %ymm30,%ymm21,%ymm21 - - vpaddq %ymm21,%ymm22,%ymm22 - - vpsrlq $42,%ymm22,%ymm30 - vpsllq $10,%ymm23,%ymm23 - vpandq %ymm29,%ymm22,%ymm2 - vpaddq %ymm30,%ymm23,%ymm23 - - vpaddq %ymm23,%ymm0,%ymm0 - vpsllq $2,%ymm23,%ymm23 - - vpaddq %ymm23,%ymm0,%ymm0 - - vpsrlq $44,%ymm0,%ymm30 - vpandq %ymm28,%ymm0,%ymm0 - - vpaddq %ymm30,%ymm1,%ymm1 - - decl %eax - jz .Ldone_init_vpmadd52 - - vpunpcklqdq %ymm4,%ymm1,%ymm4 - vpbroadcastq %xmm1,%xmm1 - vpunpcklqdq %ymm5,%ymm2,%ymm5 - vpbroadcastq %xmm2,%xmm2 - vpunpcklqdq %ymm3,%ymm0,%ymm3 - vpbroadcastq %xmm0,%xmm0 - - vpsllq $2,%ymm4,%ymm16 - vpsllq $2,%ymm5,%ymm17 - vpaddq %ymm4,%ymm16,%ymm16 - vpaddq %ymm5,%ymm17,%ymm17 - vpsllq $2,%ymm16,%ymm16 - vpsllq $2,%ymm17,%ymm17 - - jmp .Lmul_init_vpmadd52 - ud2 - -.align 32 -.Ldone_init_vpmadd52: - vinserti128 $1,%xmm4,%ymm1,%ymm4 - vinserti128 $1,%xmm5,%ymm2,%ymm5 - vinserti128 $1,%xmm3,%ymm0,%ymm3 - - vpermq $216,%ymm4,%ymm4 - vpermq $216,%ymm5,%ymm5 - vpermq $216,%ymm3,%ymm3 - - vpsllq $2,%ymm4,%ymm16 - vpaddq %ymm4,%ymm16,%ymm16 - vpsllq $2,%ymm16,%ymm16 - - vmovq 0(%rdi),%xmm0 - vmovq 8(%rdi),%xmm1 - vmovq 16(%rdi),%xmm2 - - testq $3,%rdx - jnz .Ldone_init_vpmadd52_2x - - vmovdqu64 %ymm3,64(%rdi) - vpbroadcastq %xmm3,%ymm3 - vmovdqu64 %ymm4,96(%rdi) - vpbroadcastq %xmm4,%ymm4 - vmovdqu64 %ymm5,128(%rdi) - vpbroadcastq %xmm5,%ymm5 - vmovdqu64 %ymm16,160(%rdi) - vpbroadcastq %xmm16,%ymm16 - - jmp .Lblocks_vpmadd52_4x_key_loaded - ud2 - -.align 32 -.Ldone_init_vpmadd52_2x: - vmovdqu64 %ymm3,64(%rdi) - vpsrldq $8,%ymm3,%ymm3 - vmovdqu64 %ymm4,96(%rdi) - vpsrldq $8,%ymm4,%ymm4 - vmovdqu64 %ymm5,128(%rdi) - vpsrldq $8,%ymm5,%ymm5 - vmovdqu64 %ymm16,160(%rdi) - vpsrldq $8,%ymm16,%ymm16 - jmp .Lblocks_vpmadd52_2x_key_loaded - ud2 - -.align 32 -.Lblocks_vpmadd52_2x_do: - vmovdqu64 128+8(%rdi),%ymm5{%k1}{z} - vmovdqu64 160+8(%rdi),%ymm16{%k1}{z} - vmovdqu64 64+8(%rdi),%ymm3{%k1}{z} - vmovdqu64 96+8(%rdi),%ymm4{%k1}{z} - -.Lblocks_vpmadd52_2x_key_loaded: - vmovdqu64 0(%rsi),%ymm26 - vpxorq %ymm27,%ymm27,%ymm27 - leaq 32(%rsi),%rsi - - vpunpcklqdq %ymm27,%ymm26,%ymm25 - vpunpckhqdq %ymm27,%ymm26,%ymm27 - - - - vpsrlq $24,%ymm27,%ymm26 - vporq %ymm31,%ymm26,%ymm26 - vpaddq %ymm26,%ymm2,%ymm2 - vpandq %ymm28,%ymm25,%ymm24 - vpsrlq $44,%ymm25,%ymm25 - vpsllq $20,%ymm27,%ymm27 - vporq %ymm27,%ymm25,%ymm25 - vpandq %ymm28,%ymm25,%ymm25 - - jmp .Ltail_vpmadd52_2x - ud2 - -.align 32 -.Loop_vpmadd52_4x: - - vpaddq %ymm24,%ymm0,%ymm0 - vpaddq %ymm25,%ymm1,%ymm1 - - vpxorq %ymm18,%ymm18,%ymm18 - vpmadd52luq %ymm2,%ymm16,%ymm18 - vpxorq %ymm19,%ymm19,%ymm19 - vpmadd52huq %ymm2,%ymm16,%ymm19 - vpxorq %ymm20,%ymm20,%ymm20 - vpmadd52luq %ymm2,%ymm17,%ymm20 - vpxorq %ymm21,%ymm21,%ymm21 - vpmadd52huq %ymm2,%ymm17,%ymm21 - vpxorq %ymm22,%ymm22,%ymm22 - vpmadd52luq %ymm2,%ymm3,%ymm22 - vpxorq %ymm23,%ymm23,%ymm23 - vpmadd52huq %ymm2,%ymm3,%ymm23 - - vmovdqu64 0(%rsi),%ymm26 - vmovdqu64 32(%rsi),%ymm27 - leaq 64(%rsi),%rsi - vpmadd52luq %ymm0,%ymm3,%ymm18 - vpmadd52huq %ymm0,%ymm3,%ymm19 - vpmadd52luq %ymm0,%ymm4,%ymm20 - vpmadd52huq %ymm0,%ymm4,%ymm21 - vpmadd52luq %ymm0,%ymm5,%ymm22 - vpmadd52huq %ymm0,%ymm5,%ymm23 - - vpunpcklqdq %ymm27,%ymm26,%ymm25 - vpunpckhqdq %ymm27,%ymm26,%ymm27 - vpmadd52luq %ymm1,%ymm17,%ymm18 - vpmadd52huq %ymm1,%ymm17,%ymm19 - vpmadd52luq %ymm1,%ymm3,%ymm20 - vpmadd52huq %ymm1,%ymm3,%ymm21 - vpmadd52luq %ymm1,%ymm4,%ymm22 - vpmadd52huq %ymm1,%ymm4,%ymm23 - - - - vpsrlq $44,%ymm18,%ymm30 - vpsllq $8,%ymm19,%ymm19 - vpandq %ymm28,%ymm18,%ymm0 - vpaddq %ymm30,%ymm19,%ymm19 - - vpsrlq $24,%ymm27,%ymm26 - vporq %ymm31,%ymm26,%ymm26 - vpaddq %ymm19,%ymm20,%ymm20 - - vpsrlq $44,%ymm20,%ymm30 - vpsllq $8,%ymm21,%ymm21 - vpandq %ymm28,%ymm20,%ymm1 - vpaddq %ymm30,%ymm21,%ymm21 - - vpandq %ymm28,%ymm25,%ymm24 - vpsrlq $44,%ymm25,%ymm25 - vpsllq $20,%ymm27,%ymm27 - vpaddq %ymm21,%ymm22,%ymm22 - - vpsrlq $42,%ymm22,%ymm30 - vpsllq $10,%ymm23,%ymm23 - vpandq %ymm29,%ymm22,%ymm2 - vpaddq %ymm30,%ymm23,%ymm23 - - vpaddq %ymm26,%ymm2,%ymm2 - vpaddq %ymm23,%ymm0,%ymm0 - vpsllq $2,%ymm23,%ymm23 - - vpaddq %ymm23,%ymm0,%ymm0 - vporq %ymm27,%ymm25,%ymm25 - vpandq %ymm28,%ymm25,%ymm25 - - vpsrlq $44,%ymm0,%ymm30 - vpandq %ymm28,%ymm0,%ymm0 - - vpaddq %ymm30,%ymm1,%ymm1 - - subq $4,%rdx - jnz .Loop_vpmadd52_4x - -.Ltail_vpmadd52_4x: - vmovdqu64 128(%rdi),%ymm5 - vmovdqu64 160(%rdi),%ymm16 - vmovdqu64 64(%rdi),%ymm3 - vmovdqu64 96(%rdi),%ymm4 - -.Ltail_vpmadd52_2x: - vpsllq $2,%ymm5,%ymm17 - vpaddq %ymm5,%ymm17,%ymm17 - vpsllq $2,%ymm17,%ymm17 - - - vpaddq %ymm24,%ymm0,%ymm0 - vpaddq %ymm25,%ymm1,%ymm1 - - vpxorq %ymm18,%ymm18,%ymm18 - vpmadd52luq %ymm2,%ymm16,%ymm18 - vpxorq %ymm19,%ymm19,%ymm19 - vpmadd52huq %ymm2,%ymm16,%ymm19 - vpxorq %ymm20,%ymm20,%ymm20 - vpmadd52luq %ymm2,%ymm17,%ymm20 - vpxorq %ymm21,%ymm21,%ymm21 - vpmadd52huq %ymm2,%ymm17,%ymm21 - vpxorq %ymm22,%ymm22,%ymm22 - vpmadd52luq %ymm2,%ymm3,%ymm22 - vpxorq %ymm23,%ymm23,%ymm23 - vpmadd52huq %ymm2,%ymm3,%ymm23 - - vpmadd52luq %ymm0,%ymm3,%ymm18 - vpmadd52huq %ymm0,%ymm3,%ymm19 - vpmadd52luq %ymm0,%ymm4,%ymm20 - vpmadd52huq %ymm0,%ymm4,%ymm21 - vpmadd52luq %ymm0,%ymm5,%ymm22 - vpmadd52huq %ymm0,%ymm5,%ymm23 - - vpmadd52luq %ymm1,%ymm17,%ymm18 - vpmadd52huq %ymm1,%ymm17,%ymm19 - vpmadd52luq %ymm1,%ymm3,%ymm20 - vpmadd52huq %ymm1,%ymm3,%ymm21 - vpmadd52luq %ymm1,%ymm4,%ymm22 - vpmadd52huq %ymm1,%ymm4,%ymm23 - - - - - movl $1,%eax - kmovw %eax,%k1 - vpsrldq $8,%ymm18,%ymm24 - vpsrldq $8,%ymm19,%ymm0 - vpsrldq $8,%ymm20,%ymm25 - vpsrldq $8,%ymm21,%ymm1 - vpaddq %ymm24,%ymm18,%ymm18 - vpaddq %ymm0,%ymm19,%ymm19 - vpsrldq $8,%ymm22,%ymm26 - vpsrldq $8,%ymm23,%ymm2 - vpaddq %ymm25,%ymm20,%ymm20 - vpaddq %ymm1,%ymm21,%ymm21 - vpermq $0x2,%ymm18,%ymm24 - vpermq $0x2,%ymm19,%ymm0 - vpaddq %ymm26,%ymm22,%ymm22 - vpaddq %ymm2,%ymm23,%ymm23 - - vpermq $0x2,%ymm20,%ymm25 - vpermq $0x2,%ymm21,%ymm1 - vpaddq %ymm24,%ymm18,%ymm18{%k1}{z} - vpaddq %ymm0,%ymm19,%ymm19{%k1}{z} - vpermq $0x2,%ymm22,%ymm26 - vpermq $0x2,%ymm23,%ymm2 - vpaddq %ymm25,%ymm20,%ymm20{%k1}{z} - vpaddq %ymm1,%ymm21,%ymm21{%k1}{z} - vpaddq %ymm26,%ymm22,%ymm22{%k1}{z} - vpaddq %ymm2,%ymm23,%ymm23{%k1}{z} - - - - vpsrlq $44,%ymm18,%ymm30 - vpsllq $8,%ymm19,%ymm19 - vpandq %ymm28,%ymm18,%ymm0 - vpaddq %ymm30,%ymm19,%ymm19 - - vpaddq %ymm19,%ymm20,%ymm20 - - vpsrlq $44,%ymm20,%ymm30 - vpsllq $8,%ymm21,%ymm21 - vpandq %ymm28,%ymm20,%ymm1 - vpaddq %ymm30,%ymm21,%ymm21 - - vpaddq %ymm21,%ymm22,%ymm22 - - vpsrlq $42,%ymm22,%ymm30 - vpsllq $10,%ymm23,%ymm23 - vpandq %ymm29,%ymm22,%ymm2 - vpaddq %ymm30,%ymm23,%ymm23 - - vpaddq %ymm23,%ymm0,%ymm0 - vpsllq $2,%ymm23,%ymm23 - - vpaddq %ymm23,%ymm0,%ymm0 - - vpsrlq $44,%ymm0,%ymm30 - vpandq %ymm28,%ymm0,%ymm0 - - vpaddq %ymm30,%ymm1,%ymm1 - - - subq $2,%rdx - ja .Lblocks_vpmadd52_4x_do - - vmovq %xmm0,0(%rdi) - vmovq %xmm1,8(%rdi) - vmovq %xmm2,16(%rdi) - vzeroall - -.Lno_data_vpmadd52_4x: - .byte 0xf3,0xc3 -.size poly1305_blocks_vpmadd52_4x,.-poly1305_blocks_vpmadd52_4x -.type poly1305_blocks_vpmadd52_8x,@function -.align 32 -poly1305_blocks_vpmadd52_8x: - shrq $4,%rdx - jz .Lno_data_vpmadd52_8x - - shlq $40,%rcx - movq 64(%rdi),%r8 - - vmovdqa64 .Lx_mask44(%rip),%ymm28 - vmovdqa64 .Lx_mask42(%rip),%ymm29 - - testq %r8,%r8 - js .Linit_vpmadd52 - - vmovq 0(%rdi),%xmm0 - vmovq 8(%rdi),%xmm1 - vmovq 16(%rdi),%xmm2 - -.Lblocks_vpmadd52_8x: - - - - vmovdqu64 128(%rdi),%ymm5 - vmovdqu64 160(%rdi),%ymm16 - vmovdqu64 64(%rdi),%ymm3 - vmovdqu64 96(%rdi),%ymm4 - - vpsllq $2,%ymm5,%ymm17 - vpaddq %ymm5,%ymm17,%ymm17 - vpsllq $2,%ymm17,%ymm17 - - vpbroadcastq %xmm5,%ymm8 - vpbroadcastq %xmm3,%ymm6 - vpbroadcastq %xmm4,%ymm7 - - vpxorq %ymm18,%ymm18,%ymm18 - vpmadd52luq %ymm8,%ymm16,%ymm18 - vpxorq %ymm19,%ymm19,%ymm19 - vpmadd52huq %ymm8,%ymm16,%ymm19 - vpxorq %ymm20,%ymm20,%ymm20 - vpmadd52luq %ymm8,%ymm17,%ymm20 - vpxorq %ymm21,%ymm21,%ymm21 - vpmadd52huq %ymm8,%ymm17,%ymm21 - vpxorq %ymm22,%ymm22,%ymm22 - vpmadd52luq %ymm8,%ymm3,%ymm22 - vpxorq %ymm23,%ymm23,%ymm23 - vpmadd52huq %ymm8,%ymm3,%ymm23 - - vpmadd52luq %ymm6,%ymm3,%ymm18 - vpmadd52huq %ymm6,%ymm3,%ymm19 - vpmadd52luq %ymm6,%ymm4,%ymm20 - vpmadd52huq %ymm6,%ymm4,%ymm21 - vpmadd52luq %ymm6,%ymm5,%ymm22 - vpmadd52huq %ymm6,%ymm5,%ymm23 - - vpmadd52luq %ymm7,%ymm17,%ymm18 - vpmadd52huq %ymm7,%ymm17,%ymm19 - vpmadd52luq %ymm7,%ymm3,%ymm20 - vpmadd52huq %ymm7,%ymm3,%ymm21 - vpmadd52luq %ymm7,%ymm4,%ymm22 - vpmadd52huq %ymm7,%ymm4,%ymm23 - - - - vpsrlq $44,%ymm18,%ymm30 - vpsllq $8,%ymm19,%ymm19 - vpandq %ymm28,%ymm18,%ymm6 - vpaddq %ymm30,%ymm19,%ymm19 - - vpaddq %ymm19,%ymm20,%ymm20 - - vpsrlq $44,%ymm20,%ymm30 - vpsllq $8,%ymm21,%ymm21 - vpandq %ymm28,%ymm20,%ymm7 - vpaddq %ymm30,%ymm21,%ymm21 - - vpaddq %ymm21,%ymm22,%ymm22 - - vpsrlq $42,%ymm22,%ymm30 - vpsllq $10,%ymm23,%ymm23 - vpandq %ymm29,%ymm22,%ymm8 - vpaddq %ymm30,%ymm23,%ymm23 - - vpaddq %ymm23,%ymm6,%ymm6 - vpsllq $2,%ymm23,%ymm23 - - vpaddq %ymm23,%ymm6,%ymm6 - - vpsrlq $44,%ymm6,%ymm30 - vpandq %ymm28,%ymm6,%ymm6 - - vpaddq %ymm30,%ymm7,%ymm7 - - - - - - vpunpcklqdq %ymm5,%ymm8,%ymm26 - vpunpckhqdq %ymm5,%ymm8,%ymm5 - vpunpcklqdq %ymm3,%ymm6,%ymm24 - vpunpckhqdq %ymm3,%ymm6,%ymm3 - vpunpcklqdq %ymm4,%ymm7,%ymm25 - vpunpckhqdq %ymm4,%ymm7,%ymm4 - vshufi64x2 $0x44,%zmm5,%zmm26,%zmm8 - vshufi64x2 $0x44,%zmm3,%zmm24,%zmm6 - vshufi64x2 $0x44,%zmm4,%zmm25,%zmm7 - - vmovdqu64 0(%rsi),%zmm26 - vmovdqu64 64(%rsi),%zmm27 - leaq 128(%rsi),%rsi - - vpsllq $2,%zmm8,%zmm10 - vpsllq $2,%zmm7,%zmm9 - vpaddq %zmm8,%zmm10,%zmm10 - vpaddq %zmm7,%zmm9,%zmm9 - vpsllq $2,%zmm10,%zmm10 - vpsllq $2,%zmm9,%zmm9 - - vpbroadcastq %rcx,%zmm31 - vpbroadcastq %xmm28,%zmm28 - vpbroadcastq %xmm29,%zmm29 - - vpbroadcastq %xmm9,%zmm16 - vpbroadcastq %xmm10,%zmm17 - vpbroadcastq %xmm6,%zmm3 - vpbroadcastq %xmm7,%zmm4 - vpbroadcastq %xmm8,%zmm5 - - vpunpcklqdq %zmm27,%zmm26,%zmm25 - vpunpckhqdq %zmm27,%zmm26,%zmm27 - - - - vpsrlq $24,%zmm27,%zmm26 - vporq %zmm31,%zmm26,%zmm26 - vpaddq %zmm26,%zmm2,%zmm2 - vpandq %zmm28,%zmm25,%zmm24 - vpsrlq $44,%zmm25,%zmm25 - vpsllq $20,%zmm27,%zmm27 - vporq %zmm27,%zmm25,%zmm25 - vpandq %zmm28,%zmm25,%zmm25 - - subq $8,%rdx - jz .Ltail_vpmadd52_8x - jmp .Loop_vpmadd52_8x - -.align 32 -.Loop_vpmadd52_8x: - - vpaddq %zmm24,%zmm0,%zmm0 - vpaddq %zmm25,%zmm1,%zmm1 - - vpxorq %zmm18,%zmm18,%zmm18 - vpmadd52luq %zmm2,%zmm16,%zmm18 - vpxorq %zmm19,%zmm19,%zmm19 - vpmadd52huq %zmm2,%zmm16,%zmm19 - vpxorq %zmm20,%zmm20,%zmm20 - vpmadd52luq %zmm2,%zmm17,%zmm20 - vpxorq %zmm21,%zmm21,%zmm21 - vpmadd52huq %zmm2,%zmm17,%zmm21 - vpxorq %zmm22,%zmm22,%zmm22 - vpmadd52luq %zmm2,%zmm3,%zmm22 - vpxorq %zmm23,%zmm23,%zmm23 - vpmadd52huq %zmm2,%zmm3,%zmm23 - - vmovdqu64 0(%rsi),%zmm26 - vmovdqu64 64(%rsi),%zmm27 - leaq 128(%rsi),%rsi - vpmadd52luq %zmm0,%zmm3,%zmm18 - vpmadd52huq %zmm0,%zmm3,%zmm19 - vpmadd52luq %zmm0,%zmm4,%zmm20 - vpmadd52huq %zmm0,%zmm4,%zmm21 - vpmadd52luq %zmm0,%zmm5,%zmm22 - vpmadd52huq %zmm0,%zmm5,%zmm23 - - vpunpcklqdq %zmm27,%zmm26,%zmm25 - vpunpckhqdq %zmm27,%zmm26,%zmm27 - vpmadd52luq %zmm1,%zmm17,%zmm18 - vpmadd52huq %zmm1,%zmm17,%zmm19 - vpmadd52luq %zmm1,%zmm3,%zmm20 - vpmadd52huq %zmm1,%zmm3,%zmm21 - vpmadd52luq %zmm1,%zmm4,%zmm22 - vpmadd52huq %zmm1,%zmm4,%zmm23 - - - - vpsrlq $44,%zmm18,%zmm30 - vpsllq $8,%zmm19,%zmm19 - vpandq %zmm28,%zmm18,%zmm0 - vpaddq %zmm30,%zmm19,%zmm19 - - vpsrlq $24,%zmm27,%zmm26 - vporq %zmm31,%zmm26,%zmm26 - vpaddq %zmm19,%zmm20,%zmm20 - - vpsrlq $44,%zmm20,%zmm30 - vpsllq $8,%zmm21,%zmm21 - vpandq %zmm28,%zmm20,%zmm1 - vpaddq %zmm30,%zmm21,%zmm21 - - vpandq %zmm28,%zmm25,%zmm24 - vpsrlq $44,%zmm25,%zmm25 - vpsllq $20,%zmm27,%zmm27 - vpaddq %zmm21,%zmm22,%zmm22 - - vpsrlq $42,%zmm22,%zmm30 - vpsllq $10,%zmm23,%zmm23 - vpandq %zmm29,%zmm22,%zmm2 - vpaddq %zmm30,%zmm23,%zmm23 - - vpaddq %zmm26,%zmm2,%zmm2 - vpaddq %zmm23,%zmm0,%zmm0 - vpsllq $2,%zmm23,%zmm23 - - vpaddq %zmm23,%zmm0,%zmm0 - vporq %zmm27,%zmm25,%zmm25 - vpandq %zmm28,%zmm25,%zmm25 - - vpsrlq $44,%zmm0,%zmm30 - vpandq %zmm28,%zmm0,%zmm0 - - vpaddq %zmm30,%zmm1,%zmm1 - - subq $8,%rdx - jnz .Loop_vpmadd52_8x - -.Ltail_vpmadd52_8x: - - vpaddq %zmm24,%zmm0,%zmm0 - vpaddq %zmm25,%zmm1,%zmm1 - - vpxorq %zmm18,%zmm18,%zmm18 - vpmadd52luq %zmm2,%zmm9,%zmm18 - vpxorq %zmm19,%zmm19,%zmm19 - vpmadd52huq %zmm2,%zmm9,%zmm19 - vpxorq %zmm20,%zmm20,%zmm20 - vpmadd52luq %zmm2,%zmm10,%zmm20 - vpxorq %zmm21,%zmm21,%zmm21 - vpmadd52huq %zmm2,%zmm10,%zmm21 - vpxorq %zmm22,%zmm22,%zmm22 - vpmadd52luq %zmm2,%zmm6,%zmm22 - vpxorq %zmm23,%zmm23,%zmm23 - vpmadd52huq %zmm2,%zmm6,%zmm23 - - vpmadd52luq %zmm0,%zmm6,%zmm18 - vpmadd52huq %zmm0,%zmm6,%zmm19 - vpmadd52luq %zmm0,%zmm7,%zmm20 - vpmadd52huq %zmm0,%zmm7,%zmm21 - vpmadd52luq %zmm0,%zmm8,%zmm22 - vpmadd52huq %zmm0,%zmm8,%zmm23 - - vpmadd52luq %zmm1,%zmm10,%zmm18 - vpmadd52huq %zmm1,%zmm10,%zmm19 - vpmadd52luq %zmm1,%zmm6,%zmm20 - vpmadd52huq %zmm1,%zmm6,%zmm21 - vpmadd52luq %zmm1,%zmm7,%zmm22 - vpmadd52huq %zmm1,%zmm7,%zmm23 - - - - - movl $1,%eax - kmovw %eax,%k1 - vpsrldq $8,%zmm18,%zmm24 - vpsrldq $8,%zmm19,%zmm0 - vpsrldq $8,%zmm20,%zmm25 - vpsrldq $8,%zmm21,%zmm1 - vpaddq %zmm24,%zmm18,%zmm18 - vpaddq %zmm0,%zmm19,%zmm19 - vpsrldq $8,%zmm22,%zmm26 - vpsrldq $8,%zmm23,%zmm2 - vpaddq %zmm25,%zmm20,%zmm20 - vpaddq %zmm1,%zmm21,%zmm21 - vpermq $0x2,%zmm18,%zmm24 - vpermq $0x2,%zmm19,%zmm0 - vpaddq %zmm26,%zmm22,%zmm22 - vpaddq %zmm2,%zmm23,%zmm23 - - vpermq $0x2,%zmm20,%zmm25 - vpermq $0x2,%zmm21,%zmm1 - vpaddq %zmm24,%zmm18,%zmm18 - vpaddq %zmm0,%zmm19,%zmm19 - vpermq $0x2,%zmm22,%zmm26 - vpermq $0x2,%zmm23,%zmm2 - vpaddq %zmm25,%zmm20,%zmm20 - vpaddq %zmm1,%zmm21,%zmm21 - vextracti64x4 $1,%zmm18,%ymm24 - vextracti64x4 $1,%zmm19,%ymm0 - vpaddq %zmm26,%zmm22,%zmm22 - vpaddq %zmm2,%zmm23,%zmm23 - - vextracti64x4 $1,%zmm20,%ymm25 - vextracti64x4 $1,%zmm21,%ymm1 - vextracti64x4 $1,%zmm22,%ymm26 - vextracti64x4 $1,%zmm23,%ymm2 - vpaddq %ymm24,%ymm18,%ymm18{%k1}{z} - vpaddq %ymm0,%ymm19,%ymm19{%k1}{z} - vpaddq %ymm25,%ymm20,%ymm20{%k1}{z} - vpaddq %ymm1,%ymm21,%ymm21{%k1}{z} - vpaddq %ymm26,%ymm22,%ymm22{%k1}{z} - vpaddq %ymm2,%ymm23,%ymm23{%k1}{z} - - - - vpsrlq $44,%ymm18,%ymm30 - vpsllq $8,%ymm19,%ymm19 - vpandq %ymm28,%ymm18,%ymm0 - vpaddq %ymm30,%ymm19,%ymm19 - - vpaddq %ymm19,%ymm20,%ymm20 - - vpsrlq $44,%ymm20,%ymm30 - vpsllq $8,%ymm21,%ymm21 - vpandq %ymm28,%ymm20,%ymm1 - vpaddq %ymm30,%ymm21,%ymm21 - - vpaddq %ymm21,%ymm22,%ymm22 - - vpsrlq $42,%ymm22,%ymm30 - vpsllq $10,%ymm23,%ymm23 - vpandq %ymm29,%ymm22,%ymm2 - vpaddq %ymm30,%ymm23,%ymm23 - - vpaddq %ymm23,%ymm0,%ymm0 - vpsllq $2,%ymm23,%ymm23 - - vpaddq %ymm23,%ymm0,%ymm0 - - vpsrlq $44,%ymm0,%ymm30 - vpandq %ymm28,%ymm0,%ymm0 - - vpaddq %ymm30,%ymm1,%ymm1 - - - - vmovq %xmm0,0(%rdi) - vmovq %xmm1,8(%rdi) - vmovq %xmm2,16(%rdi) - vzeroall - -.Lno_data_vpmadd52_8x: - .byte 0xf3,0xc3 -.size poly1305_blocks_vpmadd52_8x,.-poly1305_blocks_vpmadd52_8x -.type poly1305_emit_base2_44,@function -.align 32 -poly1305_emit_base2_44: - movq 0(%rdi),%r8 - movq 8(%rdi),%r9 - movq 16(%rdi),%r10 - - movq %r9,%rax - shrq $20,%r9 - shlq $44,%rax - movq %r10,%rcx - shrq $40,%r10 - shlq $24,%rcx - - addq %rax,%r8 - adcq %rcx,%r9 - adcq $0,%r10 - - movq %r8,%rax - addq $5,%r8 - movq %r9,%rcx - adcq $0,%r9 - adcq $0,%r10 - shrq $2,%r10 - cmovnzq %r8,%rax - cmovnzq %r9,%rcx - - addq 0(%rdx),%rax - adcq 8(%rdx),%rcx - movq %rax,0(%rsi) - movq %rcx,8(%rsi) - - .byte 0xf3,0xc3 -.size poly1305_emit_base2_44,.-poly1305_emit_base2_44 -.align 64 -.Lconst: -.Lmask24: -.long 0x0ffffff,0,0x0ffffff,0,0x0ffffff,0,0x0ffffff,0 -.L129: -.long 16777216,0,16777216,0,16777216,0,16777216,0 -.Lmask26: -.long 0x3ffffff,0,0x3ffffff,0,0x3ffffff,0,0x3ffffff,0 -.Lpermd_avx2: -.long 2,2,2,3,2,0,2,1 -.Lpermd_avx512: -.long 0,0,0,1, 0,2,0,3, 0,4,0,5, 0,6,0,7 + leaq -8(%r10),%rsp -.L2_44_inp_permd: -.long 0,1,1,2,2,3,7,7 -.L2_44_inp_shift: -.quad 0,12,24,64 -.L2_44_mask: -.quad 0xfffffffffff,0xfffffffffff,0x3ffffffffff,0xffffffffffffffff -.L2_44_shift_rgt: -.quad 44,44,42,64 -.L2_44_shift_lft: -.quad 8,8,10,64 + ret -.align 64 -.Lx_mask44: -.quad 0xfffffffffff,0xfffffffffff,0xfffffffffff,0xfffffffffff -.quad 0xfffffffffff,0xfffffffffff,0xfffffffffff,0xfffffffffff -.Lx_mask42: -.quad 0x3ffffffffff,0x3ffffffffff,0x3ffffffffff,0x3ffffffffff -.quad 0x3ffffffffff,0x3ffffffffff,0x3ffffffffff,0x3ffffffffff -.byte 80,111,108,121,49,51,48,53,32,102,111,114,32,120,56,54,95,54,52,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0 -.align 16 -.globl xor128_encrypt_n_pad -.type xor128_encrypt_n_pad,@function -.align 16 -xor128_encrypt_n_pad: - subq %rdx,%rsi - subq %rdx,%rdi - movq %rcx,%r10 - shrq $4,%rcx - jz .Ltail_enc - nop -.Loop_enc_xmm: - movdqu (%rsi,%rdx,1),%xmm0 - pxor (%rdx),%xmm0 - movdqu %xmm0,(%rdi,%rdx,1) - movdqa %xmm0,(%rdx) - leaq 16(%rdx),%rdx - decq %rcx - jnz .Loop_enc_xmm - - andq $15,%r10 - jz .Ldone_enc - -.Ltail_enc: - movq $16,%rcx - subq %r10,%rcx - xorl %eax,%eax -.Loop_enc_byte: - movb (%rsi,%rdx,1),%al - xorb (%rdx),%al - movb %al,(%rdi,%rdx,1) - movb %al,(%rdx) - leaq 1(%rdx),%rdx - decq %r10 - jnz .Loop_enc_byte - - xorl %eax,%eax -.Loop_enc_pad: - movb %al,(%rdx) - leaq 1(%rdx),%rdx - decq %rcx - jnz .Loop_enc_pad - -.Ldone_enc: - movq %rdx,%rax - .byte 0xf3,0xc3 -.size xor128_encrypt_n_pad,.-xor128_encrypt_n_pad - -.globl xor128_decrypt_n_pad -.type xor128_decrypt_n_pad,@function -.align 16 -xor128_decrypt_n_pad: - subq %rdx,%rsi - subq %rdx,%rdi - movq %rcx,%r10 - shrq $4,%rcx - jz .Ltail_dec - nop -.Loop_dec_xmm: - movdqu (%rsi,%rdx,1),%xmm0 - movdqa (%rdx),%xmm1 - pxor %xmm0,%xmm1 - movdqu %xmm1,(%rdi,%rdx,1) - movdqa %xmm0,(%rdx) - leaq 16(%rdx),%rdx - decq %rcx - jnz .Loop_dec_xmm - - pxor %xmm1,%xmm1 - andq $15,%r10 - jz .Ldone_dec - -.Ltail_dec: - movq $16,%rcx - subq %r10,%rcx - xorl %eax,%eax - xorq %r11,%r11 -.Loop_dec_byte: - movb (%rsi,%rdx,1),%r11b - movb (%rdx),%al - xorb %r11b,%al - movb %al,(%rdi,%rdx,1) - movb %r11b,(%rdx) - leaq 1(%rdx),%rdx - decq %r10 - jnz .Loop_dec_byte - - xorl %eax,%eax -.Loop_dec_pad: - movb %al,(%rdx) - leaq 1(%rdx),%rdx - decq %rcx - jnz .Loop_dec_pad - -.Ldone_dec: - movq %rdx,%rax - .byte 0xf3,0xc3 -.size xor128_decrypt_n_pad,.-xor128_decrypt_n_pad +ENDPROC(poly1305_blocks_avx512) +#endif /* CONFIG_AS_AVX512 */ diff --git a/lib/zinc/poly1305/poly1305.c b/lib/zinc/poly1305/poly1305.c index 6c6c64035efb..51af7045cac8 100644 --- a/lib/zinc/poly1305/poly1305.c +++ b/lib/zinc/poly1305/poly1305.c @@ -16,6 +16,9 @@ #include #include +#if defined(CONFIG_ZINC_ARCH_X86_64) +#include "poly1305-x86_64-glue.c" +#else static inline bool poly1305_init_arch(void *ctx, const u8 key[POLY1305_KEY_SIZE]) { @@ -37,6 +40,7 @@ static bool *const poly1305_nobs[] __initconst = { }; static void __init poly1305_fpu_init(void) { } +#endif #if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__) #include "poly1305-donna64.h" From patchwork Sat Oct 6 02:56:55 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Jason A. Donenfeld" X-Patchwork-Id: 148311 Delivered-To: patch@linaro.org Received: by 2002:a2e:8595:0:0:0:0:0 with SMTP id b21-v6csp1157653lji; Fri, 5 Oct 2018 19:58:21 -0700 (PDT) X-Google-Smtp-Source: ACcGV60LkdjSTEZ68KOgKWinlrJE27Bd+HSdKcCqQREi+v2RHgpcKdRgqa84zR9fiNbGEByuy7Tc X-Received: by 2002:a17:902:b03:: with SMTP id 3-v6mr6642374plq.141.1538794700905; Fri, 05 Oct 2018 19:58:20 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1538794700; cv=none; d=google.com; s=arc-20160816; b=puRBxBBmKT3xsVFJd+C2B+6DZcAmMrvpzADlaamfypKHHoYLqhtA+aufr3UZYBRwlb ExW+/AJbBgzykG71YluBWRu4S+CAM2g+QK52iHi9DZpOUshpNgWqtJY0vg50ud26UP/K kIP0h7XKiwsFXH1asZb/bu8MIkIENUYmA+gWE1TsHdsi9+U8P/OUIObfmZUPMp1hVvXN 40s6ELrzsAgJBkGsYqFYUWlBCmZHQjbVgRA4Zew8B/ht/WImtryzLF0b5rjRIC1mGoqZ rwJxpRiMDEAloZpGvssY58YU3p72zE40TPhm2FJa9MjZ7aCGHm34y3fC5IZQaWj2RXnN n6gw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=R9EUNvalXS/bpUm5K6blxQ0R6uchJeo/m/JXMD7HT4U=; b=MbUPUuph/Df2vn0xtRJDRdgPgQ3//MAOJLrplI+HMwH9jVJmFdGbKB171mXov2GtA2 V/jywxJyyp99q4yXYZGXhEA2/ZBMiQ8CNqi+/iLTD+CKePNbUejc8CLxi3hSzRNyRAzn iWl3DpAB7Erx+L3OIyMW/rrLTauFGZ4IhBkfrksdWRGNQoNZFD1sXqo1wyVZ6mA9sPtF S4FngXbhejTMysdoSeSrRTrwEH031zszvlYx1AuuvyBQqT/TZ9Ohn3nieUfSvjUAQcAP LT0US7PoP93BoOAmlhZPH13cWWbatYqn9p77IKBnWxYbRYTWPAtqaiRFi9Pew6Ws6yL+ gedg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=hazLoAbB; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id b1-v6si9630921pgc.319.2018.10.05.19.58.20; Fri, 05 Oct 2018 19:58:20 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=hazLoAbB; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729770AbeJFJ7u (ORCPT + 32 others); Sat, 6 Oct 2018 05:59:50 -0400 Received: from frisell.zx2c4.com ([192.95.5.64]:50073 "EHLO frisell.zx2c4.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729724AbeJFJ7t (ORCPT ); Sat, 6 Oct 2018 05:59:49 -0400 Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTP id 3f8bcb19; Sat, 6 Oct 2018 02:57:41 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=zx2c4.com; h=from:to:cc :subject:date:message-id:in-reply-to:references:mime-version :content-transfer-encoding; s=mail; bh=kI1EJx8E+ut2r+vi/e7yCw0wh lo=; b=hazLoAbBeeuqiwVsdxIkVrUeHGuY36M2hYeV0/E9KyC40Ko7VUCBGNP7t njJZmsfn2p8yQYe5bng5ihqgtmMdSSbggyCJv7w+YGAwuGZgBMvhpkD3GELUfGU0 nptTvMXkAhibh7LpZkERvnyCiLdED3EJIPmwhIRmL9ZUhETY5x93eNn16RNBtNSF RMAri0lryJ1TGkQyIboSE0bzFULR9m7stvt7w17KoN1A5FT7evZ92EyIuQXVpqol ACDx3jk8EbzrV3FPj93SARElFx9t8RYKWWOO5DONk9SCmmWb1oxWJDyHa93KULz2 qU52DlJos3Z8e0Pt/OQqA6Rx61YVg== Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTPSA id 8fb40566 (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256:NO); Sat, 6 Oct 2018 02:57:40 +0000 (UTC) From: "Jason A. Donenfeld" To: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, davem@davemloft.net, gregkh@linuxfoundation.org Cc: "Jason A. Donenfeld" , Andy Polyakov , Russell King , linux-arm-kernel@lists.infradead.org, Samuel Neves , Jean-Philippe Aumasson , Andy Lutomirski , Andrew Morton , Linus Torvalds , kernel-hardening@lists.openwall.com, linux-crypto@vger.kernel.org Subject: [PATCH net-next v7 14/28] zinc: import Andy Polyakov's Poly1305 ARM and ARM64 implementations Date: Sat, 6 Oct 2018 04:56:55 +0200 Message-Id: <20181006025709.4019-15-Jason@zx2c4.com> In-Reply-To: <20181006025709.4019-1-Jason@zx2c4.com> References: <20181006025709.4019-1-Jason@zx2c4.com> MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org These NEON and non-NEON implementations come from Andy Polyakov's implementation, and are included here in raw form without modification, so that subsequent commits that fix these up for the kernel can see how it has changed. While this is CRYPTOGAMS code, the originating code for this happens to be the same as OpenSSL's commit 5bb1cd2292b388263a0cc05392bb99141212aa53 Signed-off-by: Jason A. Donenfeld Based-on-code-from: Andy Polyakov Cc: Andy Polyakov Cc: Russell King Cc: linux-arm-kernel@lists.infradead.org Cc: Samuel Neves Cc: Jean-Philippe Aumasson Cc: Andy Lutomirski Cc: Greg KH Cc: Andrew Morton Cc: Linus Torvalds Cc: kernel-hardening@lists.openwall.com Cc: linux-crypto@vger.kernel.org --- lib/zinc/poly1305/poly1305-arm-cryptogams.S | 1172 +++++++++++++++++ lib/zinc/poly1305/poly1305-arm64-cryptogams.S | 869 ++++++++++++ 2 files changed, 2041 insertions(+) create mode 100644 lib/zinc/poly1305/poly1305-arm-cryptogams.S create mode 100644 lib/zinc/poly1305/poly1305-arm64-cryptogams.S -- 2.19.0 diff --git a/lib/zinc/poly1305/poly1305-arm-cryptogams.S b/lib/zinc/poly1305/poly1305-arm-cryptogams.S new file mode 100644 index 000000000000..884b465030e4 --- /dev/null +++ b/lib/zinc/poly1305/poly1305-arm-cryptogams.S @@ -0,0 +1,1172 @@ +/* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */ +/* + * Copyright (C) 2006-2017 CRYPTOGAMS by . All Rights Reserved. + */ + +#include "arm_arch.h" + +.text +#if defined(__thumb2__) +.syntax unified +.thumb +#else +.code 32 +#endif + +.globl poly1305_emit +.globl poly1305_blocks +.globl poly1305_init +.type poly1305_init,%function +.align 5 +poly1305_init: +.Lpoly1305_init: + stmdb sp!,{r4-r11} + + eor r3,r3,r3 + cmp r1,#0 + str r3,[r0,#0] @ zero hash value + str r3,[r0,#4] + str r3,[r0,#8] + str r3,[r0,#12] + str r3,[r0,#16] + str r3,[r0,#36] @ is_base2_26 + add r0,r0,#20 + +#ifdef __thumb2__ + it eq +#endif + moveq r0,#0 + beq .Lno_key + +#if __ARM_MAX_ARCH__>=7 + adr r11,.Lpoly1305_init + ldr r12,.LOPENSSL_armcap +#endif + ldrb r4,[r1,#0] + mov r10,#0x0fffffff + ldrb r5,[r1,#1] + and r3,r10,#-4 @ 0x0ffffffc + ldrb r6,[r1,#2] + ldrb r7,[r1,#3] + orr r4,r4,r5,lsl#8 + ldrb r5,[r1,#4] + orr r4,r4,r6,lsl#16 + ldrb r6,[r1,#5] + orr r4,r4,r7,lsl#24 + ldrb r7,[r1,#6] + and r4,r4,r10 + +#if __ARM_MAX_ARCH__>=7 + ldr r12,[r11,r12] @ OPENSSL_armcap_P +# ifdef __APPLE__ + ldr r12,[r12] +# endif +#endif + ldrb r8,[r1,#7] + orr r5,r5,r6,lsl#8 + ldrb r6,[r1,#8] + orr r5,r5,r7,lsl#16 + ldrb r7,[r1,#9] + orr r5,r5,r8,lsl#24 + ldrb r8,[r1,#10] + and r5,r5,r3 + +#if __ARM_MAX_ARCH__>=7 + tst r12,#ARMV7_NEON @ check for NEON +# ifdef __APPLE__ + adr r9,poly1305_blocks_neon + adr r11,poly1305_blocks +# ifdef __thumb2__ + it ne +# endif + movne r11,r9 + adr r12,poly1305_emit + adr r10,poly1305_emit_neon +# ifdef __thumb2__ + it ne +# endif + movne r12,r10 +# else +# ifdef __thumb2__ + itete eq +# endif + addeq r12,r11,#(poly1305_emit-.Lpoly1305_init) + addne r12,r11,#(poly1305_emit_neon-.Lpoly1305_init) + addeq r11,r11,#(poly1305_blocks-.Lpoly1305_init) + addne r11,r11,#(poly1305_blocks_neon-.Lpoly1305_init) +# endif +# ifdef __thumb2__ + orr r12,r12,#1 @ thumb-ify address + orr r11,r11,#1 +# endif +#endif + ldrb r9,[r1,#11] + orr r6,r6,r7,lsl#8 + ldrb r7,[r1,#12] + orr r6,r6,r8,lsl#16 + ldrb r8,[r1,#13] + orr r6,r6,r9,lsl#24 + ldrb r9,[r1,#14] + and r6,r6,r3 + + ldrb r10,[r1,#15] + orr r7,r7,r8,lsl#8 + str r4,[r0,#0] + orr r7,r7,r9,lsl#16 + str r5,[r0,#4] + orr r7,r7,r10,lsl#24 + str r6,[r0,#8] + and r7,r7,r3 + str r7,[r0,#12] +#if __ARM_MAX_ARCH__>=7 + stmia r2,{r11,r12} @ fill functions table + mov r0,#1 +#else + mov r0,#0 +#endif +.Lno_key: + ldmia sp!,{r4-r11} +#if __ARM_ARCH__>=5 + bx lr @ bx lr +#else + tst lr,#1 + moveq pc,lr @ be binary compatible with V4, yet + .word 0xe12fff1e @ interoperable with Thumb ISA:-) +#endif +.size poly1305_init,.-poly1305_init +.type poly1305_blocks,%function +.align 5 +poly1305_blocks: +.Lpoly1305_blocks: + stmdb sp!,{r3-r11,lr} + + ands r2,r2,#-16 + beq .Lno_data + + cmp r3,#0 + add r2,r2,r1 @ end pointer + sub sp,sp,#32 + + ldmia r0,{r4-r12} @ load context + + str r0,[sp,#12] @ offload stuff + mov lr,r1 + str r2,[sp,#16] + str r10,[sp,#20] + str r11,[sp,#24] + str r12,[sp,#28] + b .Loop + +.Loop: +#if __ARM_ARCH__<7 + ldrb r0,[lr],#16 @ load input +# ifdef __thumb2__ + it hi +# endif + addhi r8,r8,#1 @ 1<<128 + ldrb r1,[lr,#-15] + ldrb r2,[lr,#-14] + ldrb r3,[lr,#-13] + orr r1,r0,r1,lsl#8 + ldrb r0,[lr,#-12] + orr r2,r1,r2,lsl#16 + ldrb r1,[lr,#-11] + orr r3,r2,r3,lsl#24 + ldrb r2,[lr,#-10] + adds r4,r4,r3 @ accumulate input + + ldrb r3,[lr,#-9] + orr r1,r0,r1,lsl#8 + ldrb r0,[lr,#-8] + orr r2,r1,r2,lsl#16 + ldrb r1,[lr,#-7] + orr r3,r2,r3,lsl#24 + ldrb r2,[lr,#-6] + adcs r5,r5,r3 + + ldrb r3,[lr,#-5] + orr r1,r0,r1,lsl#8 + ldrb r0,[lr,#-4] + orr r2,r1,r2,lsl#16 + ldrb r1,[lr,#-3] + orr r3,r2,r3,lsl#24 + ldrb r2,[lr,#-2] + adcs r6,r6,r3 + + ldrb r3,[lr,#-1] + orr r1,r0,r1,lsl#8 + str lr,[sp,#8] @ offload input pointer + orr r2,r1,r2,lsl#16 + add r10,r10,r10,lsr#2 + orr r3,r2,r3,lsl#24 +#else + ldr r0,[lr],#16 @ load input +# ifdef __thumb2__ + it hi +# endif + addhi r8,r8,#1 @ padbit + ldr r1,[lr,#-12] + ldr r2,[lr,#-8] + ldr r3,[lr,#-4] +# ifdef __ARMEB__ + rev r0,r0 + rev r1,r1 + rev r2,r2 + rev r3,r3 +# endif + adds r4,r4,r0 @ accumulate input + str lr,[sp,#8] @ offload input pointer + adcs r5,r5,r1 + add r10,r10,r10,lsr#2 + adcs r6,r6,r2 +#endif + add r11,r11,r11,lsr#2 + adcs r7,r7,r3 + add r12,r12,r12,lsr#2 + + umull r2,r3,r5,r9 + adc r8,r8,#0 + umull r0,r1,r4,r9 + umlal r2,r3,r8,r10 + umlal r0,r1,r7,r10 + ldr r10,[sp,#20] @ reload r10 + umlal r2,r3,r6,r12 + umlal r0,r1,r5,r12 + umlal r2,r3,r7,r11 + umlal r0,r1,r6,r11 + umlal r2,r3,r4,r10 + str r0,[sp,#0] @ future r4 + mul r0,r11,r8 + ldr r11,[sp,#24] @ reload r11 + adds r2,r2,r1 @ d1+=d0>>32 + eor r1,r1,r1 + adc lr,r3,#0 @ future r6 + str r2,[sp,#4] @ future r5 + + mul r2,r12,r8 + eor r3,r3,r3 + umlal r0,r1,r7,r12 + ldr r12,[sp,#28] @ reload r12 + umlal r2,r3,r7,r9 + umlal r0,r1,r6,r9 + umlal r2,r3,r6,r10 + umlal r0,r1,r5,r10 + umlal r2,r3,r5,r11 + umlal r0,r1,r4,r11 + umlal r2,r3,r4,r12 + ldr r4,[sp,#0] + mul r8,r9,r8 + ldr r5,[sp,#4] + + adds r6,lr,r0 @ d2+=d1>>32 + ldr lr,[sp,#8] @ reload input pointer + adc r1,r1,#0 + adds r7,r2,r1 @ d3+=d2>>32 + ldr r0,[sp,#16] @ reload end pointer + adc r3,r3,#0 + add r8,r8,r3 @ h4+=d3>>32 + + and r1,r8,#-4 + and r8,r8,#3 + add r1,r1,r1,lsr#2 @ *=5 + adds r4,r4,r1 + adcs r5,r5,#0 + adcs r6,r6,#0 + adcs r7,r7,#0 + adc r8,r8,#0 + + cmp r0,lr @ done yet? + bhi .Loop + + ldr r0,[sp,#12] + add sp,sp,#32 + stmia r0,{r4-r8} @ store the result + +.Lno_data: +#if __ARM_ARCH__>=5 + ldmia sp!,{r3-r11,pc} +#else + ldmia sp!,{r3-r11,lr} + tst lr,#1 + moveq pc,lr @ be binary compatible with V4, yet + .word 0xe12fff1e @ interoperable with Thumb ISA:-) +#endif +.size poly1305_blocks,.-poly1305_blocks +.type poly1305_emit,%function +.align 5 +poly1305_emit: + stmdb sp!,{r4-r11} +.Lpoly1305_emit_enter: + + ldmia r0,{r3-r7} + adds r8,r3,#5 @ compare to modulus + adcs r9,r4,#0 + adcs r10,r5,#0 + adcs r11,r6,#0 + adc r7,r7,#0 + tst r7,#4 @ did it carry/borrow? + +#ifdef __thumb2__ + it ne +#endif + movne r3,r8 + ldr r8,[r2,#0] +#ifdef __thumb2__ + it ne +#endif + movne r4,r9 + ldr r9,[r2,#4] +#ifdef __thumb2__ + it ne +#endif + movne r5,r10 + ldr r10,[r2,#8] +#ifdef __thumb2__ + it ne +#endif + movne r6,r11 + ldr r11,[r2,#12] + + adds r3,r3,r8 + adcs r4,r4,r9 + adcs r5,r5,r10 + adc r6,r6,r11 + +#if __ARM_ARCH__>=7 +# ifdef __ARMEB__ + rev r3,r3 + rev r4,r4 + rev r5,r5 + rev r6,r6 +# endif + str r3,[r1,#0] + str r4,[r1,#4] + str r5,[r1,#8] + str r6,[r1,#12] +#else + strb r3,[r1,#0] + mov r3,r3,lsr#8 + strb r4,[r1,#4] + mov r4,r4,lsr#8 + strb r5,[r1,#8] + mov r5,r5,lsr#8 + strb r6,[r1,#12] + mov r6,r6,lsr#8 + + strb r3,[r1,#1] + mov r3,r3,lsr#8 + strb r4,[r1,#5] + mov r4,r4,lsr#8 + strb r5,[r1,#9] + mov r5,r5,lsr#8 + strb r6,[r1,#13] + mov r6,r6,lsr#8 + + strb r3,[r1,#2] + mov r3,r3,lsr#8 + strb r4,[r1,#6] + mov r4,r4,lsr#8 + strb r5,[r1,#10] + mov r5,r5,lsr#8 + strb r6,[r1,#14] + mov r6,r6,lsr#8 + + strb r3,[r1,#3] + strb r4,[r1,#7] + strb r5,[r1,#11] + strb r6,[r1,#15] +#endif + ldmia sp!,{r4-r11} +#if __ARM_ARCH__>=5 + bx lr @ bx lr +#else + tst lr,#1 + moveq pc,lr @ be binary compatible with V4, yet + .word 0xe12fff1e @ interoperable with Thumb ISA:-) +#endif +.size poly1305_emit,.-poly1305_emit +#if __ARM_MAX_ARCH__>=7 +.fpu neon + +.type poly1305_init_neon,%function +.align 5 +poly1305_init_neon: + ldr r4,[r0,#20] @ load key base 2^32 + ldr r5,[r0,#24] + ldr r6,[r0,#28] + ldr r7,[r0,#32] + + and r2,r4,#0x03ffffff @ base 2^32 -> base 2^26 + mov r3,r4,lsr#26 + mov r4,r5,lsr#20 + orr r3,r3,r5,lsl#6 + mov r5,r6,lsr#14 + orr r4,r4,r6,lsl#12 + mov r6,r7,lsr#8 + orr r5,r5,r7,lsl#18 + and r3,r3,#0x03ffffff + and r4,r4,#0x03ffffff + and r5,r5,#0x03ffffff + + vdup.32 d0,r2 @ r^1 in both lanes + add r2,r3,r3,lsl#2 @ *5 + vdup.32 d1,r3 + add r3,r4,r4,lsl#2 + vdup.32 d2,r2 + vdup.32 d3,r4 + add r4,r5,r5,lsl#2 + vdup.32 d4,r3 + vdup.32 d5,r5 + add r5,r6,r6,lsl#2 + vdup.32 d6,r4 + vdup.32 d7,r6 + vdup.32 d8,r5 + + mov r5,#2 @ counter + +.Lsquare_neon: + @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ + @ d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4 + @ d1 = h1*r0 + h0*r1 + h4*5*r2 + h3*5*r3 + h2*5*r4 + @ d2 = h2*r0 + h1*r1 + h0*r2 + h4*5*r3 + h3*5*r4 + @ d3 = h3*r0 + h2*r1 + h1*r2 + h0*r3 + h4*5*r4 + @ d4 = h4*r0 + h3*r1 + h2*r2 + h1*r3 + h0*r4 + + vmull.u32 q5,d0,d0[1] + vmull.u32 q6,d1,d0[1] + vmull.u32 q7,d3,d0[1] + vmull.u32 q8,d5,d0[1] + vmull.u32 q9,d7,d0[1] + + vmlal.u32 q5,d7,d2[1] + vmlal.u32 q6,d0,d1[1] + vmlal.u32 q7,d1,d1[1] + vmlal.u32 q8,d3,d1[1] + vmlal.u32 q9,d5,d1[1] + + vmlal.u32 q5,d5,d4[1] + vmlal.u32 q6,d7,d4[1] + vmlal.u32 q8,d1,d3[1] + vmlal.u32 q7,d0,d3[1] + vmlal.u32 q9,d3,d3[1] + + vmlal.u32 q5,d3,d6[1] + vmlal.u32 q8,d0,d5[1] + vmlal.u32 q6,d5,d6[1] + vmlal.u32 q7,d7,d6[1] + vmlal.u32 q9,d1,d5[1] + + vmlal.u32 q8,d7,d8[1] + vmlal.u32 q5,d1,d8[1] + vmlal.u32 q6,d3,d8[1] + vmlal.u32 q7,d5,d8[1] + vmlal.u32 q9,d0,d7[1] + + @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ + @ lazy reduction as discussed in "NEON crypto" by D.J. Bernstein + @ and P. Schwabe + @ + @ H0>>+H1>>+H2>>+H3>>+H4 + @ H3>>+H4>>*5+H0>>+H1 + @ + @ Trivia. + @ + @ Result of multiplication of n-bit number by m-bit number is + @ n+m bits wide. However! Even though 2^n is a n+1-bit number, + @ m-bit number multiplied by 2^n is still n+m bits wide. + @ + @ Sum of two n-bit numbers is n+1 bits wide, sum of three - n+2, + @ and so is sum of four. Sum of 2^m n-m-bit numbers and n-bit + @ one is n+1 bits wide. + @ + @ >>+ denotes Hnext += Hn>>26, Hn &= 0x3ffffff. This means that + @ H0, H2, H3 are guaranteed to be 26 bits wide, while H1 and H4 + @ can be 27. However! In cases when their width exceeds 26 bits + @ they are limited by 2^26+2^6. This in turn means that *sum* + @ of the products with these values can still be viewed as sum + @ of 52-bit numbers as long as the amount of addends is not a + @ power of 2. For example, + @ + @ H4 = H4*R0 + H3*R1 + H2*R2 + H1*R3 + H0 * R4, + @ + @ which can't be larger than 5 * (2^26 + 2^6) * (2^26 + 2^6), or + @ 5 * (2^52 + 2*2^32 + 2^12), which in turn is smaller than + @ 8 * (2^52) or 2^55. However, the value is then multiplied by + @ by 5, so we should be looking at 5 * 5 * (2^52 + 2^33 + 2^12), + @ which is less than 32 * (2^52) or 2^57. And when processing + @ data we are looking at triple as many addends... + @ + @ In key setup procedure pre-reduced H0 is limited by 5*4+1 and + @ 5*H4 - by 5*5 52-bit addends, or 57 bits. But when hashing the + @ input H0 is limited by (5*4+1)*3 addends, or 58 bits, while + @ 5*H4 by 5*5*3, or 59[!] bits. How is this relevant? vmlal.u32 + @ instruction accepts 2x32-bit input and writes 2x64-bit result. + @ This means that result of reduction have to be compressed upon + @ loop wrap-around. This can be done in the process of reduction + @ to minimize amount of instructions [as well as amount of + @ 128-bit instructions, which benefits low-end processors], but + @ one has to watch for H2 (which is narrower than H0) and 5*H4 + @ not being wider than 58 bits, so that result of right shift + @ by 26 bits fits in 32 bits. This is also useful on x86, + @ because it allows to use paddd in place for paddq, which + @ benefits Atom, where paddq is ridiculously slow. + + vshr.u64 q15,q8,#26 + vmovn.i64 d16,q8 + vshr.u64 q4,q5,#26 + vmovn.i64 d10,q5 + vadd.i64 q9,q9,q15 @ h3 -> h4 + vbic.i32 d16,#0xfc000000 @ &=0x03ffffff + vadd.i64 q6,q6,q4 @ h0 -> h1 + vbic.i32 d10,#0xfc000000 + + vshrn.u64 d30,q9,#26 + vmovn.i64 d18,q9 + vshr.u64 q4,q6,#26 + vmovn.i64 d12,q6 + vadd.i64 q7,q7,q4 @ h1 -> h2 + vbic.i32 d18,#0xfc000000 + vbic.i32 d12,#0xfc000000 + + vadd.i32 d10,d10,d30 + vshl.u32 d30,d30,#2 + vshrn.u64 d8,q7,#26 + vmovn.i64 d14,q7 + vadd.i32 d10,d10,d30 @ h4 -> h0 + vadd.i32 d16,d16,d8 @ h2 -> h3 + vbic.i32 d14,#0xfc000000 + + vshr.u32 d30,d10,#26 + vbic.i32 d10,#0xfc000000 + vshr.u32 d8,d16,#26 + vbic.i32 d16,#0xfc000000 + vadd.i32 d12,d12,d30 @ h0 -> h1 + vadd.i32 d18,d18,d8 @ h3 -> h4 + + subs r5,r5,#1 + beq .Lsquare_break_neon + + add r6,r0,#(48+0*9*4) + add r7,r0,#(48+1*9*4) + + vtrn.32 d0,d10 @ r^2:r^1 + vtrn.32 d3,d14 + vtrn.32 d5,d16 + vtrn.32 d1,d12 + vtrn.32 d7,d18 + + vshl.u32 d4,d3,#2 @ *5 + vshl.u32 d6,d5,#2 + vshl.u32 d2,d1,#2 + vshl.u32 d8,d7,#2 + vadd.i32 d4,d4,d3 + vadd.i32 d2,d2,d1 + vadd.i32 d6,d6,d5 + vadd.i32 d8,d8,d7 + + vst4.32 {d0[0],d1[0],d2[0],d3[0]},[r6]! + vst4.32 {d0[1],d1[1],d2[1],d3[1]},[r7]! + vst4.32 {d4[0],d5[0],d6[0],d7[0]},[r6]! + vst4.32 {d4[1],d5[1],d6[1],d7[1]},[r7]! + vst1.32 {d8[0]},[r6,:32] + vst1.32 {d8[1]},[r7,:32] + + b .Lsquare_neon + +.align 4 +.Lsquare_break_neon: + add r6,r0,#(48+2*4*9) + add r7,r0,#(48+3*4*9) + + vmov d0,d10 @ r^4:r^3 + vshl.u32 d2,d12,#2 @ *5 + vmov d1,d12 + vshl.u32 d4,d14,#2 + vmov d3,d14 + vshl.u32 d6,d16,#2 + vmov d5,d16 + vshl.u32 d8,d18,#2 + vmov d7,d18 + vadd.i32 d2,d2,d12 + vadd.i32 d4,d4,d14 + vadd.i32 d6,d6,d16 + vadd.i32 d8,d8,d18 + + vst4.32 {d0[0],d1[0],d2[0],d3[0]},[r6]! + vst4.32 {d0[1],d1[1],d2[1],d3[1]},[r7]! + vst4.32 {d4[0],d5[0],d6[0],d7[0]},[r6]! + vst4.32 {d4[1],d5[1],d6[1],d7[1]},[r7]! + vst1.32 {d8[0]},[r6] + vst1.32 {d8[1]},[r7] + + bx lr @ bx lr +.size poly1305_init_neon,.-poly1305_init_neon + +.type poly1305_blocks_neon,%function +.align 5 +poly1305_blocks_neon: + ldr ip,[r0,#36] @ is_base2_26 + ands r2,r2,#-16 + beq .Lno_data_neon + + cmp r2,#64 + bhs .Lenter_neon + tst ip,ip @ is_base2_26? + beq .Lpoly1305_blocks + +.Lenter_neon: + stmdb sp!,{r4-r7} + vstmdb sp!,{d8-d15} @ ABI specification says so + + tst ip,ip @ is_base2_26? + bne .Lbase2_26_neon + + stmdb sp!,{r1-r3,lr} + bl poly1305_init_neon + + ldr r4,[r0,#0] @ load hash value base 2^32 + ldr r5,[r0,#4] + ldr r6,[r0,#8] + ldr r7,[r0,#12] + ldr ip,[r0,#16] + + and r2,r4,#0x03ffffff @ base 2^32 -> base 2^26 + mov r3,r4,lsr#26 + veor d10,d10,d10 + mov r4,r5,lsr#20 + orr r3,r3,r5,lsl#6 + veor d12,d12,d12 + mov r5,r6,lsr#14 + orr r4,r4,r6,lsl#12 + veor d14,d14,d14 + mov r6,r7,lsr#8 + orr r5,r5,r7,lsl#18 + veor d16,d16,d16 + and r3,r3,#0x03ffffff + orr r6,r6,ip,lsl#24 + veor d18,d18,d18 + and r4,r4,#0x03ffffff + mov r1,#1 + and r5,r5,#0x03ffffff + str r1,[r0,#36] @ is_base2_26 + + vmov.32 d10[0],r2 + vmov.32 d12[0],r3 + vmov.32 d14[0],r4 + vmov.32 d16[0],r5 + vmov.32 d18[0],r6 + adr r5,.Lzeros + + ldmia sp!,{r1-r3,lr} + b .Lbase2_32_neon + +.align 4 +.Lbase2_26_neon: + @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ + @ load hash value + + veor d10,d10,d10 + veor d12,d12,d12 + veor d14,d14,d14 + veor d16,d16,d16 + veor d18,d18,d18 + vld4.32 {d10[0],d12[0],d14[0],d16[0]},[r0]! + adr r5,.Lzeros + vld1.32 {d18[0]},[r0] + sub r0,r0,#16 @ rewind + +.Lbase2_32_neon: + add r4,r1,#32 + mov r3,r3,lsl#24 + tst r2,#31 + beq .Leven + + vld4.32 {d20[0],d22[0],d24[0],d26[0]},[r1]! + vmov.32 d28[0],r3 + sub r2,r2,#16 + add r4,r1,#32 + +# ifdef __ARMEB__ + vrev32.8 q10,q10 + vrev32.8 q13,q13 + vrev32.8 q11,q11 + vrev32.8 q12,q12 +# endif + vsri.u32 d28,d26,#8 @ base 2^32 -> base 2^26 + vshl.u32 d26,d26,#18 + + vsri.u32 d26,d24,#14 + vshl.u32 d24,d24,#12 + vadd.i32 d29,d28,d18 @ add hash value and move to #hi + + vbic.i32 d26,#0xfc000000 + vsri.u32 d24,d22,#20 + vshl.u32 d22,d22,#6 + + vbic.i32 d24,#0xfc000000 + vsri.u32 d22,d20,#26 + vadd.i32 d27,d26,d16 + + vbic.i32 d20,#0xfc000000 + vbic.i32 d22,#0xfc000000 + vadd.i32 d25,d24,d14 + + vadd.i32 d21,d20,d10 + vadd.i32 d23,d22,d12 + + mov r7,r5 + add r6,r0,#48 + + cmp r2,r2 + b .Long_tail + +.align 4 +.Leven: + subs r2,r2,#64 + it lo + movlo r4,r5 + + vmov.i32 q14,#1<<24 @ padbit, yes, always + vld4.32 {d20,d22,d24,d26},[r1] @ inp[0:1] + add r1,r1,#64 + vld4.32 {d21,d23,d25,d27},[r4] @ inp[2:3] (or 0) + add r4,r4,#64 + itt hi + addhi r7,r0,#(48+1*9*4) + addhi r6,r0,#(48+3*9*4) + +# ifdef __ARMEB__ + vrev32.8 q10,q10 + vrev32.8 q13,q13 + vrev32.8 q11,q11 + vrev32.8 q12,q12 +# endif + vsri.u32 q14,q13,#8 @ base 2^32 -> base 2^26 + vshl.u32 q13,q13,#18 + + vsri.u32 q13,q12,#14 + vshl.u32 q12,q12,#12 + + vbic.i32 q13,#0xfc000000 + vsri.u32 q12,q11,#20 + vshl.u32 q11,q11,#6 + + vbic.i32 q12,#0xfc000000 + vsri.u32 q11,q10,#26 + + vbic.i32 q10,#0xfc000000 + vbic.i32 q11,#0xfc000000 + + bls .Lskip_loop + + vld4.32 {d0[1],d1[1],d2[1],d3[1]},[r7]! @ load r^2 + vld4.32 {d0[0],d1[0],d2[0],d3[0]},[r6]! @ load r^4 + vld4.32 {d4[1],d5[1],d6[1],d7[1]},[r7]! + vld4.32 {d4[0],d5[0],d6[0],d7[0]},[r6]! + b .Loop_neon + +.align 5 +.Loop_neon: + @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ + @ ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2 + @ ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^3+inp[7]*r + @ ___________________/ + @ ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2+inp[8])*r^2 + @ ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^4+inp[7]*r^2+inp[9])*r + @ ___________________/ ____________________/ + @ + @ Note that we start with inp[2:3]*r^2. This is because it + @ doesn't depend on reduction in previous iteration. + @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ + @ d4 = h4*r0 + h3*r1 + h2*r2 + h1*r3 + h0*r4 + @ d3 = h3*r0 + h2*r1 + h1*r2 + h0*r3 + h4*5*r4 + @ d2 = h2*r0 + h1*r1 + h0*r2 + h4*5*r3 + h3*5*r4 + @ d1 = h1*r0 + h0*r1 + h4*5*r2 + h3*5*r3 + h2*5*r4 + @ d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4 + + @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ + @ inp[2:3]*r^2 + + vadd.i32 d24,d24,d14 @ accumulate inp[0:1] + vmull.u32 q7,d25,d0[1] + vadd.i32 d20,d20,d10 + vmull.u32 q5,d21,d0[1] + vadd.i32 d26,d26,d16 + vmull.u32 q8,d27,d0[1] + vmlal.u32 q7,d23,d1[1] + vadd.i32 d22,d22,d12 + vmull.u32 q6,d23,d0[1] + + vadd.i32 d28,d28,d18 + vmull.u32 q9,d29,d0[1] + subs r2,r2,#64 + vmlal.u32 q5,d29,d2[1] + it lo + movlo r4,r5 + vmlal.u32 q8,d25,d1[1] + vld1.32 d8[1],[r7,:32] + vmlal.u32 q6,d21,d1[1] + vmlal.u32 q9,d27,d1[1] + + vmlal.u32 q5,d27,d4[1] + vmlal.u32 q8,d23,d3[1] + vmlal.u32 q9,d25,d3[1] + vmlal.u32 q6,d29,d4[1] + vmlal.u32 q7,d21,d3[1] + + vmlal.u32 q8,d21,d5[1] + vmlal.u32 q5,d25,d6[1] + vmlal.u32 q9,d23,d5[1] + vmlal.u32 q6,d27,d6[1] + vmlal.u32 q7,d29,d6[1] + + vmlal.u32 q8,d29,d8[1] + vmlal.u32 q5,d23,d8[1] + vmlal.u32 q9,d21,d7[1] + vmlal.u32 q6,d25,d8[1] + vmlal.u32 q7,d27,d8[1] + + vld4.32 {d21,d23,d25,d27},[r4] @ inp[2:3] (or 0) + add r4,r4,#64 + + @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ + @ (hash+inp[0:1])*r^4 and accumulate + + vmlal.u32 q8,d26,d0[0] + vmlal.u32 q5,d20,d0[0] + vmlal.u32 q9,d28,d0[0] + vmlal.u32 q6,d22,d0[0] + vmlal.u32 q7,d24,d0[0] + vld1.32 d8[0],[r6,:32] + + vmlal.u32 q8,d24,d1[0] + vmlal.u32 q5,d28,d2[0] + vmlal.u32 q9,d26,d1[0] + vmlal.u32 q6,d20,d1[0] + vmlal.u32 q7,d22,d1[0] + + vmlal.u32 q8,d22,d3[0] + vmlal.u32 q5,d26,d4[0] + vmlal.u32 q9,d24,d3[0] + vmlal.u32 q6,d28,d4[0] + vmlal.u32 q7,d20,d3[0] + + vmlal.u32 q8,d20,d5[0] + vmlal.u32 q5,d24,d6[0] + vmlal.u32 q9,d22,d5[0] + vmlal.u32 q6,d26,d6[0] + vmlal.u32 q8,d28,d8[0] + + vmlal.u32 q7,d28,d6[0] + vmlal.u32 q5,d22,d8[0] + vmlal.u32 q9,d20,d7[0] + vmov.i32 q14,#1<<24 @ padbit, yes, always + vmlal.u32 q6,d24,d8[0] + vmlal.u32 q7,d26,d8[0] + + vld4.32 {d20,d22,d24,d26},[r1] @ inp[0:1] + add r1,r1,#64 +# ifdef __ARMEB__ + vrev32.8 q10,q10 + vrev32.8 q11,q11 + vrev32.8 q12,q12 + vrev32.8 q13,q13 +# endif + + @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ + @ lazy reduction interleaved with base 2^32 -> base 2^26 of + @ inp[0:3] previously loaded to q10-q13 and smashed to q10-q14. + + vshr.u64 q15,q8,#26 + vmovn.i64 d16,q8 + vshr.u64 q4,q5,#26 + vmovn.i64 d10,q5 + vadd.i64 q9,q9,q15 @ h3 -> h4 + vbic.i32 d16,#0xfc000000 + vsri.u32 q14,q13,#8 @ base 2^32 -> base 2^26 + vadd.i64 q6,q6,q4 @ h0 -> h1 + vshl.u32 q13,q13,#18 + vbic.i32 d10,#0xfc000000 + + vshrn.u64 d30,q9,#26 + vmovn.i64 d18,q9 + vshr.u64 q4,q6,#26 + vmovn.i64 d12,q6 + vadd.i64 q7,q7,q4 @ h1 -> h2 + vsri.u32 q13,q12,#14 + vbic.i32 d18,#0xfc000000 + vshl.u32 q12,q12,#12 + vbic.i32 d12,#0xfc000000 + + vadd.i32 d10,d10,d30 + vshl.u32 d30,d30,#2 + vbic.i32 q13,#0xfc000000 + vshrn.u64 d8,q7,#26 + vmovn.i64 d14,q7 + vaddl.u32 q5,d10,d30 @ h4 -> h0 [widen for a sec] + vsri.u32 q12,q11,#20 + vadd.i32 d16,d16,d8 @ h2 -> h3 + vshl.u32 q11,q11,#6 + vbic.i32 d14,#0xfc000000 + vbic.i32 q12,#0xfc000000 + + vshrn.u64 d30,q5,#26 @ re-narrow + vmovn.i64 d10,q5 + vsri.u32 q11,q10,#26 + vbic.i32 q10,#0xfc000000 + vshr.u32 d8,d16,#26 + vbic.i32 d16,#0xfc000000 + vbic.i32 d10,#0xfc000000 + vadd.i32 d12,d12,d30 @ h0 -> h1 + vadd.i32 d18,d18,d8 @ h3 -> h4 + vbic.i32 q11,#0xfc000000 + + bhi .Loop_neon + +.Lskip_loop: + @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ + @ multiply (inp[0:1]+hash) or inp[2:3] by r^2:r^1 + + add r7,r0,#(48+0*9*4) + add r6,r0,#(48+1*9*4) + adds r2,r2,#32 + it ne + movne r2,#0 + bne .Long_tail + + vadd.i32 d25,d24,d14 @ add hash value and move to #hi + vadd.i32 d21,d20,d10 + vadd.i32 d27,d26,d16 + vadd.i32 d23,d22,d12 + vadd.i32 d29,d28,d18 + +.Long_tail: + vld4.32 {d0[1],d1[1],d2[1],d3[1]},[r7]! @ load r^1 + vld4.32 {d0[0],d1[0],d2[0],d3[0]},[r6]! @ load r^2 + + vadd.i32 d24,d24,d14 @ can be redundant + vmull.u32 q7,d25,d0 + vadd.i32 d20,d20,d10 + vmull.u32 q5,d21,d0 + vadd.i32 d26,d26,d16 + vmull.u32 q8,d27,d0 + vadd.i32 d22,d22,d12 + vmull.u32 q6,d23,d0 + vadd.i32 d28,d28,d18 + vmull.u32 q9,d29,d0 + + vmlal.u32 q5,d29,d2 + vld4.32 {d4[1],d5[1],d6[1],d7[1]},[r7]! + vmlal.u32 q8,d25,d1 + vld4.32 {d4[0],d5[0],d6[0],d7[0]},[r6]! + vmlal.u32 q6,d21,d1 + vmlal.u32 q9,d27,d1 + vmlal.u32 q7,d23,d1 + + vmlal.u32 q8,d23,d3 + vld1.32 d8[1],[r7,:32] + vmlal.u32 q5,d27,d4 + vld1.32 d8[0],[r6,:32] + vmlal.u32 q9,d25,d3 + vmlal.u32 q6,d29,d4 + vmlal.u32 q7,d21,d3 + + vmlal.u32 q8,d21,d5 + it ne + addne r7,r0,#(48+2*9*4) + vmlal.u32 q5,d25,d6 + it ne + addne r6,r0,#(48+3*9*4) + vmlal.u32 q9,d23,d5 + vmlal.u32 q6,d27,d6 + vmlal.u32 q7,d29,d6 + + vmlal.u32 q8,d29,d8 + vorn q0,q0,q0 @ all-ones, can be redundant + vmlal.u32 q5,d23,d8 + vshr.u64 q0,q0,#38 + vmlal.u32 q9,d21,d7 + vmlal.u32 q6,d25,d8 + vmlal.u32 q7,d27,d8 + + beq .Lshort_tail + + @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ + @ (hash+inp[0:1])*r^4:r^3 and accumulate + + vld4.32 {d0[1],d1[1],d2[1],d3[1]},[r7]! @ load r^3 + vld4.32 {d0[0],d1[0],d2[0],d3[0]},[r6]! @ load r^4 + + vmlal.u32 q7,d24,d0 + vmlal.u32 q5,d20,d0 + vmlal.u32 q8,d26,d0 + vmlal.u32 q6,d22,d0 + vmlal.u32 q9,d28,d0 + + vmlal.u32 q5,d28,d2 + vld4.32 {d4[1],d5[1],d6[1],d7[1]},[r7]! + vmlal.u32 q8,d24,d1 + vld4.32 {d4[0],d5[0],d6[0],d7[0]},[r6]! + vmlal.u32 q6,d20,d1 + vmlal.u32 q9,d26,d1 + vmlal.u32 q7,d22,d1 + + vmlal.u32 q8,d22,d3 + vld1.32 d8[1],[r7,:32] + vmlal.u32 q5,d26,d4 + vld1.32 d8[0],[r6,:32] + vmlal.u32 q9,d24,d3 + vmlal.u32 q6,d28,d4 + vmlal.u32 q7,d20,d3 + + vmlal.u32 q8,d20,d5 + vmlal.u32 q5,d24,d6 + vmlal.u32 q9,d22,d5 + vmlal.u32 q6,d26,d6 + vmlal.u32 q7,d28,d6 + + vmlal.u32 q8,d28,d8 + vorn q0,q0,q0 @ all-ones + vmlal.u32 q5,d22,d8 + vshr.u64 q0,q0,#38 + vmlal.u32 q9,d20,d7 + vmlal.u32 q6,d24,d8 + vmlal.u32 q7,d26,d8 + +.Lshort_tail: + @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ + @ horizontal addition + + vadd.i64 d16,d16,d17 + vadd.i64 d10,d10,d11 + vadd.i64 d18,d18,d19 + vadd.i64 d12,d12,d13 + vadd.i64 d14,d14,d15 + + @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ + @ lazy reduction, but without narrowing + + vshr.u64 q15,q8,#26 + vand.i64 q8,q8,q0 + vshr.u64 q4,q5,#26 + vand.i64 q5,q5,q0 + vadd.i64 q9,q9,q15 @ h3 -> h4 + vadd.i64 q6,q6,q4 @ h0 -> h1 + + vshr.u64 q15,q9,#26 + vand.i64 q9,q9,q0 + vshr.u64 q4,q6,#26 + vand.i64 q6,q6,q0 + vadd.i64 q7,q7,q4 @ h1 -> h2 + + vadd.i64 q5,q5,q15 + vshl.u64 q15,q15,#2 + vshr.u64 q4,q7,#26 + vand.i64 q7,q7,q0 + vadd.i64 q5,q5,q15 @ h4 -> h0 + vadd.i64 q8,q8,q4 @ h2 -> h3 + + vshr.u64 q15,q5,#26 + vand.i64 q5,q5,q0 + vshr.u64 q4,q8,#26 + vand.i64 q8,q8,q0 + vadd.i64 q6,q6,q15 @ h0 -> h1 + vadd.i64 q9,q9,q4 @ h3 -> h4 + + cmp r2,#0 + bne .Leven + + @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ + @ store hash value + + vst4.32 {d10[0],d12[0],d14[0],d16[0]},[r0]! + vst1.32 {d18[0]},[r0] + + vldmia sp!,{d8-d15} @ epilogue + ldmia sp!,{r4-r7} +.Lno_data_neon: + bx lr @ bx lr +.size poly1305_blocks_neon,.-poly1305_blocks_neon + +.type poly1305_emit_neon,%function +.align 5 +poly1305_emit_neon: + ldr ip,[r0,#36] @ is_base2_26 + + stmdb sp!,{r4-r11} + + tst ip,ip + beq .Lpoly1305_emit_enter + + ldmia r0,{r3-r7} + eor r8,r8,r8 + + adds r3,r3,r4,lsl#26 @ base 2^26 -> base 2^32 + mov r4,r4,lsr#6 + adcs r4,r4,r5,lsl#20 + mov r5,r5,lsr#12 + adcs r5,r5,r6,lsl#14 + mov r6,r6,lsr#18 + adcs r6,r6,r7,lsl#8 + adc r7,r8,r7,lsr#24 @ can be partially reduced ... + + and r8,r7,#-4 @ ... so reduce + and r7,r6,#3 + add r8,r8,r8,lsr#2 @ *= 5 + adds r3,r3,r8 + adcs r4,r4,#0 + adcs r5,r5,#0 + adcs r6,r6,#0 + adc r7,r7,#0 + + adds r8,r3,#5 @ compare to modulus + adcs r9,r4,#0 + adcs r10,r5,#0 + adcs r11,r6,#0 + adc r7,r7,#0 + tst r7,#4 @ did it carry/borrow? + + it ne + movne r3,r8 + ldr r8,[r2,#0] + it ne + movne r4,r9 + ldr r9,[r2,#4] + it ne + movne r5,r10 + ldr r10,[r2,#8] + it ne + movne r6,r11 + ldr r11,[r2,#12] + + adds r3,r3,r8 @ accumulate nonce + adcs r4,r4,r9 + adcs r5,r5,r10 + adc r6,r6,r11 + +# ifdef __ARMEB__ + rev r3,r3 + rev r4,r4 + rev r5,r5 + rev r6,r6 +# endif + str r3,[r1,#0] @ store the result + str r4,[r1,#4] + str r5,[r1,#8] + str r6,[r1,#12] + + ldmia sp!,{r4-r11} + bx lr @ bx lr +.size poly1305_emit_neon,.-poly1305_emit_neon + +.align 5 +.Lzeros: +.long 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 +.LOPENSSL_armcap: +.word OPENSSL_armcap_P-.Lpoly1305_init +#endif +.asciz "Poly1305 for ARMv4/NEON, CRYPTOGAMS by " +.align 2 +#if __ARM_MAX_ARCH__>=7 +.comm OPENSSL_armcap_P,4,4 +#endif diff --git a/lib/zinc/poly1305/poly1305-arm64-cryptogams.S b/lib/zinc/poly1305/poly1305-arm64-cryptogams.S new file mode 100644 index 000000000000..0ecb50a83ec0 --- /dev/null +++ b/lib/zinc/poly1305/poly1305-arm64-cryptogams.S @@ -0,0 +1,869 @@ +/* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */ +/* + * Copyright (C) 2006-2017 CRYPTOGAMS by . All Rights Reserved. + */ + +#include "arm_arch.h" + +.text + +// forward "declarations" are required for Apple + +.globl poly1305_blocks +.globl poly1305_emit + +.globl poly1305_init +.type poly1305_init,%function +.align 5 +poly1305_init: + cmp x1,xzr + stp xzr,xzr,[x0] // zero hash value + stp xzr,xzr,[x0,#16] // [along with is_base2_26] + + csel x0,xzr,x0,eq + b.eq .Lno_key + +#ifdef __ILP32__ + ldrsw x11,.LOPENSSL_armcap_P +#else + ldr x11,.LOPENSSL_armcap_P +#endif + adr x10,.LOPENSSL_armcap_P + + ldp x7,x8,[x1] // load key + mov x9,#0xfffffffc0fffffff + movk x9,#0x0fff,lsl#48 + ldr w17,[x10,x11] +#ifdef __ARMEB__ + rev x7,x7 // flip bytes + rev x8,x8 +#endif + and x7,x7,x9 // &=0ffffffc0fffffff + and x9,x9,#-4 + and x8,x8,x9 // &=0ffffffc0ffffffc + stp x7,x8,[x0,#32] // save key value + + tst w17,#ARMV7_NEON + + adr x12,poly1305_blocks + adr x7,poly1305_blocks_neon + adr x13,poly1305_emit + adr x8,poly1305_emit_neon + + csel x12,x12,x7,eq + csel x13,x13,x8,eq + +#ifdef __ILP32__ + stp w12,w13,[x2] +#else + stp x12,x13,[x2] +#endif + + mov x0,#1 +.Lno_key: + ret +.size poly1305_init,.-poly1305_init + +.type poly1305_blocks,%function +.align 5 +poly1305_blocks: + ands x2,x2,#-16 + b.eq .Lno_data + + ldp x4,x5,[x0] // load hash value + ldp x7,x8,[x0,#32] // load key value + ldr x6,[x0,#16] + add x9,x8,x8,lsr#2 // s1 = r1 + (r1 >> 2) + b .Loop + +.align 5 +.Loop: + ldp x10,x11,[x1],#16 // load input + sub x2,x2,#16 +#ifdef __ARMEB__ + rev x10,x10 + rev x11,x11 +#endif + adds x4,x4,x10 // accumulate input + adcs x5,x5,x11 + + mul x12,x4,x7 // h0*r0 + adc x6,x6,x3 + umulh x13,x4,x7 + + mul x10,x5,x9 // h1*5*r1 + umulh x11,x5,x9 + + adds x12,x12,x10 + mul x10,x4,x8 // h0*r1 + adc x13,x13,x11 + umulh x14,x4,x8 + + adds x13,x13,x10 + mul x10,x5,x7 // h1*r0 + adc x14,x14,xzr + umulh x11,x5,x7 + + adds x13,x13,x10 + mul x10,x6,x9 // h2*5*r1 + adc x14,x14,x11 + mul x11,x6,x7 // h2*r0 + + adds x13,x13,x10 + adc x14,x14,x11 + + and x10,x14,#-4 // final reduction + and x6,x14,#3 + add x10,x10,x14,lsr#2 + adds x4,x12,x10 + adcs x5,x13,xzr + adc x6,x6,xzr + + cbnz x2,.Loop + + stp x4,x5,[x0] // store hash value + str x6,[x0,#16] + +.Lno_data: + ret +.size poly1305_blocks,.-poly1305_blocks + +.type poly1305_emit,%function +.align 5 +poly1305_emit: + ldp x4,x5,[x0] // load hash base 2^64 + ldr x6,[x0,#16] + ldp x10,x11,[x2] // load nonce + + adds x12,x4,#5 // compare to modulus + adcs x13,x5,xzr + adc x14,x6,xzr + + tst x14,#-4 // see if it's carried/borrowed + + csel x4,x4,x12,eq + csel x5,x5,x13,eq + +#ifdef __ARMEB__ + ror x10,x10,#32 // flip nonce words + ror x11,x11,#32 +#endif + adds x4,x4,x10 // accumulate nonce + adc x5,x5,x11 +#ifdef __ARMEB__ + rev x4,x4 // flip output bytes + rev x5,x5 +#endif + stp x4,x5,[x1] // write result + + ret +.size poly1305_emit,.-poly1305_emit +.type poly1305_mult,%function +.align 5 +poly1305_mult: + mul x12,x4,x7 // h0*r0 + umulh x13,x4,x7 + + mul x10,x5,x9 // h1*5*r1 + umulh x11,x5,x9 + + adds x12,x12,x10 + mul x10,x4,x8 // h0*r1 + adc x13,x13,x11 + umulh x14,x4,x8 + + adds x13,x13,x10 + mul x10,x5,x7 // h1*r0 + adc x14,x14,xzr + umulh x11,x5,x7 + + adds x13,x13,x10 + mul x10,x6,x9 // h2*5*r1 + adc x14,x14,x11 + mul x11,x6,x7 // h2*r0 + + adds x13,x13,x10 + adc x14,x14,x11 + + and x10,x14,#-4 // final reduction + and x6,x14,#3 + add x10,x10,x14,lsr#2 + adds x4,x12,x10 + adcs x5,x13,xzr + adc x6,x6,xzr + + ret +.size poly1305_mult,.-poly1305_mult + +.type poly1305_splat,%function +.align 5 +poly1305_splat: + and x12,x4,#0x03ffffff // base 2^64 -> base 2^26 + ubfx x13,x4,#26,#26 + extr x14,x5,x4,#52 + and x14,x14,#0x03ffffff + ubfx x15,x5,#14,#26 + extr x16,x6,x5,#40 + + str w12,[x0,#16*0] // r0 + add w12,w13,w13,lsl#2 // r1*5 + str w13,[x0,#16*1] // r1 + add w13,w14,w14,lsl#2 // r2*5 + str w12,[x0,#16*2] // s1 + str w14,[x0,#16*3] // r2 + add w14,w15,w15,lsl#2 // r3*5 + str w13,[x0,#16*4] // s2 + str w15,[x0,#16*5] // r3 + add w15,w16,w16,lsl#2 // r4*5 + str w14,[x0,#16*6] // s3 + str w16,[x0,#16*7] // r4 + str w15,[x0,#16*8] // s4 + + ret +.size poly1305_splat,.-poly1305_splat + +.type poly1305_blocks_neon,%function +.align 5 +poly1305_blocks_neon: + ldr x17,[x0,#24] + cmp x2,#128 + b.hs .Lblocks_neon + cbz x17,poly1305_blocks + +.Lblocks_neon: + stp x29,x30,[sp,#-80]! + add x29,sp,#0 + + ands x2,x2,#-16 + b.eq .Lno_data_neon + + cbz x17,.Lbase2_64_neon + + ldp w10,w11,[x0] // load hash value base 2^26 + ldp w12,w13,[x0,#8] + ldr w14,[x0,#16] + + tst x2,#31 + b.eq .Leven_neon + + ldp x7,x8,[x0,#32] // load key value + + add x4,x10,x11,lsl#26 // base 2^26 -> base 2^64 + lsr x5,x12,#12 + adds x4,x4,x12,lsl#52 + add x5,x5,x13,lsl#14 + adc x5,x5,xzr + lsr x6,x14,#24 + adds x5,x5,x14,lsl#40 + adc x14,x6,xzr // can be partially reduced... + + ldp x12,x13,[x1],#16 // load input + sub x2,x2,#16 + add x9,x8,x8,lsr#2 // s1 = r1 + (r1 >> 2) + + and x10,x14,#-4 // ... so reduce + and x6,x14,#3 + add x10,x10,x14,lsr#2 + adds x4,x4,x10 + adcs x5,x5,xzr + adc x6,x6,xzr + +#ifdef __ARMEB__ + rev x12,x12 + rev x13,x13 +#endif + adds x4,x4,x12 // accumulate input + adcs x5,x5,x13 + adc x6,x6,x3 + + bl poly1305_mult + ldr x30,[sp,#8] + + cbz x3,.Lstore_base2_64_neon + + and x10,x4,#0x03ffffff // base 2^64 -> base 2^26 + ubfx x11,x4,#26,#26 + extr x12,x5,x4,#52 + and x12,x12,#0x03ffffff + ubfx x13,x5,#14,#26 + extr x14,x6,x5,#40 + + cbnz x2,.Leven_neon + + stp w10,w11,[x0] // store hash value base 2^26 + stp w12,w13,[x0,#8] + str w14,[x0,#16] + b .Lno_data_neon + +.align 4 +.Lstore_base2_64_neon: + stp x4,x5,[x0] // store hash value base 2^64 + stp x6,xzr,[x0,#16] // note that is_base2_26 is zeroed + b .Lno_data_neon + +.align 4 +.Lbase2_64_neon: + ldp x7,x8,[x0,#32] // load key value + + ldp x4,x5,[x0] // load hash value base 2^64 + ldr x6,[x0,#16] + + tst x2,#31 + b.eq .Linit_neon + + ldp x12,x13,[x1],#16 // load input + sub x2,x2,#16 + add x9,x8,x8,lsr#2 // s1 = r1 + (r1 >> 2) +#ifdef __ARMEB__ + rev x12,x12 + rev x13,x13 +#endif + adds x4,x4,x12 // accumulate input + adcs x5,x5,x13 + adc x6,x6,x3 + + bl poly1305_mult + +.Linit_neon: + and x10,x4,#0x03ffffff // base 2^64 -> base 2^26 + ubfx x11,x4,#26,#26 + extr x12,x5,x4,#52 + and x12,x12,#0x03ffffff + ubfx x13,x5,#14,#26 + extr x14,x6,x5,#40 + + stp d8,d9,[sp,#16] // meet ABI requirements + stp d10,d11,[sp,#32] + stp d12,d13,[sp,#48] + stp d14,d15,[sp,#64] + + fmov d24,x10 + fmov d25,x11 + fmov d26,x12 + fmov d27,x13 + fmov d28,x14 + + ////////////////////////////////// initialize r^n table + mov x4,x7 // r^1 + add x9,x8,x8,lsr#2 // s1 = r1 + (r1 >> 2) + mov x5,x8 + mov x6,xzr + add x0,x0,#48+12 + bl poly1305_splat + + bl poly1305_mult // r^2 + sub x0,x0,#4 + bl poly1305_splat + + bl poly1305_mult // r^3 + sub x0,x0,#4 + bl poly1305_splat + + bl poly1305_mult // r^4 + sub x0,x0,#4 + bl poly1305_splat + ldr x30,[sp,#8] + + add x16,x1,#32 + adr x17,.Lzeros + subs x2,x2,#64 + csel x16,x17,x16,lo + + mov x4,#1 + str x4,[x0,#-24] // set is_base2_26 + sub x0,x0,#48 // restore original x0 + b .Ldo_neon + +.align 4 +.Leven_neon: + add x16,x1,#32 + adr x17,.Lzeros + subs x2,x2,#64 + csel x16,x17,x16,lo + + stp d8,d9,[sp,#16] // meet ABI requirements + stp d10,d11,[sp,#32] + stp d12,d13,[sp,#48] + stp d14,d15,[sp,#64] + + fmov d24,x10 + fmov d25,x11 + fmov d26,x12 + fmov d27,x13 + fmov d28,x14 + +.Ldo_neon: + ldp x8,x12,[x16],#16 // inp[2:3] (or zero) + ldp x9,x13,[x16],#48 + + lsl x3,x3,#24 + add x15,x0,#48 + +#ifdef __ARMEB__ + rev x8,x8 + rev x12,x12 + rev x9,x9 + rev x13,x13 +#endif + and x4,x8,#0x03ffffff // base 2^64 -> base 2^26 + and x5,x9,#0x03ffffff + ubfx x6,x8,#26,#26 + ubfx x7,x9,#26,#26 + add x4,x4,x5,lsl#32 // bfi x4,x5,#32,#32 + extr x8,x12,x8,#52 + extr x9,x13,x9,#52 + add x6,x6,x7,lsl#32 // bfi x6,x7,#32,#32 + fmov d14,x4 + and x8,x8,#0x03ffffff + and x9,x9,#0x03ffffff + ubfx x10,x12,#14,#26 + ubfx x11,x13,#14,#26 + add x12,x3,x12,lsr#40 + add x13,x3,x13,lsr#40 + add x8,x8,x9,lsl#32 // bfi x8,x9,#32,#32 + fmov d15,x6 + add x10,x10,x11,lsl#32 // bfi x10,x11,#32,#32 + add x12,x12,x13,lsl#32 // bfi x12,x13,#32,#32 + fmov d16,x8 + fmov d17,x10 + fmov d18,x12 + + ldp x8,x12,[x1],#16 // inp[0:1] + ldp x9,x13,[x1],#48 + + ld1 {v0.4s,v1.4s,v2.4s,v3.4s},[x15],#64 + ld1 {v4.4s,v5.4s,v6.4s,v7.4s},[x15],#64 + ld1 {v8.4s},[x15] + +#ifdef __ARMEB__ + rev x8,x8 + rev x12,x12 + rev x9,x9 + rev x13,x13 +#endif + and x4,x8,#0x03ffffff // base 2^64 -> base 2^26 + and x5,x9,#0x03ffffff + ubfx x6,x8,#26,#26 + ubfx x7,x9,#26,#26 + add x4,x4,x5,lsl#32 // bfi x4,x5,#32,#32 + extr x8,x12,x8,#52 + extr x9,x13,x9,#52 + add x6,x6,x7,lsl#32 // bfi x6,x7,#32,#32 + fmov d9,x4 + and x8,x8,#0x03ffffff + and x9,x9,#0x03ffffff + ubfx x10,x12,#14,#26 + ubfx x11,x13,#14,#26 + add x12,x3,x12,lsr#40 + add x13,x3,x13,lsr#40 + add x8,x8,x9,lsl#32 // bfi x8,x9,#32,#32 + fmov d10,x6 + add x10,x10,x11,lsl#32 // bfi x10,x11,#32,#32 + add x12,x12,x13,lsl#32 // bfi x12,x13,#32,#32 + movi v31.2d,#-1 + fmov d11,x8 + fmov d12,x10 + fmov d13,x12 + ushr v31.2d,v31.2d,#38 + + b.ls .Lskip_loop + +.align 4 +.Loop_neon: + //////////////////////////////////////////////////////////////// + // ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2 + // ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^3+inp[7]*r + // ___________________/ + // ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2+inp[8])*r^2 + // ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^4+inp[7]*r^2+inp[9])*r + // ___________________/ ____________________/ + // + // Note that we start with inp[2:3]*r^2. This is because it + // doesn't depend on reduction in previous iteration. + //////////////////////////////////////////////////////////////// + // d4 = h0*r4 + h1*r3 + h2*r2 + h3*r1 + h4*r0 + // d3 = h0*r3 + h1*r2 + h2*r1 + h3*r0 + h4*5*r4 + // d2 = h0*r2 + h1*r1 + h2*r0 + h3*5*r4 + h4*5*r3 + // d1 = h0*r1 + h1*r0 + h2*5*r4 + h3*5*r3 + h4*5*r2 + // d0 = h0*r0 + h1*5*r4 + h2*5*r3 + h3*5*r2 + h4*5*r1 + + subs x2,x2,#64 + umull v23.2d,v14.2s,v7.s[2] + csel x16,x17,x16,lo + umull v22.2d,v14.2s,v5.s[2] + umull v21.2d,v14.2s,v3.s[2] + ldp x8,x12,[x16],#16 // inp[2:3] (or zero) + umull v20.2d,v14.2s,v1.s[2] + ldp x9,x13,[x16],#48 + umull v19.2d,v14.2s,v0.s[2] +#ifdef __ARMEB__ + rev x8,x8 + rev x12,x12 + rev x9,x9 + rev x13,x13 +#endif + + umlal v23.2d,v15.2s,v5.s[2] + and x4,x8,#0x03ffffff // base 2^64 -> base 2^26 + umlal v22.2d,v15.2s,v3.s[2] + and x5,x9,#0x03ffffff + umlal v21.2d,v15.2s,v1.s[2] + ubfx x6,x8,#26,#26 + umlal v20.2d,v15.2s,v0.s[2] + ubfx x7,x9,#26,#26 + umlal v19.2d,v15.2s,v8.s[2] + add x4,x4,x5,lsl#32 // bfi x4,x5,#32,#32 + + umlal v23.2d,v16.2s,v3.s[2] + extr x8,x12,x8,#52 + umlal v22.2d,v16.2s,v1.s[2] + extr x9,x13,x9,#52 + umlal v21.2d,v16.2s,v0.s[2] + add x6,x6,x7,lsl#32 // bfi x6,x7,#32,#32 + umlal v20.2d,v16.2s,v8.s[2] + fmov d14,x4 + umlal v19.2d,v16.2s,v6.s[2] + and x8,x8,#0x03ffffff + + umlal v23.2d,v17.2s,v1.s[2] + and x9,x9,#0x03ffffff + umlal v22.2d,v17.2s,v0.s[2] + ubfx x10,x12,#14,#26 + umlal v21.2d,v17.2s,v8.s[2] + ubfx x11,x13,#14,#26 + umlal v20.2d,v17.2s,v6.s[2] + add x8,x8,x9,lsl#32 // bfi x8,x9,#32,#32 + umlal v19.2d,v17.2s,v4.s[2] + fmov d15,x6 + + add v11.2s,v11.2s,v26.2s + add x12,x3,x12,lsr#40 + umlal v23.2d,v18.2s,v0.s[2] + add x13,x3,x13,lsr#40 + umlal v22.2d,v18.2s,v8.s[2] + add x10,x10,x11,lsl#32 // bfi x10,x11,#32,#32 + umlal v21.2d,v18.2s,v6.s[2] + add x12,x12,x13,lsl#32 // bfi x12,x13,#32,#32 + umlal v20.2d,v18.2s,v4.s[2] + fmov d16,x8 + umlal v19.2d,v18.2s,v2.s[2] + fmov d17,x10 + + //////////////////////////////////////////////////////////////// + // (hash+inp[0:1])*r^4 and accumulate + + add v9.2s,v9.2s,v24.2s + fmov d18,x12 + umlal v22.2d,v11.2s,v1.s[0] + ldp x8,x12,[x1],#16 // inp[0:1] + umlal v19.2d,v11.2s,v6.s[0] + ldp x9,x13,[x1],#48 + umlal v23.2d,v11.2s,v3.s[0] + umlal v20.2d,v11.2s,v8.s[0] + umlal v21.2d,v11.2s,v0.s[0] +#ifdef __ARMEB__ + rev x8,x8 + rev x12,x12 + rev x9,x9 + rev x13,x13 +#endif + + add v10.2s,v10.2s,v25.2s + umlal v22.2d,v9.2s,v5.s[0] + umlal v23.2d,v9.2s,v7.s[0] + and x4,x8,#0x03ffffff // base 2^64 -> base 2^26 + umlal v21.2d,v9.2s,v3.s[0] + and x5,x9,#0x03ffffff + umlal v19.2d,v9.2s,v0.s[0] + ubfx x6,x8,#26,#26 + umlal v20.2d,v9.2s,v1.s[0] + ubfx x7,x9,#26,#26 + + add v12.2s,v12.2s,v27.2s + add x4,x4,x5,lsl#32 // bfi x4,x5,#32,#32 + umlal v22.2d,v10.2s,v3.s[0] + extr x8,x12,x8,#52 + umlal v23.2d,v10.2s,v5.s[0] + extr x9,x13,x9,#52 + umlal v19.2d,v10.2s,v8.s[0] + add x6,x6,x7,lsl#32 // bfi x6,x7,#32,#32 + umlal v21.2d,v10.2s,v1.s[0] + fmov d9,x4 + umlal v20.2d,v10.2s,v0.s[0] + and x8,x8,#0x03ffffff + + add v13.2s,v13.2s,v28.2s + and x9,x9,#0x03ffffff + umlal v22.2d,v12.2s,v0.s[0] + ubfx x10,x12,#14,#26 + umlal v19.2d,v12.2s,v4.s[0] + ubfx x11,x13,#14,#26 + umlal v23.2d,v12.2s,v1.s[0] + add x8,x8,x9,lsl#32 // bfi x8,x9,#32,#32 + umlal v20.2d,v12.2s,v6.s[0] + fmov d10,x6 + umlal v21.2d,v12.2s,v8.s[0] + add x12,x3,x12,lsr#40 + + umlal v22.2d,v13.2s,v8.s[0] + add x13,x3,x13,lsr#40 + umlal v19.2d,v13.2s,v2.s[0] + add x10,x10,x11,lsl#32 // bfi x10,x11,#32,#32 + umlal v23.2d,v13.2s,v0.s[0] + add x12,x12,x13,lsl#32 // bfi x12,x13,#32,#32 + umlal v20.2d,v13.2s,v4.s[0] + fmov d11,x8 + umlal v21.2d,v13.2s,v6.s[0] + fmov d12,x10 + fmov d13,x12 + + ///////////////////////////////////////////////////////////////// + // lazy reduction as discussed in "NEON crypto" by D.J. Bernstein + // and P. Schwabe + // + // [see discussion in poly1305-armv4 module] + + ushr v29.2d,v22.2d,#26 + xtn v27.2s,v22.2d + ushr v30.2d,v19.2d,#26 + and v19.16b,v19.16b,v31.16b + add v23.2d,v23.2d,v29.2d // h3 -> h4 + bic v27.2s,#0xfc,lsl#24 // &=0x03ffffff + add v20.2d,v20.2d,v30.2d // h0 -> h1 + + ushr v29.2d,v23.2d,#26 + xtn v28.2s,v23.2d + ushr v30.2d,v20.2d,#26 + xtn v25.2s,v20.2d + bic v28.2s,#0xfc,lsl#24 + add v21.2d,v21.2d,v30.2d // h1 -> h2 + + add v19.2d,v19.2d,v29.2d + shl v29.2d,v29.2d,#2 + shrn v30.2s,v21.2d,#26 + xtn v26.2s,v21.2d + add v19.2d,v19.2d,v29.2d // h4 -> h0 + bic v25.2s,#0xfc,lsl#24 + add v27.2s,v27.2s,v30.2s // h2 -> h3 + bic v26.2s,#0xfc,lsl#24 + + shrn v29.2s,v19.2d,#26 + xtn v24.2s,v19.2d + ushr v30.2s,v27.2s,#26 + bic v27.2s,#0xfc,lsl#24 + bic v24.2s,#0xfc,lsl#24 + add v25.2s,v25.2s,v29.2s // h0 -> h1 + add v28.2s,v28.2s,v30.2s // h3 -> h4 + + b.hi .Loop_neon + +.Lskip_loop: + dup v16.2d,v16.d[0] + add v11.2s,v11.2s,v26.2s + + //////////////////////////////////////////////////////////////// + // multiply (inp[0:1]+hash) or inp[2:3] by r^2:r^1 + + adds x2,x2,#32 + b.ne .Long_tail + + dup v16.2d,v11.d[0] + add v14.2s,v9.2s,v24.2s + add v17.2s,v12.2s,v27.2s + add v15.2s,v10.2s,v25.2s + add v18.2s,v13.2s,v28.2s + +.Long_tail: + dup v14.2d,v14.d[0] + umull2 v19.2d,v16.4s,v6.4s + umull2 v22.2d,v16.4s,v1.4s + umull2 v23.2d,v16.4s,v3.4s + umull2 v21.2d,v16.4s,v0.4s + umull2 v20.2d,v16.4s,v8.4s + + dup v15.2d,v15.d[0] + umlal2 v19.2d,v14.4s,v0.4s + umlal2 v21.2d,v14.4s,v3.4s + umlal2 v22.2d,v14.4s,v5.4s + umlal2 v23.2d,v14.4s,v7.4s + umlal2 v20.2d,v14.4s,v1.4s + + dup v17.2d,v17.d[0] + umlal2 v19.2d,v15.4s,v8.4s + umlal2 v22.2d,v15.4s,v3.4s + umlal2 v21.2d,v15.4s,v1.4s + umlal2 v23.2d,v15.4s,v5.4s + umlal2 v20.2d,v15.4s,v0.4s + + dup v18.2d,v18.d[0] + umlal2 v22.2d,v17.4s,v0.4s + umlal2 v23.2d,v17.4s,v1.4s + umlal2 v19.2d,v17.4s,v4.4s + umlal2 v20.2d,v17.4s,v6.4s + umlal2 v21.2d,v17.4s,v8.4s + + umlal2 v22.2d,v18.4s,v8.4s + umlal2 v19.2d,v18.4s,v2.4s + umlal2 v23.2d,v18.4s,v0.4s + umlal2 v20.2d,v18.4s,v4.4s + umlal2 v21.2d,v18.4s,v6.4s + + b.eq .Lshort_tail + + //////////////////////////////////////////////////////////////// + // (hash+inp[0:1])*r^4:r^3 and accumulate + + add v9.2s,v9.2s,v24.2s + umlal v22.2d,v11.2s,v1.2s + umlal v19.2d,v11.2s,v6.2s + umlal v23.2d,v11.2s,v3.2s + umlal v20.2d,v11.2s,v8.2s + umlal v21.2d,v11.2s,v0.2s + + add v10.2s,v10.2s,v25.2s + umlal v22.2d,v9.2s,v5.2s + umlal v19.2d,v9.2s,v0.2s + umlal v23.2d,v9.2s,v7.2s + umlal v20.2d,v9.2s,v1.2s + umlal v21.2d,v9.2s,v3.2s + + add v12.2s,v12.2s,v27.2s + umlal v22.2d,v10.2s,v3.2s + umlal v19.2d,v10.2s,v8.2s + umlal v23.2d,v10.2s,v5.2s + umlal v20.2d,v10.2s,v0.2s + umlal v21.2d,v10.2s,v1.2s + + add v13.2s,v13.2s,v28.2s + umlal v22.2d,v12.2s,v0.2s + umlal v19.2d,v12.2s,v4.2s + umlal v23.2d,v12.2s,v1.2s + umlal v20.2d,v12.2s,v6.2s + umlal v21.2d,v12.2s,v8.2s + + umlal v22.2d,v13.2s,v8.2s + umlal v19.2d,v13.2s,v2.2s + umlal v23.2d,v13.2s,v0.2s + umlal v20.2d,v13.2s,v4.2s + umlal v21.2d,v13.2s,v6.2s + +.Lshort_tail: + //////////////////////////////////////////////////////////////// + // horizontal add + + addp v22.2d,v22.2d,v22.2d + ldp d8,d9,[sp,#16] // meet ABI requirements + addp v19.2d,v19.2d,v19.2d + ldp d10,d11,[sp,#32] + addp v23.2d,v23.2d,v23.2d + ldp d12,d13,[sp,#48] + addp v20.2d,v20.2d,v20.2d + ldp d14,d15,[sp,#64] + addp v21.2d,v21.2d,v21.2d + + //////////////////////////////////////////////////////////////// + // lazy reduction, but without narrowing + + ushr v29.2d,v22.2d,#26 + and v22.16b,v22.16b,v31.16b + ushr v30.2d,v19.2d,#26 + and v19.16b,v19.16b,v31.16b + + add v23.2d,v23.2d,v29.2d // h3 -> h4 + add v20.2d,v20.2d,v30.2d // h0 -> h1 + + ushr v29.2d,v23.2d,#26 + and v23.16b,v23.16b,v31.16b + ushr v30.2d,v20.2d,#26 + and v20.16b,v20.16b,v31.16b + add v21.2d,v21.2d,v30.2d // h1 -> h2 + + add v19.2d,v19.2d,v29.2d + shl v29.2d,v29.2d,#2 + ushr v30.2d,v21.2d,#26 + and v21.16b,v21.16b,v31.16b + add v19.2d,v19.2d,v29.2d // h4 -> h0 + add v22.2d,v22.2d,v30.2d // h2 -> h3 + + ushr v29.2d,v19.2d,#26 + and v19.16b,v19.16b,v31.16b + ushr v30.2d,v22.2d,#26 + and v22.16b,v22.16b,v31.16b + add v20.2d,v20.2d,v29.2d // h0 -> h1 + add v23.2d,v23.2d,v30.2d // h3 -> h4 + + //////////////////////////////////////////////////////////////// + // write the result, can be partially reduced + + st4 {v19.s,v20.s,v21.s,v22.s}[0],[x0],#16 + st1 {v23.s}[0],[x0] + +.Lno_data_neon: + ldr x29,[sp],#80 + ret +.size poly1305_blocks_neon,.-poly1305_blocks_neon + +.type poly1305_emit_neon,%function +.align 5 +poly1305_emit_neon: + ldr x17,[x0,#24] + cbz x17,poly1305_emit + + ldp w10,w11,[x0] // load hash value base 2^26 + ldp w12,w13,[x0,#8] + ldr w14,[x0,#16] + + add x4,x10,x11,lsl#26 // base 2^26 -> base 2^64 + lsr x5,x12,#12 + adds x4,x4,x12,lsl#52 + add x5,x5,x13,lsl#14 + adc x5,x5,xzr + lsr x6,x14,#24 + adds x5,x5,x14,lsl#40 + adc x6,x6,xzr // can be partially reduced... + + ldp x10,x11,[x2] // load nonce + + and x12,x6,#-4 // ... so reduce + add x12,x12,x6,lsr#2 + and x6,x6,#3 + adds x4,x4,x12 + adcs x5,x5,xzr + adc x6,x6,xzr + + adds x12,x4,#5 // compare to modulus + adcs x13,x5,xzr + adc x14,x6,xzr + + tst x14,#-4 // see if it's carried/borrowed + + csel x4,x4,x12,eq + csel x5,x5,x13,eq + +#ifdef __ARMEB__ + ror x10,x10,#32 // flip nonce words + ror x11,x11,#32 +#endif + adds x4,x4,x10 // accumulate nonce + adc x5,x5,x11 +#ifdef __ARMEB__ + rev x4,x4 // flip output bytes + rev x5,x5 +#endif + stp x4,x5,[x1] // write result + + ret +.size poly1305_emit_neon,.-poly1305_emit_neon + +.align 5 +.Lzeros: +.long 0,0,0,0,0,0,0,0 +.LOPENSSL_armcap_P: +#ifdef __ILP32__ +.long OPENSSL_armcap_P-. +#else +.quad OPENSSL_armcap_P-. +#endif +.byte 80,111,108,121,49,51,48,53,32,102,111,114,32,65,82,77,118,56,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0 +.align 2 +.align 2 From patchwork Sat Oct 6 02:56:56 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Jason A. Donenfeld" X-Patchwork-Id: 148312 Delivered-To: patch@linaro.org Received: by 2002:a2e:8595:0:0:0:0:0 with SMTP id b21-v6csp1157681lji; Fri, 5 Oct 2018 19:58:22 -0700 (PDT) X-Google-Smtp-Source: ACcGV63PfCdZopvPvm+NWkU742FXSfjBoE1VHgeCrfyD4ZLmvjnkoQ9FJt+sC3y90YxiA+6doakC X-Received: by 2002:a17:902:29e3:: with SMTP id h90-v6mr14303779plb.215.1538794702721; Fri, 05 Oct 2018 19:58:22 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1538794702; cv=none; d=google.com; s=arc-20160816; b=KMFHNCpnxlT3lOu9P/f2a3ewSUaVlrvvuovLVExqQDDvYgE09qGT3QqSvIfR/waAr7 S30kV8XM+OEoDEnbSq628prRilS3X5CSONNc78Lzd6rQXFx7Eq5NF5EKEGUFYZAvizQL lBrkyD6n98AO7Bh7ZHGZD+sZ/SeLe1PtVS+V8er+EKBJdkDkeeFFs37hooSa5fbMSYZg kIgjvJ2NmNY1QyDrFEbhgI2WCPirUJEiJ/nV688ajdVy8yjDzvHIFbvJYmV1OLwpTo80 nJ/mjl+geyOU7dckAJZb39gMU3QURXfbi2cwzztAP3tKqEy617Gz5nT4qNQ7DEEwoRL+ K1RQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=fo+8udEAZmrSqVhybajY2AwZMiTy4462VCDhFKMj2v8=; b=ey0SIQv8yWKUiPlff45YwHuLi/pe7mjgkK61lzFT/CNsBA6/AWAoVyhcusBCxXF+QS Zcv+AftHYGoROlMk50zZXKC40KoAFMM5dEirYJkCShplBven4/bIEe0gR0+GljQIHQdU tRetRApEst+5Ix4+Bu8a3/5s3qh+JxMV06mf18M4ZsYK01a3CMAF23+MkTIey956OrJG ghVLWuCEI6HtFsDPbIlvr/e1oe966J0bih+dHk6t3Ry46NELNGkeJ6PCELAw34EHXaF7 FC6BrPGBrDoVxIR0bExCwVxSw5jLJHw9P+OiQ3UXHPg14vW1qb5O1a/JlW5eyp+5qGwG +Vng== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=2lOXaSCG; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w4-v6si11085750pfb.52.2018.10.05.19.58.22; Fri, 05 Oct 2018 19:58:22 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=2lOXaSCG; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729797AbeJFJ7x (ORCPT + 32 others); Sat, 6 Oct 2018 05:59:53 -0400 Received: from frisell.zx2c4.com ([192.95.5.64]:50073 "EHLO frisell.zx2c4.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729562AbeJFJ7v (ORCPT ); Sat, 6 Oct 2018 05:59:51 -0400 Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTP id 1d143e89; Sat, 6 Oct 2018 02:57:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=zx2c4.com; h=from:to:cc :subject:date:message-id:in-reply-to:references:mime-version :content-transfer-encoding; s=mail; bh=78DdOhJr6z6OvvgwL/04c0Kb+ Zg=; b=2lOXaSCG3hT6cyW+lbc+jJCQa2kcg4oeWpBotaQUku7slWzUjYt/b+w01 hpxyKDfbHUZXAiEaQRJTAq02phTIP1XQfIu6xQfRc58+q9NFtuY4+F7PDt6/6xr5 Hpjb46XAy6JOxU1vo0nlR7Y3wkrLF+XR1jW++sutv6v9jnBruhEMchXdChlqIjqA LhLWhFtu0EZE98eKDAW8usQSWtAIGQHoX+kdpM9mFzQvG2qw3PLIqO2PZsdg6IwU 5jfDll6+HhUAtUgEGOAMLqzHaG1pyQo65iTd8gWm28A85BxZo6rVRDGu1jGtBI0+ +DAWvlvtTiKpIHVQX6fM/IvdQujKA== Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTPSA id b40e4014 (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256:NO); Sat, 6 Oct 2018 02:57:43 +0000 (UTC) From: "Jason A. Donenfeld" To: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, davem@davemloft.net, gregkh@linuxfoundation.org Cc: "Jason A. Donenfeld" , Russell King , linux-arm-kernel@lists.infradead.org, Samuel Neves , Jean-Philippe Aumasson , Andy Lutomirski , Andrew Morton , Linus Torvalds , kernel-hardening@lists.openwall.com, linux-crypto@vger.kernel.org Subject: [PATCH net-next v7 15/28] zinc: Poly1305 ARM and ARM64 implementations Date: Sat, 6 Oct 2018 04:56:56 +0200 Message-Id: <20181006025709.4019-16-Jason@zx2c4.com> In-Reply-To: <20181006025709.4019-1-Jason@zx2c4.com> References: <20181006025709.4019-1-Jason@zx2c4.com> MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org These wire Andy Polyakov's implementations up to the kernel. We make a few small changes to the assembly: - Entries and exits use the proper kernel convention macro. - CPU feature checking is done in C by the glue code, so that has been removed from the assembly. - The function names have been renamed to fit kernel conventions. - Labels have been renamed to fit kernel conventions. - The neon code can jump to the scalar code when it makes sense to do so. The NEON code uses base 2^26, while the scalar code uses base 2^64 on 64-bit and base 2^32 on 32-bit. If we hit the unfortunate situation of using NEON and then having to go back to scalar -- because the user is silly and has called the update function from two separate contexts -- then we need to convert back to the original base before proceeding. It is possible to reason that the initial reduction below is sufficient given the implementation invariants. However, for an avoidance of doubt and because this is not performance critical, we do the full reduction anyway. This conversion is found in the glue code, and a proof of correctness may be easily obtained from Z3: . Signed-off-by: Jason A. Donenfeld Cc: Russell King Cc: linux-arm-kernel@lists.infradead.org Cc: Samuel Neves Cc: Jean-Philippe Aumasson Cc: Andy Lutomirski Cc: Greg KH Cc: Andrew Morton Cc: Linus Torvalds Cc: kernel-hardening@lists.openwall.com Cc: linux-crypto@vger.kernel.org --- lib/zinc/Makefile | 2 + lib/zinc/poly1305/poly1305-arm-glue.c | 140 +++++++++++++++++ ...ly1305-arm-cryptogams.S => poly1305-arm.S} | 147 ++++++------------ ...05-arm64-cryptogams.S => poly1305-arm64.S} | 127 +++++---------- lib/zinc/poly1305/poly1305.c | 2 + 5 files changed, 231 insertions(+), 187 deletions(-) create mode 100644 lib/zinc/poly1305/poly1305-arm-glue.c rename lib/zinc/poly1305/{poly1305-arm-cryptogams.S => poly1305-arm.S} (91%) rename lib/zinc/poly1305/{poly1305-arm64-cryptogams.S => poly1305-arm64.S} (89%) -- 2.19.0 diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile index a8943d960b6a..c09fd3de60f9 100644 --- a/lib/zinc/Makefile +++ b/lib/zinc/Makefile @@ -12,4 +12,6 @@ obj-$(CONFIG_ZINC_CHACHA20) += zinc_chacha20.o zinc_poly1305-y := poly1305/poly1305.o zinc_poly1305-$(CONFIG_ZINC_ARCH_X86_64) += poly1305/poly1305-x86_64.o +zinc_poly1305-$(CONFIG_ZINC_ARCH_ARM) += poly1305/poly1305-arm.o +zinc_poly1305-$(CONFIG_ZINC_ARCH_ARM64) += poly1305/poly1305-arm64.o obj-$(CONFIG_ZINC_POLY1305) += zinc_poly1305.o diff --git a/lib/zinc/poly1305/poly1305-arm-glue.c b/lib/zinc/poly1305/poly1305-arm-glue.c new file mode 100644 index 000000000000..f4f08ecffbf6 --- /dev/null +++ b/lib/zinc/poly1305/poly1305-arm-glue.c @@ -0,0 +1,140 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#include +#include + +asmlinkage void poly1305_init_arm(void *ctx, const u8 key[16]); +asmlinkage void poly1305_blocks_arm(void *ctx, const u8 *inp, const size_t len, + const u32 padbit); +asmlinkage void poly1305_emit_arm(void *ctx, u8 mac[16], const u32 nonce[4]); +asmlinkage void poly1305_blocks_neon(void *ctx, const u8 *inp, const size_t len, + const u32 padbit); +asmlinkage void poly1305_emit_neon(void *ctx, u8 mac[16], const u32 nonce[4]); + +static bool poly1305_use_neon __ro_after_init; +static bool *const poly1305_nobs[] __initconst = { &poly1305_use_neon }; + +static void __init poly1305_fpu_init(void) +{ +#if defined(CONFIG_ZINC_ARCH_ARM64) + poly1305_use_neon = elf_hwcap & HWCAP_ASIMD; +#elif defined(CONFIG_ZINC_ARCH_ARM) + poly1305_use_neon = elf_hwcap & HWCAP_NEON; +#endif +} + +#if defined(CONFIG_ZINC_ARCH_ARM64) +struct poly1305_arch_internal { + union { + u32 h[5]; + struct { + u64 h0, h1, h2; + }; + }; + u64 is_base2_26; + u64 r[2]; +}; +#elif defined(CONFIG_ZINC_ARCH_ARM) +struct poly1305_arch_internal { + union { + u32 h[5]; + struct { + u64 h0, h1; + u32 h2; + } __packed; + }; + u32 r[4]; + u32 is_base2_26; +}; +#endif + +/* The NEON code uses base 2^26, while the scalar code uses base 2^64 on 64-bit + * and base 2^32 on 32-bit. If we hit the unfortunate situation of using NEON + * and then having to go back to scalar -- because the user is silly and has + * called the update function from two separate contexts -- then we need to + * convert back to the original base before proceeding. The below function is + * written for 64-bit integers, and so we have to swap words at the end on + * big-endian 32-bit. It is possible to reason that the initial reduction below + * is sufficient given the implementation invariants. However, for an avoidance + * of doubt and because this is not performance critical, we do the full + * reduction anyway. + */ +static void convert_to_base2_64(void *ctx) +{ + struct poly1305_arch_internal *state = ctx; + u32 cy; + + if (!IS_ENABLED(CONFIG_KERNEL_MODE_NEON) || !state->is_base2_26) + return; + + cy = state->h[0] >> 26; state->h[0] &= 0x3ffffff; state->h[1] += cy; + cy = state->h[1] >> 26; state->h[1] &= 0x3ffffff; state->h[2] += cy; + cy = state->h[2] >> 26; state->h[2] &= 0x3ffffff; state->h[3] += cy; + cy = state->h[3] >> 26; state->h[3] &= 0x3ffffff; state->h[4] += cy; + state->h0 = ((u64)state->h[2] << 52) | ((u64)state->h[1] << 26) | state->h[0]; + state->h1 = ((u64)state->h[4] << 40) | ((u64)state->h[3] << 14) | (state->h[2] >> 12); + state->h2 = state->h[4] >> 24; + if (IS_ENABLED(CONFIG_ZINC_ARCH_ARM) && IS_ENABLED(CONFIG_CPU_BIG_ENDIAN)) { + state->h0 = rol64(state->h0, 32); + state->h1 = rol64(state->h1, 32); + } +#define ULT(a, b) ((a ^ ((a ^ b) | ((a - b) ^ b))) >> (sizeof(a) * 8 - 1)) + cy = (state->h2 >> 2) + (state->h2 & ~3ULL); + state->h2 &= 3; + state->h0 += cy; + state->h1 += (cy = ULT(state->h0, cy)); + state->h2 += ULT(state->h1, cy); +#undef ULT + state->is_base2_26 = 0; +} + +static inline bool poly1305_init_arch(void *ctx, + const u8 key[POLY1305_KEY_SIZE]) +{ + poly1305_init_arm(ctx, key); + return true; +} + +static inline bool poly1305_blocks_arch(void *ctx, const u8 *inp, + size_t len, const u32 padbit, + simd_context_t *simd_context) +{ + /* SIMD disables preemption, so relax after processing each page. */ + BUILD_BUG_ON(PAGE_SIZE < POLY1305_BLOCK_SIZE || + PAGE_SIZE % POLY1305_BLOCK_SIZE); + + if (!IS_ENABLED(CONFIG_KERNEL_MODE_NEON) || !poly1305_use_neon || + !simd_use(simd_context)) { + convert_to_base2_64(ctx); + poly1305_blocks_arm(ctx, inp, len, padbit); + return true; + } + + for (;;) { + const size_t bytes = min_t(size_t, len, PAGE_SIZE); + + poly1305_blocks_neon(ctx, inp, bytes, padbit); + len -= bytes; + if (!len) + break; + inp += bytes; + simd_relax(simd_context); + } + return true; +} + +static inline bool poly1305_emit_arch(void *ctx, u8 mac[POLY1305_MAC_SIZE], + const u32 nonce[4], + simd_context_t *simd_context) +{ + if (!IS_ENABLED(CONFIG_KERNEL_MODE_NEON) || !poly1305_use_neon || + !simd_use(simd_context)) { + convert_to_base2_64(ctx); + poly1305_emit_arm(ctx, mac, nonce); + } else + poly1305_emit_neon(ctx, mac, nonce); + return true; +} diff --git a/lib/zinc/poly1305/poly1305-arm-cryptogams.S b/lib/zinc/poly1305/poly1305-arm.S similarity index 91% rename from lib/zinc/poly1305/poly1305-arm-cryptogams.S rename to lib/zinc/poly1305/poly1305-arm.S index 884b465030e4..4a0e9d451119 100644 --- a/lib/zinc/poly1305/poly1305-arm-cryptogams.S +++ b/lib/zinc/poly1305/poly1305-arm.S @@ -1,9 +1,12 @@ /* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */ /* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. * Copyright (C) 2006-2017 CRYPTOGAMS by . All Rights Reserved. + * + * This is based in part on Andy Polyakov's implementation from CRYPTOGAMS. */ -#include "arm_arch.h" +#include .text #if defined(__thumb2__) @@ -13,13 +16,8 @@ .code 32 #endif -.globl poly1305_emit -.globl poly1305_blocks -.globl poly1305_init -.type poly1305_init,%function .align 5 -poly1305_init: -.Lpoly1305_init: +ENTRY(poly1305_init_arm) stmdb sp!,{r4-r11} eor r3,r3,r3 @@ -38,10 +36,6 @@ poly1305_init: moveq r0,#0 beq .Lno_key -#if __ARM_MAX_ARCH__>=7 - adr r11,.Lpoly1305_init - ldr r12,.LOPENSSL_armcap -#endif ldrb r4,[r1,#0] mov r10,#0x0fffffff ldrb r5,[r1,#1] @@ -56,12 +50,6 @@ poly1305_init: ldrb r7,[r1,#6] and r4,r4,r10 -#if __ARM_MAX_ARCH__>=7 - ldr r12,[r11,r12] @ OPENSSL_armcap_P -# ifdef __APPLE__ - ldr r12,[r12] -# endif -#endif ldrb r8,[r1,#7] orr r5,r5,r6,lsl#8 ldrb r6,[r1,#8] @@ -71,35 +59,6 @@ poly1305_init: ldrb r8,[r1,#10] and r5,r5,r3 -#if __ARM_MAX_ARCH__>=7 - tst r12,#ARMV7_NEON @ check for NEON -# ifdef __APPLE__ - adr r9,poly1305_blocks_neon - adr r11,poly1305_blocks -# ifdef __thumb2__ - it ne -# endif - movne r11,r9 - adr r12,poly1305_emit - adr r10,poly1305_emit_neon -# ifdef __thumb2__ - it ne -# endif - movne r12,r10 -# else -# ifdef __thumb2__ - itete eq -# endif - addeq r12,r11,#(poly1305_emit-.Lpoly1305_init) - addne r12,r11,#(poly1305_emit_neon-.Lpoly1305_init) - addeq r11,r11,#(poly1305_blocks-.Lpoly1305_init) - addne r11,r11,#(poly1305_blocks_neon-.Lpoly1305_init) -# endif -# ifdef __thumb2__ - orr r12,r12,#1 @ thumb-ify address - orr r11,r11,#1 -# endif -#endif ldrb r9,[r1,#11] orr r6,r6,r7,lsl#8 ldrb r7,[r1,#12] @@ -118,26 +77,20 @@ poly1305_init: str r6,[r0,#8] and r7,r7,r3 str r7,[r0,#12] -#if __ARM_MAX_ARCH__>=7 - stmia r2,{r11,r12} @ fill functions table - mov r0,#1 -#else - mov r0,#0 -#endif .Lno_key: ldmia sp!,{r4-r11} -#if __ARM_ARCH__>=5 +#if __LINUX_ARM_ARCH__ >= 5 bx lr @ bx lr #else tst lr,#1 moveq pc,lr @ be binary compatible with V4, yet .word 0xe12fff1e @ interoperable with Thumb ISA:-) #endif -.size poly1305_init,.-poly1305_init -.type poly1305_blocks,%function +ENDPROC(poly1305_init_arm) + .align 5 -poly1305_blocks: -.Lpoly1305_blocks: +ENTRY(poly1305_blocks_arm) +.Lpoly1305_blocks_arm: stmdb sp!,{r3-r11,lr} ands r2,r2,#-16 @@ -158,11 +111,11 @@ poly1305_blocks: b .Loop .Loop: -#if __ARM_ARCH__<7 +#if __LINUX_ARM_ARCH__ < 7 ldrb r0,[lr],#16 @ load input -# ifdef __thumb2__ +#ifdef __thumb2__ it hi -# endif +#endif addhi r8,r8,#1 @ 1<<128 ldrb r1,[lr,#-15] ldrb r2,[lr,#-14] @@ -201,19 +154,19 @@ poly1305_blocks: orr r3,r2,r3,lsl#24 #else ldr r0,[lr],#16 @ load input -# ifdef __thumb2__ +#ifdef __thumb2__ it hi -# endif +#endif addhi r8,r8,#1 @ padbit ldr r1,[lr,#-12] ldr r2,[lr,#-8] ldr r3,[lr,#-4] -# ifdef __ARMEB__ +#ifdef __ARMEB__ rev r0,r0 rev r1,r1 rev r2,r2 rev r3,r3 -# endif +#endif adds r4,r4,r0 @ accumulate input str lr,[sp,#8] @ offload input pointer adcs r5,r5,r1 @@ -283,7 +236,7 @@ poly1305_blocks: stmia r0,{r4-r8} @ store the result .Lno_data: -#if __ARM_ARCH__>=5 +#if __LINUX_ARM_ARCH__ >= 5 ldmia sp!,{r3-r11,pc} #else ldmia sp!,{r3-r11,lr} @@ -291,13 +244,12 @@ poly1305_blocks: moveq pc,lr @ be binary compatible with V4, yet .word 0xe12fff1e @ interoperable with Thumb ISA:-) #endif -.size poly1305_blocks,.-poly1305_blocks -.type poly1305_emit,%function +ENDPROC(poly1305_blocks_arm) + .align 5 -poly1305_emit: +ENTRY(poly1305_emit_arm) stmdb sp!,{r4-r11} .Lpoly1305_emit_enter: - ldmia r0,{r3-r7} adds r8,r3,#5 @ compare to modulus adcs r9,r4,#0 @@ -332,13 +284,13 @@ poly1305_emit: adcs r5,r5,r10 adc r6,r6,r11 -#if __ARM_ARCH__>=7 -# ifdef __ARMEB__ +#if __LINUX_ARM_ARCH__ >= 7 +#ifdef __ARMEB__ rev r3,r3 rev r4,r4 rev r5,r5 rev r6,r6 -# endif +#endif str r3,[r1,#0] str r4,[r1,#4] str r5,[r1,#8] @@ -377,20 +329,22 @@ poly1305_emit: strb r6,[r1,#15] #endif ldmia sp!,{r4-r11} -#if __ARM_ARCH__>=5 +#if __LINUX_ARM_ARCH__ >= 5 bx lr @ bx lr #else tst lr,#1 moveq pc,lr @ be binary compatible with V4, yet .word 0xe12fff1e @ interoperable with Thumb ISA:-) #endif -.size poly1305_emit,.-poly1305_emit -#if __ARM_MAX_ARCH__>=7 +ENDPROC(poly1305_emit_arm) + + +#ifdef CONFIG_KERNEL_MODE_NEON .fpu neon -.type poly1305_init_neon,%function .align 5 -poly1305_init_neon: +ENTRY(poly1305_init_neon) +.Lpoly1305_init_neon: ldr r4,[r0,#20] @ load key base 2^32 ldr r5,[r0,#24] ldr r6,[r0,#28] @@ -600,11 +554,10 @@ poly1305_init_neon: vst1.32 {d8[1]},[r7] bx lr @ bx lr -.size poly1305_init_neon,.-poly1305_init_neon +ENDPROC(poly1305_init_neon) -.type poly1305_blocks_neon,%function .align 5 -poly1305_blocks_neon: +ENTRY(poly1305_blocks_neon) ldr ip,[r0,#36] @ is_base2_26 ands r2,r2,#-16 beq .Lno_data_neon @@ -612,7 +565,7 @@ poly1305_blocks_neon: cmp r2,#64 bhs .Lenter_neon tst ip,ip @ is_base2_26? - beq .Lpoly1305_blocks + beq .Lpoly1305_blocks_arm .Lenter_neon: stmdb sp!,{r4-r7} @@ -622,7 +575,7 @@ poly1305_blocks_neon: bne .Lbase2_26_neon stmdb sp!,{r1-r3,lr} - bl poly1305_init_neon + bl .Lpoly1305_init_neon ldr r4,[r0,#0] @ load hash value base 2^32 ldr r5,[r0,#4] @@ -686,12 +639,12 @@ poly1305_blocks_neon: sub r2,r2,#16 add r4,r1,#32 -# ifdef __ARMEB__ +#ifdef __ARMEB__ vrev32.8 q10,q10 vrev32.8 q13,q13 vrev32.8 q11,q11 vrev32.8 q12,q12 -# endif +#endif vsri.u32 d28,d26,#8 @ base 2^32 -> base 2^26 vshl.u32 d26,d26,#18 @@ -735,12 +688,12 @@ poly1305_blocks_neon: addhi r7,r0,#(48+1*9*4) addhi r6,r0,#(48+3*9*4) -# ifdef __ARMEB__ +#ifdef __ARMEB__ vrev32.8 q10,q10 vrev32.8 q13,q13 vrev32.8 q11,q11 vrev32.8 q12,q12 -# endif +#endif vsri.u32 q14,q13,#8 @ base 2^32 -> base 2^26 vshl.u32 q13,q13,#18 @@ -866,12 +819,12 @@ poly1305_blocks_neon: vld4.32 {d20,d22,d24,d26},[r1] @ inp[0:1] add r1,r1,#64 -# ifdef __ARMEB__ +#ifdef __ARMEB__ vrev32.8 q10,q10 vrev32.8 q11,q11 vrev32.8 q12,q12 vrev32.8 q13,q13 -# endif +#endif @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ lazy reduction interleaved with base 2^32 -> base 2^26 of @@ -1086,11 +1039,10 @@ poly1305_blocks_neon: ldmia sp!,{r4-r7} .Lno_data_neon: bx lr @ bx lr -.size poly1305_blocks_neon,.-poly1305_blocks_neon +ENDPROC(poly1305_blocks_neon) -.type poly1305_emit_neon,%function .align 5 -poly1305_emit_neon: +ENTRY(poly1305_emit_neon) ldr ip,[r0,#36] @ is_base2_26 stmdb sp!,{r4-r11} @@ -1144,12 +1096,12 @@ poly1305_emit_neon: adcs r5,r5,r10 adc r6,r6,r11 -# ifdef __ARMEB__ +#ifdef __ARMEB__ rev r3,r3 rev r4,r4 rev r5,r5 rev r6,r6 -# endif +#endif str r3,[r1,#0] @ store the result str r4,[r1,#4] str r5,[r1,#8] @@ -1157,16 +1109,9 @@ poly1305_emit_neon: ldmia sp!,{r4-r11} bx lr @ bx lr -.size poly1305_emit_neon,.-poly1305_emit_neon +ENDPROC(poly1305_emit_neon) .align 5 .Lzeros: .long 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 -.LOPENSSL_armcap: -.word OPENSSL_armcap_P-.Lpoly1305_init -#endif -.asciz "Poly1305 for ARMv4/NEON, CRYPTOGAMS by " -.align 2 -#if __ARM_MAX_ARCH__>=7 -.comm OPENSSL_armcap_P,4,4 #endif diff --git a/lib/zinc/poly1305/poly1305-arm64-cryptogams.S b/lib/zinc/poly1305/poly1305-arm64.S similarity index 89% rename from lib/zinc/poly1305/poly1305-arm64-cryptogams.S rename to lib/zinc/poly1305/poly1305-arm64.S index 0ecb50a83ec0..5f4e7fb0a836 100644 --- a/lib/zinc/poly1305/poly1305-arm64-cryptogams.S +++ b/lib/zinc/poly1305/poly1305-arm64.S @@ -1,21 +1,16 @@ /* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */ /* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. * Copyright (C) 2006-2017 CRYPTOGAMS by . All Rights Reserved. + * + * This is based in part on Andy Polyakov's implementation from CRYPTOGAMS. */ -#include "arm_arch.h" - +#include .text -// forward "declarations" are required for Apple - -.globl poly1305_blocks -.globl poly1305_emit - -.globl poly1305_init -.type poly1305_init,%function .align 5 -poly1305_init: +ENTRY(poly1305_init_arm) cmp x1,xzr stp xzr,xzr,[x0] // zero hash value stp xzr,xzr,[x0,#16] // [along with is_base2_26] @@ -23,18 +18,10 @@ poly1305_init: csel x0,xzr,x0,eq b.eq .Lno_key -#ifdef __ILP32__ - ldrsw x11,.LOPENSSL_armcap_P -#else - ldr x11,.LOPENSSL_armcap_P -#endif - adr x10,.LOPENSSL_armcap_P - ldp x7,x8,[x1] // load key mov x9,#0xfffffffc0fffffff movk x9,#0x0fff,lsl#48 - ldr w17,[x10,x11] -#ifdef __ARMEB__ +#ifdef __AARCH64EB__ rev x7,x7 // flip bytes rev x8,x8 #endif @@ -43,30 +30,12 @@ poly1305_init: and x8,x8,x9 // &=0ffffffc0ffffffc stp x7,x8,[x0,#32] // save key value - tst w17,#ARMV7_NEON - - adr x12,poly1305_blocks - adr x7,poly1305_blocks_neon - adr x13,poly1305_emit - adr x8,poly1305_emit_neon - - csel x12,x12,x7,eq - csel x13,x13,x8,eq - -#ifdef __ILP32__ - stp w12,w13,[x2] -#else - stp x12,x13,[x2] -#endif - - mov x0,#1 .Lno_key: ret -.size poly1305_init,.-poly1305_init +ENDPROC(poly1305_init_arm) -.type poly1305_blocks,%function .align 5 -poly1305_blocks: +ENTRY(poly1305_blocks_arm) ands x2,x2,#-16 b.eq .Lno_data @@ -80,7 +49,7 @@ poly1305_blocks: .Loop: ldp x10,x11,[x1],#16 // load input sub x2,x2,#16 -#ifdef __ARMEB__ +#ifdef __AARCH64EB__ rev x10,x10 rev x11,x11 #endif @@ -126,11 +95,10 @@ poly1305_blocks: .Lno_data: ret -.size poly1305_blocks,.-poly1305_blocks +ENDPROC(poly1305_blocks_arm) -.type poly1305_emit,%function .align 5 -poly1305_emit: +ENTRY(poly1305_emit_arm) ldp x4,x5,[x0] // load hash base 2^64 ldr x6,[x0,#16] ldp x10,x11,[x2] // load nonce @@ -144,23 +112,23 @@ poly1305_emit: csel x4,x4,x12,eq csel x5,x5,x13,eq -#ifdef __ARMEB__ +#ifdef __AARCH64EB__ ror x10,x10,#32 // flip nonce words ror x11,x11,#32 #endif adds x4,x4,x10 // accumulate nonce adc x5,x5,x11 -#ifdef __ARMEB__ +#ifdef __AARCH64EB__ rev x4,x4 // flip output bytes rev x5,x5 #endif stp x4,x5,[x1] // write result ret -.size poly1305_emit,.-poly1305_emit -.type poly1305_mult,%function +ENDPROC(poly1305_emit_arm) + .align 5 -poly1305_mult: +__poly1305_mult: mul x12,x4,x7 // h0*r0 umulh x13,x4,x7 @@ -193,11 +161,8 @@ poly1305_mult: adc x6,x6,xzr ret -.size poly1305_mult,.-poly1305_mult -.type poly1305_splat,%function -.align 5 -poly1305_splat: +__poly1305_splat: and x12,x4,#0x03ffffff // base 2^64 -> base 2^26 ubfx x13,x4,#26,#26 extr x14,x5,x4,#52 @@ -220,15 +185,14 @@ poly1305_splat: str w15,[x0,#16*8] // s4 ret -.size poly1305_splat,.-poly1305_splat -.type poly1305_blocks_neon,%function +#ifdef CONFIG_KERNEL_MODE_NEON .align 5 -poly1305_blocks_neon: +ENTRY(poly1305_blocks_neon) ldr x17,[x0,#24] cmp x2,#128 b.hs .Lblocks_neon - cbz x17,poly1305_blocks + cbz x17,poly1305_blocks_arm .Lblocks_neon: stp x29,x30,[sp,#-80]! @@ -268,7 +232,7 @@ poly1305_blocks_neon: adcs x5,x5,xzr adc x6,x6,xzr -#ifdef __ARMEB__ +#ifdef __AARCH64EB__ rev x12,x12 rev x13,x13 #endif @@ -276,7 +240,7 @@ poly1305_blocks_neon: adcs x5,x5,x13 adc x6,x6,x3 - bl poly1305_mult + bl __poly1305_mult ldr x30,[sp,#8] cbz x3,.Lstore_base2_64_neon @@ -314,7 +278,7 @@ poly1305_blocks_neon: ldp x12,x13,[x1],#16 // load input sub x2,x2,#16 add x9,x8,x8,lsr#2 // s1 = r1 + (r1 >> 2) -#ifdef __ARMEB__ +#ifdef __AARCH64EB__ rev x12,x12 rev x13,x13 #endif @@ -322,7 +286,7 @@ poly1305_blocks_neon: adcs x5,x5,x13 adc x6,x6,x3 - bl poly1305_mult + bl __poly1305_mult .Linit_neon: and x10,x4,#0x03ffffff // base 2^64 -> base 2^26 @@ -349,19 +313,19 @@ poly1305_blocks_neon: mov x5,x8 mov x6,xzr add x0,x0,#48+12 - bl poly1305_splat + bl __poly1305_splat - bl poly1305_mult // r^2 + bl __poly1305_mult // r^2 sub x0,x0,#4 - bl poly1305_splat + bl __poly1305_splat - bl poly1305_mult // r^3 + bl __poly1305_mult // r^3 sub x0,x0,#4 - bl poly1305_splat + bl __poly1305_splat - bl poly1305_mult // r^4 + bl __poly1305_mult // r^4 sub x0,x0,#4 - bl poly1305_splat + bl __poly1305_splat ldr x30,[sp,#8] add x16,x1,#32 @@ -399,7 +363,7 @@ poly1305_blocks_neon: lsl x3,x3,#24 add x15,x0,#48 -#ifdef __ARMEB__ +#ifdef __AARCH64EB__ rev x8,x8 rev x12,x12 rev x9,x9 @@ -435,7 +399,7 @@ poly1305_blocks_neon: ld1 {v4.4s,v5.4s,v6.4s,v7.4s},[x15],#64 ld1 {v8.4s},[x15] -#ifdef __ARMEB__ +#ifdef __AARCH64EB__ rev x8,x8 rev x12,x12 rev x9,x9 @@ -496,7 +460,7 @@ poly1305_blocks_neon: umull v20.2d,v14.2s,v1.s[2] ldp x9,x13,[x16],#48 umull v19.2d,v14.2s,v0.s[2] -#ifdef __ARMEB__ +#ifdef __AARCH64EB__ rev x8,x8 rev x12,x12 rev x9,x9 @@ -561,7 +525,7 @@ poly1305_blocks_neon: umlal v23.2d,v11.2s,v3.s[0] umlal v20.2d,v11.2s,v8.s[0] umlal v21.2d,v11.2s,v0.s[0] -#ifdef __ARMEB__ +#ifdef __AARCH64EB__ rev x8,x8 rev x12,x12 rev x9,x9 @@ -801,13 +765,12 @@ poly1305_blocks_neon: .Lno_data_neon: ldr x29,[sp],#80 ret -.size poly1305_blocks_neon,.-poly1305_blocks_neon +ENDPROC(poly1305_blocks_neon) -.type poly1305_emit_neon,%function .align 5 -poly1305_emit_neon: +ENTRY(poly1305_emit_neon) ldr x17,[x0,#24] - cbz x17,poly1305_emit + cbz x17,poly1305_emit_arm ldp w10,w11,[x0] // load hash value base 2^26 ldp w12,w13,[x0,#8] @@ -840,30 +803,22 @@ poly1305_emit_neon: csel x4,x4,x12,eq csel x5,x5,x13,eq -#ifdef __ARMEB__ +#ifdef __AARCH64EB__ ror x10,x10,#32 // flip nonce words ror x11,x11,#32 #endif adds x4,x4,x10 // accumulate nonce adc x5,x5,x11 -#ifdef __ARMEB__ +#ifdef __AARCH64EB__ rev x4,x4 // flip output bytes rev x5,x5 #endif stp x4,x5,[x1] // write result ret -.size poly1305_emit_neon,.-poly1305_emit_neon +ENDPROC(poly1305_emit_neon) .align 5 .Lzeros: .long 0,0,0,0,0,0,0,0 -.LOPENSSL_armcap_P: -#ifdef __ILP32__ -.long OPENSSL_armcap_P-. -#else -.quad OPENSSL_armcap_P-. #endif -.byte 80,111,108,121,49,51,48,53,32,102,111,114,32,65,82,77,118,56,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0 -.align 2 -.align 2 diff --git a/lib/zinc/poly1305/poly1305.c b/lib/zinc/poly1305/poly1305.c index 51af7045cac8..9dc85f62e806 100644 --- a/lib/zinc/poly1305/poly1305.c +++ b/lib/zinc/poly1305/poly1305.c @@ -18,6 +18,8 @@ #if defined(CONFIG_ZINC_ARCH_X86_64) #include "poly1305-x86_64-glue.c" +#elif defined(CONFIG_ZINC_ARCH_ARM) || defined(CONFIG_ZINC_ARCH_ARM64) +#include "poly1305-arm-glue.c" #else static inline bool poly1305_init_arch(void *ctx, const u8 key[POLY1305_KEY_SIZE]) From patchwork Sat Oct 6 02:56:57 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Jason A. Donenfeld" X-Patchwork-Id: 148314 Delivered-To: patch@linaro.org Received: by 2002:a2e:8595:0:0:0:0:0 with SMTP id b21-v6csp1157916lji; Fri, 5 Oct 2018 19:58:41 -0700 (PDT) X-Google-Smtp-Source: ACcGV61UxOn86FMSTsc9P5Uks53IHx+H7p1SoLVdo98iw9/8H9Q4h4VQ22noS4UBKGZlJx03XY8e X-Received: by 2002:a62:5c03:: with SMTP id q3-v6mr14954649pfb.182.1538794721363; Fri, 05 Oct 2018 19:58:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1538794721; cv=none; d=google.com; s=arc-20160816; b=evy97YH8phX1iFWc1QAOtb/r8s/mrsW2eseQDTby96usBXEuRR6qjTciRprbEp5Ayw jtT/TJgiFsuSGtmweaEpfEEzdKkCUb6kSGQaHG2NN2bXK/8Rq33Qx9Xg9kyL1ma12sdA FmZKpychJzgQp4RNOHaBxMy8Eh1t+yTEnooyCLXZe3h1n32tVKYlBakBovCVIOshwUqy /SKNsKQ0n3YeLEMOVkMQOKmxPOgjMxH2stBaU3H0gUpLcvKe58JbCGmO3vZbFvJx42wQ 1R81Bx0riaQXUplQF+8K5Wm3Y6qofRMrN148hS2Ov8Q64RT/rAZ6ppPiPA3vYZse9lh/ mj9Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=xYI+MiuGTVFWEfFSlasKEBvpfYa43D1JkJb3sPspFqY=; b=YXonDZeG9SSf/5LDcShHfK18FaKaEYC0pzm/nVnrl90T7CwvKQPESrdlu9DOaZ4zqw 3zC9Ec3AjwhYRjT2SyQJXsYmavgEC7+x92C7RDNAzxZOn9JQNlMdfoznUYybs/OvzpmM YSl2TjpR4vwQDnjfa67OaOztUY0pGV89aJhsBK0dJ4Ig+tWs6lSN+ZGO/z/uJdjROUYu NUy+X1rK/rXil7bibTYRubA4LXeYtwBz1wa9DthMVZy1U1hIrFSOrtkC4g46gH62NujN Gp9QGk1gCsd/qqMRwpL8nLIG+rGfmFaUTvxqQf00OQAKu4vWBKSA79BwKXy5kYpkKnKm YD4Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=tBfK6dUU; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id d81-v6si11971956pfm.40.2018.10.05.19.58.41; Fri, 05 Oct 2018 19:58:41 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=tBfK6dUU; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729842AbeJFKAN (ORCPT + 32 others); Sat, 6 Oct 2018 06:00:13 -0400 Received: from frisell.zx2c4.com ([192.95.5.64]:50073 "EHLO frisell.zx2c4.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729724AbeJFJ7x (ORCPT ); Sat, 6 Oct 2018 05:59:53 -0400 Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTP id 45e35024; Sat, 6 Oct 2018 02:57:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=zx2c4.com; h=from:to:cc :subject:date:message-id:in-reply-to:references:mime-version :content-transfer-encoding; s=mail; bh=bVt5y3WWyxOiFbzC3dEVZg4Ss sQ=; b=tBfK6dUUascUiLT1pjYJkJkA8PJYpMns/hSAb8jxnGmacC7XxcI5YUbGN e79mLAVGlv5PtcE5GuMCkc/4Z6U2V+8Q3lgVdm8AW3U9VF/EFIdzChN0GhLJ1oLY O9MuXbhQjUJ1vMvOUtGq+LLIX/cOFjc+K1dF1XaACsJi20gUNBZdjHO9FMyeVEkE RNiazefeBGXRjRVTaIxO9YNd5eBzO1k8+bMWUvR6QN7ihAuieZuefo0V1sGcStjw Rb73lwp48WytohRF7GuW8/Id6mQ8Y6tEP6xIzd2ekvYaI/Lij0ZZcOA+Bg+AL5j0 hYu4qkJkFBLla9iDv6wNyAQnroKwA== Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTPSA id efb211cf (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256:NO); Sat, 6 Oct 2018 02:57:47 +0000 (UTC) From: "Jason A. Donenfeld" To: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, davem@davemloft.net, gregkh@linuxfoundation.org Cc: "Jason A. Donenfeld" , Andy Polyakov , Ralf Baechle , Paul Burton , James Hogan , linux-mips@linux-mips.org, Samuel Neves , Jean-Philippe Aumasson , Andy Lutomirski , Andrew Morton , Linus Torvalds , kernel-hardening@lists.openwall.com, linux-crypto@vger.kernel.org Subject: [PATCH net-next v7 16/28] zinc: import Andy Polyakov's Poly1305 MIPS64 implementation Date: Sat, 6 Oct 2018 04:56:57 +0200 Message-Id: <20181006025709.4019-17-Jason@zx2c4.com> In-Reply-To: <20181006025709.4019-1-Jason@zx2c4.com> References: <20181006025709.4019-1-Jason@zx2c4.com> MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This MIPS64 accelerated implementation comes from Andy Polyakov's implementation, and is included here in raw form without modification, so that subsequent commits that fix these up for the kernel can see how it has changed. While this is CRYPTOGAMS code, the originating code for this happens to be the same as OpenSSL's commit 947716c1872d210828122212d076d503ae68b928 Signed-off-by: Jason A. Donenfeld Based-on-code-from: Andy Polyakov Cc: Andy Polyakov Cc: Ralf Baechle Cc: Paul Burton Cc: James Hogan Cc: linux-mips@linux-mips.org Cc: Samuel Neves Cc: Jean-Philippe Aumasson Cc: Andy Lutomirski Cc: Greg KH Cc: Andrew Morton Cc: Linus Torvalds Cc: kernel-hardening@lists.openwall.com Cc: linux-crypto@vger.kernel.org --- .../poly1305/poly1305-mips64-cryptogams.S | 338 ++++++++++++++++++ 1 file changed, 338 insertions(+) create mode 100644 lib/zinc/poly1305/poly1305-mips64-cryptogams.S -- 2.19.0 diff --git a/lib/zinc/poly1305/poly1305-mips64-cryptogams.S b/lib/zinc/poly1305/poly1305-mips64-cryptogams.S new file mode 100644 index 000000000000..24a6005884c3 --- /dev/null +++ b/lib/zinc/poly1305/poly1305-mips64-cryptogams.S @@ -0,0 +1,338 @@ +/* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */ +/* + * Copyright (C) 2006-2017 CRYPTOGAMS by . All Rights Reserved. + */ + +#include "mips_arch.h" + +#ifdef MIPSEB +# define MSB 0 +# define LSB 7 +#else +# define MSB 7 +# define LSB 0 +#endif + +.text +.set noat +.set noreorder + +.align 5 +.globl poly1305_init +.ent poly1305_init +poly1305_init: + .frame $29,0,$31 + .set reorder + + sd $0,0($4) + sd $0,8($4) + sd $0,16($4) + + beqz $5,.Lno_key + +#if defined(_MIPS_ARCH_MIPS64R6) + ld $8,0($5) + ld $9,8($5) +#else + ldl $8,0+MSB($5) + ldl $9,8+MSB($5) + ldr $8,0+LSB($5) + ldr $9,8+LSB($5) +#endif +#ifdef MIPSEB +# if defined(_MIPS_ARCH_MIPS64R2) + dsbh $8,$8 # byte swap + dsbh $9,$9 + dshd $8,$8 + dshd $9,$9 +# else + ori $10,$0,0xFF + dsll $1,$10,32 + or $10,$1 # 0x000000FF000000FF + + and $11,$8,$10 # byte swap + and $2,$9,$10 + dsrl $1,$8,24 + dsrl $24,$9,24 + dsll $11,24 + dsll $2,24 + and $1,$10 + and $24,$10 + dsll $10,8 # 0x0000FF000000FF00 + or $11,$1 + or $2,$24 + and $1,$8,$10 + and $24,$9,$10 + dsrl $8,8 + dsrl $9,8 + dsll $1,8 + dsll $24,8 + and $8,$10 + and $9,$10 + or $11,$1 + or $2,$24 + or $8,$11 + or $9,$2 + dsrl $11,$8,32 + dsrl $2,$9,32 + dsll $8,32 + dsll $9,32 + or $8,$11 + or $9,$2 +# endif +#endif + li $10,1 + dsll $10,32 + daddiu $10,-63 + dsll $10,28 + daddiu $10,-1 # 0ffffffc0fffffff + + and $8,$10 + daddiu $10,-3 # 0ffffffc0ffffffc + and $9,$10 + + sd $8,24($4) + dsrl $10,$9,2 + sd $9,32($4) + daddu $10,$9 # s1 = r1 + (r1 >> 2) + sd $10,40($4) + +.Lno_key: + li $2,0 # return 0 + jr $31 +.end poly1305_init +.align 5 +.globl poly1305_blocks +.ent poly1305_blocks +poly1305_blocks: + .set noreorder + dsrl $6,4 # number of complete blocks + bnez $6,poly1305_blocks_internal + nop + jr $31 + nop +.end poly1305_blocks + +.align 5 +.ent poly1305_blocks_internal +poly1305_blocks_internal: + .frame $29,6*8,$31 + .mask 0x00030000,-8 + .set noreorder + dsubu $29,6*8 + sd $17,40($29) + sd $16,32($29) + .set reorder + + ld $12,0($4) # load hash value + ld $13,8($4) + ld $14,16($4) + + ld $15,24($4) # load key + ld $16,32($4) + ld $17,40($4) + +.Loop: +#if defined(_MIPS_ARCH_MIPS64R6) + ld $8,0($5) # load input + ld $9,8($5) +#else + ldl $8,0+MSB($5) # load input + ldl $9,8+MSB($5) + ldr $8,0+LSB($5) + ldr $9,8+LSB($5) +#endif + daddiu $6,-1 + daddiu $5,16 +#ifdef MIPSEB +# if defined(_MIPS_ARCH_MIPS64R2) + dsbh $8,$8 # byte swap + dsbh $9,$9 + dshd $8,$8 + dshd $9,$9 +# else + ori $10,$0,0xFF + dsll $1,$10,32 + or $10,$1 # 0x000000FF000000FF + + and $11,$8,$10 # byte swap + and $2,$9,$10 + dsrl $1,$8,24 + dsrl $24,$9,24 + dsll $11,24 + dsll $2,24 + and $1,$10 + and $24,$10 + dsll $10,8 # 0x0000FF000000FF00 + or $11,$1 + or $2,$24 + and $1,$8,$10 + and $24,$9,$10 + dsrl $8,8 + dsrl $9,8 + dsll $1,8 + dsll $24,8 + and $8,$10 + and $9,$10 + or $11,$1 + or $2,$24 + or $8,$11 + or $9,$2 + dsrl $11,$8,32 + dsrl $2,$9,32 + dsll $8,32 + dsll $9,32 + or $8,$11 + or $9,$2 +# endif +#endif + daddu $12,$8 # accumulate input + daddu $13,$9 + sltu $10,$12,$8 + sltu $11,$13,$9 + daddu $13,$10 + + dmultu ($15,$12) # h0*r0 + daddu $14,$7 + sltu $10,$13,$10 + mflo ($8,$15,$12) + mfhi ($9,$15,$12) + + dmultu ($17,$13) # h1*5*r1 + daddu $10,$11 + daddu $14,$10 + mflo ($10,$17,$13) + mfhi ($11,$17,$13) + + dmultu ($16,$12) # h0*r1 + daddu $8,$10 + daddu $9,$11 + mflo ($1,$16,$12) + mfhi ($25,$16,$12) + sltu $10,$8,$10 + daddu $9,$10 + + dmultu ($15,$13) # h1*r0 + daddu $9,$1 + sltu $1,$9,$1 + mflo ($10,$15,$13) + mfhi ($11,$15,$13) + daddu $25,$1 + + dmultu ($17,$14) # h2*5*r1 + daddu $9,$10 + daddu $25,$11 + mflo ($1,$17,$14) + + dmultu ($15,$14) # h2*r0 + sltu $10,$9,$10 + daddu $25,$10 + mflo ($2,$15,$14) + + daddu $9,$1 + daddu $25,$2 + sltu $1,$9,$1 + daddu $25,$1 + + li $10,-4 # final reduction + and $10,$25 + dsrl $11,$25,2 + andi $14,$25,3 + daddu $10,$11 + daddu $12,$8,$10 + sltu $10,$12,$10 + daddu $13,$9,$10 + sltu $10,$13,$10 + daddu $14,$14,$10 + + bnez $6,.Loop + + sd $12,0($4) # store hash value + sd $13,8($4) + sd $14,16($4) + + .set noreorder + ld $17,40($29) # epilogue + ld $16,32($29) + jr $31 + daddu $29,6*8 +.end poly1305_blocks_internal +.align 5 +.globl poly1305_emit +.ent poly1305_emit +poly1305_emit: + .frame $29,0,$31 + .set reorder + + ld $10,0($4) + ld $11,8($4) + ld $1,16($4) + + daddiu $8,$10,5 # compare to modulus + sltiu $2,$8,5 + daddu $9,$11,$2 + sltu $2,$9,$2 + daddu $1,$1,$2 + + dsrl $1,2 # see if it carried/borrowed + dsubu $1,$0,$1 + nor $2,$0,$1 + + and $8,$1 + and $10,$2 + and $9,$1 + and $11,$2 + or $8,$10 + or $9,$11 + + lwu $10,0($6) # load nonce + lwu $11,4($6) + lwu $1,8($6) + lwu $2,12($6) + dsll $11,32 + dsll $2,32 + or $10,$11 + or $1,$2 + + daddu $8,$10 # accumulate nonce + daddu $9,$1 + sltu $10,$8,$10 + daddu $9,$10 + + dsrl $10,$8,8 # write mac value + dsrl $11,$8,16 + dsrl $1,$8,24 + sb $8,0($5) + dsrl $2,$8,32 + sb $10,1($5) + dsrl $10,$8,40 + sb $11,2($5) + dsrl $11,$8,48 + sb $1,3($5) + dsrl $1,$8,56 + sb $2,4($5) + dsrl $2,$9,8 + sb $10,5($5) + dsrl $10,$9,16 + sb $11,6($5) + dsrl $11,$9,24 + sb $1,7($5) + + sb $9,8($5) + dsrl $1,$9,32 + sb $2,9($5) + dsrl $2,$9,40 + sb $10,10($5) + dsrl $10,$9,48 + sb $11,11($5) + dsrl $11,$9,56 + sb $1,12($5) + sb $2,13($5) + sb $10,14($5) + sb $11,15($5) + + jr $31 +.end poly1305_emit +.rdata +.asciiz "Poly1305 for MIPS64, CRYPTOGAMS by " +.align 2 From patchwork Sat Oct 6 02:56:58 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: "Jason A. Donenfeld" X-Patchwork-Id: 148313 Delivered-To: patch@linaro.org Received: by 2002:a2e:8595:0:0:0:0:0 with SMTP id b21-v6csp1157722lji; Fri, 5 Oct 2018 19:58:26 -0700 (PDT) X-Google-Smtp-Source: ACcGV60mztm69MDpaTlmfucfhLpF4T1f1thnjDc3iTn+HAa8MbPab+uC5OzZ7ChorN6zmb5ig81x X-Received: by 2002:a63:6849:: with SMTP id d70-v6mr12266662pgc.7.1538794706333; Fri, 05 Oct 2018 19:58:26 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1538794706; cv=none; d=google.com; s=arc-20160816; b=RYktJr89MvORBOUfZvnCEATMk5L07ZTgYyL+4fCCxDePwNXA4hU/rcxmYl7fAckM7V 42KOETG18ByGV/EBLgqLrX1rMlI1UEtRU9CcohhyS+trvYXlprb6NShPZUp6etv7vmeq GGZq5KLh5tJPUTMJqCYE3gtFrpixiYOPR5J9XSAgWo4uY5QBaP73TG8o24F01lpJcBSW Yuk1FtHpjI2ptrohDQY0qWcCc6HbuCoyzORPSrDveDN3ZO2Aa4Uhn9JqVf6A3kkiWTj1 UUJf/uVp7ONfIXHKfASm9RdZNyh4zCWwwjdA4HXHldnpU5+hq35gCY3gPseKCGZ193yR oWsg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=OdythF4/a44Ph6YjjI5xXi8dg8xpaMgkXX/lGZq1/+0=; b=GEcjSWLrrMMVKtKo8W11lPbdOlDCEdGWBSzLPWFaBGZAe77NTSxOYTrUyY20bbSPS9 bqATTypTF6hslzSHTvuJuaHs/NpqKZUxGM8Uge6Llcv0Kq62etg/HNRg5P+ankzRUDi5 vHPo/OpdkK5q9D/1w9Z5MAxEQuj5pGE1LZcmUexG9CoDC5CWQ2Hl7b/zyDuuqD92qIx6 0SL5zp2cZdvZCPuQCykzvDVnSy6CCY53IcbI0+bUwn16pGI4+ubqKqT1zT8efTzRbcSD vatZFkMFS4r9ypLO11gcMbzBRAs/js8GrQZUxihyacSoehDaWcFsOxn3VbfOlecJ6sDy 5SqQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=2ym1KLpI; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id k5-v6si8650571pgo.312.2018.10.05.19.58.26; Fri, 05 Oct 2018 19:58:26 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=2ym1KLpI; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729817AbeJFJ75 (ORCPT + 32 others); Sat, 6 Oct 2018 05:59:57 -0400 Received: from frisell.zx2c4.com ([192.95.5.64]:50073 "EHLO frisell.zx2c4.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729562AbeJFJ74 (ORCPT ); Sat, 6 Oct 2018 05:59:56 -0400 Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTP id c753649e; Sat, 6 Oct 2018 02:57:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=zx2c4.com; h=from:to:cc :subject:date:message-id:in-reply-to:references:mime-version :content-type:content-transfer-encoding; s=mail; bh=md7nW6t9RHwF DkFxeazXCOUeY/c=; b=2ym1KLpINmGIVFTxBkko0eIrQ9FA7/Uf8KXiv5ehQV8T wRtypXbhAge4gX6+VXsdmqRhxNL14QP32crGoPFxlz0jR+gyo2+pHrx7EsAJdsyC HgWv9Ej5hng0jGkNtDwwya27XLhed5Y6jXTSNSLb5QK+SyEi5qp9z5e8TNbJOiR5 0jeV3wHbDkaywoNAZdQ6v6TcOvLwVHNZKIeb6YrBPg2rDs//ntyjC31Fid6UZF+n z7NExwc0mswvY4nDl3tqFvkvZI8JwK1Z6o5J52/GGjXdWePO9flY7fhwCo0Ntn9y aj2AQNviDug7Mw8RencXy76Kndby/U6GDysJ5VLhOg== Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTPSA id 3db43530 (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256:NO); Sat, 6 Oct 2018 02:57:50 +0000 (UTC) From: "Jason A. Donenfeld" To: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, davem@davemloft.net, gregkh@linuxfoundation.org Cc: "Jason A. Donenfeld" , =?utf-8?q?Ren=C3=A9_van_Dorst?= , Ralf Baechle , Paul Burton , James Hogan , linux-mips@linux-mips.org, Samuel Neves , Jean-Philippe Aumasson , Andy Lutomirski , Andrew Morton , Linus Torvalds , kernel-hardening@lists.openwall.com, linux-crypto@vger.kernel.org Subject: [PATCH net-next v7 17/28] zinc: Poly1305 MIPS32r2 and MIPS64 implementations Date: Sat, 6 Oct 2018 04:56:58 +0200 Message-Id: <20181006025709.4019-18-Jason@zx2c4.com> In-Reply-To: <20181006025709.4019-1-Jason@zx2c4.com> References: <20181006025709.4019-1-Jason@zx2c4.com> MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This MIPS32r2 implementation comes from René van Dorst and me and results in a nice speedup on the usual OpenWRT targets. The MIPS64 implementation from Andy Polyakov ported here results in a nice speedup on commodity Octeon hardware, and has been modified slightly from the original: - The function names have been renamed to fit kernel conventions. - A comment has been added. Signed-off-by: Jason A. Donenfeld Signed-off-by: René van Dorst Co-developed-by: René van Dorst Cc: Ralf Baechle Cc: Paul Burton Cc: James Hogan Cc: linux-mips@linux-mips.org Cc: Samuel Neves Cc: Jean-Philippe Aumasson Cc: Andy Lutomirski Cc: Greg KH Cc: Andrew Morton Cc: Linus Torvalds Cc: kernel-hardening@lists.openwall.com Cc: linux-crypto@vger.kernel.org --- lib/zinc/Makefile | 3 + lib/zinc/poly1305/poly1305-mips-glue.c | 37 ++ lib/zinc/poly1305/poly1305-mips.S | 407 ++++++++++++++++++ ...-mips64-cryptogams.S => poly1305-mips64.S} | 80 ++-- lib/zinc/poly1305/poly1305.c | 2 + 5 files changed, 500 insertions(+), 29 deletions(-) create mode 100644 lib/zinc/poly1305/poly1305-mips-glue.c create mode 100644 lib/zinc/poly1305/poly1305-mips.S rename lib/zinc/poly1305/{poly1305-mips64-cryptogams.S => poly1305-mips64.S} (75%) -- 2.19.0 diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile index c09fd3de60f9..5c4b1d51cb03 100644 --- a/lib/zinc/Makefile +++ b/lib/zinc/Makefile @@ -14,4 +14,7 @@ zinc_poly1305-y := poly1305/poly1305.o zinc_poly1305-$(CONFIG_ZINC_ARCH_X86_64) += poly1305/poly1305-x86_64.o zinc_poly1305-$(CONFIG_ZINC_ARCH_ARM) += poly1305/poly1305-arm.o zinc_poly1305-$(CONFIG_ZINC_ARCH_ARM64) += poly1305/poly1305-arm64.o +zinc_poly1305-$(CONFIG_ZINC_ARCH_MIPS) += poly1305/poly1305-mips.o +AFLAGS_poly1305-mips.o += -O2 # This is required to fill the branch delay slots +zinc_poly1305-$(CONFIG_ZINC_ARCH_MIPS64) += poly1305/poly1305-mips64.o obj-$(CONFIG_ZINC_POLY1305) += zinc_poly1305.o diff --git a/lib/zinc/poly1305/poly1305-mips-glue.c b/lib/zinc/poly1305/poly1305-mips-glue.c new file mode 100644 index 000000000000..1eba9512a05c --- /dev/null +++ b/lib/zinc/poly1305/poly1305-mips-glue.c @@ -0,0 +1,37 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +asmlinkage void poly1305_init_mips(void *ctx, const u8 key[16]); +asmlinkage void poly1305_blocks_mips(void *ctx, const u8 *inp, const size_t len, + const u32 padbit); +asmlinkage void poly1305_emit_mips(void *ctx, u8 mac[16], const u32 nonce[4]); + +static bool *const poly1305_nobs[] __initconst = { }; +static void __init poly1305_fpu_init(void) +{ +} + +static inline bool poly1305_init_arch(void *ctx, + const u8 key[POLY1305_KEY_SIZE]) +{ + poly1305_init_mips(ctx, key); + return true; +} + +static inline bool poly1305_blocks_arch(void *ctx, const u8 *inp, + size_t len, const u32 padbit, + simd_context_t *simd_context) +{ + poly1305_blocks_mips(ctx, inp, len, padbit); + return true; +} + +static inline bool poly1305_emit_arch(void *ctx, u8 mac[POLY1305_MAC_SIZE], + const u32 nonce[4], + simd_context_t *simd_context) +{ + poly1305_emit_mips(ctx, mac, nonce); + return true; +} diff --git a/lib/zinc/poly1305/poly1305-mips.S b/lib/zinc/poly1305/poly1305-mips.S new file mode 100644 index 000000000000..4d695eef1091 --- /dev/null +++ b/lib/zinc/poly1305/poly1305-mips.S @@ -0,0 +1,407 @@ +/* SPDX-License-Identifier: GPL-2.0 OR MIT */ +/* + * Copyright (C) 2016-2018 René van Dorst All Rights Reserved. + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__ +#define MSB 0 +#define LSB 3 +#else +#define MSB 3 +#define LSB 0 +#endif + +#define POLY1305_BLOCK_SIZE 16 +.text +#define H0 $t0 +#define H1 $t1 +#define H2 $t2 +#define H3 $t3 +#define H4 $t4 + +#define R0 $t5 +#define R1 $t6 +#define R2 $t7 +#define R3 $t8 + +#define O0 $s0 +#define O1 $s4 +#define O2 $v1 +#define O3 $t9 +#define O4 $s5 + +#define S1 $s1 +#define S2 $s2 +#define S3 $s3 + +#define SC $at +#define CA $v0 + +/* Input arguments */ +#define poly $a0 +#define src $a1 +#define srclen $a2 +#define hibit $a3 + +/* Location in the opaque buffer + * R[0..3], CA, H[0..4] + */ +#define PTR_POLY1305_R(n) ( 0 + (n*4)) ## ($a0) +#define PTR_POLY1305_CA (16 ) ## ($a0) +#define PTR_POLY1305_H(n) (20 + (n*4)) ## ($a0) + +#define POLY1305_BLOCK_SIZE 16 +#define POLY1305_STACK_SIZE 32 + +.set noat +.align 4 +.globl poly1305_blocks_mips +.ent poly1305_blocks_mips +poly1305_blocks_mips: + .frame $sp, POLY1305_STACK_SIZE, $ra + /* srclen &= 0xFFFFFFF0 */ + ins srclen, $zero, 0, 4 + + addiu $sp, -(POLY1305_STACK_SIZE) + + /* check srclen >= 16 bytes */ + beqz srclen, .Lpoly1305_blocks_mips_end + + /* Calculate last round based on src address pointer. + * last round src ptr (srclen) = src + (srclen & 0xFFFFFFF0) + */ + addu srclen, src + + lw R0, PTR_POLY1305_R(0) + lw R1, PTR_POLY1305_R(1) + lw R2, PTR_POLY1305_R(2) + lw R3, PTR_POLY1305_R(3) + + /* store the used save registers. */ + sw $s0, 0($sp) + sw $s1, 4($sp) + sw $s2, 8($sp) + sw $s3, 12($sp) + sw $s4, 16($sp) + sw $s5, 20($sp) + + /* load Hx and Carry */ + lw CA, PTR_POLY1305_CA + lw H0, PTR_POLY1305_H(0) + lw H1, PTR_POLY1305_H(1) + lw H2, PTR_POLY1305_H(2) + lw H3, PTR_POLY1305_H(3) + lw H4, PTR_POLY1305_H(4) + + /* Sx = Rx + (Rx >> 2) */ + srl S1, R1, 2 + srl S2, R2, 2 + srl S3, R3, 2 + addu S1, R1 + addu S2, R2 + addu S3, R3 + + addiu SC, $zero, 1 + +.Lpoly1305_loop: + lwl O0, 0+MSB(src) + lwl O1, 4+MSB(src) + lwl O2, 8+MSB(src) + lwl O3,12+MSB(src) + lwr O0, 0+LSB(src) + lwr O1, 4+LSB(src) + lwr O2, 8+LSB(src) + lwr O3,12+LSB(src) + +#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__ + wsbh O0 + wsbh O1 + wsbh O2 + wsbh O3 + rotr O0, 16 + rotr O1, 16 + rotr O2, 16 + rotr O3, 16 +#endif + + /* h0 = (u32)(d0 = (u64)h0 + inp[0] + c 'Carry_previous cycle'); */ + addu H0, CA + sltu CA, H0, CA + addu O0, H0 + sltu H0, O0, H0 + addu CA, H0 + + /* h1 = (u32)(d1 = (u64)h1 + (d0 >> 32) + inp[4]); */ + addu H1, CA + sltu CA, H1, CA + addu O1, H1 + sltu H1, O1, H1 + addu CA, H1 + + /* h2 = (u32)(d2 = (u64)h2 + (d1 >> 32) + inp[8]); */ + addu H2, CA + sltu CA, H2, CA + addu O2, H2 + sltu H2, O2, H2 + addu CA, H2 + + /* h3 = (u32)(d3 = (u64)h3 + (d2 >> 32) + inp[12]); */ + addu H3, CA + sltu CA, H3, CA + addu O3, H3 + sltu H3, O3, H3 + addu CA, H3 + + /* h4 += (u32)(d3 >> 32) + padbit; */ + addu H4, hibit + addu O4, H4, CA + + /* D0 */ + multu O0, R0 + maddu O1, S3 + maddu O2, S2 + maddu O3, S1 + mfhi CA + mflo H0 + + /* D1 */ + multu O0, R1 + maddu O1, R0 + maddu O2, S3 + maddu O3, S2 + maddu O4, S1 + maddu CA, SC + mfhi CA + mflo H1 + + /* D2 */ + multu O0, R2 + maddu O1, R1 + maddu O2, R0 + maddu O3, S3 + maddu O4, S2 + maddu CA, SC + mfhi CA + mflo H2 + + /* D4 */ + mul H4, O4, R0 + + /* D3 */ + multu O0, R3 + maddu O1, R2 + maddu O2, R1 + maddu O3, R0 + maddu O4, S3 + maddu CA, SC + mfhi CA + mflo H3 + + addiu src, POLY1305_BLOCK_SIZE + + /* h4 += (u32)(d3 >> 32); */ + addu O4, H4, CA + /* h4 &= 3 */ + andi H4, O4, 3 + /* c = (h4 >> 2) + (h4 & ~3U); */ + srl CA, O4, 2 + ins O4, $zero, 0, 2 + + addu CA, O4 + + /* able to do a 16 byte block. */ + bne src, srclen, .Lpoly1305_loop + + /* restore the used save registers. */ + lw $s0, 0($sp) + lw $s1, 4($sp) + lw $s2, 8($sp) + lw $s3, 12($sp) + lw $s4, 16($sp) + lw $s5, 20($sp) + + /* store Hx and Carry */ + sw CA, PTR_POLY1305_CA + sw H0, PTR_POLY1305_H(0) + sw H1, PTR_POLY1305_H(1) + sw H2, PTR_POLY1305_H(2) + sw H3, PTR_POLY1305_H(3) + sw H4, PTR_POLY1305_H(4) + +.Lpoly1305_blocks_mips_end: + addiu $sp, POLY1305_STACK_SIZE + + /* Jump Back */ + jr $ra +.end poly1305_blocks_mips +.set at + +/* Input arguments CTX=$a0, MAC=$a1, NONCE=$a2 */ +#define MAC $a1 +#define NONCE $a2 + +#define G0 $t5 +#define G1 $t6 +#define G2 $t7 +#define G3 $t8 +#define G4 $t9 + +.set noat +.align 4 +.globl poly1305_emit_mips +.ent poly1305_emit_mips +poly1305_emit_mips: + /* load Hx and Carry */ + lw CA, PTR_POLY1305_CA + lw H0, PTR_POLY1305_H(0) + lw H1, PTR_POLY1305_H(1) + lw H2, PTR_POLY1305_H(2) + lw H3, PTR_POLY1305_H(3) + lw H4, PTR_POLY1305_H(4) + + /* Add left over carry */ + addu H0, CA + sltu CA, H0, CA + addu H1, CA + sltu CA, H1, CA + addu H2, CA + sltu CA, H2, CA + addu H3, CA + sltu CA, H3, CA + addu H4, CA + + /* compare to modulus by computing h + -p */ + addiu G0, H0, 5 + sltu CA, G0, H0 + addu G1, H1, CA + sltu CA, G1, H1 + addu G2, H2, CA + sltu CA, G2, H2 + addu G3, H3, CA + sltu CA, G3, H3 + addu G4, H4, CA + + srl SC, G4, 2 + + /* if there was carry into 131st bit, h3:h0 = g3:g0 */ + movn H0, G0, SC + movn H1, G1, SC + movn H2, G2, SC + movn H3, G3, SC + + lwl G0, 0+MSB(NONCE) + lwl G1, 4+MSB(NONCE) + lwl G2, 8+MSB(NONCE) + lwl G3,12+MSB(NONCE) + lwr G0, 0+LSB(NONCE) + lwr G1, 4+LSB(NONCE) + lwr G2, 8+LSB(NONCE) + lwr G3,12+LSB(NONCE) + + /* mac = (h + nonce) % (2^128) */ + addu H0, G0 + sltu CA, H0, G0 + + /* H1 */ + addu H1, CA + sltu CA, H1, CA + addu H1, G1 + sltu G1, H1, G1 + addu CA, G1 + + /* H2 */ + addu H2, CA + sltu CA, H2, CA + addu H2, G2 + sltu G2, H2, G2 + addu CA, G2 + + /* H3 */ + addu H3, CA + addu H3, G3 + +#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__ + wsbh H0 + wsbh H1 + wsbh H2 + wsbh H3 + rotr H0, 16 + rotr H1, 16 + rotr H2, 16 + rotr H3, 16 +#endif + + /* store MAC */ + swl H0, 0+MSB(MAC) + swl H1, 4+MSB(MAC) + swl H2, 8+MSB(MAC) + swl H3,12+MSB(MAC) + swr H0, 0+LSB(MAC) + swr H1, 4+LSB(MAC) + swr H2, 8+LSB(MAC) + swr H3,12+LSB(MAC) + + jr $ra +.end poly1305_emit_mips + +#define PR0 $t0 +#define PR1 $t1 +#define PR2 $t2 +#define PR3 $t3 +#define PT0 $t4 + +/* Input arguments CTX=$a0, KEY=$a1 */ + +.align 4 +.globl poly1305_init_mips +.ent poly1305_init_mips +poly1305_init_mips: + lwl PR0, 0+MSB($a1) + lwl PR1, 4+MSB($a1) + lwl PR2, 8+MSB($a1) + lwl PR3,12+MSB($a1) + lwr PR0, 0+LSB($a1) + lwr PR1, 4+LSB($a1) + lwr PR2, 8+LSB($a1) + lwr PR3,12+LSB($a1) + + /* store Hx and Carry */ + sw $zero, PTR_POLY1305_CA + sw $zero, PTR_POLY1305_H(0) + sw $zero, PTR_POLY1305_H(1) + sw $zero, PTR_POLY1305_H(2) + sw $zero, PTR_POLY1305_H(3) + sw $zero, PTR_POLY1305_H(4) + +#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__ + wsbh PR0 + wsbh PR1 + wsbh PR2 + wsbh PR3 + rotr PR0, 16 + rotr PR1, 16 + rotr PR2, 16 + rotr PR3, 16 +#endif + + lui PT0, 0x0FFF + ori PT0, 0xFFFC + + /* AND 0x0fffffff; */ + ext PR0, PR0, 0, (32-4) + + /* AND 0x0ffffffc; */ + and PR1, PT0 + and PR2, PT0 + and PR3, PT0 + + /* store Rx */ + sw PR0, PTR_POLY1305_R(0) + sw PR1, PTR_POLY1305_R(1) + sw PR2, PTR_POLY1305_R(2) + sw PR3, PTR_POLY1305_R(3) + + /* Jump Back */ + jr $ra +.end poly1305_init_mips diff --git a/lib/zinc/poly1305/poly1305-mips64-cryptogams.S b/lib/zinc/poly1305/poly1305-mips64.S similarity index 75% rename from lib/zinc/poly1305/poly1305-mips64-cryptogams.S rename to lib/zinc/poly1305/poly1305-mips64.S index 24a6005884c3..272a86c47bcb 100644 --- a/lib/zinc/poly1305/poly1305-mips64-cryptogams.S +++ b/lib/zinc/poly1305/poly1305-mips64.S @@ -1,26 +1,49 @@ /* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */ /* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. * Copyright (C) 2006-2017 CRYPTOGAMS by . All Rights Reserved. + * + * This is based in part on Andy Polyakov's implementation from CRYPTOGAMS. */ -#include "mips_arch.h" +#if (defined(_MIPS_ARCH_MIPS64R3) || defined(_MIPS_ARCH_MIPS64R5) || \ + defined(_MIPS_ARCH_MIPS64R6)) && !defined(_MIPS_ARCH_MIPS64R2) +#define _MIPS_ARCH_MIPS64R2 +#endif + +#ifdef __MIPSEB__ +#define MSB 0 +#define LSB 7 +#else +#define MSB 7 +#define LSB 0 +#endif -#ifdef MIPSEB -# define MSB 0 -# define LSB 7 +#if defined(_MIPS_ARCH_MIPS64R6) +#define dmultu(rs,rt) +#define mflo(rd,rs,rt) dmulu rd,rs,rt +#define mfhi(rd,rs,rt) dmuhu rd,rs,rt #else -# define MSB 7 -# define LSB 0 +#define dmultu(rs,rt) dmultu rs,rt +#define multu(rs,rt) multu rs,rt +#define mflo(rd,rs,rt) mflo rd +#define mfhi(rd,rs,rt) mfhi rd #endif .text .set noat .set noreorder +/* While most of the assembly in the kernel prefers ENTRY() and ENDPROC(), + * there is no existing MIPS assembly that uses it, and MIPS assembler seems + * to like its own .ent/.end notation, which the MIPS include files don't + * provide in a MIPS-specific ENTRY/ENDPROC definition. So, we skip these + * for now, until somebody complains. */ + .align 5 -.globl poly1305_init -.ent poly1305_init -poly1305_init: +.globl poly1305_init_mips +.ent poly1305_init_mips +poly1305_init_mips: .frame $29,0,$31 .set reorder @@ -39,13 +62,13 @@ poly1305_init: ldr $8,0+LSB($5) ldr $9,8+LSB($5) #endif -#ifdef MIPSEB -# if defined(_MIPS_ARCH_MIPS64R2) +#ifdef __MIPSEB__ +#if defined(_MIPS_ARCH_MIPS64R2) dsbh $8,$8 # byte swap dsbh $9,$9 dshd $8,$8 dshd $9,$9 -# else +#else ori $10,$0,0xFF dsll $1,$10,32 or $10,$1 # 0x000000FF000000FF @@ -79,7 +102,7 @@ poly1305_init: dsll $9,32 or $8,$11 or $9,$2 -# endif +#endif #endif li $10,1 dsll $10,32 @@ -100,18 +123,19 @@ poly1305_init: .Lno_key: li $2,0 # return 0 jr $31 -.end poly1305_init +.end poly1305_init_mips + .align 5 -.globl poly1305_blocks -.ent poly1305_blocks -poly1305_blocks: +.globl poly1305_blocks_mips +.ent poly1305_blocks_mips +poly1305_blocks_mips: .set noreorder dsrl $6,4 # number of complete blocks bnez $6,poly1305_blocks_internal nop jr $31 nop -.end poly1305_blocks +.end poly1305_blocks_mips .align 5 .ent poly1305_blocks_internal @@ -144,13 +168,13 @@ poly1305_blocks_internal: #endif daddiu $6,-1 daddiu $5,16 -#ifdef MIPSEB -# if defined(_MIPS_ARCH_MIPS64R2) +#ifdef __MIPSEB__ +#if defined(_MIPS_ARCH_MIPS64R2) dsbh $8,$8 # byte swap dsbh $9,$9 dshd $8,$8 dshd $9,$9 -# else +#else ori $10,$0,0xFF dsll $1,$10,32 or $10,$1 # 0x000000FF000000FF @@ -184,7 +208,7 @@ poly1305_blocks_internal: dsll $9,32 or $8,$11 or $9,$2 -# endif +#endif #endif daddu $12,$8 # accumulate input daddu $13,$9 @@ -257,10 +281,11 @@ poly1305_blocks_internal: jr $31 daddu $29,6*8 .end poly1305_blocks_internal + .align 5 -.globl poly1305_emit -.ent poly1305_emit -poly1305_emit: +.globl poly1305_emit_mips +.ent poly1305_emit_mips +poly1305_emit_mips: .frame $29,0,$31 .set reorder @@ -332,7 +357,4 @@ poly1305_emit: sb $11,15($5) jr $31 -.end poly1305_emit -.rdata -.asciiz "Poly1305 for MIPS64, CRYPTOGAMS by " -.align 2 +.end poly1305_emit_mips diff --git a/lib/zinc/poly1305/poly1305.c b/lib/zinc/poly1305/poly1305.c index 9dc85f62e806..e3386a0e1554 100644 --- a/lib/zinc/poly1305/poly1305.c +++ b/lib/zinc/poly1305/poly1305.c @@ -20,6 +20,8 @@ #include "poly1305-x86_64-glue.c" #elif defined(CONFIG_ZINC_ARCH_ARM) || defined(CONFIG_ZINC_ARCH_ARM64) #include "poly1305-arm-glue.c" +#elif defined(CONFIG_ZINC_ARCH_MIPS) || defined(CONFIG_ZINC_ARCH_MIPS64) +#include "poly1305-mips-glue.c" #else static inline bool poly1305_init_arch(void *ctx, const u8 key[POLY1305_KEY_SIZE]) From patchwork Sat Oct 6 02:57:00 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Jason A. Donenfeld" X-Patchwork-Id: 148324 Delivered-To: patch@linaro.org Received: by 2002:a2e:8595:0:0:0:0:0 with SMTP id b21-v6csp1158929lji; Fri, 5 Oct 2018 20:00:11 -0700 (PDT) X-Google-Smtp-Source: ACcGV63N6Z8YO2SPOSpZ00dXl4haTLLQzB5ObYFMfMUkSTZPisyBjnP3VW4KTLZbOaAIm8u1+gzW X-Received: by 2002:a63:1520:: with SMTP id v32-v6mr8294390pgl.150.1538794811062; Fri, 05 Oct 2018 20:00:11 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1538794811; cv=none; d=google.com; s=arc-20160816; b=ALDWt6NZfIYvYe4QKdVFXc8hPlQVOR1eRTQ3dQ5WekB13tp52mFqWVsPiDy4aHnwcb PAuasqpB49dEszbQAjGKI2OJWadh9g8AT+Ov5IrsP7kLvU29rzerXun8s1Ox/vjBFwbM tm5Z7wW2bI5gKCCoeSydn0guBoKGBpiF5cTIod80YlWEklsaEEATeAKos9FMc6C7NOfe UIoT/XRCqQWyR6pbRfT6HShWY3X7i5wZ2bOj/lKd4GsZr4iRNdHZMehXNYXxKPV5tQQf Cfxdo17PtUOA2KkoUkQlmdW/2P74ZWDIhbQSeVEvIbr2GFDjs8Kwk7X6m4R5i3p4wd3h ViJQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=NReBXO2Zg9oT0QXgMRWsTy0bCmkultGCr7F6FRHFPuU=; b=O0NheBcuOijlkjQ8Q+efq8XctJ+gIudzSE9X34cbsq0f/Y7nYU+6T4E9mNwVGkF07A uHLBIFnqayjrVd0wtZsvr22zO6BX5KCvHkJD+cIM5za+J5sYfesSnqCZq0Dsir/Ak4lg kcnDO8ghf6XawJeV3S+AjXGwiaHQGEx5qi3NC8NcdSoEymcMstFbatFW5V4TU+s0pcy+ 94ey3aNJblksW/9Iqg7PqZPNL9B2JvpPkqLrd0Ew/5N3XiF2CsUoBd54+a85tMd8m+SF qNYXn9QskQ0paDNljcGyIcOl6VKd8iAw0DlIXVMRR93CBAfrksp70ShJDOOzjrJ6cEkb Dvcw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=SMzffCaY; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w16-v6si10512634pll.234.2018.10.05.20.00.08; Fri, 05 Oct 2018 20:00:11 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=SMzffCaY; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729656AbeJFKBk (ORCPT + 32 others); Sat, 6 Oct 2018 06:01:40 -0400 Received: from frisell.zx2c4.com ([192.95.5.64]:60175 "EHLO frisell.zx2c4.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729807AbeJFKAO (ORCPT ); Sat, 6 Oct 2018 06:00:14 -0400 Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTP id 1a730c1b; Sat, 6 Oct 2018 02:57:57 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=zx2c4.com; h=from:to:cc :subject:date:message-id:in-reply-to:references:mime-version :content-transfer-encoding; s=mail; bh=zNtmtcK0VzQQtsP5sXce4/l0f rg=; b=SMzffCaYMX1gVLuQ1fmE9YmiCW2n6Sd4L2NXliWGsJ4fSU25dMfKCXpSe C6EZVShFtxOZ+pKEn6BUAnnX8rtKYjC4G3pSfYzDmyU5AZsXruNqyqdYIAcGSTxu pU6mvGboII8fju+DQcedLPjqD0uM2ekvFhVCg/FWXuxVKQ2ddEjK7MNs5Kv/D15L LNSGzN784TroUDMF+QDArCXp1s/k6QRaDA8rjLhDkEAJ8Yh14zVhwn9mA4SN8yOl RJ1z0iFh8BOdC6q2UHfHrs/fcejZi/dZ6lOfHrh8/D7kjMozcV/yYHvDPrOSWnUz lBanE6EXEbPGyxvLblxq78479A/lg== Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTPSA id b9f6b3f9 (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256:NO); Sat, 6 Oct 2018 02:57:56 +0000 (UTC) From: "Jason A. Donenfeld" To: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, davem@davemloft.net, gregkh@linuxfoundation.org Cc: "Jason A. Donenfeld" , Samuel Neves , Jean-Philippe Aumasson , Andy Lutomirski , Andrew Morton , Linus Torvalds , kernel-hardening@lists.openwall.com, linux-crypto@vger.kernel.org Subject: [PATCH net-next v7 19/28] zinc: BLAKE2s generic C implementation and selftest Date: Sat, 6 Oct 2018 04:57:00 +0200 Message-Id: <20181006025709.4019-20-Jason@zx2c4.com> In-Reply-To: <20181006025709.4019-1-Jason@zx2c4.com> References: <20181006025709.4019-1-Jason@zx2c4.com> MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The C implementation was originally based on Samuel Neves' public domain reference implementation but has since been heavily modified for the kernel. We're able to do compile-time optimizations by moving some scaffolding around the final function into the header file. Information: https://blake2.net/ Signed-off-by: Jason A. Donenfeld Signed-off-by: Samuel Neves Co-developed-by: Samuel Neves Cc: Jean-Philippe Aumasson Cc: Andy Lutomirski Cc: Greg KH Cc: Andrew Morton Cc: Linus Torvalds Cc: kernel-hardening@lists.openwall.com Cc: linux-crypto@vger.kernel.org --- include/zinc/blake2s.h | 56 + lib/zinc/Kconfig | 3 + lib/zinc/Makefile | 3 + lib/zinc/blake2s/blake2s.c | 295 +++++ lib/zinc/selftest/blake2s.c | 2090 +++++++++++++++++++++++++++++++++++ 5 files changed, 2447 insertions(+) create mode 100644 include/zinc/blake2s.h create mode 100644 lib/zinc/blake2s/blake2s.c create mode 100644 lib/zinc/selftest/blake2s.c -- 2.19.0 diff --git a/include/zinc/blake2s.h b/include/zinc/blake2s.h new file mode 100644 index 000000000000..701a08ba47c1 --- /dev/null +++ b/include/zinc/blake2s.h @@ -0,0 +1,56 @@ +/* SPDX-License-Identifier: GPL-2.0 OR MIT */ +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#ifndef _ZINC_BLAKE2S_H +#define _ZINC_BLAKE2S_H + +#include +#include +#include + +enum blake2s_lengths { + BLAKE2S_BLOCK_SIZE = 64, + BLAKE2S_HASH_SIZE = 32, + BLAKE2S_KEY_SIZE = 32 +}; + +struct blake2s_state { + u32 h[8]; + u32 t[2]; + u32 f[2]; + u8 buf[BLAKE2S_BLOCK_SIZE]; + size_t buflen; + u8 last_node; +}; + +void blake2s_init(struct blake2s_state *state, const size_t outlen); +void blake2s_init_key(struct blake2s_state *state, const size_t outlen, + const void *key, const size_t keylen); +void blake2s_update(struct blake2s_state *state, const u8 *in, size_t inlen); +void blake2s_final(struct blake2s_state *state, u8 *out, const size_t outlen); + +static inline void blake2s(u8 *out, const u8 *in, const u8 *key, + const size_t outlen, const size_t inlen, + const size_t keylen) +{ + struct blake2s_state state; + + WARN_ON(IS_ENABLED(DEBUG) && ((!in && inlen > 0) || !out || !outlen || + outlen > BLAKE2S_HASH_SIZE || keylen > BLAKE2S_KEY_SIZE || + (!key && keylen))); + + if (keylen) + blake2s_init_key(&state, outlen, key, keylen); + else + blake2s_init(&state, outlen); + + blake2s_update(&state, in, inlen); + blake2s_final(&state, out, outlen); +} + +void blake2s_hmac(u8 *out, const u8 *in, const u8 *key, const size_t outlen, + const size_t inlen, const size_t keylen); + +#endif /* _ZINC_BLAKE2S_H */ diff --git a/lib/zinc/Kconfig b/lib/zinc/Kconfig index 765eba3267c9..9fc21f93ee9f 100644 --- a/lib/zinc/Kconfig +++ b/lib/zinc/Kconfig @@ -11,6 +11,9 @@ config ZINC_CHACHA20POLY1305 select ZINC_POLY1305 select CRYPTO_BLKCIPHER +config ZINC_BLAKE2S + tristate + config ZINC_SELFTEST bool "Zinc cryptography library self-tests" help diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile index c31186b491e8..d2ec55c33ef0 100644 --- a/lib/zinc/Makefile +++ b/lib/zinc/Makefile @@ -21,3 +21,6 @@ obj-$(CONFIG_ZINC_POLY1305) += zinc_poly1305.o zinc_chacha20poly1305-y := chacha20poly1305.o obj-$(CONFIG_ZINC_CHACHA20POLY1305) += zinc_chacha20poly1305.o + +zinc_blake2s-y := blake2s/blake2s.o +obj-$(CONFIG_ZINC_BLAKE2S) += zinc_blake2s.o diff --git a/lib/zinc/blake2s/blake2s.c b/lib/zinc/blake2s/blake2s.c new file mode 100644 index 000000000000..58d7e9378bd4 --- /dev/null +++ b/lib/zinc/blake2s/blake2s.c @@ -0,0 +1,295 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT +/* + * Copyright (C) 2012 Samuel Neves . All Rights Reserved. + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + * + * This is an implementation of the BLAKE2s hash and PRF functions. + * + * Information: https://blake2.net/ + * + */ + +#include +#include "../selftest/run.h" + +#include +#include +#include +#include +#include +#include +#include + +typedef union { + struct { + u8 digest_length; + u8 key_length; + u8 fanout; + u8 depth; + u32 leaf_length; + u32 node_offset; + u16 xof_length; + u8 node_depth; + u8 inner_length; + u8 salt[8]; + u8 personal[8]; + }; + __le32 words[8]; +} __packed blake2s_param; + +static const u32 blake2s_iv[8] = { + 0x6A09E667UL, 0xBB67AE85UL, 0x3C6EF372UL, 0xA54FF53AUL, + 0x510E527FUL, 0x9B05688CUL, 0x1F83D9ABUL, 0x5BE0CD19UL +}; + +static const u8 blake2s_sigma[10][16] = { + { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }, + { 14, 10, 4, 8, 9, 15, 13, 6, 1, 12, 0, 2, 11, 7, 5, 3 }, + { 11, 8, 12, 0, 5, 2, 15, 13, 10, 14, 3, 6, 7, 1, 9, 4 }, + { 7, 9, 3, 1, 13, 12, 11, 14, 2, 6, 5, 10, 4, 0, 15, 8 }, + { 9, 0, 5, 7, 2, 4, 10, 15, 14, 1, 11, 12, 6, 8, 3, 13 }, + { 2, 12, 6, 10, 0, 11, 8, 3, 4, 13, 7, 5, 15, 14, 1, 9 }, + { 12, 5, 1, 15, 14, 13, 4, 10, 0, 7, 6, 3, 9, 2, 8, 11 }, + { 13, 11, 7, 14, 12, 1, 3, 9, 5, 0, 15, 4, 8, 6, 2, 10 }, + { 6, 15, 14, 9, 11, 3, 0, 8, 12, 2, 13, 7, 1, 4, 10, 5 }, + { 10, 2, 8, 4, 7, 6, 1, 5, 15, 11, 9, 14, 3, 12, 13, 0 }, +}; + +static inline void blake2s_set_lastblock(struct blake2s_state *state) +{ + if (state->last_node) + state->f[1] = -1; + state->f[0] = -1; +} + +static inline void blake2s_increment_counter(struct blake2s_state *state, + const u32 inc) +{ + state->t[0] += inc; + state->t[1] += (state->t[0] < inc); +} + +static inline void blake2s_init_param(struct blake2s_state *state, + const blake2s_param *param) +{ + int i; + + memset(state, 0, sizeof(*state)); + for (i = 0; i < 8; ++i) + state->h[i] = blake2s_iv[i] ^ le32_to_cpu(param->words[i]); +} + +void blake2s_init(struct blake2s_state *state, const size_t outlen) +{ + blake2s_param param __aligned(__alignof__(u32)) = { + .digest_length = outlen, + .fanout = 1, + .depth = 1 + }; + + WARN_ON(IS_ENABLED(DEBUG) && (!outlen || outlen > BLAKE2S_HASH_SIZE)); + blake2s_init_param(state, ¶m); +} +EXPORT_SYMBOL(blake2s_init); + +void blake2s_init_key(struct blake2s_state *state, const size_t outlen, + const void *key, const size_t keylen) +{ + blake2s_param param = { .digest_length = outlen, + .key_length = keylen, + .fanout = 1, + .depth = 1 }; + u8 block[BLAKE2S_BLOCK_SIZE] = { 0 }; + + WARN_ON(IS_ENABLED(DEBUG) && (!outlen || outlen > BLAKE2S_HASH_SIZE || + !key || !keylen || keylen > BLAKE2S_KEY_SIZE)); + blake2s_init_param(state, ¶m); + memcpy(block, key, keylen); + blake2s_update(state, block, BLAKE2S_BLOCK_SIZE); + memzero_explicit(block, BLAKE2S_BLOCK_SIZE); +} +EXPORT_SYMBOL(blake2s_init_key); + +static bool *const blake2s_nobs[] __initconst = { }; +static void __init blake2s_fpu_init(void) +{ +} +static inline bool blake2s_compress_arch(struct blake2s_state *state, + const u8 *block, size_t nblocks, + const u32 inc) +{ + return false; +} + +static inline void blake2s_compress(struct blake2s_state *state, + const u8 *block, size_t nblocks, + const u32 inc) +{ + u32 m[16]; + u32 v[16]; + int i; + + WARN_ON(IS_ENABLED(DEBUG) && + (nblocks > 1 && inc != BLAKE2S_BLOCK_SIZE)); + + if (blake2s_compress_arch(state, block, nblocks, inc)) + return; + + while (nblocks > 0) { + blake2s_increment_counter(state, inc); + memcpy(m, block, BLAKE2S_BLOCK_SIZE); + le32_to_cpu_array(m, ARRAY_SIZE(m)); + memcpy(v, state->h, 32); + v[ 8] = blake2s_iv[0]; + v[ 9] = blake2s_iv[1]; + v[10] = blake2s_iv[2]; + v[11] = blake2s_iv[3]; + v[12] = blake2s_iv[4] ^ state->t[0]; + v[13] = blake2s_iv[5] ^ state->t[1]; + v[14] = blake2s_iv[6] ^ state->f[0]; + v[15] = blake2s_iv[7] ^ state->f[1]; + +#define G(r, i, a, b, c, d) do { \ + a += b + m[blake2s_sigma[r][2 * i + 0]]; \ + d = ror32(d ^ a, 16); \ + c += d; \ + b = ror32(b ^ c, 12); \ + a += b + m[blake2s_sigma[r][2 * i + 1]]; \ + d = ror32(d ^ a, 8); \ + c += d; \ + b = ror32(b ^ c, 7); \ +} while (0) + +#define ROUND(r) do { \ + G(r, 0, v[0], v[ 4], v[ 8], v[12]); \ + G(r, 1, v[1], v[ 5], v[ 9], v[13]); \ + G(r, 2, v[2], v[ 6], v[10], v[14]); \ + G(r, 3, v[3], v[ 7], v[11], v[15]); \ + G(r, 4, v[0], v[ 5], v[10], v[15]); \ + G(r, 5, v[1], v[ 6], v[11], v[12]); \ + G(r, 6, v[2], v[ 7], v[ 8], v[13]); \ + G(r, 7, v[3], v[ 4], v[ 9], v[14]); \ +} while (0) + ROUND(0); + ROUND(1); + ROUND(2); + ROUND(3); + ROUND(4); + ROUND(5); + ROUND(6); + ROUND(7); + ROUND(8); + ROUND(9); + +#undef G +#undef ROUND + + for (i = 0; i < 8; ++i) + state->h[i] ^= v[i] ^ v[i + 8]; + + block += BLAKE2S_BLOCK_SIZE; + --nblocks; + } +} + +void blake2s_update(struct blake2s_state *state, const u8 *in, size_t inlen) +{ + const size_t fill = BLAKE2S_BLOCK_SIZE - state->buflen; + + if (unlikely(!inlen)) + return; + if (inlen > fill) { + memcpy(state->buf + state->buflen, in, fill); + blake2s_compress(state, state->buf, 1, BLAKE2S_BLOCK_SIZE); + state->buflen = 0; + in += fill; + inlen -= fill; + } + if (inlen > BLAKE2S_BLOCK_SIZE) { + const size_t nblocks = + (inlen + BLAKE2S_BLOCK_SIZE - 1) / BLAKE2S_BLOCK_SIZE; + /* Hash one less (full) block than strictly possible */ + blake2s_compress(state, in, nblocks - 1, BLAKE2S_BLOCK_SIZE); + in += BLAKE2S_BLOCK_SIZE * (nblocks - 1); + inlen -= BLAKE2S_BLOCK_SIZE * (nblocks - 1); + } + memcpy(state->buf + state->buflen, in, inlen); + state->buflen += inlen; +} +EXPORT_SYMBOL(blake2s_update); + +void blake2s_final(struct blake2s_state *state, u8 *out, const size_t outlen) +{ + WARN_ON(IS_ENABLED(DEBUG) && + (!out || !outlen || outlen > BLAKE2S_HASH_SIZE)); + blake2s_set_lastblock(state); + memset(state->buf + state->buflen, 0, + BLAKE2S_BLOCK_SIZE - state->buflen); /* Padding */ + blake2s_compress(state, state->buf, 1, state->buflen); + cpu_to_le32_array(state->h, ARRAY_SIZE(state->h)); + memcpy(out, state->h, outlen); + memzero_explicit(state, sizeof(*state)); +} +EXPORT_SYMBOL(blake2s_final); + +void blake2s_hmac(u8 *out, const u8 *in, const u8 *key, const size_t outlen, + const size_t inlen, const size_t keylen) +{ + struct blake2s_state state; + u8 x_key[BLAKE2S_BLOCK_SIZE] __aligned(__alignof__(u32)) = { 0 }; + u8 i_hash[BLAKE2S_HASH_SIZE] __aligned(__alignof__(u32)); + int i; + + if (keylen > BLAKE2S_BLOCK_SIZE) { + blake2s_init(&state, BLAKE2S_HASH_SIZE); + blake2s_update(&state, key, keylen); + blake2s_final(&state, x_key, BLAKE2S_HASH_SIZE); + } else + memcpy(x_key, key, keylen); + + for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i) + x_key[i] ^= 0x36; + + blake2s_init(&state, BLAKE2S_HASH_SIZE); + blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE); + blake2s_update(&state, in, inlen); + blake2s_final(&state, i_hash, BLAKE2S_HASH_SIZE); + + for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i) + x_key[i] ^= 0x5c ^ 0x36; + + blake2s_init(&state, BLAKE2S_HASH_SIZE); + blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE); + blake2s_update(&state, i_hash, BLAKE2S_HASH_SIZE); + blake2s_final(&state, i_hash, BLAKE2S_HASH_SIZE); + + memcpy(out, i_hash, outlen); + memzero_explicit(x_key, BLAKE2S_BLOCK_SIZE); + memzero_explicit(i_hash, BLAKE2S_HASH_SIZE); +} +EXPORT_SYMBOL(blake2s_hmac); + +#include "../selftest/blake2s.c" + +static bool nosimd __initdata = false; + +static int __init mod_init(void) +{ + if (!nosimd) + blake2s_fpu_init(); + if (!selftest_run("blake2s", blake2s_selftest, blake2s_nobs, + ARRAY_SIZE(blake2s_nobs))) + return -ENOTRECOVERABLE; + return 0; +} + +static void __exit mod_exit(void) +{ +} + +module_param(nosimd, bool, 0); +module_init(mod_init); +module_exit(mod_exit); +MODULE_LICENSE("GPL v2"); +MODULE_DESCRIPTION("BLAKE2s hash function"); +MODULE_AUTHOR("Jason A. Donenfeld "); diff --git a/lib/zinc/selftest/blake2s.c b/lib/zinc/selftest/blake2s.c new file mode 100644 index 000000000000..7325a42334aa --- /dev/null +++ b/lib/zinc/selftest/blake2s.c @@ -0,0 +1,2090 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +static const u8 blake2s_testvecs[][BLAKE2S_HASH_SIZE] __initconst = { + { 0x69, 0x21, 0x7a, 0x30, 0x79, 0x90, 0x80, 0x94, + 0xe1, 0x11, 0x21, 0xd0, 0x42, 0x35, 0x4a, 0x7c, + 0x1f, 0x55, 0xb6, 0x48, 0x2c, 0xa1, 0xa5, 0x1e, + 0x1b, 0x25, 0x0d, 0xfd, 0x1e, 0xd0, 0xee, 0xf9 }, + { 0xe3, 0x4d, 0x74, 0xdb, 0xaf, 0x4f, 0xf4, 0xc6, + 0xab, 0xd8, 0x71, 0xcc, 0x22, 0x04, 0x51, 0xd2, + 0xea, 0x26, 0x48, 0x84, 0x6c, 0x77, 0x57, 0xfb, + 0xaa, 0xc8, 0x2f, 0xe5, 0x1a, 0xd6, 0x4b, 0xea }, + { 0xdd, 0xad, 0x9a, 0xb1, 0x5d, 0xac, 0x45, 0x49, + 0xba, 0x42, 0xf4, 0x9d, 0x26, 0x24, 0x96, 0xbe, + 0xf6, 0xc0, 0xba, 0xe1, 0xdd, 0x34, 0x2a, 0x88, + 0x08, 0xf8, 0xea, 0x26, 0x7c, 0x6e, 0x21, 0x0c }, + { 0xe8, 0xf9, 0x1c, 0x6e, 0xf2, 0x32, 0xa0, 0x41, + 0x45, 0x2a, 0xb0, 0xe1, 0x49, 0x07, 0x0c, 0xdd, + 0x7d, 0xd1, 0x76, 0x9e, 0x75, 0xb3, 0xa5, 0x92, + 0x1b, 0xe3, 0x78, 0x76, 0xc4, 0x5c, 0x99, 0x00 }, + { 0x0c, 0xc7, 0x0e, 0x00, 0x34, 0x8b, 0x86, 0xba, + 0x29, 0x44, 0xd0, 0xc3, 0x20, 0x38, 0xb2, 0x5c, + 0x55, 0x58, 0x4f, 0x90, 0xdf, 0x23, 0x04, 0xf5, + 0x5f, 0xa3, 0x32, 0xaf, 0x5f, 0xb0, 0x1e, 0x20 }, + { 0xec, 0x19, 0x64, 0x19, 0x10, 0x87, 0xa4, 0xfe, + 0x9d, 0xf1, 0xc7, 0x95, 0x34, 0x2a, 0x02, 0xff, + 0xc1, 0x91, 0xa5, 0xb2, 0x51, 0x76, 0x48, 0x56, + 0xae, 0x5b, 0x8b, 0x57, 0x69, 0xf0, 0xc6, 0xcd }, + { 0xe1, 0xfa, 0x51, 0x61, 0x8d, 0x7d, 0xf4, 0xeb, + 0x70, 0xcf, 0x0d, 0x5a, 0x9e, 0x90, 0x6f, 0x80, + 0x6e, 0x9d, 0x19, 0xf7, 0xf4, 0xf0, 0x1e, 0x3b, + 0x62, 0x12, 0x88, 0xe4, 0x12, 0x04, 0x05, 0xd6 }, + { 0x59, 0x80, 0x01, 0xfa, 0xfb, 0xe8, 0xf9, 0x4e, + 0xc6, 0x6d, 0xc8, 0x27, 0xd0, 0x12, 0xcf, 0xcb, + 0xba, 0x22, 0x28, 0x56, 0x9f, 0x44, 0x8e, 0x89, + 0xea, 0x22, 0x08, 0xc8, 0xbf, 0x76, 0x92, 0x93 }, + { 0xc7, 0xe8, 0x87, 0xb5, 0x46, 0x62, 0x36, 0x35, + 0xe9, 0x3e, 0x04, 0x95, 0x59, 0x8f, 0x17, 0x26, + 0x82, 0x19, 0x96, 0xc2, 0x37, 0x77, 0x05, 0xb9, + 0x3a, 0x1f, 0x63, 0x6f, 0x87, 0x2b, 0xfa, 0x2d }, + { 0xc3, 0x15, 0xa4, 0x37, 0xdd, 0x28, 0x06, 0x2a, + 0x77, 0x0d, 0x48, 0x19, 0x67, 0x13, 0x6b, 0x1b, + 0x5e, 0xb8, 0x8b, 0x21, 0xee, 0x53, 0xd0, 0x32, + 0x9c, 0x58, 0x97, 0x12, 0x6e, 0x9d, 0xb0, 0x2c }, + { 0xbb, 0x47, 0x3d, 0xed, 0xdc, 0x05, 0x5f, 0xea, + 0x62, 0x28, 0xf2, 0x07, 0xda, 0x57, 0x53, 0x47, + 0xbb, 0x00, 0x40, 0x4c, 0xd3, 0x49, 0xd3, 0x8c, + 0x18, 0x02, 0x63, 0x07, 0xa2, 0x24, 0xcb, 0xff }, + { 0x68, 0x7e, 0x18, 0x73, 0xa8, 0x27, 0x75, 0x91, + 0xbb, 0x33, 0xd9, 0xad, 0xf9, 0xa1, 0x39, 0x12, + 0xef, 0xef, 0xe5, 0x57, 0xca, 0xfc, 0x39, 0xa7, + 0x95, 0x26, 0x23, 0xe4, 0x72, 0x55, 0xf1, 0x6d }, + { 0x1a, 0xc7, 0xba, 0x75, 0x4d, 0x6e, 0x2f, 0x94, + 0xe0, 0xe8, 0x6c, 0x46, 0xbf, 0xb2, 0x62, 0xab, + 0xbb, 0x74, 0xf4, 0x50, 0xef, 0x45, 0x6d, 0x6b, + 0x4d, 0x97, 0xaa, 0x80, 0xce, 0x6d, 0xa7, 0x67 }, + { 0x01, 0x2c, 0x97, 0x80, 0x96, 0x14, 0x81, 0x6b, + 0x5d, 0x94, 0x94, 0x47, 0x7d, 0x4b, 0x68, 0x7d, + 0x15, 0xb9, 0x6e, 0xb6, 0x9c, 0x0e, 0x80, 0x74, + 0xa8, 0x51, 0x6f, 0x31, 0x22, 0x4b, 0x5c, 0x98 }, + { 0x91, 0xff, 0xd2, 0x6c, 0xfa, 0x4d, 0xa5, 0x13, + 0x4c, 0x7e, 0xa2, 0x62, 0xf7, 0x88, 0x9c, 0x32, + 0x9f, 0x61, 0xf6, 0xa6, 0x57, 0x22, 0x5c, 0xc2, + 0x12, 0xf4, 0x00, 0x56, 0xd9, 0x86, 0xb3, 0xf4 }, + { 0xd9, 0x7c, 0x82, 0x8d, 0x81, 0x82, 0xa7, 0x21, + 0x80, 0xa0, 0x6a, 0x78, 0x26, 0x83, 0x30, 0x67, + 0x3f, 0x7c, 0x4e, 0x06, 0x35, 0x94, 0x7c, 0x04, + 0xc0, 0x23, 0x23, 0xfd, 0x45, 0xc0, 0xa5, 0x2d }, + { 0xef, 0xc0, 0x4c, 0xdc, 0x39, 0x1c, 0x7e, 0x91, + 0x19, 0xbd, 0x38, 0x66, 0x8a, 0x53, 0x4e, 0x65, + 0xfe, 0x31, 0x03, 0x6d, 0x6a, 0x62, 0x11, 0x2e, + 0x44, 0xeb, 0xeb, 0x11, 0xf9, 0xc5, 0x70, 0x80 }, + { 0x99, 0x2c, 0xf5, 0xc0, 0x53, 0x44, 0x2a, 0x5f, + 0xbc, 0x4f, 0xaf, 0x58, 0x3e, 0x04, 0xe5, 0x0b, + 0xb7, 0x0d, 0x2f, 0x39, 0xfb, 0xb6, 0xa5, 0x03, + 0xf8, 0x9e, 0x56, 0xa6, 0x3e, 0x18, 0x57, 0x8a }, + { 0x38, 0x64, 0x0e, 0x9f, 0x21, 0x98, 0x3e, 0x67, + 0xb5, 0x39, 0xca, 0xcc, 0xae, 0x5e, 0xcf, 0x61, + 0x5a, 0xe2, 0x76, 0x4f, 0x75, 0xa0, 0x9c, 0x9c, + 0x59, 0xb7, 0x64, 0x83, 0xc1, 0xfb, 0xc7, 0x35 }, + { 0x21, 0x3d, 0xd3, 0x4c, 0x7e, 0xfe, 0x4f, 0xb2, + 0x7a, 0x6b, 0x35, 0xf6, 0xb4, 0x00, 0x0d, 0x1f, + 0xe0, 0x32, 0x81, 0xaf, 0x3c, 0x72, 0x3e, 0x5c, + 0x9f, 0x94, 0x74, 0x7a, 0x5f, 0x31, 0xcd, 0x3b }, + { 0xec, 0x24, 0x6e, 0xee, 0xb9, 0xce, 0xd3, 0xf7, + 0xad, 0x33, 0xed, 0x28, 0x66, 0x0d, 0xd9, 0xbb, + 0x07, 0x32, 0x51, 0x3d, 0xb4, 0xe2, 0xfa, 0x27, + 0x8b, 0x60, 0xcd, 0xe3, 0x68, 0x2a, 0x4c, 0xcd }, + { 0xac, 0x9b, 0x61, 0xd4, 0x46, 0x64, 0x8c, 0x30, + 0x05, 0xd7, 0x89, 0x2b, 0xf3, 0xa8, 0x71, 0x9f, + 0x4c, 0x81, 0x81, 0xcf, 0xdc, 0xbc, 0x2b, 0x79, + 0xfe, 0xf1, 0x0a, 0x27, 0x9b, 0x91, 0x10, 0x95 }, + { 0x7b, 0xf8, 0xb2, 0x29, 0x59, 0xe3, 0x4e, 0x3a, + 0x43, 0xf7, 0x07, 0x92, 0x23, 0xe8, 0x3a, 0x97, + 0x54, 0x61, 0x7d, 0x39, 0x1e, 0x21, 0x3d, 0xfd, + 0x80, 0x8e, 0x41, 0xb9, 0xbe, 0xad, 0x4c, 0xe7 }, + { 0x68, 0xd4, 0xb5, 0xd4, 0xfa, 0x0e, 0x30, 0x2b, + 0x64, 0xcc, 0xc5, 0xaf, 0x79, 0x29, 0x13, 0xac, + 0x4c, 0x88, 0xec, 0x95, 0xc0, 0x7d, 0xdf, 0x40, + 0x69, 0x42, 0x56, 0xeb, 0x88, 0xce, 0x9f, 0x3d }, + { 0xb2, 0xc2, 0x42, 0x0f, 0x05, 0xf9, 0xab, 0xe3, + 0x63, 0x15, 0x91, 0x93, 0x36, 0xb3, 0x7e, 0x4e, + 0x0f, 0xa3, 0x3f, 0xf7, 0xe7, 0x6a, 0x49, 0x27, + 0x67, 0x00, 0x6f, 0xdb, 0x5d, 0x93, 0x54, 0x62 }, + { 0x13, 0x4f, 0x61, 0xbb, 0xd0, 0xbb, 0xb6, 0x9a, + 0xed, 0x53, 0x43, 0x90, 0x45, 0x51, 0xa3, 0xe6, + 0xc1, 0xaa, 0x7d, 0xcd, 0xd7, 0x7e, 0x90, 0x3e, + 0x70, 0x23, 0xeb, 0x7c, 0x60, 0x32, 0x0a, 0xa7 }, + { 0x46, 0x93, 0xf9, 0xbf, 0xf7, 0xd4, 0xf3, 0x98, + 0x6a, 0x7d, 0x17, 0x6e, 0x6e, 0x06, 0xf7, 0x2a, + 0xd1, 0x49, 0x0d, 0x80, 0x5c, 0x99, 0xe2, 0x53, + 0x47, 0xb8, 0xde, 0x77, 0xb4, 0xdb, 0x6d, 0x9b }, + { 0x85, 0x3e, 0x26, 0xf7, 0x41, 0x95, 0x3b, 0x0f, + 0xd5, 0xbd, 0xb4, 0x24, 0xe8, 0xab, 0x9e, 0x8b, + 0x37, 0x50, 0xea, 0xa8, 0xef, 0x61, 0xe4, 0x79, + 0x02, 0xc9, 0x1e, 0x55, 0x4e, 0x9c, 0x73, 0xb9 }, + { 0xf7, 0xde, 0x53, 0x63, 0x61, 0xab, 0xaa, 0x0e, + 0x15, 0x81, 0x56, 0xcf, 0x0e, 0xa4, 0xf6, 0x3a, + 0x99, 0xb5, 0xe4, 0x05, 0x4f, 0x8f, 0xa4, 0xc9, + 0xd4, 0x5f, 0x62, 0x85, 0xca, 0xd5, 0x56, 0x94 }, + { 0x4c, 0x23, 0x06, 0x08, 0x86, 0x0a, 0x99, 0xae, + 0x8d, 0x7b, 0xd5, 0xc2, 0xcc, 0x17, 0xfa, 0x52, + 0x09, 0x6b, 0x9a, 0x61, 0xbe, 0xdb, 0x17, 0xcb, + 0x76, 0x17, 0x86, 0x4a, 0xd2, 0x9c, 0xa7, 0xa6 }, + { 0xae, 0xb9, 0x20, 0xea, 0x87, 0x95, 0x2d, 0xad, + 0xb1, 0xfb, 0x75, 0x92, 0x91, 0xe3, 0x38, 0x81, + 0x39, 0xa8, 0x72, 0x86, 0x50, 0x01, 0x88, 0x6e, + 0xd8, 0x47, 0x52, 0xe9, 0x3c, 0x25, 0x0c, 0x2a }, + { 0xab, 0xa4, 0xad, 0x9b, 0x48, 0x0b, 0x9d, 0xf3, + 0xd0, 0x8c, 0xa5, 0xe8, 0x7b, 0x0c, 0x24, 0x40, + 0xd4, 0xe4, 0xea, 0x21, 0x22, 0x4c, 0x2e, 0xb4, + 0x2c, 0xba, 0xe4, 0x69, 0xd0, 0x89, 0xb9, 0x31 }, + { 0x05, 0x82, 0x56, 0x07, 0xd7, 0xfd, 0xf2, 0xd8, + 0x2e, 0xf4, 0xc3, 0xc8, 0xc2, 0xae, 0xa9, 0x61, + 0xad, 0x98, 0xd6, 0x0e, 0xdf, 0xf7, 0xd0, 0x18, + 0x98, 0x3e, 0x21, 0x20, 0x4c, 0x0d, 0x93, 0xd1 }, + { 0xa7, 0x42, 0xf8, 0xb6, 0xaf, 0x82, 0xd8, 0xa6, + 0xca, 0x23, 0x57, 0xc5, 0xf1, 0xcf, 0x91, 0xde, + 0xfb, 0xd0, 0x66, 0x26, 0x7d, 0x75, 0xc0, 0x48, + 0xb3, 0x52, 0x36, 0x65, 0x85, 0x02, 0x59, 0x62 }, + { 0x2b, 0xca, 0xc8, 0x95, 0x99, 0x00, 0x0b, 0x42, + 0xc9, 0x5a, 0xe2, 0x38, 0x35, 0xa7, 0x13, 0x70, + 0x4e, 0xd7, 0x97, 0x89, 0xc8, 0x4f, 0xef, 0x14, + 0x9a, 0x87, 0x4f, 0xf7, 0x33, 0xf0, 0x17, 0xa2 }, + { 0xac, 0x1e, 0xd0, 0x7d, 0x04, 0x8f, 0x10, 0x5a, + 0x9e, 0x5b, 0x7a, 0xb8, 0x5b, 0x09, 0xa4, 0x92, + 0xd5, 0xba, 0xff, 0x14, 0xb8, 0xbf, 0xb0, 0xe9, + 0xfd, 0x78, 0x94, 0x86, 0xee, 0xa2, 0xb9, 0x74 }, + { 0xe4, 0x8d, 0x0e, 0xcf, 0xaf, 0x49, 0x7d, 0x5b, + 0x27, 0xc2, 0x5d, 0x99, 0xe1, 0x56, 0xcb, 0x05, + 0x79, 0xd4, 0x40, 0xd6, 0xe3, 0x1f, 0xb6, 0x24, + 0x73, 0x69, 0x6d, 0xbf, 0x95, 0xe0, 0x10, 0xe4 }, + { 0x12, 0xa9, 0x1f, 0xad, 0xf8, 0xb2, 0x16, 0x44, + 0xfd, 0x0f, 0x93, 0x4f, 0x3c, 0x4a, 0x8f, 0x62, + 0xba, 0x86, 0x2f, 0xfd, 0x20, 0xe8, 0xe9, 0x61, + 0x15, 0x4c, 0x15, 0xc1, 0x38, 0x84, 0xed, 0x3d }, + { 0x7c, 0xbe, 0xe9, 0x6e, 0x13, 0x98, 0x97, 0xdc, + 0x98, 0xfb, 0xef, 0x3b, 0xe8, 0x1a, 0xd4, 0xd9, + 0x64, 0xd2, 0x35, 0xcb, 0x12, 0x14, 0x1f, 0xb6, + 0x67, 0x27, 0xe6, 0xe5, 0xdf, 0x73, 0xa8, 0x78 }, + { 0xeb, 0xf6, 0x6a, 0xbb, 0x59, 0x7a, 0xe5, 0x72, + 0xa7, 0x29, 0x7c, 0xb0, 0x87, 0x1e, 0x35, 0x5a, + 0xcc, 0xaf, 0xad, 0x83, 0x77, 0xb8, 0xe7, 0x8b, + 0xf1, 0x64, 0xce, 0x2a, 0x18, 0xde, 0x4b, 0xaf }, + { 0x71, 0xb9, 0x33, 0xb0, 0x7e, 0x4f, 0xf7, 0x81, + 0x8c, 0xe0, 0x59, 0xd0, 0x08, 0x82, 0x9e, 0x45, + 0x3c, 0x6f, 0xf0, 0x2e, 0xc0, 0xa7, 0xdb, 0x39, + 0x3f, 0xc2, 0xd8, 0x70, 0xf3, 0x7a, 0x72, 0x86 }, + { 0x7c, 0xf7, 0xc5, 0x13, 0x31, 0x22, 0x0b, 0x8d, + 0x3e, 0xba, 0xed, 0x9c, 0x29, 0x39, 0x8a, 0x16, + 0xd9, 0x81, 0x56, 0xe2, 0x61, 0x3c, 0xb0, 0x88, + 0xf2, 0xb0, 0xe0, 0x8a, 0x1b, 0xe4, 0xcf, 0x4f }, + { 0x3e, 0x41, 0xa1, 0x08, 0xe0, 0xf6, 0x4a, 0xd2, + 0x76, 0xb9, 0x79, 0xe1, 0xce, 0x06, 0x82, 0x79, + 0xe1, 0x6f, 0x7b, 0xc7, 0xe4, 0xaa, 0x1d, 0x21, + 0x1e, 0x17, 0xb8, 0x11, 0x61, 0xdf, 0x16, 0x02 }, + { 0x88, 0x65, 0x02, 0xa8, 0x2a, 0xb4, 0x7b, 0xa8, + 0xd8, 0x67, 0x10, 0xaa, 0x9d, 0xe3, 0xd4, 0x6e, + 0xa6, 0x5c, 0x47, 0xaf, 0x6e, 0xe8, 0xde, 0x45, + 0x0c, 0xce, 0xb8, 0xb1, 0x1b, 0x04, 0x5f, 0x50 }, + { 0xc0, 0x21, 0xbc, 0x5f, 0x09, 0x54, 0xfe, 0xe9, + 0x4f, 0x46, 0xea, 0x09, 0x48, 0x7e, 0x10, 0xa8, + 0x48, 0x40, 0xd0, 0x2f, 0x64, 0x81, 0x0b, 0xc0, + 0x8d, 0x9e, 0x55, 0x1f, 0x7d, 0x41, 0x68, 0x14 }, + { 0x20, 0x30, 0x51, 0x6e, 0x8a, 0x5f, 0xe1, 0x9a, + 0xe7, 0x9c, 0x33, 0x6f, 0xce, 0x26, 0x38, 0x2a, + 0x74, 0x9d, 0x3f, 0xd0, 0xec, 0x91, 0xe5, 0x37, + 0xd4, 0xbd, 0x23, 0x58, 0xc1, 0x2d, 0xfb, 0x22 }, + { 0x55, 0x66, 0x98, 0xda, 0xc8, 0x31, 0x7f, 0xd3, + 0x6d, 0xfb, 0xdf, 0x25, 0xa7, 0x9c, 0xb1, 0x12, + 0xd5, 0x42, 0x58, 0x60, 0x60, 0x5c, 0xba, 0xf5, + 0x07, 0xf2, 0x3b, 0xf7, 0xe9, 0xf4, 0x2a, 0xfe }, + { 0x2f, 0x86, 0x7b, 0xa6, 0x77, 0x73, 0xfd, 0xc3, + 0xe9, 0x2f, 0xce, 0xd9, 0x9a, 0x64, 0x09, 0xad, + 0x39, 0xd0, 0xb8, 0x80, 0xfd, 0xe8, 0xf1, 0x09, + 0xa8, 0x17, 0x30, 0xc4, 0x45, 0x1d, 0x01, 0x78 }, + { 0x17, 0x2e, 0xc2, 0x18, 0xf1, 0x19, 0xdf, 0xae, + 0x98, 0x89, 0x6d, 0xff, 0x29, 0xdd, 0x98, 0x76, + 0xc9, 0x4a, 0xf8, 0x74, 0x17, 0xf9, 0xae, 0x4c, + 0x70, 0x14, 0xbb, 0x4e, 0x4b, 0x96, 0xaf, 0xc7 }, + { 0x3f, 0x85, 0x81, 0x4a, 0x18, 0x19, 0x5f, 0x87, + 0x9a, 0xa9, 0x62, 0xf9, 0x5d, 0x26, 0xbd, 0x82, + 0xa2, 0x78, 0xf2, 0xb8, 0x23, 0x20, 0x21, 0x8f, + 0x6b, 0x3b, 0xd6, 0xf7, 0xf6, 0x67, 0xa6, 0xd9 }, + { 0x1b, 0x61, 0x8f, 0xba, 0xa5, 0x66, 0xb3, 0xd4, + 0x98, 0xc1, 0x2e, 0x98, 0x2c, 0x9e, 0xc5, 0x2e, + 0x4d, 0xa8, 0x5a, 0x8c, 0x54, 0xf3, 0x8f, 0x34, + 0xc0, 0x90, 0x39, 0x4f, 0x23, 0xc1, 0x84, 0xc1 }, + { 0x0c, 0x75, 0x8f, 0xb5, 0x69, 0x2f, 0xfd, 0x41, + 0xa3, 0x57, 0x5d, 0x0a, 0xf0, 0x0c, 0xc7, 0xfb, + 0xf2, 0xcb, 0xe5, 0x90, 0x5a, 0x58, 0x32, 0x3a, + 0x88, 0xae, 0x42, 0x44, 0xf6, 0xe4, 0xc9, 0x93 }, + { 0xa9, 0x31, 0x36, 0x0c, 0xad, 0x62, 0x8c, 0x7f, + 0x12, 0xa6, 0xc1, 0xc4, 0xb7, 0x53, 0xb0, 0xf4, + 0x06, 0x2a, 0xef, 0x3c, 0xe6, 0x5a, 0x1a, 0xe3, + 0xf1, 0x93, 0x69, 0xda, 0xdf, 0x3a, 0xe2, 0x3d }, + { 0xcb, 0xac, 0x7d, 0x77, 0x3b, 0x1e, 0x3b, 0x3c, + 0x66, 0x91, 0xd7, 0xab, 0xb7, 0xe9, 0xdf, 0x04, + 0x5c, 0x8b, 0xa1, 0x92, 0x68, 0xde, 0xd1, 0x53, + 0x20, 0x7f, 0x5e, 0x80, 0x43, 0x52, 0xec, 0x5d }, + { 0x23, 0xa1, 0x96, 0xd3, 0x80, 0x2e, 0xd3, 0xc1, + 0xb3, 0x84, 0x01, 0x9a, 0x82, 0x32, 0x58, 0x40, + 0xd3, 0x2f, 0x71, 0x95, 0x0c, 0x45, 0x80, 0xb0, + 0x34, 0x45, 0xe0, 0x89, 0x8e, 0x14, 0x05, 0x3c }, + { 0xf4, 0x49, 0x54, 0x70, 0xf2, 0x26, 0xc8, 0xc2, + 0x14, 0xbe, 0x08, 0xfd, 0xfa, 0xd4, 0xbc, 0x4a, + 0x2a, 0x9d, 0xbe, 0xa9, 0x13, 0x6a, 0x21, 0x0d, + 0xf0, 0xd4, 0xb6, 0x49, 0x29, 0xe6, 0xfc, 0x14 }, + { 0xe2, 0x90, 0xdd, 0x27, 0x0b, 0x46, 0x7f, 0x34, + 0xab, 0x1c, 0x00, 0x2d, 0x34, 0x0f, 0xa0, 0x16, + 0x25, 0x7f, 0xf1, 0x9e, 0x58, 0x33, 0xfd, 0xbb, + 0xf2, 0xcb, 0x40, 0x1c, 0x3b, 0x28, 0x17, 0xde }, + { 0x9f, 0xc7, 0xb5, 0xde, 0xd3, 0xc1, 0x50, 0x42, + 0xb2, 0xa6, 0x58, 0x2d, 0xc3, 0x9b, 0xe0, 0x16, + 0xd2, 0x4a, 0x68, 0x2d, 0x5e, 0x61, 0xad, 0x1e, + 0xff, 0x9c, 0x63, 0x30, 0x98, 0x48, 0xf7, 0x06 }, + { 0x8c, 0xca, 0x67, 0xa3, 0x6d, 0x17, 0xd5, 0xe6, + 0x34, 0x1c, 0xb5, 0x92, 0xfd, 0x7b, 0xef, 0x99, + 0x26, 0xc9, 0xe3, 0xaa, 0x10, 0x27, 0xea, 0x11, + 0xa7, 0xd8, 0xbd, 0x26, 0x0b, 0x57, 0x6e, 0x04 }, + { 0x40, 0x93, 0x92, 0xf5, 0x60, 0xf8, 0x68, 0x31, + 0xda, 0x43, 0x73, 0xee, 0x5e, 0x00, 0x74, 0x26, + 0x05, 0x95, 0xd7, 0xbc, 0x24, 0x18, 0x3b, 0x60, + 0xed, 0x70, 0x0d, 0x45, 0x83, 0xd3, 0xf6, 0xf0 }, + { 0x28, 0x02, 0x16, 0x5d, 0xe0, 0x90, 0x91, 0x55, + 0x46, 0xf3, 0x39, 0x8c, 0xd8, 0x49, 0x16, 0x4a, + 0x19, 0xf9, 0x2a, 0xdb, 0xc3, 0x61, 0xad, 0xc9, + 0x9b, 0x0f, 0x20, 0xc8, 0xea, 0x07, 0x10, 0x54 }, + { 0xad, 0x83, 0x91, 0x68, 0xd9, 0xf8, 0xa4, 0xbe, + 0x95, 0xba, 0x9e, 0xf9, 0xa6, 0x92, 0xf0, 0x72, + 0x56, 0xae, 0x43, 0xfe, 0x6f, 0x98, 0x64, 0xe2, + 0x90, 0x69, 0x1b, 0x02, 0x56, 0xce, 0x50, 0xa9 }, + { 0x75, 0xfd, 0xaa, 0x50, 0x38, 0xc2, 0x84, 0xb8, + 0x6d, 0x6e, 0x8a, 0xff, 0xe8, 0xb2, 0x80, 0x7e, + 0x46, 0x7b, 0x86, 0x60, 0x0e, 0x79, 0xaf, 0x36, + 0x89, 0xfb, 0xc0, 0x63, 0x28, 0xcb, 0xf8, 0x94 }, + { 0xe5, 0x7c, 0xb7, 0x94, 0x87, 0xdd, 0x57, 0x90, + 0x24, 0x32, 0xb2, 0x50, 0x73, 0x38, 0x13, 0xbd, + 0x96, 0xa8, 0x4e, 0xfc, 0xe5, 0x9f, 0x65, 0x0f, + 0xac, 0x26, 0xe6, 0x69, 0x6a, 0xef, 0xaf, 0xc3 }, + { 0x56, 0xf3, 0x4e, 0x8b, 0x96, 0x55, 0x7e, 0x90, + 0xc1, 0xf2, 0x4b, 0x52, 0xd0, 0xc8, 0x9d, 0x51, + 0x08, 0x6a, 0xcf, 0x1b, 0x00, 0xf6, 0x34, 0xcf, + 0x1d, 0xde, 0x92, 0x33, 0xb8, 0xea, 0xaa, 0x3e }, + { 0x1b, 0x53, 0xee, 0x94, 0xaa, 0xf3, 0x4e, 0x4b, + 0x15, 0x9d, 0x48, 0xde, 0x35, 0x2c, 0x7f, 0x06, + 0x61, 0xd0, 0xa4, 0x0e, 0xdf, 0xf9, 0x5a, 0x0b, + 0x16, 0x39, 0xb4, 0x09, 0x0e, 0x97, 0x44, 0x72 }, + { 0x05, 0x70, 0x5e, 0x2a, 0x81, 0x75, 0x7c, 0x14, + 0xbd, 0x38, 0x3e, 0xa9, 0x8d, 0xda, 0x54, 0x4e, + 0xb1, 0x0e, 0x6b, 0xc0, 0x7b, 0xae, 0x43, 0x5e, + 0x25, 0x18, 0xdb, 0xe1, 0x33, 0x52, 0x53, 0x75 }, + { 0xd8, 0xb2, 0x86, 0x6e, 0x8a, 0x30, 0x9d, 0xb5, + 0x3e, 0x52, 0x9e, 0xc3, 0x29, 0x11, 0xd8, 0x2f, + 0x5c, 0xa1, 0x6c, 0xff, 0x76, 0x21, 0x68, 0x91, + 0xa9, 0x67, 0x6a, 0xa3, 0x1a, 0xaa, 0x6c, 0x42 }, + { 0xf5, 0x04, 0x1c, 0x24, 0x12, 0x70, 0xeb, 0x04, + 0xc7, 0x1e, 0xc2, 0xc9, 0x5d, 0x4c, 0x38, 0xd8, + 0x03, 0xb1, 0x23, 0x7b, 0x0f, 0x29, 0xfd, 0x4d, + 0xb3, 0xeb, 0x39, 0x76, 0x69, 0xe8, 0x86, 0x99 }, + { 0x9a, 0x4c, 0xe0, 0x77, 0xc3, 0x49, 0x32, 0x2f, + 0x59, 0x5e, 0x0e, 0xe7, 0x9e, 0xd0, 0xda, 0x5f, + 0xab, 0x66, 0x75, 0x2c, 0xbf, 0xef, 0x8f, 0x87, + 0xd0, 0xe9, 0xd0, 0x72, 0x3c, 0x75, 0x30, 0xdd }, + { 0x65, 0x7b, 0x09, 0xf3, 0xd0, 0xf5, 0x2b, 0x5b, + 0x8f, 0x2f, 0x97, 0x16, 0x3a, 0x0e, 0xdf, 0x0c, + 0x04, 0xf0, 0x75, 0x40, 0x8a, 0x07, 0xbb, 0xeb, + 0x3a, 0x41, 0x01, 0xa8, 0x91, 0x99, 0x0d, 0x62 }, + { 0x1e, 0x3f, 0x7b, 0xd5, 0xa5, 0x8f, 0xa5, 0x33, + 0x34, 0x4a, 0xa8, 0xed, 0x3a, 0xc1, 0x22, 0xbb, + 0x9e, 0x70, 0xd4, 0xef, 0x50, 0xd0, 0x04, 0x53, + 0x08, 0x21, 0x94, 0x8f, 0x5f, 0xe6, 0x31, 0x5a }, + { 0x80, 0xdc, 0xcf, 0x3f, 0xd8, 0x3d, 0xfd, 0x0d, + 0x35, 0xaa, 0x28, 0x58, 0x59, 0x22, 0xab, 0x89, + 0xd5, 0x31, 0x39, 0x97, 0x67, 0x3e, 0xaf, 0x90, + 0x5c, 0xea, 0x9c, 0x0b, 0x22, 0x5c, 0x7b, 0x5f }, + { 0x8a, 0x0d, 0x0f, 0xbf, 0x63, 0x77, 0xd8, 0x3b, + 0xb0, 0x8b, 0x51, 0x4b, 0x4b, 0x1c, 0x43, 0xac, + 0xc9, 0x5d, 0x75, 0x17, 0x14, 0xf8, 0x92, 0x56, + 0x45, 0xcb, 0x6b, 0xc8, 0x56, 0xca, 0x15, 0x0a }, + { 0x9f, 0xa5, 0xb4, 0x87, 0x73, 0x8a, 0xd2, 0x84, + 0x4c, 0xc6, 0x34, 0x8a, 0x90, 0x19, 0x18, 0xf6, + 0x59, 0xa3, 0xb8, 0x9e, 0x9c, 0x0d, 0xfe, 0xea, + 0xd3, 0x0d, 0xd9, 0x4b, 0xcf, 0x42, 0xef, 0x8e }, + { 0x80, 0x83, 0x2c, 0x4a, 0x16, 0x77, 0xf5, 0xea, + 0x25, 0x60, 0xf6, 0x68, 0xe9, 0x35, 0x4d, 0xd3, + 0x69, 0x97, 0xf0, 0x37, 0x28, 0xcf, 0xa5, 0x5e, + 0x1b, 0x38, 0x33, 0x7c, 0x0c, 0x9e, 0xf8, 0x18 }, + { 0xab, 0x37, 0xdd, 0xb6, 0x83, 0x13, 0x7e, 0x74, + 0x08, 0x0d, 0x02, 0x6b, 0x59, 0x0b, 0x96, 0xae, + 0x9b, 0xb4, 0x47, 0x72, 0x2f, 0x30, 0x5a, 0x5a, + 0xc5, 0x70, 0xec, 0x1d, 0xf9, 0xb1, 0x74, 0x3c }, + { 0x3e, 0xe7, 0x35, 0xa6, 0x94, 0xc2, 0x55, 0x9b, + 0x69, 0x3a, 0xa6, 0x86, 0x29, 0x36, 0x1e, 0x15, + 0xd1, 0x22, 0x65, 0xad, 0x6a, 0x3d, 0xed, 0xf4, + 0x88, 0xb0, 0xb0, 0x0f, 0xac, 0x97, 0x54, 0xba }, + { 0xd6, 0xfc, 0xd2, 0x32, 0x19, 0xb6, 0x47, 0xe4, + 0xcb, 0xd5, 0xeb, 0x2d, 0x0a, 0xd0, 0x1e, 0xc8, + 0x83, 0x8a, 0x4b, 0x29, 0x01, 0xfc, 0x32, 0x5c, + 0xc3, 0x70, 0x19, 0x81, 0xca, 0x6c, 0x88, 0x8b }, + { 0x05, 0x20, 0xec, 0x2f, 0x5b, 0xf7, 0xa7, 0x55, + 0xda, 0xcb, 0x50, 0xc6, 0xbf, 0x23, 0x3e, 0x35, + 0x15, 0x43, 0x47, 0x63, 0xdb, 0x01, 0x39, 0xcc, + 0xd9, 0xfa, 0xef, 0xbb, 0x82, 0x07, 0x61, 0x2d }, + { 0xaf, 0xf3, 0xb7, 0x5f, 0x3f, 0x58, 0x12, 0x64, + 0xd7, 0x66, 0x16, 0x62, 0xb9, 0x2f, 0x5a, 0xd3, + 0x7c, 0x1d, 0x32, 0xbd, 0x45, 0xff, 0x81, 0xa4, + 0xed, 0x8a, 0xdc, 0x9e, 0xf3, 0x0d, 0xd9, 0x89 }, + { 0xd0, 0xdd, 0x65, 0x0b, 0xef, 0xd3, 0xba, 0x63, + 0xdc, 0x25, 0x10, 0x2c, 0x62, 0x7c, 0x92, 0x1b, + 0x9c, 0xbe, 0xb0, 0xb1, 0x30, 0x68, 0x69, 0x35, + 0xb5, 0xc9, 0x27, 0xcb, 0x7c, 0xcd, 0x5e, 0x3b }, + { 0xe1, 0x14, 0x98, 0x16, 0xb1, 0x0a, 0x85, 0x14, + 0xfb, 0x3e, 0x2c, 0xab, 0x2c, 0x08, 0xbe, 0xe9, + 0xf7, 0x3c, 0xe7, 0x62, 0x21, 0x70, 0x12, 0x46, + 0xa5, 0x89, 0xbb, 0xb6, 0x73, 0x02, 0xd8, 0xa9 }, + { 0x7d, 0xa3, 0xf4, 0x41, 0xde, 0x90, 0x54, 0x31, + 0x7e, 0x72, 0xb5, 0xdb, 0xf9, 0x79, 0xda, 0x01, + 0xe6, 0xbc, 0xee, 0xbb, 0x84, 0x78, 0xea, 0xe6, + 0xa2, 0x28, 0x49, 0xd9, 0x02, 0x92, 0x63, 0x5c }, + { 0x12, 0x30, 0xb1, 0xfc, 0x8a, 0x7d, 0x92, 0x15, + 0xed, 0xc2, 0xd4, 0xa2, 0xde, 0xcb, 0xdd, 0x0a, + 0x6e, 0x21, 0x6c, 0x92, 0x42, 0x78, 0xc9, 0x1f, + 0xc5, 0xd1, 0x0e, 0x7d, 0x60, 0x19, 0x2d, 0x94 }, + { 0x57, 0x50, 0xd7, 0x16, 0xb4, 0x80, 0x8f, 0x75, + 0x1f, 0xeb, 0xc3, 0x88, 0x06, 0xba, 0x17, 0x0b, + 0xf6, 0xd5, 0x19, 0x9a, 0x78, 0x16, 0xbe, 0x51, + 0x4e, 0x3f, 0x93, 0x2f, 0xbe, 0x0c, 0xb8, 0x71 }, + { 0x6f, 0xc5, 0x9b, 0x2f, 0x10, 0xfe, 0xba, 0x95, + 0x4a, 0xa6, 0x82, 0x0b, 0x3c, 0xa9, 0x87, 0xee, + 0x81, 0xd5, 0xcc, 0x1d, 0xa3, 0xc6, 0x3c, 0xe8, + 0x27, 0x30, 0x1c, 0x56, 0x9d, 0xfb, 0x39, 0xce }, + { 0xc7, 0xc3, 0xfe, 0x1e, 0xeb, 0xdc, 0x7b, 0x5a, + 0x93, 0x93, 0x26, 0xe8, 0xdd, 0xb8, 0x3e, 0x8b, + 0xf2, 0xb7, 0x80, 0xb6, 0x56, 0x78, 0xcb, 0x62, + 0xf2, 0x08, 0xb0, 0x40, 0xab, 0xdd, 0x35, 0xe2 }, + { 0x0c, 0x75, 0xc1, 0xa1, 0x5c, 0xf3, 0x4a, 0x31, + 0x4e, 0xe4, 0x78, 0xf4, 0xa5, 0xce, 0x0b, 0x8a, + 0x6b, 0x36, 0x52, 0x8e, 0xf7, 0xa8, 0x20, 0x69, + 0x6c, 0x3e, 0x42, 0x46, 0xc5, 0xa1, 0x58, 0x64 }, + { 0x21, 0x6d, 0xc1, 0x2a, 0x10, 0x85, 0x69, 0xa3, + 0xc7, 0xcd, 0xde, 0x4a, 0xed, 0x43, 0xa6, 0xc3, + 0x30, 0x13, 0x9d, 0xda, 0x3c, 0xcc, 0x4a, 0x10, + 0x89, 0x05, 0xdb, 0x38, 0x61, 0x89, 0x90, 0x50 }, + { 0xa5, 0x7b, 0xe6, 0xae, 0x67, 0x56, 0xf2, 0x8b, + 0x02, 0xf5, 0x9d, 0xad, 0xf7, 0xe0, 0xd7, 0xd8, + 0x80, 0x7f, 0x10, 0xfa, 0x15, 0xce, 0xd1, 0xad, + 0x35, 0x85, 0x52, 0x1a, 0x1d, 0x99, 0x5a, 0x89 }, + { 0x81, 0x6a, 0xef, 0x87, 0x59, 0x53, 0x71, 0x6c, + 0xd7, 0xa5, 0x81, 0xf7, 0x32, 0xf5, 0x3d, 0xd4, + 0x35, 0xda, 0xb6, 0x6d, 0x09, 0xc3, 0x61, 0xd2, + 0xd6, 0x59, 0x2d, 0xe1, 0x77, 0x55, 0xd8, 0xa8 }, + { 0x9a, 0x76, 0x89, 0x32, 0x26, 0x69, 0x3b, 0x6e, + 0xa9, 0x7e, 0x6a, 0x73, 0x8f, 0x9d, 0x10, 0xfb, + 0x3d, 0x0b, 0x43, 0xae, 0x0e, 0x8b, 0x7d, 0x81, + 0x23, 0xea, 0x76, 0xce, 0x97, 0x98, 0x9c, 0x7e }, + { 0x8d, 0xae, 0xdb, 0x9a, 0x27, 0x15, 0x29, 0xdb, + 0xb7, 0xdc, 0x3b, 0x60, 0x7f, 0xe5, 0xeb, 0x2d, + 0x32, 0x11, 0x77, 0x07, 0x58, 0xdd, 0x3b, 0x0a, + 0x35, 0x93, 0xd2, 0xd7, 0x95, 0x4e, 0x2d, 0x5b }, + { 0x16, 0xdb, 0xc0, 0xaa, 0x5d, 0xd2, 0xc7, 0x74, + 0xf5, 0x05, 0x10, 0x0f, 0x73, 0x37, 0x86, 0xd8, + 0xa1, 0x75, 0xfc, 0xbb, 0xb5, 0x9c, 0x43, 0xe1, + 0xfb, 0xff, 0x3e, 0x1e, 0xaf, 0x31, 0xcb, 0x4a }, + { 0x86, 0x06, 0xcb, 0x89, 0x9c, 0x6a, 0xea, 0xf5, + 0x1b, 0x9d, 0xb0, 0xfe, 0x49, 0x24, 0xa9, 0xfd, + 0x5d, 0xab, 0xc1, 0x9f, 0x88, 0x26, 0xf2, 0xbc, + 0x1c, 0x1d, 0x7d, 0xa1, 0x4d, 0x2c, 0x2c, 0x99 }, + { 0x84, 0x79, 0x73, 0x1a, 0xed, 0xa5, 0x7b, 0xd3, + 0x7e, 0xad, 0xb5, 0x1a, 0x50, 0x7e, 0x30, 0x7f, + 0x3b, 0xd9, 0x5e, 0x69, 0xdb, 0xca, 0x94, 0xf3, + 0xbc, 0x21, 0x72, 0x60, 0x66, 0xad, 0x6d, 0xfd }, + { 0x58, 0x47, 0x3a, 0x9e, 0xa8, 0x2e, 0xfa, 0x3f, + 0x3b, 0x3d, 0x8f, 0xc8, 0x3e, 0xd8, 0x86, 0x31, + 0x27, 0xb3, 0x3a, 0xe8, 0xde, 0xae, 0x63, 0x07, + 0x20, 0x1e, 0xdb, 0x6d, 0xde, 0x61, 0xde, 0x29 }, + { 0x9a, 0x92, 0x55, 0xd5, 0x3a, 0xf1, 0x16, 0xde, + 0x8b, 0xa2, 0x7c, 0xe3, 0x5b, 0x4c, 0x7e, 0x15, + 0x64, 0x06, 0x57, 0xa0, 0xfc, 0xb8, 0x88, 0xc7, + 0x0d, 0x95, 0x43, 0x1d, 0xac, 0xd8, 0xf8, 0x30 }, + { 0x9e, 0xb0, 0x5f, 0xfb, 0xa3, 0x9f, 0xd8, 0x59, + 0x6a, 0x45, 0x49, 0x3e, 0x18, 0xd2, 0x51, 0x0b, + 0xf3, 0xef, 0x06, 0x5c, 0x51, 0xd6, 0xe1, 0x3a, + 0xbe, 0x66, 0xaa, 0x57, 0xe0, 0x5c, 0xfd, 0xb7 }, + { 0x81, 0xdc, 0xc3, 0xa5, 0x05, 0xea, 0xce, 0x3f, + 0x87, 0x9d, 0x8f, 0x70, 0x27, 0x76, 0x77, 0x0f, + 0x9d, 0xf5, 0x0e, 0x52, 0x1d, 0x14, 0x28, 0xa8, + 0x5d, 0xaf, 0x04, 0xf9, 0xad, 0x21, 0x50, 0xe0 }, + { 0xe3, 0xe3, 0xc4, 0xaa, 0x3a, 0xcb, 0xbc, 0x85, + 0x33, 0x2a, 0xf9, 0xd5, 0x64, 0xbc, 0x24, 0x16, + 0x5e, 0x16, 0x87, 0xf6, 0xb1, 0xad, 0xcb, 0xfa, + 0xe7, 0x7a, 0x8f, 0x03, 0xc7, 0x2a, 0xc2, 0x8c }, + { 0x67, 0x46, 0xc8, 0x0b, 0x4e, 0xb5, 0x6a, 0xea, + 0x45, 0xe6, 0x4e, 0x72, 0x89, 0xbb, 0xa3, 0xed, + 0xbf, 0x45, 0xec, 0xf8, 0x20, 0x64, 0x81, 0xff, + 0x63, 0x02, 0x12, 0x29, 0x84, 0xcd, 0x52, 0x6a }, + { 0x2b, 0x62, 0x8e, 0x52, 0x76, 0x4d, 0x7d, 0x62, + 0xc0, 0x86, 0x8b, 0x21, 0x23, 0x57, 0xcd, 0xd1, + 0x2d, 0x91, 0x49, 0x82, 0x2f, 0x4e, 0x98, 0x45, + 0xd9, 0x18, 0xa0, 0x8d, 0x1a, 0xe9, 0x90, 0xc0 }, + { 0xe4, 0xbf, 0xe8, 0x0d, 0x58, 0xc9, 0x19, 0x94, + 0x61, 0x39, 0x09, 0xdc, 0x4b, 0x1a, 0x12, 0x49, + 0x68, 0x96, 0xc0, 0x04, 0xaf, 0x7b, 0x57, 0x01, + 0x48, 0x3d, 0xe4, 0x5d, 0x28, 0x23, 0xd7, 0x8e }, + { 0xeb, 0xb4, 0xba, 0x15, 0x0c, 0xef, 0x27, 0x34, + 0x34, 0x5b, 0x5d, 0x64, 0x1b, 0xbe, 0xd0, 0x3a, + 0x21, 0xea, 0xfa, 0xe9, 0x33, 0xc9, 0x9e, 0x00, + 0x92, 0x12, 0xef, 0x04, 0x57, 0x4a, 0x85, 0x30 }, + { 0x39, 0x66, 0xec, 0x73, 0xb1, 0x54, 0xac, 0xc6, + 0x97, 0xac, 0x5c, 0xf5, 0xb2, 0x4b, 0x40, 0xbd, + 0xb0, 0xdb, 0x9e, 0x39, 0x88, 0x36, 0xd7, 0x6d, + 0x4b, 0x88, 0x0e, 0x3b, 0x2a, 0xf1, 0xaa, 0x27 }, + { 0xef, 0x7e, 0x48, 0x31, 0xb3, 0xa8, 0x46, 0x36, + 0x51, 0x8d, 0x6e, 0x4b, 0xfc, 0xe6, 0x4a, 0x43, + 0xdb, 0x2a, 0x5d, 0xda, 0x9c, 0xca, 0x2b, 0x44, + 0xf3, 0x90, 0x33, 0xbd, 0xc4, 0x0d, 0x62, 0x43 }, + { 0x7a, 0xbf, 0x6a, 0xcf, 0x5c, 0x8e, 0x54, 0x9d, + 0xdb, 0xb1, 0x5a, 0xe8, 0xd8, 0xb3, 0x88, 0xc1, + 0xc1, 0x97, 0xe6, 0x98, 0x73, 0x7c, 0x97, 0x85, + 0x50, 0x1e, 0xd1, 0xf9, 0x49, 0x30, 0xb7, 0xd9 }, + { 0x88, 0x01, 0x8d, 0xed, 0x66, 0x81, 0x3f, 0x0c, + 0xa9, 0x5d, 0xef, 0x47, 0x4c, 0x63, 0x06, 0x92, + 0x01, 0x99, 0x67, 0xb9, 0xe3, 0x68, 0x88, 0xda, + 0xdd, 0x94, 0x12, 0x47, 0x19, 0xb6, 0x82, 0xf6 }, + { 0x39, 0x30, 0x87, 0x6b, 0x9f, 0xc7, 0x52, 0x90, + 0x36, 0xb0, 0x08, 0xb1, 0xb8, 0xbb, 0x99, 0x75, + 0x22, 0xa4, 0x41, 0x63, 0x5a, 0x0c, 0x25, 0xec, + 0x02, 0xfb, 0x6d, 0x90, 0x26, 0xe5, 0x5a, 0x97 }, + { 0x0a, 0x40, 0x49, 0xd5, 0x7e, 0x83, 0x3b, 0x56, + 0x95, 0xfa, 0xc9, 0x3d, 0xd1, 0xfb, 0xef, 0x31, + 0x66, 0xb4, 0x4b, 0x12, 0xad, 0x11, 0x24, 0x86, + 0x62, 0x38, 0x3a, 0xe0, 0x51, 0xe1, 0x58, 0x27 }, + { 0x81, 0xdc, 0xc0, 0x67, 0x8b, 0xb6, 0xa7, 0x65, + 0xe4, 0x8c, 0x32, 0x09, 0x65, 0x4f, 0xe9, 0x00, + 0x89, 0xce, 0x44, 0xff, 0x56, 0x18, 0x47, 0x7e, + 0x39, 0xab, 0x28, 0x64, 0x76, 0xdf, 0x05, 0x2b }, + { 0xe6, 0x9b, 0x3a, 0x36, 0xa4, 0x46, 0x19, 0x12, + 0xdc, 0x08, 0x34, 0x6b, 0x11, 0xdd, 0xcb, 0x9d, + 0xb7, 0x96, 0xf8, 0x85, 0xfd, 0x01, 0x93, 0x6e, + 0x66, 0x2f, 0xe2, 0x92, 0x97, 0xb0, 0x99, 0xa4 }, + { 0x5a, 0xc6, 0x50, 0x3b, 0x0d, 0x8d, 0xa6, 0x91, + 0x76, 0x46, 0xe6, 0xdc, 0xc8, 0x7e, 0xdc, 0x58, + 0xe9, 0x42, 0x45, 0x32, 0x4c, 0xc2, 0x04, 0xf4, + 0xdd, 0x4a, 0xf0, 0x15, 0x63, 0xac, 0xd4, 0x27 }, + { 0xdf, 0x6d, 0xda, 0x21, 0x35, 0x9a, 0x30, 0xbc, + 0x27, 0x17, 0x80, 0x97, 0x1c, 0x1a, 0xbd, 0x56, + 0xa6, 0xef, 0x16, 0x7e, 0x48, 0x08, 0x87, 0x88, + 0x8e, 0x73, 0xa8, 0x6d, 0x3b, 0xf6, 0x05, 0xe9 }, + { 0xe8, 0xe6, 0xe4, 0x70, 0x71, 0xe7, 0xb7, 0xdf, + 0x25, 0x80, 0xf2, 0x25, 0xcf, 0xbb, 0xed, 0xf8, + 0x4c, 0xe6, 0x77, 0x46, 0x62, 0x66, 0x28, 0xd3, + 0x30, 0x97, 0xe4, 0xb7, 0xdc, 0x57, 0x11, 0x07 }, + { 0x53, 0xe4, 0x0e, 0xad, 0x62, 0x05, 0x1e, 0x19, + 0xcb, 0x9b, 0xa8, 0x13, 0x3e, 0x3e, 0x5c, 0x1c, + 0xe0, 0x0d, 0xdc, 0xad, 0x8a, 0xcf, 0x34, 0x2a, + 0x22, 0x43, 0x60, 0xb0, 0xac, 0xc1, 0x47, 0x77 }, + { 0x9c, 0xcd, 0x53, 0xfe, 0x80, 0xbe, 0x78, 0x6a, + 0xa9, 0x84, 0x63, 0x84, 0x62, 0xfb, 0x28, 0xaf, + 0xdf, 0x12, 0x2b, 0x34, 0xd7, 0x8f, 0x46, 0x87, + 0xec, 0x63, 0x2b, 0xb1, 0x9d, 0xe2, 0x37, 0x1a }, + { 0xcb, 0xd4, 0x80, 0x52, 0xc4, 0x8d, 0x78, 0x84, + 0x66, 0xa3, 0xe8, 0x11, 0x8c, 0x56, 0xc9, 0x7f, + 0xe1, 0x46, 0xe5, 0x54, 0x6f, 0xaa, 0xf9, 0x3e, + 0x2b, 0xc3, 0xc4, 0x7e, 0x45, 0x93, 0x97, 0x53 }, + { 0x25, 0x68, 0x83, 0xb1, 0x4e, 0x2a, 0xf4, 0x4d, + 0xad, 0xb2, 0x8e, 0x1b, 0x34, 0xb2, 0xac, 0x0f, + 0x0f, 0x4c, 0x91, 0xc3, 0x4e, 0xc9, 0x16, 0x9e, + 0x29, 0x03, 0x61, 0x58, 0xac, 0xaa, 0x95, 0xb9 }, + { 0x44, 0x71, 0xb9, 0x1a, 0xb4, 0x2d, 0xb7, 0xc4, + 0xdd, 0x84, 0x90, 0xab, 0x95, 0xa2, 0xee, 0x8d, + 0x04, 0xe3, 0xef, 0x5c, 0x3d, 0x6f, 0xc7, 0x1a, + 0xc7, 0x4b, 0x2b, 0x26, 0x91, 0x4d, 0x16, 0x41 }, + { 0xa5, 0xeb, 0x08, 0x03, 0x8f, 0x8f, 0x11, 0x55, + 0xed, 0x86, 0xe6, 0x31, 0x90, 0x6f, 0xc1, 0x30, + 0x95, 0xf6, 0xbb, 0xa4, 0x1d, 0xe5, 0xd4, 0xe7, + 0x95, 0x75, 0x8e, 0xc8, 0xc8, 0xdf, 0x8a, 0xf1 }, + { 0xdc, 0x1d, 0xb6, 0x4e, 0xd8, 0xb4, 0x8a, 0x91, + 0x0e, 0x06, 0x0a, 0x6b, 0x86, 0x63, 0x74, 0xc5, + 0x78, 0x78, 0x4e, 0x9a, 0xc4, 0x9a, 0xb2, 0x77, + 0x40, 0x92, 0xac, 0x71, 0x50, 0x19, 0x34, 0xac }, + { 0x28, 0x54, 0x13, 0xb2, 0xf2, 0xee, 0x87, 0x3d, + 0x34, 0x31, 0x9e, 0xe0, 0xbb, 0xfb, 0xb9, 0x0f, + 0x32, 0xda, 0x43, 0x4c, 0xc8, 0x7e, 0x3d, 0xb5, + 0xed, 0x12, 0x1b, 0xb3, 0x98, 0xed, 0x96, 0x4b }, + { 0x02, 0x16, 0xe0, 0xf8, 0x1f, 0x75, 0x0f, 0x26, + 0xf1, 0x99, 0x8b, 0xc3, 0x93, 0x4e, 0x3e, 0x12, + 0x4c, 0x99, 0x45, 0xe6, 0x85, 0xa6, 0x0b, 0x25, + 0xe8, 0xfb, 0xd9, 0x62, 0x5a, 0xb6, 0xb5, 0x99 }, + { 0x38, 0xc4, 0x10, 0xf5, 0xb9, 0xd4, 0x07, 0x20, + 0x50, 0x75, 0x5b, 0x31, 0xdc, 0xa8, 0x9f, 0xd5, + 0x39, 0x5c, 0x67, 0x85, 0xee, 0xb3, 0xd7, 0x90, + 0xf3, 0x20, 0xff, 0x94, 0x1c, 0x5a, 0x93, 0xbf }, + { 0xf1, 0x84, 0x17, 0xb3, 0x9d, 0x61, 0x7a, 0xb1, + 0xc1, 0x8f, 0xdf, 0x91, 0xeb, 0xd0, 0xfc, 0x6d, + 0x55, 0x16, 0xbb, 0x34, 0xcf, 0x39, 0x36, 0x40, + 0x37, 0xbc, 0xe8, 0x1f, 0xa0, 0x4c, 0xec, 0xb1 }, + { 0x1f, 0xa8, 0x77, 0xde, 0x67, 0x25, 0x9d, 0x19, + 0x86, 0x3a, 0x2a, 0x34, 0xbc, 0xc6, 0x96, 0x2a, + 0x2b, 0x25, 0xfc, 0xbf, 0x5c, 0xbe, 0xcd, 0x7e, + 0xde, 0x8f, 0x1f, 0xa3, 0x66, 0x88, 0xa7, 0x96 }, + { 0x5b, 0xd1, 0x69, 0xe6, 0x7c, 0x82, 0xc2, 0xc2, + 0xe9, 0x8e, 0xf7, 0x00, 0x8b, 0xdf, 0x26, 0x1f, + 0x2d, 0xdf, 0x30, 0xb1, 0xc0, 0x0f, 0x9e, 0x7f, + 0x27, 0x5b, 0xb3, 0xe8, 0xa2, 0x8d, 0xc9, 0xa2 }, + { 0xc8, 0x0a, 0xbe, 0xeb, 0xb6, 0x69, 0xad, 0x5d, + 0xee, 0xb5, 0xf5, 0xec, 0x8e, 0xa6, 0xb7, 0xa0, + 0x5d, 0xdf, 0x7d, 0x31, 0xec, 0x4c, 0x0a, 0x2e, + 0xe2, 0x0b, 0x0b, 0x98, 0xca, 0xec, 0x67, 0x46 }, + { 0xe7, 0x6d, 0x3f, 0xbd, 0xa5, 0xba, 0x37, 0x4e, + 0x6b, 0xf8, 0xe5, 0x0f, 0xad, 0xc3, 0xbb, 0xb9, + 0xba, 0x5c, 0x20, 0x6e, 0xbd, 0xec, 0x89, 0xa3, + 0xa5, 0x4c, 0xf3, 0xdd, 0x84, 0xa0, 0x70, 0x16 }, + { 0x7b, 0xba, 0x9d, 0xc5, 0xb5, 0xdb, 0x20, 0x71, + 0xd1, 0x77, 0x52, 0xb1, 0x04, 0x4c, 0x1e, 0xce, + 0xd9, 0x6a, 0xaf, 0x2d, 0xd4, 0x6e, 0x9b, 0x43, + 0x37, 0x50, 0xe8, 0xea, 0x0d, 0xcc, 0x18, 0x70 }, + { 0xf2, 0x9b, 0x1b, 0x1a, 0xb9, 0xba, 0xb1, 0x63, + 0x01, 0x8e, 0xe3, 0xda, 0x15, 0x23, 0x2c, 0xca, + 0x78, 0xec, 0x52, 0xdb, 0xc3, 0x4e, 0xda, 0x5b, + 0x82, 0x2e, 0xc1, 0xd8, 0x0f, 0xc2, 0x1b, 0xd0 }, + { 0x9e, 0xe3, 0xe3, 0xe7, 0xe9, 0x00, 0xf1, 0xe1, + 0x1d, 0x30, 0x8c, 0x4b, 0x2b, 0x30, 0x76, 0xd2, + 0x72, 0xcf, 0x70, 0x12, 0x4f, 0x9f, 0x51, 0xe1, + 0xda, 0x60, 0xf3, 0x78, 0x46, 0xcd, 0xd2, 0xf4 }, + { 0x70, 0xea, 0x3b, 0x01, 0x76, 0x92, 0x7d, 0x90, + 0x96, 0xa1, 0x85, 0x08, 0xcd, 0x12, 0x3a, 0x29, + 0x03, 0x25, 0x92, 0x0a, 0x9d, 0x00, 0xa8, 0x9b, + 0x5d, 0xe0, 0x42, 0x73, 0xfb, 0xc7, 0x6b, 0x85 }, + { 0x67, 0xde, 0x25, 0xc0, 0x2a, 0x4a, 0xab, 0xa2, + 0x3b, 0xdc, 0x97, 0x3c, 0x8b, 0xb0, 0xb5, 0x79, + 0x6d, 0x47, 0xcc, 0x06, 0x59, 0xd4, 0x3d, 0xff, + 0x1f, 0x97, 0xde, 0x17, 0x49, 0x63, 0xb6, 0x8e }, + { 0xb2, 0x16, 0x8e, 0x4e, 0x0f, 0x18, 0xb0, 0xe6, + 0x41, 0x00, 0xb5, 0x17, 0xed, 0x95, 0x25, 0x7d, + 0x73, 0xf0, 0x62, 0x0d, 0xf8, 0x85, 0xc1, 0x3d, + 0x2e, 0xcf, 0x79, 0x36, 0x7b, 0x38, 0x4c, 0xee }, + { 0x2e, 0x7d, 0xec, 0x24, 0x28, 0x85, 0x3b, 0x2c, + 0x71, 0x76, 0x07, 0x45, 0x54, 0x1f, 0x7a, 0xfe, + 0x98, 0x25, 0xb5, 0xdd, 0x77, 0xdf, 0x06, 0x51, + 0x1d, 0x84, 0x41, 0xa9, 0x4b, 0xac, 0xc9, 0x27 }, + { 0xca, 0x9f, 0xfa, 0xc4, 0xc4, 0x3f, 0x0b, 0x48, + 0x46, 0x1d, 0xc5, 0xc2, 0x63, 0xbe, 0xa3, 0xf6, + 0xf0, 0x06, 0x11, 0xce, 0xac, 0xab, 0xf6, 0xf8, + 0x95, 0xba, 0x2b, 0x01, 0x01, 0xdb, 0xb6, 0x8d }, + { 0x74, 0x10, 0xd4, 0x2d, 0x8f, 0xd1, 0xd5, 0xe9, + 0xd2, 0xf5, 0x81, 0x5c, 0xb9, 0x34, 0x17, 0x99, + 0x88, 0x28, 0xef, 0x3c, 0x42, 0x30, 0xbf, 0xbd, + 0x41, 0x2d, 0xf0, 0xa4, 0xa7, 0xa2, 0x50, 0x7a }, + { 0x50, 0x10, 0xf6, 0x84, 0x51, 0x6d, 0xcc, 0xd0, + 0xb6, 0xee, 0x08, 0x52, 0xc2, 0x51, 0x2b, 0x4d, + 0xc0, 0x06, 0x6c, 0xf0, 0xd5, 0x6f, 0x35, 0x30, + 0x29, 0x78, 0xdb, 0x8a, 0xe3, 0x2c, 0x6a, 0x81 }, + { 0xac, 0xaa, 0xb5, 0x85, 0xf7, 0xb7, 0x9b, 0x71, + 0x99, 0x35, 0xce, 0xb8, 0x95, 0x23, 0xdd, 0xc5, + 0x48, 0x27, 0xf7, 0x5c, 0x56, 0x88, 0x38, 0x56, + 0x15, 0x4a, 0x56, 0xcd, 0xcd, 0x5e, 0xe9, 0x88 }, + { 0x66, 0x6d, 0xe5, 0xd1, 0x44, 0x0f, 0xee, 0x73, + 0x31, 0xaa, 0xf0, 0x12, 0x3a, 0x62, 0xef, 0x2d, + 0x8b, 0xa5, 0x74, 0x53, 0xa0, 0x76, 0x96, 0x35, + 0xac, 0x6c, 0xd0, 0x1e, 0x63, 0x3f, 0x77, 0x12 }, + { 0xa6, 0xf9, 0x86, 0x58, 0xf6, 0xea, 0xba, 0xf9, + 0x02, 0xd8, 0xb3, 0x87, 0x1a, 0x4b, 0x10, 0x1d, + 0x16, 0x19, 0x6e, 0x8a, 0x4b, 0x24, 0x1e, 0x15, + 0x58, 0xfe, 0x29, 0x96, 0x6e, 0x10, 0x3e, 0x8d }, + { 0x89, 0x15, 0x46, 0xa8, 0xb2, 0x9f, 0x30, 0x47, + 0xdd, 0xcf, 0xe5, 0xb0, 0x0e, 0x45, 0xfd, 0x55, + 0x75, 0x63, 0x73, 0x10, 0x5e, 0xa8, 0x63, 0x7d, + 0xfc, 0xff, 0x54, 0x7b, 0x6e, 0xa9, 0x53, 0x5f }, + { 0x18, 0xdf, 0xbc, 0x1a, 0xc5, 0xd2, 0x5b, 0x07, + 0x61, 0x13, 0x7d, 0xbd, 0x22, 0xc1, 0x7c, 0x82, + 0x9d, 0x0f, 0x0e, 0xf1, 0xd8, 0x23, 0x44, 0xe9, + 0xc8, 0x9c, 0x28, 0x66, 0x94, 0xda, 0x24, 0xe8 }, + { 0xb5, 0x4b, 0x9b, 0x67, 0xf8, 0xfe, 0xd5, 0x4b, + 0xbf, 0x5a, 0x26, 0x66, 0xdb, 0xdf, 0x4b, 0x23, + 0xcf, 0xf1, 0xd1, 0xb6, 0xf4, 0xaf, 0xc9, 0x85, + 0xb2, 0xe6, 0xd3, 0x30, 0x5a, 0x9f, 0xf8, 0x0f }, + { 0x7d, 0xb4, 0x42, 0xe1, 0x32, 0xba, 0x59, 0xbc, + 0x12, 0x89, 0xaa, 0x98, 0xb0, 0xd3, 0xe8, 0x06, + 0x00, 0x4f, 0x8e, 0xc1, 0x28, 0x11, 0xaf, 0x1e, + 0x2e, 0x33, 0xc6, 0x9b, 0xfd, 0xe7, 0x29, 0xe1 }, + { 0x25, 0x0f, 0x37, 0xcd, 0xc1, 0x5e, 0x81, 0x7d, + 0x2f, 0x16, 0x0d, 0x99, 0x56, 0xc7, 0x1f, 0xe3, + 0xeb, 0x5d, 0xb7, 0x45, 0x56, 0xe4, 0xad, 0xf9, + 0xa4, 0xff, 0xaf, 0xba, 0x74, 0x01, 0x03, 0x96 }, + { 0x4a, 0xb8, 0xa3, 0xdd, 0x1d, 0xdf, 0x8a, 0xd4, + 0x3d, 0xab, 0x13, 0xa2, 0x7f, 0x66, 0xa6, 0x54, + 0x4f, 0x29, 0x05, 0x97, 0xfa, 0x96, 0x04, 0x0e, + 0x0e, 0x1d, 0xb9, 0x26, 0x3a, 0xa4, 0x79, 0xf8 }, + { 0xee, 0x61, 0x72, 0x7a, 0x07, 0x66, 0xdf, 0x93, + 0x9c, 0xcd, 0xc8, 0x60, 0x33, 0x40, 0x44, 0xc7, + 0x9a, 0x3c, 0x9b, 0x15, 0x62, 0x00, 0xbc, 0x3a, + 0xa3, 0x29, 0x73, 0x48, 0x3d, 0x83, 0x41, 0xae }, + { 0x3f, 0x68, 0xc7, 0xec, 0x63, 0xac, 0x11, 0xeb, + 0xb9, 0x8f, 0x94, 0xb3, 0x39, 0xb0, 0x5c, 0x10, + 0x49, 0x84, 0xfd, 0xa5, 0x01, 0x03, 0x06, 0x01, + 0x44, 0xe5, 0xa2, 0xbf, 0xcc, 0xc9, 0xda, 0x95 }, + { 0x05, 0x6f, 0x29, 0x81, 0x6b, 0x8a, 0xf8, 0xf5, + 0x66, 0x82, 0xbc, 0x4d, 0x7c, 0xf0, 0x94, 0x11, + 0x1d, 0xa7, 0x73, 0x3e, 0x72, 0x6c, 0xd1, 0x3d, + 0x6b, 0x3e, 0x8e, 0xa0, 0x3e, 0x92, 0xa0, 0xd5 }, + { 0xf5, 0xec, 0x43, 0xa2, 0x8a, 0xcb, 0xef, 0xf1, + 0xf3, 0x31, 0x8a, 0x5b, 0xca, 0xc7, 0xc6, 0x6d, + 0xdb, 0x52, 0x30, 0xb7, 0x9d, 0xb2, 0xd1, 0x05, + 0xbc, 0xbe, 0x15, 0xf3, 0xc1, 0x14, 0x8d, 0x69 }, + { 0x2a, 0x69, 0x60, 0xad, 0x1d, 0x8d, 0xd5, 0x47, + 0x55, 0x5c, 0xfb, 0xd5, 0xe4, 0x60, 0x0f, 0x1e, + 0xaa, 0x1c, 0x8e, 0xda, 0x34, 0xde, 0x03, 0x74, + 0xec, 0x4a, 0x26, 0xea, 0xaa, 0xa3, 0x3b, 0x4e }, + { 0xdc, 0xc1, 0xea, 0x7b, 0xaa, 0xb9, 0x33, 0x84, + 0xf7, 0x6b, 0x79, 0x68, 0x66, 0x19, 0x97, 0x54, + 0x74, 0x2f, 0x7b, 0x96, 0xd6, 0xb4, 0xc1, 0x20, + 0x16, 0x5c, 0x04, 0xa6, 0xc4, 0xf5, 0xce, 0x10 }, + { 0x13, 0xd5, 0xdf, 0x17, 0x92, 0x21, 0x37, 0x9c, + 0x6a, 0x78, 0xc0, 0x7c, 0x79, 0x3f, 0xf5, 0x34, + 0x87, 0xca, 0xe6, 0xbf, 0x9f, 0xe8, 0x82, 0x54, + 0x1a, 0xb0, 0xe7, 0x35, 0xe3, 0xea, 0xda, 0x3b }, + { 0x8c, 0x59, 0xe4, 0x40, 0x76, 0x41, 0xa0, 0x1e, + 0x8f, 0xf9, 0x1f, 0x99, 0x80, 0xdc, 0x23, 0x6f, + 0x4e, 0xcd, 0x6f, 0xcf, 0x52, 0x58, 0x9a, 0x09, + 0x9a, 0x96, 0x16, 0x33, 0x96, 0x77, 0x14, 0xe1 }, + { 0x83, 0x3b, 0x1a, 0xc6, 0xa2, 0x51, 0xfd, 0x08, + 0xfd, 0x6d, 0x90, 0x8f, 0xea, 0x2a, 0x4e, 0xe1, + 0xe0, 0x40, 0xbc, 0xa9, 0x3f, 0xc1, 0xa3, 0x8e, + 0xc3, 0x82, 0x0e, 0x0c, 0x10, 0xbd, 0x82, 0xea }, + { 0xa2, 0x44, 0xf9, 0x27, 0xf3, 0xb4, 0x0b, 0x8f, + 0x6c, 0x39, 0x15, 0x70, 0xc7, 0x65, 0x41, 0x8f, + 0x2f, 0x6e, 0x70, 0x8e, 0xac, 0x90, 0x06, 0xc5, + 0x1a, 0x7f, 0xef, 0xf4, 0xaf, 0x3b, 0x2b, 0x9e }, + { 0x3d, 0x99, 0xed, 0x95, 0x50, 0xcf, 0x11, 0x96, + 0xe6, 0xc4, 0xd2, 0x0c, 0x25, 0x96, 0x20, 0xf8, + 0x58, 0xc3, 0xd7, 0x03, 0x37, 0x4c, 0x12, 0x8c, + 0xe7, 0xb5, 0x90, 0x31, 0x0c, 0x83, 0x04, 0x6d }, + { 0x2b, 0x35, 0xc4, 0x7d, 0x7b, 0x87, 0x76, 0x1f, + 0x0a, 0xe4, 0x3a, 0xc5, 0x6a, 0xc2, 0x7b, 0x9f, + 0x25, 0x83, 0x03, 0x67, 0xb5, 0x95, 0xbe, 0x8c, + 0x24, 0x0e, 0x94, 0x60, 0x0c, 0x6e, 0x33, 0x12 }, + { 0x5d, 0x11, 0xed, 0x37, 0xd2, 0x4d, 0xc7, 0x67, + 0x30, 0x5c, 0xb7, 0xe1, 0x46, 0x7d, 0x87, 0xc0, + 0x65, 0xac, 0x4b, 0xc8, 0xa4, 0x26, 0xde, 0x38, + 0x99, 0x1f, 0xf5, 0x9a, 0xa8, 0x73, 0x5d, 0x02 }, + { 0xb8, 0x36, 0x47, 0x8e, 0x1c, 0xa0, 0x64, 0x0d, + 0xce, 0x6f, 0xd9, 0x10, 0xa5, 0x09, 0x62, 0x72, + 0xc8, 0x33, 0x09, 0x90, 0xcd, 0x97, 0x86, 0x4a, + 0xc2, 0xbf, 0x14, 0xef, 0x6b, 0x23, 0x91, 0x4a }, + { 0x91, 0x00, 0xf9, 0x46, 0xd6, 0xcc, 0xde, 0x3a, + 0x59, 0x7f, 0x90, 0xd3, 0x9f, 0xc1, 0x21, 0x5b, + 0xad, 0xdc, 0x74, 0x13, 0x64, 0x3d, 0x85, 0xc2, + 0x1c, 0x3e, 0xee, 0x5d, 0x2d, 0xd3, 0x28, 0x94 }, + { 0xda, 0x70, 0xee, 0xdd, 0x23, 0xe6, 0x63, 0xaa, + 0x1a, 0x74, 0xb9, 0x76, 0x69, 0x35, 0xb4, 0x79, + 0x22, 0x2a, 0x72, 0xaf, 0xba, 0x5c, 0x79, 0x51, + 0x58, 0xda, 0xd4, 0x1a, 0x3b, 0xd7, 0x7e, 0x40 }, + { 0xf0, 0x67, 0xed, 0x6a, 0x0d, 0xbd, 0x43, 0xaa, + 0x0a, 0x92, 0x54, 0xe6, 0x9f, 0xd6, 0x6b, 0xdd, + 0x8a, 0xcb, 0x87, 0xde, 0x93, 0x6c, 0x25, 0x8c, + 0xfb, 0x02, 0x28, 0x5f, 0x2c, 0x11, 0xfa, 0x79 }, + { 0x71, 0x5c, 0x99, 0xc7, 0xd5, 0x75, 0x80, 0xcf, + 0x97, 0x53, 0xb4, 0xc1, 0xd7, 0x95, 0xe4, 0x5a, + 0x83, 0xfb, 0xb2, 0x28, 0xc0, 0xd3, 0x6f, 0xbe, + 0x20, 0xfa, 0xf3, 0x9b, 0xdd, 0x6d, 0x4e, 0x85 }, + { 0xe4, 0x57, 0xd6, 0xad, 0x1e, 0x67, 0xcb, 0x9b, + 0xbd, 0x17, 0xcb, 0xd6, 0x98, 0xfa, 0x6d, 0x7d, + 0xae, 0x0c, 0x9b, 0x7a, 0xd6, 0xcb, 0xd6, 0x53, + 0x96, 0x34, 0xe3, 0x2a, 0x71, 0x9c, 0x84, 0x92 }, + { 0xec, 0xe3, 0xea, 0x81, 0x03, 0xe0, 0x24, 0x83, + 0xc6, 0x4a, 0x70, 0xa4, 0xbd, 0xce, 0xe8, 0xce, + 0xb6, 0x27, 0x8f, 0x25, 0x33, 0xf3, 0xf4, 0x8d, + 0xbe, 0xed, 0xfb, 0xa9, 0x45, 0x31, 0xd4, 0xae }, + { 0x38, 0x8a, 0xa5, 0xd3, 0x66, 0x7a, 0x97, 0xc6, + 0x8d, 0x3d, 0x56, 0xf8, 0xf3, 0xee, 0x8d, 0x3d, + 0x36, 0x09, 0x1f, 0x17, 0xfe, 0x5d, 0x1b, 0x0d, + 0x5d, 0x84, 0xc9, 0x3b, 0x2f, 0xfe, 0x40, 0xbd }, + { 0x8b, 0x6b, 0x31, 0xb9, 0xad, 0x7c, 0x3d, 0x5c, + 0xd8, 0x4b, 0xf9, 0x89, 0x47, 0xb9, 0xcd, 0xb5, + 0x9d, 0xf8, 0xa2, 0x5f, 0xf7, 0x38, 0x10, 0x10, + 0x13, 0xbe, 0x4f, 0xd6, 0x5e, 0x1d, 0xd1, 0xa3 }, + { 0x06, 0x62, 0x91, 0xf6, 0xbb, 0xd2, 0x5f, 0x3c, + 0x85, 0x3d, 0xb7, 0xd8, 0xb9, 0x5c, 0x9a, 0x1c, + 0xfb, 0x9b, 0xf1, 0xc1, 0xc9, 0x9f, 0xb9, 0x5a, + 0x9b, 0x78, 0x69, 0xd9, 0x0f, 0x1c, 0x29, 0x03 }, + { 0xa7, 0x07, 0xef, 0xbc, 0xcd, 0xce, 0xed, 0x42, + 0x96, 0x7a, 0x66, 0xf5, 0x53, 0x9b, 0x93, 0xed, + 0x75, 0x60, 0xd4, 0x67, 0x30, 0x40, 0x16, 0xc4, + 0x78, 0x0d, 0x77, 0x55, 0xa5, 0x65, 0xd4, 0xc4 }, + { 0x38, 0xc5, 0x3d, 0xfb, 0x70, 0xbe, 0x7e, 0x79, + 0x2b, 0x07, 0xa6, 0xa3, 0x5b, 0x8a, 0x6a, 0x0a, + 0xba, 0x02, 0xc5, 0xc5, 0xf3, 0x8b, 0xaf, 0x5c, + 0x82, 0x3f, 0xdf, 0xd9, 0xe4, 0x2d, 0x65, 0x7e }, + { 0xf2, 0x91, 0x13, 0x86, 0x50, 0x1d, 0x9a, 0xb9, + 0xd7, 0x20, 0xcf, 0x8a, 0xd1, 0x05, 0x03, 0xd5, + 0x63, 0x4b, 0xf4, 0xb7, 0xd1, 0x2b, 0x56, 0xdf, + 0xb7, 0x4f, 0xec, 0xc6, 0xe4, 0x09, 0x3f, 0x68 }, + { 0xc6, 0xf2, 0xbd, 0xd5, 0x2b, 0x81, 0xe6, 0xe4, + 0xf6, 0x59, 0x5a, 0xbd, 0x4d, 0x7f, 0xb3, 0x1f, + 0x65, 0x11, 0x69, 0xd0, 0x0f, 0xf3, 0x26, 0x92, + 0x6b, 0x34, 0x94, 0x7b, 0x28, 0xa8, 0x39, 0x59 }, + { 0x29, 0x3d, 0x94, 0xb1, 0x8c, 0x98, 0xbb, 0x32, + 0x23, 0x36, 0x6b, 0x8c, 0xe7, 0x4c, 0x28, 0xfb, + 0xdf, 0x28, 0xe1, 0xf8, 0x4a, 0x33, 0x50, 0xb0, + 0xeb, 0x2d, 0x18, 0x04, 0xa5, 0x77, 0x57, 0x9b }, + { 0x2c, 0x2f, 0xa5, 0xc0, 0xb5, 0x15, 0x33, 0x16, + 0x5b, 0xc3, 0x75, 0xc2, 0x2e, 0x27, 0x81, 0x76, + 0x82, 0x70, 0xa3, 0x83, 0x98, 0x5d, 0x13, 0xbd, + 0x6b, 0x67, 0xb6, 0xfd, 0x67, 0xf8, 0x89, 0xeb }, + { 0xca, 0xa0, 0x9b, 0x82, 0xb7, 0x25, 0x62, 0xe4, + 0x3f, 0x4b, 0x22, 0x75, 0xc0, 0x91, 0x91, 0x8e, + 0x62, 0x4d, 0x91, 0x16, 0x61, 0xcc, 0x81, 0x1b, + 0xb5, 0xfa, 0xec, 0x51, 0xf6, 0x08, 0x8e, 0xf7 }, + { 0x24, 0x76, 0x1e, 0x45, 0xe6, 0x74, 0x39, 0x53, + 0x79, 0xfb, 0x17, 0x72, 0x9c, 0x78, 0xcb, 0x93, + 0x9e, 0x6f, 0x74, 0xc5, 0xdf, 0xfb, 0x9c, 0x96, + 0x1f, 0x49, 0x59, 0x82, 0xc3, 0xed, 0x1f, 0xe3 }, + { 0x55, 0xb7, 0x0a, 0x82, 0x13, 0x1e, 0xc9, 0x48, + 0x88, 0xd7, 0xab, 0x54, 0xa7, 0xc5, 0x15, 0x25, + 0x5c, 0x39, 0x38, 0xbb, 0x10, 0xbc, 0x78, 0x4d, + 0xc9, 0xb6, 0x7f, 0x07, 0x6e, 0x34, 0x1a, 0x73 }, + { 0x6a, 0xb9, 0x05, 0x7b, 0x97, 0x7e, 0xbc, 0x3c, + 0xa4, 0xd4, 0xce, 0x74, 0x50, 0x6c, 0x25, 0xcc, + 0xcd, 0xc5, 0x66, 0x49, 0x7c, 0x45, 0x0b, 0x54, + 0x15, 0xa3, 0x94, 0x86, 0xf8, 0x65, 0x7a, 0x03 }, + { 0x24, 0x06, 0x6d, 0xee, 0xe0, 0xec, 0xee, 0x15, + 0xa4, 0x5f, 0x0a, 0x32, 0x6d, 0x0f, 0x8d, 0xbc, + 0x79, 0x76, 0x1e, 0xbb, 0x93, 0xcf, 0x8c, 0x03, + 0x77, 0xaf, 0x44, 0x09, 0x78, 0xfc, 0xf9, 0x94 }, + { 0x20, 0x00, 0x0d, 0x3f, 0x66, 0xba, 0x76, 0x86, + 0x0d, 0x5a, 0x95, 0x06, 0x88, 0xb9, 0xaa, 0x0d, + 0x76, 0xcf, 0xea, 0x59, 0xb0, 0x05, 0xd8, 0x59, + 0x91, 0x4b, 0x1a, 0x46, 0x65, 0x3a, 0x93, 0x9b }, + { 0xb9, 0x2d, 0xaa, 0x79, 0x60, 0x3e, 0x3b, 0xdb, + 0xc3, 0xbf, 0xe0, 0xf4, 0x19, 0xe4, 0x09, 0xb2, + 0xea, 0x10, 0xdc, 0x43, 0x5b, 0xee, 0xfe, 0x29, + 0x59, 0xda, 0x16, 0x89, 0x5d, 0x5d, 0xca, 0x1c }, + { 0xe9, 0x47, 0x94, 0x87, 0x05, 0xb2, 0x06, 0xd5, + 0x72, 0xb0, 0xe8, 0xf6, 0x2f, 0x66, 0xa6, 0x55, + 0x1c, 0xbd, 0x6b, 0xc3, 0x05, 0xd2, 0x6c, 0xe7, + 0x53, 0x9a, 0x12, 0xf9, 0xaa, 0xdf, 0x75, 0x71 }, + { 0x3d, 0x67, 0xc1, 0xb3, 0xf9, 0xb2, 0x39, 0x10, + 0xe3, 0xd3, 0x5e, 0x6b, 0x0f, 0x2c, 0xcf, 0x44, + 0xa0, 0xb5, 0x40, 0xa4, 0x5c, 0x18, 0xba, 0x3c, + 0x36, 0x26, 0x4d, 0xd4, 0x8e, 0x96, 0xaf, 0x6a }, + { 0xc7, 0x55, 0x8b, 0xab, 0xda, 0x04, 0xbc, 0xcb, + 0x76, 0x4d, 0x0b, 0xbf, 0x33, 0x58, 0x42, 0x51, + 0x41, 0x90, 0x2d, 0x22, 0x39, 0x1d, 0x9f, 0x8c, + 0x59, 0x15, 0x9f, 0xec, 0x9e, 0x49, 0xb1, 0x51 }, + { 0x0b, 0x73, 0x2b, 0xb0, 0x35, 0x67, 0x5a, 0x50, + 0xff, 0x58, 0xf2, 0xc2, 0x42, 0xe4, 0x71, 0x0a, + 0xec, 0xe6, 0x46, 0x70, 0x07, 0x9c, 0x13, 0x04, + 0x4c, 0x79, 0xc9, 0xb7, 0x49, 0x1f, 0x70, 0x00 }, + { 0xd1, 0x20, 0xb5, 0xef, 0x6d, 0x57, 0xeb, 0xf0, + 0x6e, 0xaf, 0x96, 0xbc, 0x93, 0x3c, 0x96, 0x7b, + 0x16, 0xcb, 0xe6, 0xe2, 0xbf, 0x00, 0x74, 0x1c, + 0x30, 0xaa, 0x1c, 0x54, 0xba, 0x64, 0x80, 0x1f }, + { 0x58, 0xd2, 0x12, 0xad, 0x6f, 0x58, 0xae, 0xf0, + 0xf8, 0x01, 0x16, 0xb4, 0x41, 0xe5, 0x7f, 0x61, + 0x95, 0xbf, 0xef, 0x26, 0xb6, 0x14, 0x63, 0xed, + 0xec, 0x11, 0x83, 0xcd, 0xb0, 0x4f, 0xe7, 0x6d }, + { 0xb8, 0x83, 0x6f, 0x51, 0xd1, 0xe2, 0x9b, 0xdf, + 0xdb, 0xa3, 0x25, 0x56, 0x53, 0x60, 0x26, 0x8b, + 0x8f, 0xad, 0x62, 0x74, 0x73, 0xed, 0xec, 0xef, + 0x7e, 0xae, 0xfe, 0xe8, 0x37, 0xc7, 0x40, 0x03 }, + { 0xc5, 0x47, 0xa3, 0xc1, 0x24, 0xae, 0x56, 0x85, + 0xff, 0xa7, 0xb8, 0xed, 0xaf, 0x96, 0xec, 0x86, + 0xf8, 0xb2, 0xd0, 0xd5, 0x0c, 0xee, 0x8b, 0xe3, + 0xb1, 0xf0, 0xc7, 0x67, 0x63, 0x06, 0x9d, 0x9c }, + { 0x5d, 0x16, 0x8b, 0x76, 0x9a, 0x2f, 0x67, 0x85, + 0x3d, 0x62, 0x95, 0xf7, 0x56, 0x8b, 0xe4, 0x0b, + 0xb7, 0xa1, 0x6b, 0x8d, 0x65, 0xba, 0x87, 0x63, + 0x5d, 0x19, 0x78, 0xd2, 0xab, 0x11, 0xba, 0x2a }, + { 0xa2, 0xf6, 0x75, 0xdc, 0x73, 0x02, 0x63, 0x8c, + 0xb6, 0x02, 0x01, 0x06, 0x4c, 0xa5, 0x50, 0x77, + 0x71, 0x4d, 0x71, 0xfe, 0x09, 0x6a, 0x31, 0x5f, + 0x2f, 0xe7, 0x40, 0x12, 0x77, 0xca, 0xa5, 0xaf }, + { 0xc8, 0xaa, 0xb5, 0xcd, 0x01, 0x60, 0xae, 0x78, + 0xcd, 0x2e, 0x8a, 0xc5, 0xfb, 0x0e, 0x09, 0x3c, + 0xdb, 0x5c, 0x4b, 0x60, 0x52, 0xa0, 0xa9, 0x7b, + 0xb0, 0x42, 0x16, 0x82, 0x6f, 0xa7, 0xa4, 0x37 }, + { 0xff, 0x68, 0xca, 0x40, 0x35, 0xbf, 0xeb, 0x43, + 0xfb, 0xf1, 0x45, 0xfd, 0xdd, 0x5e, 0x43, 0xf1, + 0xce, 0xa5, 0x4f, 0x11, 0xf7, 0xbe, 0xe1, 0x30, + 0x58, 0xf0, 0x27, 0x32, 0x9a, 0x4a, 0x5f, 0xa4 }, + { 0x1d, 0x4e, 0x54, 0x87, 0xae, 0x3c, 0x74, 0x0f, + 0x2b, 0xa6, 0xe5, 0x41, 0xac, 0x91, 0xbc, 0x2b, + 0xfc, 0xd2, 0x99, 0x9c, 0x51, 0x8d, 0x80, 0x7b, + 0x42, 0x67, 0x48, 0x80, 0x3a, 0x35, 0x0f, 0xd4 }, + { 0x6d, 0x24, 0x4e, 0x1a, 0x06, 0xce, 0x4e, 0xf5, + 0x78, 0xdd, 0x0f, 0x63, 0xaf, 0xf0, 0x93, 0x67, + 0x06, 0x73, 0x51, 0x19, 0xca, 0x9c, 0x8d, 0x22, + 0xd8, 0x6c, 0x80, 0x14, 0x14, 0xab, 0x97, 0x41 }, + { 0xde, 0xcf, 0x73, 0x29, 0xdb, 0xcc, 0x82, 0x7b, + 0x8f, 0xc5, 0x24, 0xc9, 0x43, 0x1e, 0x89, 0x98, + 0x02, 0x9e, 0xce, 0x12, 0xce, 0x93, 0xb7, 0xb2, + 0xf3, 0xe7, 0x69, 0xa9, 0x41, 0xfb, 0x8c, 0xea }, + { 0x2f, 0xaf, 0xcc, 0x0f, 0x2e, 0x63, 0xcb, 0xd0, + 0x77, 0x55, 0xbe, 0x7b, 0x75, 0xec, 0xea, 0x0a, + 0xdf, 0xf9, 0xaa, 0x5e, 0xde, 0x2a, 0x52, 0xfd, + 0xab, 0x4d, 0xfd, 0x03, 0x74, 0xcd, 0x48, 0x3f }, + { 0xaa, 0x85, 0x01, 0x0d, 0xd4, 0x6a, 0x54, 0x6b, + 0x53, 0x5e, 0xf4, 0xcf, 0x5f, 0x07, 0xd6, 0x51, + 0x61, 0xe8, 0x98, 0x28, 0xf3, 0xa7, 0x7d, 0xb7, + 0xb9, 0xb5, 0x6f, 0x0d, 0xf5, 0x9a, 0xae, 0x45 }, + { 0x07, 0xe8, 0xe1, 0xee, 0x73, 0x2c, 0xb0, 0xd3, + 0x56, 0xc9, 0xc0, 0xd1, 0x06, 0x9c, 0x89, 0xd1, + 0x7a, 0xdf, 0x6a, 0x9a, 0x33, 0x4f, 0x74, 0x5e, + 0xc7, 0x86, 0x73, 0x32, 0x54, 0x8c, 0xa8, 0xe9 }, + { 0x0e, 0x01, 0xe8, 0x1c, 0xad, 0xa8, 0x16, 0x2b, + 0xfd, 0x5f, 0x8a, 0x8c, 0x81, 0x8a, 0x6c, 0x69, + 0xfe, 0xdf, 0x02, 0xce, 0xb5, 0x20, 0x85, 0x23, + 0xcb, 0xe5, 0x31, 0x3b, 0x89, 0xca, 0x10, 0x53 }, + { 0x6b, 0xb6, 0xc6, 0x47, 0x26, 0x55, 0x08, 0x43, + 0x99, 0x85, 0x2e, 0x00, 0x24, 0x9f, 0x8c, 0xb2, + 0x47, 0x89, 0x6d, 0x39, 0x2b, 0x02, 0xd7, 0x3b, + 0x7f, 0x0d, 0xd8, 0x18, 0xe1, 0xe2, 0x9b, 0x07 }, + { 0x42, 0xd4, 0x63, 0x6e, 0x20, 0x60, 0xf0, 0x8f, + 0x41, 0xc8, 0x82, 0xe7, 0x6b, 0x39, 0x6b, 0x11, + 0x2e, 0xf6, 0x27, 0xcc, 0x24, 0xc4, 0x3d, 0xd5, + 0xf8, 0x3a, 0x1d, 0x1a, 0x7e, 0xad, 0x71, 0x1a }, + { 0x48, 0x58, 0xc9, 0xa1, 0x88, 0xb0, 0x23, 0x4f, + 0xb9, 0xa8, 0xd4, 0x7d, 0x0b, 0x41, 0x33, 0x65, + 0x0a, 0x03, 0x0b, 0xd0, 0x61, 0x1b, 0x87, 0xc3, + 0x89, 0x2e, 0x94, 0x95, 0x1f, 0x8d, 0xf8, 0x52 }, + { 0x3f, 0xab, 0x3e, 0x36, 0x98, 0x8d, 0x44, 0x5a, + 0x51, 0xc8, 0x78, 0x3e, 0x53, 0x1b, 0xe3, 0xa0, + 0x2b, 0xe4, 0x0c, 0xd0, 0x47, 0x96, 0xcf, 0xb6, + 0x1d, 0x40, 0x34, 0x74, 0x42, 0xd3, 0xf7, 0x94 }, + { 0xeb, 0xab, 0xc4, 0x96, 0x36, 0xbd, 0x43, 0x3d, + 0x2e, 0xc8, 0xf0, 0xe5, 0x18, 0x73, 0x2e, 0xf8, + 0xfa, 0x21, 0xd4, 0xd0, 0x71, 0xcc, 0x3b, 0xc4, + 0x6c, 0xd7, 0x9f, 0xa3, 0x8a, 0x28, 0xb8, 0x10 }, + { 0xa1, 0xd0, 0x34, 0x35, 0x23, 0xb8, 0x93, 0xfc, + 0xa8, 0x4f, 0x47, 0xfe, 0xb4, 0xa6, 0x4d, 0x35, + 0x0a, 0x17, 0xd8, 0xee, 0xf5, 0x49, 0x7e, 0xce, + 0x69, 0x7d, 0x02, 0xd7, 0x91, 0x78, 0xb5, 0x91 }, + { 0x26, 0x2e, 0xbf, 0xd9, 0x13, 0x0b, 0x7d, 0x28, + 0x76, 0x0d, 0x08, 0xef, 0x8b, 0xfd, 0x3b, 0x86, + 0xcd, 0xd3, 0xb2, 0x11, 0x3d, 0x2c, 0xae, 0xf7, + 0xea, 0x95, 0x1a, 0x30, 0x3d, 0xfa, 0x38, 0x46 }, + { 0xf7, 0x61, 0x58, 0xed, 0xd5, 0x0a, 0x15, 0x4f, + 0xa7, 0x82, 0x03, 0xed, 0x23, 0x62, 0x93, 0x2f, + 0xcb, 0x82, 0x53, 0xaa, 0xe3, 0x78, 0x90, 0x3e, + 0xde, 0xd1, 0xe0, 0x3f, 0x70, 0x21, 0xa2, 0x57 }, + { 0x26, 0x17, 0x8e, 0x95, 0x0a, 0xc7, 0x22, 0xf6, + 0x7a, 0xe5, 0x6e, 0x57, 0x1b, 0x28, 0x4c, 0x02, + 0x07, 0x68, 0x4a, 0x63, 0x34, 0xa1, 0x77, 0x48, + 0xa9, 0x4d, 0x26, 0x0b, 0xc5, 0xf5, 0x52, 0x74 }, + { 0xc3, 0x78, 0xd1, 0xe4, 0x93, 0xb4, 0x0e, 0xf1, + 0x1f, 0xe6, 0xa1, 0x5d, 0x9c, 0x27, 0x37, 0xa3, + 0x78, 0x09, 0x63, 0x4c, 0x5a, 0xba, 0xd5, 0xb3, + 0x3d, 0x7e, 0x39, 0x3b, 0x4a, 0xe0, 0x5d, 0x03 }, + { 0x98, 0x4b, 0xd8, 0x37, 0x91, 0x01, 0xbe, 0x8f, + 0xd8, 0x06, 0x12, 0xd8, 0xea, 0x29, 0x59, 0xa7, + 0x86, 0x5e, 0xc9, 0x71, 0x85, 0x23, 0x55, 0x01, + 0x07, 0xae, 0x39, 0x38, 0xdf, 0x32, 0x01, 0x1b }, + { 0xc6, 0xf2, 0x5a, 0x81, 0x2a, 0x14, 0x48, 0x58, + 0xac, 0x5c, 0xed, 0x37, 0xa9, 0x3a, 0x9f, 0x47, + 0x59, 0xba, 0x0b, 0x1c, 0x0f, 0xdc, 0x43, 0x1d, + 0xce, 0x35, 0xf9, 0xec, 0x1f, 0x1f, 0x4a, 0x99 }, + { 0x92, 0x4c, 0x75, 0xc9, 0x44, 0x24, 0xff, 0x75, + 0xe7, 0x4b, 0x8b, 0x4e, 0x94, 0x35, 0x89, 0x58, + 0xb0, 0x27, 0xb1, 0x71, 0xdf, 0x5e, 0x57, 0x89, + 0x9a, 0xd0, 0xd4, 0xda, 0xc3, 0x73, 0x53, 0xb6 }, + { 0x0a, 0xf3, 0x58, 0x92, 0xa6, 0x3f, 0x45, 0x93, + 0x1f, 0x68, 0x46, 0xed, 0x19, 0x03, 0x61, 0xcd, + 0x07, 0x30, 0x89, 0xe0, 0x77, 0x16, 0x57, 0x14, + 0xb5, 0x0b, 0x81, 0xa2, 0xe3, 0xdd, 0x9b, 0xa1 }, + { 0xcc, 0x80, 0xce, 0xfb, 0x26, 0xc3, 0xb2, 0xb0, + 0xda, 0xef, 0x23, 0x3e, 0x60, 0x6d, 0x5f, 0xfc, + 0x80, 0xfa, 0x17, 0x42, 0x7d, 0x18, 0xe3, 0x04, + 0x89, 0x67, 0x3e, 0x06, 0xef, 0x4b, 0x87, 0xf7 }, + { 0xc2, 0xf8, 0xc8, 0x11, 0x74, 0x47, 0xf3, 0x97, + 0x8b, 0x08, 0x18, 0xdc, 0xf6, 0xf7, 0x01, 0x16, + 0xac, 0x56, 0xfd, 0x18, 0x4d, 0xd1, 0x27, 0x84, + 0x94, 0xe1, 0x03, 0xfc, 0x6d, 0x74, 0xa8, 0x87 }, + { 0xbd, 0xec, 0xf6, 0xbf, 0xc1, 0xba, 0x0d, 0xf6, + 0xe8, 0x62, 0xc8, 0x31, 0x99, 0x22, 0x07, 0x79, + 0x6a, 0xcc, 0x79, 0x79, 0x68, 0x35, 0x88, 0x28, + 0xc0, 0x6e, 0x7a, 0x51, 0xe0, 0x90, 0x09, 0x8f }, + { 0x24, 0xd1, 0xa2, 0x6e, 0x3d, 0xab, 0x02, 0xfe, + 0x45, 0x72, 0xd2, 0xaa, 0x7d, 0xbd, 0x3e, 0xc3, + 0x0f, 0x06, 0x93, 0xdb, 0x26, 0xf2, 0x73, 0xd0, + 0xab, 0x2c, 0xb0, 0xc1, 0x3b, 0x5e, 0x64, 0x51 }, + { 0xec, 0x56, 0xf5, 0x8b, 0x09, 0x29, 0x9a, 0x30, + 0x0b, 0x14, 0x05, 0x65, 0xd7, 0xd3, 0xe6, 0x87, + 0x82, 0xb6, 0xe2, 0xfb, 0xeb, 0x4b, 0x7e, 0xa9, + 0x7a, 0xc0, 0x57, 0x98, 0x90, 0x61, 0xdd, 0x3f }, + { 0x11, 0xa4, 0x37, 0xc1, 0xab, 0xa3, 0xc1, 0x19, + 0xdd, 0xfa, 0xb3, 0x1b, 0x3e, 0x8c, 0x84, 0x1d, + 0xee, 0xeb, 0x91, 0x3e, 0xf5, 0x7f, 0x7e, 0x48, + 0xf2, 0xc9, 0xcf, 0x5a, 0x28, 0xfa, 0x42, 0xbc }, + { 0x53, 0xc7, 0xe6, 0x11, 0x4b, 0x85, 0x0a, 0x2c, + 0xb4, 0x96, 0xc9, 0xb3, 0xc6, 0x9a, 0x62, 0x3e, + 0xae, 0xa2, 0xcb, 0x1d, 0x33, 0xdd, 0x81, 0x7e, + 0x47, 0x65, 0xed, 0xaa, 0x68, 0x23, 0xc2, 0x28 }, + { 0x15, 0x4c, 0x3e, 0x96, 0xfe, 0xe5, 0xdb, 0x14, + 0xf8, 0x77, 0x3e, 0x18, 0xaf, 0x14, 0x85, 0x79, + 0x13, 0x50, 0x9d, 0xa9, 0x99, 0xb4, 0x6c, 0xdd, + 0x3d, 0x4c, 0x16, 0x97, 0x60, 0xc8, 0x3a, 0xd2 }, + { 0x40, 0xb9, 0x91, 0x6f, 0x09, 0x3e, 0x02, 0x7a, + 0x87, 0x86, 0x64, 0x18, 0x18, 0x92, 0x06, 0x20, + 0x47, 0x2f, 0xbc, 0xf6, 0x8f, 0x70, 0x1d, 0x1b, + 0x68, 0x06, 0x32, 0xe6, 0x99, 0x6b, 0xde, 0xd3 }, + { 0x24, 0xc4, 0xcb, 0xba, 0x07, 0x11, 0x98, 0x31, + 0xa7, 0x26, 0xb0, 0x53, 0x05, 0xd9, 0x6d, 0xa0, + 0x2f, 0xf8, 0xb1, 0x48, 0xf0, 0xda, 0x44, 0x0f, + 0xe2, 0x33, 0xbc, 0xaa, 0x32, 0xc7, 0x2f, 0x6f }, + { 0x5d, 0x20, 0x15, 0x10, 0x25, 0x00, 0x20, 0xb7, + 0x83, 0x68, 0x96, 0x88, 0xab, 0xbf, 0x8e, 0xcf, + 0x25, 0x94, 0xa9, 0x6a, 0x08, 0xf2, 0xbf, 0xec, + 0x6c, 0xe0, 0x57, 0x44, 0x65, 0xdd, 0xed, 0x71 }, + { 0x04, 0x3b, 0x97, 0xe3, 0x36, 0xee, 0x6f, 0xdb, + 0xbe, 0x2b, 0x50, 0xf2, 0x2a, 0xf8, 0x32, 0x75, + 0xa4, 0x08, 0x48, 0x05, 0xd2, 0xd5, 0x64, 0x59, + 0x62, 0x45, 0x4b, 0x6c, 0x9b, 0x80, 0x53, 0xa0 }, + { 0x56, 0x48, 0x35, 0xcb, 0xae, 0xa7, 0x74, 0x94, + 0x85, 0x68, 0xbe, 0x36, 0xcf, 0x52, 0xfc, 0xdd, + 0x83, 0x93, 0x4e, 0xb0, 0xa2, 0x75, 0x12, 0xdb, + 0xe3, 0xe2, 0xdb, 0x47, 0xb9, 0xe6, 0x63, 0x5a }, + { 0xf2, 0x1c, 0x33, 0xf4, 0x7b, 0xde, 0x40, 0xa2, + 0xa1, 0x01, 0xc9, 0xcd, 0xe8, 0x02, 0x7a, 0xaf, + 0x61, 0xa3, 0x13, 0x7d, 0xe2, 0x42, 0x2b, 0x30, + 0x03, 0x5a, 0x04, 0xc2, 0x70, 0x89, 0x41, 0x83 }, + { 0x9d, 0xb0, 0xef, 0x74, 0xe6, 0x6c, 0xbb, 0x84, + 0x2e, 0xb0, 0xe0, 0x73, 0x43, 0xa0, 0x3c, 0x5c, + 0x56, 0x7e, 0x37, 0x2b, 0x3f, 0x23, 0xb9, 0x43, + 0xc7, 0x88, 0xa4, 0xf2, 0x50, 0xf6, 0x78, 0x91 }, + { 0xab, 0x8d, 0x08, 0x65, 0x5f, 0xf1, 0xd3, 0xfe, + 0x87, 0x58, 0xd5, 0x62, 0x23, 0x5f, 0xd2, 0x3e, + 0x7c, 0xf9, 0xdc, 0xaa, 0xd6, 0x58, 0x87, 0x2a, + 0x49, 0xe5, 0xd3, 0x18, 0x3b, 0x6c, 0xce, 0xbd }, + { 0x6f, 0x27, 0xf7, 0x7e, 0x7b, 0xcf, 0x46, 0xa1, + 0xe9, 0x63, 0xad, 0xe0, 0x30, 0x97, 0x33, 0x54, + 0x30, 0x31, 0xdc, 0xcd, 0xd4, 0x7c, 0xaa, 0xc1, + 0x74, 0xd7, 0xd2, 0x7c, 0xe8, 0x07, 0x7e, 0x8b }, + { 0xe3, 0xcd, 0x54, 0xda, 0x7e, 0x44, 0x4c, 0xaa, + 0x62, 0x07, 0x56, 0x95, 0x25, 0xa6, 0x70, 0xeb, + 0xae, 0x12, 0x78, 0xde, 0x4e, 0x3f, 0xe2, 0x68, + 0x4b, 0x3e, 0x33, 0xf5, 0xef, 0x90, 0xcc, 0x1b }, + { 0xb2, 0xc3, 0xe3, 0x3a, 0x51, 0xd2, 0x2c, 0x4c, + 0x08, 0xfc, 0x09, 0x89, 0xc8, 0x73, 0xc9, 0xcc, + 0x41, 0x50, 0x57, 0x9b, 0x1e, 0x61, 0x63, 0xfa, + 0x69, 0x4a, 0xd5, 0x1d, 0x53, 0xd7, 0x12, 0xdc }, + { 0xbe, 0x7f, 0xda, 0x98, 0x3e, 0x13, 0x18, 0x9b, + 0x4c, 0x77, 0xe0, 0xa8, 0x09, 0x20, 0xb6, 0xe0, + 0xe0, 0xea, 0x80, 0xc3, 0xb8, 0x4d, 0xbe, 0x7e, + 0x71, 0x17, 0xd2, 0x53, 0xf4, 0x81, 0x12, 0xf4 }, + { 0xb6, 0x00, 0x8c, 0x28, 0xfa, 0xe0, 0x8a, 0xa4, + 0x27, 0xe5, 0xbd, 0x3a, 0xad, 0x36, 0xf1, 0x00, + 0x21, 0xf1, 0x6c, 0x77, 0xcf, 0xea, 0xbe, 0xd0, + 0x7f, 0x97, 0xcc, 0x7d, 0xc1, 0xf1, 0x28, 0x4a }, + { 0x6e, 0x4e, 0x67, 0x60, 0xc5, 0x38, 0xf2, 0xe9, + 0x7b, 0x3a, 0xdb, 0xfb, 0xbc, 0xde, 0x57, 0xf8, + 0x96, 0x6b, 0x7e, 0xa8, 0xfc, 0xb5, 0xbf, 0x7e, + 0xfe, 0xc9, 0x13, 0xfd, 0x2a, 0x2b, 0x0c, 0x55 }, + { 0x4a, 0xe5, 0x1f, 0xd1, 0x83, 0x4a, 0xa5, 0xbd, + 0x9a, 0x6f, 0x7e, 0xc3, 0x9f, 0xc6, 0x63, 0x33, + 0x8d, 0xc5, 0xd2, 0xe2, 0x07, 0x61, 0x56, 0x6d, + 0x90, 0xcc, 0x68, 0xb1, 0xcb, 0x87, 0x5e, 0xd8 }, + { 0xb6, 0x73, 0xaa, 0xd7, 0x5a, 0xb1, 0xfd, 0xb5, + 0x40, 0x1a, 0xbf, 0xa1, 0xbf, 0x89, 0xf3, 0xad, + 0xd2, 0xeb, 0xc4, 0x68, 0xdf, 0x36, 0x24, 0xa4, + 0x78, 0xf4, 0xfe, 0x85, 0x9d, 0x8d, 0x55, 0xe2 }, + { 0x13, 0xc9, 0x47, 0x1a, 0x98, 0x55, 0x91, 0x35, + 0x39, 0x83, 0x66, 0x60, 0x39, 0x8d, 0xa0, 0xf3, + 0xf9, 0x9a, 0xda, 0x08, 0x47, 0x9c, 0x69, 0xd1, + 0xb7, 0xfc, 0xaa, 0x34, 0x61, 0xdd, 0x7e, 0x59 }, + { 0x2c, 0x11, 0xf4, 0xa7, 0xf9, 0x9a, 0x1d, 0x23, + 0xa5, 0x8b, 0xb6, 0x36, 0x35, 0x0f, 0xe8, 0x49, + 0xf2, 0x9c, 0xba, 0xc1, 0xb2, 0xa1, 0x11, 0x2d, + 0x9f, 0x1e, 0xd5, 0xbc, 0x5b, 0x31, 0x3c, 0xcd }, + { 0xc7, 0xd3, 0xc0, 0x70, 0x6b, 0x11, 0xae, 0x74, + 0x1c, 0x05, 0xa1, 0xef, 0x15, 0x0d, 0xd6, 0x5b, + 0x54, 0x94, 0xd6, 0xd5, 0x4c, 0x9a, 0x86, 0xe2, + 0x61, 0x78, 0x54, 0xe6, 0xae, 0xee, 0xbb, 0xd9 }, + { 0x19, 0x4e, 0x10, 0xc9, 0x38, 0x93, 0xaf, 0xa0, + 0x64, 0xc3, 0xac, 0x04, 0xc0, 0xdd, 0x80, 0x8d, + 0x79, 0x1c, 0x3d, 0x4b, 0x75, 0x56, 0xe8, 0x9d, + 0x8d, 0x9c, 0xb2, 0x25, 0xc4, 0xb3, 0x33, 0x39 }, + { 0x6f, 0xc4, 0x98, 0x8b, 0x8f, 0x78, 0x54, 0x6b, + 0x16, 0x88, 0x99, 0x18, 0x45, 0x90, 0x8f, 0x13, + 0x4b, 0x6a, 0x48, 0x2e, 0x69, 0x94, 0xb3, 0xd4, + 0x83, 0x17, 0xbf, 0x08, 0xdb, 0x29, 0x21, 0x85 }, + { 0x56, 0x65, 0xbe, 0xb8, 0xb0, 0x95, 0x55, 0x25, + 0x81, 0x3b, 0x59, 0x81, 0xcd, 0x14, 0x2e, 0xd4, + 0xd0, 0x3f, 0xba, 0x38, 0xa6, 0xf3, 0xe5, 0xad, + 0x26, 0x8e, 0x0c, 0xc2, 0x70, 0xd1, 0xcd, 0x11 }, + { 0xb8, 0x83, 0xd6, 0x8f, 0x5f, 0xe5, 0x19, 0x36, + 0x43, 0x1b, 0xa4, 0x25, 0x67, 0x38, 0x05, 0x3b, + 0x1d, 0x04, 0x26, 0xd4, 0xcb, 0x64, 0xb1, 0x6e, + 0x83, 0xba, 0xdc, 0x5e, 0x9f, 0xbe, 0x3b, 0x81 }, + { 0x53, 0xe7, 0xb2, 0x7e, 0xa5, 0x9c, 0x2f, 0x6d, + 0xbb, 0x50, 0x76, 0x9e, 0x43, 0x55, 0x4d, 0xf3, + 0x5a, 0xf8, 0x9f, 0x48, 0x22, 0xd0, 0x46, 0x6b, + 0x00, 0x7d, 0xd6, 0xf6, 0xde, 0xaf, 0xff, 0x02 }, + { 0x1f, 0x1a, 0x02, 0x29, 0xd4, 0x64, 0x0f, 0x01, + 0x90, 0x15, 0x88, 0xd9, 0xde, 0xc2, 0x2d, 0x13, + 0xfc, 0x3e, 0xb3, 0x4a, 0x61, 0xb3, 0x29, 0x38, + 0xef, 0xbf, 0x53, 0x34, 0xb2, 0x80, 0x0a, 0xfa }, + { 0xc2, 0xb4, 0x05, 0xaf, 0xa0, 0xfa, 0x66, 0x68, + 0x85, 0x2a, 0xee, 0x4d, 0x88, 0x04, 0x08, 0x53, + 0xfa, 0xb8, 0x00, 0xe7, 0x2b, 0x57, 0x58, 0x14, + 0x18, 0xe5, 0x50, 0x6f, 0x21, 0x4c, 0x7d, 0x1f }, + { 0xc0, 0x8a, 0xa1, 0xc2, 0x86, 0xd7, 0x09, 0xfd, + 0xc7, 0x47, 0x37, 0x44, 0x97, 0x71, 0x88, 0xc8, + 0x95, 0xba, 0x01, 0x10, 0x14, 0x24, 0x7e, 0x4e, + 0xfa, 0x8d, 0x07, 0xe7, 0x8f, 0xec, 0x69, 0x5c }, + { 0xf0, 0x3f, 0x57, 0x89, 0xd3, 0x33, 0x6b, 0x80, + 0xd0, 0x02, 0xd5, 0x9f, 0xdf, 0x91, 0x8b, 0xdb, + 0x77, 0x5b, 0x00, 0x95, 0x6e, 0xd5, 0x52, 0x8e, + 0x86, 0xaa, 0x99, 0x4a, 0xcb, 0x38, 0xfe, 0x2d } +}; + +static const u8 blake2s_keyed_testvecs[][BLAKE2S_HASH_SIZE] __initconst = { + { 0x48, 0xa8, 0x99, 0x7d, 0xa4, 0x07, 0x87, 0x6b, + 0x3d, 0x79, 0xc0, 0xd9, 0x23, 0x25, 0xad, 0x3b, + 0x89, 0xcb, 0xb7, 0x54, 0xd8, 0x6a, 0xb7, 0x1a, + 0xee, 0x04, 0x7a, 0xd3, 0x45, 0xfd, 0x2c, 0x49 }, + { 0x40, 0xd1, 0x5f, 0xee, 0x7c, 0x32, 0x88, 0x30, + 0x16, 0x6a, 0xc3, 0xf9, 0x18, 0x65, 0x0f, 0x80, + 0x7e, 0x7e, 0x01, 0xe1, 0x77, 0x25, 0x8c, 0xdc, + 0x0a, 0x39, 0xb1, 0x1f, 0x59, 0x80, 0x66, 0xf1 }, + { 0x6b, 0xb7, 0x13, 0x00, 0x64, 0x4c, 0xd3, 0x99, + 0x1b, 0x26, 0xcc, 0xd4, 0xd2, 0x74, 0xac, 0xd1, + 0xad, 0xea, 0xb8, 0xb1, 0xd7, 0x91, 0x45, 0x46, + 0xc1, 0x19, 0x8b, 0xbe, 0x9f, 0xc9, 0xd8, 0x03 }, + { 0x1d, 0x22, 0x0d, 0xbe, 0x2e, 0xe1, 0x34, 0x66, + 0x1f, 0xdf, 0x6d, 0x9e, 0x74, 0xb4, 0x17, 0x04, + 0x71, 0x05, 0x56, 0xf2, 0xf6, 0xe5, 0xa0, 0x91, + 0xb2, 0x27, 0x69, 0x74, 0x45, 0xdb, 0xea, 0x6b }, + { 0xf6, 0xc3, 0xfb, 0xad, 0xb4, 0xcc, 0x68, 0x7a, + 0x00, 0x64, 0xa5, 0xbe, 0x6e, 0x79, 0x1b, 0xec, + 0x63, 0xb8, 0x68, 0xad, 0x62, 0xfb, 0xa6, 0x1b, + 0x37, 0x57, 0xef, 0x9c, 0xa5, 0x2e, 0x05, 0xb2 }, + { 0x49, 0xc1, 0xf2, 0x11, 0x88, 0xdf, 0xd7, 0x69, + 0xae, 0xa0, 0xe9, 0x11, 0xdd, 0x6b, 0x41, 0xf1, + 0x4d, 0xab, 0x10, 0x9d, 0x2b, 0x85, 0x97, 0x7a, + 0xa3, 0x08, 0x8b, 0x5c, 0x70, 0x7e, 0x85, 0x98 }, + { 0xfd, 0xd8, 0x99, 0x3d, 0xcd, 0x43, 0xf6, 0x96, + 0xd4, 0x4f, 0x3c, 0xea, 0x0f, 0xf3, 0x53, 0x45, + 0x23, 0x4e, 0xc8, 0xee, 0x08, 0x3e, 0xb3, 0xca, + 0xda, 0x01, 0x7c, 0x7f, 0x78, 0xc1, 0x71, 0x43 }, + { 0xe6, 0xc8, 0x12, 0x56, 0x37, 0x43, 0x8d, 0x09, + 0x05, 0xb7, 0x49, 0xf4, 0x65, 0x60, 0xac, 0x89, + 0xfd, 0x47, 0x1c, 0xf8, 0x69, 0x2e, 0x28, 0xfa, + 0xb9, 0x82, 0xf7, 0x3f, 0x01, 0x9b, 0x83, 0xa9 }, + { 0x19, 0xfc, 0x8c, 0xa6, 0x97, 0x9d, 0x60, 0xe6, + 0xed, 0xd3, 0xb4, 0x54, 0x1e, 0x2f, 0x96, 0x7c, + 0xed, 0x74, 0x0d, 0xf6, 0xec, 0x1e, 0xae, 0xbb, + 0xfe, 0x81, 0x38, 0x32, 0xe9, 0x6b, 0x29, 0x74 }, + { 0xa6, 0xad, 0x77, 0x7c, 0xe8, 0x81, 0xb5, 0x2b, + 0xb5, 0xa4, 0x42, 0x1a, 0xb6, 0xcd, 0xd2, 0xdf, + 0xba, 0x13, 0xe9, 0x63, 0x65, 0x2d, 0x4d, 0x6d, + 0x12, 0x2a, 0xee, 0x46, 0x54, 0x8c, 0x14, 0xa7 }, + { 0xf5, 0xc4, 0xb2, 0xba, 0x1a, 0x00, 0x78, 0x1b, + 0x13, 0xab, 0xa0, 0x42, 0x52, 0x42, 0xc6, 0x9c, + 0xb1, 0x55, 0x2f, 0x3f, 0x71, 0xa9, 0xa3, 0xbb, + 0x22, 0xb4, 0xa6, 0xb4, 0x27, 0x7b, 0x46, 0xdd }, + { 0xe3, 0x3c, 0x4c, 0x9b, 0xd0, 0xcc, 0x7e, 0x45, + 0xc8, 0x0e, 0x65, 0xc7, 0x7f, 0xa5, 0x99, 0x7f, + 0xec, 0x70, 0x02, 0x73, 0x85, 0x41, 0x50, 0x9e, + 0x68, 0xa9, 0x42, 0x38, 0x91, 0xe8, 0x22, 0xa3 }, + { 0xfb, 0xa1, 0x61, 0x69, 0xb2, 0xc3, 0xee, 0x10, + 0x5b, 0xe6, 0xe1, 0xe6, 0x50, 0xe5, 0xcb, 0xf4, + 0x07, 0x46, 0xb6, 0x75, 0x3d, 0x03, 0x6a, 0xb5, + 0x51, 0x79, 0x01, 0x4a, 0xd7, 0xef, 0x66, 0x51 }, + { 0xf5, 0xc4, 0xbe, 0xc6, 0xd6, 0x2f, 0xc6, 0x08, + 0xbf, 0x41, 0xcc, 0x11, 0x5f, 0x16, 0xd6, 0x1c, + 0x7e, 0xfd, 0x3f, 0xf6, 0xc6, 0x56, 0x92, 0xbb, + 0xe0, 0xaf, 0xff, 0xb1, 0xfe, 0xde, 0x74, 0x75 }, + { 0xa4, 0x86, 0x2e, 0x76, 0xdb, 0x84, 0x7f, 0x05, + 0xba, 0x17, 0xed, 0xe5, 0xda, 0x4e, 0x7f, 0x91, + 0xb5, 0x92, 0x5c, 0xf1, 0xad, 0x4b, 0xa1, 0x27, + 0x32, 0xc3, 0x99, 0x57, 0x42, 0xa5, 0xcd, 0x6e }, + { 0x65, 0xf4, 0xb8, 0x60, 0xcd, 0x15, 0xb3, 0x8e, + 0xf8, 0x14, 0xa1, 0xa8, 0x04, 0x31, 0x4a, 0x55, + 0xbe, 0x95, 0x3c, 0xaa, 0x65, 0xfd, 0x75, 0x8a, + 0xd9, 0x89, 0xff, 0x34, 0xa4, 0x1c, 0x1e, 0xea }, + { 0x19, 0xba, 0x23, 0x4f, 0x0a, 0x4f, 0x38, 0x63, + 0x7d, 0x18, 0x39, 0xf9, 0xd9, 0xf7, 0x6a, 0xd9, + 0x1c, 0x85, 0x22, 0x30, 0x71, 0x43, 0xc9, 0x7d, + 0x5f, 0x93, 0xf6, 0x92, 0x74, 0xce, 0xc9, 0xa7 }, + { 0x1a, 0x67, 0x18, 0x6c, 0xa4, 0xa5, 0xcb, 0x8e, + 0x65, 0xfc, 0xa0, 0xe2, 0xec, 0xbc, 0x5d, 0xdc, + 0x14, 0xae, 0x38, 0x1b, 0xb8, 0xbf, 0xfe, 0xb9, + 0xe0, 0xa1, 0x03, 0x44, 0x9e, 0x3e, 0xf0, 0x3c }, + { 0xaf, 0xbe, 0xa3, 0x17, 0xb5, 0xa2, 0xe8, 0x9c, + 0x0b, 0xd9, 0x0c, 0xcf, 0x5d, 0x7f, 0xd0, 0xed, + 0x57, 0xfe, 0x58, 0x5e, 0x4b, 0xe3, 0x27, 0x1b, + 0x0a, 0x6b, 0xf0, 0xf5, 0x78, 0x6b, 0x0f, 0x26 }, + { 0xf1, 0xb0, 0x15, 0x58, 0xce, 0x54, 0x12, 0x62, + 0xf5, 0xec, 0x34, 0x29, 0x9d, 0x6f, 0xb4, 0x09, + 0x00, 0x09, 0xe3, 0x43, 0x4b, 0xe2, 0xf4, 0x91, + 0x05, 0xcf, 0x46, 0xaf, 0x4d, 0x2d, 0x41, 0x24 }, + { 0x13, 0xa0, 0xa0, 0xc8, 0x63, 0x35, 0x63, 0x5e, + 0xaa, 0x74, 0xca, 0x2d, 0x5d, 0x48, 0x8c, 0x79, + 0x7b, 0xbb, 0x4f, 0x47, 0xdc, 0x07, 0x10, 0x50, + 0x15, 0xed, 0x6a, 0x1f, 0x33, 0x09, 0xef, 0xce }, + { 0x15, 0x80, 0xaf, 0xee, 0xbe, 0xbb, 0x34, 0x6f, + 0x94, 0xd5, 0x9f, 0xe6, 0x2d, 0xa0, 0xb7, 0x92, + 0x37, 0xea, 0xd7, 0xb1, 0x49, 0x1f, 0x56, 0x67, + 0xa9, 0x0e, 0x45, 0xed, 0xf6, 0xca, 0x8b, 0x03 }, + { 0x20, 0xbe, 0x1a, 0x87, 0x5b, 0x38, 0xc5, 0x73, + 0xdd, 0x7f, 0xaa, 0xa0, 0xde, 0x48, 0x9d, 0x65, + 0x5c, 0x11, 0xef, 0xb6, 0xa5, 0x52, 0x69, 0x8e, + 0x07, 0xa2, 0xd3, 0x31, 0xb5, 0xf6, 0x55, 0xc3 }, + { 0xbe, 0x1f, 0xe3, 0xc4, 0xc0, 0x40, 0x18, 0xc5, + 0x4c, 0x4a, 0x0f, 0x6b, 0x9a, 0x2e, 0xd3, 0xc5, + 0x3a, 0xbe, 0x3a, 0x9f, 0x76, 0xb4, 0xd2, 0x6d, + 0xe5, 0x6f, 0xc9, 0xae, 0x95, 0x05, 0x9a, 0x99 }, + { 0xe3, 0xe3, 0xac, 0xe5, 0x37, 0xeb, 0x3e, 0xdd, + 0x84, 0x63, 0xd9, 0xad, 0x35, 0x82, 0xe1, 0x3c, + 0xf8, 0x65, 0x33, 0xff, 0xde, 0x43, 0xd6, 0x68, + 0xdd, 0x2e, 0x93, 0xbb, 0xdb, 0xd7, 0x19, 0x5a }, + { 0x11, 0x0c, 0x50, 0xc0, 0xbf, 0x2c, 0x6e, 0x7a, + 0xeb, 0x7e, 0x43, 0x5d, 0x92, 0xd1, 0x32, 0xab, + 0x66, 0x55, 0x16, 0x8e, 0x78, 0xa2, 0xde, 0xcd, + 0xec, 0x33, 0x30, 0x77, 0x76, 0x84, 0xd9, 0xc1 }, + { 0xe9, 0xba, 0x8f, 0x50, 0x5c, 0x9c, 0x80, 0xc0, + 0x86, 0x66, 0xa7, 0x01, 0xf3, 0x36, 0x7e, 0x6c, + 0xc6, 0x65, 0xf3, 0x4b, 0x22, 0xe7, 0x3c, 0x3c, + 0x04, 0x17, 0xeb, 0x1c, 0x22, 0x06, 0x08, 0x2f }, + { 0x26, 0xcd, 0x66, 0xfc, 0xa0, 0x23, 0x79, 0xc7, + 0x6d, 0xf1, 0x23, 0x17, 0x05, 0x2b, 0xca, 0xfd, + 0x6c, 0xd8, 0xc3, 0xa7, 0xb8, 0x90, 0xd8, 0x05, + 0xf3, 0x6c, 0x49, 0x98, 0x97, 0x82, 0x43, 0x3a }, + { 0x21, 0x3f, 0x35, 0x96, 0xd6, 0xe3, 0xa5, 0xd0, + 0xe9, 0x93, 0x2c, 0xd2, 0x15, 0x91, 0x46, 0x01, + 0x5e, 0x2a, 0xbc, 0x94, 0x9f, 0x47, 0x29, 0xee, + 0x26, 0x32, 0xfe, 0x1e, 0xdb, 0x78, 0xd3, 0x37 }, + { 0x10, 0x15, 0xd7, 0x01, 0x08, 0xe0, 0x3b, 0xe1, + 0xc7, 0x02, 0xfe, 0x97, 0x25, 0x36, 0x07, 0xd1, + 0x4a, 0xee, 0x59, 0x1f, 0x24, 0x13, 0xea, 0x67, + 0x87, 0x42, 0x7b, 0x64, 0x59, 0xff, 0x21, 0x9a }, + { 0x3c, 0xa9, 0x89, 0xde, 0x10, 0xcf, 0xe6, 0x09, + 0x90, 0x94, 0x72, 0xc8, 0xd3, 0x56, 0x10, 0x80, + 0x5b, 0x2f, 0x97, 0x77, 0x34, 0xcf, 0x65, 0x2c, + 0xc6, 0x4b, 0x3b, 0xfc, 0x88, 0x2d, 0x5d, 0x89 }, + { 0xb6, 0x15, 0x6f, 0x72, 0xd3, 0x80, 0xee, 0x9e, + 0xa6, 0xac, 0xd1, 0x90, 0x46, 0x4f, 0x23, 0x07, + 0xa5, 0xc1, 0x79, 0xef, 0x01, 0xfd, 0x71, 0xf9, + 0x9f, 0x2d, 0x0f, 0x7a, 0x57, 0x36, 0x0a, 0xea }, + { 0xc0, 0x3b, 0xc6, 0x42, 0xb2, 0x09, 0x59, 0xcb, + 0xe1, 0x33, 0xa0, 0x30, 0x3e, 0x0c, 0x1a, 0xbf, + 0xf3, 0xe3, 0x1e, 0xc8, 0xe1, 0xa3, 0x28, 0xec, + 0x85, 0x65, 0xc3, 0x6d, 0xec, 0xff, 0x52, 0x65 }, + { 0x2c, 0x3e, 0x08, 0x17, 0x6f, 0x76, 0x0c, 0x62, + 0x64, 0xc3, 0xa2, 0xcd, 0x66, 0xfe, 0xc6, 0xc3, + 0xd7, 0x8d, 0xe4, 0x3f, 0xc1, 0x92, 0x45, 0x7b, + 0x2a, 0x4a, 0x66, 0x0a, 0x1e, 0x0e, 0xb2, 0x2b }, + { 0xf7, 0x38, 0xc0, 0x2f, 0x3c, 0x1b, 0x19, 0x0c, + 0x51, 0x2b, 0x1a, 0x32, 0xde, 0xab, 0xf3, 0x53, + 0x72, 0x8e, 0x0e, 0x9a, 0xb0, 0x34, 0x49, 0x0e, + 0x3c, 0x34, 0x09, 0x94, 0x6a, 0x97, 0xae, 0xec }, + { 0x8b, 0x18, 0x80, 0xdf, 0x30, 0x1c, 0xc9, 0x63, + 0x41, 0x88, 0x11, 0x08, 0x89, 0x64, 0x83, 0x92, + 0x87, 0xff, 0x7f, 0xe3, 0x1c, 0x49, 0xea, 0x6e, + 0xbd, 0x9e, 0x48, 0xbd, 0xee, 0xe4, 0x97, 0xc5 }, + { 0x1e, 0x75, 0xcb, 0x21, 0xc6, 0x09, 0x89, 0x02, + 0x03, 0x75, 0xf1, 0xa7, 0xa2, 0x42, 0x83, 0x9f, + 0x0b, 0x0b, 0x68, 0x97, 0x3a, 0x4c, 0x2a, 0x05, + 0xcf, 0x75, 0x55, 0xed, 0x5a, 0xae, 0xc4, 0xc1 }, + { 0x62, 0xbf, 0x8a, 0x9c, 0x32, 0xa5, 0xbc, 0xcf, + 0x29, 0x0b, 0x6c, 0x47, 0x4d, 0x75, 0xb2, 0xa2, + 0xa4, 0x09, 0x3f, 0x1a, 0x9e, 0x27, 0x13, 0x94, + 0x33, 0xa8, 0xf2, 0xb3, 0xbc, 0xe7, 0xb8, 0xd7 }, + { 0x16, 0x6c, 0x83, 0x50, 0xd3, 0x17, 0x3b, 0x5e, + 0x70, 0x2b, 0x78, 0x3d, 0xfd, 0x33, 0xc6, 0x6e, + 0xe0, 0x43, 0x27, 0x42, 0xe9, 0xb9, 0x2b, 0x99, + 0x7f, 0xd2, 0x3c, 0x60, 0xdc, 0x67, 0x56, 0xca }, + { 0x04, 0x4a, 0x14, 0xd8, 0x22, 0xa9, 0x0c, 0xac, + 0xf2, 0xf5, 0xa1, 0x01, 0x42, 0x8a, 0xdc, 0x8f, + 0x41, 0x09, 0x38, 0x6c, 0xcb, 0x15, 0x8b, 0xf9, + 0x05, 0xc8, 0x61, 0x8b, 0x8e, 0xe2, 0x4e, 0xc3 }, + { 0x38, 0x7d, 0x39, 0x7e, 0xa4, 0x3a, 0x99, 0x4b, + 0xe8, 0x4d, 0x2d, 0x54, 0x4a, 0xfb, 0xe4, 0x81, + 0xa2, 0x00, 0x0f, 0x55, 0x25, 0x26, 0x96, 0xbb, + 0xa2, 0xc5, 0x0c, 0x8e, 0xbd, 0x10, 0x13, 0x47 }, + { 0x56, 0xf8, 0xcc, 0xf1, 0xf8, 0x64, 0x09, 0xb4, + 0x6c, 0xe3, 0x61, 0x66, 0xae, 0x91, 0x65, 0x13, + 0x84, 0x41, 0x57, 0x75, 0x89, 0xdb, 0x08, 0xcb, + 0xc5, 0xf6, 0x6c, 0xa2, 0x97, 0x43, 0xb9, 0xfd }, + { 0x97, 0x06, 0xc0, 0x92, 0xb0, 0x4d, 0x91, 0xf5, + 0x3d, 0xff, 0x91, 0xfa, 0x37, 0xb7, 0x49, 0x3d, + 0x28, 0xb5, 0x76, 0xb5, 0xd7, 0x10, 0x46, 0x9d, + 0xf7, 0x94, 0x01, 0x66, 0x22, 0x36, 0xfc, 0x03 }, + { 0x87, 0x79, 0x68, 0x68, 0x6c, 0x06, 0x8c, 0xe2, + 0xf7, 0xe2, 0xad, 0xcf, 0xf6, 0x8b, 0xf8, 0x74, + 0x8e, 0xdf, 0x3c, 0xf8, 0x62, 0xcf, 0xb4, 0xd3, + 0x94, 0x7a, 0x31, 0x06, 0x95, 0x80, 0x54, 0xe3 }, + { 0x88, 0x17, 0xe5, 0x71, 0x98, 0x79, 0xac, 0xf7, + 0x02, 0x47, 0x87, 0xec, 0xcd, 0xb2, 0x71, 0x03, + 0x55, 0x66, 0xcf, 0xa3, 0x33, 0xe0, 0x49, 0x40, + 0x7c, 0x01, 0x78, 0xcc, 0xc5, 0x7a, 0x5b, 0x9f }, + { 0x89, 0x38, 0x24, 0x9e, 0x4b, 0x50, 0xca, 0xda, + 0xcc, 0xdf, 0x5b, 0x18, 0x62, 0x13, 0x26, 0xcb, + 0xb1, 0x52, 0x53, 0xe3, 0x3a, 0x20, 0xf5, 0x63, + 0x6e, 0x99, 0x5d, 0x72, 0x47, 0x8d, 0xe4, 0x72 }, + { 0xf1, 0x64, 0xab, 0xba, 0x49, 0x63, 0xa4, 0x4d, + 0x10, 0x72, 0x57, 0xe3, 0x23, 0x2d, 0x90, 0xac, + 0xa5, 0xe6, 0x6a, 0x14, 0x08, 0x24, 0x8c, 0x51, + 0x74, 0x1e, 0x99, 0x1d, 0xb5, 0x22, 0x77, 0x56 }, + { 0xd0, 0x55, 0x63, 0xe2, 0xb1, 0xcb, 0xa0, 0xc4, + 0xa2, 0xa1, 0xe8, 0xbd, 0xe3, 0xa1, 0xa0, 0xd9, + 0xf5, 0xb4, 0x0c, 0x85, 0xa0, 0x70, 0xd6, 0xf5, + 0xfb, 0x21, 0x06, 0x6e, 0xad, 0x5d, 0x06, 0x01 }, + { 0x03, 0xfb, 0xb1, 0x63, 0x84, 0xf0, 0xa3, 0x86, + 0x6f, 0x4c, 0x31, 0x17, 0x87, 0x76, 0x66, 0xef, + 0xbf, 0x12, 0x45, 0x97, 0x56, 0x4b, 0x29, 0x3d, + 0x4a, 0xab, 0x0d, 0x26, 0x9f, 0xab, 0xdd, 0xfa }, + { 0x5f, 0xa8, 0x48, 0x6a, 0xc0, 0xe5, 0x29, 0x64, + 0xd1, 0x88, 0x1b, 0xbe, 0x33, 0x8e, 0xb5, 0x4b, + 0xe2, 0xf7, 0x19, 0x54, 0x92, 0x24, 0x89, 0x20, + 0x57, 0xb4, 0xda, 0x04, 0xba, 0x8b, 0x34, 0x75 }, + { 0xcd, 0xfa, 0xbc, 0xee, 0x46, 0x91, 0x11, 0x11, + 0x23, 0x6a, 0x31, 0x70, 0x8b, 0x25, 0x39, 0xd7, + 0x1f, 0xc2, 0x11, 0xd9, 0xb0, 0x9c, 0x0d, 0x85, + 0x30, 0xa1, 0x1e, 0x1d, 0xbf, 0x6e, 0xed, 0x01 }, + { 0x4f, 0x82, 0xde, 0x03, 0xb9, 0x50, 0x47, 0x93, + 0xb8, 0x2a, 0x07, 0xa0, 0xbd, 0xcd, 0xff, 0x31, + 0x4d, 0x75, 0x9e, 0x7b, 0x62, 0xd2, 0x6b, 0x78, + 0x49, 0x46, 0xb0, 0xd3, 0x6f, 0x91, 0x6f, 0x52 }, + { 0x25, 0x9e, 0xc7, 0xf1, 0x73, 0xbc, 0xc7, 0x6a, + 0x09, 0x94, 0xc9, 0x67, 0xb4, 0xf5, 0xf0, 0x24, + 0xc5, 0x60, 0x57, 0xfb, 0x79, 0xc9, 0x65, 0xc4, + 0xfa, 0xe4, 0x18, 0x75, 0xf0, 0x6a, 0x0e, 0x4c }, + { 0x19, 0x3c, 0xc8, 0xe7, 0xc3, 0xe0, 0x8b, 0xb3, + 0x0f, 0x54, 0x37, 0xaa, 0x27, 0xad, 0xe1, 0xf1, + 0x42, 0x36, 0x9b, 0x24, 0x6a, 0x67, 0x5b, 0x23, + 0x83, 0xe6, 0xda, 0x9b, 0x49, 0xa9, 0x80, 0x9e }, + { 0x5c, 0x10, 0x89, 0x6f, 0x0e, 0x28, 0x56, 0xb2, + 0xa2, 0xee, 0xe0, 0xfe, 0x4a, 0x2c, 0x16, 0x33, + 0x56, 0x5d, 0x18, 0xf0, 0xe9, 0x3e, 0x1f, 0xab, + 0x26, 0xc3, 0x73, 0xe8, 0xf8, 0x29, 0x65, 0x4d }, + { 0xf1, 0x60, 0x12, 0xd9, 0x3f, 0x28, 0x85, 0x1a, + 0x1e, 0xb9, 0x89, 0xf5, 0xd0, 0xb4, 0x3f, 0x3f, + 0x39, 0xca, 0x73, 0xc9, 0xa6, 0x2d, 0x51, 0x81, + 0xbf, 0xf2, 0x37, 0x53, 0x6b, 0xd3, 0x48, 0xc3 }, + { 0x29, 0x66, 0xb3, 0xcf, 0xae, 0x1e, 0x44, 0xea, + 0x99, 0x6d, 0xc5, 0xd6, 0x86, 0xcf, 0x25, 0xfa, + 0x05, 0x3f, 0xb6, 0xf6, 0x72, 0x01, 0xb9, 0xe4, + 0x6e, 0xad, 0xe8, 0x5d, 0x0a, 0xd6, 0xb8, 0x06 }, + { 0xdd, 0xb8, 0x78, 0x24, 0x85, 0xe9, 0x00, 0xbc, + 0x60, 0xbc, 0xf4, 0xc3, 0x3a, 0x6f, 0xd5, 0x85, + 0x68, 0x0c, 0xc6, 0x83, 0xd5, 0x16, 0xef, 0xa0, + 0x3e, 0xb9, 0x98, 0x5f, 0xad, 0x87, 0x15, 0xfb }, + { 0x4c, 0x4d, 0x6e, 0x71, 0xae, 0xa0, 0x57, 0x86, + 0x41, 0x31, 0x48, 0xfc, 0x7a, 0x78, 0x6b, 0x0e, + 0xca, 0xf5, 0x82, 0xcf, 0xf1, 0x20, 0x9f, 0x5a, + 0x80, 0x9f, 0xba, 0x85, 0x04, 0xce, 0x66, 0x2c }, + { 0xfb, 0x4c, 0x5e, 0x86, 0xd7, 0xb2, 0x22, 0x9b, + 0x99, 0xb8, 0xba, 0x6d, 0x94, 0xc2, 0x47, 0xef, + 0x96, 0x4a, 0xa3, 0xa2, 0xba, 0xe8, 0xed, 0xc7, + 0x75, 0x69, 0xf2, 0x8d, 0xbb, 0xff, 0x2d, 0x4e }, + { 0xe9, 0x4f, 0x52, 0x6d, 0xe9, 0x01, 0x96, 0x33, + 0xec, 0xd5, 0x4a, 0xc6, 0x12, 0x0f, 0x23, 0x95, + 0x8d, 0x77, 0x18, 0xf1, 0xe7, 0x71, 0x7b, 0xf3, + 0x29, 0x21, 0x1a, 0x4f, 0xae, 0xed, 0x4e, 0x6d }, + { 0xcb, 0xd6, 0x66, 0x0a, 0x10, 0xdb, 0x3f, 0x23, + 0xf7, 0xa0, 0x3d, 0x4b, 0x9d, 0x40, 0x44, 0xc7, + 0x93, 0x2b, 0x28, 0x01, 0xac, 0x89, 0xd6, 0x0b, + 0xc9, 0xeb, 0x92, 0xd6, 0x5a, 0x46, 0xc2, 0xa0 }, + { 0x88, 0x18, 0xbb, 0xd3, 0xdb, 0x4d, 0xc1, 0x23, + 0xb2, 0x5c, 0xbb, 0xa5, 0xf5, 0x4c, 0x2b, 0xc4, + 0xb3, 0xfc, 0xf9, 0xbf, 0x7d, 0x7a, 0x77, 0x09, + 0xf4, 0xae, 0x58, 0x8b, 0x26, 0x7c, 0x4e, 0xce }, + { 0xc6, 0x53, 0x82, 0x51, 0x3f, 0x07, 0x46, 0x0d, + 0xa3, 0x98, 0x33, 0xcb, 0x66, 0x6c, 0x5e, 0xd8, + 0x2e, 0x61, 0xb9, 0xe9, 0x98, 0xf4, 0xb0, 0xc4, + 0x28, 0x7c, 0xee, 0x56, 0xc3, 0xcc, 0x9b, 0xcd }, + { 0x89, 0x75, 0xb0, 0x57, 0x7f, 0xd3, 0x55, 0x66, + 0xd7, 0x50, 0xb3, 0x62, 0xb0, 0x89, 0x7a, 0x26, + 0xc3, 0x99, 0x13, 0x6d, 0xf0, 0x7b, 0xab, 0xab, + 0xbd, 0xe6, 0x20, 0x3f, 0xf2, 0x95, 0x4e, 0xd4 }, + { 0x21, 0xfe, 0x0c, 0xeb, 0x00, 0x52, 0xbe, 0x7f, + 0xb0, 0xf0, 0x04, 0x18, 0x7c, 0xac, 0xd7, 0xde, + 0x67, 0xfa, 0x6e, 0xb0, 0x93, 0x8d, 0x92, 0x76, + 0x77, 0xf2, 0x39, 0x8c, 0x13, 0x23, 0x17, 0xa8 }, + { 0x2e, 0xf7, 0x3f, 0x3c, 0x26, 0xf1, 0x2d, 0x93, + 0x88, 0x9f, 0x3c, 0x78, 0xb6, 0xa6, 0x6c, 0x1d, + 0x52, 0xb6, 0x49, 0xdc, 0x9e, 0x85, 0x6e, 0x2c, + 0x17, 0x2e, 0xa7, 0xc5, 0x8a, 0xc2, 0xb5, 0xe3 }, + { 0x38, 0x8a, 0x3c, 0xd5, 0x6d, 0x73, 0x86, 0x7a, + 0xbb, 0x5f, 0x84, 0x01, 0x49, 0x2b, 0x6e, 0x26, + 0x81, 0xeb, 0x69, 0x85, 0x1e, 0x76, 0x7f, 0xd8, + 0x42, 0x10, 0xa5, 0x60, 0x76, 0xfb, 0x3d, 0xd3 }, + { 0xaf, 0x53, 0x3e, 0x02, 0x2f, 0xc9, 0x43, 0x9e, + 0x4e, 0x3c, 0xb8, 0x38, 0xec, 0xd1, 0x86, 0x92, + 0x23, 0x2a, 0xdf, 0x6f, 0xe9, 0x83, 0x95, 0x26, + 0xd3, 0xc3, 0xdd, 0x1b, 0x71, 0x91, 0x0b, 0x1a }, + { 0x75, 0x1c, 0x09, 0xd4, 0x1a, 0x93, 0x43, 0x88, + 0x2a, 0x81, 0xcd, 0x13, 0xee, 0x40, 0x81, 0x8d, + 0x12, 0xeb, 0x44, 0xc6, 0xc7, 0xf4, 0x0d, 0xf1, + 0x6e, 0x4a, 0xea, 0x8f, 0xab, 0x91, 0x97, 0x2a }, + { 0x5b, 0x73, 0xdd, 0xb6, 0x8d, 0x9d, 0x2b, 0x0a, + 0xa2, 0x65, 0xa0, 0x79, 0x88, 0xd6, 0xb8, 0x8a, + 0xe9, 0xaa, 0xc5, 0x82, 0xaf, 0x83, 0x03, 0x2f, + 0x8a, 0x9b, 0x21, 0xa2, 0xe1, 0xb7, 0xbf, 0x18 }, + { 0x3d, 0xa2, 0x91, 0x26, 0xc7, 0xc5, 0xd7, 0xf4, + 0x3e, 0x64, 0x24, 0x2a, 0x79, 0xfe, 0xaa, 0x4e, + 0xf3, 0x45, 0x9c, 0xde, 0xcc, 0xc8, 0x98, 0xed, + 0x59, 0xa9, 0x7f, 0x6e, 0xc9, 0x3b, 0x9d, 0xab }, + { 0x56, 0x6d, 0xc9, 0x20, 0x29, 0x3d, 0xa5, 0xcb, + 0x4f, 0xe0, 0xaa, 0x8a, 0xbd, 0xa8, 0xbb, 0xf5, + 0x6f, 0x55, 0x23, 0x13, 0xbf, 0xf1, 0x90, 0x46, + 0x64, 0x1e, 0x36, 0x15, 0xc1, 0xe3, 0xed, 0x3f }, + { 0x41, 0x15, 0xbe, 0xa0, 0x2f, 0x73, 0xf9, 0x7f, + 0x62, 0x9e, 0x5c, 0x55, 0x90, 0x72, 0x0c, 0x01, + 0xe7, 0xe4, 0x49, 0xae, 0x2a, 0x66, 0x97, 0xd4, + 0xd2, 0x78, 0x33, 0x21, 0x30, 0x36, 0x92, 0xf9 }, + { 0x4c, 0xe0, 0x8f, 0x47, 0x62, 0x46, 0x8a, 0x76, + 0x70, 0x01, 0x21, 0x64, 0x87, 0x8d, 0x68, 0x34, + 0x0c, 0x52, 0xa3, 0x5e, 0x66, 0xc1, 0x88, 0x4d, + 0x5c, 0x86, 0x48, 0x89, 0xab, 0xc9, 0x66, 0x77 }, + { 0x81, 0xea, 0x0b, 0x78, 0x04, 0x12, 0x4e, 0x0c, + 0x22, 0xea, 0x5f, 0xc7, 0x11, 0x04, 0xa2, 0xaf, + 0xcb, 0x52, 0xa1, 0xfa, 0x81, 0x6f, 0x3e, 0xcb, + 0x7d, 0xcb, 0x5d, 0x9d, 0xea, 0x17, 0x86, 0xd0 }, + { 0xfe, 0x36, 0x27, 0x33, 0xb0, 0x5f, 0x6b, 0xed, + 0xaf, 0x93, 0x79, 0xd7, 0xf7, 0x93, 0x6e, 0xde, + 0x20, 0x9b, 0x1f, 0x83, 0x23, 0xc3, 0x92, 0x25, + 0x49, 0xd9, 0xe7, 0x36, 0x81, 0xb5, 0xdb, 0x7b }, + { 0xef, 0xf3, 0x7d, 0x30, 0xdf, 0xd2, 0x03, 0x59, + 0xbe, 0x4e, 0x73, 0xfd, 0xf4, 0x0d, 0x27, 0x73, + 0x4b, 0x3d, 0xf9, 0x0a, 0x97, 0xa5, 0x5e, 0xd7, + 0x45, 0x29, 0x72, 0x94, 0xca, 0x85, 0xd0, 0x9f }, + { 0x17, 0x2f, 0xfc, 0x67, 0x15, 0x3d, 0x12, 0xe0, + 0xca, 0x76, 0xa8, 0xb6, 0xcd, 0x5d, 0x47, 0x31, + 0x88, 0x5b, 0x39, 0xce, 0x0c, 0xac, 0x93, 0xa8, + 0x97, 0x2a, 0x18, 0x00, 0x6c, 0x8b, 0x8b, 0xaf }, + { 0xc4, 0x79, 0x57, 0xf1, 0xcc, 0x88, 0xe8, 0x3e, + 0xf9, 0x44, 0x58, 0x39, 0x70, 0x9a, 0x48, 0x0a, + 0x03, 0x6b, 0xed, 0x5f, 0x88, 0xac, 0x0f, 0xcc, + 0x8e, 0x1e, 0x70, 0x3f, 0xfa, 0xac, 0x13, 0x2c }, + { 0x30, 0xf3, 0x54, 0x83, 0x70, 0xcf, 0xdc, 0xed, + 0xa5, 0xc3, 0x7b, 0x56, 0x9b, 0x61, 0x75, 0xe7, + 0x99, 0xee, 0xf1, 0xa6, 0x2a, 0xaa, 0x94, 0x32, + 0x45, 0xae, 0x76, 0x69, 0xc2, 0x27, 0xa7, 0xb5 }, + { 0xc9, 0x5d, 0xcb, 0x3c, 0xf1, 0xf2, 0x7d, 0x0e, + 0xef, 0x2f, 0x25, 0xd2, 0x41, 0x38, 0x70, 0x90, + 0x4a, 0x87, 0x7c, 0x4a, 0x56, 0xc2, 0xde, 0x1e, + 0x83, 0xe2, 0xbc, 0x2a, 0xe2, 0xe4, 0x68, 0x21 }, + { 0xd5, 0xd0, 0xb5, 0xd7, 0x05, 0x43, 0x4c, 0xd4, + 0x6b, 0x18, 0x57, 0x49, 0xf6, 0x6b, 0xfb, 0x58, + 0x36, 0xdc, 0xdf, 0x6e, 0xe5, 0x49, 0xa2, 0xb7, + 0xa4, 0xae, 0xe7, 0xf5, 0x80, 0x07, 0xca, 0xaf }, + { 0xbb, 0xc1, 0x24, 0xa7, 0x12, 0xf1, 0x5d, 0x07, + 0xc3, 0x00, 0xe0, 0x5b, 0x66, 0x83, 0x89, 0xa4, + 0x39, 0xc9, 0x17, 0x77, 0xf7, 0x21, 0xf8, 0x32, + 0x0c, 0x1c, 0x90, 0x78, 0x06, 0x6d, 0x2c, 0x7e }, + { 0xa4, 0x51, 0xb4, 0x8c, 0x35, 0xa6, 0xc7, 0x85, + 0x4c, 0xfa, 0xae, 0x60, 0x26, 0x2e, 0x76, 0x99, + 0x08, 0x16, 0x38, 0x2a, 0xc0, 0x66, 0x7e, 0x5a, + 0x5c, 0x9e, 0x1b, 0x46, 0xc4, 0x34, 0x2d, 0xdf }, + { 0xb0, 0xd1, 0x50, 0xfb, 0x55, 0xe7, 0x78, 0xd0, + 0x11, 0x47, 0xf0, 0xb5, 0xd8, 0x9d, 0x99, 0xec, + 0xb2, 0x0f, 0xf0, 0x7e, 0x5e, 0x67, 0x60, 0xd6, + 0xb6, 0x45, 0xeb, 0x5b, 0x65, 0x4c, 0x62, 0x2b }, + { 0x34, 0xf7, 0x37, 0xc0, 0xab, 0x21, 0x99, 0x51, + 0xee, 0xe8, 0x9a, 0x9f, 0x8d, 0xac, 0x29, 0x9c, + 0x9d, 0x4c, 0x38, 0xf3, 0x3f, 0xa4, 0x94, 0xc5, + 0xc6, 0xee, 0xfc, 0x92, 0xb6, 0xdb, 0x08, 0xbc }, + { 0x1a, 0x62, 0xcc, 0x3a, 0x00, 0x80, 0x0d, 0xcb, + 0xd9, 0x98, 0x91, 0x08, 0x0c, 0x1e, 0x09, 0x84, + 0x58, 0x19, 0x3a, 0x8c, 0xc9, 0xf9, 0x70, 0xea, + 0x99, 0xfb, 0xef, 0xf0, 0x03, 0x18, 0xc2, 0x89 }, + { 0xcf, 0xce, 0x55, 0xeb, 0xaf, 0xc8, 0x40, 0xd7, + 0xae, 0x48, 0x28, 0x1c, 0x7f, 0xd5, 0x7e, 0xc8, + 0xb4, 0x82, 0xd4, 0xb7, 0x04, 0x43, 0x74, 0x95, + 0x49, 0x5a, 0xc4, 0x14, 0xcf, 0x4a, 0x37, 0x4b }, + { 0x67, 0x46, 0xfa, 0xcf, 0x71, 0x14, 0x6d, 0x99, + 0x9d, 0xab, 0xd0, 0x5d, 0x09, 0x3a, 0xe5, 0x86, + 0x64, 0x8d, 0x1e, 0xe2, 0x8e, 0x72, 0x61, 0x7b, + 0x99, 0xd0, 0xf0, 0x08, 0x6e, 0x1e, 0x45, 0xbf }, + { 0x57, 0x1c, 0xed, 0x28, 0x3b, 0x3f, 0x23, 0xb4, + 0xe7, 0x50, 0xbf, 0x12, 0xa2, 0xca, 0xf1, 0x78, + 0x18, 0x47, 0xbd, 0x89, 0x0e, 0x43, 0x60, 0x3c, + 0xdc, 0x59, 0x76, 0x10, 0x2b, 0x7b, 0xb1, 0x1b }, + { 0xcf, 0xcb, 0x76, 0x5b, 0x04, 0x8e, 0x35, 0x02, + 0x2c, 0x5d, 0x08, 0x9d, 0x26, 0xe8, 0x5a, 0x36, + 0xb0, 0x05, 0xa2, 0xb8, 0x04, 0x93, 0xd0, 0x3a, + 0x14, 0x4e, 0x09, 0xf4, 0x09, 0xb6, 0xaf, 0xd1 }, + { 0x40, 0x50, 0xc7, 0xa2, 0x77, 0x05, 0xbb, 0x27, + 0xf4, 0x20, 0x89, 0xb2, 0x99, 0xf3, 0xcb, 0xe5, + 0x05, 0x4e, 0xad, 0x68, 0x72, 0x7e, 0x8e, 0xf9, + 0x31, 0x8c, 0xe6, 0xf2, 0x5c, 0xd6, 0xf3, 0x1d }, + { 0x18, 0x40, 0x70, 0xbd, 0x5d, 0x26, 0x5f, 0xbd, + 0xc1, 0x42, 0xcd, 0x1c, 0x5c, 0xd0, 0xd7, 0xe4, + 0x14, 0xe7, 0x03, 0x69, 0xa2, 0x66, 0xd6, 0x27, + 0xc8, 0xfb, 0xa8, 0x4f, 0xa5, 0xe8, 0x4c, 0x34 }, + { 0x9e, 0xdd, 0xa9, 0xa4, 0x44, 0x39, 0x02, 0xa9, + 0x58, 0x8c, 0x0d, 0x0c, 0xcc, 0x62, 0xb9, 0x30, + 0x21, 0x84, 0x79, 0xa6, 0x84, 0x1e, 0x6f, 0xe7, + 0xd4, 0x30, 0x03, 0xf0, 0x4b, 0x1f, 0xd6, 0x43 }, + { 0xe4, 0x12, 0xfe, 0xef, 0x79, 0x08, 0x32, 0x4a, + 0x6d, 0xa1, 0x84, 0x16, 0x29, 0xf3, 0x5d, 0x3d, + 0x35, 0x86, 0x42, 0x01, 0x93, 0x10, 0xec, 0x57, + 0xc6, 0x14, 0x83, 0x6b, 0x63, 0xd3, 0x07, 0x63 }, + { 0x1a, 0x2b, 0x8e, 0xdf, 0xf3, 0xf9, 0xac, 0xc1, + 0x55, 0x4f, 0xcb, 0xae, 0x3c, 0xf1, 0xd6, 0x29, + 0x8c, 0x64, 0x62, 0xe2, 0x2e, 0x5e, 0xb0, 0x25, + 0x96, 0x84, 0xf8, 0x35, 0x01, 0x2b, 0xd1, 0x3f }, + { 0x28, 0x8c, 0x4a, 0xd9, 0xb9, 0x40, 0x97, 0x62, + 0xea, 0x07, 0xc2, 0x4a, 0x41, 0xf0, 0x4f, 0x69, + 0xa7, 0xd7, 0x4b, 0xee, 0x2d, 0x95, 0x43, 0x53, + 0x74, 0xbd, 0xe9, 0x46, 0xd7, 0x24, 0x1c, 0x7b }, + { 0x80, 0x56, 0x91, 0xbb, 0x28, 0x67, 0x48, 0xcf, + 0xb5, 0x91, 0xd3, 0xae, 0xbe, 0x7e, 0x6f, 0x4e, + 0x4d, 0xc6, 0xe2, 0x80, 0x8c, 0x65, 0x14, 0x3c, + 0xc0, 0x04, 0xe4, 0xeb, 0x6f, 0xd0, 0x9d, 0x43 }, + { 0xd4, 0xac, 0x8d, 0x3a, 0x0a, 0xfc, 0x6c, 0xfa, + 0x7b, 0x46, 0x0a, 0xe3, 0x00, 0x1b, 0xae, 0xb3, + 0x6d, 0xad, 0xb3, 0x7d, 0xa0, 0x7d, 0x2e, 0x8a, + 0xc9, 0x18, 0x22, 0xdf, 0x34, 0x8a, 0xed, 0x3d }, + { 0xc3, 0x76, 0x61, 0x70, 0x14, 0xd2, 0x01, 0x58, + 0xbc, 0xed, 0x3d, 0x3b, 0xa5, 0x52, 0xb6, 0xec, + 0xcf, 0x84, 0xe6, 0x2a, 0xa3, 0xeb, 0x65, 0x0e, + 0x90, 0x02, 0x9c, 0x84, 0xd1, 0x3e, 0xea, 0x69 }, + { 0xc4, 0x1f, 0x09, 0xf4, 0x3c, 0xec, 0xae, 0x72, + 0x93, 0xd6, 0x00, 0x7c, 0xa0, 0xa3, 0x57, 0x08, + 0x7d, 0x5a, 0xe5, 0x9b, 0xe5, 0x00, 0xc1, 0xcd, + 0x5b, 0x28, 0x9e, 0xe8, 0x10, 0xc7, 0xb0, 0x82 }, + { 0x03, 0xd1, 0xce, 0xd1, 0xfb, 0xa5, 0xc3, 0x91, + 0x55, 0xc4, 0x4b, 0x77, 0x65, 0xcb, 0x76, 0x0c, + 0x78, 0x70, 0x8d, 0xcf, 0xc8, 0x0b, 0x0b, 0xd8, + 0xad, 0xe3, 0xa5, 0x6d, 0xa8, 0x83, 0x0b, 0x29 }, + { 0x09, 0xbd, 0xe6, 0xf1, 0x52, 0x21, 0x8d, 0xc9, + 0x2c, 0x41, 0xd7, 0xf4, 0x53, 0x87, 0xe6, 0x3e, + 0x58, 0x69, 0xd8, 0x07, 0xec, 0x70, 0xb8, 0x21, + 0x40, 0x5d, 0xbd, 0x88, 0x4b, 0x7f, 0xcf, 0x4b }, + { 0x71, 0xc9, 0x03, 0x6e, 0x18, 0x17, 0x9b, 0x90, + 0xb3, 0x7d, 0x39, 0xe9, 0xf0, 0x5e, 0xb8, 0x9c, + 0xc5, 0xfc, 0x34, 0x1f, 0xd7, 0xc4, 0x77, 0xd0, + 0xd7, 0x49, 0x32, 0x85, 0xfa, 0xca, 0x08, 0xa4 }, + { 0x59, 0x16, 0x83, 0x3e, 0xbb, 0x05, 0xcd, 0x91, + 0x9c, 0xa7, 0xfe, 0x83, 0xb6, 0x92, 0xd3, 0x20, + 0x5b, 0xef, 0x72, 0x39, 0x2b, 0x2c, 0xf6, 0xbb, + 0x0a, 0x6d, 0x43, 0xf9, 0x94, 0xf9, 0x5f, 0x11 }, + { 0xf6, 0x3a, 0xab, 0x3e, 0xc6, 0x41, 0xb3, 0xb0, + 0x24, 0x96, 0x4c, 0x2b, 0x43, 0x7c, 0x04, 0xf6, + 0x04, 0x3c, 0x4c, 0x7e, 0x02, 0x79, 0x23, 0x99, + 0x95, 0x40, 0x19, 0x58, 0xf8, 0x6b, 0xbe, 0x54 }, + { 0xf1, 0x72, 0xb1, 0x80, 0xbf, 0xb0, 0x97, 0x40, + 0x49, 0x31, 0x20, 0xb6, 0x32, 0x6c, 0xbd, 0xc5, + 0x61, 0xe4, 0x77, 0xde, 0xf9, 0xbb, 0xcf, 0xd2, + 0x8c, 0xc8, 0xc1, 0xc5, 0xe3, 0x37, 0x9a, 0x31 }, + { 0xcb, 0x9b, 0x89, 0xcc, 0x18, 0x38, 0x1d, 0xd9, + 0x14, 0x1a, 0xde, 0x58, 0x86, 0x54, 0xd4, 0xe6, + 0xa2, 0x31, 0xd5, 0xbf, 0x49, 0xd4, 0xd5, 0x9a, + 0xc2, 0x7d, 0x86, 0x9c, 0xbe, 0x10, 0x0c, 0xf3 }, + { 0x7b, 0xd8, 0x81, 0x50, 0x46, 0xfd, 0xd8, 0x10, + 0xa9, 0x23, 0xe1, 0x98, 0x4a, 0xae, 0xbd, 0xcd, + 0xf8, 0x4d, 0x87, 0xc8, 0x99, 0x2d, 0x68, 0xb5, + 0xee, 0xb4, 0x60, 0xf9, 0x3e, 0xb3, 0xc8, 0xd7 }, + { 0x60, 0x7b, 0xe6, 0x68, 0x62, 0xfd, 0x08, 0xee, + 0x5b, 0x19, 0xfa, 0xca, 0xc0, 0x9d, 0xfd, 0xbc, + 0xd4, 0x0c, 0x31, 0x21, 0x01, 0xd6, 0x6e, 0x6e, + 0xbd, 0x2b, 0x84, 0x1f, 0x1b, 0x9a, 0x93, 0x25 }, + { 0x9f, 0xe0, 0x3b, 0xbe, 0x69, 0xab, 0x18, 0x34, + 0xf5, 0x21, 0x9b, 0x0d, 0xa8, 0x8a, 0x08, 0xb3, + 0x0a, 0x66, 0xc5, 0x91, 0x3f, 0x01, 0x51, 0x96, + 0x3c, 0x36, 0x05, 0x60, 0xdb, 0x03, 0x87, 0xb3 }, + { 0x90, 0xa8, 0x35, 0x85, 0x71, 0x7b, 0x75, 0xf0, + 0xe9, 0xb7, 0x25, 0xe0, 0x55, 0xee, 0xee, 0xb9, + 0xe7, 0xa0, 0x28, 0xea, 0x7e, 0x6c, 0xbc, 0x07, + 0xb2, 0x09, 0x17, 0xec, 0x03, 0x63, 0xe3, 0x8c }, + { 0x33, 0x6e, 0xa0, 0x53, 0x0f, 0x4a, 0x74, 0x69, + 0x12, 0x6e, 0x02, 0x18, 0x58, 0x7e, 0xbb, 0xde, + 0x33, 0x58, 0xa0, 0xb3, 0x1c, 0x29, 0xd2, 0x00, + 0xf7, 0xdc, 0x7e, 0xb1, 0x5c, 0x6a, 0xad, 0xd8 }, + { 0xa7, 0x9e, 0x76, 0xdc, 0x0a, 0xbc, 0xa4, 0x39, + 0x6f, 0x07, 0x47, 0xcd, 0x7b, 0x74, 0x8d, 0xf9, + 0x13, 0x00, 0x76, 0x26, 0xb1, 0xd6, 0x59, 0xda, + 0x0c, 0x1f, 0x78, 0xb9, 0x30, 0x3d, 0x01, 0xa3 }, + { 0x44, 0xe7, 0x8a, 0x77, 0x37, 0x56, 0xe0, 0x95, + 0x15, 0x19, 0x50, 0x4d, 0x70, 0x38, 0xd2, 0x8d, + 0x02, 0x13, 0xa3, 0x7e, 0x0c, 0xe3, 0x75, 0x37, + 0x17, 0x57, 0xbc, 0x99, 0x63, 0x11, 0xe3, 0xb8 }, + { 0x77, 0xac, 0x01, 0x2a, 0x3f, 0x75, 0x4d, 0xcf, + 0xea, 0xb5, 0xeb, 0x99, 0x6b, 0xe9, 0xcd, 0x2d, + 0x1f, 0x96, 0x11, 0x1b, 0x6e, 0x49, 0xf3, 0x99, + 0x4d, 0xf1, 0x81, 0xf2, 0x85, 0x69, 0xd8, 0x25 }, + { 0xce, 0x5a, 0x10, 0xdb, 0x6f, 0xcc, 0xda, 0xf1, + 0x40, 0xaa, 0xa4, 0xde, 0xd6, 0x25, 0x0a, 0x9c, + 0x06, 0xe9, 0x22, 0x2b, 0xc9, 0xf9, 0xf3, 0x65, + 0x8a, 0x4a, 0xff, 0x93, 0x5f, 0x2b, 0x9f, 0x3a }, + { 0xec, 0xc2, 0x03, 0xa7, 0xfe, 0x2b, 0xe4, 0xab, + 0xd5, 0x5b, 0xb5, 0x3e, 0x6e, 0x67, 0x35, 0x72, + 0xe0, 0x07, 0x8d, 0xa8, 0xcd, 0x37, 0x5e, 0xf4, + 0x30, 0xcc, 0x97, 0xf9, 0xf8, 0x00, 0x83, 0xaf }, + { 0x14, 0xa5, 0x18, 0x6d, 0xe9, 0xd7, 0xa1, 0x8b, + 0x04, 0x12, 0xb8, 0x56, 0x3e, 0x51, 0xcc, 0x54, + 0x33, 0x84, 0x0b, 0x4a, 0x12, 0x9a, 0x8f, 0xf9, + 0x63, 0xb3, 0x3a, 0x3c, 0x4a, 0xfe, 0x8e, 0xbb }, + { 0x13, 0xf8, 0xef, 0x95, 0xcb, 0x86, 0xe6, 0xa6, + 0x38, 0x93, 0x1c, 0x8e, 0x10, 0x76, 0x73, 0xeb, + 0x76, 0xba, 0x10, 0xd7, 0xc2, 0xcd, 0x70, 0xb9, + 0xd9, 0x92, 0x0b, 0xbe, 0xed, 0x92, 0x94, 0x09 }, + { 0x0b, 0x33, 0x8f, 0x4e, 0xe1, 0x2f, 0x2d, 0xfc, + 0xb7, 0x87, 0x13, 0x37, 0x79, 0x41, 0xe0, 0xb0, + 0x63, 0x21, 0x52, 0x58, 0x1d, 0x13, 0x32, 0x51, + 0x6e, 0x4a, 0x2c, 0xab, 0x19, 0x42, 0xcc, 0xa4 }, + { 0xea, 0xab, 0x0e, 0xc3, 0x7b, 0x3b, 0x8a, 0xb7, + 0x96, 0xe9, 0xf5, 0x72, 0x38, 0xde, 0x14, 0xa2, + 0x64, 0xa0, 0x76, 0xf3, 0x88, 0x7d, 0x86, 0xe2, + 0x9b, 0xb5, 0x90, 0x6d, 0xb5, 0xa0, 0x0e, 0x02 }, + { 0x23, 0xcb, 0x68, 0xb8, 0xc0, 0xe6, 0xdc, 0x26, + 0xdc, 0x27, 0x76, 0x6d, 0xdc, 0x0a, 0x13, 0xa9, + 0x94, 0x38, 0xfd, 0x55, 0x61, 0x7a, 0xa4, 0x09, + 0x5d, 0x8f, 0x96, 0x97, 0x20, 0xc8, 0x72, 0xdf }, + { 0x09, 0x1d, 0x8e, 0xe3, 0x0d, 0x6f, 0x29, 0x68, + 0xd4, 0x6b, 0x68, 0x7d, 0xd6, 0x52, 0x92, 0x66, + 0x57, 0x42, 0xde, 0x0b, 0xb8, 0x3d, 0xcc, 0x00, + 0x04, 0xc7, 0x2c, 0xe1, 0x00, 0x07, 0xa5, 0x49 }, + { 0x7f, 0x50, 0x7a, 0xbc, 0x6d, 0x19, 0xba, 0x00, + 0xc0, 0x65, 0xa8, 0x76, 0xec, 0x56, 0x57, 0x86, + 0x88, 0x82, 0xd1, 0x8a, 0x22, 0x1b, 0xc4, 0x6c, + 0x7a, 0x69, 0x12, 0x54, 0x1f, 0x5b, 0xc7, 0xba }, + { 0xa0, 0x60, 0x7c, 0x24, 0xe1, 0x4e, 0x8c, 0x22, + 0x3d, 0xb0, 0xd7, 0x0b, 0x4d, 0x30, 0xee, 0x88, + 0x01, 0x4d, 0x60, 0x3f, 0x43, 0x7e, 0x9e, 0x02, + 0xaa, 0x7d, 0xaf, 0xa3, 0xcd, 0xfb, 0xad, 0x94 }, + { 0xdd, 0xbf, 0xea, 0x75, 0xcc, 0x46, 0x78, 0x82, + 0xeb, 0x34, 0x83, 0xce, 0x5e, 0x2e, 0x75, 0x6a, + 0x4f, 0x47, 0x01, 0xb7, 0x6b, 0x44, 0x55, 0x19, + 0xe8, 0x9f, 0x22, 0xd6, 0x0f, 0xa8, 0x6e, 0x06 }, + { 0x0c, 0x31, 0x1f, 0x38, 0xc3, 0x5a, 0x4f, 0xb9, + 0x0d, 0x65, 0x1c, 0x28, 0x9d, 0x48, 0x68, 0x56, + 0xcd, 0x14, 0x13, 0xdf, 0x9b, 0x06, 0x77, 0xf5, + 0x3e, 0xce, 0x2c, 0xd9, 0xe4, 0x77, 0xc6, 0x0a }, + { 0x46, 0xa7, 0x3a, 0x8d, 0xd3, 0xe7, 0x0f, 0x59, + 0xd3, 0x94, 0x2c, 0x01, 0xdf, 0x59, 0x9d, 0xef, + 0x78, 0x3c, 0x9d, 0xa8, 0x2f, 0xd8, 0x32, 0x22, + 0xcd, 0x66, 0x2b, 0x53, 0xdc, 0xe7, 0xdb, 0xdf }, + { 0xad, 0x03, 0x8f, 0xf9, 0xb1, 0x4d, 0xe8, 0x4a, + 0x80, 0x1e, 0x4e, 0x62, 0x1c, 0xe5, 0xdf, 0x02, + 0x9d, 0xd9, 0x35, 0x20, 0xd0, 0xc2, 0xfa, 0x38, + 0xbf, 0xf1, 0x76, 0xa8, 0xb1, 0xd1, 0x69, 0x8c }, + { 0xab, 0x70, 0xc5, 0xdf, 0xbd, 0x1e, 0xa8, 0x17, + 0xfe, 0xd0, 0xcd, 0x06, 0x72, 0x93, 0xab, 0xf3, + 0x19, 0xe5, 0xd7, 0x90, 0x1c, 0x21, 0x41, 0xd5, + 0xd9, 0x9b, 0x23, 0xf0, 0x3a, 0x38, 0xe7, 0x48 }, + { 0x1f, 0xff, 0xda, 0x67, 0x93, 0x2b, 0x73, 0xc8, + 0xec, 0xaf, 0x00, 0x9a, 0x34, 0x91, 0xa0, 0x26, + 0x95, 0x3b, 0xab, 0xfe, 0x1f, 0x66, 0x3b, 0x06, + 0x97, 0xc3, 0xc4, 0xae, 0x8b, 0x2e, 0x7d, 0xcb }, + { 0xb0, 0xd2, 0xcc, 0x19, 0x47, 0x2d, 0xd5, 0x7f, + 0x2b, 0x17, 0xef, 0xc0, 0x3c, 0x8d, 0x58, 0xc2, + 0x28, 0x3d, 0xbb, 0x19, 0xda, 0x57, 0x2f, 0x77, + 0x55, 0x85, 0x5a, 0xa9, 0x79, 0x43, 0x17, 0xa0 }, + { 0xa0, 0xd1, 0x9a, 0x6e, 0xe3, 0x39, 0x79, 0xc3, + 0x25, 0x51, 0x0e, 0x27, 0x66, 0x22, 0xdf, 0x41, + 0xf7, 0x15, 0x83, 0xd0, 0x75, 0x01, 0xb8, 0x70, + 0x71, 0x12, 0x9a, 0x0a, 0xd9, 0x47, 0x32, 0xa5 }, + { 0x72, 0x46, 0x42, 0xa7, 0x03, 0x2d, 0x10, 0x62, + 0xb8, 0x9e, 0x52, 0xbe, 0xa3, 0x4b, 0x75, 0xdf, + 0x7d, 0x8f, 0xe7, 0x72, 0xd9, 0xfe, 0x3c, 0x93, + 0xdd, 0xf3, 0xc4, 0x54, 0x5a, 0xb5, 0xa9, 0x9b }, + { 0xad, 0xe5, 0xea, 0xa7, 0xe6, 0x1f, 0x67, 0x2d, + 0x58, 0x7e, 0xa0, 0x3d, 0xae, 0x7d, 0x7b, 0x55, + 0x22, 0x9c, 0x01, 0xd0, 0x6b, 0xc0, 0xa5, 0x70, + 0x14, 0x36, 0xcb, 0xd1, 0x83, 0x66, 0xa6, 0x26 }, + { 0x01, 0x3b, 0x31, 0xeb, 0xd2, 0x28, 0xfc, 0xdd, + 0xa5, 0x1f, 0xab, 0xb0, 0x3b, 0xb0, 0x2d, 0x60, + 0xac, 0x20, 0xca, 0x21, 0x5a, 0xaf, 0xa8, 0x3b, + 0xdd, 0x85, 0x5e, 0x37, 0x55, 0xa3, 0x5f, 0x0b }, + { 0x33, 0x2e, 0xd4, 0x0b, 0xb1, 0x0d, 0xde, 0x3c, + 0x95, 0x4a, 0x75, 0xd7, 0xb8, 0x99, 0x9d, 0x4b, + 0x26, 0xa1, 0xc0, 0x63, 0xc1, 0xdc, 0x6e, 0x32, + 0xc1, 0xd9, 0x1b, 0xab, 0x7b, 0xbb, 0x7d, 0x16 }, + { 0xc7, 0xa1, 0x97, 0xb3, 0xa0, 0x5b, 0x56, 0x6b, + 0xcc, 0x9f, 0xac, 0xd2, 0x0e, 0x44, 0x1d, 0x6f, + 0x6c, 0x28, 0x60, 0xac, 0x96, 0x51, 0xcd, 0x51, + 0xd6, 0xb9, 0xd2, 0xcd, 0xee, 0xea, 0x03, 0x90 }, + { 0xbd, 0x9c, 0xf6, 0x4e, 0xa8, 0x95, 0x3c, 0x03, + 0x71, 0x08, 0xe6, 0xf6, 0x54, 0x91, 0x4f, 0x39, + 0x58, 0xb6, 0x8e, 0x29, 0xc1, 0x67, 0x00, 0xdc, + 0x18, 0x4d, 0x94, 0xa2, 0x17, 0x08, 0xff, 0x60 }, + { 0x88, 0x35, 0xb0, 0xac, 0x02, 0x11, 0x51, 0xdf, + 0x71, 0x64, 0x74, 0xce, 0x27, 0xce, 0x4d, 0x3c, + 0x15, 0xf0, 0xb2, 0xda, 0xb4, 0x80, 0x03, 0xcf, + 0x3f, 0x3e, 0xfd, 0x09, 0x45, 0x10, 0x6b, 0x9a }, + { 0x3b, 0xfe, 0xfa, 0x33, 0x01, 0xaa, 0x55, 0xc0, + 0x80, 0x19, 0x0c, 0xff, 0xda, 0x8e, 0xae, 0x51, + 0xd9, 0xaf, 0x48, 0x8b, 0x4c, 0x1f, 0x24, 0xc3, + 0xd9, 0xa7, 0x52, 0x42, 0xfd, 0x8e, 0xa0, 0x1d }, + { 0x08, 0x28, 0x4d, 0x14, 0x99, 0x3c, 0xd4, 0x7d, + 0x53, 0xeb, 0xae, 0xcf, 0x0d, 0xf0, 0x47, 0x8c, + 0xc1, 0x82, 0xc8, 0x9c, 0x00, 0xe1, 0x85, 0x9c, + 0x84, 0x85, 0x16, 0x86, 0xdd, 0xf2, 0xc1, 0xb7 }, + { 0x1e, 0xd7, 0xef, 0x9f, 0x04, 0xc2, 0xac, 0x8d, + 0xb6, 0xa8, 0x64, 0xdb, 0x13, 0x10, 0x87, 0xf2, + 0x70, 0x65, 0x09, 0x8e, 0x69, 0xc3, 0xfe, 0x78, + 0x71, 0x8d, 0x9b, 0x94, 0x7f, 0x4a, 0x39, 0xd0 }, + { 0xc1, 0x61, 0xf2, 0xdc, 0xd5, 0x7e, 0x9c, 0x14, + 0x39, 0xb3, 0x1a, 0x9d, 0xd4, 0x3d, 0x8f, 0x3d, + 0x7d, 0xd8, 0xf0, 0xeb, 0x7c, 0xfa, 0xc6, 0xfb, + 0x25, 0xa0, 0xf2, 0x8e, 0x30, 0x6f, 0x06, 0x61 }, + { 0xc0, 0x19, 0x69, 0xad, 0x34, 0xc5, 0x2c, 0xaf, + 0x3d, 0xc4, 0xd8, 0x0d, 0x19, 0x73, 0x5c, 0x29, + 0x73, 0x1a, 0xc6, 0xe7, 0xa9, 0x20, 0x85, 0xab, + 0x92, 0x50, 0xc4, 0x8d, 0xea, 0x48, 0xa3, 0xfc }, + { 0x17, 0x20, 0xb3, 0x65, 0x56, 0x19, 0xd2, 0xa5, + 0x2b, 0x35, 0x21, 0xae, 0x0e, 0x49, 0xe3, 0x45, + 0xcb, 0x33, 0x89, 0xeb, 0xd6, 0x20, 0x8a, 0xca, + 0xf9, 0xf1, 0x3f, 0xda, 0xcc, 0xa8, 0xbe, 0x49 }, + { 0x75, 0x62, 0x88, 0x36, 0x1c, 0x83, 0xe2, 0x4c, + 0x61, 0x7c, 0xf9, 0x5c, 0x90, 0x5b, 0x22, 0xd0, + 0x17, 0xcd, 0xc8, 0x6f, 0x0b, 0xf1, 0xd6, 0x58, + 0xf4, 0x75, 0x6c, 0x73, 0x79, 0x87, 0x3b, 0x7f }, + { 0xe7, 0xd0, 0xed, 0xa3, 0x45, 0x26, 0x93, 0xb7, + 0x52, 0xab, 0xcd, 0xa1, 0xb5, 0x5e, 0x27, 0x6f, + 0x82, 0x69, 0x8f, 0x5f, 0x16, 0x05, 0x40, 0x3e, + 0xff, 0x83, 0x0b, 0xea, 0x00, 0x71, 0xa3, 0x94 }, + { 0x2c, 0x82, 0xec, 0xaa, 0x6b, 0x84, 0x80, 0x3e, + 0x04, 0x4a, 0xf6, 0x31, 0x18, 0xaf, 0xe5, 0x44, + 0x68, 0x7c, 0xb6, 0xe6, 0xc7, 0xdf, 0x49, 0xed, + 0x76, 0x2d, 0xfd, 0x7c, 0x86, 0x93, 0xa1, 0xbc }, + { 0x61, 0x36, 0xcb, 0xf4, 0xb4, 0x41, 0x05, 0x6f, + 0xa1, 0xe2, 0x72, 0x24, 0x98, 0x12, 0x5d, 0x6d, + 0xed, 0x45, 0xe1, 0x7b, 0x52, 0x14, 0x39, 0x59, + 0xc7, 0xf4, 0xd4, 0xe3, 0x95, 0x21, 0x8a, 0xc2 }, + { 0x72, 0x1d, 0x32, 0x45, 0xaa, 0xfe, 0xf2, 0x7f, + 0x6a, 0x62, 0x4f, 0x47, 0x95, 0x4b, 0x6c, 0x25, + 0x50, 0x79, 0x52, 0x6f, 0xfa, 0x25, 0xe9, 0xff, + 0x77, 0xe5, 0xdc, 0xff, 0x47, 0x3b, 0x15, 0x97 }, + { 0x9d, 0xd2, 0xfb, 0xd8, 0xce, 0xf1, 0x6c, 0x35, + 0x3c, 0x0a, 0xc2, 0x11, 0x91, 0xd5, 0x09, 0xeb, + 0x28, 0xdd, 0x9e, 0x3e, 0x0d, 0x8c, 0xea, 0x5d, + 0x26, 0xca, 0x83, 0x93, 0x93, 0x85, 0x1c, 0x3a }, + { 0xb2, 0x39, 0x4c, 0xea, 0xcd, 0xeb, 0xf2, 0x1b, + 0xf9, 0xdf, 0x2c, 0xed, 0x98, 0xe5, 0x8f, 0x1c, + 0x3a, 0x4b, 0xbb, 0xff, 0x66, 0x0d, 0xd9, 0x00, + 0xf6, 0x22, 0x02, 0xd6, 0x78, 0x5c, 0xc4, 0x6e }, + { 0x57, 0x08, 0x9f, 0x22, 0x27, 0x49, 0xad, 0x78, + 0x71, 0x76, 0x5f, 0x06, 0x2b, 0x11, 0x4f, 0x43, + 0xba, 0x20, 0xec, 0x56, 0x42, 0x2a, 0x8b, 0x1e, + 0x3f, 0x87, 0x19, 0x2c, 0x0e, 0xa7, 0x18, 0xc6 }, + { 0xe4, 0x9a, 0x94, 0x59, 0x96, 0x1c, 0xd3, 0x3c, + 0xdf, 0x4a, 0xae, 0x1b, 0x10, 0x78, 0xa5, 0xde, + 0xa7, 0xc0, 0x40, 0xe0, 0xfe, 0xa3, 0x40, 0xc9, + 0x3a, 0x72, 0x48, 0x72, 0xfc, 0x4a, 0xf8, 0x06 }, + { 0xed, 0xe6, 0x7f, 0x72, 0x0e, 0xff, 0xd2, 0xca, + 0x9c, 0x88, 0x99, 0x41, 0x52, 0xd0, 0x20, 0x1d, + 0xee, 0x6b, 0x0a, 0x2d, 0x2c, 0x07, 0x7a, 0xca, + 0x6d, 0xae, 0x29, 0xf7, 0x3f, 0x8b, 0x63, 0x09 }, + { 0xe0, 0xf4, 0x34, 0xbf, 0x22, 0xe3, 0x08, 0x80, + 0x39, 0xc2, 0x1f, 0x71, 0x9f, 0xfc, 0x67, 0xf0, + 0xf2, 0xcb, 0x5e, 0x98, 0xa7, 0xa0, 0x19, 0x4c, + 0x76, 0xe9, 0x6b, 0xf4, 0xe8, 0xe1, 0x7e, 0x61 }, + { 0x27, 0x7c, 0x04, 0xe2, 0x85, 0x34, 0x84, 0xa4, + 0xeb, 0xa9, 0x10, 0xad, 0x33, 0x6d, 0x01, 0xb4, + 0x77, 0xb6, 0x7c, 0xc2, 0x00, 0xc5, 0x9f, 0x3c, + 0x8d, 0x77, 0xee, 0xf8, 0x49, 0x4f, 0x29, 0xcd }, + { 0x15, 0x6d, 0x57, 0x47, 0xd0, 0xc9, 0x9c, 0x7f, + 0x27, 0x09, 0x7d, 0x7b, 0x7e, 0x00, 0x2b, 0x2e, + 0x18, 0x5c, 0xb7, 0x2d, 0x8d, 0xd7, 0xeb, 0x42, + 0x4a, 0x03, 0x21, 0x52, 0x81, 0x61, 0x21, 0x9f }, + { 0x20, 0xdd, 0xd1, 0xed, 0x9b, 0x1c, 0xa8, 0x03, + 0x94, 0x6d, 0x64, 0xa8, 0x3a, 0xe4, 0x65, 0x9d, + 0xa6, 0x7f, 0xba, 0x7a, 0x1a, 0x3e, 0xdd, 0xb1, + 0xe1, 0x03, 0xc0, 0xf5, 0xe0, 0x3e, 0x3a, 0x2c }, + { 0xf0, 0xaf, 0x60, 0x4d, 0x3d, 0xab, 0xbf, 0x9a, + 0x0f, 0x2a, 0x7d, 0x3d, 0xda, 0x6b, 0xd3, 0x8b, + 0xba, 0x72, 0xc6, 0xd0, 0x9b, 0xe4, 0x94, 0xfc, + 0xef, 0x71, 0x3f, 0xf1, 0x01, 0x89, 0xb6, 0xe6 }, + { 0x98, 0x02, 0xbb, 0x87, 0xde, 0xf4, 0xcc, 0x10, + 0xc4, 0xa5, 0xfd, 0x49, 0xaa, 0x58, 0xdf, 0xe2, + 0xf3, 0xfd, 0xdb, 0x46, 0xb4, 0x70, 0x88, 0x14, + 0xea, 0xd8, 0x1d, 0x23, 0xba, 0x95, 0x13, 0x9b }, + { 0x4f, 0x8c, 0xe1, 0xe5, 0x1d, 0x2f, 0xe7, 0xf2, + 0x40, 0x43, 0xa9, 0x04, 0xd8, 0x98, 0xeb, 0xfc, + 0x91, 0x97, 0x54, 0x18, 0x75, 0x34, 0x13, 0xaa, + 0x09, 0x9b, 0x79, 0x5e, 0xcb, 0x35, 0xce, 0xdb }, + { 0xbd, 0xdc, 0x65, 0x14, 0xd7, 0xee, 0x6a, 0xce, + 0x0a, 0x4a, 0xc1, 0xd0, 0xe0, 0x68, 0x11, 0x22, + 0x88, 0xcb, 0xcf, 0x56, 0x04, 0x54, 0x64, 0x27, + 0x05, 0x63, 0x01, 0x77, 0xcb, 0xa6, 0x08, 0xbd }, + { 0xd6, 0x35, 0x99, 0x4f, 0x62, 0x91, 0x51, 0x7b, + 0x02, 0x81, 0xff, 0xdd, 0x49, 0x6a, 0xfa, 0x86, + 0x27, 0x12, 0xe5, 0xb3, 0xc4, 0xe5, 0x2e, 0x4c, + 0xd5, 0xfd, 0xae, 0x8c, 0x0e, 0x72, 0xfb, 0x08 }, + { 0x87, 0x8d, 0x9c, 0xa6, 0x00, 0xcf, 0x87, 0xe7, + 0x69, 0xcc, 0x30, 0x5c, 0x1b, 0x35, 0x25, 0x51, + 0x86, 0x61, 0x5a, 0x73, 0xa0, 0xda, 0x61, 0x3b, + 0x5f, 0x1c, 0x98, 0xdb, 0xf8, 0x12, 0x83, 0xea }, + { 0xa6, 0x4e, 0xbe, 0x5d, 0xc1, 0x85, 0xde, 0x9f, + 0xdd, 0xe7, 0x60, 0x7b, 0x69, 0x98, 0x70, 0x2e, + 0xb2, 0x34, 0x56, 0x18, 0x49, 0x57, 0x30, 0x7d, + 0x2f, 0xa7, 0x2e, 0x87, 0xa4, 0x77, 0x02, 0xd6 }, + { 0xce, 0x50, 0xea, 0xb7, 0xb5, 0xeb, 0x52, 0xbd, + 0xc9, 0xad, 0x8e, 0x5a, 0x48, 0x0a, 0xb7, 0x80, + 0xca, 0x93, 0x20, 0xe4, 0x43, 0x60, 0xb1, 0xfe, + 0x37, 0xe0, 0x3f, 0x2f, 0x7a, 0xd7, 0xde, 0x01 }, + { 0xee, 0xdd, 0xb7, 0xc0, 0xdb, 0x6e, 0x30, 0xab, + 0xe6, 0x6d, 0x79, 0xe3, 0x27, 0x51, 0x1e, 0x61, + 0xfc, 0xeb, 0xbc, 0x29, 0xf1, 0x59, 0xb4, 0x0a, + 0x86, 0xb0, 0x46, 0xec, 0xf0, 0x51, 0x38, 0x23 }, + { 0x78, 0x7f, 0xc9, 0x34, 0x40, 0xc1, 0xec, 0x96, + 0xb5, 0xad, 0x01, 0xc1, 0x6c, 0xf7, 0x79, 0x16, + 0xa1, 0x40, 0x5f, 0x94, 0x26, 0x35, 0x6e, 0xc9, + 0x21, 0xd8, 0xdf, 0xf3, 0xea, 0x63, 0xb7, 0xe0 }, + { 0x7f, 0x0d, 0x5e, 0xab, 0x47, 0xee, 0xfd, 0xa6, + 0x96, 0xc0, 0xbf, 0x0f, 0xbf, 0x86, 0xab, 0x21, + 0x6f, 0xce, 0x46, 0x1e, 0x93, 0x03, 0xab, 0xa6, + 0xac, 0x37, 0x41, 0x20, 0xe8, 0x90, 0xe8, 0xdf }, + { 0xb6, 0x80, 0x04, 0xb4, 0x2f, 0x14, 0xad, 0x02, + 0x9f, 0x4c, 0x2e, 0x03, 0xb1, 0xd5, 0xeb, 0x76, + 0xd5, 0x71, 0x60, 0xe2, 0x64, 0x76, 0xd2, 0x11, + 0x31, 0xbe, 0xf2, 0x0a, 0xda, 0x7d, 0x27, 0xf4 }, + { 0xb0, 0xc4, 0xeb, 0x18, 0xae, 0x25, 0x0b, 0x51, + 0xa4, 0x13, 0x82, 0xea, 0xd9, 0x2d, 0x0d, 0xc7, + 0x45, 0x5f, 0x93, 0x79, 0xfc, 0x98, 0x84, 0x42, + 0x8e, 0x47, 0x70, 0x60, 0x8d, 0xb0, 0xfa, 0xec }, + { 0xf9, 0x2b, 0x7a, 0x87, 0x0c, 0x05, 0x9f, 0x4d, + 0x46, 0x46, 0x4c, 0x82, 0x4e, 0xc9, 0x63, 0x55, + 0x14, 0x0b, 0xdc, 0xe6, 0x81, 0x32, 0x2c, 0xc3, + 0xa9, 0x92, 0xff, 0x10, 0x3e, 0x3f, 0xea, 0x52 }, + { 0x53, 0x64, 0x31, 0x26, 0x14, 0x81, 0x33, 0x98, + 0xcc, 0x52, 0x5d, 0x4c, 0x4e, 0x14, 0x6e, 0xde, + 0xb3, 0x71, 0x26, 0x5f, 0xba, 0x19, 0x13, 0x3a, + 0x2c, 0x3d, 0x21, 0x59, 0x29, 0x8a, 0x17, 0x42 }, + { 0xf6, 0x62, 0x0e, 0x68, 0xd3, 0x7f, 0xb2, 0xaf, + 0x50, 0x00, 0xfc, 0x28, 0xe2, 0x3b, 0x83, 0x22, + 0x97, 0xec, 0xd8, 0xbc, 0xe9, 0x9e, 0x8b, 0xe4, + 0xd0, 0x4e, 0x85, 0x30, 0x9e, 0x3d, 0x33, 0x74 }, + { 0x53, 0x16, 0xa2, 0x79, 0x69, 0xd7, 0xfe, 0x04, + 0xff, 0x27, 0xb2, 0x83, 0x96, 0x1b, 0xff, 0xc3, + 0xbf, 0x5d, 0xfb, 0x32, 0xfb, 0x6a, 0x89, 0xd1, + 0x01, 0xc6, 0xc3, 0xb1, 0x93, 0x7c, 0x28, 0x71 }, + { 0x81, 0xd1, 0x66, 0x4f, 0xdf, 0x3c, 0xb3, 0x3c, + 0x24, 0xee, 0xba, 0xc0, 0xbd, 0x64, 0x24, 0x4b, + 0x77, 0xc4, 0xab, 0xea, 0x90, 0xbb, 0xe8, 0xb5, + 0xee, 0x0b, 0x2a, 0xaf, 0xcf, 0x2d, 0x6a, 0x53 }, + { 0x34, 0x57, 0x82, 0xf2, 0x95, 0xb0, 0x88, 0x03, + 0x52, 0xe9, 0x24, 0xa0, 0x46, 0x7b, 0x5f, 0xbc, + 0x3e, 0x8f, 0x3b, 0xfb, 0xc3, 0xc7, 0xe4, 0x8b, + 0x67, 0x09, 0x1f, 0xb5, 0xe8, 0x0a, 0x94, 0x42 }, + { 0x79, 0x41, 0x11, 0xea, 0x6c, 0xd6, 0x5e, 0x31, + 0x1f, 0x74, 0xee, 0x41, 0xd4, 0x76, 0xcb, 0x63, + 0x2c, 0xe1, 0xe4, 0xb0, 0x51, 0xdc, 0x1d, 0x9e, + 0x9d, 0x06, 0x1a, 0x19, 0xe1, 0xd0, 0xbb, 0x49 }, + { 0x2a, 0x85, 0xda, 0xf6, 0x13, 0x88, 0x16, 0xb9, + 0x9b, 0xf8, 0xd0, 0x8b, 0xa2, 0x11, 0x4b, 0x7a, + 0xb0, 0x79, 0x75, 0xa7, 0x84, 0x20, 0xc1, 0xa3, + 0xb0, 0x6a, 0x77, 0x7c, 0x22, 0xdd, 0x8b, 0xcb }, + { 0x89, 0xb0, 0xd5, 0xf2, 0x89, 0xec, 0x16, 0x40, + 0x1a, 0x06, 0x9a, 0x96, 0x0d, 0x0b, 0x09, 0x3e, + 0x62, 0x5d, 0xa3, 0xcf, 0x41, 0xee, 0x29, 0xb5, + 0x9b, 0x93, 0x0c, 0x58, 0x20, 0x14, 0x54, 0x55 }, + { 0xd0, 0xfd, 0xcb, 0x54, 0x39, 0x43, 0xfc, 0x27, + 0xd2, 0x08, 0x64, 0xf5, 0x21, 0x81, 0x47, 0x1b, + 0x94, 0x2c, 0xc7, 0x7c, 0xa6, 0x75, 0xbc, 0xb3, + 0x0d, 0xf3, 0x1d, 0x35, 0x8e, 0xf7, 0xb1, 0xeb }, + { 0xb1, 0x7e, 0xa8, 0xd7, 0x70, 0x63, 0xc7, 0x09, + 0xd4, 0xdc, 0x6b, 0x87, 0x94, 0x13, 0xc3, 0x43, + 0xe3, 0x79, 0x0e, 0x9e, 0x62, 0xca, 0x85, 0xb7, + 0x90, 0x0b, 0x08, 0x6f, 0x6b, 0x75, 0xc6, 0x72 }, + { 0xe7, 0x1a, 0x3e, 0x2c, 0x27, 0x4d, 0xb8, 0x42, + 0xd9, 0x21, 0x14, 0xf2, 0x17, 0xe2, 0xc0, 0xea, + 0xc8, 0xb4, 0x50, 0x93, 0xfd, 0xfd, 0x9d, 0xf4, + 0xca, 0x71, 0x62, 0x39, 0x48, 0x62, 0xd5, 0x01 }, + { 0xc0, 0x47, 0x67, 0x59, 0xab, 0x7a, 0xa3, 0x33, + 0x23, 0x4f, 0x6b, 0x44, 0xf5, 0xfd, 0x85, 0x83, + 0x90, 0xec, 0x23, 0x69, 0x4c, 0x62, 0x2c, 0xb9, + 0x86, 0xe7, 0x69, 0xc7, 0x8e, 0xdd, 0x73, 0x3e }, + { 0x9a, 0xb8, 0xea, 0xbb, 0x14, 0x16, 0x43, 0x4d, + 0x85, 0x39, 0x13, 0x41, 0xd5, 0x69, 0x93, 0xc5, + 0x54, 0x58, 0x16, 0x7d, 0x44, 0x18, 0xb1, 0x9a, + 0x0f, 0x2a, 0xd8, 0xb7, 0x9a, 0x83, 0xa7, 0x5b }, + { 0x79, 0x92, 0xd0, 0xbb, 0xb1, 0x5e, 0x23, 0x82, + 0x6f, 0x44, 0x3e, 0x00, 0x50, 0x5d, 0x68, 0xd3, + 0xed, 0x73, 0x72, 0x99, 0x5a, 0x5c, 0x3e, 0x49, + 0x86, 0x54, 0x10, 0x2f, 0xbc, 0xd0, 0x96, 0x4e }, + { 0xc0, 0x21, 0xb3, 0x00, 0x85, 0x15, 0x14, 0x35, + 0xdf, 0x33, 0xb0, 0x07, 0xcc, 0xec, 0xc6, 0x9d, + 0xf1, 0x26, 0x9f, 0x39, 0xba, 0x25, 0x09, 0x2b, + 0xed, 0x59, 0xd9, 0x32, 0xac, 0x0f, 0xdc, 0x28 }, + { 0x91, 0xa2, 0x5e, 0xc0, 0xec, 0x0d, 0x9a, 0x56, + 0x7f, 0x89, 0xc4, 0xbf, 0xe1, 0xa6, 0x5a, 0x0e, + 0x43, 0x2d, 0x07, 0x06, 0x4b, 0x41, 0x90, 0xe2, + 0x7d, 0xfb, 0x81, 0x90, 0x1f, 0xd3, 0x13, 0x9b }, + { 0x59, 0x50, 0xd3, 0x9a, 0x23, 0xe1, 0x54, 0x5f, + 0x30, 0x12, 0x70, 0xaa, 0x1a, 0x12, 0xf2, 0xe6, + 0xc4, 0x53, 0x77, 0x6e, 0x4d, 0x63, 0x55, 0xde, + 0x42, 0x5c, 0xc1, 0x53, 0xf9, 0x81, 0x88, 0x67 }, + { 0xd7, 0x9f, 0x14, 0x72, 0x0c, 0x61, 0x0a, 0xf1, + 0x79, 0xa3, 0x76, 0x5d, 0x4b, 0x7c, 0x09, 0x68, + 0xf9, 0x77, 0x96, 0x2d, 0xbf, 0x65, 0x5b, 0x52, + 0x12, 0x72, 0xb6, 0xf1, 0xe1, 0x94, 0x48, 0x8e }, + { 0xe9, 0x53, 0x1b, 0xfc, 0x8b, 0x02, 0x99, 0x5a, + 0xea, 0xa7, 0x5b, 0xa2, 0x70, 0x31, 0xfa, 0xdb, + 0xcb, 0xf4, 0xa0, 0xda, 0xb8, 0x96, 0x1d, 0x92, + 0x96, 0xcd, 0x7e, 0x84, 0xd2, 0x5d, 0x60, 0x06 }, + { 0x34, 0xe9, 0xc2, 0x6a, 0x01, 0xd7, 0xf1, 0x61, + 0x81, 0xb4, 0x54, 0xa9, 0xd1, 0x62, 0x3c, 0x23, + 0x3c, 0xb9, 0x9d, 0x31, 0xc6, 0x94, 0x65, 0x6e, + 0x94, 0x13, 0xac, 0xa3, 0xe9, 0x18, 0x69, 0x2f }, + { 0xd9, 0xd7, 0x42, 0x2f, 0x43, 0x7b, 0xd4, 0x39, + 0xdd, 0xd4, 0xd8, 0x83, 0xda, 0xe2, 0xa0, 0x83, + 0x50, 0x17, 0x34, 0x14, 0xbe, 0x78, 0x15, 0x51, + 0x33, 0xff, 0xf1, 0x96, 0x4c, 0x3d, 0x79, 0x72 }, + { 0x4a, 0xee, 0x0c, 0x7a, 0xaf, 0x07, 0x54, 0x14, + 0xff, 0x17, 0x93, 0xea, 0xd7, 0xea, 0xca, 0x60, + 0x17, 0x75, 0xc6, 0x15, 0xdb, 0xd6, 0x0b, 0x64, + 0x0b, 0x0a, 0x9f, 0x0c, 0xe5, 0x05, 0xd4, 0x35 }, + { 0x6b, 0xfd, 0xd1, 0x54, 0x59, 0xc8, 0x3b, 0x99, + 0xf0, 0x96, 0xbf, 0xb4, 0x9e, 0xe8, 0x7b, 0x06, + 0x3d, 0x69, 0xc1, 0x97, 0x4c, 0x69, 0x28, 0xac, + 0xfc, 0xfb, 0x40, 0x99, 0xf8, 0xc4, 0xef, 0x67 }, + { 0x9f, 0xd1, 0xc4, 0x08, 0xfd, 0x75, 0xc3, 0x36, + 0x19, 0x3a, 0x2a, 0x14, 0xd9, 0x4f, 0x6a, 0xf5, + 0xad, 0xf0, 0x50, 0xb8, 0x03, 0x87, 0xb4, 0xb0, + 0x10, 0xfb, 0x29, 0xf4, 0xcc, 0x72, 0x70, 0x7c }, + { 0x13, 0xc8, 0x84, 0x80, 0xa5, 0xd0, 0x0d, 0x6c, + 0x8c, 0x7a, 0xd2, 0x11, 0x0d, 0x76, 0xa8, 0x2d, + 0x9b, 0x70, 0xf4, 0xfa, 0x66, 0x96, 0xd4, 0xe5, + 0xdd, 0x42, 0xa0, 0x66, 0xdc, 0xaf, 0x99, 0x20 }, + { 0x82, 0x0e, 0x72, 0x5e, 0xe2, 0x5f, 0xe8, 0xfd, + 0x3a, 0x8d, 0x5a, 0xbe, 0x4c, 0x46, 0xc3, 0xba, + 0x88, 0x9d, 0xe6, 0xfa, 0x91, 0x91, 0xaa, 0x22, + 0xba, 0x67, 0xd5, 0x70, 0x54, 0x21, 0x54, 0x2b }, + { 0x32, 0xd9, 0x3a, 0x0e, 0xb0, 0x2f, 0x42, 0xfb, + 0xbc, 0xaf, 0x2b, 0xad, 0x00, 0x85, 0xb2, 0x82, + 0xe4, 0x60, 0x46, 0xa4, 0xdf, 0x7a, 0xd1, 0x06, + 0x57, 0xc9, 0xd6, 0x47, 0x63, 0x75, 0xb9, 0x3e }, + { 0xad, 0xc5, 0x18, 0x79, 0x05, 0xb1, 0x66, 0x9c, + 0xd8, 0xec, 0x9c, 0x72, 0x1e, 0x19, 0x53, 0x78, + 0x6b, 0x9d, 0x89, 0xa9, 0xba, 0xe3, 0x07, 0x80, + 0xf1, 0xe1, 0xea, 0xb2, 0x4a, 0x00, 0x52, 0x3c }, + { 0xe9, 0x07, 0x56, 0xff, 0x7f, 0x9a, 0xd8, 0x10, + 0xb2, 0x39, 0xa1, 0x0c, 0xed, 0x2c, 0xf9, 0xb2, + 0x28, 0x43, 0x54, 0xc1, 0xf8, 0xc7, 0xe0, 0xac, + 0xcc, 0x24, 0x61, 0xdc, 0x79, 0x6d, 0x6e, 0x89 }, + { 0x12, 0x51, 0xf7, 0x6e, 0x56, 0x97, 0x84, 0x81, + 0x87, 0x53, 0x59, 0x80, 0x1d, 0xb5, 0x89, 0xa0, + 0xb2, 0x2f, 0x86, 0xd8, 0xd6, 0x34, 0xdc, 0x04, + 0x50, 0x6f, 0x32, 0x2e, 0xd7, 0x8f, 0x17, 0xe8 }, + { 0x3a, 0xfa, 0x89, 0x9f, 0xd9, 0x80, 0xe7, 0x3e, + 0xcb, 0x7f, 0x4d, 0x8b, 0x8f, 0x29, 0x1d, 0xc9, + 0xaf, 0x79, 0x6b, 0xc6, 0x5d, 0x27, 0xf9, 0x74, + 0xc6, 0xf1, 0x93, 0xc9, 0x19, 0x1a, 0x09, 0xfd }, + { 0xaa, 0x30, 0x5b, 0xe2, 0x6e, 0x5d, 0xed, 0xdc, + 0x3c, 0x10, 0x10, 0xcb, 0xc2, 0x13, 0xf9, 0x5f, + 0x05, 0x1c, 0x78, 0x5c, 0x5b, 0x43, 0x1e, 0x6a, + 0x7c, 0xd0, 0x48, 0xf1, 0x61, 0x78, 0x75, 0x28 }, + { 0x8e, 0xa1, 0x88, 0x4f, 0xf3, 0x2e, 0x9d, 0x10, + 0xf0, 0x39, 0xb4, 0x07, 0xd0, 0xd4, 0x4e, 0x7e, + 0x67, 0x0a, 0xbd, 0x88, 0x4a, 0xee, 0xe0, 0xfb, + 0x75, 0x7a, 0xe9, 0x4e, 0xaa, 0x97, 0x37, 0x3d }, + { 0xd4, 0x82, 0xb2, 0x15, 0x5d, 0x4d, 0xec, 0x6b, + 0x47, 0x36, 0xa1, 0xf1, 0x61, 0x7b, 0x53, 0xaa, + 0xa3, 0x73, 0x10, 0x27, 0x7d, 0x3f, 0xef, 0x0c, + 0x37, 0xad, 0x41, 0x76, 0x8f, 0xc2, 0x35, 0xb4 }, + { 0x4d, 0x41, 0x39, 0x71, 0x38, 0x7e, 0x7a, 0x88, + 0x98, 0xa8, 0xdc, 0x2a, 0x27, 0x50, 0x07, 0x78, + 0x53, 0x9e, 0xa2, 0x14, 0xa2, 0xdf, 0xe9, 0xb3, + 0xd7, 0xe8, 0xeb, 0xdc, 0xe5, 0xcf, 0x3d, 0xb3 }, + { 0x69, 0x6e, 0x5d, 0x46, 0xe6, 0xc5, 0x7e, 0x87, + 0x96, 0xe4, 0x73, 0x5d, 0x08, 0x91, 0x6e, 0x0b, + 0x79, 0x29, 0xb3, 0xcf, 0x29, 0x8c, 0x29, 0x6d, + 0x22, 0xe9, 0xd3, 0x01, 0x96, 0x53, 0x37, 0x1c }, + { 0x1f, 0x56, 0x47, 0xc1, 0xd3, 0xb0, 0x88, 0x22, + 0x88, 0x85, 0x86, 0x5c, 0x89, 0x40, 0x90, 0x8b, + 0xf4, 0x0d, 0x1a, 0x82, 0x72, 0x82, 0x19, 0x73, + 0xb1, 0x60, 0x00, 0x8e, 0x7a, 0x3c, 0xe2, 0xeb }, + { 0xb6, 0xe7, 0x6c, 0x33, 0x0f, 0x02, 0x1a, 0x5b, + 0xda, 0x65, 0x87, 0x50, 0x10, 0xb0, 0xed, 0xf0, + 0x91, 0x26, 0xc0, 0xf5, 0x10, 0xea, 0x84, 0x90, + 0x48, 0x19, 0x20, 0x03, 0xae, 0xf4, 0xc6, 0x1c }, + { 0x3c, 0xd9, 0x52, 0xa0, 0xbe, 0xad, 0xa4, 0x1a, + 0xbb, 0x42, 0x4c, 0xe4, 0x7f, 0x94, 0xb4, 0x2b, + 0xe6, 0x4e, 0x1f, 0xfb, 0x0f, 0xd0, 0x78, 0x22, + 0x76, 0x80, 0x79, 0x46, 0xd0, 0xd0, 0xbc, 0x55 }, + { 0x98, 0xd9, 0x26, 0x77, 0x43, 0x9b, 0x41, 0xb7, + 0xbb, 0x51, 0x33, 0x12, 0xaf, 0xb9, 0x2b, 0xcc, + 0x8e, 0xe9, 0x68, 0xb2, 0xe3, 0xb2, 0x38, 0xce, + 0xcb, 0x9b, 0x0f, 0x34, 0xc9, 0xbb, 0x63, 0xd0 }, + { 0xec, 0xbc, 0xa2, 0xcf, 0x08, 0xae, 0x57, 0xd5, + 0x17, 0xad, 0x16, 0x15, 0x8a, 0x32, 0xbf, 0xa7, + 0xdc, 0x03, 0x82, 0xea, 0xed, 0xa1, 0x28, 0xe9, + 0x18, 0x86, 0x73, 0x4c, 0x24, 0xa0, 0xb2, 0x9d }, + { 0x94, 0x2c, 0xc7, 0xc0, 0xb5, 0x2e, 0x2b, 0x16, + 0xa4, 0xb8, 0x9f, 0xa4, 0xfc, 0x7e, 0x0b, 0xf6, + 0x09, 0xe2, 0x9a, 0x08, 0xc1, 0xa8, 0x54, 0x34, + 0x52, 0xb7, 0x7c, 0x7b, 0xfd, 0x11, 0xbb, 0x28 }, + { 0x8a, 0x06, 0x5d, 0x8b, 0x61, 0xa0, 0xdf, 0xfb, + 0x17, 0x0d, 0x56, 0x27, 0x73, 0x5a, 0x76, 0xb0, + 0xe9, 0x50, 0x60, 0x37, 0x80, 0x8c, 0xba, 0x16, + 0xc3, 0x45, 0x00, 0x7c, 0x9f, 0x79, 0xcf, 0x8f }, + { 0x1b, 0x9f, 0xa1, 0x97, 0x14, 0x65, 0x9c, 0x78, + 0xff, 0x41, 0x38, 0x71, 0x84, 0x92, 0x15, 0x36, + 0x10, 0x29, 0xac, 0x80, 0x2b, 0x1c, 0xbc, 0xd5, + 0x4e, 0x40, 0x8b, 0xd8, 0x72, 0x87, 0xf8, 0x1f }, + { 0x8d, 0xab, 0x07, 0x1b, 0xcd, 0x6c, 0x72, 0x92, + 0xa9, 0xef, 0x72, 0x7b, 0x4a, 0xe0, 0xd8, 0x67, + 0x13, 0x30, 0x1d, 0xa8, 0x61, 0x8d, 0x9a, 0x48, + 0xad, 0xce, 0x55, 0xf3, 0x03, 0xa8, 0x69, 0xa1 }, + { 0x82, 0x53, 0xe3, 0xe7, 0xc7, 0xb6, 0x84, 0xb9, + 0xcb, 0x2b, 0xeb, 0x01, 0x4c, 0xe3, 0x30, 0xff, + 0x3d, 0x99, 0xd1, 0x7a, 0xbb, 0xdb, 0xab, 0xe4, + 0xf4, 0xd6, 0x74, 0xde, 0xd5, 0x3f, 0xfc, 0x6b }, + { 0xf1, 0x95, 0xf3, 0x21, 0xe9, 0xe3, 0xd6, 0xbd, + 0x7d, 0x07, 0x45, 0x04, 0xdd, 0x2a, 0xb0, 0xe6, + 0x24, 0x1f, 0x92, 0xe7, 0x84, 0xb1, 0xaa, 0x27, + 0x1f, 0xf6, 0x48, 0xb1, 0xca, 0xb6, 0xd7, 0xf6 }, + { 0x27, 0xe4, 0xcc, 0x72, 0x09, 0x0f, 0x24, 0x12, + 0x66, 0x47, 0x6a, 0x7c, 0x09, 0x49, 0x5f, 0x2d, + 0xb1, 0x53, 0xd5, 0xbc, 0xbd, 0x76, 0x19, 0x03, + 0xef, 0x79, 0x27, 0x5e, 0xc5, 0x6b, 0x2e, 0xd8 }, + { 0x89, 0x9c, 0x24, 0x05, 0x78, 0x8e, 0x25, 0xb9, + 0x9a, 0x18, 0x46, 0x35, 0x5e, 0x64, 0x6d, 0x77, + 0xcf, 0x40, 0x00, 0x83, 0x41, 0x5f, 0x7d, 0xc5, + 0xaf, 0xe6, 0x9d, 0x6e, 0x17, 0xc0, 0x00, 0x23 }, + { 0xa5, 0x9b, 0x78, 0xc4, 0x90, 0x57, 0x44, 0x07, + 0x6b, 0xfe, 0xe8, 0x94, 0xde, 0x70, 0x7d, 0x4f, + 0x12, 0x0b, 0x5c, 0x68, 0x93, 0xea, 0x04, 0x00, + 0x29, 0x7d, 0x0b, 0xb8, 0x34, 0x72, 0x76, 0x32 }, + { 0x59, 0xdc, 0x78, 0xb1, 0x05, 0x64, 0x97, 0x07, + 0xa2, 0xbb, 0x44, 0x19, 0xc4, 0x8f, 0x00, 0x54, + 0x00, 0xd3, 0x97, 0x3d, 0xe3, 0x73, 0x66, 0x10, + 0x23, 0x04, 0x35, 0xb1, 0x04, 0x24, 0xb2, 0x4f }, + { 0xc0, 0x14, 0x9d, 0x1d, 0x7e, 0x7a, 0x63, 0x53, + 0xa6, 0xd9, 0x06, 0xef, 0xe7, 0x28, 0xf2, 0xf3, + 0x29, 0xfe, 0x14, 0xa4, 0x14, 0x9a, 0x3e, 0xa7, + 0x76, 0x09, 0xbc, 0x42, 0xb9, 0x75, 0xdd, 0xfa }, + { 0xa3, 0x2f, 0x24, 0x14, 0x74, 0xa6, 0xc1, 0x69, + 0x32, 0xe9, 0x24, 0x3b, 0xe0, 0xcf, 0x09, 0xbc, + 0xdc, 0x7e, 0x0c, 0xa0, 0xe7, 0xa6, 0xa1, 0xb9, + 0xb1, 0xa0, 0xf0, 0x1e, 0x41, 0x50, 0x23, 0x77 }, + { 0xb2, 0x39, 0xb2, 0xe4, 0xf8, 0x18, 0x41, 0x36, + 0x1c, 0x13, 0x39, 0xf6, 0x8e, 0x2c, 0x35, 0x9f, + 0x92, 0x9a, 0xf9, 0xad, 0x9f, 0x34, 0xe0, 0x1a, + 0xab, 0x46, 0x31, 0xad, 0x6d, 0x55, 0x00, 0xb0 }, + { 0x85, 0xfb, 0x41, 0x9c, 0x70, 0x02, 0xa3, 0xe0, + 0xb4, 0xb6, 0xea, 0x09, 0x3b, 0x4c, 0x1a, 0xc6, + 0x93, 0x66, 0x45, 0xb6, 0x5d, 0xac, 0x5a, 0xc1, + 0x5a, 0x85, 0x28, 0xb7, 0xb9, 0x4c, 0x17, 0x54 }, + { 0x96, 0x19, 0x72, 0x06, 0x25, 0xf1, 0x90, 0xb9, + 0x3a, 0x3f, 0xad, 0x18, 0x6a, 0xb3, 0x14, 0x18, + 0x96, 0x33, 0xc0, 0xd3, 0xa0, 0x1e, 0x6f, 0x9b, + 0xc8, 0xc4, 0xa8, 0xf8, 0x2f, 0x38, 0x3d, 0xbf }, + { 0x7d, 0x62, 0x0d, 0x90, 0xfe, 0x69, 0xfa, 0x46, + 0x9a, 0x65, 0x38, 0x38, 0x89, 0x70, 0xa1, 0xaa, + 0x09, 0xbb, 0x48, 0xa2, 0xd5, 0x9b, 0x34, 0x7b, + 0x97, 0xe8, 0xce, 0x71, 0xf4, 0x8c, 0x7f, 0x46 }, + { 0x29, 0x43, 0x83, 0x56, 0x85, 0x96, 0xfb, 0x37, + 0xc7, 0x5b, 0xba, 0xcd, 0x97, 0x9c, 0x5f, 0xf6, + 0xf2, 0x0a, 0x55, 0x6b, 0xf8, 0x87, 0x9c, 0xc7, + 0x29, 0x24, 0x85, 0x5d, 0xf9, 0xb8, 0x24, 0x0e }, + { 0x16, 0xb1, 0x8a, 0xb3, 0x14, 0x35, 0x9c, 0x2b, + 0x83, 0x3c, 0x1c, 0x69, 0x86, 0xd4, 0x8c, 0x55, + 0xa9, 0xfc, 0x97, 0xcd, 0xe9, 0xa3, 0xc1, 0xf1, + 0x0a, 0x31, 0x77, 0x14, 0x0f, 0x73, 0xf7, 0x38 }, + { 0x8c, 0xbb, 0xdd, 0x14, 0xbc, 0x33, 0xf0, 0x4c, + 0xf4, 0x58, 0x13, 0xe4, 0xa1, 0x53, 0xa2, 0x73, + 0xd3, 0x6a, 0xda, 0xd5, 0xce, 0x71, 0xf4, 0x99, + 0xee, 0xb8, 0x7f, 0xb8, 0xac, 0x63, 0xb7, 0x29 }, + { 0x69, 0xc9, 0xa4, 0x98, 0xdb, 0x17, 0x4e, 0xca, + 0xef, 0xcc, 0x5a, 0x3a, 0xc9, 0xfd, 0xed, 0xf0, + 0xf8, 0x13, 0xa5, 0xbe, 0xc7, 0x27, 0xf1, 0xe7, + 0x75, 0xba, 0xbd, 0xec, 0x77, 0x18, 0x81, 0x6e }, + { 0xb4, 0x62, 0xc3, 0xbe, 0x40, 0x44, 0x8f, 0x1d, + 0x4f, 0x80, 0x62, 0x62, 0x54, 0xe5, 0x35, 0xb0, + 0x8b, 0xc9, 0xcd, 0xcf, 0xf5, 0x99, 0xa7, 0x68, + 0x57, 0x8d, 0x4b, 0x28, 0x81, 0xa8, 0xe3, 0xf0 }, + { 0x55, 0x3e, 0x9d, 0x9c, 0x5f, 0x36, 0x0a, 0xc0, + 0xb7, 0x4a, 0x7d, 0x44, 0xe5, 0xa3, 0x91, 0xda, + 0xd4, 0xce, 0xd0, 0x3e, 0x0c, 0x24, 0x18, 0x3b, + 0x7e, 0x8e, 0xca, 0xbd, 0xf1, 0x71, 0x5a, 0x64 }, + { 0x7a, 0x7c, 0x55, 0xa5, 0x6f, 0xa9, 0xae, 0x51, + 0xe6, 0x55, 0xe0, 0x19, 0x75, 0xd8, 0xa6, 0xff, + 0x4a, 0xe9, 0xe4, 0xb4, 0x86, 0xfc, 0xbe, 0x4e, + 0xac, 0x04, 0x45, 0x88, 0xf2, 0x45, 0xeb, 0xea }, + { 0x2a, 0xfd, 0xf3, 0xc8, 0x2a, 0xbc, 0x48, 0x67, + 0xf5, 0xde, 0x11, 0x12, 0x86, 0xc2, 0xb3, 0xbe, + 0x7d, 0x6e, 0x48, 0x65, 0x7b, 0xa9, 0x23, 0xcf, + 0xbf, 0x10, 0x1a, 0x6d, 0xfc, 0xf9, 0xdb, 0x9a }, + { 0x41, 0x03, 0x7d, 0x2e, 0xdc, 0xdc, 0xe0, 0xc4, + 0x9b, 0x7f, 0xb4, 0xa6, 0xaa, 0x09, 0x99, 0xca, + 0x66, 0x97, 0x6c, 0x74, 0x83, 0xaf, 0xe6, 0x31, + 0xd4, 0xed, 0xa2, 0x83, 0x14, 0x4f, 0x6d, 0xfc }, + { 0xc4, 0x46, 0x6f, 0x84, 0x97, 0xca, 0x2e, 0xeb, + 0x45, 0x83, 0xa0, 0xb0, 0x8e, 0x9d, 0x9a, 0xc7, + 0x43, 0x95, 0x70, 0x9f, 0xda, 0x10, 0x9d, 0x24, + 0xf2, 0xe4, 0x46, 0x21, 0x96, 0x77, 0x9c, 0x5d }, + { 0x75, 0xf6, 0x09, 0x33, 0x8a, 0xa6, 0x7d, 0x96, + 0x9a, 0x2a, 0xe2, 0xa2, 0x36, 0x2b, 0x2d, 0xa9, + 0xd7, 0x7c, 0x69, 0x5d, 0xfd, 0x1d, 0xf7, 0x22, + 0x4a, 0x69, 0x01, 0xdb, 0x93, 0x2c, 0x33, 0x64 }, + { 0x68, 0x60, 0x6c, 0xeb, 0x98, 0x9d, 0x54, 0x88, + 0xfc, 0x7c, 0xf6, 0x49, 0xf3, 0xd7, 0xc2, 0x72, + 0xef, 0x05, 0x5d, 0xa1, 0xa9, 0x3f, 0xae, 0xcd, + 0x55, 0xfe, 0x06, 0xf6, 0x96, 0x70, 0x98, 0xca }, + { 0x44, 0x34, 0x6b, 0xde, 0xb7, 0xe0, 0x52, 0xf6, + 0x25, 0x50, 0x48, 0xf0, 0xd9, 0xb4, 0x2c, 0x42, + 0x5b, 0xab, 0x9c, 0x3d, 0xd2, 0x41, 0x68, 0x21, + 0x2c, 0x3e, 0xcf, 0x1e, 0xbf, 0x34, 0xe6, 0xae }, + { 0x8e, 0x9c, 0xf6, 0xe1, 0xf3, 0x66, 0x47, 0x1f, + 0x2a, 0xc7, 0xd2, 0xee, 0x9b, 0x5e, 0x62, 0x66, + 0xfd, 0xa7, 0x1f, 0x8f, 0x2e, 0x41, 0x09, 0xf2, + 0x23, 0x7e, 0xd5, 0xf8, 0x81, 0x3f, 0xc7, 0x18 }, + { 0x84, 0xbb, 0xeb, 0x84, 0x06, 0xd2, 0x50, 0x95, + 0x1f, 0x8c, 0x1b, 0x3e, 0x86, 0xa7, 0xc0, 0x10, + 0x08, 0x29, 0x21, 0x83, 0x3d, 0xfd, 0x95, 0x55, + 0xa2, 0xf9, 0x09, 0xb1, 0x08, 0x6e, 0xb4, 0xb8 }, + { 0xee, 0x66, 0x6f, 0x3e, 0xef, 0x0f, 0x7e, 0x2a, + 0x9c, 0x22, 0x29, 0x58, 0xc9, 0x7e, 0xaf, 0x35, + 0xf5, 0x1c, 0xed, 0x39, 0x3d, 0x71, 0x44, 0x85, + 0xab, 0x09, 0xa0, 0x69, 0x34, 0x0f, 0xdf, 0x88 }, + { 0xc1, 0x53, 0xd3, 0x4a, 0x65, 0xc4, 0x7b, 0x4a, + 0x62, 0xc5, 0xca, 0xcf, 0x24, 0x01, 0x09, 0x75, + 0xd0, 0x35, 0x6b, 0x2f, 0x32, 0xc8, 0xf5, 0xda, + 0x53, 0x0d, 0x33, 0x88, 0x16, 0xad, 0x5d, 0xe6 }, + { 0x9f, 0xc5, 0x45, 0x01, 0x09, 0xe1, 0xb7, 0x79, + 0xf6, 0xc7, 0xae, 0x79, 0xd5, 0x6c, 0x27, 0x63, + 0x5c, 0x8d, 0xd4, 0x26, 0xc5, 0xa9, 0xd5, 0x4e, + 0x25, 0x78, 0xdb, 0x98, 0x9b, 0x8c, 0x3b, 0x4e }, + { 0xd1, 0x2b, 0xf3, 0x73, 0x2e, 0xf4, 0xaf, 0x5c, + 0x22, 0xfa, 0x90, 0x35, 0x6a, 0xf8, 0xfc, 0x50, + 0xfc, 0xb4, 0x0f, 0x8f, 0x2e, 0xa5, 0xc8, 0x59, + 0x47, 0x37, 0xa3, 0xb3, 0xd5, 0xab, 0xdb, 0xd7 }, + { 0x11, 0x03, 0x0b, 0x92, 0x89, 0xbb, 0xa5, 0xaf, + 0x65, 0x26, 0x06, 0x72, 0xab, 0x6f, 0xee, 0x88, + 0xb8, 0x74, 0x20, 0xac, 0xef, 0x4a, 0x17, 0x89, + 0xa2, 0x07, 0x3b, 0x7e, 0xc2, 0xf2, 0xa0, 0x9e }, + { 0x69, 0xcb, 0x19, 0x2b, 0x84, 0x44, 0x00, 0x5c, + 0x8c, 0x0c, 0xeb, 0x12, 0xc8, 0x46, 0x86, 0x07, + 0x68, 0x18, 0x8c, 0xda, 0x0a, 0xec, 0x27, 0xa9, + 0xc8, 0xa5, 0x5c, 0xde, 0xe2, 0x12, 0x36, 0x32 }, + { 0xdb, 0x44, 0x4c, 0x15, 0x59, 0x7b, 0x5f, 0x1a, + 0x03, 0xd1, 0xf9, 0xed, 0xd1, 0x6e, 0x4a, 0x9f, + 0x43, 0xa6, 0x67, 0xcc, 0x27, 0x51, 0x75, 0xdf, + 0xa2, 0xb7, 0x04, 0xe3, 0xbb, 0x1a, 0x9b, 0x83 }, + { 0x3f, 0xb7, 0x35, 0x06, 0x1a, 0xbc, 0x51, 0x9d, + 0xfe, 0x97, 0x9e, 0x54, 0xc1, 0xee, 0x5b, 0xfa, + 0xd0, 0xa9, 0xd8, 0x58, 0xb3, 0x31, 0x5b, 0xad, + 0x34, 0xbd, 0xe9, 0x99, 0xef, 0xd7, 0x24, 0xdd } +}; + +static bool __init blake2s_selftest(void) +{ + u8 key[BLAKE2S_KEY_SIZE]; + u8 buf[ARRAY_SIZE(blake2s_testvecs)]; + u8 hash[BLAKE2S_HASH_SIZE]; + size_t i; + bool success = true; + + for (i = 0; i < BLAKE2S_KEY_SIZE; ++i) + key[i] = (u8)i; + + for (i = 0; i < ARRAY_SIZE(blake2s_testvecs); ++i) + buf[i] = (u8)i; + + for (i = 0; i < ARRAY_SIZE(blake2s_keyed_testvecs); ++i) { + blake2s(hash, buf, key, BLAKE2S_HASH_SIZE, i, BLAKE2S_KEY_SIZE); + if (memcmp(hash, blake2s_keyed_testvecs[i], BLAKE2S_HASH_SIZE)) { + pr_err("blake2s keyed self-test %zu: FAIL\n", i + 1); + success = false; + } + } + + for (i = 0; i < ARRAY_SIZE(blake2s_testvecs); ++i) { + blake2s(hash, buf, NULL, BLAKE2S_HASH_SIZE, i, 0); + if (memcmp(hash, blake2s_testvecs[i], BLAKE2S_HASH_SIZE)) { + pr_err("blake2s unkeyed self-test %zu: FAIL\n", i + i); + success = false; + } + } + return success; +} From patchwork Sat Oct 6 02:57:03 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: "Jason A. Donenfeld" X-Patchwork-Id: 148316 Delivered-To: patch@linaro.org Received: by 2002:a2e:8595:0:0:0:0:0 with SMTP id b21-v6csp1157999lji; Fri, 5 Oct 2018 19:58:51 -0700 (PDT) X-Google-Smtp-Source: ACcGV60LpMZWtK08fqQThqaoP+SwB61NyLOPf/DvKGEtVmhPnWmpwYAvLRfRN8wWqW0CHcePgVhx X-Received: by 2002:a63:5f03:: with SMTP id t3-v6mr12732678pgb.68.1538794731701; Fri, 05 Oct 2018 19:58:51 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1538794731; cv=none; d=google.com; s=arc-20160816; b=ckzDAuZRT0HLEddSvlGXFd1+w4pdh2ebJLlrHkUek5qwAO79hFVP3akdBKnD1pq4CI Tk6i4b4YW0lGFMXA9n2yjgZNbPqNU76W0HqrNGHNmjgC/mjIZ0QU6PArFermbmmw0tJY +S57dg7iCD/SZQuCZxhXTwR2vQWxP+ZJuEzAThM0rymu/u+3ft4MNZnzliyxPuVcjrcN uTfJjs+Yta0KKGMZ3IJIH2FdLhS1pkHqbxEx1TgDcSRArE8DJKBnHjnQ6mkUGbeo7hG6 jscuBFgMCKmvGOG82dFEV6DW7iQTZcswH00AVBFO1zH7OsjGIe9/eELa89YNr9hijFMl uB2A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=Pm8cNuNFk5ASRZiWcJ0QM4MH2T0VjdNrihH6p8LWvdc=; b=pI28oNypviA/llrO8uHu7Nn2vyU93N0aFTOHVKwOrKxZeviufN9oKvwPhv1zdugElS p9FO6z3uJAzfOGbH2EfmTSaiUEKSj+xtzrhuVF1D4ym8bZZ1WC3Amo/gnL3WHyFsNTKW LU9p/BxOgKwUwmKAE6YfmOt/V7XAaocObvVg/y0dvAnQfWLzCrHhOu05GlgZHUr2ruLw 8NbanGpHt821Ci7OvNoKsTv8E4Le1rdJQIpIgCCJN3QzFf+W1W0nb4VXETJ9c1AgzDXe /zUVUa5xfW3RFtsQKLm0boLqwKfFE0F0ZHgFlFE/AV7CrA4AnKV1Akf6AUonlqrTdKLM wTVg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=u99zfx6Z; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l136-v6si11700745pfd.132.2018.10.05.19.58.51; Fri, 05 Oct 2018 19:58:51 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=u99zfx6Z; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729906AbeJFKAU (ORCPT + 32 others); Sat, 6 Oct 2018 06:00:20 -0400 Received: from frisell.zx2c4.com ([192.95.5.64]:60175 "EHLO frisell.zx2c4.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729834AbeJFKAT (ORCPT ); Sat, 6 Oct 2018 06:00:19 -0400 Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTP id a1884ef4; Sat, 6 Oct 2018 02:58:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=zx2c4.com; h=from:to:cc :subject:date:message-id:in-reply-to:references:mime-version :content-type:content-transfer-encoding; s=mail; bh=gVoJwH4VTDY/ yc+qDEhVUt8vCeM=; b=u99zfx6ZQWsi3vg0IGS0vR7zpS/IDUakWI5XiptRuRW1 OSDOjR0TckEqnxv47jmcXXoVuIWO1CZchj5kABHWfT4icchhM3zcOTwH8tcARv26 IK+V16UrE+Q7O+1pfQiuq53dt8j9KLmnb6MFesDmPIiqQIHQSX0LIPX0x2xtq025 U/rpBkBkNwncXuNSZGKuxFosuHelwAwHfl0kJCHSVPpYvv48/28KtMPJVLyvqoTC 2HierRmhaP+hBTDpa26EE65Bn9At8BNtEFVAGuSUPxh2TwJFiwiY8T+ATCJI6Ahp 9VySPXNSfI/YdSSHeZl8sInEcV2UM5NCOUW77fbHwg== Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTPSA id c4c34fdb (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256:NO); Sat, 6 Oct 2018 02:58:05 +0000 (UTC) From: "Jason A. Donenfeld" To: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, davem@davemloft.net, gregkh@linuxfoundation.org Cc: "Jason A. Donenfeld" , Samuel Neves , =?utf-8?q?Armando_Faz-Hern=C3=A1ndez?= , Thomas Gleixner , Ingo Molnar , x86@kernel.org, Jean-Philippe Aumasson , Andy Lutomirski , Andrew Morton , Linus Torvalds , kernel-hardening@lists.openwall.com, linux-crypto@vger.kernel.org Subject: [PATCH net-next v7 22/28] zinc: Curve25519 x86_64 implementation Date: Sat, 6 Oct 2018 04:57:03 +0200 Message-Id: <20181006025709.4019-23-Jason@zx2c4.com> In-Reply-To: <20181006025709.4019-1-Jason@zx2c4.com> References: <20181006025709.4019-1-Jason@zx2c4.com> MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This implementation is the fastest available x86_64 implementation, and unlike Sandy2x, it doesn't requie use of the floating point registers at all. Instead it makes use of BMI2 and ADX, available on recent microarchitectures. The implementation was written by Armando Faz-Hernández with contributions (upstream) from Samuel Neves and me, in addition to further changes in the kernel implementation from us. Signed-off-by: Jason A. Donenfeld Signed-off-by: Samuel Neves Cc: Armando Faz-Hernández Cc: Thomas Gleixner Cc: Ingo Molnar Cc: x86@kernel.org Cc: Samuel Neves Cc: Jean-Philippe Aumasson Cc: Andy Lutomirski Cc: Greg KH Cc: Andrew Morton Cc: Linus Torvalds Cc: kernel-hardening@lists.openwall.com Cc: linux-crypto@vger.kernel.org --- lib/zinc/curve25519/curve25519-x86_64-glue.c | 48 + lib/zinc/curve25519/curve25519-x86_64.h | 2333 ++++++++++++++++++ lib/zinc/curve25519/curve25519.c | 4 + 3 files changed, 2385 insertions(+) create mode 100644 lib/zinc/curve25519/curve25519-x86_64-glue.c create mode 100644 lib/zinc/curve25519/curve25519-x86_64.h -- 2.19.0 diff --git a/lib/zinc/curve25519/curve25519-x86_64-glue.c b/lib/zinc/curve25519/curve25519-x86_64-glue.c new file mode 100644 index 000000000000..a0e35bb41683 --- /dev/null +++ b/lib/zinc/curve25519/curve25519-x86_64-glue.c @@ -0,0 +1,48 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#include +#include + +#include "curve25519-x86_64.h" + +static bool curve25519_use_bmi2 __ro_after_init; +static bool curve25519_use_adx __ro_after_init; +static bool *const curve25519_nobs[] __initconst = { + &curve25519_use_bmi2, &curve25519_use_adx }; + +static void __init curve25519_fpu_init(void) +{ + curve25519_use_bmi2 = boot_cpu_has(X86_FEATURE_BMI2); + curve25519_use_adx = boot_cpu_has(X86_FEATURE_BMI2) && + boot_cpu_has(X86_FEATURE_ADX); +} + +static inline bool curve25519_arch(u8 mypublic[CURVE25519_KEY_SIZE], + const u8 secret[CURVE25519_KEY_SIZE], + const u8 basepoint[CURVE25519_KEY_SIZE]) +{ + if (curve25519_use_adx) { + curve25519_adx(mypublic, secret, basepoint); + return true; + } else if (curve25519_use_bmi2) { + curve25519_bmi2(mypublic, secret, basepoint); + return true; + } + return false; +} + +static inline bool curve25519_base_arch(u8 pub[CURVE25519_KEY_SIZE], + const u8 secret[CURVE25519_KEY_SIZE]) +{ + if (curve25519_use_adx) { + curve25519_adx_base(pub, secret); + return true; + } else if (curve25519_use_bmi2) { + curve25519_bmi2_base(pub, secret); + return true; + } + return false; +} diff --git a/lib/zinc/curve25519/curve25519-x86_64.h b/lib/zinc/curve25519/curve25519-x86_64.h new file mode 100644 index 000000000000..258a30dbe66c --- /dev/null +++ b/lib/zinc/curve25519/curve25519-x86_64.h @@ -0,0 +1,2333 @@ +/* SPDX-License-Identifier: GPL-2.0 OR LGPL-2.1 */ +/* + * Copyright (c) 2017 Armando Faz . All Rights Reserved. + * Copyright (C) 2018 Jason A. Donenfeld . All Rights Reserved. + * Copyright (C) 2018 Samuel Neves . All Rights Reserved. + */ + +enum { NUM_WORDS_ELTFP25519 = 4 }; +typedef __aligned(32) u64 eltfp25519_1w[NUM_WORDS_ELTFP25519]; +typedef __aligned(32) u64 eltfp25519_1w_buffer[2 * NUM_WORDS_ELTFP25519]; + +#define mul_eltfp25519_1w_adx(c, a, b) do { \ + mul_256x256_integer_adx(m.buffer, a, b); \ + red_eltfp25519_1w_adx(c, m.buffer); \ +} while (0) + +#define mul_eltfp25519_1w_bmi2(c, a, b) do { \ + mul_256x256_integer_bmi2(m.buffer, a, b); \ + red_eltfp25519_1w_bmi2(c, m.buffer); \ +} while (0) + +#define sqr_eltfp25519_1w_adx(a) do { \ + sqr_256x256_integer_adx(m.buffer, a); \ + red_eltfp25519_1w_adx(a, m.buffer); \ +} while (0) + +#define sqr_eltfp25519_1w_bmi2(a) do { \ + sqr_256x256_integer_bmi2(m.buffer, a); \ + red_eltfp25519_1w_bmi2(a, m.buffer); \ +} while (0) + +#define mul_eltfp25519_2w_adx(c, a, b) do { \ + mul2_256x256_integer_adx(m.buffer, a, b); \ + red_eltfp25519_2w_adx(c, m.buffer); \ +} while (0) + +#define mul_eltfp25519_2w_bmi2(c, a, b) do { \ + mul2_256x256_integer_bmi2(m.buffer, a, b); \ + red_eltfp25519_2w_bmi2(c, m.buffer); \ +} while (0) + +#define sqr_eltfp25519_2w_adx(a) do { \ + sqr2_256x256_integer_adx(m.buffer, a); \ + red_eltfp25519_2w_adx(a, m.buffer); \ +} while (0) + +#define sqr_eltfp25519_2w_bmi2(a) do { \ + sqr2_256x256_integer_bmi2(m.buffer, a); \ + red_eltfp25519_2w_bmi2(a, m.buffer); \ +} while (0) + +#define sqrn_eltfp25519_1w_adx(a, times) do { \ + int ____counter = (times); \ + while (____counter-- > 0) \ + sqr_eltfp25519_1w_adx(a); \ +} while (0) + +#define sqrn_eltfp25519_1w_bmi2(a, times) do { \ + int ____counter = (times); \ + while (____counter-- > 0) \ + sqr_eltfp25519_1w_bmi2(a); \ +} while (0) + +#define copy_eltfp25519_1w(C, A) do { \ + (C)[0] = (A)[0]; \ + (C)[1] = (A)[1]; \ + (C)[2] = (A)[2]; \ + (C)[3] = (A)[3]; \ +} while (0) + +#define setzero_eltfp25519_1w(C) do { \ + (C)[0] = 0; \ + (C)[1] = 0; \ + (C)[2] = 0; \ + (C)[3] = 0; \ +} while (0) + +__aligned(32) static const u64 table_ladder_8k[252 * NUM_WORDS_ELTFP25519] = { + /* 1 */ 0xfffffffffffffff3UL, 0xffffffffffffffffUL, + 0xffffffffffffffffUL, 0x5fffffffffffffffUL, + /* 2 */ 0x6b8220f416aafe96UL, 0x82ebeb2b4f566a34UL, + 0xd5a9a5b075a5950fUL, 0x5142b2cf4b2488f4UL, + /* 3 */ 0x6aaebc750069680cUL, 0x89cf7820a0f99c41UL, + 0x2a58d9183b56d0f4UL, 0x4b5aca80e36011a4UL, + /* 4 */ 0x329132348c29745dUL, 0xf4a2e616e1642fd7UL, + 0x1e45bb03ff67bc34UL, 0x306912d0f42a9b4aUL, + /* 5 */ 0xff886507e6af7154UL, 0x04f50e13dfeec82fUL, + 0xaa512fe82abab5ceUL, 0x174e251a68d5f222UL, + /* 6 */ 0xcf96700d82028898UL, 0x1743e3370a2c02c5UL, + 0x379eec98b4e86eaaUL, 0x0c59888a51e0482eUL, + /* 7 */ 0xfbcbf1d699b5d189UL, 0xacaef0d58e9fdc84UL, + 0xc1c20d06231f7614UL, 0x2938218da274f972UL, + /* 8 */ 0xf6af49beff1d7f18UL, 0xcc541c22387ac9c2UL, + 0x96fcc9ef4015c56bUL, 0x69c1627c690913a9UL, + /* 9 */ 0x7a86fd2f4733db0eUL, 0xfdb8c4f29e087de9UL, + 0x095e4b1a8ea2a229UL, 0x1ad7a7c829b37a79UL, + /* 10 */ 0x342d89cad17ea0c0UL, 0x67bedda6cced2051UL, + 0x19ca31bf2bb42f74UL, 0x3df7b4c84980acbbUL, + /* 11 */ 0xa8c6444dc80ad883UL, 0xb91e440366e3ab85UL, + 0xc215cda00164f6d8UL, 0x3d867c6ef247e668UL, + /* 12 */ 0xc7dd582bcc3e658cUL, 0xfd2c4748ee0e5528UL, + 0xa0fd9b95cc9f4f71UL, 0x7529d871b0675ddfUL, + /* 13 */ 0xb8f568b42d3cbd78UL, 0x1233011b91f3da82UL, + 0x2dce6ccd4a7c3b62UL, 0x75e7fc8e9e498603UL, + /* 14 */ 0x2f4f13f1fcd0b6ecUL, 0xf1a8ca1f29ff7a45UL, + 0xc249c1a72981e29bUL, 0x6ebe0dbb8c83b56aUL, + /* 15 */ 0x7114fa8d170bb222UL, 0x65a2dcd5bf93935fUL, + 0xbdc41f68b59c979aUL, 0x2f0eef79a2ce9289UL, + /* 16 */ 0x42ecbf0c083c37ceUL, 0x2930bc09ec496322UL, + 0xf294b0c19cfeac0dUL, 0x3780aa4bedfabb80UL, + /* 17 */ 0x56c17d3e7cead929UL, 0xe7cb4beb2e5722c5UL, + 0x0ce931732dbfe15aUL, 0x41b883c7621052f8UL, + /* 18 */ 0xdbf75ca0c3d25350UL, 0x2936be086eb1e351UL, + 0xc936e03cb4a9b212UL, 0x1d45bf82322225aaUL, + /* 19 */ 0xe81ab1036a024cc5UL, 0xe212201c304c9a72UL, + 0xc5d73fba6832b1fcUL, 0x20ffdb5a4d839581UL, + /* 20 */ 0xa283d367be5d0fadUL, 0x6c2b25ca8b164475UL, + 0x9d4935467caaf22eUL, 0x5166408eee85ff49UL, + /* 21 */ 0x3c67baa2fab4e361UL, 0xb3e433c67ef35cefUL, + 0x5259729241159b1cUL, 0x6a621892d5b0ab33UL, + /* 22 */ 0x20b74a387555cdcbUL, 0x532aa10e1208923fUL, + 0xeaa17b7762281dd1UL, 0x61ab3443f05c44bfUL, + /* 23 */ 0x257a6c422324def8UL, 0x131c6c1017e3cf7fUL, + 0x23758739f630a257UL, 0x295a407a01a78580UL, + /* 24 */ 0xf8c443246d5da8d9UL, 0x19d775450c52fa5dUL, + 0x2afcfc92731bf83dUL, 0x7d10c8e81b2b4700UL, + /* 25 */ 0xc8e0271f70baa20bUL, 0x993748867ca63957UL, + 0x5412efb3cb7ed4bbUL, 0x3196d36173e62975UL, + /* 26 */ 0xde5bcad141c7dffcUL, 0x47cc8cd2b395c848UL, + 0xa34cd942e11af3cbUL, 0x0256dbf2d04ecec2UL, + /* 27 */ 0x875ab7e94b0e667fUL, 0xcad4dd83c0850d10UL, + 0x47f12e8f4e72c79fUL, 0x5f1a87bb8c85b19bUL, + /* 28 */ 0x7ae9d0b6437f51b8UL, 0x12c7ce5518879065UL, + 0x2ade09fe5cf77aeeUL, 0x23a05a2f7d2c5627UL, + /* 29 */ 0x5908e128f17c169aUL, 0xf77498dd8ad0852dUL, + 0x74b4c4ceab102f64UL, 0x183abadd10139845UL, + /* 30 */ 0xb165ba8daa92aaacUL, 0xd5c5ef9599386705UL, + 0xbe2f8f0cf8fc40d1UL, 0x2701e635ee204514UL, + /* 31 */ 0x629fa80020156514UL, 0xf223868764a8c1ceUL, + 0x5b894fff0b3f060eUL, 0x60d9944cf708a3faUL, + /* 32 */ 0xaeea001a1c7a201fUL, 0xebf16a633ee2ce63UL, + 0x6f7709594c7a07e1UL, 0x79b958150d0208cbUL, + /* 33 */ 0x24b55e5301d410e7UL, 0xe3a34edff3fdc84dUL, + 0xd88768e4904032d8UL, 0x131384427b3aaeecUL, + /* 34 */ 0x8405e51286234f14UL, 0x14dc4739adb4c529UL, + 0xb8a2b5b250634ffdUL, 0x2fe2a94ad8a7ff93UL, + /* 35 */ 0xec5c57efe843faddUL, 0x2843ce40f0bb9918UL, + 0xa4b561d6cf3d6305UL, 0x743629bde8fb777eUL, + /* 36 */ 0x343edd46bbaf738fUL, 0xed981828b101a651UL, + 0xa401760b882c797aUL, 0x1fc223e28dc88730UL, + /* 37 */ 0x48604e91fc0fba0eUL, 0xb637f78f052c6fa4UL, + 0x91ccac3d09e9239cUL, 0x23f7eed4437a687cUL, + /* 38 */ 0x5173b1118d9bd800UL, 0x29d641b63189d4a7UL, + 0xfdbf177988bbc586UL, 0x2959894fcad81df5UL, + /* 39 */ 0xaebc8ef3b4bbc899UL, 0x4148995ab26992b9UL, + 0x24e20b0134f92cfbUL, 0x40d158894a05dee8UL, + /* 40 */ 0x46b00b1185af76f6UL, 0x26bac77873187a79UL, + 0x3dc0bf95ab8fff5fUL, 0x2a608bd8945524d7UL, + /* 41 */ 0x26449588bd446302UL, 0x7c4bc21c0388439cUL, + 0x8e98a4f383bd11b2UL, 0x26218d7bc9d876b9UL, + /* 42 */ 0xe3081542997c178aUL, 0x3c2d29a86fb6606fUL, + 0x5c217736fa279374UL, 0x7dde05734afeb1faUL, + /* 43 */ 0x3bf10e3906d42babUL, 0xe4f7803e1980649cUL, + 0xe6053bf89595bf7aUL, 0x394faf38da245530UL, + /* 44 */ 0x7a8efb58896928f4UL, 0xfbc778e9cc6a113cUL, + 0x72670ce330af596fUL, 0x48f222a81d3d6cf7UL, + /* 45 */ 0xf01fce410d72caa7UL, 0x5a20ecc7213b5595UL, + 0x7bc21165c1fa1483UL, 0x07f89ae31da8a741UL, + /* 46 */ 0x05d2c2b4c6830ff9UL, 0xd43e330fc6316293UL, + 0xa5a5590a96d3a904UL, 0x705edb91a65333b6UL, + /* 47 */ 0x048ee15e0bb9a5f7UL, 0x3240cfca9e0aaf5dUL, + 0x8f4b71ceedc4a40bUL, 0x621c0da3de544a6dUL, + /* 48 */ 0x92872836a08c4091UL, 0xce8375b010c91445UL, + 0x8a72eb524f276394UL, 0x2667fcfa7ec83635UL, + /* 49 */ 0x7f4c173345e8752aUL, 0x061b47feee7079a5UL, + 0x25dd9afa9f86ff34UL, 0x3780cef5425dc89cUL, + /* 50 */ 0x1a46035a513bb4e9UL, 0x3e1ef379ac575adaUL, + 0xc78c5f1c5fa24b50UL, 0x321a967634fd9f22UL, + /* 51 */ 0x946707b8826e27faUL, 0x3dca84d64c506fd0UL, + 0xc189218075e91436UL, 0x6d9284169b3b8484UL, + /* 52 */ 0x3a67e840383f2ddfUL, 0x33eec9a30c4f9b75UL, + 0x3ec7c86fa783ef47UL, 0x26ec449fbac9fbc4UL, + /* 53 */ 0x5c0f38cba09b9e7dUL, 0x81168cc762a3478cUL, + 0x3e23b0d306fc121cUL, 0x5a238aa0a5efdcddUL, + /* 54 */ 0x1ba26121c4ea43ffUL, 0x36f8c77f7c8832b5UL, + 0x88fbea0b0adcf99aUL, 0x5ca9938ec25bebf9UL, + /* 55 */ 0xd5436a5e51fccda0UL, 0x1dbc4797c2cd893bUL, + 0x19346a65d3224a08UL, 0x0f5034e49b9af466UL, + /* 56 */ 0xf23c3967a1e0b96eUL, 0xe58b08fa867a4d88UL, + 0xfb2fabc6a7341679UL, 0x2a75381eb6026946UL, + /* 57 */ 0xc80a3be4c19420acUL, 0x66b1f6c681f2b6dcUL, + 0x7cf7036761e93388UL, 0x25abbbd8a660a4c4UL, + /* 58 */ 0x91ea12ba14fd5198UL, 0x684950fc4a3cffa9UL, + 0xf826842130f5ad28UL, 0x3ea988f75301a441UL, + /* 59 */ 0xc978109a695f8c6fUL, 0x1746eb4a0530c3f3UL, + 0x444d6d77b4459995UL, 0x75952b8c054e5cc7UL, + /* 60 */ 0xa3703f7915f4d6aaUL, 0x66c346202f2647d8UL, + 0xd01469df811d644bUL, 0x77fea47d81a5d71fUL, + /* 61 */ 0xc5e9529ef57ca381UL, 0x6eeeb4b9ce2f881aUL, + 0xb6e91a28e8009bd6UL, 0x4b80be3e9afc3fecUL, + /* 62 */ 0x7e3773c526aed2c5UL, 0x1b4afcb453c9a49dUL, + 0xa920bdd7baffb24dUL, 0x7c54699f122d400eUL, + /* 63 */ 0xef46c8e14fa94bc8UL, 0xe0b074ce2952ed5eUL, + 0xbea450e1dbd885d5UL, 0x61b68649320f712cUL, + /* 64 */ 0x8a485f7309ccbdd1UL, 0xbd06320d7d4d1a2dUL, + 0x25232973322dbef4UL, 0x445dc4758c17f770UL, + /* 65 */ 0xdb0434177cc8933cUL, 0xed6fe82175ea059fUL, + 0x1efebefdc053db34UL, 0x4adbe867c65daf99UL, + /* 66 */ 0x3acd71a2a90609dfUL, 0xe5e991856dd04050UL, + 0x1ec69b688157c23cUL, 0x697427f6885cfe4dUL, + /* 67 */ 0xd7be7b9b65e1a851UL, 0xa03d28d522c536ddUL, + 0x28399d658fd2b645UL, 0x49e5b7e17c2641e1UL, + /* 68 */ 0x6f8c3a98700457a4UL, 0x5078f0a25ebb6778UL, + 0xd13c3ccbc382960fUL, 0x2e003258a7df84b1UL, + /* 69 */ 0x8ad1f39be6296a1cUL, 0xc1eeaa652a5fbfb2UL, + 0x33ee0673fd26f3cbUL, 0x59256173a69d2cccUL, + /* 70 */ 0x41ea07aa4e18fc41UL, 0xd9fc19527c87a51eUL, + 0xbdaacb805831ca6fUL, 0x445b652dc916694fUL, + /* 71 */ 0xce92a3a7f2172315UL, 0x1edc282de11b9964UL, + 0xa1823aafe04c314aUL, 0x790a2d94437cf586UL, + /* 72 */ 0x71c447fb93f6e009UL, 0x8922a56722845276UL, + 0xbf70903b204f5169UL, 0x2f7a89891ba319feUL, + /* 73 */ 0x02a08eb577e2140cUL, 0xed9a4ed4427bdcf4UL, + 0x5253ec44e4323cd1UL, 0x3e88363c14e9355bUL, + /* 74 */ 0xaa66c14277110b8cUL, 0x1ae0391610a23390UL, + 0x2030bd12c93fc2a2UL, 0x3ee141579555c7abUL, + /* 75 */ 0x9214de3a6d6e7d41UL, 0x3ccdd88607f17efeUL, + 0x674f1288f8e11217UL, 0x5682250f329f93d0UL, + /* 76 */ 0x6cf00b136d2e396eUL, 0x6e4cf86f1014debfUL, + 0x5930b1b5bfcc4e83UL, 0x047069b48aba16b6UL, + /* 77 */ 0x0d4ce4ab69b20793UL, 0xb24db91a97d0fb9eUL, + 0xcdfa50f54e00d01dUL, 0x221b1085368bddb5UL, + /* 78 */ 0xe7e59468b1e3d8d2UL, 0x53c56563bd122f93UL, + 0xeee8a903e0663f09UL, 0x61efa662cbbe3d42UL, + /* 79 */ 0x2cf8ddddde6eab2aUL, 0x9bf80ad51435f231UL, + 0x5deadacec9f04973UL, 0x29275b5d41d29b27UL, + /* 80 */ 0xcfde0f0895ebf14fUL, 0xb9aab96b054905a7UL, + 0xcae80dd9a1c420fdUL, 0x0a63bf2f1673bbc7UL, + /* 81 */ 0x092f6e11958fbc8cUL, 0x672a81e804822fadUL, + 0xcac8351560d52517UL, 0x6f3f7722c8f192f8UL, + /* 82 */ 0xf8ba90ccc2e894b7UL, 0x2c7557a438ff9f0dUL, + 0x894d1d855ae52359UL, 0x68e122157b743d69UL, + /* 83 */ 0xd87e5570cfb919f3UL, 0x3f2cdecd95798db9UL, + 0x2121154710c0a2ceUL, 0x3c66a115246dc5b2UL, + /* 84 */ 0xcbedc562294ecb72UL, 0xba7143c36a280b16UL, + 0x9610c2efd4078b67UL, 0x6144735d946a4b1eUL, + /* 85 */ 0x536f111ed75b3350UL, 0x0211db8c2041d81bUL, + 0xf93cb1000e10413cUL, 0x149dfd3c039e8876UL, + /* 86 */ 0xd479dde46b63155bUL, 0xb66e15e93c837976UL, + 0xdafde43b1f13e038UL, 0x5fafda1a2e4b0b35UL, + /* 87 */ 0x3600bbdf17197581UL, 0x3972050bbe3cd2c2UL, + 0x5938906dbdd5be86UL, 0x34fce5e43f9b860fUL, + /* 88 */ 0x75a8a4cd42d14d02UL, 0x828dabc53441df65UL, + 0x33dcabedd2e131d3UL, 0x3ebad76fb814d25fUL, + /* 89 */ 0xd4906f566f70e10fUL, 0x5d12f7aa51690f5aUL, + 0x45adb16e76cefcf2UL, 0x01f768aead232999UL, + /* 90 */ 0x2b6cc77b6248febdUL, 0x3cd30628ec3aaffdUL, + 0xce1c0b80d4ef486aUL, 0x4c3bff2ea6f66c23UL, + /* 91 */ 0x3f2ec4094aeaeb5fUL, 0x61b19b286e372ca7UL, + 0x5eefa966de2a701dUL, 0x23b20565de55e3efUL, + /* 92 */ 0xe301ca5279d58557UL, 0x07b2d4ce27c2874fUL, + 0xa532cd8a9dcf1d67UL, 0x2a52fee23f2bff56UL, + /* 93 */ 0x8624efb37cd8663dUL, 0xbbc7ac20ffbd7594UL, + 0x57b85e9c82d37445UL, 0x7b3052cb86a6ec66UL, + /* 94 */ 0x3482f0ad2525e91eUL, 0x2cb68043d28edca0UL, + 0xaf4f6d052e1b003aUL, 0x185f8c2529781b0aUL, + /* 95 */ 0xaa41de5bd80ce0d6UL, 0x9407b2416853e9d6UL, + 0x563ec36e357f4c3aUL, 0x4cc4b8dd0e297bceUL, + /* 96 */ 0xa2fc1a52ffb8730eUL, 0x1811f16e67058e37UL, + 0x10f9a366cddf4ee1UL, 0x72f4a0c4a0b9f099UL, + /* 97 */ 0x8c16c06f663f4ea7UL, 0x693b3af74e970fbaUL, + 0x2102e7f1d69ec345UL, 0x0ba53cbc968a8089UL, + /* 98 */ 0xca3d9dc7fea15537UL, 0x4c6824bb51536493UL, + 0xb9886314844006b1UL, 0x40d2a72ab454cc60UL, + /* 99 */ 0x5936a1b712570975UL, 0x91b9d648debda657UL, + 0x3344094bb64330eaUL, 0x006ba10d12ee51d0UL, + /* 100 */ 0x19228468f5de5d58UL, 0x0eb12f4c38cc05b0UL, + 0xa1039f9dd5601990UL, 0x4502d4ce4fff0e0bUL, + /* 101 */ 0xeb2054106837c189UL, 0xd0f6544c6dd3b93cUL, + 0x40727064c416d74fUL, 0x6e15c6114b502ef0UL, + /* 102 */ 0x4df2a398cfb1a76bUL, 0x11256c7419f2f6b1UL, + 0x4a497962066e6043UL, 0x705b3aab41355b44UL, + /* 103 */ 0x365ef536d797b1d8UL, 0x00076bd622ddf0dbUL, + 0x3bbf33b0e0575a88UL, 0x3777aa05c8e4ca4dUL, + /* 104 */ 0x392745c85578db5fUL, 0x6fda4149dbae5ae2UL, + 0xb1f0b00b8adc9867UL, 0x09963437d36f1da3UL, + /* 105 */ 0x7e824e90a5dc3853UL, 0xccb5f6641f135cbdUL, + 0x6736d86c87ce8fccUL, 0x625f3ce26604249fUL, + /* 106 */ 0xaf8ac8059502f63fUL, 0x0c05e70a2e351469UL, + 0x35292e9c764b6305UL, 0x1a394360c7e23ac3UL, + /* 107 */ 0xd5c6d53251183264UL, 0x62065abd43c2b74fUL, + 0xb5fbf5d03b973f9bUL, 0x13a3da3661206e5eUL, + /* 108 */ 0xc6bd5837725d94e5UL, 0x18e30912205016c5UL, + 0x2088ce1570033c68UL, 0x7fba1f495c837987UL, + /* 109 */ 0x5a8c7423f2f9079dUL, 0x1735157b34023fc5UL, + 0xe4f9b49ad2fab351UL, 0x6691ff72c878e33cUL, + /* 110 */ 0x122c2adedc5eff3eUL, 0xf8dd4bf1d8956cf4UL, + 0xeb86205d9e9e5bdaUL, 0x049b92b9d975c743UL, + /* 111 */ 0xa5379730b0f6c05aUL, 0x72a0ffacc6f3a553UL, + 0xb0032c34b20dcd6dUL, 0x470e9dbc88d5164aUL, + /* 112 */ 0xb19cf10ca237c047UL, 0xb65466711f6c81a2UL, + 0xb3321bd16dd80b43UL, 0x48c14f600c5fbe8eUL, + /* 113 */ 0x66451c264aa6c803UL, 0xb66e3904a4fa7da6UL, + 0xd45f19b0b3128395UL, 0x31602627c3c9bc10UL, + /* 114 */ 0x3120dc4832e4e10dUL, 0xeb20c46756c717f7UL, + 0x00f52e3f67280294UL, 0x566d4fc14730c509UL, + /* 115 */ 0x7e3a5d40fd837206UL, 0xc1e926dc7159547aUL, + 0x216730fba68d6095UL, 0x22e8c3843f69cea7UL, + /* 116 */ 0x33d074e8930e4b2bUL, 0xb6e4350e84d15816UL, + 0x5534c26ad6ba2365UL, 0x7773c12f89f1f3f3UL, + /* 117 */ 0x8cba404da57962aaUL, 0x5b9897a81999ce56UL, + 0x508e862f121692fcUL, 0x3a81907fa093c291UL, + /* 118 */ 0x0dded0ff4725a510UL, 0x10d8cc10673fc503UL, + 0x5b9d151c9f1f4e89UL, 0x32a5c1d5cb09a44cUL, + /* 119 */ 0x1e0aa442b90541fbUL, 0x5f85eb7cc1b485dbUL, + 0xbee595ce8a9df2e5UL, 0x25e496c722422236UL, + /* 120 */ 0x5edf3c46cd0fe5b9UL, 0x34e75a7ed2a43388UL, + 0xe488de11d761e352UL, 0x0e878a01a085545cUL, + /* 121 */ 0xba493c77e021bb04UL, 0x2b4d1843c7df899aUL, + 0x9ea37a487ae80d67UL, 0x67a9958011e41794UL, + /* 122 */ 0x4b58051a6697b065UL, 0x47e33f7d8d6ba6d4UL, + 0xbb4da8d483ca46c1UL, 0x68becaa181c2db0dUL, + /* 123 */ 0x8d8980e90b989aa5UL, 0xf95eb14a2c93c99bUL, + 0x51c6c7c4796e73a2UL, 0x6e228363b5efb569UL, + /* 124 */ 0xc6bbc0b02dd624c8UL, 0x777eb47dec8170eeUL, + 0x3cde15a004cfafa9UL, 0x1dc6bc087160bf9bUL, + /* 125 */ 0x2e07e043eec34002UL, 0x18e9fc677a68dc7fUL, + 0xd8da03188bd15b9aUL, 0x48fbc3bb00568253UL, + /* 126 */ 0x57547d4cfb654ce1UL, 0xd3565b82a058e2adUL, + 0xf63eaf0bbf154478UL, 0x47531ef114dfbb18UL, + /* 127 */ 0xe1ec630a4278c587UL, 0x5507d546ca8e83f3UL, + 0x85e135c63adc0c2bUL, 0x0aa7efa85682844eUL, + /* 128 */ 0x72691ba8b3e1f615UL, 0x32b4e9701fbe3ffaUL, + 0x97b6d92e39bb7868UL, 0x2cfe53dea02e39e8UL, + /* 129 */ 0x687392cd85cd52b0UL, 0x27ff66c910e29831UL, + 0x97134556a9832d06UL, 0x269bb0360a84f8a0UL, + /* 130 */ 0x706e55457643f85cUL, 0x3734a48c9b597d1bUL, + 0x7aee91e8c6efa472UL, 0x5cd6abc198a9d9e0UL, + /* 131 */ 0x0e04de06cb3ce41aUL, 0xd8c6eb893402e138UL, + 0x904659bb686e3772UL, 0x7215c371746ba8c8UL, + /* 132 */ 0xfd12a97eeae4a2d9UL, 0x9514b7516394f2c5UL, + 0x266fd5809208f294UL, 0x5c847085619a26b9UL, + /* 133 */ 0x52985410fed694eaUL, 0x3c905b934a2ed254UL, + 0x10bb47692d3be467UL, 0x063b3d2d69e5e9e1UL, + /* 134 */ 0x472726eedda57debUL, 0xefb6c4ae10f41891UL, + 0x2b1641917b307614UL, 0x117c554fc4f45b7cUL, + /* 135 */ 0xc07cf3118f9d8812UL, 0x01dbd82050017939UL, + 0xd7e803f4171b2827UL, 0x1015e87487d225eaUL, + /* 136 */ 0xc58de3fed23acc4dUL, 0x50db91c294a7be2dUL, + 0x0b94d43d1c9cf457UL, 0x6b1640fa6e37524aUL, + /* 137 */ 0x692f346c5fda0d09UL, 0x200b1c59fa4d3151UL, + 0xb8c46f760777a296UL, 0x4b38395f3ffdfbcfUL, + /* 138 */ 0x18d25e00be54d671UL, 0x60d50582bec8aba6UL, + 0x87ad8f263b78b982UL, 0x50fdf64e9cda0432UL, + /* 139 */ 0x90f567aac578dcf0UL, 0xef1e9b0ef2a3133bUL, + 0x0eebba9242d9de71UL, 0x15473c9bf03101c7UL, + /* 140 */ 0x7c77e8ae56b78095UL, 0xb678e7666e6f078eUL, + 0x2da0b9615348ba1fUL, 0x7cf931c1ff733f0bUL, + /* 141 */ 0x26b357f50a0a366cUL, 0xe9708cf42b87d732UL, + 0xc13aeea5f91cb2c0UL, 0x35d90c991143bb4cUL, + /* 142 */ 0x47c1c404a9a0d9dcUL, 0x659e58451972d251UL, + 0x3875a8c473b38c31UL, 0x1fbd9ed379561f24UL, + /* 143 */ 0x11fabc6fd41ec28dUL, 0x7ef8dfe3cd2a2dcaUL, + 0x72e73b5d8c404595UL, 0x6135fa4954b72f27UL, + /* 144 */ 0xccfc32a2de24b69cUL, 0x3f55698c1f095d88UL, + 0xbe3350ed5ac3f929UL, 0x5e9bf806ca477eebUL, + /* 145 */ 0xe9ce8fb63c309f68UL, 0x5376f63565e1f9f4UL, + 0xd1afcfb35a6393f1UL, 0x6632a1ede5623506UL, + /* 146 */ 0x0b7d6c390c2ded4cUL, 0x56cb3281df04cb1fUL, + 0x66305a1249ecc3c7UL, 0x5d588b60a38ca72aUL, + /* 147 */ 0xa6ecbf78e8e5f42dUL, 0x86eeb44b3c8a3eecUL, + 0xec219c48fbd21604UL, 0x1aaf1af517c36731UL, + /* 148 */ 0xc306a2836769bde7UL, 0x208280622b1e2adbUL, + 0x8027f51ffbff94a6UL, 0x76cfa1ce1124f26bUL, + /* 149 */ 0x18eb00562422abb6UL, 0xf377c4d58f8c29c3UL, + 0x4dbbc207f531561aUL, 0x0253b7f082128a27UL, + /* 150 */ 0x3d1f091cb62c17e0UL, 0x4860e1abd64628a9UL, + 0x52d17436309d4253UL, 0x356f97e13efae576UL, + /* 151 */ 0xd351e11aa150535bUL, 0x3e6b45bb1dd878ccUL, + 0x0c776128bed92c98UL, 0x1d34ae93032885b8UL, + /* 152 */ 0x4ba0488ca85ba4c3UL, 0x985348c33c9ce6ceUL, + 0x66124c6f97bda770UL, 0x0f81a0290654124aUL, + /* 153 */ 0x9ed09ca6569b86fdUL, 0x811009fd18af9a2dUL, + 0xff08d03f93d8c20aUL, 0x52a148199faef26bUL, + /* 154 */ 0x3e03f9dc2d8d1b73UL, 0x4205801873961a70UL, + 0xc0d987f041a35970UL, 0x07aa1f15a1c0d549UL, + /* 155 */ 0xdfd46ce08cd27224UL, 0x6d0a024f934e4239UL, + 0x808a7a6399897b59UL, 0x0a4556e9e13d95a2UL, + /* 156 */ 0xd21a991fe9c13045UL, 0x9b0e8548fe7751b8UL, + 0x5da643cb4bf30035UL, 0x77db28d63940f721UL, + /* 157 */ 0xfc5eeb614adc9011UL, 0x5229419ae8c411ebUL, + 0x9ec3e7787d1dcf74UL, 0x340d053e216e4cb5UL, + /* 158 */ 0xcac7af39b48df2b4UL, 0xc0faec2871a10a94UL, + 0x140a69245ca575edUL, 0x0cf1c37134273a4cUL, + /* 159 */ 0xc8ee306ac224b8a5UL, 0x57eaee7ccb4930b0UL, + 0xa1e806bdaacbe74fUL, 0x7d9a62742eeb657dUL, + /* 160 */ 0x9eb6b6ef546c4830UL, 0x885cca1fddb36e2eUL, + 0xe6b9f383ef0d7105UL, 0x58654fef9d2e0412UL, + /* 161 */ 0xa905c4ffbe0e8e26UL, 0x942de5df9b31816eUL, + 0x497d723f802e88e1UL, 0x30684dea602f408dUL, + /* 162 */ 0x21e5a278a3e6cb34UL, 0xaefb6e6f5b151dc4UL, + 0xb30b8e049d77ca15UL, 0x28c3c9cf53b98981UL, + /* 163 */ 0x287fb721556cdd2aUL, 0x0d317ca897022274UL, + 0x7468c7423a543258UL, 0x4a7f11464eb5642fUL, + /* 164 */ 0xa237a4774d193aa6UL, 0xd865986ea92129a1UL, + 0x24c515ecf87c1a88UL, 0x604003575f39f5ebUL, + /* 165 */ 0x47b9f189570a9b27UL, 0x2b98cede465e4b78UL, + 0x026df551dbb85c20UL, 0x74fcd91047e21901UL, + /* 166 */ 0x13e2a90a23c1bfa3UL, 0x0cb0074e478519f6UL, + 0x5ff1cbbe3af6cf44UL, 0x67fe5438be812dbeUL, + /* 167 */ 0xd13cf64fa40f05b0UL, 0x054dfb2f32283787UL, + 0x4173915b7f0d2aeaUL, 0x482f144f1f610d4eUL, + /* 168 */ 0xf6210201b47f8234UL, 0x5d0ae1929e70b990UL, + 0xdcd7f455b049567cUL, 0x7e93d0f1f0916f01UL, + /* 169 */ 0xdd79cbf18a7db4faUL, 0xbe8391bf6f74c62fUL, + 0x027145d14b8291bdUL, 0x585a73ea2cbf1705UL, + /* 170 */ 0x485ca03e928a0db2UL, 0x10fc01a5742857e7UL, + 0x2f482edbd6d551a7UL, 0x0f0433b5048fdb8aUL, + /* 171 */ 0x60da2e8dd7dc6247UL, 0x88b4c9d38cd4819aUL, + 0x13033ac001f66697UL, 0x273b24fe3b367d75UL, + /* 172 */ 0xc6e8f66a31b3b9d4UL, 0x281514a494df49d5UL, + 0xd1726fdfc8b23da7UL, 0x4b3ae7d103dee548UL, + /* 173 */ 0xc6256e19ce4b9d7eUL, 0xff5c5cf186e3c61cUL, + 0xacc63ca34b8ec145UL, 0x74621888fee66574UL, + /* 174 */ 0x956f409645290a1eUL, 0xef0bf8e3263a962eUL, + 0xed6a50eb5ec2647bUL, 0x0694283a9dca7502UL, + /* 175 */ 0x769b963643a2dcd1UL, 0x42b7c8ea09fc5353UL, + 0x4f002aee13397eabUL, 0x63005e2c19b7d63aUL, + /* 176 */ 0xca6736da63023beaUL, 0x966c7f6db12a99b7UL, + 0xace09390c537c5e1UL, 0x0b696063a1aa89eeUL, + /* 177 */ 0xebb03e97288c56e5UL, 0x432a9f9f938c8be8UL, + 0xa6a5a93d5b717f71UL, 0x1a5fb4c3e18f9d97UL, + /* 178 */ 0x1c94e7ad1c60cdceUL, 0xee202a43fc02c4a0UL, + 0x8dafe4d867c46a20UL, 0x0a10263c8ac27b58UL, + /* 179 */ 0xd0dea9dfe4432a4aUL, 0x856af87bbe9277c5UL, + 0xce8472acc212c71aUL, 0x6f151b6d9bbb1e91UL, + /* 180 */ 0x26776c527ceed56aUL, 0x7d211cb7fbf8faecUL, + 0x37ae66a6fd4609ccUL, 0x1f81b702d2770c42UL, + /* 181 */ 0x2fb0b057eac58392UL, 0xe1dd89fe29744e9dUL, + 0xc964f8eb17beb4f8UL, 0x29571073c9a2d41eUL, + /* 182 */ 0xa948a18981c0e254UL, 0x2df6369b65b22830UL, + 0xa33eb2d75fcfd3c6UL, 0x078cd6ec4199a01fUL, + /* 183 */ 0x4a584a41ad900d2fUL, 0x32142b78e2c74c52UL, + 0x68c4e8338431c978UL, 0x7f69ea9008689fc2UL, + /* 184 */ 0x52f2c81e46a38265UL, 0xfd78072d04a832fdUL, + 0x8cd7d5fa25359e94UL, 0x4de71b7454cc29d2UL, + /* 185 */ 0x42eb60ad1eda6ac9UL, 0x0aad37dfdbc09c3aUL, + 0x81004b71e33cc191UL, 0x44e6be345122803cUL, + /* 186 */ 0x03fe8388ba1920dbUL, 0xf5d57c32150db008UL, + 0x49c8c4281af60c29UL, 0x21edb518de701aeeUL, + /* 187 */ 0x7fb63e418f06dc99UL, 0xa4460d99c166d7b8UL, + 0x24dd5248ce520a83UL, 0x5ec3ad712b928358UL, + /* 188 */ 0x15022a5fbd17930fUL, 0xa4f64a77d82570e3UL, + 0x12bc8d6915783712UL, 0x498194c0fc620abbUL, + /* 189 */ 0x38a2d9d255686c82UL, 0x785c6bd9193e21f0UL, + 0xe4d5c81ab24a5484UL, 0x56307860b2e20989UL, + /* 190 */ 0x429d55f78b4d74c4UL, 0x22f1834643350131UL, + 0x1e60c24598c71fffUL, 0x59f2f014979983efUL, + /* 191 */ 0x46a47d56eb494a44UL, 0x3e22a854d636a18eUL, + 0xb346e15274491c3bUL, 0x2ceafd4e5390cde7UL, + /* 192 */ 0xba8a8538be0d6675UL, 0x4b9074bb50818e23UL, + 0xcbdab89085d304c3UL, 0x61a24fe0e56192c4UL, + /* 193 */ 0xcb7615e6db525bcbUL, 0xdd7d8c35a567e4caUL, + 0xe6b4153acafcdd69UL, 0x2d668e097f3c9766UL, + /* 194 */ 0xa57e7e265ce55ef0UL, 0x5d9f4e527cd4b967UL, + 0xfbc83606492fd1e5UL, 0x090d52beb7c3f7aeUL, + /* 195 */ 0x09b9515a1e7b4d7cUL, 0x1f266a2599da44c0UL, + 0xa1c49548e2c55504UL, 0x7ef04287126f15ccUL, + /* 196 */ 0xfed1659dbd30ef15UL, 0x8b4ab9eec4e0277bUL, + 0x884d6236a5df3291UL, 0x1fd96ea6bf5cf788UL, + /* 197 */ 0x42a161981f190d9aUL, 0x61d849507e6052c1UL, + 0x9fe113bf285a2cd5UL, 0x7c22d676dbad85d8UL, + /* 198 */ 0x82e770ed2bfbd27dUL, 0x4c05b2ece996f5a5UL, + 0xcd40a9c2b0900150UL, 0x5895319213d9bf64UL, + /* 199 */ 0xe7cc5d703fea2e08UL, 0xb50c491258e2188cUL, + 0xcce30baa48205bf0UL, 0x537c659ccfa32d62UL, + /* 200 */ 0x37b6623a98cfc088UL, 0xfe9bed1fa4d6aca4UL, + 0x04d29b8e56a8d1b0UL, 0x725f71c40b519575UL, + /* 201 */ 0x28c7f89cd0339ce6UL, 0x8367b14469ddc18bUL, + 0x883ada83a6a1652cUL, 0x585f1974034d6c17UL, + /* 202 */ 0x89cfb266f1b19188UL, 0xe63b4863e7c35217UL, + 0xd88c9da6b4c0526aUL, 0x3e035c9df0954635UL, + /* 203 */ 0xdd9d5412fb45de9dUL, 0xdd684532e4cff40dUL, + 0x4b5c999b151d671cUL, 0x2d8c2cc811e7f690UL, + /* 204 */ 0x7f54be1d90055d40UL, 0xa464c5df464aaf40UL, + 0x33979624f0e917beUL, 0x2c018dc527356b30UL, + /* 205 */ 0xa5415024e330b3d4UL, 0x73ff3d96691652d3UL, + 0x94ec42c4ef9b59f1UL, 0x0747201618d08e5aUL, + /* 206 */ 0x4d6ca48aca411c53UL, 0x66415f2fcfa66119UL, + 0x9c4dd40051e227ffUL, 0x59810bc09a02f7ebUL, + /* 207 */ 0x2a7eb171b3dc101dUL, 0x441c5ab99ffef68eUL, + 0x32025c9b93b359eaUL, 0x5e8ce0a71e9d112fUL, + /* 208 */ 0xbfcccb92429503fdUL, 0xd271ba752f095d55UL, + 0x345ead5e972d091eUL, 0x18c8df11a83103baUL, + /* 209 */ 0x90cd949a9aed0f4cUL, 0xc5d1f4cb6660e37eUL, + 0xb8cac52d56c52e0bUL, 0x6e42e400c5808e0dUL, + /* 210 */ 0xa3b46966eeaefd23UL, 0x0c4f1f0be39ecdcaUL, + 0x189dc8c9d683a51dUL, 0x51f27f054c09351bUL, + /* 211 */ 0x4c487ccd2a320682UL, 0x587ea95bb3df1c96UL, + 0xc8ccf79e555cb8e8UL, 0x547dc829a206d73dUL, + /* 212 */ 0xb822a6cd80c39b06UL, 0xe96d54732000d4c6UL, + 0x28535b6f91463b4dUL, 0x228f4660e2486e1dUL, + /* 213 */ 0x98799538de8d3abfUL, 0x8cd8330045ebca6eUL, + 0x79952a008221e738UL, 0x4322e1a7535cd2bbUL, + /* 214 */ 0xb114c11819d1801cUL, 0x2016e4d84f3f5ec7UL, + 0xdd0e2df409260f4cUL, 0x5ec362c0ae5f7266UL, + /* 215 */ 0xc0462b18b8b2b4eeUL, 0x7cc8d950274d1afbUL, + 0xf25f7105436b02d2UL, 0x43bbf8dcbff9ccd3UL, + /* 216 */ 0xb6ad1767a039e9dfUL, 0xb0714da8f69d3583UL, + 0x5e55fa18b42931f5UL, 0x4ed5558f33c60961UL, + /* 217 */ 0x1fe37901c647a5ddUL, 0x593ddf1f8081d357UL, + 0x0249a4fd813fd7a6UL, 0x69acca274e9caf61UL, + /* 218 */ 0x047ba3ea330721c9UL, 0x83423fc20e7e1ea0UL, + 0x1df4c0af01314a60UL, 0x09a62dab89289527UL, + /* 219 */ 0xa5b325a49cc6cb00UL, 0xe94b5dc654b56cb6UL, + 0x3be28779adc994a0UL, 0x4296e8f8ba3a4aadUL, + /* 220 */ 0x328689761e451eabUL, 0x2e4d598bff59594aUL, + 0x49b96853d7a7084aUL, 0x4980a319601420a8UL, + /* 221 */ 0x9565b9e12f552c42UL, 0x8a5318db7100fe96UL, + 0x05c90b4d43add0d7UL, 0x538b4cd66a5d4edaUL, + /* 222 */ 0xf4e94fc3e89f039fUL, 0x592c9af26f618045UL, + 0x08a36eb5fd4b9550UL, 0x25fffaf6c2ed1419UL, + /* 223 */ 0x34434459cc79d354UL, 0xeeecbfb4b1d5476bUL, + 0xddeb34a061615d99UL, 0x5129cecceb64b773UL, + /* 224 */ 0xee43215894993520UL, 0x772f9c7cf14c0b3bUL, + 0xd2e2fce306bedad5UL, 0x715f42b546f06a97UL, + /* 225 */ 0x434ecdceda5b5f1aUL, 0x0da17115a49741a9UL, + 0x680bd77c73edad2eUL, 0x487c02354edd9041UL, + /* 226 */ 0xb8efeff3a70ed9c4UL, 0x56a32aa3e857e302UL, + 0xdf3a68bd48a2a5a0UL, 0x07f650b73176c444UL, + /* 227 */ 0xe38b9b1626e0ccb1UL, 0x79e053c18b09fb36UL, + 0x56d90319c9f94964UL, 0x1ca941e7ac9ff5c4UL, + /* 228 */ 0x49c4df29162fa0bbUL, 0x8488cf3282b33305UL, + 0x95dfda14cabb437dUL, 0x3391f78264d5ad86UL, + /* 229 */ 0x729ae06ae2b5095dUL, 0xd58a58d73259a946UL, + 0xe9834262d13921edUL, 0x27fedafaa54bb592UL, + /* 230 */ 0xa99dc5b829ad48bbUL, 0x5f025742499ee260UL, + 0x802c8ecd5d7513fdUL, 0x78ceb3ef3f6dd938UL, + /* 231 */ 0xc342f44f8a135d94UL, 0x7b9edb44828cdda3UL, + 0x9436d11a0537cfe7UL, 0x5064b164ec1ab4c8UL, + /* 232 */ 0x7020eccfd37eb2fcUL, 0x1f31ea3ed90d25fcUL, + 0x1b930d7bdfa1bb34UL, 0x5344467a48113044UL, + /* 233 */ 0x70073170f25e6dfbUL, 0xe385dc1a50114cc8UL, + 0x2348698ac8fc4f00UL, 0x2a77a55284dd40d8UL, + /* 234 */ 0xfe06afe0c98c6ce4UL, 0xc235df96dddfd6e4UL, + 0x1428d01e33bf1ed3UL, 0x785768ec9300bdafUL, + /* 235 */ 0x9702e57a91deb63bUL, 0x61bdb8bfe5ce8b80UL, + 0x645b426f3d1d58acUL, 0x4804a82227a557bcUL, + /* 236 */ 0x8e57048ab44d2601UL, 0x68d6501a4b3a6935UL, + 0xc39c9ec3f9e1c293UL, 0x4172f257d4de63e2UL, + /* 237 */ 0xd368b450330c6401UL, 0x040d3017418f2391UL, + 0x2c34bb6090b7d90dUL, 0x16f649228fdfd51fUL, + /* 238 */ 0xbea6818e2b928ef5UL, 0xe28ccf91cdc11e72UL, + 0x594aaa68e77a36cdUL, 0x313034806c7ffd0fUL, + /* 239 */ 0x8a9d27ac2249bd65UL, 0x19a3b464018e9512UL, + 0xc26ccff352b37ec7UL, 0x056f68341d797b21UL, + /* 240 */ 0x5e79d6757efd2327UL, 0xfabdbcb6553afe15UL, + 0xd3e7222c6eaf5a60UL, 0x7046c76d4dae743bUL, + /* 241 */ 0x660be872b18d4a55UL, 0x19992518574e1496UL, + 0xc103053a302bdcbbUL, 0x3ed8e9800b218e8eUL, + /* 242 */ 0x7b0b9239fa75e03eUL, 0xefe9fb684633c083UL, + 0x98a35fbe391a7793UL, 0x6065510fe2d0fe34UL, + /* 243 */ 0x55cb668548abad0cUL, 0xb4584548da87e527UL, + 0x2c43ecea0107c1ddUL, 0x526028809372de35UL, + /* 244 */ 0x3415c56af9213b1fUL, 0x5bee1a4d017e98dbUL, + 0x13f6b105b5cf709bUL, 0x5ff20e3482b29ab6UL, + /* 245 */ 0x0aa29c75cc2e6c90UL, 0xfc7d73ca3a70e206UL, + 0x899fc38fc4b5c515UL, 0x250386b124ffc207UL, + /* 246 */ 0x54ea28d5ae3d2b56UL, 0x9913149dd6de60ceUL, + 0x16694fc58f06d6c1UL, 0x46b23975eb018fc7UL, + /* 247 */ 0x470a6a0fb4b7b4e2UL, 0x5d92475a8f7253deUL, + 0xabeee5b52fbd3adbUL, 0x7fa20801a0806968UL, + /* 248 */ 0x76f3faf19f7714d2UL, 0xb3e840c12f4660c3UL, + 0x0fb4cd8df212744eUL, 0x4b065a251d3a2dd2UL, + /* 249 */ 0x5cebde383d77cd4aUL, 0x6adf39df882c9cb1UL, + 0xa2dd242eb09af759UL, 0x3147c0e50e5f6422UL, + /* 250 */ 0x164ca5101d1350dbUL, 0xf8d13479c33fc962UL, + 0xe640ce4d13e5da08UL, 0x4bdee0c45061f8baUL, + /* 251 */ 0xd7c46dc1a4edb1c9UL, 0x5514d7b6437fd98aUL, + 0x58942f6bb2a1c00bUL, 0x2dffb2ab1d70710eUL, + /* 252 */ 0xccdfcf2fc18b6d68UL, 0xa8ebcba8b7806167UL, + 0x980697f95e2937e3UL, 0x02fbba1cd0126e8cUL +}; + +/* c is two 512-bit products: c0[0:7]=a0[0:3]*b0[0:3] and c1[8:15]=a1[4:7]*b1[4:7] + * a is two 256-bit integers: a0[0:3] and a1[4:7] + * b is two 256-bit integers: b0[0:3] and b1[4:7] + */ +static void mul2_256x256_integer_adx(u64 *const c, const u64 *const a, + const u64 *const b) +{ + asm volatile( + "xorl %%r14d, %%r14d ;" + "movq (%1), %%rdx; " /* A[0] */ + "mulx (%2), %%r8, %%r15; " /* A[0]*B[0] */ + "xorl %%r10d, %%r10d ;" + "movq %%r8, (%0) ;" + "mulx 8(%2), %%r10, %%rax; " /* A[0]*B[1] */ + "adox %%r10, %%r15 ;" + "mulx 16(%2), %%r8, %%rbx; " /* A[0]*B[2] */ + "adox %%r8, %%rax ;" + "mulx 24(%2), %%r10, %%rcx; " /* A[0]*B[3] */ + "adox %%r10, %%rbx ;" + /******************************************/ + "adox %%r14, %%rcx ;" + + "movq 8(%1), %%rdx; " /* A[1] */ + "mulx (%2), %%r8, %%r9; " /* A[1]*B[0] */ + "adox %%r15, %%r8 ;" + "movq %%r8, 8(%0) ;" + "mulx 8(%2), %%r10, %%r11; " /* A[1]*B[1] */ + "adox %%r10, %%r9 ;" + "adcx %%r9, %%rax ;" + "mulx 16(%2), %%r8, %%r13; " /* A[1]*B[2] */ + "adox %%r8, %%r11 ;" + "adcx %%r11, %%rbx ;" + "mulx 24(%2), %%r10, %%r15; " /* A[1]*B[3] */ + "adox %%r10, %%r13 ;" + "adcx %%r13, %%rcx ;" + /******************************************/ + "adox %%r14, %%r15 ;" + "adcx %%r14, %%r15 ;" + + "movq 16(%1), %%rdx; " /* A[2] */ + "xorl %%r10d, %%r10d ;" + "mulx (%2), %%r8, %%r9; " /* A[2]*B[0] */ + "adox %%rax, %%r8 ;" + "movq %%r8, 16(%0) ;" + "mulx 8(%2), %%r10, %%r11; " /* A[2]*B[1] */ + "adox %%r10, %%r9 ;" + "adcx %%r9, %%rbx ;" + "mulx 16(%2), %%r8, %%r13; " /* A[2]*B[2] */ + "adox %%r8, %%r11 ;" + "adcx %%r11, %%rcx ;" + "mulx 24(%2), %%r10, %%rax; " /* A[2]*B[3] */ + "adox %%r10, %%r13 ;" + "adcx %%r13, %%r15 ;" + /******************************************/ + "adox %%r14, %%rax ;" + "adcx %%r14, %%rax ;" + + "movq 24(%1), %%rdx; " /* A[3] */ + "xorl %%r10d, %%r10d ;" + "mulx (%2), %%r8, %%r9; " /* A[3]*B[0] */ + "adox %%rbx, %%r8 ;" + "movq %%r8, 24(%0) ;" + "mulx 8(%2), %%r10, %%r11; " /* A[3]*B[1] */ + "adox %%r10, %%r9 ;" + "adcx %%r9, %%rcx ;" + "movq %%rcx, 32(%0) ;" + "mulx 16(%2), %%r8, %%r13; " /* A[3]*B[2] */ + "adox %%r8, %%r11 ;" + "adcx %%r11, %%r15 ;" + "movq %%r15, 40(%0) ;" + "mulx 24(%2), %%r10, %%rbx; " /* A[3]*B[3] */ + "adox %%r10, %%r13 ;" + "adcx %%r13, %%rax ;" + "movq %%rax, 48(%0) ;" + /******************************************/ + "adox %%r14, %%rbx ;" + "adcx %%r14, %%rbx ;" + "movq %%rbx, 56(%0) ;" + + "movq 32(%1), %%rdx; " /* C[0] */ + "mulx 32(%2), %%r8, %%r15; " /* C[0]*D[0] */ + "xorl %%r10d, %%r10d ;" + "movq %%r8, 64(%0);" + "mulx 40(%2), %%r10, %%rax; " /* C[0]*D[1] */ + "adox %%r10, %%r15 ;" + "mulx 48(%2), %%r8, %%rbx; " /* C[0]*D[2] */ + "adox %%r8, %%rax ;" + "mulx 56(%2), %%r10, %%rcx; " /* C[0]*D[3] */ + "adox %%r10, %%rbx ;" + /******************************************/ + "adox %%r14, %%rcx ;" + + "movq 40(%1), %%rdx; " /* C[1] */ + "xorl %%r10d, %%r10d ;" + "mulx 32(%2), %%r8, %%r9; " /* C[1]*D[0] */ + "adox %%r15, %%r8 ;" + "movq %%r8, 72(%0);" + "mulx 40(%2), %%r10, %%r11; " /* C[1]*D[1] */ + "adox %%r10, %%r9 ;" + "adcx %%r9, %%rax ;" + "mulx 48(%2), %%r8, %%r13; " /* C[1]*D[2] */ + "adox %%r8, %%r11 ;" + "adcx %%r11, %%rbx ;" + "mulx 56(%2), %%r10, %%r15; " /* C[1]*D[3] */ + "adox %%r10, %%r13 ;" + "adcx %%r13, %%rcx ;" + /******************************************/ + "adox %%r14, %%r15 ;" + "adcx %%r14, %%r15 ;" + + "movq 48(%1), %%rdx; " /* C[2] */ + "xorl %%r10d, %%r10d ;" + "mulx 32(%2), %%r8, %%r9; " /* C[2]*D[0] */ + "adox %%rax, %%r8 ;" + "movq %%r8, 80(%0);" + "mulx 40(%2), %%r10, %%r11; " /* C[2]*D[1] */ + "adox %%r10, %%r9 ;" + "adcx %%r9, %%rbx ;" + "mulx 48(%2), %%r8, %%r13; " /* C[2]*D[2] */ + "adox %%r8, %%r11 ;" + "adcx %%r11, %%rcx ;" + "mulx 56(%2), %%r10, %%rax; " /* C[2]*D[3] */ + "adox %%r10, %%r13 ;" + "adcx %%r13, %%r15 ;" + /******************************************/ + "adox %%r14, %%rax ;" + "adcx %%r14, %%rax ;" + + "movq 56(%1), %%rdx; " /* C[3] */ + "xorl %%r10d, %%r10d ;" + "mulx 32(%2), %%r8, %%r9; " /* C[3]*D[0] */ + "adox %%rbx, %%r8 ;" + "movq %%r8, 88(%0);" + "mulx 40(%2), %%r10, %%r11; " /* C[3]*D[1] */ + "adox %%r10, %%r9 ;" + "adcx %%r9, %%rcx ;" + "movq %%rcx, 96(%0) ;" + "mulx 48(%2), %%r8, %%r13; " /* C[3]*D[2] */ + "adox %%r8, %%r11 ;" + "adcx %%r11, %%r15 ;" + "movq %%r15, 104(%0) ;" + "mulx 56(%2), %%r10, %%rbx; " /* C[3]*D[3] */ + "adox %%r10, %%r13 ;" + "adcx %%r13, %%rax ;" + "movq %%rax, 112(%0) ;" + /******************************************/ + "adox %%r14, %%rbx ;" + "adcx %%r14, %%rbx ;" + "movq %%rbx, 120(%0) ;" + : + : "r"(c), "r"(a), "r"(b) + : "memory", "cc", "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9", + "%r10", "%r11", "%r13", "%r14", "%r15"); +} + +static void mul2_256x256_integer_bmi2(u64 *const c, const u64 *const a, + const u64 *const b) +{ + asm volatile( + "movq (%1), %%rdx; " /* A[0] */ + "mulx (%2), %%r8, %%r15; " /* A[0]*B[0] */ + "movq %%r8, (%0) ;" + "mulx 8(%2), %%r10, %%rax; " /* A[0]*B[1] */ + "addq %%r10, %%r15 ;" + "mulx 16(%2), %%r8, %%rbx; " /* A[0]*B[2] */ + "adcq %%r8, %%rax ;" + "mulx 24(%2), %%r10, %%rcx; " /* A[0]*B[3] */ + "adcq %%r10, %%rbx ;" + /******************************************/ + "adcq $0, %%rcx ;" + + "movq 8(%1), %%rdx; " /* A[1] */ + "mulx (%2), %%r8, %%r9; " /* A[1]*B[0] */ + "addq %%r15, %%r8 ;" + "movq %%r8, 8(%0) ;" + "mulx 8(%2), %%r10, %%r11; " /* A[1]*B[1] */ + "adcq %%r10, %%r9 ;" + "mulx 16(%2), %%r8, %%r13; " /* A[1]*B[2] */ + "adcq %%r8, %%r11 ;" + "mulx 24(%2), %%r10, %%r15; " /* A[1]*B[3] */ + "adcq %%r10, %%r13 ;" + /******************************************/ + "adcq $0, %%r15 ;" + + "addq %%r9, %%rax ;" + "adcq %%r11, %%rbx ;" + "adcq %%r13, %%rcx ;" + "adcq $0, %%r15 ;" + + "movq 16(%1), %%rdx; " /* A[2] */ + "mulx (%2), %%r8, %%r9; " /* A[2]*B[0] */ + "addq %%rax, %%r8 ;" + "movq %%r8, 16(%0) ;" + "mulx 8(%2), %%r10, %%r11; " /* A[2]*B[1] */ + "adcq %%r10, %%r9 ;" + "mulx 16(%2), %%r8, %%r13; " /* A[2]*B[2] */ + "adcq %%r8, %%r11 ;" + "mulx 24(%2), %%r10, %%rax; " /* A[2]*B[3] */ + "adcq %%r10, %%r13 ;" + /******************************************/ + "adcq $0, %%rax ;" + + "addq %%r9, %%rbx ;" + "adcq %%r11, %%rcx ;" + "adcq %%r13, %%r15 ;" + "adcq $0, %%rax ;" + + "movq 24(%1), %%rdx; " /* A[3] */ + "mulx (%2), %%r8, %%r9; " /* A[3]*B[0] */ + "addq %%rbx, %%r8 ;" + "movq %%r8, 24(%0) ;" + "mulx 8(%2), %%r10, %%r11; " /* A[3]*B[1] */ + "adcq %%r10, %%r9 ;" + "mulx 16(%2), %%r8, %%r13; " /* A[3]*B[2] */ + "adcq %%r8, %%r11 ;" + "mulx 24(%2), %%r10, %%rbx; " /* A[3]*B[3] */ + "adcq %%r10, %%r13 ;" + /******************************************/ + "adcq $0, %%rbx ;" + + "addq %%r9, %%rcx ;" + "movq %%rcx, 32(%0) ;" + "adcq %%r11, %%r15 ;" + "movq %%r15, 40(%0) ;" + "adcq %%r13, %%rax ;" + "movq %%rax, 48(%0) ;" + "adcq $0, %%rbx ;" + "movq %%rbx, 56(%0) ;" + + "movq 32(%1), %%rdx; " /* C[0] */ + "mulx 32(%2), %%r8, %%r15; " /* C[0]*D[0] */ + "movq %%r8, 64(%0) ;" + "mulx 40(%2), %%r10, %%rax; " /* C[0]*D[1] */ + "addq %%r10, %%r15 ;" + "mulx 48(%2), %%r8, %%rbx; " /* C[0]*D[2] */ + "adcq %%r8, %%rax ;" + "mulx 56(%2), %%r10, %%rcx; " /* C[0]*D[3] */ + "adcq %%r10, %%rbx ;" + /******************************************/ + "adcq $0, %%rcx ;" + + "movq 40(%1), %%rdx; " /* C[1] */ + "mulx 32(%2), %%r8, %%r9; " /* C[1]*D[0] */ + "addq %%r15, %%r8 ;" + "movq %%r8, 72(%0) ;" + "mulx 40(%2), %%r10, %%r11; " /* C[1]*D[1] */ + "adcq %%r10, %%r9 ;" + "mulx 48(%2), %%r8, %%r13; " /* C[1]*D[2] */ + "adcq %%r8, %%r11 ;" + "mulx 56(%2), %%r10, %%r15; " /* C[1]*D[3] */ + "adcq %%r10, %%r13 ;" + /******************************************/ + "adcq $0, %%r15 ;" + + "addq %%r9, %%rax ;" + "adcq %%r11, %%rbx ;" + "adcq %%r13, %%rcx ;" + "adcq $0, %%r15 ;" + + "movq 48(%1), %%rdx; " /* C[2] */ + "mulx 32(%2), %%r8, %%r9; " /* C[2]*D[0] */ + "addq %%rax, %%r8 ;" + "movq %%r8, 80(%0) ;" + "mulx 40(%2), %%r10, %%r11; " /* C[2]*D[1] */ + "adcq %%r10, %%r9 ;" + "mulx 48(%2), %%r8, %%r13; " /* C[2]*D[2] */ + "adcq %%r8, %%r11 ;" + "mulx 56(%2), %%r10, %%rax; " /* C[2]*D[3] */ + "adcq %%r10, %%r13 ;" + /******************************************/ + "adcq $0, %%rax ;" + + "addq %%r9, %%rbx ;" + "adcq %%r11, %%rcx ;" + "adcq %%r13, %%r15 ;" + "adcq $0, %%rax ;" + + "movq 56(%1), %%rdx; " /* C[3] */ + "mulx 32(%2), %%r8, %%r9; " /* C[3]*D[0] */ + "addq %%rbx, %%r8 ;" + "movq %%r8, 88(%0) ;" + "mulx 40(%2), %%r10, %%r11; " /* C[3]*D[1] */ + "adcq %%r10, %%r9 ;" + "mulx 48(%2), %%r8, %%r13; " /* C[3]*D[2] */ + "adcq %%r8, %%r11 ;" + "mulx 56(%2), %%r10, %%rbx; " /* C[3]*D[3] */ + "adcq %%r10, %%r13 ;" + /******************************************/ + "adcq $0, %%rbx ;" + + "addq %%r9, %%rcx ;" + "movq %%rcx, 96(%0) ;" + "adcq %%r11, %%r15 ;" + "movq %%r15, 104(%0) ;" + "adcq %%r13, %%rax ;" + "movq %%rax, 112(%0) ;" + "adcq $0, %%rbx ;" + "movq %%rbx, 120(%0) ;" + : + : "r"(c), "r"(a), "r"(b) + : "memory", "cc", "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9", + "%r10", "%r11", "%r13", "%r15"); +} + +static void sqr2_256x256_integer_adx(u64 *const c, const u64 *const a) +{ + asm volatile( + "movq (%1), %%rdx ;" /* A[0] */ + "mulx 8(%1), %%r8, %%r14 ;" /* A[1]*A[0] */ + "xorl %%r15d, %%r15d;" + "mulx 16(%1), %%r9, %%r10 ;" /* A[2]*A[0] */ + "adcx %%r14, %%r9 ;" + "mulx 24(%1), %%rax, %%rcx ;" /* A[3]*A[0] */ + "adcx %%rax, %%r10 ;" + "movq 24(%1), %%rdx ;" /* A[3] */ + "mulx 8(%1), %%r11, %%rbx ;" /* A[1]*A[3] */ + "adcx %%rcx, %%r11 ;" + "mulx 16(%1), %%rax, %%r13 ;" /* A[2]*A[3] */ + "adcx %%rax, %%rbx ;" + "movq 8(%1), %%rdx ;" /* A[1] */ + "adcx %%r15, %%r13 ;" + "mulx 16(%1), %%rax, %%rcx ;" /* A[2]*A[1] */ + "movq $0, %%r14 ;" + /******************************************/ + "adcx %%r15, %%r14 ;" + + "xorl %%r15d, %%r15d;" + "adox %%rax, %%r10 ;" + "adcx %%r8, %%r8 ;" + "adox %%rcx, %%r11 ;" + "adcx %%r9, %%r9 ;" + "adox %%r15, %%rbx ;" + "adcx %%r10, %%r10 ;" + "adox %%r15, %%r13 ;" + "adcx %%r11, %%r11 ;" + "adox %%r15, %%r14 ;" + "adcx %%rbx, %%rbx ;" + "adcx %%r13, %%r13 ;" + "adcx %%r14, %%r14 ;" + + "movq (%1), %%rdx ;" + "mulx %%rdx, %%rax, %%rcx ;" /* A[0]^2 */ + /*******************/ + "movq %%rax, 0(%0) ;" + "addq %%rcx, %%r8 ;" + "movq %%r8, 8(%0) ;" + "movq 8(%1), %%rdx ;" + "mulx %%rdx, %%rax, %%rcx ;" /* A[1]^2 */ + "adcq %%rax, %%r9 ;" + "movq %%r9, 16(%0) ;" + "adcq %%rcx, %%r10 ;" + "movq %%r10, 24(%0) ;" + "movq 16(%1), %%rdx ;" + "mulx %%rdx, %%rax, %%rcx ;" /* A[2]^2 */ + "adcq %%rax, %%r11 ;" + "movq %%r11, 32(%0) ;" + "adcq %%rcx, %%rbx ;" + "movq %%rbx, 40(%0) ;" + "movq 24(%1), %%rdx ;" + "mulx %%rdx, %%rax, %%rcx ;" /* A[3]^2 */ + "adcq %%rax, %%r13 ;" + "movq %%r13, 48(%0) ;" + "adcq %%rcx, %%r14 ;" + "movq %%r14, 56(%0) ;" + + + "movq 32(%1), %%rdx ;" /* B[0] */ + "mulx 40(%1), %%r8, %%r14 ;" /* B[1]*B[0] */ + "xorl %%r15d, %%r15d;" + "mulx 48(%1), %%r9, %%r10 ;" /* B[2]*B[0] */ + "adcx %%r14, %%r9 ;" + "mulx 56(%1), %%rax, %%rcx ;" /* B[3]*B[0] */ + "adcx %%rax, %%r10 ;" + "movq 56(%1), %%rdx ;" /* B[3] */ + "mulx 40(%1), %%r11, %%rbx ;" /* B[1]*B[3] */ + "adcx %%rcx, %%r11 ;" + "mulx 48(%1), %%rax, %%r13 ;" /* B[2]*B[3] */ + "adcx %%rax, %%rbx ;" + "movq 40(%1), %%rdx ;" /* B[1] */ + "adcx %%r15, %%r13 ;" + "mulx 48(%1), %%rax, %%rcx ;" /* B[2]*B[1] */ + "movq $0, %%r14 ;" + /******************************************/ + "adcx %%r15, %%r14 ;" + + "xorl %%r15d, %%r15d;" + "adox %%rax, %%r10 ;" + "adcx %%r8, %%r8 ;" + "adox %%rcx, %%r11 ;" + "adcx %%r9, %%r9 ;" + "adox %%r15, %%rbx ;" + "adcx %%r10, %%r10 ;" + "adox %%r15, %%r13 ;" + "adcx %%r11, %%r11 ;" + "adox %%r15, %%r14 ;" + "adcx %%rbx, %%rbx ;" + "adcx %%r13, %%r13 ;" + "adcx %%r14, %%r14 ;" + + "movq 32(%1), %%rdx ;" + "mulx %%rdx, %%rax, %%rcx ;" /* B[0]^2 */ + /*******************/ + "movq %%rax, 64(%0) ;" + "addq %%rcx, %%r8 ;" + "movq %%r8, 72(%0) ;" + "movq 40(%1), %%rdx ;" + "mulx %%rdx, %%rax, %%rcx ;" /* B[1]^2 */ + "adcq %%rax, %%r9 ;" + "movq %%r9, 80(%0) ;" + "adcq %%rcx, %%r10 ;" + "movq %%r10, 88(%0) ;" + "movq 48(%1), %%rdx ;" + "mulx %%rdx, %%rax, %%rcx ;" /* B[2]^2 */ + "adcq %%rax, %%r11 ;" + "movq %%r11, 96(%0) ;" + "adcq %%rcx, %%rbx ;" + "movq %%rbx, 104(%0) ;" + "movq 56(%1), %%rdx ;" + "mulx %%rdx, %%rax, %%rcx ;" /* B[3]^2 */ + "adcq %%rax, %%r13 ;" + "movq %%r13, 112(%0) ;" + "adcq %%rcx, %%r14 ;" + "movq %%r14, 120(%0) ;" + : + : "r"(c), "r"(a) + : "memory", "cc", "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9", + "%r10", "%r11", "%r13", "%r14", "%r15"); +} + +static void sqr2_256x256_integer_bmi2(u64 *const c, const u64 *const a) +{ + asm volatile( + "movq 8(%1), %%rdx ;" /* A[1] */ + "mulx (%1), %%r8, %%r9 ;" /* A[0]*A[1] */ + "mulx 16(%1), %%r10, %%r11 ;" /* A[2]*A[1] */ + "mulx 24(%1), %%rcx, %%r14 ;" /* A[3]*A[1] */ + + "movq 16(%1), %%rdx ;" /* A[2] */ + "mulx 24(%1), %%r15, %%r13 ;" /* A[3]*A[2] */ + "mulx (%1), %%rax, %%rdx ;" /* A[0]*A[2] */ + + "addq %%rax, %%r9 ;" + "adcq %%rdx, %%r10 ;" + "adcq %%rcx, %%r11 ;" + "adcq %%r14, %%r15 ;" + "adcq $0, %%r13 ;" + "movq $0, %%r14 ;" + "adcq $0, %%r14 ;" + + "movq (%1), %%rdx ;" /* A[0] */ + "mulx 24(%1), %%rax, %%rcx ;" /* A[0]*A[3] */ + + "addq %%rax, %%r10 ;" + "adcq %%rcx, %%r11 ;" + "adcq $0, %%r15 ;" + "adcq $0, %%r13 ;" + "adcq $0, %%r14 ;" + + "shldq $1, %%r13, %%r14 ;" + "shldq $1, %%r15, %%r13 ;" + "shldq $1, %%r11, %%r15 ;" + "shldq $1, %%r10, %%r11 ;" + "shldq $1, %%r9, %%r10 ;" + "shldq $1, %%r8, %%r9 ;" + "shlq $1, %%r8 ;" + + /*******************/ + "mulx %%rdx, %%rax, %%rcx ; " /* A[0]^2 */ + /*******************/ + "movq %%rax, 0(%0) ;" + "addq %%rcx, %%r8 ;" + "movq %%r8, 8(%0) ;" + "movq 8(%1), %%rdx ;" + "mulx %%rdx, %%rax, %%rcx ; " /* A[1]^2 */ + "adcq %%rax, %%r9 ;" + "movq %%r9, 16(%0) ;" + "adcq %%rcx, %%r10 ;" + "movq %%r10, 24(%0) ;" + "movq 16(%1), %%rdx ;" + "mulx %%rdx, %%rax, %%rcx ; " /* A[2]^2 */ + "adcq %%rax, %%r11 ;" + "movq %%r11, 32(%0) ;" + "adcq %%rcx, %%r15 ;" + "movq %%r15, 40(%0) ;" + "movq 24(%1), %%rdx ;" + "mulx %%rdx, %%rax, %%rcx ; " /* A[3]^2 */ + "adcq %%rax, %%r13 ;" + "movq %%r13, 48(%0) ;" + "adcq %%rcx, %%r14 ;" + "movq %%r14, 56(%0) ;" + + "movq 40(%1), %%rdx ;" /* B[1] */ + "mulx 32(%1), %%r8, %%r9 ;" /* B[0]*B[1] */ + "mulx 48(%1), %%r10, %%r11 ;" /* B[2]*B[1] */ + "mulx 56(%1), %%rcx, %%r14 ;" /* B[3]*B[1] */ + + "movq 48(%1), %%rdx ;" /* B[2] */ + "mulx 56(%1), %%r15, %%r13 ;" /* B[3]*B[2] */ + "mulx 32(%1), %%rax, %%rdx ;" /* B[0]*B[2] */ + + "addq %%rax, %%r9 ;" + "adcq %%rdx, %%r10 ;" + "adcq %%rcx, %%r11 ;" + "adcq %%r14, %%r15 ;" + "adcq $0, %%r13 ;" + "movq $0, %%r14 ;" + "adcq $0, %%r14 ;" + + "movq 32(%1), %%rdx ;" /* B[0] */ + "mulx 56(%1), %%rax, %%rcx ;" /* B[0]*B[3] */ + + "addq %%rax, %%r10 ;" + "adcq %%rcx, %%r11 ;" + "adcq $0, %%r15 ;" + "adcq $0, %%r13 ;" + "adcq $0, %%r14 ;" + + "shldq $1, %%r13, %%r14 ;" + "shldq $1, %%r15, %%r13 ;" + "shldq $1, %%r11, %%r15 ;" + "shldq $1, %%r10, %%r11 ;" + "shldq $1, %%r9, %%r10 ;" + "shldq $1, %%r8, %%r9 ;" + "shlq $1, %%r8 ;" + + /*******************/ + "mulx %%rdx, %%rax, %%rcx ; " /* B[0]^2 */ + /*******************/ + "movq %%rax, 64(%0) ;" + "addq %%rcx, %%r8 ;" + "movq %%r8, 72(%0) ;" + "movq 40(%1), %%rdx ;" + "mulx %%rdx, %%rax, %%rcx ; " /* B[1]^2 */ + "adcq %%rax, %%r9 ;" + "movq %%r9, 80(%0) ;" + "adcq %%rcx, %%r10 ;" + "movq %%r10, 88(%0) ;" + "movq 48(%1), %%rdx ;" + "mulx %%rdx, %%rax, %%rcx ; " /* B[2]^2 */ + "adcq %%rax, %%r11 ;" + "movq %%r11, 96(%0) ;" + "adcq %%rcx, %%r15 ;" + "movq %%r15, 104(%0) ;" + "movq 56(%1), %%rdx ;" + "mulx %%rdx, %%rax, %%rcx ; " /* B[3]^2 */ + "adcq %%rax, %%r13 ;" + "movq %%r13, 112(%0) ;" + "adcq %%rcx, %%r14 ;" + "movq %%r14, 120(%0) ;" + : + : "r"(c), "r"(a) + : "memory", "cc", "%rax", "%rcx", "%rdx", "%r8", "%r9", "%r10", + "%r11", "%r13", "%r14", "%r15"); +} + +static void red_eltfp25519_2w_adx(u64 *const c, const u64 *const a) +{ + asm volatile( + "movl $38, %%edx; " /* 2*c = 38 = 2^256 */ + "mulx 32(%1), %%r8, %%r10; " /* c*C[4] */ + "xorl %%ebx, %%ebx ;" + "adox (%1), %%r8 ;" + "mulx 40(%1), %%r9, %%r11; " /* c*C[5] */ + "adcx %%r10, %%r9 ;" + "adox 8(%1), %%r9 ;" + "mulx 48(%1), %%r10, %%rax; " /* c*C[6] */ + "adcx %%r11, %%r10 ;" + "adox 16(%1), %%r10 ;" + "mulx 56(%1), %%r11, %%rcx; " /* c*C[7] */ + "adcx %%rax, %%r11 ;" + "adox 24(%1), %%r11 ;" + /***************************************/ + "adcx %%rbx, %%rcx ;" + "adox %%rbx, %%rcx ;" + "imul %%rdx, %%rcx ;" /* c*C[4], cf=0, of=0 */ + "adcx %%rcx, %%r8 ;" + "adcx %%rbx, %%r9 ;" + "movq %%r9, 8(%0) ;" + "adcx %%rbx, %%r10 ;" + "movq %%r10, 16(%0) ;" + "adcx %%rbx, %%r11 ;" + "movq %%r11, 24(%0) ;" + "mov $0, %%ecx ;" + "cmovc %%edx, %%ecx ;" + "addq %%rcx, %%r8 ;" + "movq %%r8, (%0) ;" + + "mulx 96(%1), %%r8, %%r10; " /* c*C[4] */ + "xorl %%ebx, %%ebx ;" + "adox 64(%1), %%r8 ;" + "mulx 104(%1), %%r9, %%r11; " /* c*C[5] */ + "adcx %%r10, %%r9 ;" + "adox 72(%1), %%r9 ;" + "mulx 112(%1), %%r10, %%rax; " /* c*C[6] */ + "adcx %%r11, %%r10 ;" + "adox 80(%1), %%r10 ;" + "mulx 120(%1), %%r11, %%rcx; " /* c*C[7] */ + "adcx %%rax, %%r11 ;" + "adox 88(%1), %%r11 ;" + /****************************************/ + "adcx %%rbx, %%rcx ;" + "adox %%rbx, %%rcx ;" + "imul %%rdx, %%rcx ;" /* c*C[4], cf=0, of=0 */ + "adcx %%rcx, %%r8 ;" + "adcx %%rbx, %%r9 ;" + "movq %%r9, 40(%0) ;" + "adcx %%rbx, %%r10 ;" + "movq %%r10, 48(%0) ;" + "adcx %%rbx, %%r11 ;" + "movq %%r11, 56(%0) ;" + "mov $0, %%ecx ;" + "cmovc %%edx, %%ecx ;" + "addq %%rcx, %%r8 ;" + "movq %%r8, 32(%0) ;" + : + : "r"(c), "r"(a) + : "memory", "cc", "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9", + "%r10", "%r11"); +} + +static void red_eltfp25519_2w_bmi2(u64 *const c, const u64 *const a) +{ + asm volatile( + "movl $38, %%edx ; " /* 2*c = 38 = 2^256 */ + "mulx 32(%1), %%r8, %%r10 ;" /* c*C[4] */ + "mulx 40(%1), %%r9, %%r11 ;" /* c*C[5] */ + "addq %%r10, %%r9 ;" + "mulx 48(%1), %%r10, %%rax ;" /* c*C[6] */ + "adcq %%r11, %%r10 ;" + "mulx 56(%1), %%r11, %%rcx ;" /* c*C[7] */ + "adcq %%rax, %%r11 ;" + /***************************************/ + "adcq $0, %%rcx ;" + "addq (%1), %%r8 ;" + "adcq 8(%1), %%r9 ;" + "adcq 16(%1), %%r10 ;" + "adcq 24(%1), %%r11 ;" + "adcq $0, %%rcx ;" + "imul %%rdx, %%rcx ;" /* c*C[4], cf=0 */ + "addq %%rcx, %%r8 ;" + "adcq $0, %%r9 ;" + "movq %%r9, 8(%0) ;" + "adcq $0, %%r10 ;" + "movq %%r10, 16(%0) ;" + "adcq $0, %%r11 ;" + "movq %%r11, 24(%0) ;" + "mov $0, %%ecx ;" + "cmovc %%edx, %%ecx ;" + "addq %%rcx, %%r8 ;" + "movq %%r8, (%0) ;" + + "mulx 96(%1), %%r8, %%r10 ;" /* c*C[4] */ + "mulx 104(%1), %%r9, %%r11 ;" /* c*C[5] */ + "addq %%r10, %%r9 ;" + "mulx 112(%1), %%r10, %%rax ;" /* c*C[6] */ + "adcq %%r11, %%r10 ;" + "mulx 120(%1), %%r11, %%rcx ;" /* c*C[7] */ + "adcq %%rax, %%r11 ;" + /****************************************/ + "adcq $0, %%rcx ;" + "addq 64(%1), %%r8 ;" + "adcq 72(%1), %%r9 ;" + "adcq 80(%1), %%r10 ;" + "adcq 88(%1), %%r11 ;" + "adcq $0, %%rcx ;" + "imul %%rdx, %%rcx ;" /* c*C[4], cf=0 */ + "addq %%rcx, %%r8 ;" + "adcq $0, %%r9 ;" + "movq %%r9, 40(%0) ;" + "adcq $0, %%r10 ;" + "movq %%r10, 48(%0) ;" + "adcq $0, %%r11 ;" + "movq %%r11, 56(%0) ;" + "mov $0, %%ecx ;" + "cmovc %%edx, %%ecx ;" + "addq %%rcx, %%r8 ;" + "movq %%r8, 32(%0) ;" + : + : "r"(c), "r"(a) + : "memory", "cc", "%rax", "%rcx", "%rdx", "%r8", "%r9", "%r10", + "%r11"); +} + +static void mul_256x256_integer_adx(u64 *const c, const u64 *const a, + const u64 *const b) +{ + asm volatile( + "movq (%1), %%rdx; " /* A[0] */ + "mulx (%2), %%r8, %%r9; " /* A[0]*B[0] */ + "xorl %%r10d, %%r10d ;" + "movq %%r8, (%0) ;" + "mulx 8(%2), %%r10, %%r11; " /* A[0]*B[1] */ + "adox %%r9, %%r10 ;" + "movq %%r10, 8(%0) ;" + "mulx 16(%2), %%r15, %%r13; " /* A[0]*B[2] */ + "adox %%r11, %%r15 ;" + "mulx 24(%2), %%r14, %%rdx; " /* A[0]*B[3] */ + "adox %%r13, %%r14 ;" + "movq $0, %%rax ;" + /******************************************/ + "adox %%rdx, %%rax ;" + + "movq 8(%1), %%rdx; " /* A[1] */ + "mulx (%2), %%r8, %%r9; " /* A[1]*B[0] */ + "xorl %%r10d, %%r10d ;" + "adcx 8(%0), %%r8 ;" + "movq %%r8, 8(%0) ;" + "mulx 8(%2), %%r10, %%r11; " /* A[1]*B[1] */ + "adox %%r9, %%r10 ;" + "adcx %%r15, %%r10 ;" + "movq %%r10, 16(%0) ;" + "mulx 16(%2), %%r15, %%r13; " /* A[1]*B[2] */ + "adox %%r11, %%r15 ;" + "adcx %%r14, %%r15 ;" + "movq $0, %%r8 ;" + "mulx 24(%2), %%r14, %%rdx; " /* A[1]*B[3] */ + "adox %%r13, %%r14 ;" + "adcx %%rax, %%r14 ;" + "movq $0, %%rax ;" + /******************************************/ + "adox %%rdx, %%rax ;" + "adcx %%r8, %%rax ;" + + "movq 16(%1), %%rdx; " /* A[2] */ + "mulx (%2), %%r8, %%r9; " /* A[2]*B[0] */ + "xorl %%r10d, %%r10d ;" + "adcx 16(%0), %%r8 ;" + "movq %%r8, 16(%0) ;" + "mulx 8(%2), %%r10, %%r11; " /* A[2]*B[1] */ + "adox %%r9, %%r10 ;" + "adcx %%r15, %%r10 ;" + "movq %%r10, 24(%0) ;" + "mulx 16(%2), %%r15, %%r13; " /* A[2]*B[2] */ + "adox %%r11, %%r15 ;" + "adcx %%r14, %%r15 ;" + "movq $0, %%r8 ;" + "mulx 24(%2), %%r14, %%rdx; " /* A[2]*B[3] */ + "adox %%r13, %%r14 ;" + "adcx %%rax, %%r14 ;" + "movq $0, %%rax ;" + /******************************************/ + "adox %%rdx, %%rax ;" + "adcx %%r8, %%rax ;" + + "movq 24(%1), %%rdx; " /* A[3] */ + "mulx (%2), %%r8, %%r9; " /* A[3]*B[0] */ + "xorl %%r10d, %%r10d ;" + "adcx 24(%0), %%r8 ;" + "movq %%r8, 24(%0) ;" + "mulx 8(%2), %%r10, %%r11; " /* A[3]*B[1] */ + "adox %%r9, %%r10 ;" + "adcx %%r15, %%r10 ;" + "movq %%r10, 32(%0) ;" + "mulx 16(%2), %%r15, %%r13; " /* A[3]*B[2] */ + "adox %%r11, %%r15 ;" + "adcx %%r14, %%r15 ;" + "movq %%r15, 40(%0) ;" + "movq $0, %%r8 ;" + "mulx 24(%2), %%r14, %%rdx; " /* A[3]*B[3] */ + "adox %%r13, %%r14 ;" + "adcx %%rax, %%r14 ;" + "movq %%r14, 48(%0) ;" + "movq $0, %%rax ;" + /******************************************/ + "adox %%rdx, %%rax ;" + "adcx %%r8, %%rax ;" + "movq %%rax, 56(%0) ;" + : + : "r"(c), "r"(a), "r"(b) + : "memory", "cc", "%rax", "%rdx", "%r8", "%r9", "%r10", "%r11", + "%r13", "%r14", "%r15"); +} + +static void mul_256x256_integer_bmi2(u64 *const c, const u64 *const a, + const u64 *const b) +{ + asm volatile( + "movq (%1), %%rdx; " /* A[0] */ + "mulx (%2), %%r8, %%r15; " /* A[0]*B[0] */ + "movq %%r8, (%0) ;" + "mulx 8(%2), %%r10, %%rax; " /* A[0]*B[1] */ + "addq %%r10, %%r15 ;" + "mulx 16(%2), %%r8, %%rbx; " /* A[0]*B[2] */ + "adcq %%r8, %%rax ;" + "mulx 24(%2), %%r10, %%rcx; " /* A[0]*B[3] */ + "adcq %%r10, %%rbx ;" + /******************************************/ + "adcq $0, %%rcx ;" + + "movq 8(%1), %%rdx; " /* A[1] */ + "mulx (%2), %%r8, %%r9; " /* A[1]*B[0] */ + "addq %%r15, %%r8 ;" + "movq %%r8, 8(%0) ;" + "mulx 8(%2), %%r10, %%r11; " /* A[1]*B[1] */ + "adcq %%r10, %%r9 ;" + "mulx 16(%2), %%r8, %%r13; " /* A[1]*B[2] */ + "adcq %%r8, %%r11 ;" + "mulx 24(%2), %%r10, %%r15; " /* A[1]*B[3] */ + "adcq %%r10, %%r13 ;" + /******************************************/ + "adcq $0, %%r15 ;" + + "addq %%r9, %%rax ;" + "adcq %%r11, %%rbx ;" + "adcq %%r13, %%rcx ;" + "adcq $0, %%r15 ;" + + "movq 16(%1), %%rdx; " /* A[2] */ + "mulx (%2), %%r8, %%r9; " /* A[2]*B[0] */ + "addq %%rax, %%r8 ;" + "movq %%r8, 16(%0) ;" + "mulx 8(%2), %%r10, %%r11; " /* A[2]*B[1] */ + "adcq %%r10, %%r9 ;" + "mulx 16(%2), %%r8, %%r13; " /* A[2]*B[2] */ + "adcq %%r8, %%r11 ;" + "mulx 24(%2), %%r10, %%rax; " /* A[2]*B[3] */ + "adcq %%r10, %%r13 ;" + /******************************************/ + "adcq $0, %%rax ;" + + "addq %%r9, %%rbx ;" + "adcq %%r11, %%rcx ;" + "adcq %%r13, %%r15 ;" + "adcq $0, %%rax ;" + + "movq 24(%1), %%rdx; " /* A[3] */ + "mulx (%2), %%r8, %%r9; " /* A[3]*B[0] */ + "addq %%rbx, %%r8 ;" + "movq %%r8, 24(%0) ;" + "mulx 8(%2), %%r10, %%r11; " /* A[3]*B[1] */ + "adcq %%r10, %%r9 ;" + "mulx 16(%2), %%r8, %%r13; " /* A[3]*B[2] */ + "adcq %%r8, %%r11 ;" + "mulx 24(%2), %%r10, %%rbx; " /* A[3]*B[3] */ + "adcq %%r10, %%r13 ;" + /******************************************/ + "adcq $0, %%rbx ;" + + "addq %%r9, %%rcx ;" + "movq %%rcx, 32(%0) ;" + "adcq %%r11, %%r15 ;" + "movq %%r15, 40(%0) ;" + "adcq %%r13, %%rax ;" + "movq %%rax, 48(%0) ;" + "adcq $0, %%rbx ;" + "movq %%rbx, 56(%0) ;" + : + : "r"(c), "r"(a), "r"(b) + : "memory", "cc", "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9", + "%r10", "%r11", "%r13", "%r15"); +} + +static void sqr_256x256_integer_adx(u64 *const c, const u64 *const a) +{ + asm volatile( + "movq (%1), %%rdx ;" /* A[0] */ + "mulx 8(%1), %%r8, %%r14 ;" /* A[1]*A[0] */ + "xorl %%r15d, %%r15d;" + "mulx 16(%1), %%r9, %%r10 ;" /* A[2]*A[0] */ + "adcx %%r14, %%r9 ;" + "mulx 24(%1), %%rax, %%rcx ;" /* A[3]*A[0] */ + "adcx %%rax, %%r10 ;" + "movq 24(%1), %%rdx ;" /* A[3] */ + "mulx 8(%1), %%r11, %%rbx ;" /* A[1]*A[3] */ + "adcx %%rcx, %%r11 ;" + "mulx 16(%1), %%rax, %%r13 ;" /* A[2]*A[3] */ + "adcx %%rax, %%rbx ;" + "movq 8(%1), %%rdx ;" /* A[1] */ + "adcx %%r15, %%r13 ;" + "mulx 16(%1), %%rax, %%rcx ;" /* A[2]*A[1] */ + "movq $0, %%r14 ;" + /******************************************/ + "adcx %%r15, %%r14 ;" + + "xorl %%r15d, %%r15d;" + "adox %%rax, %%r10 ;" + "adcx %%r8, %%r8 ;" + "adox %%rcx, %%r11 ;" + "adcx %%r9, %%r9 ;" + "adox %%r15, %%rbx ;" + "adcx %%r10, %%r10 ;" + "adox %%r15, %%r13 ;" + "adcx %%r11, %%r11 ;" + "adox %%r15, %%r14 ;" + "adcx %%rbx, %%rbx ;" + "adcx %%r13, %%r13 ;" + "adcx %%r14, %%r14 ;" + + "movq (%1), %%rdx ;" + "mulx %%rdx, %%rax, %%rcx ;" /* A[0]^2 */ + /*******************/ + "movq %%rax, 0(%0) ;" + "addq %%rcx, %%r8 ;" + "movq %%r8, 8(%0) ;" + "movq 8(%1), %%rdx ;" + "mulx %%rdx, %%rax, %%rcx ;" /* A[1]^2 */ + "adcq %%rax, %%r9 ;" + "movq %%r9, 16(%0) ;" + "adcq %%rcx, %%r10 ;" + "movq %%r10, 24(%0) ;" + "movq 16(%1), %%rdx ;" + "mulx %%rdx, %%rax, %%rcx ;" /* A[2]^2 */ + "adcq %%rax, %%r11 ;" + "movq %%r11, 32(%0) ;" + "adcq %%rcx, %%rbx ;" + "movq %%rbx, 40(%0) ;" + "movq 24(%1), %%rdx ;" + "mulx %%rdx, %%rax, %%rcx ;" /* A[3]^2 */ + "adcq %%rax, %%r13 ;" + "movq %%r13, 48(%0) ;" + "adcq %%rcx, %%r14 ;" + "movq %%r14, 56(%0) ;" + : + : "r"(c), "r"(a) + : "memory", "cc", "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9", + "%r10", "%r11", "%r13", "%r14", "%r15"); +} + +static void sqr_256x256_integer_bmi2(u64 *const c, const u64 *const a) +{ + asm volatile( + "movq 8(%1), %%rdx ;" /* A[1] */ + "mulx (%1), %%r8, %%r9 ;" /* A[0]*A[1] */ + "mulx 16(%1), %%r10, %%r11 ;" /* A[2]*A[1] */ + "mulx 24(%1), %%rcx, %%r14 ;" /* A[3]*A[1] */ + + "movq 16(%1), %%rdx ;" /* A[2] */ + "mulx 24(%1), %%r15, %%r13 ;" /* A[3]*A[2] */ + "mulx (%1), %%rax, %%rdx ;" /* A[0]*A[2] */ + + "addq %%rax, %%r9 ;" + "adcq %%rdx, %%r10 ;" + "adcq %%rcx, %%r11 ;" + "adcq %%r14, %%r15 ;" + "adcq $0, %%r13 ;" + "movq $0, %%r14 ;" + "adcq $0, %%r14 ;" + + "movq (%1), %%rdx ;" /* A[0] */ + "mulx 24(%1), %%rax, %%rcx ;" /* A[0]*A[3] */ + + "addq %%rax, %%r10 ;" + "adcq %%rcx, %%r11 ;" + "adcq $0, %%r15 ;" + "adcq $0, %%r13 ;" + "adcq $0, %%r14 ;" + + "shldq $1, %%r13, %%r14 ;" + "shldq $1, %%r15, %%r13 ;" + "shldq $1, %%r11, %%r15 ;" + "shldq $1, %%r10, %%r11 ;" + "shldq $1, %%r9, %%r10 ;" + "shldq $1, %%r8, %%r9 ;" + "shlq $1, %%r8 ;" + + /*******************/ + "mulx %%rdx, %%rax, %%rcx ;" /* A[0]^2 */ + /*******************/ + "movq %%rax, 0(%0) ;" + "addq %%rcx, %%r8 ;" + "movq %%r8, 8(%0) ;" + "movq 8(%1), %%rdx ;" + "mulx %%rdx, %%rax, %%rcx ;" /* A[1]^2 */ + "adcq %%rax, %%r9 ;" + "movq %%r9, 16(%0) ;" + "adcq %%rcx, %%r10 ;" + "movq %%r10, 24(%0) ;" + "movq 16(%1), %%rdx ;" + "mulx %%rdx, %%rax, %%rcx ;" /* A[2]^2 */ + "adcq %%rax, %%r11 ;" + "movq %%r11, 32(%0) ;" + "adcq %%rcx, %%r15 ;" + "movq %%r15, 40(%0) ;" + "movq 24(%1), %%rdx ;" + "mulx %%rdx, %%rax, %%rcx ;" /* A[3]^2 */ + "adcq %%rax, %%r13 ;" + "movq %%r13, 48(%0) ;" + "adcq %%rcx, %%r14 ;" + "movq %%r14, 56(%0) ;" + : + : "r"(c), "r"(a) + : "memory", "cc", "%rax", "%rcx", "%rdx", "%r8", "%r9", "%r10", + "%r11", "%r13", "%r14", "%r15"); +} + +static void red_eltfp25519_1w_adx(u64 *const c, const u64 *const a) +{ + asm volatile( + "movl $38, %%edx ;" /* 2*c = 38 = 2^256 */ + "mulx 32(%1), %%r8, %%r10 ;" /* c*C[4] */ + "xorl %%ebx, %%ebx ;" + "adox (%1), %%r8 ;" + "mulx 40(%1), %%r9, %%r11 ;" /* c*C[5] */ + "adcx %%r10, %%r9 ;" + "adox 8(%1), %%r9 ;" + "mulx 48(%1), %%r10, %%rax ;" /* c*C[6] */ + "adcx %%r11, %%r10 ;" + "adox 16(%1), %%r10 ;" + "mulx 56(%1), %%r11, %%rcx ;" /* c*C[7] */ + "adcx %%rax, %%r11 ;" + "adox 24(%1), %%r11 ;" + /***************************************/ + "adcx %%rbx, %%rcx ;" + "adox %%rbx, %%rcx ;" + "imul %%rdx, %%rcx ;" /* c*C[4], cf=0, of=0 */ + "adcx %%rcx, %%r8 ;" + "adcx %%rbx, %%r9 ;" + "movq %%r9, 8(%0) ;" + "adcx %%rbx, %%r10 ;" + "movq %%r10, 16(%0) ;" + "adcx %%rbx, %%r11 ;" + "movq %%r11, 24(%0) ;" + "mov $0, %%ecx ;" + "cmovc %%edx, %%ecx ;" + "addq %%rcx, %%r8 ;" + "movq %%r8, (%0) ;" + : + : "r"(c), "r"(a) + : "memory", "cc", "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9", + "%r10", "%r11"); +} + +static void red_eltfp25519_1w_bmi2(u64 *const c, const u64 *const a) +{ + asm volatile( + "movl $38, %%edx ;" /* 2*c = 38 = 2^256 */ + "mulx 32(%1), %%r8, %%r10 ;" /* c*C[4] */ + "mulx 40(%1), %%r9, %%r11 ;" /* c*C[5] */ + "addq %%r10, %%r9 ;" + "mulx 48(%1), %%r10, %%rax ;" /* c*C[6] */ + "adcq %%r11, %%r10 ;" + "mulx 56(%1), %%r11, %%rcx ;" /* c*C[7] */ + "adcq %%rax, %%r11 ;" + /***************************************/ + "adcq $0, %%rcx ;" + "addq (%1), %%r8 ;" + "adcq 8(%1), %%r9 ;" + "adcq 16(%1), %%r10 ;" + "adcq 24(%1), %%r11 ;" + "adcq $0, %%rcx ;" + "imul %%rdx, %%rcx ;" /* c*C[4], cf=0 */ + "addq %%rcx, %%r8 ;" + "adcq $0, %%r9 ;" + "movq %%r9, 8(%0) ;" + "adcq $0, %%r10 ;" + "movq %%r10, 16(%0) ;" + "adcq $0, %%r11 ;" + "movq %%r11, 24(%0) ;" + "mov $0, %%ecx ;" + "cmovc %%edx, %%ecx ;" + "addq %%rcx, %%r8 ;" + "movq %%r8, (%0) ;" + : + : "r"(c), "r"(a) + : "memory", "cc", "%rax", "%rcx", "%rdx", "%r8", "%r9", "%r10", + "%r11"); +} + +static __always_inline void +add_eltfp25519_1w_adx(u64 *const c, const u64 *const a, const u64 *const b) +{ + asm volatile( + "mov $38, %%eax ;" + "xorl %%ecx, %%ecx ;" + "movq (%2), %%r8 ;" + "adcx (%1), %%r8 ;" + "movq 8(%2), %%r9 ;" + "adcx 8(%1), %%r9 ;" + "movq 16(%2), %%r10 ;" + "adcx 16(%1), %%r10 ;" + "movq 24(%2), %%r11 ;" + "adcx 24(%1), %%r11 ;" + "cmovc %%eax, %%ecx ;" + "xorl %%eax, %%eax ;" + "adcx %%rcx, %%r8 ;" + "adcx %%rax, %%r9 ;" + "movq %%r9, 8(%0) ;" + "adcx %%rax, %%r10 ;" + "movq %%r10, 16(%0) ;" + "adcx %%rax, %%r11 ;" + "movq %%r11, 24(%0) ;" + "mov $38, %%ecx ;" + "cmovc %%ecx, %%eax ;" + "addq %%rax, %%r8 ;" + "movq %%r8, (%0) ;" + : + : "r"(c), "r"(a), "r"(b) + : "memory", "cc", "%rax", "%rcx", "%r8", "%r9", "%r10", "%r11"); +} + +static __always_inline void +add_eltfp25519_1w_bmi2(u64 *const c, const u64 *const a, const u64 *const b) +{ + asm volatile( + "mov $38, %%eax ;" + "movq (%2), %%r8 ;" + "addq (%1), %%r8 ;" + "movq 8(%2), %%r9 ;" + "adcq 8(%1), %%r9 ;" + "movq 16(%2), %%r10 ;" + "adcq 16(%1), %%r10 ;" + "movq 24(%2), %%r11 ;" + "adcq 24(%1), %%r11 ;" + "mov $0, %%ecx ;" + "cmovc %%eax, %%ecx ;" + "addq %%rcx, %%r8 ;" + "adcq $0, %%r9 ;" + "movq %%r9, 8(%0) ;" + "adcq $0, %%r10 ;" + "movq %%r10, 16(%0) ;" + "adcq $0, %%r11 ;" + "movq %%r11, 24(%0) ;" + "mov $0, %%ecx ;" + "cmovc %%eax, %%ecx ;" + "addq %%rcx, %%r8 ;" + "movq %%r8, (%0) ;" + : + : "r"(c), "r"(a), "r"(b) + : "memory", "cc", "%rax", "%rcx", "%r8", "%r9", "%r10", "%r11"); +} + +static __always_inline void +sub_eltfp25519_1w(u64 *const c, const u64 *const a, const u64 *const b) +{ + asm volatile( + "mov $38, %%eax ;" + "movq (%1), %%r8 ;" + "subq (%2), %%r8 ;" + "movq 8(%1), %%r9 ;" + "sbbq 8(%2), %%r9 ;" + "movq 16(%1), %%r10 ;" + "sbbq 16(%2), %%r10 ;" + "movq 24(%1), %%r11 ;" + "sbbq 24(%2), %%r11 ;" + "mov $0, %%ecx ;" + "cmovc %%eax, %%ecx ;" + "subq %%rcx, %%r8 ;" + "sbbq $0, %%r9 ;" + "movq %%r9, 8(%0) ;" + "sbbq $0, %%r10 ;" + "movq %%r10, 16(%0) ;" + "sbbq $0, %%r11 ;" + "movq %%r11, 24(%0) ;" + "mov $0, %%ecx ;" + "cmovc %%eax, %%ecx ;" + "subq %%rcx, %%r8 ;" + "movq %%r8, (%0) ;" + : + : "r"(c), "r"(a), "r"(b) + : "memory", "cc", "%rax", "%rcx", "%r8", "%r9", "%r10", "%r11"); +} + +/* Multiplication by a24 = (A+2)/4 = (486662+2)/4 = 121666 */ +static __always_inline void +mul_a24_eltfp25519_1w(u64 *const c, const u64 *const a) +{ + const u64 a24 = 121666; + asm volatile( + "movq %2, %%rdx ;" + "mulx (%1), %%r8, %%r10 ;" + "mulx 8(%1), %%r9, %%r11 ;" + "addq %%r10, %%r9 ;" + "mulx 16(%1), %%r10, %%rax ;" + "adcq %%r11, %%r10 ;" + "mulx 24(%1), %%r11, %%rcx ;" + "adcq %%rax, %%r11 ;" + /**************************/ + "adcq $0, %%rcx ;" + "movl $38, %%edx ;" /* 2*c = 38 = 2^256 mod 2^255-19*/ + "imul %%rdx, %%rcx ;" + "addq %%rcx, %%r8 ;" + "adcq $0, %%r9 ;" + "movq %%r9, 8(%0) ;" + "adcq $0, %%r10 ;" + "movq %%r10, 16(%0) ;" + "adcq $0, %%r11 ;" + "movq %%r11, 24(%0) ;" + "mov $0, %%ecx ;" + "cmovc %%edx, %%ecx ;" + "addq %%rcx, %%r8 ;" + "movq %%r8, (%0) ;" + : + : "r"(c), "r"(a), "r"(a24) + : "memory", "cc", "%rax", "%rcx", "%rdx", "%r8", "%r9", "%r10", + "%r11"); +} + +static void inv_eltfp25519_1w_adx(u64 *const c, const u64 *const a) +{ + struct { + eltfp25519_1w_buffer buffer; + eltfp25519_1w x0, x1, x2; + } __aligned(32) m; + u64 *T[4]; + + T[0] = m.x0; + T[1] = c; /* x^(-1) */ + T[2] = m.x1; + T[3] = m.x2; + + copy_eltfp25519_1w(T[1], a); + sqrn_eltfp25519_1w_adx(T[1], 1); + copy_eltfp25519_1w(T[2], T[1]); + sqrn_eltfp25519_1w_adx(T[2], 2); + mul_eltfp25519_1w_adx(T[0], a, T[2]); + mul_eltfp25519_1w_adx(T[1], T[1], T[0]); + copy_eltfp25519_1w(T[2], T[1]); + sqrn_eltfp25519_1w_adx(T[2], 1); + mul_eltfp25519_1w_adx(T[0], T[0], T[2]); + copy_eltfp25519_1w(T[2], T[0]); + sqrn_eltfp25519_1w_adx(T[2], 5); + mul_eltfp25519_1w_adx(T[0], T[0], T[2]); + copy_eltfp25519_1w(T[2], T[0]); + sqrn_eltfp25519_1w_adx(T[2], 10); + mul_eltfp25519_1w_adx(T[2], T[2], T[0]); + copy_eltfp25519_1w(T[3], T[2]); + sqrn_eltfp25519_1w_adx(T[3], 20); + mul_eltfp25519_1w_adx(T[3], T[3], T[2]); + sqrn_eltfp25519_1w_adx(T[3], 10); + mul_eltfp25519_1w_adx(T[3], T[3], T[0]); + copy_eltfp25519_1w(T[0], T[3]); + sqrn_eltfp25519_1w_adx(T[0], 50); + mul_eltfp25519_1w_adx(T[0], T[0], T[3]); + copy_eltfp25519_1w(T[2], T[0]); + sqrn_eltfp25519_1w_adx(T[2], 100); + mul_eltfp25519_1w_adx(T[2], T[2], T[0]); + sqrn_eltfp25519_1w_adx(T[2], 50); + mul_eltfp25519_1w_adx(T[2], T[2], T[3]); + sqrn_eltfp25519_1w_adx(T[2], 5); + mul_eltfp25519_1w_adx(T[1], T[1], T[2]); + + memzero_explicit(&m, sizeof(m)); +} + +static void inv_eltfp25519_1w_bmi2(u64 *const c, const u64 *const a) +{ + struct { + eltfp25519_1w_buffer buffer; + eltfp25519_1w x0, x1, x2; + } __aligned(32) m; + u64 *T[5]; + + T[0] = m.x0; + T[1] = c; /* x^(-1) */ + T[2] = m.x1; + T[3] = m.x2; + + copy_eltfp25519_1w(T[1], a); + sqrn_eltfp25519_1w_bmi2(T[1], 1); + copy_eltfp25519_1w(T[2], T[1]); + sqrn_eltfp25519_1w_bmi2(T[2], 2); + mul_eltfp25519_1w_bmi2(T[0], a, T[2]); + mul_eltfp25519_1w_bmi2(T[1], T[1], T[0]); + copy_eltfp25519_1w(T[2], T[1]); + sqrn_eltfp25519_1w_bmi2(T[2], 1); + mul_eltfp25519_1w_bmi2(T[0], T[0], T[2]); + copy_eltfp25519_1w(T[2], T[0]); + sqrn_eltfp25519_1w_bmi2(T[2], 5); + mul_eltfp25519_1w_bmi2(T[0], T[0], T[2]); + copy_eltfp25519_1w(T[2], T[0]); + sqrn_eltfp25519_1w_bmi2(T[2], 10); + mul_eltfp25519_1w_bmi2(T[2], T[2], T[0]); + copy_eltfp25519_1w(T[3], T[2]); + sqrn_eltfp25519_1w_bmi2(T[3], 20); + mul_eltfp25519_1w_bmi2(T[3], T[3], T[2]); + sqrn_eltfp25519_1w_bmi2(T[3], 10); + mul_eltfp25519_1w_bmi2(T[3], T[3], T[0]); + copy_eltfp25519_1w(T[0], T[3]); + sqrn_eltfp25519_1w_bmi2(T[0], 50); + mul_eltfp25519_1w_bmi2(T[0], T[0], T[3]); + copy_eltfp25519_1w(T[2], T[0]); + sqrn_eltfp25519_1w_bmi2(T[2], 100); + mul_eltfp25519_1w_bmi2(T[2], T[2], T[0]); + sqrn_eltfp25519_1w_bmi2(T[2], 50); + mul_eltfp25519_1w_bmi2(T[2], T[2], T[3]); + sqrn_eltfp25519_1w_bmi2(T[2], 5); + mul_eltfp25519_1w_bmi2(T[1], T[1], T[2]); + + memzero_explicit(&m, sizeof(m)); +} + +/* Given c, a 256-bit number, fred_eltfp25519_1w updates c + * with a number such that 0 <= C < 2**255-19. + */ +static __always_inline void fred_eltfp25519_1w(u64 *const c) +{ + u64 tmp0 = 38, tmp1 = 19; + asm volatile( + "btrq $63, %3 ;" /* Put bit 255 in carry flag and clear */ + "cmovncl %k5, %k4 ;" /* c[255] ? 38 : 19 */ + + /* Add either 19 or 38 to c */ + "addq %4, %0 ;" + "adcq $0, %1 ;" + "adcq $0, %2 ;" + "adcq $0, %3 ;" + + /* Test for bit 255 again; only triggered on overflow modulo 2^255-19 */ + "movl $0, %k4 ;" + "cmovnsl %k5, %k4 ;" /* c[255] ? 0 : 19 */ + "btrq $63, %3 ;" /* Clear bit 255 */ + + /* Subtract 19 if necessary */ + "subq %4, %0 ;" + "sbbq $0, %1 ;" + "sbbq $0, %2 ;" + "sbbq $0, %3 ;" + + : "+r"(c[0]), "+r"(c[1]), "+r"(c[2]), "+r"(c[3]), "+r"(tmp0), + "+r"(tmp1) + : + : "memory", "cc"); +} + +static __always_inline void cswap(u8 bit, u64 *const px, u64 *const py) +{ + u64 temp; + asm volatile( + "test %9, %9 ;" + "movq %0, %8 ;" + "cmovnzq %4, %0 ;" + "cmovnzq %8, %4 ;" + "movq %1, %8 ;" + "cmovnzq %5, %1 ;" + "cmovnzq %8, %5 ;" + "movq %2, %8 ;" + "cmovnzq %6, %2 ;" + "cmovnzq %8, %6 ;" + "movq %3, %8 ;" + "cmovnzq %7, %3 ;" + "cmovnzq %8, %7 ;" + : "+r"(px[0]), "+r"(px[1]), "+r"(px[2]), "+r"(px[3]), + "+r"(py[0]), "+r"(py[1]), "+r"(py[2]), "+r"(py[3]), + "=r"(temp) + : "r"(bit) + : "cc" + ); +} + +static __always_inline void cselect(u8 bit, u64 *const px, const u64 *const py) +{ + asm volatile( + "test %4, %4 ;" + "cmovnzq %5, %0 ;" + "cmovnzq %6, %1 ;" + "cmovnzq %7, %2 ;" + "cmovnzq %8, %3 ;" + : "+r"(px[0]), "+r"(px[1]), "+r"(px[2]), "+r"(px[3]) + : "r"(bit), "rm"(py[0]), "rm"(py[1]), "rm"(py[2]), "rm"(py[3]) + : "cc" + ); +} + +static __always_inline void clamp_secret(u8 secret[CURVE25519_KEY_SIZE]) +{ + secret[0] &= 248; + secret[31] &= 127; + secret[31] |= 64; +} + +static void curve25519_adx(u8 shared[CURVE25519_KEY_SIZE], + const u8 private_key[CURVE25519_KEY_SIZE], + const u8 session_key[CURVE25519_KEY_SIZE]) +{ + struct { + u64 buffer[4 * NUM_WORDS_ELTFP25519]; + u64 coordinates[4 * NUM_WORDS_ELTFP25519]; + u64 workspace[6 * NUM_WORDS_ELTFP25519]; + u8 session[CURVE25519_KEY_SIZE]; + u8 private[CURVE25519_KEY_SIZE]; + } __aligned(32) m; + + int i = 0, j = 0; + u64 prev = 0; + u64 *const X1 = (u64 *)m.session; + u64 *const key = (u64 *)m.private; + u64 *const Px = m.coordinates + 0; + u64 *const Pz = m.coordinates + 4; + u64 *const Qx = m.coordinates + 8; + u64 *const Qz = m.coordinates + 12; + u64 *const X2 = Qx; + u64 *const Z2 = Qz; + u64 *const X3 = Px; + u64 *const Z3 = Pz; + u64 *const X2Z2 = Qx; + u64 *const X3Z3 = Px; + + u64 *const A = m.workspace + 0; + u64 *const B = m.workspace + 4; + u64 *const D = m.workspace + 8; + u64 *const C = m.workspace + 12; + u64 *const DA = m.workspace + 16; + u64 *const CB = m.workspace + 20; + u64 *const AB = A; + u64 *const DC = D; + u64 *const DACB = DA; + + memcpy(m.private, private_key, sizeof(m.private)); + memcpy(m.session, session_key, sizeof(m.session)); + + clamp_secret(m.private); + + /* As in the draft: + * When receiving such an array, implementations of curve25519 + * MUST mask the most-significant bit in the final byte. This + * is done to preserve compatibility with point formats which + * reserve the sign bit for use in other protocols and to + * increase resistance to implementation fingerprinting + */ + m.session[CURVE25519_KEY_SIZE - 1] &= (1 << (255 % 8)) - 1; + + copy_eltfp25519_1w(Px, X1); + setzero_eltfp25519_1w(Pz); + setzero_eltfp25519_1w(Qx); + setzero_eltfp25519_1w(Qz); + + Pz[0] = 1; + Qx[0] = 1; + + /* main-loop */ + prev = 0; + j = 62; + for (i = 3; i >= 0; --i) { + while (j >= 0) { + u64 bit = (key[i] >> j) & 0x1; + u64 swap = bit ^ prev; + prev = bit; + + add_eltfp25519_1w_adx(A, X2, Z2); /* A = (X2+Z2) */ + sub_eltfp25519_1w(B, X2, Z2); /* B = (X2-Z2) */ + add_eltfp25519_1w_adx(C, X3, Z3); /* C = (X3+Z3) */ + sub_eltfp25519_1w(D, X3, Z3); /* D = (X3-Z3) */ + mul_eltfp25519_2w_adx(DACB, AB, DC); /* [DA|CB] = [A|B]*[D|C] */ + + cselect(swap, A, C); + cselect(swap, B, D); + + sqr_eltfp25519_2w_adx(AB); /* [AA|BB] = [A^2|B^2] */ + add_eltfp25519_1w_adx(X3, DA, CB); /* X3 = (DA+CB) */ + sub_eltfp25519_1w(Z3, DA, CB); /* Z3 = (DA-CB) */ + sqr_eltfp25519_2w_adx(X3Z3); /* [X3|Z3] = [(DA+CB)|(DA+CB)]^2 */ + + copy_eltfp25519_1w(X2, B); /* X2 = B^2 */ + sub_eltfp25519_1w(Z2, A, B); /* Z2 = E = AA-BB */ + + mul_a24_eltfp25519_1w(B, Z2); /* B = a24*E */ + add_eltfp25519_1w_adx(B, B, X2); /* B = a24*E+B */ + mul_eltfp25519_2w_adx(X2Z2, X2Z2, AB); /* [X2|Z2] = [B|E]*[A|a24*E+B] */ + mul_eltfp25519_1w_adx(Z3, Z3, X1); /* Z3 = Z3*X1 */ + --j; + } + j = 63; + } + + inv_eltfp25519_1w_adx(A, Qz); + mul_eltfp25519_1w_adx((u64 *)shared, Qx, A); + fred_eltfp25519_1w((u64 *)shared); + + memzero_explicit(&m, sizeof(m)); +} + +static void curve25519_adx_base(u8 session_key[CURVE25519_KEY_SIZE], + const u8 private_key[CURVE25519_KEY_SIZE]) +{ + struct { + u64 buffer[4 * NUM_WORDS_ELTFP25519]; + u64 coordinates[4 * NUM_WORDS_ELTFP25519]; + u64 workspace[4 * NUM_WORDS_ELTFP25519]; + u8 private[CURVE25519_KEY_SIZE]; + } __aligned(32) m; + + const int ite[4] = { 64, 64, 64, 63 }; + const int q = 3; + u64 swap = 1; + + int i = 0, j = 0, k = 0; + u64 *const key = (u64 *)m.private; + u64 *const Ur1 = m.coordinates + 0; + u64 *const Zr1 = m.coordinates + 4; + u64 *const Ur2 = m.coordinates + 8; + u64 *const Zr2 = m.coordinates + 12; + + u64 *const UZr1 = m.coordinates + 0; + u64 *const ZUr2 = m.coordinates + 8; + + u64 *const A = m.workspace + 0; + u64 *const B = m.workspace + 4; + u64 *const C = m.workspace + 8; + u64 *const D = m.workspace + 12; + + u64 *const AB = m.workspace + 0; + u64 *const CD = m.workspace + 8; + + const u64 *const P = table_ladder_8k; + + memcpy(m.private, private_key, sizeof(m.private)); + + clamp_secret(m.private); + + setzero_eltfp25519_1w(Ur1); + setzero_eltfp25519_1w(Zr1); + setzero_eltfp25519_1w(Zr2); + Ur1[0] = 1; + Zr1[0] = 1; + Zr2[0] = 1; + + /* G-S */ + Ur2[3] = 0x1eaecdeee27cab34UL; + Ur2[2] = 0xadc7a0b9235d48e2UL; + Ur2[1] = 0xbbf095ae14b2edf8UL; + Ur2[0] = 0x7e94e1fec82faabdUL; + + /* main-loop */ + j = q; + for (i = 0; i < NUM_WORDS_ELTFP25519; ++i) { + while (j < ite[i]) { + u64 bit = (key[i] >> j) & 0x1; + k = (64 * i + j - q); + swap = swap ^ bit; + cswap(swap, Ur1, Ur2); + cswap(swap, Zr1, Zr2); + swap = bit; + /* Addition */ + sub_eltfp25519_1w(B, Ur1, Zr1); /* B = Ur1-Zr1 */ + add_eltfp25519_1w_adx(A, Ur1, Zr1); /* A = Ur1+Zr1 */ + mul_eltfp25519_1w_adx(C, &P[4 * k], B); /* C = M0-B */ + sub_eltfp25519_1w(B, A, C); /* B = (Ur1+Zr1) - M*(Ur1-Zr1) */ + add_eltfp25519_1w_adx(A, A, C); /* A = (Ur1+Zr1) + M*(Ur1-Zr1) */ + sqr_eltfp25519_2w_adx(AB); /* A = A^2 | B = B^2 */ + mul_eltfp25519_2w_adx(UZr1, ZUr2, AB); /* Ur1 = Zr2*A | Zr1 = Ur2*B */ + ++j; + } + j = 0; + } + + /* Doubling */ + for (i = 0; i < q; ++i) { + add_eltfp25519_1w_adx(A, Ur1, Zr1); /* A = Ur1+Zr1 */ + sub_eltfp25519_1w(B, Ur1, Zr1); /* B = Ur1-Zr1 */ + sqr_eltfp25519_2w_adx(AB); /* A = A**2 B = B**2 */ + copy_eltfp25519_1w(C, B); /* C = B */ + sub_eltfp25519_1w(B, A, B); /* B = A-B */ + mul_a24_eltfp25519_1w(D, B); /* D = my_a24*B */ + add_eltfp25519_1w_adx(D, D, C); /* D = D+C */ + mul_eltfp25519_2w_adx(UZr1, AB, CD); /* Ur1 = A*B Zr1 = Zr1*A */ + } + + /* Convert to affine coordinates */ + inv_eltfp25519_1w_adx(A, Zr1); + mul_eltfp25519_1w_adx((u64 *)session_key, Ur1, A); + fred_eltfp25519_1w((u64 *)session_key); + + memzero_explicit(&m, sizeof(m)); +} + +static void curve25519_bmi2(u8 shared[CURVE25519_KEY_SIZE], + const u8 private_key[CURVE25519_KEY_SIZE], + const u8 session_key[CURVE25519_KEY_SIZE]) +{ + struct { + u64 buffer[4 * NUM_WORDS_ELTFP25519]; + u64 coordinates[4 * NUM_WORDS_ELTFP25519]; + u64 workspace[6 * NUM_WORDS_ELTFP25519]; + u8 session[CURVE25519_KEY_SIZE]; + u8 private[CURVE25519_KEY_SIZE]; + } __aligned(32) m; + + int i = 0, j = 0; + u64 prev = 0; + u64 *const X1 = (u64 *)m.session; + u64 *const key = (u64 *)m.private; + u64 *const Px = m.coordinates + 0; + u64 *const Pz = m.coordinates + 4; + u64 *const Qx = m.coordinates + 8; + u64 *const Qz = m.coordinates + 12; + u64 *const X2 = Qx; + u64 *const Z2 = Qz; + u64 *const X3 = Px; + u64 *const Z3 = Pz; + u64 *const X2Z2 = Qx; + u64 *const X3Z3 = Px; + + u64 *const A = m.workspace + 0; + u64 *const B = m.workspace + 4; + u64 *const D = m.workspace + 8; + u64 *const C = m.workspace + 12; + u64 *const DA = m.workspace + 16; + u64 *const CB = m.workspace + 20; + u64 *const AB = A; + u64 *const DC = D; + u64 *const DACB = DA; + + memcpy(m.private, private_key, sizeof(m.private)); + memcpy(m.session, session_key, sizeof(m.session)); + + clamp_secret(m.private); + + /* As in the draft: + * When receiving such an array, implementations of curve25519 + * MUST mask the most-significant bit in the final byte. This + * is done to preserve compatibility with point formats which + * reserve the sign bit for use in other protocols and to + * increase resistance to implementation fingerprinting + */ + m.session[CURVE25519_KEY_SIZE - 1] &= (1 << (255 % 8)) - 1; + + copy_eltfp25519_1w(Px, X1); + setzero_eltfp25519_1w(Pz); + setzero_eltfp25519_1w(Qx); + setzero_eltfp25519_1w(Qz); + + Pz[0] = 1; + Qx[0] = 1; + + /* main-loop */ + prev = 0; + j = 62; + for (i = 3; i >= 0; --i) { + while (j >= 0) { + u64 bit = (key[i] >> j) & 0x1; + u64 swap = bit ^ prev; + prev = bit; + + add_eltfp25519_1w_bmi2(A, X2, Z2); /* A = (X2+Z2) */ + sub_eltfp25519_1w(B, X2, Z2); /* B = (X2-Z2) */ + add_eltfp25519_1w_bmi2(C, X3, Z3); /* C = (X3+Z3) */ + sub_eltfp25519_1w(D, X3, Z3); /* D = (X3-Z3) */ + mul_eltfp25519_2w_bmi2(DACB, AB, DC); /* [DA|CB] = [A|B]*[D|C] */ + + cselect(swap, A, C); + cselect(swap, B, D); + + sqr_eltfp25519_2w_bmi2(AB); /* [AA|BB] = [A^2|B^2] */ + add_eltfp25519_1w_bmi2(X3, DA, CB); /* X3 = (DA+CB) */ + sub_eltfp25519_1w(Z3, DA, CB); /* Z3 = (DA-CB) */ + sqr_eltfp25519_2w_bmi2(X3Z3); /* [X3|Z3] = [(DA+CB)|(DA+CB)]^2 */ + + copy_eltfp25519_1w(X2, B); /* X2 = B^2 */ + sub_eltfp25519_1w(Z2, A, B); /* Z2 = E = AA-BB */ + + mul_a24_eltfp25519_1w(B, Z2); /* B = a24*E */ + add_eltfp25519_1w_bmi2(B, B, X2); /* B = a24*E+B */ + mul_eltfp25519_2w_bmi2(X2Z2, X2Z2, AB); /* [X2|Z2] = [B|E]*[A|a24*E+B] */ + mul_eltfp25519_1w_bmi2(Z3, Z3, X1); /* Z3 = Z3*X1 */ + --j; + } + j = 63; + } + + inv_eltfp25519_1w_bmi2(A, Qz); + mul_eltfp25519_1w_bmi2((u64 *)shared, Qx, A); + fred_eltfp25519_1w((u64 *)shared); + + memzero_explicit(&m, sizeof(m)); +} + +static void curve25519_bmi2_base(u8 session_key[CURVE25519_KEY_SIZE], + const u8 private_key[CURVE25519_KEY_SIZE]) +{ + struct { + u64 buffer[4 * NUM_WORDS_ELTFP25519]; + u64 coordinates[4 * NUM_WORDS_ELTFP25519]; + u64 workspace[4 * NUM_WORDS_ELTFP25519]; + u8 private[CURVE25519_KEY_SIZE]; + } __aligned(32) m; + + const int ite[4] = { 64, 64, 64, 63 }; + const int q = 3; + u64 swap = 1; + + int i = 0, j = 0, k = 0; + u64 *const key = (u64 *)m.private; + u64 *const Ur1 = m.coordinates + 0; + u64 *const Zr1 = m.coordinates + 4; + u64 *const Ur2 = m.coordinates + 8; + u64 *const Zr2 = m.coordinates + 12; + + u64 *const UZr1 = m.coordinates + 0; + u64 *const ZUr2 = m.coordinates + 8; + + u64 *const A = m.workspace + 0; + u64 *const B = m.workspace + 4; + u64 *const C = m.workspace + 8; + u64 *const D = m.workspace + 12; + + u64 *const AB = m.workspace + 0; + u64 *const CD = m.workspace + 8; + + const u64 *const P = table_ladder_8k; + + memcpy(m.private, private_key, sizeof(m.private)); + + clamp_secret(m.private); + + setzero_eltfp25519_1w(Ur1); + setzero_eltfp25519_1w(Zr1); + setzero_eltfp25519_1w(Zr2); + Ur1[0] = 1; + Zr1[0] = 1; + Zr2[0] = 1; + + /* G-S */ + Ur2[3] = 0x1eaecdeee27cab34UL; + Ur2[2] = 0xadc7a0b9235d48e2UL; + Ur2[1] = 0xbbf095ae14b2edf8UL; + Ur2[0] = 0x7e94e1fec82faabdUL; + + /* main-loop */ + j = q; + for (i = 0; i < NUM_WORDS_ELTFP25519; ++i) { + while (j < ite[i]) { + u64 bit = (key[i] >> j) & 0x1; + k = (64 * i + j - q); + swap = swap ^ bit; + cswap(swap, Ur1, Ur2); + cswap(swap, Zr1, Zr2); + swap = bit; + /* Addition */ + sub_eltfp25519_1w(B, Ur1, Zr1); /* B = Ur1-Zr1 */ + add_eltfp25519_1w_bmi2(A, Ur1, Zr1); /* A = Ur1+Zr1 */ + mul_eltfp25519_1w_bmi2(C, &P[4 * k], B);/* C = M0-B */ + sub_eltfp25519_1w(B, A, C); /* B = (Ur1+Zr1) - M*(Ur1-Zr1) */ + add_eltfp25519_1w_bmi2(A, A, C); /* A = (Ur1+Zr1) + M*(Ur1-Zr1) */ + sqr_eltfp25519_2w_bmi2(AB); /* A = A^2 | B = B^2 */ + mul_eltfp25519_2w_bmi2(UZr1, ZUr2, AB); /* Ur1 = Zr2*A | Zr1 = Ur2*B */ + ++j; + } + j = 0; + } + + /* Doubling */ + for (i = 0; i < q; ++i) { + add_eltfp25519_1w_bmi2(A, Ur1, Zr1); /* A = Ur1+Zr1 */ + sub_eltfp25519_1w(B, Ur1, Zr1); /* B = Ur1-Zr1 */ + sqr_eltfp25519_2w_bmi2(AB); /* A = A**2 B = B**2 */ + copy_eltfp25519_1w(C, B); /* C = B */ + sub_eltfp25519_1w(B, A, B); /* B = A-B */ + mul_a24_eltfp25519_1w(D, B); /* D = my_a24*B */ + add_eltfp25519_1w_bmi2(D, D, C); /* D = D+C */ + mul_eltfp25519_2w_bmi2(UZr1, AB, CD); /* Ur1 = A*B Zr1 = Zr1*A */ + } + + /* Convert to affine coordinates */ + inv_eltfp25519_1w_bmi2(A, Zr1); + mul_eltfp25519_1w_bmi2((u64 *)session_key, Ur1, A); + fred_eltfp25519_1w((u64 *)session_key); + + memzero_explicit(&m, sizeof(m)); +} diff --git a/lib/zinc/curve25519/curve25519.c b/lib/zinc/curve25519/curve25519.c index 2f613d2a7519..4f9c45ba126d 100644 --- a/lib/zinc/curve25519/curve25519.c +++ b/lib/zinc/curve25519/curve25519.c @@ -20,6 +20,9 @@ #include #include // For crypto_memneq. +#if defined(CONFIG_ZINC_ARCH_X86_64) +#include "curve25519-x86_64-glue.c" +#else static bool *const curve25519_nobs[] __initconst = { }; static void __init curve25519_fpu_init(void) { @@ -35,6 +38,7 @@ static inline bool curve25519_base_arch(u8 pub[CURVE25519_KEY_SIZE], { return false; } +#endif static __always_inline void normalize_secret(u8 secret[CURVE25519_KEY_SIZE]) { From patchwork Sat Oct 6 02:57:04 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Jason A. Donenfeld" X-Patchwork-Id: 148318 Delivered-To: patch@linaro.org Received: by 2002:a2e:8595:0:0:0:0:0 with SMTP id b21-v6csp1158004lji; Fri, 5 Oct 2018 19:58:52 -0700 (PDT) X-Google-Smtp-Source: ACcGV60ZF8R1CiM117pALstUfCFV74qdr7uNFnkSuIMyyW/56rbvk63j5u3ywbV8rinWKJ8pTZ1m X-Received: by 2002:a63:a42:: with SMTP id z2-v6mr12317464pgk.209.1538794732084; Fri, 05 Oct 2018 19:58:52 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1538794732; cv=none; d=google.com; s=arc-20160816; b=WprfY6+AykmAQo1DfHsZb3pUaBCkmuI0sDqd00a5gI6qTB4muTcyEt52+xC8kqgXGg 7/VjRrW1OtpxiwPnJBE4nyeHF2oKwyhexhy9QMRn0yWuRnWJfU+BKyNI+mk6k6lyVhip msC7fPSa1BQAof1od3j3bzk/fMiKfKzc11M7NfZaaXECVUsVOnshp3H0A/QYmn2icnZ4 iVGwu4O/vv7uPhgfpdYPqWKSbd5M3BMSaUnP7W+Jwwcnq57tOb02GKkp3RheR7oF+Zgb 2qCnBCS+U3pnVvKLCfNIXtQzKEBVWtxOSPk9HCph1QiDD9g0GPTNvQghBBRaF6sJzv2g YpVw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=TgGfLHtELBrCBB6CNB6kcfo/duMq8hlafKQ8i52TqCM=; b=J5nptq3wFEser+Hwte0bSG5S4lDFibCg9RzrM/KqHyGOoMMR3TDTfeXYjObClIebxU YekMC0VPjuLvwVZoISqidPj8z7z9XaJUUBmXZjNnuS6eQfZKuV4D3nJAhN4HEsDgwH5A eO1NiBKm7qG+8KuVpQg26NFY9gX/4gDKe42UWj+EnMHoWwUeaPlrT4COS1qYrCXaTU5D XOyRxb00gzaxEF3ERZCENNsyctYUpwZ9qp0/J9CYu/yR57/whJItBWtycv+NzfkzN5wN dICG878bsRjVOv+SGzgmPJFet4gooC+H3I+mSzrm97wXUhD7CzVHIlcwPjxuA8vjvYH3 0JzA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=V8NN8YVf; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l136-v6si11700745pfd.132.2018.10.05.19.58.51; Fri, 05 Oct 2018 19:58:52 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=V8NN8YVf; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729928AbeJFKAV (ORCPT + 32 others); Sat, 6 Oct 2018 06:00:21 -0400 Received: from frisell.zx2c4.com ([192.95.5.64]:50073 "EHLO frisell.zx2c4.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729862AbeJFKAU (ORCPT ); Sat, 6 Oct 2018 06:00:20 -0400 Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTP id 18fd4c4d; Sat, 6 Oct 2018 02:58:10 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=zx2c4.com; h=from:to:cc :subject:date:message-id:in-reply-to:references:mime-version :content-transfer-encoding; s=mail; bh=27uzPC3r4Lv5ePmfnVzwfvuP4 70=; b=V8NN8YVfnML70kUC+ZjkC10mAjGml+7o6B7ZLzZIz+IZCSTae/Rm0ZRP/ bfbLWpwChYvxN60hNC5aYB2CHql0zUn32f+an2DHOVwrtI1IqZ7fWwDg+Q9yOtqT 237hxCaeXoJL6XGpF6TOazYA0X/JegxaFgn9C3suWuKjG9rHVrwSuruAwv/iYTCx We4DAP3+zspGz8zVJ7eJk0KofYFGBbuOxEer1DZXzeeZZQvkF4M5CirAgBT0NLvU kmqp7v/J7CwyN/i+poGCMmBx9ahyJY3BNkLnA12GkJJsknyBpvnTo7VIl/e0jcHV /1VJ4rJghxXpn2myTo3ksZ/qoDhVg== Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTPSA id df47e6fe (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256:NO); Sat, 6 Oct 2018 02:58:09 +0000 (UTC) From: "Jason A. Donenfeld" To: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, davem@davemloft.net, gregkh@linuxfoundation.org Cc: "Jason A. Donenfeld" , Russell King , linux-arm-kernel@lists.infradead.org, Samuel Neves , Jean-Philippe Aumasson , Andy Lutomirski , Andrew Morton , Linus Torvalds , kernel-hardening@lists.openwall.com, linux-crypto@vger.kernel.org Subject: [PATCH net-next v7 23/28] zinc: import Bernstein and Schwabe's Curve25519 ARM implementation Date: Sat, 6 Oct 2018 04:57:04 +0200 Message-Id: <20181006025709.4019-24-Jason@zx2c4.com> In-Reply-To: <20181006025709.4019-1-Jason@zx2c4.com> References: <20181006025709.4019-1-Jason@zx2c4.com> MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This comes from Dan Bernstein and Peter Schwabe's public domain NEON code, and is included here in raw form so that subsequent commits that fix these up for the kernel can see how it has changed. This code does have some entirely cosmetic formatting differences, adding indentation and so forth, so that when we actually port it for use in the kernel in the subsequent commit, it's obvious what's changed in the process. This code originates from SUPERCOP 20180818, available at . Signed-off-by: Jason A. Donenfeld Cc: Russell King Cc: linux-arm-kernel@lists.infradead.org Cc: Samuel Neves Cc: Jean-Philippe Aumasson Cc: Andy Lutomirski Cc: Greg KH Cc: Andrew Morton Cc: Linus Torvalds Cc: kernel-hardening@lists.openwall.com Cc: linux-crypto@vger.kernel.org --- lib/zinc/curve25519/curve25519-arm-supercop.S | 2105 +++++++++++++++++ 1 file changed, 2105 insertions(+) create mode 100644 lib/zinc/curve25519/curve25519-arm-supercop.S -- 2.19.0 diff --git a/lib/zinc/curve25519/curve25519-arm-supercop.S b/lib/zinc/curve25519/curve25519-arm-supercop.S new file mode 100644 index 000000000000..f33b85fef382 --- /dev/null +++ b/lib/zinc/curve25519/curve25519-arm-supercop.S @@ -0,0 +1,2105 @@ +/* + * Public domain code from Daniel J. Bernstein and Peter Schwabe, from + * SUPERCOP's curve25519/neon2/scalarmult.s. + */ + +.fpu neon +.text +.align 4 +.global _crypto_scalarmult_curve25519_neon2 +.global crypto_scalarmult_curve25519_neon2 +.type _crypto_scalarmult_curve25519_neon2 STT_FUNC +.type crypto_scalarmult_curve25519_neon2 STT_FUNC + _crypto_scalarmult_curve25519_neon2: + crypto_scalarmult_curve25519_neon2: + vpush {q4, q5, q6, q7} + mov r12, sp + sub sp, sp, #736 + and sp, sp, #0xffffffe0 + strd r4, [sp, #0] + strd r6, [sp, #8] + strd r8, [sp, #16] + strd r10, [sp, #24] + str r12, [sp, #480] + str r14, [sp, #484] + mov r0, r0 + mov r1, r1 + mov r2, r2 + add r3, sp, #32 + ldr r4, =0 + ldr r5, =254 + vmov.i32 q0, #1 + vshr.u64 q1, q0, #7 + vshr.u64 q0, q0, #8 + vmov.i32 d4, #19 + vmov.i32 d5, #38 + add r6, sp, #512 + vst1.8 {d2-d3}, [r6, : 128] + add r6, sp, #528 + vst1.8 {d0-d1}, [r6, : 128] + add r6, sp, #544 + vst1.8 {d4-d5}, [r6, : 128] + add r6, r3, #0 + vmov.i32 q2, #0 + vst1.8 {d4-d5}, [r6, : 128]! + vst1.8 {d4-d5}, [r6, : 128]! + vst1.8 d4, [r6, : 64] + add r6, r3, #0 + ldr r7, =960 + sub r7, r7, #2 + neg r7, r7 + sub r7, r7, r7, LSL #7 + str r7, [r6] + add r6, sp, #704 + vld1.8 {d4-d5}, [r1]! + vld1.8 {d6-d7}, [r1] + vst1.8 {d4-d5}, [r6, : 128]! + vst1.8 {d6-d7}, [r6, : 128] + sub r1, r6, #16 + ldrb r6, [r1] + and r6, r6, #248 + strb r6, [r1] + ldrb r6, [r1, #31] + and r6, r6, #127 + orr r6, r6, #64 + strb r6, [r1, #31] + vmov.i64 q2, #0xffffffff + vshr.u64 q3, q2, #7 + vshr.u64 q2, q2, #6 + vld1.8 {d8}, [r2] + vld1.8 {d10}, [r2] + add r2, r2, #6 + vld1.8 {d12}, [r2] + vld1.8 {d14}, [r2] + add r2, r2, #6 + vld1.8 {d16}, [r2] + add r2, r2, #4 + vld1.8 {d18}, [r2] + vld1.8 {d20}, [r2] + add r2, r2, #6 + vld1.8 {d22}, [r2] + add r2, r2, #2 + vld1.8 {d24}, [r2] + vld1.8 {d26}, [r2] + vshr.u64 q5, q5, #26 + vshr.u64 q6, q6, #3 + vshr.u64 q7, q7, #29 + vshr.u64 q8, q8, #6 + vshr.u64 q10, q10, #25 + vshr.u64 q11, q11, #3 + vshr.u64 q12, q12, #12 + vshr.u64 q13, q13, #38 + vand q4, q4, q2 + vand q6, q6, q2 + vand q8, q8, q2 + vand q10, q10, q2 + vand q2, q12, q2 + vand q5, q5, q3 + vand q7, q7, q3 + vand q9, q9, q3 + vand q11, q11, q3 + vand q3, q13, q3 + add r2, r3, #48 + vadd.i64 q12, q4, q1 + vadd.i64 q13, q10, q1 + vshr.s64 q12, q12, #26 + vshr.s64 q13, q13, #26 + vadd.i64 q5, q5, q12 + vshl.i64 q12, q12, #26 + vadd.i64 q14, q5, q0 + vadd.i64 q11, q11, q13 + vshl.i64 q13, q13, #26 + vadd.i64 q15, q11, q0 + vsub.i64 q4, q4, q12 + vshr.s64 q12, q14, #25 + vsub.i64 q10, q10, q13 + vshr.s64 q13, q15, #25 + vadd.i64 q6, q6, q12 + vshl.i64 q12, q12, #25 + vadd.i64 q14, q6, q1 + vadd.i64 q2, q2, q13 + vsub.i64 q5, q5, q12 + vshr.s64 q12, q14, #26 + vshl.i64 q13, q13, #25 + vadd.i64 q14, q2, q1 + vadd.i64 q7, q7, q12 + vshl.i64 q12, q12, #26 + vadd.i64 q15, q7, q0 + vsub.i64 q11, q11, q13 + vshr.s64 q13, q14, #26 + vsub.i64 q6, q6, q12 + vshr.s64 q12, q15, #25 + vadd.i64 q3, q3, q13 + vshl.i64 q13, q13, #26 + vadd.i64 q14, q3, q0 + vadd.i64 q8, q8, q12 + vshl.i64 q12, q12, #25 + vadd.i64 q15, q8, q1 + add r2, r2, #8 + vsub.i64 q2, q2, q13 + vshr.s64 q13, q14, #25 + vsub.i64 q7, q7, q12 + vshr.s64 q12, q15, #26 + vadd.i64 q14, q13, q13 + vadd.i64 q9, q9, q12 + vtrn.32 d12, d14 + vshl.i64 q12, q12, #26 + vtrn.32 d13, d15 + vadd.i64 q0, q9, q0 + vadd.i64 q4, q4, q14 + vst1.8 d12, [r2, : 64]! + vshl.i64 q6, q13, #4 + vsub.i64 q7, q8, q12 + vshr.s64 q0, q0, #25 + vadd.i64 q4, q4, q6 + vadd.i64 q6, q10, q0 + vshl.i64 q0, q0, #25 + vadd.i64 q8, q6, q1 + vadd.i64 q4, q4, q13 + vshl.i64 q10, q13, #25 + vadd.i64 q1, q4, q1 + vsub.i64 q0, q9, q0 + vshr.s64 q8, q8, #26 + vsub.i64 q3, q3, q10 + vtrn.32 d14, d0 + vshr.s64 q1, q1, #26 + vtrn.32 d15, d1 + vadd.i64 q0, q11, q8 + vst1.8 d14, [r2, : 64] + vshl.i64 q7, q8, #26 + vadd.i64 q5, q5, q1 + vtrn.32 d4, d6 + vshl.i64 q1, q1, #26 + vtrn.32 d5, d7 + vsub.i64 q3, q6, q7 + add r2, r2, #16 + vsub.i64 q1, q4, q1 + vst1.8 d4, [r2, : 64] + vtrn.32 d6, d0 + vtrn.32 d7, d1 + sub r2, r2, #8 + vtrn.32 d2, d10 + vtrn.32 d3, d11 + vst1.8 d6, [r2, : 64] + sub r2, r2, #24 + vst1.8 d2, [r2, : 64] + add r2, r3, #96 + vmov.i32 q0, #0 + vmov.i64 d2, #0xff + vmov.i64 d3, #0 + vshr.u32 q1, q1, #7 + vst1.8 {d2-d3}, [r2, : 128]! + vst1.8 {d0-d1}, [r2, : 128]! + vst1.8 d0, [r2, : 64] + add r2, r3, #144 + vmov.i32 q0, #0 + vst1.8 {d0-d1}, [r2, : 128]! + vst1.8 {d0-d1}, [r2, : 128]! + vst1.8 d0, [r2, : 64] + add r2, r3, #240 + vmov.i32 q0, #0 + vmov.i64 d2, #0xff + vmov.i64 d3, #0 + vshr.u32 q1, q1, #7 + vst1.8 {d2-d3}, [r2, : 128]! + vst1.8 {d0-d1}, [r2, : 128]! + vst1.8 d0, [r2, : 64] + add r2, r3, #48 + add r6, r3, #192 + vld1.8 {d0-d1}, [r2, : 128]! + vld1.8 {d2-d3}, [r2, : 128]! + vld1.8 {d4}, [r2, : 64] + vst1.8 {d0-d1}, [r6, : 128]! + vst1.8 {d2-d3}, [r6, : 128]! + vst1.8 d4, [r6, : 64] +._mainloop: + mov r2, r5, LSR #3 + and r6, r5, #7 + ldrb r2, [r1, r2] + mov r2, r2, LSR r6 + and r2, r2, #1 + str r5, [sp, #488] + eor r4, r4, r2 + str r2, [sp, #492] + neg r2, r4 + add r4, r3, #96 + add r5, r3, #192 + add r6, r3, #144 + vld1.8 {d8-d9}, [r4, : 128]! + add r7, r3, #240 + vld1.8 {d10-d11}, [r5, : 128]! + veor q6, q4, q5 + vld1.8 {d14-d15}, [r6, : 128]! + vdup.i32 q8, r2 + vld1.8 {d18-d19}, [r7, : 128]! + veor q10, q7, q9 + vld1.8 {d22-d23}, [r4, : 128]! + vand q6, q6, q8 + vld1.8 {d24-d25}, [r5, : 128]! + vand q10, q10, q8 + vld1.8 {d26-d27}, [r6, : 128]! + veor q4, q4, q6 + vld1.8 {d28-d29}, [r7, : 128]! + veor q5, q5, q6 + vld1.8 {d0}, [r4, : 64] + veor q6, q7, q10 + vld1.8 {d2}, [r5, : 64] + veor q7, q9, q10 + vld1.8 {d4}, [r6, : 64] + veor q9, q11, q12 + vld1.8 {d6}, [r7, : 64] + veor q10, q0, q1 + sub r2, r4, #32 + vand q9, q9, q8 + sub r4, r5, #32 + vand q10, q10, q8 + sub r5, r6, #32 + veor q11, q11, q9 + sub r6, r7, #32 + veor q0, q0, q10 + veor q9, q12, q9 + veor q1, q1, q10 + veor q10, q13, q14 + veor q12, q2, q3 + vand q10, q10, q8 + vand q8, q12, q8 + veor q12, q13, q10 + veor q2, q2, q8 + veor q10, q14, q10 + veor q3, q3, q8 + vadd.i32 q8, q4, q6 + vsub.i32 q4, q4, q6 + vst1.8 {d16-d17}, [r2, : 128]! + vadd.i32 q6, q11, q12 + vst1.8 {d8-d9}, [r5, : 128]! + vsub.i32 q4, q11, q12 + vst1.8 {d12-d13}, [r2, : 128]! + vadd.i32 q6, q0, q2 + vst1.8 {d8-d9}, [r5, : 128]! + vsub.i32 q0, q0, q2 + vst1.8 d12, [r2, : 64] + vadd.i32 q2, q5, q7 + vst1.8 d0, [r5, : 64] + vsub.i32 q0, q5, q7 + vst1.8 {d4-d5}, [r4, : 128]! + vadd.i32 q2, q9, q10 + vst1.8 {d0-d1}, [r6, : 128]! + vsub.i32 q0, q9, q10 + vst1.8 {d4-d5}, [r4, : 128]! + vadd.i32 q2, q1, q3 + vst1.8 {d0-d1}, [r6, : 128]! + vsub.i32 q0, q1, q3 + vst1.8 d4, [r4, : 64] + vst1.8 d0, [r6, : 64] + add r2, sp, #544 + add r4, r3, #96 + add r5, r3, #144 + vld1.8 {d0-d1}, [r2, : 128] + vld1.8 {d2-d3}, [r4, : 128]! + vld1.8 {d4-d5}, [r5, : 128]! + vzip.i32 q1, q2 + vld1.8 {d6-d7}, [r4, : 128]! + vld1.8 {d8-d9}, [r5, : 128]! + vshl.i32 q5, q1, #1 + vzip.i32 q3, q4 + vshl.i32 q6, q2, #1 + vld1.8 {d14}, [r4, : 64] + vshl.i32 q8, q3, #1 + vld1.8 {d15}, [r5, : 64] + vshl.i32 q9, q4, #1 + vmul.i32 d21, d7, d1 + vtrn.32 d14, d15 + vmul.i32 q11, q4, q0 + vmul.i32 q0, q7, q0 + vmull.s32 q12, d2, d2 + vmlal.s32 q12, d11, d1 + vmlal.s32 q12, d12, d0 + vmlal.s32 q12, d13, d23 + vmlal.s32 q12, d16, d22 + vmlal.s32 q12, d7, d21 + vmull.s32 q10, d2, d11 + vmlal.s32 q10, d4, d1 + vmlal.s32 q10, d13, d0 + vmlal.s32 q10, d6, d23 + vmlal.s32 q10, d17, d22 + vmull.s32 q13, d10, d4 + vmlal.s32 q13, d11, d3 + vmlal.s32 q13, d13, d1 + vmlal.s32 q13, d16, d0 + vmlal.s32 q13, d17, d23 + vmlal.s32 q13, d8, d22 + vmull.s32 q1, d10, d5 + vmlal.s32 q1, d11, d4 + vmlal.s32 q1, d6, d1 + vmlal.s32 q1, d17, d0 + vmlal.s32 q1, d8, d23 + vmull.s32 q14, d10, d6 + vmlal.s32 q14, d11, d13 + vmlal.s32 q14, d4, d4 + vmlal.s32 q14, d17, d1 + vmlal.s32 q14, d18, d0 + vmlal.s32 q14, d9, d23 + vmull.s32 q11, d10, d7 + vmlal.s32 q11, d11, d6 + vmlal.s32 q11, d12, d5 + vmlal.s32 q11, d8, d1 + vmlal.s32 q11, d19, d0 + vmull.s32 q15, d10, d8 + vmlal.s32 q15, d11, d17 + vmlal.s32 q15, d12, d6 + vmlal.s32 q15, d13, d5 + vmlal.s32 q15, d19, d1 + vmlal.s32 q15, d14, d0 + vmull.s32 q2, d10, d9 + vmlal.s32 q2, d11, d8 + vmlal.s32 q2, d12, d7 + vmlal.s32 q2, d13, d6 + vmlal.s32 q2, d14, d1 + vmull.s32 q0, d15, d1 + vmlal.s32 q0, d10, d14 + vmlal.s32 q0, d11, d19 + vmlal.s32 q0, d12, d8 + vmlal.s32 q0, d13, d17 + vmlal.s32 q0, d6, d6 + add r2, sp, #512 + vld1.8 {d18-d19}, [r2, : 128] + vmull.s32 q3, d16, d7 + vmlal.s32 q3, d10, d15 + vmlal.s32 q3, d11, d14 + vmlal.s32 q3, d12, d9 + vmlal.s32 q3, d13, d8 + add r2, sp, #528 + vld1.8 {d8-d9}, [r2, : 128] + vadd.i64 q5, q12, q9 + vadd.i64 q6, q15, q9 + vshr.s64 q5, q5, #26 + vshr.s64 q6, q6, #26 + vadd.i64 q7, q10, q5 + vshl.i64 q5, q5, #26 + vadd.i64 q8, q7, q4 + vadd.i64 q2, q2, q6 + vshl.i64 q6, q6, #26 + vadd.i64 q10, q2, q4 + vsub.i64 q5, q12, q5 + vshr.s64 q8, q8, #25 + vsub.i64 q6, q15, q6 + vshr.s64 q10, q10, #25 + vadd.i64 q12, q13, q8 + vshl.i64 q8, q8, #25 + vadd.i64 q13, q12, q9 + vadd.i64 q0, q0, q10 + vsub.i64 q7, q7, q8 + vshr.s64 q8, q13, #26 + vshl.i64 q10, q10, #25 + vadd.i64 q13, q0, q9 + vadd.i64 q1, q1, q8 + vshl.i64 q8, q8, #26 + vadd.i64 q15, q1, q4 + vsub.i64 q2, q2, q10 + vshr.s64 q10, q13, #26 + vsub.i64 q8, q12, q8 + vshr.s64 q12, q15, #25 + vadd.i64 q3, q3, q10 + vshl.i64 q10, q10, #26 + vadd.i64 q13, q3, q4 + vadd.i64 q14, q14, q12 + add r2, r3, #288 + vshl.i64 q12, q12, #25 + add r4, r3, #336 + vadd.i64 q15, q14, q9 + add r2, r2, #8 + vsub.i64 q0, q0, q10 + add r4, r4, #8 + vshr.s64 q10, q13, #25 + vsub.i64 q1, q1, q12 + vshr.s64 q12, q15, #26 + vadd.i64 q13, q10, q10 + vadd.i64 q11, q11, q12 + vtrn.32 d16, d2 + vshl.i64 q12, q12, #26 + vtrn.32 d17, d3 + vadd.i64 q1, q11, q4 + vadd.i64 q4, q5, q13 + vst1.8 d16, [r2, : 64]! + vshl.i64 q5, q10, #4 + vst1.8 d17, [r4, : 64]! + vsub.i64 q8, q14, q12 + vshr.s64 q1, q1, #25 + vadd.i64 q4, q4, q5 + vadd.i64 q5, q6, q1 + vshl.i64 q1, q1, #25 + vadd.i64 q6, q5, q9 + vadd.i64 q4, q4, q10 + vshl.i64 q10, q10, #25 + vadd.i64 q9, q4, q9 + vsub.i64 q1, q11, q1 + vshr.s64 q6, q6, #26 + vsub.i64 q3, q3, q10 + vtrn.32 d16, d2 + vshr.s64 q9, q9, #26 + vtrn.32 d17, d3 + vadd.i64 q1, q2, q6 + vst1.8 d16, [r2, : 64] + vshl.i64 q2, q6, #26 + vst1.8 d17, [r4, : 64] + vadd.i64 q6, q7, q9 + vtrn.32 d0, d6 + vshl.i64 q7, q9, #26 + vtrn.32 d1, d7 + vsub.i64 q2, q5, q2 + add r2, r2, #16 + vsub.i64 q3, q4, q7 + vst1.8 d0, [r2, : 64] + add r4, r4, #16 + vst1.8 d1, [r4, : 64] + vtrn.32 d4, d2 + vtrn.32 d5, d3 + sub r2, r2, #8 + sub r4, r4, #8 + vtrn.32 d6, d12 + vtrn.32 d7, d13 + vst1.8 d4, [r2, : 64] + vst1.8 d5, [r4, : 64] + sub r2, r2, #24 + sub r4, r4, #24 + vst1.8 d6, [r2, : 64] + vst1.8 d7, [r4, : 64] + add r2, r3, #240 + add r4, r3, #96 + vld1.8 {d0-d1}, [r4, : 128]! + vld1.8 {d2-d3}, [r4, : 128]! + vld1.8 {d4}, [r4, : 64] + add r4, r3, #144 + vld1.8 {d6-d7}, [r4, : 128]! + vtrn.32 q0, q3 + vld1.8 {d8-d9}, [r4, : 128]! + vshl.i32 q5, q0, #4 + vtrn.32 q1, q4 + vshl.i32 q6, q3, #4 + vadd.i32 q5, q5, q0 + vadd.i32 q6, q6, q3 + vshl.i32 q7, q1, #4 + vld1.8 {d5}, [r4, : 64] + vshl.i32 q8, q4, #4 + vtrn.32 d4, d5 + vadd.i32 q7, q7, q1 + vadd.i32 q8, q8, q4 + vld1.8 {d18-d19}, [r2, : 128]! + vshl.i32 q10, q2, #4 + vld1.8 {d22-d23}, [r2, : 128]! + vadd.i32 q10, q10, q2 + vld1.8 {d24}, [r2, : 64] + vadd.i32 q5, q5, q0 + add r2, r3, #192 + vld1.8 {d26-d27}, [r2, : 128]! + vadd.i32 q6, q6, q3 + vld1.8 {d28-d29}, [r2, : 128]! + vadd.i32 q8, q8, q4 + vld1.8 {d25}, [r2, : 64] + vadd.i32 q10, q10, q2 + vtrn.32 q9, q13 + vadd.i32 q7, q7, q1 + vadd.i32 q5, q5, q0 + vtrn.32 q11, q14 + vadd.i32 q6, q6, q3 + add r2, sp, #560 + vadd.i32 q10, q10, q2 + vtrn.32 d24, d25 + vst1.8 {d12-d13}, [r2, : 128] + vshl.i32 q6, q13, #1 + add r2, sp, #576 + vst1.8 {d20-d21}, [r2, : 128] + vshl.i32 q10, q14, #1 + add r2, sp, #592 + vst1.8 {d12-d13}, [r2, : 128] + vshl.i32 q15, q12, #1 + vadd.i32 q8, q8, q4 + vext.32 d10, d31, d30, #0 + vadd.i32 q7, q7, q1 + add r2, sp, #608 + vst1.8 {d16-d17}, [r2, : 128] + vmull.s32 q8, d18, d5 + vmlal.s32 q8, d26, d4 + vmlal.s32 q8, d19, d9 + vmlal.s32 q8, d27, d3 + vmlal.s32 q8, d22, d8 + vmlal.s32 q8, d28, d2 + vmlal.s32 q8, d23, d7 + vmlal.s32 q8, d29, d1 + vmlal.s32 q8, d24, d6 + vmlal.s32 q8, d25, d0 + add r2, sp, #624 + vst1.8 {d14-d15}, [r2, : 128] + vmull.s32 q2, d18, d4 + vmlal.s32 q2, d12, d9 + vmlal.s32 q2, d13, d8 + vmlal.s32 q2, d19, d3 + vmlal.s32 q2, d22, d2 + vmlal.s32 q2, d23, d1 + vmlal.s32 q2, d24, d0 + add r2, sp, #640 + vst1.8 {d20-d21}, [r2, : 128] + vmull.s32 q7, d18, d9 + vmlal.s32 q7, d26, d3 + vmlal.s32 q7, d19, d8 + vmlal.s32 q7, d27, d2 + vmlal.s32 q7, d22, d7 + vmlal.s32 q7, d28, d1 + vmlal.s32 q7, d23, d6 + vmlal.s32 q7, d29, d0 + add r2, sp, #656 + vst1.8 {d10-d11}, [r2, : 128] + vmull.s32 q5, d18, d3 + vmlal.s32 q5, d19, d2 + vmlal.s32 q5, d22, d1 + vmlal.s32 q5, d23, d0 + vmlal.s32 q5, d12, d8 + add r2, sp, #672 + vst1.8 {d16-d17}, [r2, : 128] + vmull.s32 q4, d18, d8 + vmlal.s32 q4, d26, d2 + vmlal.s32 q4, d19, d7 + vmlal.s32 q4, d27, d1 + vmlal.s32 q4, d22, d6 + vmlal.s32 q4, d28, d0 + vmull.s32 q8, d18, d7 + vmlal.s32 q8, d26, d1 + vmlal.s32 q8, d19, d6 + vmlal.s32 q8, d27, d0 + add r2, sp, #576 + vld1.8 {d20-d21}, [r2, : 128] + vmlal.s32 q7, d24, d21 + vmlal.s32 q7, d25, d20 + vmlal.s32 q4, d23, d21 + vmlal.s32 q4, d29, d20 + vmlal.s32 q8, d22, d21 + vmlal.s32 q8, d28, d20 + vmlal.s32 q5, d24, d20 + add r2, sp, #576 + vst1.8 {d14-d15}, [r2, : 128] + vmull.s32 q7, d18, d6 + vmlal.s32 q7, d26, d0 + add r2, sp, #656 + vld1.8 {d30-d31}, [r2, : 128] + vmlal.s32 q2, d30, d21 + vmlal.s32 q7, d19, d21 + vmlal.s32 q7, d27, d20 + add r2, sp, #624 + vld1.8 {d26-d27}, [r2, : 128] + vmlal.s32 q4, d25, d27 + vmlal.s32 q8, d29, d27 + vmlal.s32 q8, d25, d26 + vmlal.s32 q7, d28, d27 + vmlal.s32 q7, d29, d26 + add r2, sp, #608 + vld1.8 {d28-d29}, [r2, : 128] + vmlal.s32 q4, d24, d29 + vmlal.s32 q8, d23, d29 + vmlal.s32 q8, d24, d28 + vmlal.s32 q7, d22, d29 + vmlal.s32 q7, d23, d28 + add r2, sp, #608 + vst1.8 {d8-d9}, [r2, : 128] + add r2, sp, #560 + vld1.8 {d8-d9}, [r2, : 128] + vmlal.s32 q7, d24, d9 + vmlal.s32 q7, d25, d31 + vmull.s32 q1, d18, d2 + vmlal.s32 q1, d19, d1 + vmlal.s32 q1, d22, d0 + vmlal.s32 q1, d24, d27 + vmlal.s32 q1, d23, d20 + vmlal.s32 q1, d12, d7 + vmlal.s32 q1, d13, d6 + vmull.s32 q6, d18, d1 + vmlal.s32 q6, d19, d0 + vmlal.s32 q6, d23, d27 + vmlal.s32 q6, d22, d20 + vmlal.s32 q6, d24, d26 + vmull.s32 q0, d18, d0 + vmlal.s32 q0, d22, d27 + vmlal.s32 q0, d23, d26 + vmlal.s32 q0, d24, d31 + vmlal.s32 q0, d19, d20 + add r2, sp, #640 + vld1.8 {d18-d19}, [r2, : 128] + vmlal.s32 q2, d18, d7 + vmlal.s32 q2, d19, d6 + vmlal.s32 q5, d18, d6 + vmlal.s32 q5, d19, d21 + vmlal.s32 q1, d18, d21 + vmlal.s32 q1, d19, d29 + vmlal.s32 q0, d18, d28 + vmlal.s32 q0, d19, d9 + vmlal.s32 q6, d18, d29 + vmlal.s32 q6, d19, d28 + add r2, sp, #592 + vld1.8 {d18-d19}, [r2, : 128] + add r2, sp, #512 + vld1.8 {d22-d23}, [r2, : 128] + vmlal.s32 q5, d19, d7 + vmlal.s32 q0, d18, d21 + vmlal.s32 q0, d19, d29 + vmlal.s32 q6, d18, d6 + add r2, sp, #528 + vld1.8 {d6-d7}, [r2, : 128] + vmlal.s32 q6, d19, d21 + add r2, sp, #576 + vld1.8 {d18-d19}, [r2, : 128] + vmlal.s32 q0, d30, d8 + add r2, sp, #672 + vld1.8 {d20-d21}, [r2, : 128] + vmlal.s32 q5, d30, d29 + add r2, sp, #608 + vld1.8 {d24-d25}, [r2, : 128] + vmlal.s32 q1, d30, d28 + vadd.i64 q13, q0, q11 + vadd.i64 q14, q5, q11 + vmlal.s32 q6, d30, d9 + vshr.s64 q4, q13, #26 + vshr.s64 q13, q14, #26 + vadd.i64 q7, q7, q4 + vshl.i64 q4, q4, #26 + vadd.i64 q14, q7, q3 + vadd.i64 q9, q9, q13 + vshl.i64 q13, q13, #26 + vadd.i64 q15, q9, q3 + vsub.i64 q0, q0, q4 + vshr.s64 q4, q14, #25 + vsub.i64 q5, q5, q13 + vshr.s64 q13, q15, #25 + vadd.i64 q6, q6, q4 + vshl.i64 q4, q4, #25 + vadd.i64 q14, q6, q11 + vadd.i64 q2, q2, q13 + vsub.i64 q4, q7, q4 + vshr.s64 q7, q14, #26 + vshl.i64 q13, q13, #25 + vadd.i64 q14, q2, q11 + vadd.i64 q8, q8, q7 + vshl.i64 q7, q7, #26 + vadd.i64 q15, q8, q3 + vsub.i64 q9, q9, q13 + vshr.s64 q13, q14, #26 + vsub.i64 q6, q6, q7 + vshr.s64 q7, q15, #25 + vadd.i64 q10, q10, q13 + vshl.i64 q13, q13, #26 + vadd.i64 q14, q10, q3 + vadd.i64 q1, q1, q7 + add r2, r3, #144 + vshl.i64 q7, q7, #25 + add r4, r3, #96 + vadd.i64 q15, q1, q11 + add r2, r2, #8 + vsub.i64 q2, q2, q13 + add r4, r4, #8 + vshr.s64 q13, q14, #25 + vsub.i64 q7, q8, q7 + vshr.s64 q8, q15, #26 + vadd.i64 q14, q13, q13 + vadd.i64 q12, q12, q8 + vtrn.32 d12, d14 + vshl.i64 q8, q8, #26 + vtrn.32 d13, d15 + vadd.i64 q3, q12, q3 + vadd.i64 q0, q0, q14 + vst1.8 d12, [r2, : 64]! + vshl.i64 q7, q13, #4 + vst1.8 d13, [r4, : 64]! + vsub.i64 q1, q1, q8 + vshr.s64 q3, q3, #25 + vadd.i64 q0, q0, q7 + vadd.i64 q5, q5, q3 + vshl.i64 q3, q3, #25 + vadd.i64 q6, q5, q11 + vadd.i64 q0, q0, q13 + vshl.i64 q7, q13, #25 + vadd.i64 q8, q0, q11 + vsub.i64 q3, q12, q3 + vshr.s64 q6, q6, #26 + vsub.i64 q7, q10, q7 + vtrn.32 d2, d6 + vshr.s64 q8, q8, #26 + vtrn.32 d3, d7 + vadd.i64 q3, q9, q6 + vst1.8 d2, [r2, : 64] + vshl.i64 q6, q6, #26 + vst1.8 d3, [r4, : 64] + vadd.i64 q1, q4, q8 + vtrn.32 d4, d14 + vshl.i64 q4, q8, #26 + vtrn.32 d5, d15 + vsub.i64 q5, q5, q6 + add r2, r2, #16 + vsub.i64 q0, q0, q4 + vst1.8 d4, [r2, : 64] + add r4, r4, #16 + vst1.8 d5, [r4, : 64] + vtrn.32 d10, d6 + vtrn.32 d11, d7 + sub r2, r2, #8 + sub r4, r4, #8 + vtrn.32 d0, d2 + vtrn.32 d1, d3 + vst1.8 d10, [r2, : 64] + vst1.8 d11, [r4, : 64] + sub r2, r2, #24 + sub r4, r4, #24 + vst1.8 d0, [r2, : 64] + vst1.8 d1, [r4, : 64] + add r2, r3, #288 + add r4, r3, #336 + vld1.8 {d0-d1}, [r2, : 128]! + vld1.8 {d2-d3}, [r4, : 128]! + vsub.i32 q0, q0, q1 + vld1.8 {d2-d3}, [r2, : 128]! + vld1.8 {d4-d5}, [r4, : 128]! + vsub.i32 q1, q1, q2 + add r5, r3, #240 + vld1.8 {d4}, [r2, : 64] + vld1.8 {d6}, [r4, : 64] + vsub.i32 q2, q2, q3 + vst1.8 {d0-d1}, [r5, : 128]! + vst1.8 {d2-d3}, [r5, : 128]! + vst1.8 d4, [r5, : 64] + add r2, r3, #144 + add r4, r3, #96 + add r5, r3, #144 + add r6, r3, #192 + vld1.8 {d0-d1}, [r2, : 128]! + vld1.8 {d2-d3}, [r4, : 128]! + vsub.i32 q2, q0, q1 + vadd.i32 q0, q0, q1 + vld1.8 {d2-d3}, [r2, : 128]! + vld1.8 {d6-d7}, [r4, : 128]! + vsub.i32 q4, q1, q3 + vadd.i32 q1, q1, q3 + vld1.8 {d6}, [r2, : 64] + vld1.8 {d10}, [r4, : 64] + vsub.i32 q6, q3, q5 + vadd.i32 q3, q3, q5 + vst1.8 {d4-d5}, [r5, : 128]! + vst1.8 {d0-d1}, [r6, : 128]! + vst1.8 {d8-d9}, [r5, : 128]! + vst1.8 {d2-d3}, [r6, : 128]! + vst1.8 d12, [r5, : 64] + vst1.8 d6, [r6, : 64] + add r2, r3, #0 + add r4, r3, #240 + vld1.8 {d0-d1}, [r4, : 128]! + vld1.8 {d2-d3}, [r4, : 128]! + vld1.8 {d4}, [r4, : 64] + add r4, r3, #336 + vld1.8 {d6-d7}, [r4, : 128]! + vtrn.32 q0, q3 + vld1.8 {d8-d9}, [r4, : 128]! + vshl.i32 q5, q0, #4 + vtrn.32 q1, q4 + vshl.i32 q6, q3, #4 + vadd.i32 q5, q5, q0 + vadd.i32 q6, q6, q3 + vshl.i32 q7, q1, #4 + vld1.8 {d5}, [r4, : 64] + vshl.i32 q8, q4, #4 + vtrn.32 d4, d5 + vadd.i32 q7, q7, q1 + vadd.i32 q8, q8, q4 + vld1.8 {d18-d19}, [r2, : 128]! + vshl.i32 q10, q2, #4 + vld1.8 {d22-d23}, [r2, : 128]! + vadd.i32 q10, q10, q2 + vld1.8 {d24}, [r2, : 64] + vadd.i32 q5, q5, q0 + add r2, r3, #288 + vld1.8 {d26-d27}, [r2, : 128]! + vadd.i32 q6, q6, q3 + vld1.8 {d28-d29}, [r2, : 128]! + vadd.i32 q8, q8, q4 + vld1.8 {d25}, [r2, : 64] + vadd.i32 q10, q10, q2 + vtrn.32 q9, q13 + vadd.i32 q7, q7, q1 + vadd.i32 q5, q5, q0 + vtrn.32 q11, q14 + vadd.i32 q6, q6, q3 + add r2, sp, #560 + vadd.i32 q10, q10, q2 + vtrn.32 d24, d25 + vst1.8 {d12-d13}, [r2, : 128] + vshl.i32 q6, q13, #1 + add r2, sp, #576 + vst1.8 {d20-d21}, [r2, : 128] + vshl.i32 q10, q14, #1 + add r2, sp, #592 + vst1.8 {d12-d13}, [r2, : 128] + vshl.i32 q15, q12, #1 + vadd.i32 q8, q8, q4 + vext.32 d10, d31, d30, #0 + vadd.i32 q7, q7, q1 + add r2, sp, #608 + vst1.8 {d16-d17}, [r2, : 128] + vmull.s32 q8, d18, d5 + vmlal.s32 q8, d26, d4 + vmlal.s32 q8, d19, d9 + vmlal.s32 q8, d27, d3 + vmlal.s32 q8, d22, d8 + vmlal.s32 q8, d28, d2 + vmlal.s32 q8, d23, d7 + vmlal.s32 q8, d29, d1 + vmlal.s32 q8, d24, d6 + vmlal.s32 q8, d25, d0 + add r2, sp, #624 + vst1.8 {d14-d15}, [r2, : 128] + vmull.s32 q2, d18, d4 + vmlal.s32 q2, d12, d9 + vmlal.s32 q2, d13, d8 + vmlal.s32 q2, d19, d3 + vmlal.s32 q2, d22, d2 + vmlal.s32 q2, d23, d1 + vmlal.s32 q2, d24, d0 + add r2, sp, #640 + vst1.8 {d20-d21}, [r2, : 128] + vmull.s32 q7, d18, d9 + vmlal.s32 q7, d26, d3 + vmlal.s32 q7, d19, d8 + vmlal.s32 q7, d27, d2 + vmlal.s32 q7, d22, d7 + vmlal.s32 q7, d28, d1 + vmlal.s32 q7, d23, d6 + vmlal.s32 q7, d29, d0 + add r2, sp, #656 + vst1.8 {d10-d11}, [r2, : 128] + vmull.s32 q5, d18, d3 + vmlal.s32 q5, d19, d2 + vmlal.s32 q5, d22, d1 + vmlal.s32 q5, d23, d0 + vmlal.s32 q5, d12, d8 + add r2, sp, #672 + vst1.8 {d16-d17}, [r2, : 128] + vmull.s32 q4, d18, d8 + vmlal.s32 q4, d26, d2 + vmlal.s32 q4, d19, d7 + vmlal.s32 q4, d27, d1 + vmlal.s32 q4, d22, d6 + vmlal.s32 q4, d28, d0 + vmull.s32 q8, d18, d7 + vmlal.s32 q8, d26, d1 + vmlal.s32 q8, d19, d6 + vmlal.s32 q8, d27, d0 + add r2, sp, #576 + vld1.8 {d20-d21}, [r2, : 128] + vmlal.s32 q7, d24, d21 + vmlal.s32 q7, d25, d20 + vmlal.s32 q4, d23, d21 + vmlal.s32 q4, d29, d20 + vmlal.s32 q8, d22, d21 + vmlal.s32 q8, d28, d20 + vmlal.s32 q5, d24, d20 + add r2, sp, #576 + vst1.8 {d14-d15}, [r2, : 128] + vmull.s32 q7, d18, d6 + vmlal.s32 q7, d26, d0 + add r2, sp, #656 + vld1.8 {d30-d31}, [r2, : 128] + vmlal.s32 q2, d30, d21 + vmlal.s32 q7, d19, d21 + vmlal.s32 q7, d27, d20 + add r2, sp, #624 + vld1.8 {d26-d27}, [r2, : 128] + vmlal.s32 q4, d25, d27 + vmlal.s32 q8, d29, d27 + vmlal.s32 q8, d25, d26 + vmlal.s32 q7, d28, d27 + vmlal.s32 q7, d29, d26 + add r2, sp, #608 + vld1.8 {d28-d29}, [r2, : 128] + vmlal.s32 q4, d24, d29 + vmlal.s32 q8, d23, d29 + vmlal.s32 q8, d24, d28 + vmlal.s32 q7, d22, d29 + vmlal.s32 q7, d23, d28 + add r2, sp, #608 + vst1.8 {d8-d9}, [r2, : 128] + add r2, sp, #560 + vld1.8 {d8-d9}, [r2, : 128] + vmlal.s32 q7, d24, d9 + vmlal.s32 q7, d25, d31 + vmull.s32 q1, d18, d2 + vmlal.s32 q1, d19, d1 + vmlal.s32 q1, d22, d0 + vmlal.s32 q1, d24, d27 + vmlal.s32 q1, d23, d20 + vmlal.s32 q1, d12, d7 + vmlal.s32 q1, d13, d6 + vmull.s32 q6, d18, d1 + vmlal.s32 q6, d19, d0 + vmlal.s32 q6, d23, d27 + vmlal.s32 q6, d22, d20 + vmlal.s32 q6, d24, d26 + vmull.s32 q0, d18, d0 + vmlal.s32 q0, d22, d27 + vmlal.s32 q0, d23, d26 + vmlal.s32 q0, d24, d31 + vmlal.s32 q0, d19, d20 + add r2, sp, #640 + vld1.8 {d18-d19}, [r2, : 128] + vmlal.s32 q2, d18, d7 + vmlal.s32 q2, d19, d6 + vmlal.s32 q5, d18, d6 + vmlal.s32 q5, d19, d21 + vmlal.s32 q1, d18, d21 + vmlal.s32 q1, d19, d29 + vmlal.s32 q0, d18, d28 + vmlal.s32 q0, d19, d9 + vmlal.s32 q6, d18, d29 + vmlal.s32 q6, d19, d28 + add r2, sp, #592 + vld1.8 {d18-d19}, [r2, : 128] + add r2, sp, #512 + vld1.8 {d22-d23}, [r2, : 128] + vmlal.s32 q5, d19, d7 + vmlal.s32 q0, d18, d21 + vmlal.s32 q0, d19, d29 + vmlal.s32 q6, d18, d6 + add r2, sp, #528 + vld1.8 {d6-d7}, [r2, : 128] + vmlal.s32 q6, d19, d21 + add r2, sp, #576 + vld1.8 {d18-d19}, [r2, : 128] + vmlal.s32 q0, d30, d8 + add r2, sp, #672 + vld1.8 {d20-d21}, [r2, : 128] + vmlal.s32 q5, d30, d29 + add r2, sp, #608 + vld1.8 {d24-d25}, [r2, : 128] + vmlal.s32 q1, d30, d28 + vadd.i64 q13, q0, q11 + vadd.i64 q14, q5, q11 + vmlal.s32 q6, d30, d9 + vshr.s64 q4, q13, #26 + vshr.s64 q13, q14, #26 + vadd.i64 q7, q7, q4 + vshl.i64 q4, q4, #26 + vadd.i64 q14, q7, q3 + vadd.i64 q9, q9, q13 + vshl.i64 q13, q13, #26 + vadd.i64 q15, q9, q3 + vsub.i64 q0, q0, q4 + vshr.s64 q4, q14, #25 + vsub.i64 q5, q5, q13 + vshr.s64 q13, q15, #25 + vadd.i64 q6, q6, q4 + vshl.i64 q4, q4, #25 + vadd.i64 q14, q6, q11 + vadd.i64 q2, q2, q13 + vsub.i64 q4, q7, q4 + vshr.s64 q7, q14, #26 + vshl.i64 q13, q13, #25 + vadd.i64 q14, q2, q11 + vadd.i64 q8, q8, q7 + vshl.i64 q7, q7, #26 + vadd.i64 q15, q8, q3 + vsub.i64 q9, q9, q13 + vshr.s64 q13, q14, #26 + vsub.i64 q6, q6, q7 + vshr.s64 q7, q15, #25 + vadd.i64 q10, q10, q13 + vshl.i64 q13, q13, #26 + vadd.i64 q14, q10, q3 + vadd.i64 q1, q1, q7 + add r2, r3, #288 + vshl.i64 q7, q7, #25 + add r4, r3, #96 + vadd.i64 q15, q1, q11 + add r2, r2, #8 + vsub.i64 q2, q2, q13 + add r4, r4, #8 + vshr.s64 q13, q14, #25 + vsub.i64 q7, q8, q7 + vshr.s64 q8, q15, #26 + vadd.i64 q14, q13, q13 + vadd.i64 q12, q12, q8 + vtrn.32 d12, d14 + vshl.i64 q8, q8, #26 + vtrn.32 d13, d15 + vadd.i64 q3, q12, q3 + vadd.i64 q0, q0, q14 + vst1.8 d12, [r2, : 64]! + vshl.i64 q7, q13, #4 + vst1.8 d13, [r4, : 64]! + vsub.i64 q1, q1, q8 + vshr.s64 q3, q3, #25 + vadd.i64 q0, q0, q7 + vadd.i64 q5, q5, q3 + vshl.i64 q3, q3, #25 + vadd.i64 q6, q5, q11 + vadd.i64 q0, q0, q13 + vshl.i64 q7, q13, #25 + vadd.i64 q8, q0, q11 + vsub.i64 q3, q12, q3 + vshr.s64 q6, q6, #26 + vsub.i64 q7, q10, q7 + vtrn.32 d2, d6 + vshr.s64 q8, q8, #26 + vtrn.32 d3, d7 + vadd.i64 q3, q9, q6 + vst1.8 d2, [r2, : 64] + vshl.i64 q6, q6, #26 + vst1.8 d3, [r4, : 64] + vadd.i64 q1, q4, q8 + vtrn.32 d4, d14 + vshl.i64 q4, q8, #26 + vtrn.32 d5, d15 + vsub.i64 q5, q5, q6 + add r2, r2, #16 + vsub.i64 q0, q0, q4 + vst1.8 d4, [r2, : 64] + add r4, r4, #16 + vst1.8 d5, [r4, : 64] + vtrn.32 d10, d6 + vtrn.32 d11, d7 + sub r2, r2, #8 + sub r4, r4, #8 + vtrn.32 d0, d2 + vtrn.32 d1, d3 + vst1.8 d10, [r2, : 64] + vst1.8 d11, [r4, : 64] + sub r2, r2, #24 + sub r4, r4, #24 + vst1.8 d0, [r2, : 64] + vst1.8 d1, [r4, : 64] + add r2, sp, #544 + add r4, r3, #144 + add r5, r3, #192 + vld1.8 {d0-d1}, [r2, : 128] + vld1.8 {d2-d3}, [r4, : 128]! + vld1.8 {d4-d5}, [r5, : 128]! + vzip.i32 q1, q2 + vld1.8 {d6-d7}, [r4, : 128]! + vld1.8 {d8-d9}, [r5, : 128]! + vshl.i32 q5, q1, #1 + vzip.i32 q3, q4 + vshl.i32 q6, q2, #1 + vld1.8 {d14}, [r4, : 64] + vshl.i32 q8, q3, #1 + vld1.8 {d15}, [r5, : 64] + vshl.i32 q9, q4, #1 + vmul.i32 d21, d7, d1 + vtrn.32 d14, d15 + vmul.i32 q11, q4, q0 + vmul.i32 q0, q7, q0 + vmull.s32 q12, d2, d2 + vmlal.s32 q12, d11, d1 + vmlal.s32 q12, d12, d0 + vmlal.s32 q12, d13, d23 + vmlal.s32 q12, d16, d22 + vmlal.s32 q12, d7, d21 + vmull.s32 q10, d2, d11 + vmlal.s32 q10, d4, d1 + vmlal.s32 q10, d13, d0 + vmlal.s32 q10, d6, d23 + vmlal.s32 q10, d17, d22 + vmull.s32 q13, d10, d4 + vmlal.s32 q13, d11, d3 + vmlal.s32 q13, d13, d1 + vmlal.s32 q13, d16, d0 + vmlal.s32 q13, d17, d23 + vmlal.s32 q13, d8, d22 + vmull.s32 q1, d10, d5 + vmlal.s32 q1, d11, d4 + vmlal.s32 q1, d6, d1 + vmlal.s32 q1, d17, d0 + vmlal.s32 q1, d8, d23 + vmull.s32 q14, d10, d6 + vmlal.s32 q14, d11, d13 + vmlal.s32 q14, d4, d4 + vmlal.s32 q14, d17, d1 + vmlal.s32 q14, d18, d0 + vmlal.s32 q14, d9, d23 + vmull.s32 q11, d10, d7 + vmlal.s32 q11, d11, d6 + vmlal.s32 q11, d12, d5 + vmlal.s32 q11, d8, d1 + vmlal.s32 q11, d19, d0 + vmull.s32 q15, d10, d8 + vmlal.s32 q15, d11, d17 + vmlal.s32 q15, d12, d6 + vmlal.s32 q15, d13, d5 + vmlal.s32 q15, d19, d1 + vmlal.s32 q15, d14, d0 + vmull.s32 q2, d10, d9 + vmlal.s32 q2, d11, d8 + vmlal.s32 q2, d12, d7 + vmlal.s32 q2, d13, d6 + vmlal.s32 q2, d14, d1 + vmull.s32 q0, d15, d1 + vmlal.s32 q0, d10, d14 + vmlal.s32 q0, d11, d19 + vmlal.s32 q0, d12, d8 + vmlal.s32 q0, d13, d17 + vmlal.s32 q0, d6, d6 + add r2, sp, #512 + vld1.8 {d18-d19}, [r2, : 128] + vmull.s32 q3, d16, d7 + vmlal.s32 q3, d10, d15 + vmlal.s32 q3, d11, d14 + vmlal.s32 q3, d12, d9 + vmlal.s32 q3, d13, d8 + add r2, sp, #528 + vld1.8 {d8-d9}, [r2, : 128] + vadd.i64 q5, q12, q9 + vadd.i64 q6, q15, q9 + vshr.s64 q5, q5, #26 + vshr.s64 q6, q6, #26 + vadd.i64 q7, q10, q5 + vshl.i64 q5, q5, #26 + vadd.i64 q8, q7, q4 + vadd.i64 q2, q2, q6 + vshl.i64 q6, q6, #26 + vadd.i64 q10, q2, q4 + vsub.i64 q5, q12, q5 + vshr.s64 q8, q8, #25 + vsub.i64 q6, q15, q6 + vshr.s64 q10, q10, #25 + vadd.i64 q12, q13, q8 + vshl.i64 q8, q8, #25 + vadd.i64 q13, q12, q9 + vadd.i64 q0, q0, q10 + vsub.i64 q7, q7, q8 + vshr.s64 q8, q13, #26 + vshl.i64 q10, q10, #25 + vadd.i64 q13, q0, q9 + vadd.i64 q1, q1, q8 + vshl.i64 q8, q8, #26 + vadd.i64 q15, q1, q4 + vsub.i64 q2, q2, q10 + vshr.s64 q10, q13, #26 + vsub.i64 q8, q12, q8 + vshr.s64 q12, q15, #25 + vadd.i64 q3, q3, q10 + vshl.i64 q10, q10, #26 + vadd.i64 q13, q3, q4 + vadd.i64 q14, q14, q12 + add r2, r3, #144 + vshl.i64 q12, q12, #25 + add r4, r3, #192 + vadd.i64 q15, q14, q9 + add r2, r2, #8 + vsub.i64 q0, q0, q10 + add r4, r4, #8 + vshr.s64 q10, q13, #25 + vsub.i64 q1, q1, q12 + vshr.s64 q12, q15, #26 + vadd.i64 q13, q10, q10 + vadd.i64 q11, q11, q12 + vtrn.32 d16, d2 + vshl.i64 q12, q12, #26 + vtrn.32 d17, d3 + vadd.i64 q1, q11, q4 + vadd.i64 q4, q5, q13 + vst1.8 d16, [r2, : 64]! + vshl.i64 q5, q10, #4 + vst1.8 d17, [r4, : 64]! + vsub.i64 q8, q14, q12 + vshr.s64 q1, q1, #25 + vadd.i64 q4, q4, q5 + vadd.i64 q5, q6, q1 + vshl.i64 q1, q1, #25 + vadd.i64 q6, q5, q9 + vadd.i64 q4, q4, q10 + vshl.i64 q10, q10, #25 + vadd.i64 q9, q4, q9 + vsub.i64 q1, q11, q1 + vshr.s64 q6, q6, #26 + vsub.i64 q3, q3, q10 + vtrn.32 d16, d2 + vshr.s64 q9, q9, #26 + vtrn.32 d17, d3 + vadd.i64 q1, q2, q6 + vst1.8 d16, [r2, : 64] + vshl.i64 q2, q6, #26 + vst1.8 d17, [r4, : 64] + vadd.i64 q6, q7, q9 + vtrn.32 d0, d6 + vshl.i64 q7, q9, #26 + vtrn.32 d1, d7 + vsub.i64 q2, q5, q2 + add r2, r2, #16 + vsub.i64 q3, q4, q7 + vst1.8 d0, [r2, : 64] + add r4, r4, #16 + vst1.8 d1, [r4, : 64] + vtrn.32 d4, d2 + vtrn.32 d5, d3 + sub r2, r2, #8 + sub r4, r4, #8 + vtrn.32 d6, d12 + vtrn.32 d7, d13 + vst1.8 d4, [r2, : 64] + vst1.8 d5, [r4, : 64] + sub r2, r2, #24 + sub r4, r4, #24 + vst1.8 d6, [r2, : 64] + vst1.8 d7, [r4, : 64] + add r2, r3, #336 + add r4, r3, #288 + vld1.8 {d0-d1}, [r2, : 128]! + vld1.8 {d2-d3}, [r4, : 128]! + vadd.i32 q0, q0, q1 + vld1.8 {d2-d3}, [r2, : 128]! + vld1.8 {d4-d5}, [r4, : 128]! + vadd.i32 q1, q1, q2 + add r5, r3, #288 + vld1.8 {d4}, [r2, : 64] + vld1.8 {d6}, [r4, : 64] + vadd.i32 q2, q2, q3 + vst1.8 {d0-d1}, [r5, : 128]! + vst1.8 {d2-d3}, [r5, : 128]! + vst1.8 d4, [r5, : 64] + add r2, r3, #48 + add r4, r3, #144 + vld1.8 {d0-d1}, [r4, : 128]! + vld1.8 {d2-d3}, [r4, : 128]! + vld1.8 {d4}, [r4, : 64] + add r4, r3, #288 + vld1.8 {d6-d7}, [r4, : 128]! + vtrn.32 q0, q3 + vld1.8 {d8-d9}, [r4, : 128]! + vshl.i32 q5, q0, #4 + vtrn.32 q1, q4 + vshl.i32 q6, q3, #4 + vadd.i32 q5, q5, q0 + vadd.i32 q6, q6, q3 + vshl.i32 q7, q1, #4 + vld1.8 {d5}, [r4, : 64] + vshl.i32 q8, q4, #4 + vtrn.32 d4, d5 + vadd.i32 q7, q7, q1 + vadd.i32 q8, q8, q4 + vld1.8 {d18-d19}, [r2, : 128]! + vshl.i32 q10, q2, #4 + vld1.8 {d22-d23}, [r2, : 128]! + vadd.i32 q10, q10, q2 + vld1.8 {d24}, [r2, : 64] + vadd.i32 q5, q5, q0 + add r2, r3, #240 + vld1.8 {d26-d27}, [r2, : 128]! + vadd.i32 q6, q6, q3 + vld1.8 {d28-d29}, [r2, : 128]! + vadd.i32 q8, q8, q4 + vld1.8 {d25}, [r2, : 64] + vadd.i32 q10, q10, q2 + vtrn.32 q9, q13 + vadd.i32 q7, q7, q1 + vadd.i32 q5, q5, q0 + vtrn.32 q11, q14 + vadd.i32 q6, q6, q3 + add r2, sp, #560 + vadd.i32 q10, q10, q2 + vtrn.32 d24, d25 + vst1.8 {d12-d13}, [r2, : 128] + vshl.i32 q6, q13, #1 + add r2, sp, #576 + vst1.8 {d20-d21}, [r2, : 128] + vshl.i32 q10, q14, #1 + add r2, sp, #592 + vst1.8 {d12-d13}, [r2, : 128] + vshl.i32 q15, q12, #1 + vadd.i32 q8, q8, q4 + vext.32 d10, d31, d30, #0 + vadd.i32 q7, q7, q1 + add r2, sp, #608 + vst1.8 {d16-d17}, [r2, : 128] + vmull.s32 q8, d18, d5 + vmlal.s32 q8, d26, d4 + vmlal.s32 q8, d19, d9 + vmlal.s32 q8, d27, d3 + vmlal.s32 q8, d22, d8 + vmlal.s32 q8, d28, d2 + vmlal.s32 q8, d23, d7 + vmlal.s32 q8, d29, d1 + vmlal.s32 q8, d24, d6 + vmlal.s32 q8, d25, d0 + add r2, sp, #624 + vst1.8 {d14-d15}, [r2, : 128] + vmull.s32 q2, d18, d4 + vmlal.s32 q2, d12, d9 + vmlal.s32 q2, d13, d8 + vmlal.s32 q2, d19, d3 + vmlal.s32 q2, d22, d2 + vmlal.s32 q2, d23, d1 + vmlal.s32 q2, d24, d0 + add r2, sp, #640 + vst1.8 {d20-d21}, [r2, : 128] + vmull.s32 q7, d18, d9 + vmlal.s32 q7, d26, d3 + vmlal.s32 q7, d19, d8 + vmlal.s32 q7, d27, d2 + vmlal.s32 q7, d22, d7 + vmlal.s32 q7, d28, d1 + vmlal.s32 q7, d23, d6 + vmlal.s32 q7, d29, d0 + add r2, sp, #656 + vst1.8 {d10-d11}, [r2, : 128] + vmull.s32 q5, d18, d3 + vmlal.s32 q5, d19, d2 + vmlal.s32 q5, d22, d1 + vmlal.s32 q5, d23, d0 + vmlal.s32 q5, d12, d8 + add r2, sp, #672 + vst1.8 {d16-d17}, [r2, : 128] + vmull.s32 q4, d18, d8 + vmlal.s32 q4, d26, d2 + vmlal.s32 q4, d19, d7 + vmlal.s32 q4, d27, d1 + vmlal.s32 q4, d22, d6 + vmlal.s32 q4, d28, d0 + vmull.s32 q8, d18, d7 + vmlal.s32 q8, d26, d1 + vmlal.s32 q8, d19, d6 + vmlal.s32 q8, d27, d0 + add r2, sp, #576 + vld1.8 {d20-d21}, [r2, : 128] + vmlal.s32 q7, d24, d21 + vmlal.s32 q7, d25, d20 + vmlal.s32 q4, d23, d21 + vmlal.s32 q4, d29, d20 + vmlal.s32 q8, d22, d21 + vmlal.s32 q8, d28, d20 + vmlal.s32 q5, d24, d20 + add r2, sp, #576 + vst1.8 {d14-d15}, [r2, : 128] + vmull.s32 q7, d18, d6 + vmlal.s32 q7, d26, d0 + add r2, sp, #656 + vld1.8 {d30-d31}, [r2, : 128] + vmlal.s32 q2, d30, d21 + vmlal.s32 q7, d19, d21 + vmlal.s32 q7, d27, d20 + add r2, sp, #624 + vld1.8 {d26-d27}, [r2, : 128] + vmlal.s32 q4, d25, d27 + vmlal.s32 q8, d29, d27 + vmlal.s32 q8, d25, d26 + vmlal.s32 q7, d28, d27 + vmlal.s32 q7, d29, d26 + add r2, sp, #608 + vld1.8 {d28-d29}, [r2, : 128] + vmlal.s32 q4, d24, d29 + vmlal.s32 q8, d23, d29 + vmlal.s32 q8, d24, d28 + vmlal.s32 q7, d22, d29 + vmlal.s32 q7, d23, d28 + add r2, sp, #608 + vst1.8 {d8-d9}, [r2, : 128] + add r2, sp, #560 + vld1.8 {d8-d9}, [r2, : 128] + vmlal.s32 q7, d24, d9 + vmlal.s32 q7, d25, d31 + vmull.s32 q1, d18, d2 + vmlal.s32 q1, d19, d1 + vmlal.s32 q1, d22, d0 + vmlal.s32 q1, d24, d27 + vmlal.s32 q1, d23, d20 + vmlal.s32 q1, d12, d7 + vmlal.s32 q1, d13, d6 + vmull.s32 q6, d18, d1 + vmlal.s32 q6, d19, d0 + vmlal.s32 q6, d23, d27 + vmlal.s32 q6, d22, d20 + vmlal.s32 q6, d24, d26 + vmull.s32 q0, d18, d0 + vmlal.s32 q0, d22, d27 + vmlal.s32 q0, d23, d26 + vmlal.s32 q0, d24, d31 + vmlal.s32 q0, d19, d20 + add r2, sp, #640 + vld1.8 {d18-d19}, [r2, : 128] + vmlal.s32 q2, d18, d7 + vmlal.s32 q2, d19, d6 + vmlal.s32 q5, d18, d6 + vmlal.s32 q5, d19, d21 + vmlal.s32 q1, d18, d21 + vmlal.s32 q1, d19, d29 + vmlal.s32 q0, d18, d28 + vmlal.s32 q0, d19, d9 + vmlal.s32 q6, d18, d29 + vmlal.s32 q6, d19, d28 + add r2, sp, #592 + vld1.8 {d18-d19}, [r2, : 128] + add r2, sp, #512 + vld1.8 {d22-d23}, [r2, : 128] + vmlal.s32 q5, d19, d7 + vmlal.s32 q0, d18, d21 + vmlal.s32 q0, d19, d29 + vmlal.s32 q6, d18, d6 + add r2, sp, #528 + vld1.8 {d6-d7}, [r2, : 128] + vmlal.s32 q6, d19, d21 + add r2, sp, #576 + vld1.8 {d18-d19}, [r2, : 128] + vmlal.s32 q0, d30, d8 + add r2, sp, #672 + vld1.8 {d20-d21}, [r2, : 128] + vmlal.s32 q5, d30, d29 + add r2, sp, #608 + vld1.8 {d24-d25}, [r2, : 128] + vmlal.s32 q1, d30, d28 + vadd.i64 q13, q0, q11 + vadd.i64 q14, q5, q11 + vmlal.s32 q6, d30, d9 + vshr.s64 q4, q13, #26 + vshr.s64 q13, q14, #26 + vadd.i64 q7, q7, q4 + vshl.i64 q4, q4, #26 + vadd.i64 q14, q7, q3 + vadd.i64 q9, q9, q13 + vshl.i64 q13, q13, #26 + vadd.i64 q15, q9, q3 + vsub.i64 q0, q0, q4 + vshr.s64 q4, q14, #25 + vsub.i64 q5, q5, q13 + vshr.s64 q13, q15, #25 + vadd.i64 q6, q6, q4 + vshl.i64 q4, q4, #25 + vadd.i64 q14, q6, q11 + vadd.i64 q2, q2, q13 + vsub.i64 q4, q7, q4 + vshr.s64 q7, q14, #26 + vshl.i64 q13, q13, #25 + vadd.i64 q14, q2, q11 + vadd.i64 q8, q8, q7 + vshl.i64 q7, q7, #26 + vadd.i64 q15, q8, q3 + vsub.i64 q9, q9, q13 + vshr.s64 q13, q14, #26 + vsub.i64 q6, q6, q7 + vshr.s64 q7, q15, #25 + vadd.i64 q10, q10, q13 + vshl.i64 q13, q13, #26 + vadd.i64 q14, q10, q3 + vadd.i64 q1, q1, q7 + add r2, r3, #240 + vshl.i64 q7, q7, #25 + add r4, r3, #144 + vadd.i64 q15, q1, q11 + add r2, r2, #8 + vsub.i64 q2, q2, q13 + add r4, r4, #8 + vshr.s64 q13, q14, #25 + vsub.i64 q7, q8, q7 + vshr.s64 q8, q15, #26 + vadd.i64 q14, q13, q13 + vadd.i64 q12, q12, q8 + vtrn.32 d12, d14 + vshl.i64 q8, q8, #26 + vtrn.32 d13, d15 + vadd.i64 q3, q12, q3 + vadd.i64 q0, q0, q14 + vst1.8 d12, [r2, : 64]! + vshl.i64 q7, q13, #4 + vst1.8 d13, [r4, : 64]! + vsub.i64 q1, q1, q8 + vshr.s64 q3, q3, #25 + vadd.i64 q0, q0, q7 + vadd.i64 q5, q5, q3 + vshl.i64 q3, q3, #25 + vadd.i64 q6, q5, q11 + vadd.i64 q0, q0, q13 + vshl.i64 q7, q13, #25 + vadd.i64 q8, q0, q11 + vsub.i64 q3, q12, q3 + vshr.s64 q6, q6, #26 + vsub.i64 q7, q10, q7 + vtrn.32 d2, d6 + vshr.s64 q8, q8, #26 + vtrn.32 d3, d7 + vadd.i64 q3, q9, q6 + vst1.8 d2, [r2, : 64] + vshl.i64 q6, q6, #26 + vst1.8 d3, [r4, : 64] + vadd.i64 q1, q4, q8 + vtrn.32 d4, d14 + vshl.i64 q4, q8, #26 + vtrn.32 d5, d15 + vsub.i64 q5, q5, q6 + add r2, r2, #16 + vsub.i64 q0, q0, q4 + vst1.8 d4, [r2, : 64] + add r4, r4, #16 + vst1.8 d5, [r4, : 64] + vtrn.32 d10, d6 + vtrn.32 d11, d7 + sub r2, r2, #8 + sub r4, r4, #8 + vtrn.32 d0, d2 + vtrn.32 d1, d3 + vst1.8 d10, [r2, : 64] + vst1.8 d11, [r4, : 64] + sub r2, r2, #24 + sub r4, r4, #24 + vst1.8 d0, [r2, : 64] + vst1.8 d1, [r4, : 64] + ldr r2, [sp, #488] + ldr r4, [sp, #492] + subs r5, r2, #1 + bge ._mainloop + add r1, r3, #144 + add r2, r3, #336 + vld1.8 {d0-d1}, [r1, : 128]! + vld1.8 {d2-d3}, [r1, : 128]! + vld1.8 {d4}, [r1, : 64] + vst1.8 {d0-d1}, [r2, : 128]! + vst1.8 {d2-d3}, [r2, : 128]! + vst1.8 d4, [r2, : 64] + ldr r1, =0 +._invertloop: + add r2, r3, #144 + ldr r4, =0 + ldr r5, =2 + cmp r1, #1 + ldreq r5, =1 + addeq r2, r3, #336 + addeq r4, r3, #48 + cmp r1, #2 + ldreq r5, =1 + addeq r2, r3, #48 + cmp r1, #3 + ldreq r5, =5 + addeq r4, r3, #336 + cmp r1, #4 + ldreq r5, =10 + cmp r1, #5 + ldreq r5, =20 + cmp r1, #6 + ldreq r5, =10 + addeq r2, r3, #336 + addeq r4, r3, #336 + cmp r1, #7 + ldreq r5, =50 + cmp r1, #8 + ldreq r5, =100 + cmp r1, #9 + ldreq r5, =50 + addeq r2, r3, #336 + cmp r1, #10 + ldreq r5, =5 + addeq r2, r3, #48 + cmp r1, #11 + ldreq r5, =0 + addeq r2, r3, #96 + add r6, r3, #144 + add r7, r3, #288 + vld1.8 {d0-d1}, [r6, : 128]! + vld1.8 {d2-d3}, [r6, : 128]! + vld1.8 {d4}, [r6, : 64] + vst1.8 {d0-d1}, [r7, : 128]! + vst1.8 {d2-d3}, [r7, : 128]! + vst1.8 d4, [r7, : 64] + cmp r5, #0 + beq ._skipsquaringloop +._squaringloop: + add r6, r3, #288 + add r7, r3, #288 + add r8, r3, #288 + vmov.i32 q0, #19 + vmov.i32 q1, #0 + vmov.i32 q2, #1 + vzip.i32 q1, q2 + vld1.8 {d4-d5}, [r7, : 128]! + vld1.8 {d6-d7}, [r7, : 128]! + vld1.8 {d9}, [r7, : 64] + vld1.8 {d10-d11}, [r6, : 128]! + add r7, sp, #416 + vld1.8 {d12-d13}, [r6, : 128]! + vmul.i32 q7, q2, q0 + vld1.8 {d8}, [r6, : 64] + vext.32 d17, d11, d10, #1 + vmul.i32 q9, q3, q0 + vext.32 d16, d10, d8, #1 + vshl.u32 q10, q5, q1 + vext.32 d22, d14, d4, #1 + vext.32 d24, d18, d6, #1 + vshl.u32 q13, q6, q1 + vshl.u32 d28, d8, d2 + vrev64.i32 d22, d22 + vmul.i32 d1, d9, d1 + vrev64.i32 d24, d24 + vext.32 d29, d8, d13, #1 + vext.32 d0, d1, d9, #1 + vrev64.i32 d0, d0 + vext.32 d2, d9, d1, #1 + vext.32 d23, d15, d5, #1 + vmull.s32 q4, d20, d4 + vrev64.i32 d23, d23 + vmlal.s32 q4, d21, d1 + vrev64.i32 d2, d2 + vmlal.s32 q4, d26, d19 + vext.32 d3, d5, d15, #1 + vmlal.s32 q4, d27, d18 + vrev64.i32 d3, d3 + vmlal.s32 q4, d28, d15 + vext.32 d14, d12, d11, #1 + vmull.s32 q5, d16, d23 + vext.32 d15, d13, d12, #1 + vmlal.s32 q5, d17, d4 + vst1.8 d8, [r7, : 64]! + vmlal.s32 q5, d14, d1 + vext.32 d12, d9, d8, #0 + vmlal.s32 q5, d15, d19 + vmov.i64 d13, #0 + vmlal.s32 q5, d29, d18 + vext.32 d25, d19, d7, #1 + vmlal.s32 q6, d20, d5 + vrev64.i32 d25, d25 + vmlal.s32 q6, d21, d4 + vst1.8 d11, [r7, : 64]! + vmlal.s32 q6, d26, d1 + vext.32 d9, d10, d10, #0 + vmlal.s32 q6, d27, d19 + vmov.i64 d8, #0 + vmlal.s32 q6, d28, d18 + vmlal.s32 q4, d16, d24 + vmlal.s32 q4, d17, d5 + vmlal.s32 q4, d14, d4 + vst1.8 d12, [r7, : 64]! + vmlal.s32 q4, d15, d1 + vext.32 d10, d13, d12, #0 + vmlal.s32 q4, d29, d19 + vmov.i64 d11, #0 + vmlal.s32 q5, d20, d6 + vmlal.s32 q5, d21, d5 + vmlal.s32 q5, d26, d4 + vext.32 d13, d8, d8, #0 + vmlal.s32 q5, d27, d1 + vmov.i64 d12, #0 + vmlal.s32 q5, d28, d19 + vst1.8 d9, [r7, : 64]! + vmlal.s32 q6, d16, d25 + vmlal.s32 q6, d17, d6 + vst1.8 d10, [r7, : 64] + vmlal.s32 q6, d14, d5 + vext.32 d8, d11, d10, #0 + vmlal.s32 q6, d15, d4 + vmov.i64 d9, #0 + vmlal.s32 q6, d29, d1 + vmlal.s32 q4, d20, d7 + vmlal.s32 q4, d21, d6 + vmlal.s32 q4, d26, d5 + vext.32 d11, d12, d12, #0 + vmlal.s32 q4, d27, d4 + vmov.i64 d10, #0 + vmlal.s32 q4, d28, d1 + vmlal.s32 q5, d16, d0 + sub r6, r7, #32 + vmlal.s32 q5, d17, d7 + vmlal.s32 q5, d14, d6 + vext.32 d30, d9, d8, #0 + vmlal.s32 q5, d15, d5 + vld1.8 {d31}, [r6, : 64]! + vmlal.s32 q5, d29, d4 + vmlal.s32 q15, d20, d0 + vext.32 d0, d6, d18, #1 + vmlal.s32 q15, d21, d25 + vrev64.i32 d0, d0 + vmlal.s32 q15, d26, d24 + vext.32 d1, d7, d19, #1 + vext.32 d7, d10, d10, #0 + vmlal.s32 q15, d27, d23 + vrev64.i32 d1, d1 + vld1.8 {d6}, [r6, : 64] + vmlal.s32 q15, d28, d22 + vmlal.s32 q3, d16, d4 + add r6, r6, #24 + vmlal.s32 q3, d17, d2 + vext.32 d4, d31, d30, #0 + vmov d17, d11 + vmlal.s32 q3, d14, d1 + vext.32 d11, d13, d13, #0 + vext.32 d13, d30, d30, #0 + vmlal.s32 q3, d15, d0 + vext.32 d1, d8, d8, #0 + vmlal.s32 q3, d29, d3 + vld1.8 {d5}, [r6, : 64] + sub r6, r6, #16 + vext.32 d10, d6, d6, #0 + vmov.i32 q1, #0xffffffff + vshl.i64 q4, q1, #25 + add r7, sp, #512 + vld1.8 {d14-d15}, [r7, : 128] + vadd.i64 q9, q2, q7 + vshl.i64 q1, q1, #26 + vshr.s64 q10, q9, #26 + vld1.8 {d0}, [r6, : 64]! + vadd.i64 q5, q5, q10 + vand q9, q9, q1 + vld1.8 {d16}, [r6, : 64]! + add r6, sp, #528 + vld1.8 {d20-d21}, [r6, : 128] + vadd.i64 q11, q5, q10 + vsub.i64 q2, q2, q9 + vshr.s64 q9, q11, #25 + vext.32 d12, d5, d4, #0 + vand q11, q11, q4 + vadd.i64 q0, q0, q9 + vmov d19, d7 + vadd.i64 q3, q0, q7 + vsub.i64 q5, q5, q11 + vshr.s64 q11, q3, #26 + vext.32 d18, d11, d10, #0 + vand q3, q3, q1 + vadd.i64 q8, q8, q11 + vadd.i64 q11, q8, q10 + vsub.i64 q0, q0, q3 + vshr.s64 q3, q11, #25 + vand q11, q11, q4 + vadd.i64 q3, q6, q3 + vadd.i64 q6, q3, q7 + vsub.i64 q8, q8, q11 + vshr.s64 q11, q6, #26 + vand q6, q6, q1 + vadd.i64 q9, q9, q11 + vadd.i64 d25, d19, d21 + vsub.i64 q3, q3, q6 + vshr.s64 d23, d25, #25 + vand q4, q12, q4 + vadd.i64 d21, d23, d23 + vshl.i64 d25, d23, #4 + vadd.i64 d21, d21, d23 + vadd.i64 d25, d25, d21 + vadd.i64 d4, d4, d25 + vzip.i32 q0, q8 + vadd.i64 d12, d4, d14 + add r6, r8, #8 + vst1.8 d0, [r6, : 64] + vsub.i64 d19, d19, d9 + add r6, r6, #16 + vst1.8 d16, [r6, : 64] + vshr.s64 d22, d12, #26 + vand q0, q6, q1 + vadd.i64 d10, d10, d22 + vzip.i32 q3, q9 + vsub.i64 d4, d4, d0 + sub r6, r6, #8 + vst1.8 d6, [r6, : 64] + add r6, r6, #16 + vst1.8 d18, [r6, : 64] + vzip.i32 q2, q5 + sub r6, r6, #32 + vst1.8 d4, [r6, : 64] + subs r5, r5, #1 + bhi ._squaringloop +._skipsquaringloop: + mov r2, r2 + add r5, r3, #288 + add r6, r3, #144 + vmov.i32 q0, #19 + vmov.i32 q1, #0 + vmov.i32 q2, #1 + vzip.i32 q1, q2 + vld1.8 {d4-d5}, [r5, : 128]! + vld1.8 {d6-d7}, [r5, : 128]! + vld1.8 {d9}, [r5, : 64] + vld1.8 {d10-d11}, [r2, : 128]! + add r5, sp, #416 + vld1.8 {d12-d13}, [r2, : 128]! + vmul.i32 q7, q2, q0 + vld1.8 {d8}, [r2, : 64] + vext.32 d17, d11, d10, #1 + vmul.i32 q9, q3, q0 + vext.32 d16, d10, d8, #1 + vshl.u32 q10, q5, q1 + vext.32 d22, d14, d4, #1 + vext.32 d24, d18, d6, #1 + vshl.u32 q13, q6, q1 + vshl.u32 d28, d8, d2 + vrev64.i32 d22, d22 + vmul.i32 d1, d9, d1 + vrev64.i32 d24, d24 + vext.32 d29, d8, d13, #1 + vext.32 d0, d1, d9, #1 + vrev64.i32 d0, d0 + vext.32 d2, d9, d1, #1 + vext.32 d23, d15, d5, #1 + vmull.s32 q4, d20, d4 + vrev64.i32 d23, d23 + vmlal.s32 q4, d21, d1 + vrev64.i32 d2, d2 + vmlal.s32 q4, d26, d19 + vext.32 d3, d5, d15, #1 + vmlal.s32 q4, d27, d18 + vrev64.i32 d3, d3 + vmlal.s32 q4, d28, d15 + vext.32 d14, d12, d11, #1 + vmull.s32 q5, d16, d23 + vext.32 d15, d13, d12, #1 + vmlal.s32 q5, d17, d4 + vst1.8 d8, [r5, : 64]! + vmlal.s32 q5, d14, d1 + vext.32 d12, d9, d8, #0 + vmlal.s32 q5, d15, d19 + vmov.i64 d13, #0 + vmlal.s32 q5, d29, d18 + vext.32 d25, d19, d7, #1 + vmlal.s32 q6, d20, d5 + vrev64.i32 d25, d25 + vmlal.s32 q6, d21, d4 + vst1.8 d11, [r5, : 64]! + vmlal.s32 q6, d26, d1 + vext.32 d9, d10, d10, #0 + vmlal.s32 q6, d27, d19 + vmov.i64 d8, #0 + vmlal.s32 q6, d28, d18 + vmlal.s32 q4, d16, d24 + vmlal.s32 q4, d17, d5 + vmlal.s32 q4, d14, d4 + vst1.8 d12, [r5, : 64]! + vmlal.s32 q4, d15, d1 + vext.32 d10, d13, d12, #0 + vmlal.s32 q4, d29, d19 + vmov.i64 d11, #0 + vmlal.s32 q5, d20, d6 + vmlal.s32 q5, d21, d5 + vmlal.s32 q5, d26, d4 + vext.32 d13, d8, d8, #0 + vmlal.s32 q5, d27, d1 + vmov.i64 d12, #0 + vmlal.s32 q5, d28, d19 + vst1.8 d9, [r5, : 64]! + vmlal.s32 q6, d16, d25 + vmlal.s32 q6, d17, d6 + vst1.8 d10, [r5, : 64] + vmlal.s32 q6, d14, d5 + vext.32 d8, d11, d10, #0 + vmlal.s32 q6, d15, d4 + vmov.i64 d9, #0 + vmlal.s32 q6, d29, d1 + vmlal.s32 q4, d20, d7 + vmlal.s32 q4, d21, d6 + vmlal.s32 q4, d26, d5 + vext.32 d11, d12, d12, #0 + vmlal.s32 q4, d27, d4 + vmov.i64 d10, #0 + vmlal.s32 q4, d28, d1 + vmlal.s32 q5, d16, d0 + sub r2, r5, #32 + vmlal.s32 q5, d17, d7 + vmlal.s32 q5, d14, d6 + vext.32 d30, d9, d8, #0 + vmlal.s32 q5, d15, d5 + vld1.8 {d31}, [r2, : 64]! + vmlal.s32 q5, d29, d4 + vmlal.s32 q15, d20, d0 + vext.32 d0, d6, d18, #1 + vmlal.s32 q15, d21, d25 + vrev64.i32 d0, d0 + vmlal.s32 q15, d26, d24 + vext.32 d1, d7, d19, #1 + vext.32 d7, d10, d10, #0 + vmlal.s32 q15, d27, d23 + vrev64.i32 d1, d1 + vld1.8 {d6}, [r2, : 64] + vmlal.s32 q15, d28, d22 + vmlal.s32 q3, d16, d4 + add r2, r2, #24 + vmlal.s32 q3, d17, d2 + vext.32 d4, d31, d30, #0 + vmov d17, d11 + vmlal.s32 q3, d14, d1 + vext.32 d11, d13, d13, #0 + vext.32 d13, d30, d30, #0 + vmlal.s32 q3, d15, d0 + vext.32 d1, d8, d8, #0 + vmlal.s32 q3, d29, d3 + vld1.8 {d5}, [r2, : 64] + sub r2, r2, #16 + vext.32 d10, d6, d6, #0 + vmov.i32 q1, #0xffffffff + vshl.i64 q4, q1, #25 + add r5, sp, #512 + vld1.8 {d14-d15}, [r5, : 128] + vadd.i64 q9, q2, q7 + vshl.i64 q1, q1, #26 + vshr.s64 q10, q9, #26 + vld1.8 {d0}, [r2, : 64]! + vadd.i64 q5, q5, q10 + vand q9, q9, q1 + vld1.8 {d16}, [r2, : 64]! + add r2, sp, #528 + vld1.8 {d20-d21}, [r2, : 128] + vadd.i64 q11, q5, q10 + vsub.i64 q2, q2, q9 + vshr.s64 q9, q11, #25 + vext.32 d12, d5, d4, #0 + vand q11, q11, q4 + vadd.i64 q0, q0, q9 + vmov d19, d7 + vadd.i64 q3, q0, q7 + vsub.i64 q5, q5, q11 + vshr.s64 q11, q3, #26 + vext.32 d18, d11, d10, #0 + vand q3, q3, q1 + vadd.i64 q8, q8, q11 + vadd.i64 q11, q8, q10 + vsub.i64 q0, q0, q3 + vshr.s64 q3, q11, #25 + vand q11, q11, q4 + vadd.i64 q3, q6, q3 + vadd.i64 q6, q3, q7 + vsub.i64 q8, q8, q11 + vshr.s64 q11, q6, #26 + vand q6, q6, q1 + vadd.i64 q9, q9, q11 + vadd.i64 d25, d19, d21 + vsub.i64 q3, q3, q6 + vshr.s64 d23, d25, #25 + vand q4, q12, q4 + vadd.i64 d21, d23, d23 + vshl.i64 d25, d23, #4 + vadd.i64 d21, d21, d23 + vadd.i64 d25, d25, d21 + vadd.i64 d4, d4, d25 + vzip.i32 q0, q8 + vadd.i64 d12, d4, d14 + add r2, r6, #8 + vst1.8 d0, [r2, : 64] + vsub.i64 d19, d19, d9 + add r2, r2, #16 + vst1.8 d16, [r2, : 64] + vshr.s64 d22, d12, #26 + vand q0, q6, q1 + vadd.i64 d10, d10, d22 + vzip.i32 q3, q9 + vsub.i64 d4, d4, d0 + sub r2, r2, #8 + vst1.8 d6, [r2, : 64] + add r2, r2, #16 + vst1.8 d18, [r2, : 64] + vzip.i32 q2, q5 + sub r2, r2, #32 + vst1.8 d4, [r2, : 64] + cmp r4, #0 + beq ._skippostcopy + add r2, r3, #144 + mov r4, r4 + vld1.8 {d0-d1}, [r2, : 128]! + vld1.8 {d2-d3}, [r2, : 128]! + vld1.8 {d4}, [r2, : 64] + vst1.8 {d0-d1}, [r4, : 128]! + vst1.8 {d2-d3}, [r4, : 128]! + vst1.8 d4, [r4, : 64] +._skippostcopy: + cmp r1, #1 + bne ._skipfinalcopy + add r2, r3, #288 + add r4, r3, #144 + vld1.8 {d0-d1}, [r2, : 128]! + vld1.8 {d2-d3}, [r2, : 128]! + vld1.8 {d4}, [r2, : 64] + vst1.8 {d0-d1}, [r4, : 128]! + vst1.8 {d2-d3}, [r4, : 128]! + vst1.8 d4, [r4, : 64] +._skipfinalcopy: + add r1, r1, #1 + cmp r1, #12 + blo ._invertloop + add r1, r3, #144 + ldr r2, [r1], #4 + ldr r3, [r1], #4 + ldr r4, [r1], #4 + ldr r5, [r1], #4 + ldr r6, [r1], #4 + ldr r7, [r1], #4 + ldr r8, [r1], #4 + ldr r9, [r1], #4 + ldr r10, [r1], #4 + ldr r1, [r1] + add r11, r1, r1, LSL #4 + add r11, r11, r1, LSL #1 + add r11, r11, #16777216 + mov r11, r11, ASR #25 + add r11, r11, r2 + mov r11, r11, ASR #26 + add r11, r11, r3 + mov r11, r11, ASR #25 + add r11, r11, r4 + mov r11, r11, ASR #26 + add r11, r11, r5 + mov r11, r11, ASR #25 + add r11, r11, r6 + mov r11, r11, ASR #26 + add r11, r11, r7 + mov r11, r11, ASR #25 + add r11, r11, r8 + mov r11, r11, ASR #26 + add r11, r11, r9 + mov r11, r11, ASR #25 + add r11, r11, r10 + mov r11, r11, ASR #26 + add r11, r11, r1 + mov r11, r11, ASR #25 + add r2, r2, r11 + add r2, r2, r11, LSL #1 + add r2, r2, r11, LSL #4 + mov r11, r2, ASR #26 + add r3, r3, r11 + sub r2, r2, r11, LSL #26 + mov r11, r3, ASR #25 + add r4, r4, r11 + sub r3, r3, r11, LSL #25 + mov r11, r4, ASR #26 + add r5, r5, r11 + sub r4, r4, r11, LSL #26 + mov r11, r5, ASR #25 + add r6, r6, r11 + sub r5, r5, r11, LSL #25 + mov r11, r6, ASR #26 + add r7, r7, r11 + sub r6, r6, r11, LSL #26 + mov r11, r7, ASR #25 + add r8, r8, r11 + sub r7, r7, r11, LSL #25 + mov r11, r8, ASR #26 + add r9, r9, r11 + sub r8, r8, r11, LSL #26 + mov r11, r9, ASR #25 + add r10, r10, r11 + sub r9, r9, r11, LSL #25 + mov r11, r10, ASR #26 + add r1, r1, r11 + sub r10, r10, r11, LSL #26 + mov r11, r1, ASR #25 + sub r1, r1, r11, LSL #25 + add r2, r2, r3, LSL #26 + mov r3, r3, LSR #6 + add r3, r3, r4, LSL #19 + mov r4, r4, LSR #13 + add r4, r4, r5, LSL #13 + mov r5, r5, LSR #19 + add r5, r5, r6, LSL #6 + add r6, r7, r8, LSL #25 + mov r7, r8, LSR #7 + add r7, r7, r9, LSL #19 + mov r8, r9, LSR #13 + add r8, r8, r10, LSL #12 + mov r9, r10, LSR #20 + add r1, r9, r1, LSL #6 + str r2, [r0], #4 + str r3, [r0], #4 + str r4, [r0], #4 + str r5, [r0], #4 + str r6, [r0], #4 + str r7, [r0], #4 + str r8, [r0], #4 + str r1, [r0] + ldrd r4, [sp, #0] + ldrd r6, [sp, #8] + ldrd r8, [sp, #16] + ldrd r10, [sp, #24] + ldr r12, [sp, #480] + ldr r14, [sp, #484] + ldr r0, =0 + mov sp, r12 + vpop {q4, q5, q6, q7} + bx lr From patchwork Sat Oct 6 02:57:05 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Jason A. Donenfeld" X-Patchwork-Id: 148323 Delivered-To: patch@linaro.org Received: by 2002:a2e:8595:0:0:0:0:0 with SMTP id b21-v6csp1158524lji; Fri, 5 Oct 2018 19:59:38 -0700 (PDT) X-Google-Smtp-Source: ACcGV61z8QOgIK4uYS7uBD7qp+KOWUUAy91Bm5BHQlo6s/gUpbyZ/V4SZH8EGjTI70w84T4KPXPQ X-Received: by 2002:a17:902:7290:: with SMTP id d16-v6mr397892pll.90.1538794778437; Fri, 05 Oct 2018 19:59:38 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1538794778; cv=none; d=google.com; s=arc-20160816; b=ZCKfQwo7Jlug30gQQlYRSV6/rfqMBmLnX3I0XPGo4j/YO83oITjOuLh58z5VztlQF0 oWASxPZYx+l51s1S2RC+hc7ZzdtXCKnaxx2lPxj11M53Zs9WynW+F0vq/eJj/O8IUYEp fBP/qhIheBTKOBMtPMi8xHLmkI8055R1F+I+NvN84oI2sumwJVThUavsTwxHEuncTkVn mQgLaGohIz/wLKwjOaJnPyxxGz7aqeJeVgpyQNTlONkvB+mGRtrtyW8ZkTT/xD+F2eIc SagKWbSJ+MY3PPMqPLSr+bZCV/jCRitQCW0a/ojCfcnrIKnGe24926AQbxVU9TOz2cTY WgUQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=snp86H9Wt1k+LO0cwOfGW5XxsZ6qJ+/u2aNZyQLOGng=; b=o3kEj0cMVTcLjqHtSgzksNPmrg2rsruqcSo3tCImAuyNBco50i2skCWu5SS1Gcrx6a A26t4c/Qa+KmJnmpreYLxcwQiv7/zhFzUAjh86MPFcPL1GXokJqwnc00m9misYhvYupQ h7f85Mdmp3WWP1FlGBp6kZ//BLrrbaxV0D8uGjIsKJAkcMqwYkKdwsXP+J9blI472pii yvFvuzuac15mk80wOnKSs+yfy9+fFa2SibcBkacpKA/JPQR3tps7bn+qOtLNbmz2UnFP dFTM4r8wn0NNp7yLIxokMqdqkOKWPCouqsuih3lDbuPKUDHC+xCpCS1PUNyEgZ/OF7ON R/YA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=HOdLk9dE; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id v11-v6si11637928pfl.233.2018.10.05.19.59.38; Fri, 05 Oct 2018 19:59:38 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=HOdLk9dE; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730012AbeJFKBJ (ORCPT + 32 others); Sat, 6 Oct 2018 06:01:09 -0400 Received: from frisell.zx2c4.com ([192.95.5.64]:60175 "EHLO frisell.zx2c4.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729724AbeJFKAV (ORCPT ); Sat, 6 Oct 2018 06:00:21 -0400 Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTP id 6887831d; Sat, 6 Oct 2018 02:58:13 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=zx2c4.com; h=from:to:cc :subject:date:message-id:in-reply-to:references:mime-version :content-transfer-encoding; s=mail; bh=gpka6FQ0gcJFjS9SWNf8T8eVf c0=; b=HOdLk9dEbSCojSXuci3UUUMek/Coq88rT+/c/ZbJFLeD16oSHlZi51uSI EYBAHdDtQFx49o5H8FemG4Vh/qQE5nKCtgmlaP36FUXuO7LoyNRBxVQvdCNTPxXL AldZDOJUY7U6aMuucnXbJwNNyLoJAty+x/rb4fuubQgxFtHbMEb/s1cR6hMmmpON 2OY5cDspmz07tsbqqhFXuMlTESPz2R53LDc4oSbclaXVDSnX1sQdMJQkDPJ6Bt45 HxONmyozLg8IR45RtZTW7tz9ZY66rRnS24RDjtgxXrFWU7sebDGq5uCr5yJyfuPq /aqqJ2wvB44YlV3tXOQaOafPsqOEg== Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTPSA id acda0fb3 (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256:NO); Sat, 6 Oct 2018 02:58:12 +0000 (UTC) From: "Jason A. Donenfeld" To: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, davem@davemloft.net, gregkh@linuxfoundation.org Cc: "Jason A. Donenfeld" , Russell King , linux-arm-kernel@lists.infradead.org, Samuel Neves , Jean-Philippe Aumasson , Andy Lutomirski , Andrew Morton , Linus Torvalds , kernel-hardening@lists.openwall.com, linux-crypto@vger.kernel.org Subject: [PATCH net-next v7 24/28] zinc: Curve25519 ARM implementation Date: Sat, 6 Oct 2018 04:57:05 +0200 Message-Id: <20181006025709.4019-25-Jason@zx2c4.com> In-Reply-To: <20181006025709.4019-1-Jason@zx2c4.com> References: <20181006025709.4019-1-Jason@zx2c4.com> MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This ports the SUPERCOP implementation for usage in kernel space. In addition to the usual header, macro, and style changes required for kernel space, it makes a few small changes to the code: - The stack alignment is relaxed to 16 bytes. - Superfluous mov statements have been removed. - ldr for constants has been replaced with movw. - ldreq has been replaced with moveq. - The str epilogue has been made more idiomatic. - SIMD registers are not pushed and popped at the beginning and end. - The prologue and epilogue have been made idiomatic. - A hole has been removed from the stack, saving 32 bytes. - We write-back the base register whenever possible for vld1.8. - Some multiplications have been reordered for better A7 performance. There are more opportunities for cleanup, since this code is from qhasm, which doesn't always do the most opportune thing. But even prior to extensive hand optimizations, this code delivers significant performance improvements (given in get_cycles() per call): ----------- ------------- | generic C | this commit | ------------ ----------- ------------- | Cortex-A7 | 49136 | 22395 | ------------ ----------- ------------- | Cortex-A17 | 17326 | 4983 | ------------ ----------- ------------- Signed-off-by: Jason A. Donenfeld Cc: Russell King Cc: linux-arm-kernel@lists.infradead.org Cc: Samuel Neves Cc: Jean-Philippe Aumasson Cc: Andy Lutomirski Cc: Greg KH Cc: Andrew Morton Cc: Linus Torvalds Cc: kernel-hardening@lists.openwall.com Cc: linux-crypto@vger.kernel.org --- lib/zinc/Makefile | 1 + lib/zinc/curve25519/curve25519-arm-glue.c | 43 +++ ...e25519-arm-supercop.S => curve25519-arm.S} | 349 ++++++++---------- lib/zinc/curve25519/curve25519.c | 2 + 4 files changed, 200 insertions(+), 195 deletions(-) create mode 100644 lib/zinc/curve25519/curve25519-arm-glue.c rename lib/zinc/curve25519/{curve25519-arm-supercop.S => curve25519-arm.S} (92%) -- 2.19.0 diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile index 65440438c6e5..be73c342f9ba 100644 --- a/lib/zinc/Makefile +++ b/lib/zinc/Makefile @@ -27,4 +27,5 @@ zinc_blake2s-$(CONFIG_ZINC_ARCH_X86_64) += blake2s/blake2s-x86_64.o obj-$(CONFIG_ZINC_BLAKE2S) += zinc_blake2s.o zinc_curve25519-y := curve25519/curve25519.o +zinc_curve25519-$(CONFIG_ZINC_ARCH_ARM) += curve25519/curve25519-arm.o obj-$(CONFIG_ZINC_CURVE25519) += zinc_curve25519.o diff --git a/lib/zinc/curve25519/curve25519-arm-glue.c b/lib/zinc/curve25519/curve25519-arm-glue.c new file mode 100644 index 000000000000..c71c981c3ba9 --- /dev/null +++ b/lib/zinc/curve25519/curve25519-arm-glue.c @@ -0,0 +1,43 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#include +#include +#include + +asmlinkage void curve25519_neon(u8 mypublic[CURVE25519_KEY_SIZE], + const u8 secret[CURVE25519_KEY_SIZE], + const u8 basepoint[CURVE25519_KEY_SIZE]); + +static bool curve25519_use_neon __ro_after_init; +static bool *const curve25519_nobs[] __initconst = { &curve25519_use_neon }; +static void __init curve25519_fpu_init(void) +{ + curve25519_use_neon = elf_hwcap & HWCAP_NEON; +} + +static inline bool curve25519_arch(u8 mypublic[CURVE25519_KEY_SIZE], + const u8 secret[CURVE25519_KEY_SIZE], + const u8 basepoint[CURVE25519_KEY_SIZE]) +{ + simd_context_t simd_context; + bool used_arch = false; + + simd_get(&simd_context); + if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON) && + !IS_ENABLED(CONFIG_CPU_BIG_ENDIAN) && curve25519_use_neon && + simd_use(&simd_context)) { + curve25519_neon(mypublic, secret, basepoint); + used_arch = true; + } + simd_put(&simd_context); + return used_arch; +} + +static inline bool curve25519_base_arch(u8 pub[CURVE25519_KEY_SIZE], + const u8 secret[CURVE25519_KEY_SIZE]) +{ + return false; +} diff --git a/lib/zinc/curve25519/curve25519-arm-supercop.S b/lib/zinc/curve25519/curve25519-arm.S similarity index 92% rename from lib/zinc/curve25519/curve25519-arm-supercop.S rename to lib/zinc/curve25519/curve25519-arm.S index f33b85fef382..b63ac48e7f8d 100644 --- a/lib/zinc/curve25519/curve25519-arm-supercop.S +++ b/lib/zinc/curve25519/curve25519-arm.S @@ -1,43 +1,36 @@ +/* SPDX-License-Identifier: GPL-2.0 OR MIT */ /* - * Public domain code from Daniel J. Bernstein and Peter Schwabe, from - * SUPERCOP's curve25519/neon2/scalarmult.s. + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + * + * Based on public domain code from Daniel J. Bernstein and Peter Schwabe. This + * began from SUPERCOP's curve25519/neon2/scalarmult.s, but has subsequently been + * manually reworked for use in kernel space. */ -.fpu neon +#if defined(CONFIG_KERNEL_MODE_NEON) && !defined(__ARMEB__) +#include + .text +.fpu neon +.arch armv7-a .align 4 -.global _crypto_scalarmult_curve25519_neon2 -.global crypto_scalarmult_curve25519_neon2 -.type _crypto_scalarmult_curve25519_neon2 STT_FUNC -.type crypto_scalarmult_curve25519_neon2 STT_FUNC - _crypto_scalarmult_curve25519_neon2: - crypto_scalarmult_curve25519_neon2: - vpush {q4, q5, q6, q7} - mov r12, sp - sub sp, sp, #736 - and sp, sp, #0xffffffe0 - strd r4, [sp, #0] - strd r6, [sp, #8] - strd r8, [sp, #16] - strd r10, [sp, #24] - str r12, [sp, #480] - str r14, [sp, #484] - mov r0, r0 - mov r1, r1 - mov r2, r2 - add r3, sp, #32 - ldr r4, =0 - ldr r5, =254 + +ENTRY(curve25519_neon) + push {r4-r11, lr} + mov ip, sp + sub r3, sp, #704 + and r3, r3, #0xfffffff0 + mov sp, r3 + movw r4, #0 + movw r5, #254 vmov.i32 q0, #1 vshr.u64 q1, q0, #7 vshr.u64 q0, q0, #8 vmov.i32 d4, #19 vmov.i32 d5, #38 - add r6, sp, #512 - vst1.8 {d2-d3}, [r6, : 128] - add r6, sp, #528 - vst1.8 {d0-d1}, [r6, : 128] - add r6, sp, #544 + add r6, sp, #480 + vst1.8 {d2-d3}, [r6, : 128]! + vst1.8 {d0-d1}, [r6, : 128]! vst1.8 {d4-d5}, [r6, : 128] add r6, r3, #0 vmov.i32 q2, #0 @@ -45,12 +38,12 @@ vst1.8 {d4-d5}, [r6, : 128]! vst1.8 d4, [r6, : 64] add r6, r3, #0 - ldr r7, =960 + movw r7, #960 sub r7, r7, #2 neg r7, r7 sub r7, r7, r7, LSL #7 str r7, [r6] - add r6, sp, #704 + add r6, sp, #672 vld1.8 {d4-d5}, [r1]! vld1.8 {d6-d7}, [r1] vst1.8 {d4-d5}, [r6, : 128]! @@ -212,15 +205,15 @@ vst1.8 {d0-d1}, [r6, : 128]! vst1.8 {d2-d3}, [r6, : 128]! vst1.8 d4, [r6, : 64] -._mainloop: +.Lmainloop: mov r2, r5, LSR #3 and r6, r5, #7 ldrb r2, [r1, r2] mov r2, r2, LSR r6 and r2, r2, #1 - str r5, [sp, #488] + str r5, [sp, #456] eor r4, r4, r2 - str r2, [sp, #492] + str r2, [sp, #460] neg r2, r4 add r4, r3, #96 add r5, r3, #192 @@ -291,7 +284,7 @@ vsub.i32 q0, q1, q3 vst1.8 d4, [r4, : 64] vst1.8 d0, [r6, : 64] - add r2, sp, #544 + add r2, sp, #512 add r4, r3, #96 add r5, r3, #144 vld1.8 {d0-d1}, [r2, : 128] @@ -361,14 +354,13 @@ vmlal.s32 q0, d12, d8 vmlal.s32 q0, d13, d17 vmlal.s32 q0, d6, d6 - add r2, sp, #512 - vld1.8 {d18-d19}, [r2, : 128] + add r2, sp, #480 + vld1.8 {d18-d19}, [r2, : 128]! vmull.s32 q3, d16, d7 vmlal.s32 q3, d10, d15 vmlal.s32 q3, d11, d14 vmlal.s32 q3, d12, d9 vmlal.s32 q3, d13, d8 - add r2, sp, #528 vld1.8 {d8-d9}, [r2, : 128] vadd.i64 q5, q12, q9 vadd.i64 q6, q15, q9 @@ -502,22 +494,19 @@ vadd.i32 q5, q5, q0 vtrn.32 q11, q14 vadd.i32 q6, q6, q3 - add r2, sp, #560 + add r2, sp, #528 vadd.i32 q10, q10, q2 vtrn.32 d24, d25 - vst1.8 {d12-d13}, [r2, : 128] + vst1.8 {d12-d13}, [r2, : 128]! vshl.i32 q6, q13, #1 - add r2, sp, #576 - vst1.8 {d20-d21}, [r2, : 128] + vst1.8 {d20-d21}, [r2, : 128]! vshl.i32 q10, q14, #1 - add r2, sp, #592 - vst1.8 {d12-d13}, [r2, : 128] + vst1.8 {d12-d13}, [r2, : 128]! vshl.i32 q15, q12, #1 vadd.i32 q8, q8, q4 vext.32 d10, d31, d30, #0 vadd.i32 q7, q7, q1 - add r2, sp, #608 - vst1.8 {d16-d17}, [r2, : 128] + vst1.8 {d16-d17}, [r2, : 128]! vmull.s32 q8, d18, d5 vmlal.s32 q8, d26, d4 vmlal.s32 q8, d19, d9 @@ -528,8 +517,7 @@ vmlal.s32 q8, d29, d1 vmlal.s32 q8, d24, d6 vmlal.s32 q8, d25, d0 - add r2, sp, #624 - vst1.8 {d14-d15}, [r2, : 128] + vst1.8 {d14-d15}, [r2, : 128]! vmull.s32 q2, d18, d4 vmlal.s32 q2, d12, d9 vmlal.s32 q2, d13, d8 @@ -537,8 +525,7 @@ vmlal.s32 q2, d22, d2 vmlal.s32 q2, d23, d1 vmlal.s32 q2, d24, d0 - add r2, sp, #640 - vst1.8 {d20-d21}, [r2, : 128] + vst1.8 {d20-d21}, [r2, : 128]! vmull.s32 q7, d18, d9 vmlal.s32 q7, d26, d3 vmlal.s32 q7, d19, d8 @@ -547,14 +534,12 @@ vmlal.s32 q7, d28, d1 vmlal.s32 q7, d23, d6 vmlal.s32 q7, d29, d0 - add r2, sp, #656 - vst1.8 {d10-d11}, [r2, : 128] + vst1.8 {d10-d11}, [r2, : 128]! vmull.s32 q5, d18, d3 vmlal.s32 q5, d19, d2 vmlal.s32 q5, d22, d1 vmlal.s32 q5, d23, d0 vmlal.s32 q5, d12, d8 - add r2, sp, #672 vst1.8 {d16-d17}, [r2, : 128] vmull.s32 q4, d18, d8 vmlal.s32 q4, d26, d2 @@ -566,7 +551,7 @@ vmlal.s32 q8, d26, d1 vmlal.s32 q8, d19, d6 vmlal.s32 q8, d27, d0 - add r2, sp, #576 + add r2, sp, #544 vld1.8 {d20-d21}, [r2, : 128] vmlal.s32 q7, d24, d21 vmlal.s32 q7, d25, d20 @@ -575,32 +560,30 @@ vmlal.s32 q8, d22, d21 vmlal.s32 q8, d28, d20 vmlal.s32 q5, d24, d20 - add r2, sp, #576 vst1.8 {d14-d15}, [r2, : 128] vmull.s32 q7, d18, d6 vmlal.s32 q7, d26, d0 - add r2, sp, #656 + add r2, sp, #624 vld1.8 {d30-d31}, [r2, : 128] vmlal.s32 q2, d30, d21 vmlal.s32 q7, d19, d21 vmlal.s32 q7, d27, d20 - add r2, sp, #624 + add r2, sp, #592 vld1.8 {d26-d27}, [r2, : 128] vmlal.s32 q4, d25, d27 vmlal.s32 q8, d29, d27 vmlal.s32 q8, d25, d26 vmlal.s32 q7, d28, d27 vmlal.s32 q7, d29, d26 - add r2, sp, #608 + add r2, sp, #576 vld1.8 {d28-d29}, [r2, : 128] vmlal.s32 q4, d24, d29 vmlal.s32 q8, d23, d29 vmlal.s32 q8, d24, d28 vmlal.s32 q7, d22, d29 vmlal.s32 q7, d23, d28 - add r2, sp, #608 vst1.8 {d8-d9}, [r2, : 128] - add r2, sp, #560 + add r2, sp, #528 vld1.8 {d8-d9}, [r2, : 128] vmlal.s32 q7, d24, d9 vmlal.s32 q7, d25, d31 @@ -621,36 +604,36 @@ vmlal.s32 q0, d23, d26 vmlal.s32 q0, d24, d31 vmlal.s32 q0, d19, d20 - add r2, sp, #640 + add r2, sp, #608 vld1.8 {d18-d19}, [r2, : 128] vmlal.s32 q2, d18, d7 - vmlal.s32 q2, d19, d6 vmlal.s32 q5, d18, d6 - vmlal.s32 q5, d19, d21 vmlal.s32 q1, d18, d21 - vmlal.s32 q1, d19, d29 vmlal.s32 q0, d18, d28 - vmlal.s32 q0, d19, d9 vmlal.s32 q6, d18, d29 + vmlal.s32 q2, d19, d6 + vmlal.s32 q5, d19, d21 + vmlal.s32 q1, d19, d29 + vmlal.s32 q0, d19, d9 vmlal.s32 q6, d19, d28 - add r2, sp, #592 + add r2, sp, #560 vld1.8 {d18-d19}, [r2, : 128] - add r2, sp, #512 + add r2, sp, #480 vld1.8 {d22-d23}, [r2, : 128] vmlal.s32 q5, d19, d7 vmlal.s32 q0, d18, d21 vmlal.s32 q0, d19, d29 vmlal.s32 q6, d18, d6 - add r2, sp, #528 + add r2, sp, #496 vld1.8 {d6-d7}, [r2, : 128] vmlal.s32 q6, d19, d21 - add r2, sp, #576 + add r2, sp, #544 vld1.8 {d18-d19}, [r2, : 128] vmlal.s32 q0, d30, d8 - add r2, sp, #672 + add r2, sp, #640 vld1.8 {d20-d21}, [r2, : 128] vmlal.s32 q5, d30, d29 - add r2, sp, #608 + add r2, sp, #576 vld1.8 {d24-d25}, [r2, : 128] vmlal.s32 q1, d30, d28 vadd.i64 q13, q0, q11 @@ -823,22 +806,19 @@ vadd.i32 q5, q5, q0 vtrn.32 q11, q14 vadd.i32 q6, q6, q3 - add r2, sp, #560 + add r2, sp, #528 vadd.i32 q10, q10, q2 vtrn.32 d24, d25 - vst1.8 {d12-d13}, [r2, : 128] + vst1.8 {d12-d13}, [r2, : 128]! vshl.i32 q6, q13, #1 - add r2, sp, #576 - vst1.8 {d20-d21}, [r2, : 128] + vst1.8 {d20-d21}, [r2, : 128]! vshl.i32 q10, q14, #1 - add r2, sp, #592 - vst1.8 {d12-d13}, [r2, : 128] + vst1.8 {d12-d13}, [r2, : 128]! vshl.i32 q15, q12, #1 vadd.i32 q8, q8, q4 vext.32 d10, d31, d30, #0 vadd.i32 q7, q7, q1 - add r2, sp, #608 - vst1.8 {d16-d17}, [r2, : 128] + vst1.8 {d16-d17}, [r2, : 128]! vmull.s32 q8, d18, d5 vmlal.s32 q8, d26, d4 vmlal.s32 q8, d19, d9 @@ -849,8 +829,7 @@ vmlal.s32 q8, d29, d1 vmlal.s32 q8, d24, d6 vmlal.s32 q8, d25, d0 - add r2, sp, #624 - vst1.8 {d14-d15}, [r2, : 128] + vst1.8 {d14-d15}, [r2, : 128]! vmull.s32 q2, d18, d4 vmlal.s32 q2, d12, d9 vmlal.s32 q2, d13, d8 @@ -858,8 +837,7 @@ vmlal.s32 q2, d22, d2 vmlal.s32 q2, d23, d1 vmlal.s32 q2, d24, d0 - add r2, sp, #640 - vst1.8 {d20-d21}, [r2, : 128] + vst1.8 {d20-d21}, [r2, : 128]! vmull.s32 q7, d18, d9 vmlal.s32 q7, d26, d3 vmlal.s32 q7, d19, d8 @@ -868,15 +846,13 @@ vmlal.s32 q7, d28, d1 vmlal.s32 q7, d23, d6 vmlal.s32 q7, d29, d0 - add r2, sp, #656 - vst1.8 {d10-d11}, [r2, : 128] + vst1.8 {d10-d11}, [r2, : 128]! vmull.s32 q5, d18, d3 vmlal.s32 q5, d19, d2 vmlal.s32 q5, d22, d1 vmlal.s32 q5, d23, d0 vmlal.s32 q5, d12, d8 - add r2, sp, #672 - vst1.8 {d16-d17}, [r2, : 128] + vst1.8 {d16-d17}, [r2, : 128]! vmull.s32 q4, d18, d8 vmlal.s32 q4, d26, d2 vmlal.s32 q4, d19, d7 @@ -887,7 +863,7 @@ vmlal.s32 q8, d26, d1 vmlal.s32 q8, d19, d6 vmlal.s32 q8, d27, d0 - add r2, sp, #576 + add r2, sp, #544 vld1.8 {d20-d21}, [r2, : 128] vmlal.s32 q7, d24, d21 vmlal.s32 q7, d25, d20 @@ -896,32 +872,30 @@ vmlal.s32 q8, d22, d21 vmlal.s32 q8, d28, d20 vmlal.s32 q5, d24, d20 - add r2, sp, #576 vst1.8 {d14-d15}, [r2, : 128] vmull.s32 q7, d18, d6 vmlal.s32 q7, d26, d0 - add r2, sp, #656 + add r2, sp, #624 vld1.8 {d30-d31}, [r2, : 128] vmlal.s32 q2, d30, d21 vmlal.s32 q7, d19, d21 vmlal.s32 q7, d27, d20 - add r2, sp, #624 + add r2, sp, #592 vld1.8 {d26-d27}, [r2, : 128] vmlal.s32 q4, d25, d27 vmlal.s32 q8, d29, d27 vmlal.s32 q8, d25, d26 vmlal.s32 q7, d28, d27 vmlal.s32 q7, d29, d26 - add r2, sp, #608 + add r2, sp, #576 vld1.8 {d28-d29}, [r2, : 128] vmlal.s32 q4, d24, d29 vmlal.s32 q8, d23, d29 vmlal.s32 q8, d24, d28 vmlal.s32 q7, d22, d29 vmlal.s32 q7, d23, d28 - add r2, sp, #608 vst1.8 {d8-d9}, [r2, : 128] - add r2, sp, #560 + add r2, sp, #528 vld1.8 {d8-d9}, [r2, : 128] vmlal.s32 q7, d24, d9 vmlal.s32 q7, d25, d31 @@ -942,36 +916,36 @@ vmlal.s32 q0, d23, d26 vmlal.s32 q0, d24, d31 vmlal.s32 q0, d19, d20 - add r2, sp, #640 + add r2, sp, #608 vld1.8 {d18-d19}, [r2, : 128] vmlal.s32 q2, d18, d7 - vmlal.s32 q2, d19, d6 vmlal.s32 q5, d18, d6 - vmlal.s32 q5, d19, d21 vmlal.s32 q1, d18, d21 - vmlal.s32 q1, d19, d29 vmlal.s32 q0, d18, d28 - vmlal.s32 q0, d19, d9 vmlal.s32 q6, d18, d29 + vmlal.s32 q2, d19, d6 + vmlal.s32 q5, d19, d21 + vmlal.s32 q1, d19, d29 + vmlal.s32 q0, d19, d9 vmlal.s32 q6, d19, d28 - add r2, sp, #592 + add r2, sp, #560 vld1.8 {d18-d19}, [r2, : 128] - add r2, sp, #512 + add r2, sp, #480 vld1.8 {d22-d23}, [r2, : 128] vmlal.s32 q5, d19, d7 vmlal.s32 q0, d18, d21 vmlal.s32 q0, d19, d29 vmlal.s32 q6, d18, d6 - add r2, sp, #528 + add r2, sp, #496 vld1.8 {d6-d7}, [r2, : 128] vmlal.s32 q6, d19, d21 - add r2, sp, #576 + add r2, sp, #544 vld1.8 {d18-d19}, [r2, : 128] vmlal.s32 q0, d30, d8 - add r2, sp, #672 + add r2, sp, #640 vld1.8 {d20-d21}, [r2, : 128] vmlal.s32 q5, d30, d29 - add r2, sp, #608 + add r2, sp, #576 vld1.8 {d24-d25}, [r2, : 128] vmlal.s32 q1, d30, d28 vadd.i64 q13, q0, q11 @@ -1069,7 +1043,7 @@ sub r4, r4, #24 vst1.8 d0, [r2, : 64] vst1.8 d1, [r4, : 64] - add r2, sp, #544 + add r2, sp, #512 add r4, r3, #144 add r5, r3, #192 vld1.8 {d0-d1}, [r2, : 128] @@ -1139,14 +1113,13 @@ vmlal.s32 q0, d12, d8 vmlal.s32 q0, d13, d17 vmlal.s32 q0, d6, d6 - add r2, sp, #512 - vld1.8 {d18-d19}, [r2, : 128] + add r2, sp, #480 + vld1.8 {d18-d19}, [r2, : 128]! vmull.s32 q3, d16, d7 vmlal.s32 q3, d10, d15 vmlal.s32 q3, d11, d14 vmlal.s32 q3, d12, d9 vmlal.s32 q3, d13, d8 - add r2, sp, #528 vld1.8 {d8-d9}, [r2, : 128] vadd.i64 q5, q12, q9 vadd.i64 q6, q15, q9 @@ -1295,22 +1268,19 @@ vadd.i32 q5, q5, q0 vtrn.32 q11, q14 vadd.i32 q6, q6, q3 - add r2, sp, #560 + add r2, sp, #528 vadd.i32 q10, q10, q2 vtrn.32 d24, d25 - vst1.8 {d12-d13}, [r2, : 128] + vst1.8 {d12-d13}, [r2, : 128]! vshl.i32 q6, q13, #1 - add r2, sp, #576 - vst1.8 {d20-d21}, [r2, : 128] + vst1.8 {d20-d21}, [r2, : 128]! vshl.i32 q10, q14, #1 - add r2, sp, #592 - vst1.8 {d12-d13}, [r2, : 128] + vst1.8 {d12-d13}, [r2, : 128]! vshl.i32 q15, q12, #1 vadd.i32 q8, q8, q4 vext.32 d10, d31, d30, #0 vadd.i32 q7, q7, q1 - add r2, sp, #608 - vst1.8 {d16-d17}, [r2, : 128] + vst1.8 {d16-d17}, [r2, : 128]! vmull.s32 q8, d18, d5 vmlal.s32 q8, d26, d4 vmlal.s32 q8, d19, d9 @@ -1321,8 +1291,7 @@ vmlal.s32 q8, d29, d1 vmlal.s32 q8, d24, d6 vmlal.s32 q8, d25, d0 - add r2, sp, #624 - vst1.8 {d14-d15}, [r2, : 128] + vst1.8 {d14-d15}, [r2, : 128]! vmull.s32 q2, d18, d4 vmlal.s32 q2, d12, d9 vmlal.s32 q2, d13, d8 @@ -1330,8 +1299,7 @@ vmlal.s32 q2, d22, d2 vmlal.s32 q2, d23, d1 vmlal.s32 q2, d24, d0 - add r2, sp, #640 - vst1.8 {d20-d21}, [r2, : 128] + vst1.8 {d20-d21}, [r2, : 128]! vmull.s32 q7, d18, d9 vmlal.s32 q7, d26, d3 vmlal.s32 q7, d19, d8 @@ -1340,15 +1308,13 @@ vmlal.s32 q7, d28, d1 vmlal.s32 q7, d23, d6 vmlal.s32 q7, d29, d0 - add r2, sp, #656 - vst1.8 {d10-d11}, [r2, : 128] + vst1.8 {d10-d11}, [r2, : 128]! vmull.s32 q5, d18, d3 vmlal.s32 q5, d19, d2 vmlal.s32 q5, d22, d1 vmlal.s32 q5, d23, d0 vmlal.s32 q5, d12, d8 - add r2, sp, #672 - vst1.8 {d16-d17}, [r2, : 128] + vst1.8 {d16-d17}, [r2, : 128]! vmull.s32 q4, d18, d8 vmlal.s32 q4, d26, d2 vmlal.s32 q4, d19, d7 @@ -1359,7 +1325,7 @@ vmlal.s32 q8, d26, d1 vmlal.s32 q8, d19, d6 vmlal.s32 q8, d27, d0 - add r2, sp, #576 + add r2, sp, #544 vld1.8 {d20-d21}, [r2, : 128] vmlal.s32 q7, d24, d21 vmlal.s32 q7, d25, d20 @@ -1368,32 +1334,30 @@ vmlal.s32 q8, d22, d21 vmlal.s32 q8, d28, d20 vmlal.s32 q5, d24, d20 - add r2, sp, #576 vst1.8 {d14-d15}, [r2, : 128] vmull.s32 q7, d18, d6 vmlal.s32 q7, d26, d0 - add r2, sp, #656 + add r2, sp, #624 vld1.8 {d30-d31}, [r2, : 128] vmlal.s32 q2, d30, d21 vmlal.s32 q7, d19, d21 vmlal.s32 q7, d27, d20 - add r2, sp, #624 + add r2, sp, #592 vld1.8 {d26-d27}, [r2, : 128] vmlal.s32 q4, d25, d27 vmlal.s32 q8, d29, d27 vmlal.s32 q8, d25, d26 vmlal.s32 q7, d28, d27 vmlal.s32 q7, d29, d26 - add r2, sp, #608 + add r2, sp, #576 vld1.8 {d28-d29}, [r2, : 128] vmlal.s32 q4, d24, d29 vmlal.s32 q8, d23, d29 vmlal.s32 q8, d24, d28 vmlal.s32 q7, d22, d29 vmlal.s32 q7, d23, d28 - add r2, sp, #608 vst1.8 {d8-d9}, [r2, : 128] - add r2, sp, #560 + add r2, sp, #528 vld1.8 {d8-d9}, [r2, : 128] vmlal.s32 q7, d24, d9 vmlal.s32 q7, d25, d31 @@ -1414,36 +1378,36 @@ vmlal.s32 q0, d23, d26 vmlal.s32 q0, d24, d31 vmlal.s32 q0, d19, d20 - add r2, sp, #640 + add r2, sp, #608 vld1.8 {d18-d19}, [r2, : 128] vmlal.s32 q2, d18, d7 - vmlal.s32 q2, d19, d6 vmlal.s32 q5, d18, d6 - vmlal.s32 q5, d19, d21 vmlal.s32 q1, d18, d21 - vmlal.s32 q1, d19, d29 vmlal.s32 q0, d18, d28 - vmlal.s32 q0, d19, d9 vmlal.s32 q6, d18, d29 + vmlal.s32 q2, d19, d6 + vmlal.s32 q5, d19, d21 + vmlal.s32 q1, d19, d29 + vmlal.s32 q0, d19, d9 vmlal.s32 q6, d19, d28 - add r2, sp, #592 + add r2, sp, #560 vld1.8 {d18-d19}, [r2, : 128] - add r2, sp, #512 + add r2, sp, #480 vld1.8 {d22-d23}, [r2, : 128] vmlal.s32 q5, d19, d7 vmlal.s32 q0, d18, d21 vmlal.s32 q0, d19, d29 vmlal.s32 q6, d18, d6 - add r2, sp, #528 + add r2, sp, #496 vld1.8 {d6-d7}, [r2, : 128] vmlal.s32 q6, d19, d21 - add r2, sp, #576 + add r2, sp, #544 vld1.8 {d18-d19}, [r2, : 128] vmlal.s32 q0, d30, d8 - add r2, sp, #672 + add r2, sp, #640 vld1.8 {d20-d21}, [r2, : 128] vmlal.s32 q5, d30, d29 - add r2, sp, #608 + add r2, sp, #576 vld1.8 {d24-d25}, [r2, : 128] vmlal.s32 q1, d30, d28 vadd.i64 q13, q0, q11 @@ -1541,10 +1505,10 @@ sub r4, r4, #24 vst1.8 d0, [r2, : 64] vst1.8 d1, [r4, : 64] - ldr r2, [sp, #488] - ldr r4, [sp, #492] + ldr r2, [sp, #456] + ldr r4, [sp, #460] subs r5, r2, #1 - bge ._mainloop + bge .Lmainloop add r1, r3, #144 add r2, r3, #336 vld1.8 {d0-d1}, [r1, : 128]! @@ -1553,41 +1517,41 @@ vst1.8 {d0-d1}, [r2, : 128]! vst1.8 {d2-d3}, [r2, : 128]! vst1.8 d4, [r2, : 64] - ldr r1, =0 -._invertloop: + movw r1, #0 +.Linvertloop: add r2, r3, #144 - ldr r4, =0 - ldr r5, =2 + movw r4, #0 + movw r5, #2 cmp r1, #1 - ldreq r5, =1 + moveq r5, #1 addeq r2, r3, #336 addeq r4, r3, #48 cmp r1, #2 - ldreq r5, =1 + moveq r5, #1 addeq r2, r3, #48 cmp r1, #3 - ldreq r5, =5 + moveq r5, #5 addeq r4, r3, #336 cmp r1, #4 - ldreq r5, =10 + moveq r5, #10 cmp r1, #5 - ldreq r5, =20 + moveq r5, #20 cmp r1, #6 - ldreq r5, =10 + moveq r5, #10 addeq r2, r3, #336 addeq r4, r3, #336 cmp r1, #7 - ldreq r5, =50 + moveq r5, #50 cmp r1, #8 - ldreq r5, =100 + moveq r5, #100 cmp r1, #9 - ldreq r5, =50 + moveq r5, #50 addeq r2, r3, #336 cmp r1, #10 - ldreq r5, =5 + moveq r5, #5 addeq r2, r3, #48 cmp r1, #11 - ldreq r5, =0 + moveq r5, #0 addeq r2, r3, #96 add r6, r3, #144 add r7, r3, #288 @@ -1598,8 +1562,8 @@ vst1.8 {d2-d3}, [r7, : 128]! vst1.8 d4, [r7, : 64] cmp r5, #0 - beq ._skipsquaringloop -._squaringloop: + beq .Lskipsquaringloop +.Lsquaringloop: add r6, r3, #288 add r7, r3, #288 add r8, r3, #288 @@ -1611,7 +1575,7 @@ vld1.8 {d6-d7}, [r7, : 128]! vld1.8 {d9}, [r7, : 64] vld1.8 {d10-d11}, [r6, : 128]! - add r7, sp, #416 + add r7, sp, #384 vld1.8 {d12-d13}, [r6, : 128]! vmul.i32 q7, q2, q0 vld1.8 {d8}, [r6, : 64] @@ -1726,7 +1690,7 @@ vext.32 d10, d6, d6, #0 vmov.i32 q1, #0xffffffff vshl.i64 q4, q1, #25 - add r7, sp, #512 + add r7, sp, #480 vld1.8 {d14-d15}, [r7, : 128] vadd.i64 q9, q2, q7 vshl.i64 q1, q1, #26 @@ -1735,7 +1699,7 @@ vadd.i64 q5, q5, q10 vand q9, q9, q1 vld1.8 {d16}, [r6, : 64]! - add r6, sp, #528 + add r6, sp, #496 vld1.8 {d20-d21}, [r6, : 128] vadd.i64 q11, q5, q10 vsub.i64 q2, q2, q9 @@ -1789,8 +1753,8 @@ sub r6, r6, #32 vst1.8 d4, [r6, : 64] subs r5, r5, #1 - bhi ._squaringloop -._skipsquaringloop: + bhi .Lsquaringloop +.Lskipsquaringloop: mov r2, r2 add r5, r3, #288 add r6, r3, #144 @@ -1802,7 +1766,7 @@ vld1.8 {d6-d7}, [r5, : 128]! vld1.8 {d9}, [r5, : 64] vld1.8 {d10-d11}, [r2, : 128]! - add r5, sp, #416 + add r5, sp, #384 vld1.8 {d12-d13}, [r2, : 128]! vmul.i32 q7, q2, q0 vld1.8 {d8}, [r2, : 64] @@ -1917,7 +1881,7 @@ vext.32 d10, d6, d6, #0 vmov.i32 q1, #0xffffffff vshl.i64 q4, q1, #25 - add r5, sp, #512 + add r5, sp, #480 vld1.8 {d14-d15}, [r5, : 128] vadd.i64 q9, q2, q7 vshl.i64 q1, q1, #26 @@ -1926,7 +1890,7 @@ vadd.i64 q5, q5, q10 vand q9, q9, q1 vld1.8 {d16}, [r2, : 64]! - add r2, sp, #528 + add r2, sp, #496 vld1.8 {d20-d21}, [r2, : 128] vadd.i64 q11, q5, q10 vsub.i64 q2, q2, q9 @@ -1980,7 +1944,7 @@ sub r2, r2, #32 vst1.8 d4, [r2, : 64] cmp r4, #0 - beq ._skippostcopy + beq .Lskippostcopy add r2, r3, #144 mov r4, r4 vld1.8 {d0-d1}, [r2, : 128]! @@ -1989,9 +1953,9 @@ vst1.8 {d0-d1}, [r4, : 128]! vst1.8 {d2-d3}, [r4, : 128]! vst1.8 d4, [r4, : 64] -._skippostcopy: +.Lskippostcopy: cmp r1, #1 - bne ._skipfinalcopy + bne .Lskipfinalcopy add r2, r3, #288 add r4, r3, #144 vld1.8 {d0-d1}, [r2, : 128]! @@ -2000,10 +1964,10 @@ vst1.8 {d0-d1}, [r4, : 128]! vst1.8 {d2-d3}, [r4, : 128]! vst1.8 d4, [r4, : 64] -._skipfinalcopy: +.Lskipfinalcopy: add r1, r1, #1 cmp r1, #12 - blo ._invertloop + blo .Linvertloop add r1, r3, #144 ldr r2, [r1], #4 ldr r3, [r1], #4 @@ -2085,21 +2049,16 @@ add r8, r8, r10, LSL #12 mov r9, r10, LSR #20 add r1, r9, r1, LSL #6 - str r2, [r0], #4 - str r3, [r0], #4 - str r4, [r0], #4 - str r5, [r0], #4 - str r6, [r0], #4 - str r7, [r0], #4 - str r8, [r0], #4 - str r1, [r0] - ldrd r4, [sp, #0] - ldrd r6, [sp, #8] - ldrd r8, [sp, #16] - ldrd r10, [sp, #24] - ldr r12, [sp, #480] - ldr r14, [sp, #484] - ldr r0, =0 - mov sp, r12 - vpop {q4, q5, q6, q7} - bx lr + str r2, [r0] + str r3, [r0, #4] + str r4, [r0, #8] + str r5, [r0, #12] + str r6, [r0, #16] + str r7, [r0, #20] + str r8, [r0, #24] + str r1, [r0, #28] + movw r0, #0 + mov sp, ip + pop {r4-r11, pc} +ENDPROC(curve25519_neon) +#endif diff --git a/lib/zinc/curve25519/curve25519.c b/lib/zinc/curve25519/curve25519.c index 4f9c45ba126d..30dd5c93d130 100644 --- a/lib/zinc/curve25519/curve25519.c +++ b/lib/zinc/curve25519/curve25519.c @@ -22,6 +22,8 @@ #if defined(CONFIG_ZINC_ARCH_X86_64) #include "curve25519-x86_64-glue.c" +#elif defined(CONFIG_ZINC_ARCH_ARM) +#include "curve25519-arm-glue.c" #else static bool *const curve25519_nobs[] __initconst = { }; static void __init curve25519_fpu_init(void) From patchwork Sat Oct 6 02:57:06 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Jason A. Donenfeld" X-Patchwork-Id: 148319 Delivered-To: patch@linaro.org Received: by 2002:a2e:8595:0:0:0:0:0 with SMTP id b21-v6csp1158034lji; Fri, 5 Oct 2018 19:58:55 -0700 (PDT) X-Google-Smtp-Source: ACcGV63KxeUCCeJmrs4XDop6G4CDqVzefIvHJICVIBwdVRw0NivIxjVD7RCaHd+TrLvEshKxITCW X-Received: by 2002:a63:41c2:: with SMTP id o185-v6mr12450881pga.11.1538794735193; Fri, 05 Oct 2018 19:58:55 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1538794735; cv=none; d=google.com; s=arc-20160816; b=DlMUQdKxlOqOWiKxaucv9LrhCDmZBXsnPmQdAEqooA90rcFnQbSQiAZsW84RWkumWg 6yHAyKypXvx2msuIIsWk9fkqmupBNk6gcmpRL3nHXo5mQKZzn6BKKS1ttSkTIhNeHlO9 RbIYZNUV2uP/kgOgOTnP9aBGlDBokHbjBAbAKi76xOZQxgkS/40tbkdG5zgP/s1UYyeZ x8GzT8LWQtWzlHw7ziUdMa4SrOvRvPqayOOCfaXTfRB6nHrZLxqaI15poN96odCTHZfw mNf+sFHB1JQ3vhshfOmGQVmD1FD4eQKLtXxAgxqJHdAPc0m5oHeiFbqftj2dIvKsUps+ IuIw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=2H4WAaNE4fTS3CAA/odvYE7IR2TJAywy44Xegfhe3GI=; b=SVuS9/E3GBJ3BaBfQQ9H4UcobqAk9+THa0qi8D4JDtZt1Bs9FJp4sUx3YiQSJHoY39 YvKQYGC30CHVYDujBknaq3yOcopR+V8b0syXux3K07n78pq/fYpDvW/ExPIxt72pER+R gLsPEGw7K/x9+Ojl8YYCyH/ug6lTypPDUbAgW1o7lKXhgjNaapy5I36qX5xSAizU1m79 MMupymN2NqFEowMkDzN0LEd0V16hSSvrgwd4Q6HO5TkSlEbHJQ5kPTNvbeS7Yns5nAf5 Ns9VmqU4oeLEg285L9N0Xfw35f+aEwAQ2qUNJitmQWedNz9o9tbZIRQIgiM72i1QFsqW uuvg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=CI2LtxYx; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id x3-v6si10223925pln.232.2018.10.05.19.58.54; Fri, 05 Oct 2018 19:58:55 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=CI2LtxYx; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729949AbeJFKAY (ORCPT + 32 others); Sat, 6 Oct 2018 06:00:24 -0400 Received: from frisell.zx2c4.com ([192.95.5.64]:50073 "EHLO frisell.zx2c4.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729834AbeJFKAX (ORCPT ); Sat, 6 Oct 2018 06:00:23 -0400 Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTP id e5bf475d; Sat, 6 Oct 2018 02:58:15 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=zx2c4.com; h=from:to:cc :subject:date:message-id:in-reply-to:references:mime-version :content-transfer-encoding; s=mail; bh=xmJwvWrMXlN5YdmjcQL+2moBK 4E=; b=CI2LtxYxmTMSUOWPf3d3yR7PLbu2bYippOFljS3Rvzs48NJRoZyTwzpx6 yySDLoy0mbeR3OmE1Gju6qTjAwnQIE81AIkHCFBMmmvivFKGK2jMtyqT0J8Yh7eP 6l53VQkaaG8SnK743Cd2i1L2XUABlzx37Zza9pP8UiSS4WpUmTGa7otu7RL+fYqL U+8axs/MRuQyhbvAeowaTSxwqIf8P1D4V9QwZwrc/OmclOl1nGF9jEDp98Duhvtz QEwSAp1dB7z6VRkyZcfEuQ7HuuvgUTXcGH/RCO4/6+HXLjKugndqRCR3Og5KsfgG TZIcMYJt4TY3konGEbB45UnO5Npeg== Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTPSA id 75fe556f (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256:NO); Sat, 6 Oct 2018 02:58:14 +0000 (UTC) From: "Jason A. Donenfeld" To: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, davem@davemloft.net, gregkh@linuxfoundation.org Cc: "Jason A. Donenfeld" , Samuel Neves , Andy Lutomirski , linux-crypto@vger.kernel.org Subject: [PATCH net-next v7 25/28] crypto: port Poly1305 to Zinc Date: Sat, 6 Oct 2018 04:57:06 +0200 Message-Id: <20181006025709.4019-26-Jason@zx2c4.com> In-Reply-To: <20181006025709.4019-1-Jason@zx2c4.com> References: <20181006025709.4019-1-Jason@zx2c4.com> MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Now that Poly1305 is in Zinc, we can have the crypto API code simply call into it. We have to do a little bit of book keeping here, because the crypto API receives the key in the first few calls to update. Signed-off-by: Jason A. Donenfeld Cc: Samuel Neves Cc: Andy Lutomirski Cc: Greg KH Cc: linux-crypto@vger.kernel.org --- arch/x86/crypto/Makefile | 3 - arch/x86/crypto/poly1305-avx2-x86_64.S | 388 ---------------- arch/x86/crypto/poly1305-sse2-x86_64.S | 584 ------------------------- arch/x86/crypto/poly1305_glue.c | 205 --------- crypto/Kconfig | 15 +- crypto/Makefile | 2 +- crypto/chacha20poly1305.c | 12 +- crypto/poly1305_generic.c | 304 ------------- crypto/poly1305_zinc.c | 98 +++++ include/crypto/poly1305.h | 40 -- 10 files changed, 107 insertions(+), 1544 deletions(-) delete mode 100644 arch/x86/crypto/poly1305-avx2-x86_64.S delete mode 100644 arch/x86/crypto/poly1305-sse2-x86_64.S delete mode 100644 arch/x86/crypto/poly1305_glue.c delete mode 100644 crypto/poly1305_generic.c create mode 100644 crypto/poly1305_zinc.c delete mode 100644 include/crypto/poly1305.h -- 2.19.0 diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile index a450ad573dcb..cf830219846b 100644 --- a/arch/x86/crypto/Makefile +++ b/arch/x86/crypto/Makefile @@ -34,7 +34,6 @@ obj-$(CONFIG_CRYPTO_CRC32_PCLMUL) += crc32-pclmul.o obj-$(CONFIG_CRYPTO_SHA256_SSSE3) += sha256-ssse3.o obj-$(CONFIG_CRYPTO_SHA512_SSSE3) += sha512-ssse3.o obj-$(CONFIG_CRYPTO_CRCT10DIF_PCLMUL) += crct10dif-pclmul.o -obj-$(CONFIG_CRYPTO_POLY1305_X86_64) += poly1305-x86_64.o obj-$(CONFIG_CRYPTO_AEGIS128_AESNI_SSE2) += aegis128-aesni.o obj-$(CONFIG_CRYPTO_AEGIS128L_AESNI_SSE2) += aegis128l-aesni.o @@ -110,10 +109,8 @@ aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o fpu.o aesni-intel-$(CONFIG_64BIT) += aesni-intel_avx-x86_64.o aes_ctrby8_avx-x86_64.o ghash-clmulni-intel-y := ghash-clmulni-intel_asm.o ghash-clmulni-intel_glue.o sha1-ssse3-y := sha1_ssse3_asm.o sha1_ssse3_glue.o -poly1305-x86_64-y := poly1305-sse2-x86_64.o poly1305_glue.o ifeq ($(avx2_supported),yes) sha1-ssse3-y += sha1_avx2_x86_64_asm.o -poly1305-x86_64-y += poly1305-avx2-x86_64.o endif ifeq ($(sha1_ni_supported),yes) sha1-ssse3-y += sha1_ni_asm.o diff --git a/arch/x86/crypto/poly1305-avx2-x86_64.S b/arch/x86/crypto/poly1305-avx2-x86_64.S deleted file mode 100644 index 3b6e70d085da..000000000000 --- a/arch/x86/crypto/poly1305-avx2-x86_64.S +++ /dev/null @@ -1,388 +0,0 @@ -/* - * Poly1305 authenticator algorithm, RFC7539, x64 AVX2 functions - * - * Copyright (C) 2015 Martin Willi - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ - -#include - -.section .rodata.cst32.ANMASK, "aM", @progbits, 32 -.align 32 -ANMASK: .octa 0x0000000003ffffff0000000003ffffff - .octa 0x0000000003ffffff0000000003ffffff - -.section .rodata.cst32.ORMASK, "aM", @progbits, 32 -.align 32 -ORMASK: .octa 0x00000000010000000000000001000000 - .octa 0x00000000010000000000000001000000 - -.text - -#define h0 0x00(%rdi) -#define h1 0x04(%rdi) -#define h2 0x08(%rdi) -#define h3 0x0c(%rdi) -#define h4 0x10(%rdi) -#define r0 0x00(%rdx) -#define r1 0x04(%rdx) -#define r2 0x08(%rdx) -#define r3 0x0c(%rdx) -#define r4 0x10(%rdx) -#define u0 0x00(%r8) -#define u1 0x04(%r8) -#define u2 0x08(%r8) -#define u3 0x0c(%r8) -#define u4 0x10(%r8) -#define w0 0x14(%r8) -#define w1 0x18(%r8) -#define w2 0x1c(%r8) -#define w3 0x20(%r8) -#define w4 0x24(%r8) -#define y0 0x28(%r8) -#define y1 0x2c(%r8) -#define y2 0x30(%r8) -#define y3 0x34(%r8) -#define y4 0x38(%r8) -#define m %rsi -#define hc0 %ymm0 -#define hc1 %ymm1 -#define hc2 %ymm2 -#define hc3 %ymm3 -#define hc4 %ymm4 -#define hc0x %xmm0 -#define hc1x %xmm1 -#define hc2x %xmm2 -#define hc3x %xmm3 -#define hc4x %xmm4 -#define t1 %ymm5 -#define t2 %ymm6 -#define t1x %xmm5 -#define t2x %xmm6 -#define ruwy0 %ymm7 -#define ruwy1 %ymm8 -#define ruwy2 %ymm9 -#define ruwy3 %ymm10 -#define ruwy4 %ymm11 -#define ruwy0x %xmm7 -#define ruwy1x %xmm8 -#define ruwy2x %xmm9 -#define ruwy3x %xmm10 -#define ruwy4x %xmm11 -#define svxz1 %ymm12 -#define svxz2 %ymm13 -#define svxz3 %ymm14 -#define svxz4 %ymm15 -#define d0 %r9 -#define d1 %r10 -#define d2 %r11 -#define d3 %r12 -#define d4 %r13 - -ENTRY(poly1305_4block_avx2) - # %rdi: Accumulator h[5] - # %rsi: 64 byte input block m - # %rdx: Poly1305 key r[5] - # %rcx: Quadblock count - # %r8: Poly1305 derived key r^2 u[5], r^3 w[5], r^4 y[5], - - # This four-block variant uses loop unrolled block processing. It - # requires 4 Poly1305 keys: r, r^2, r^3 and r^4: - # h = (h + m) * r => h = (h + m1) * r^4 + m2 * r^3 + m3 * r^2 + m4 * r - - vzeroupper - push %rbx - push %r12 - push %r13 - - # combine r0,u0,w0,y0 - vmovd y0,ruwy0x - vmovd w0,t1x - vpunpcklqdq t1,ruwy0,ruwy0 - vmovd u0,t1x - vmovd r0,t2x - vpunpcklqdq t2,t1,t1 - vperm2i128 $0x20,t1,ruwy0,ruwy0 - - # combine r1,u1,w1,y1 and s1=r1*5,v1=u1*5,x1=w1*5,z1=y1*5 - vmovd y1,ruwy1x - vmovd w1,t1x - vpunpcklqdq t1,ruwy1,ruwy1 - vmovd u1,t1x - vmovd r1,t2x - vpunpcklqdq t2,t1,t1 - vperm2i128 $0x20,t1,ruwy1,ruwy1 - vpslld $2,ruwy1,svxz1 - vpaddd ruwy1,svxz1,svxz1 - - # combine r2,u2,w2,y2 and s2=r2*5,v2=u2*5,x2=w2*5,z2=y2*5 - vmovd y2,ruwy2x - vmovd w2,t1x - vpunpcklqdq t1,ruwy2,ruwy2 - vmovd u2,t1x - vmovd r2,t2x - vpunpcklqdq t2,t1,t1 - vperm2i128 $0x20,t1,ruwy2,ruwy2 - vpslld $2,ruwy2,svxz2 - vpaddd ruwy2,svxz2,svxz2 - - # combine r3,u3,w3,y3 and s3=r3*5,v3=u3*5,x3=w3*5,z3=y3*5 - vmovd y3,ruwy3x - vmovd w3,t1x - vpunpcklqdq t1,ruwy3,ruwy3 - vmovd u3,t1x - vmovd r3,t2x - vpunpcklqdq t2,t1,t1 - vperm2i128 $0x20,t1,ruwy3,ruwy3 - vpslld $2,ruwy3,svxz3 - vpaddd ruwy3,svxz3,svxz3 - - # combine r4,u4,w4,y4 and s4=r4*5,v4=u4*5,x4=w4*5,z4=y4*5 - vmovd y4,ruwy4x - vmovd w4,t1x - vpunpcklqdq t1,ruwy4,ruwy4 - vmovd u4,t1x - vmovd r4,t2x - vpunpcklqdq t2,t1,t1 - vperm2i128 $0x20,t1,ruwy4,ruwy4 - vpslld $2,ruwy4,svxz4 - vpaddd ruwy4,svxz4,svxz4 - -.Ldoblock4: - # hc0 = [m[48-51] & 0x3ffffff, m[32-35] & 0x3ffffff, - # m[16-19] & 0x3ffffff, m[ 0- 3] & 0x3ffffff + h0] - vmovd 0x00(m),hc0x - vmovd 0x10(m),t1x - vpunpcklqdq t1,hc0,hc0 - vmovd 0x20(m),t1x - vmovd 0x30(m),t2x - vpunpcklqdq t2,t1,t1 - vperm2i128 $0x20,t1,hc0,hc0 - vpand ANMASK(%rip),hc0,hc0 - vmovd h0,t1x - vpaddd t1,hc0,hc0 - # hc1 = [(m[51-54] >> 2) & 0x3ffffff, (m[35-38] >> 2) & 0x3ffffff, - # (m[19-22] >> 2) & 0x3ffffff, (m[ 3- 6] >> 2) & 0x3ffffff + h1] - vmovd 0x03(m),hc1x - vmovd 0x13(m),t1x - vpunpcklqdq t1,hc1,hc1 - vmovd 0x23(m),t1x - vmovd 0x33(m),t2x - vpunpcklqdq t2,t1,t1 - vperm2i128 $0x20,t1,hc1,hc1 - vpsrld $2,hc1,hc1 - vpand ANMASK(%rip),hc1,hc1 - vmovd h1,t1x - vpaddd t1,hc1,hc1 - # hc2 = [(m[54-57] >> 4) & 0x3ffffff, (m[38-41] >> 4) & 0x3ffffff, - # (m[22-25] >> 4) & 0x3ffffff, (m[ 6- 9] >> 4) & 0x3ffffff + h2] - vmovd 0x06(m),hc2x - vmovd 0x16(m),t1x - vpunpcklqdq t1,hc2,hc2 - vmovd 0x26(m),t1x - vmovd 0x36(m),t2x - vpunpcklqdq t2,t1,t1 - vperm2i128 $0x20,t1,hc2,hc2 - vpsrld $4,hc2,hc2 - vpand ANMASK(%rip),hc2,hc2 - vmovd h2,t1x - vpaddd t1,hc2,hc2 - # hc3 = [(m[57-60] >> 6) & 0x3ffffff, (m[41-44] >> 6) & 0x3ffffff, - # (m[25-28] >> 6) & 0x3ffffff, (m[ 9-12] >> 6) & 0x3ffffff + h3] - vmovd 0x09(m),hc3x - vmovd 0x19(m),t1x - vpunpcklqdq t1,hc3,hc3 - vmovd 0x29(m),t1x - vmovd 0x39(m),t2x - vpunpcklqdq t2,t1,t1 - vperm2i128 $0x20,t1,hc3,hc3 - vpsrld $6,hc3,hc3 - vpand ANMASK(%rip),hc3,hc3 - vmovd h3,t1x - vpaddd t1,hc3,hc3 - # hc4 = [(m[60-63] >> 8) | (1<<24), (m[44-47] >> 8) | (1<<24), - # (m[28-31] >> 8) | (1<<24), (m[12-15] >> 8) | (1<<24) + h4] - vmovd 0x0c(m),hc4x - vmovd 0x1c(m),t1x - vpunpcklqdq t1,hc4,hc4 - vmovd 0x2c(m),t1x - vmovd 0x3c(m),t2x - vpunpcklqdq t2,t1,t1 - vperm2i128 $0x20,t1,hc4,hc4 - vpsrld $8,hc4,hc4 - vpor ORMASK(%rip),hc4,hc4 - vmovd h4,t1x - vpaddd t1,hc4,hc4 - - # t1 = [ hc0[3] * r0, hc0[2] * u0, hc0[1] * w0, hc0[0] * y0 ] - vpmuludq hc0,ruwy0,t1 - # t1 += [ hc1[3] * s4, hc1[2] * v4, hc1[1] * x4, hc1[0] * z4 ] - vpmuludq hc1,svxz4,t2 - vpaddq t2,t1,t1 - # t1 += [ hc2[3] * s3, hc2[2] * v3, hc2[1] * x3, hc2[0] * z3 ] - vpmuludq hc2,svxz3,t2 - vpaddq t2,t1,t1 - # t1 += [ hc3[3] * s2, hc3[2] * v2, hc3[1] * x2, hc3[0] * z2 ] - vpmuludq hc3,svxz2,t2 - vpaddq t2,t1,t1 - # t1 += [ hc4[3] * s1, hc4[2] * v1, hc4[1] * x1, hc4[0] * z1 ] - vpmuludq hc4,svxz1,t2 - vpaddq t2,t1,t1 - # d0 = t1[0] + t1[1] + t[2] + t[3] - vpermq $0xee,t1,t2 - vpaddq t2,t1,t1 - vpsrldq $8,t1,t2 - vpaddq t2,t1,t1 - vmovq t1x,d0 - - # t1 = [ hc0[3] * r1, hc0[2] * u1,hc0[1] * w1, hc0[0] * y1 ] - vpmuludq hc0,ruwy1,t1 - # t1 += [ hc1[3] * r0, hc1[2] * u0, hc1[1] * w0, hc1[0] * y0 ] - vpmuludq hc1,ruwy0,t2 - vpaddq t2,t1,t1 - # t1 += [ hc2[3] * s4, hc2[2] * v4, hc2[1] * x4, hc2[0] * z4 ] - vpmuludq hc2,svxz4,t2 - vpaddq t2,t1,t1 - # t1 += [ hc3[3] * s3, hc3[2] * v3, hc3[1] * x3, hc3[0] * z3 ] - vpmuludq hc3,svxz3,t2 - vpaddq t2,t1,t1 - # t1 += [ hc4[3] * s2, hc4[2] * v2, hc4[1] * x2, hc4[0] * z2 ] - vpmuludq hc4,svxz2,t2 - vpaddq t2,t1,t1 - # d1 = t1[0] + t1[1] + t1[3] + t1[4] - vpermq $0xee,t1,t2 - vpaddq t2,t1,t1 - vpsrldq $8,t1,t2 - vpaddq t2,t1,t1 - vmovq t1x,d1 - - # t1 = [ hc0[3] * r2, hc0[2] * u2, hc0[1] * w2, hc0[0] * y2 ] - vpmuludq hc0,ruwy2,t1 - # t1 += [ hc1[3] * r1, hc1[2] * u1, hc1[1] * w1, hc1[0] * y1 ] - vpmuludq hc1,ruwy1,t2 - vpaddq t2,t1,t1 - # t1 += [ hc2[3] * r0, hc2[2] * u0, hc2[1] * w0, hc2[0] * y0 ] - vpmuludq hc2,ruwy0,t2 - vpaddq t2,t1,t1 - # t1 += [ hc3[3] * s4, hc3[2] * v4, hc3[1] * x4, hc3[0] * z4 ] - vpmuludq hc3,svxz4,t2 - vpaddq t2,t1,t1 - # t1 += [ hc4[3] * s3, hc4[2] * v3, hc4[1] * x3, hc4[0] * z3 ] - vpmuludq hc4,svxz3,t2 - vpaddq t2,t1,t1 - # d2 = t1[0] + t1[1] + t1[2] + t1[3] - vpermq $0xee,t1,t2 - vpaddq t2,t1,t1 - vpsrldq $8,t1,t2 - vpaddq t2,t1,t1 - vmovq t1x,d2 - - # t1 = [ hc0[3] * r3, hc0[2] * u3, hc0[1] * w3, hc0[0] * y3 ] - vpmuludq hc0,ruwy3,t1 - # t1 += [ hc1[3] * r2, hc1[2] * u2, hc1[1] * w2, hc1[0] * y2 ] - vpmuludq hc1,ruwy2,t2 - vpaddq t2,t1,t1 - # t1 += [ hc2[3] * r1, hc2[2] * u1, hc2[1] * w1, hc2[0] * y1 ] - vpmuludq hc2,ruwy1,t2 - vpaddq t2,t1,t1 - # t1 += [ hc3[3] * r0, hc3[2] * u0, hc3[1] * w0, hc3[0] * y0 ] - vpmuludq hc3,ruwy0,t2 - vpaddq t2,t1,t1 - # t1 += [ hc4[3] * s4, hc4[2] * v4, hc4[1] * x4, hc4[0] * z4 ] - vpmuludq hc4,svxz4,t2 - vpaddq t2,t1,t1 - # d3 = t1[0] + t1[1] + t1[2] + t1[3] - vpermq $0xee,t1,t2 - vpaddq t2,t1,t1 - vpsrldq $8,t1,t2 - vpaddq t2,t1,t1 - vmovq t1x,d3 - - # t1 = [ hc0[3] * r4, hc0[2] * u4, hc0[1] * w4, hc0[0] * y4 ] - vpmuludq hc0,ruwy4,t1 - # t1 += [ hc1[3] * r3, hc1[2] * u3, hc1[1] * w3, hc1[0] * y3 ] - vpmuludq hc1,ruwy3,t2 - vpaddq t2,t1,t1 - # t1 += [ hc2[3] * r2, hc2[2] * u2, hc2[1] * w2, hc2[0] * y2 ] - vpmuludq hc2,ruwy2,t2 - vpaddq t2,t1,t1 - # t1 += [ hc3[3] * r1, hc3[2] * u1, hc3[1] * w1, hc3[0] * y1 ] - vpmuludq hc3,ruwy1,t2 - vpaddq t2,t1,t1 - # t1 += [ hc4[3] * r0, hc4[2] * u0, hc4[1] * w0, hc4[0] * y0 ] - vpmuludq hc4,ruwy0,t2 - vpaddq t2,t1,t1 - # d4 = t1[0] + t1[1] + t1[2] + t1[3] - vpermq $0xee,t1,t2 - vpaddq t2,t1,t1 - vpsrldq $8,t1,t2 - vpaddq t2,t1,t1 - vmovq t1x,d4 - - # d1 += d0 >> 26 - mov d0,%rax - shr $26,%rax - add %rax,d1 - # h0 = d0 & 0x3ffffff - mov d0,%rbx - and $0x3ffffff,%ebx - - # d2 += d1 >> 26 - mov d1,%rax - shr $26,%rax - add %rax,d2 - # h1 = d1 & 0x3ffffff - mov d1,%rax - and $0x3ffffff,%eax - mov %eax,h1 - - # d3 += d2 >> 26 - mov d2,%rax - shr $26,%rax - add %rax,d3 - # h2 = d2 & 0x3ffffff - mov d2,%rax - and $0x3ffffff,%eax - mov %eax,h2 - - # d4 += d3 >> 26 - mov d3,%rax - shr $26,%rax - add %rax,d4 - # h3 = d3 & 0x3ffffff - mov d3,%rax - and $0x3ffffff,%eax - mov %eax,h3 - - # h0 += (d4 >> 26) * 5 - mov d4,%rax - shr $26,%rax - lea (%eax,%eax,4),%eax - add %eax,%ebx - # h4 = d4 & 0x3ffffff - mov d4,%rax - and $0x3ffffff,%eax - mov %eax,h4 - - # h1 += h0 >> 26 - mov %ebx,%eax - shr $26,%eax - add %eax,h1 - # h0 = h0 & 0x3ffffff - andl $0x3ffffff,%ebx - mov %ebx,h0 - - add $0x40,m - dec %rcx - jnz .Ldoblock4 - - vzeroupper - pop %r13 - pop %r12 - pop %rbx - ret -ENDPROC(poly1305_4block_avx2) diff --git a/arch/x86/crypto/poly1305-sse2-x86_64.S b/arch/x86/crypto/poly1305-sse2-x86_64.S deleted file mode 100644 index c88c670cb5fc..000000000000 --- a/arch/x86/crypto/poly1305-sse2-x86_64.S +++ /dev/null @@ -1,584 +0,0 @@ -/* - * Poly1305 authenticator algorithm, RFC7539, x64 SSE2 functions - * - * Copyright (C) 2015 Martin Willi - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ - -#include - -.section .rodata.cst16.ANMASK, "aM", @progbits, 16 -.align 16 -ANMASK: .octa 0x0000000003ffffff0000000003ffffff - -.section .rodata.cst16.ORMASK, "aM", @progbits, 16 -.align 16 -ORMASK: .octa 0x00000000010000000000000001000000 - -.text - -#define h0 0x00(%rdi) -#define h1 0x04(%rdi) -#define h2 0x08(%rdi) -#define h3 0x0c(%rdi) -#define h4 0x10(%rdi) -#define r0 0x00(%rdx) -#define r1 0x04(%rdx) -#define r2 0x08(%rdx) -#define r3 0x0c(%rdx) -#define r4 0x10(%rdx) -#define s1 0x00(%rsp) -#define s2 0x04(%rsp) -#define s3 0x08(%rsp) -#define s4 0x0c(%rsp) -#define m %rsi -#define h01 %xmm0 -#define h23 %xmm1 -#define h44 %xmm2 -#define t1 %xmm3 -#define t2 %xmm4 -#define t3 %xmm5 -#define t4 %xmm6 -#define mask %xmm7 -#define d0 %r8 -#define d1 %r9 -#define d2 %r10 -#define d3 %r11 -#define d4 %r12 - -ENTRY(poly1305_block_sse2) - # %rdi: Accumulator h[5] - # %rsi: 16 byte input block m - # %rdx: Poly1305 key r[5] - # %rcx: Block count - - # This single block variant tries to improve performance by doing two - # multiplications in parallel using SSE instructions. There is quite - # some quardword packing involved, hence the speedup is marginal. - - push %rbx - push %r12 - sub $0x10,%rsp - - # s1..s4 = r1..r4 * 5 - mov r1,%eax - lea (%eax,%eax,4),%eax - mov %eax,s1 - mov r2,%eax - lea (%eax,%eax,4),%eax - mov %eax,s2 - mov r3,%eax - lea (%eax,%eax,4),%eax - mov %eax,s3 - mov r4,%eax - lea (%eax,%eax,4),%eax - mov %eax,s4 - - movdqa ANMASK(%rip),mask - -.Ldoblock: - # h01 = [0, h1, 0, h0] - # h23 = [0, h3, 0, h2] - # h44 = [0, h4, 0, h4] - movd h0,h01 - movd h1,t1 - movd h2,h23 - movd h3,t2 - movd h4,h44 - punpcklqdq t1,h01 - punpcklqdq t2,h23 - punpcklqdq h44,h44 - - # h01 += [ (m[3-6] >> 2) & 0x3ffffff, m[0-3] & 0x3ffffff ] - movd 0x00(m),t1 - movd 0x03(m),t2 - psrld $2,t2 - punpcklqdq t2,t1 - pand mask,t1 - paddd t1,h01 - # h23 += [ (m[9-12] >> 6) & 0x3ffffff, (m[6-9] >> 4) & 0x3ffffff ] - movd 0x06(m),t1 - movd 0x09(m),t2 - psrld $4,t1 - psrld $6,t2 - punpcklqdq t2,t1 - pand mask,t1 - paddd t1,h23 - # h44 += [ (m[12-15] >> 8) | (1 << 24), (m[12-15] >> 8) | (1 << 24) ] - mov 0x0c(m),%eax - shr $8,%eax - or $0x01000000,%eax - movd %eax,t1 - pshufd $0xc4,t1,t1 - paddd t1,h44 - - # t1[0] = h0 * r0 + h2 * s3 - # t1[1] = h1 * s4 + h3 * s2 - movd r0,t1 - movd s4,t2 - punpcklqdq t2,t1 - pmuludq h01,t1 - movd s3,t2 - movd s2,t3 - punpcklqdq t3,t2 - pmuludq h23,t2 - paddq t2,t1 - # t2[0] = h0 * r1 + h2 * s4 - # t2[1] = h1 * r0 + h3 * s3 - movd r1,t2 - movd r0,t3 - punpcklqdq t3,t2 - pmuludq h01,t2 - movd s4,t3 - movd s3,t4 - punpcklqdq t4,t3 - pmuludq h23,t3 - paddq t3,t2 - # t3[0] = h4 * s1 - # t3[1] = h4 * s2 - movd s1,t3 - movd s2,t4 - punpcklqdq t4,t3 - pmuludq h44,t3 - # d0 = t1[0] + t1[1] + t3[0] - # d1 = t2[0] + t2[1] + t3[1] - movdqa t1,t4 - punpcklqdq t2,t4 - punpckhqdq t2,t1 - paddq t4,t1 - paddq t3,t1 - movq t1,d0 - psrldq $8,t1 - movq t1,d1 - - # t1[0] = h0 * r2 + h2 * r0 - # t1[1] = h1 * r1 + h3 * s4 - movd r2,t1 - movd r1,t2 - punpcklqdq t2,t1 - pmuludq h01,t1 - movd r0,t2 - movd s4,t3 - punpcklqdq t3,t2 - pmuludq h23,t2 - paddq t2,t1 - # t2[0] = h0 * r3 + h2 * r1 - # t2[1] = h1 * r2 + h3 * r0 - movd r3,t2 - movd r2,t3 - punpcklqdq t3,t2 - pmuludq h01,t2 - movd r1,t3 - movd r0,t4 - punpcklqdq t4,t3 - pmuludq h23,t3 - paddq t3,t2 - # t3[0] = h4 * s3 - # t3[1] = h4 * s4 - movd s3,t3 - movd s4,t4 - punpcklqdq t4,t3 - pmuludq h44,t3 - # d2 = t1[0] + t1[1] + t3[0] - # d3 = t2[0] + t2[1] + t3[1] - movdqa t1,t4 - punpcklqdq t2,t4 - punpckhqdq t2,t1 - paddq t4,t1 - paddq t3,t1 - movq t1,d2 - psrldq $8,t1 - movq t1,d3 - - # t1[0] = h0 * r4 + h2 * r2 - # t1[1] = h1 * r3 + h3 * r1 - movd r4,t1 - movd r3,t2 - punpcklqdq t2,t1 - pmuludq h01,t1 - movd r2,t2 - movd r1,t3 - punpcklqdq t3,t2 - pmuludq h23,t2 - paddq t2,t1 - # t3[0] = h4 * r0 - movd r0,t3 - pmuludq h44,t3 - # d4 = t1[0] + t1[1] + t3[0] - movdqa t1,t4 - psrldq $8,t4 - paddq t4,t1 - paddq t3,t1 - movq t1,d4 - - # d1 += d0 >> 26 - mov d0,%rax - shr $26,%rax - add %rax,d1 - # h0 = d0 & 0x3ffffff - mov d0,%rbx - and $0x3ffffff,%ebx - - # d2 += d1 >> 26 - mov d1,%rax - shr $26,%rax - add %rax,d2 - # h1 = d1 & 0x3ffffff - mov d1,%rax - and $0x3ffffff,%eax - mov %eax,h1 - - # d3 += d2 >> 26 - mov d2,%rax - shr $26,%rax - add %rax,d3 - # h2 = d2 & 0x3ffffff - mov d2,%rax - and $0x3ffffff,%eax - mov %eax,h2 - - # d4 += d3 >> 26 - mov d3,%rax - shr $26,%rax - add %rax,d4 - # h3 = d3 & 0x3ffffff - mov d3,%rax - and $0x3ffffff,%eax - mov %eax,h3 - - # h0 += (d4 >> 26) * 5 - mov d4,%rax - shr $26,%rax - lea (%eax,%eax,4),%eax - add %eax,%ebx - # h4 = d4 & 0x3ffffff - mov d4,%rax - and $0x3ffffff,%eax - mov %eax,h4 - - # h1 += h0 >> 26 - mov %ebx,%eax - shr $26,%eax - add %eax,h1 - # h0 = h0 & 0x3ffffff - andl $0x3ffffff,%ebx - mov %ebx,h0 - - add $0x10,m - dec %rcx - jnz .Ldoblock - - add $0x10,%rsp - pop %r12 - pop %rbx - ret -ENDPROC(poly1305_block_sse2) - - -#define u0 0x00(%r8) -#define u1 0x04(%r8) -#define u2 0x08(%r8) -#define u3 0x0c(%r8) -#define u4 0x10(%r8) -#define hc0 %xmm0 -#define hc1 %xmm1 -#define hc2 %xmm2 -#define hc3 %xmm5 -#define hc4 %xmm6 -#define ru0 %xmm7 -#define ru1 %xmm8 -#define ru2 %xmm9 -#define ru3 %xmm10 -#define ru4 %xmm11 -#define sv1 %xmm12 -#define sv2 %xmm13 -#define sv3 %xmm14 -#define sv4 %xmm15 -#undef d0 -#define d0 %r13 - -ENTRY(poly1305_2block_sse2) - # %rdi: Accumulator h[5] - # %rsi: 16 byte input block m - # %rdx: Poly1305 key r[5] - # %rcx: Doubleblock count - # %r8: Poly1305 derived key r^2 u[5] - - # This two-block variant further improves performance by using loop - # unrolled block processing. This is more straight forward and does - # less byte shuffling, but requires a second Poly1305 key r^2: - # h = (h + m) * r => h = (h + m1) * r^2 + m2 * r - - push %rbx - push %r12 - push %r13 - - # combine r0,u0 - movd u0,ru0 - movd r0,t1 - punpcklqdq t1,ru0 - - # combine r1,u1 and s1=r1*5,v1=u1*5 - movd u1,ru1 - movd r1,t1 - punpcklqdq t1,ru1 - movdqa ru1,sv1 - pslld $2,sv1 - paddd ru1,sv1 - - # combine r2,u2 and s2=r2*5,v2=u2*5 - movd u2,ru2 - movd r2,t1 - punpcklqdq t1,ru2 - movdqa ru2,sv2 - pslld $2,sv2 - paddd ru2,sv2 - - # combine r3,u3 and s3=r3*5,v3=u3*5 - movd u3,ru3 - movd r3,t1 - punpcklqdq t1,ru3 - movdqa ru3,sv3 - pslld $2,sv3 - paddd ru3,sv3 - - # combine r4,u4 and s4=r4*5,v4=u4*5 - movd u4,ru4 - movd r4,t1 - punpcklqdq t1,ru4 - movdqa ru4,sv4 - pslld $2,sv4 - paddd ru4,sv4 - -.Ldoblock2: - # hc0 = [ m[16-19] & 0x3ffffff, h0 + m[0-3] & 0x3ffffff ] - movd 0x00(m),hc0 - movd 0x10(m),t1 - punpcklqdq t1,hc0 - pand ANMASK(%rip),hc0 - movd h0,t1 - paddd t1,hc0 - # hc1 = [ (m[19-22] >> 2) & 0x3ffffff, h1 + (m[3-6] >> 2) & 0x3ffffff ] - movd 0x03(m),hc1 - movd 0x13(m),t1 - punpcklqdq t1,hc1 - psrld $2,hc1 - pand ANMASK(%rip),hc1 - movd h1,t1 - paddd t1,hc1 - # hc2 = [ (m[22-25] >> 4) & 0x3ffffff, h2 + (m[6-9] >> 4) & 0x3ffffff ] - movd 0x06(m),hc2 - movd 0x16(m),t1 - punpcklqdq t1,hc2 - psrld $4,hc2 - pand ANMASK(%rip),hc2 - movd h2,t1 - paddd t1,hc2 - # hc3 = [ (m[25-28] >> 6) & 0x3ffffff, h3 + (m[9-12] >> 6) & 0x3ffffff ] - movd 0x09(m),hc3 - movd 0x19(m),t1 - punpcklqdq t1,hc3 - psrld $6,hc3 - pand ANMASK(%rip),hc3 - movd h3,t1 - paddd t1,hc3 - # hc4 = [ (m[28-31] >> 8) | (1<<24), h4 + (m[12-15] >> 8) | (1<<24) ] - movd 0x0c(m),hc4 - movd 0x1c(m),t1 - punpcklqdq t1,hc4 - psrld $8,hc4 - por ORMASK(%rip),hc4 - movd h4,t1 - paddd t1,hc4 - - # t1 = [ hc0[1] * r0, hc0[0] * u0 ] - movdqa ru0,t1 - pmuludq hc0,t1 - # t1 += [ hc1[1] * s4, hc1[0] * v4 ] - movdqa sv4,t2 - pmuludq hc1,t2 - paddq t2,t1 - # t1 += [ hc2[1] * s3, hc2[0] * v3 ] - movdqa sv3,t2 - pmuludq hc2,t2 - paddq t2,t1 - # t1 += [ hc3[1] * s2, hc3[0] * v2 ] - movdqa sv2,t2 - pmuludq hc3,t2 - paddq t2,t1 - # t1 += [ hc4[1] * s1, hc4[0] * v1 ] - movdqa sv1,t2 - pmuludq hc4,t2 - paddq t2,t1 - # d0 = t1[0] + t1[1] - movdqa t1,t2 - psrldq $8,t2 - paddq t2,t1 - movq t1,d0 - - # t1 = [ hc0[1] * r1, hc0[0] * u1 ] - movdqa ru1,t1 - pmuludq hc0,t1 - # t1 += [ hc1[1] * r0, hc1[0] * u0 ] - movdqa ru0,t2 - pmuludq hc1,t2 - paddq t2,t1 - # t1 += [ hc2[1] * s4, hc2[0] * v4 ] - movdqa sv4,t2 - pmuludq hc2,t2 - paddq t2,t1 - # t1 += [ hc3[1] * s3, hc3[0] * v3 ] - movdqa sv3,t2 - pmuludq hc3,t2 - paddq t2,t1 - # t1 += [ hc4[1] * s2, hc4[0] * v2 ] - movdqa sv2,t2 - pmuludq hc4,t2 - paddq t2,t1 - # d1 = t1[0] + t1[1] - movdqa t1,t2 - psrldq $8,t2 - paddq t2,t1 - movq t1,d1 - - # t1 = [ hc0[1] * r2, hc0[0] * u2 ] - movdqa ru2,t1 - pmuludq hc0,t1 - # t1 += [ hc1[1] * r1, hc1[0] * u1 ] - movdqa ru1,t2 - pmuludq hc1,t2 - paddq t2,t1 - # t1 += [ hc2[1] * r0, hc2[0] * u0 ] - movdqa ru0,t2 - pmuludq hc2,t2 - paddq t2,t1 - # t1 += [ hc3[1] * s4, hc3[0] * v4 ] - movdqa sv4,t2 - pmuludq hc3,t2 - paddq t2,t1 - # t1 += [ hc4[1] * s3, hc4[0] * v3 ] - movdqa sv3,t2 - pmuludq hc4,t2 - paddq t2,t1 - # d2 = t1[0] + t1[1] - movdqa t1,t2 - psrldq $8,t2 - paddq t2,t1 - movq t1,d2 - - # t1 = [ hc0[1] * r3, hc0[0] * u3 ] - movdqa ru3,t1 - pmuludq hc0,t1 - # t1 += [ hc1[1] * r2, hc1[0] * u2 ] - movdqa ru2,t2 - pmuludq hc1,t2 - paddq t2,t1 - # t1 += [ hc2[1] * r1, hc2[0] * u1 ] - movdqa ru1,t2 - pmuludq hc2,t2 - paddq t2,t1 - # t1 += [ hc3[1] * r0, hc3[0] * u0 ] - movdqa ru0,t2 - pmuludq hc3,t2 - paddq t2,t1 - # t1 += [ hc4[1] * s4, hc4[0] * v4 ] - movdqa sv4,t2 - pmuludq hc4,t2 - paddq t2,t1 - # d3 = t1[0] + t1[1] - movdqa t1,t2 - psrldq $8,t2 - paddq t2,t1 - movq t1,d3 - - # t1 = [ hc0[1] * r4, hc0[0] * u4 ] - movdqa ru4,t1 - pmuludq hc0,t1 - # t1 += [ hc1[1] * r3, hc1[0] * u3 ] - movdqa ru3,t2 - pmuludq hc1,t2 - paddq t2,t1 - # t1 += [ hc2[1] * r2, hc2[0] * u2 ] - movdqa ru2,t2 - pmuludq hc2,t2 - paddq t2,t1 - # t1 += [ hc3[1] * r1, hc3[0] * u1 ] - movdqa ru1,t2 - pmuludq hc3,t2 - paddq t2,t1 - # t1 += [ hc4[1] * r0, hc4[0] * u0 ] - movdqa ru0,t2 - pmuludq hc4,t2 - paddq t2,t1 - # d4 = t1[0] + t1[1] - movdqa t1,t2 - psrldq $8,t2 - paddq t2,t1 - movq t1,d4 - - # d1 += d0 >> 26 - mov d0,%rax - shr $26,%rax - add %rax,d1 - # h0 = d0 & 0x3ffffff - mov d0,%rbx - and $0x3ffffff,%ebx - - # d2 += d1 >> 26 - mov d1,%rax - shr $26,%rax - add %rax,d2 - # h1 = d1 & 0x3ffffff - mov d1,%rax - and $0x3ffffff,%eax - mov %eax,h1 - - # d3 += d2 >> 26 - mov d2,%rax - shr $26,%rax - add %rax,d3 - # h2 = d2 & 0x3ffffff - mov d2,%rax - and $0x3ffffff,%eax - mov %eax,h2 - - # d4 += d3 >> 26 - mov d3,%rax - shr $26,%rax - add %rax,d4 - # h3 = d3 & 0x3ffffff - mov d3,%rax - and $0x3ffffff,%eax - mov %eax,h3 - - # h0 += (d4 >> 26) * 5 - mov d4,%rax - shr $26,%rax - lea (%eax,%eax,4),%eax - add %eax,%ebx - # h4 = d4 & 0x3ffffff - mov d4,%rax - and $0x3ffffff,%eax - mov %eax,h4 - - # h1 += h0 >> 26 - mov %ebx,%eax - shr $26,%eax - add %eax,h1 - # h0 = h0 & 0x3ffffff - andl $0x3ffffff,%ebx - mov %ebx,h0 - - add $0x20,m - dec %rcx - jnz .Ldoblock2 - - pop %r13 - pop %r12 - pop %rbx - ret -ENDPROC(poly1305_2block_sse2) diff --git a/arch/x86/crypto/poly1305_glue.c b/arch/x86/crypto/poly1305_glue.c deleted file mode 100644 index f012b7e28ad1..000000000000 --- a/arch/x86/crypto/poly1305_glue.c +++ /dev/null @@ -1,205 +0,0 @@ -/* - * Poly1305 authenticator algorithm, RFC7539, SIMD glue code - * - * Copyright (C) 2015 Martin Willi - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ - -#include -#include -#include -#include -#include -#include -#include -#include - -struct poly1305_simd_desc_ctx { - struct poly1305_desc_ctx base; - /* derived key u set? */ - bool uset; -#ifdef CONFIG_AS_AVX2 - /* derived keys r^3, r^4 set? */ - bool wset; -#endif - /* derived Poly1305 key r^2 */ - u32 u[5]; - /* ... silently appended r^3 and r^4 when using AVX2 */ -}; - -asmlinkage void poly1305_block_sse2(u32 *h, const u8 *src, - const u32 *r, unsigned int blocks); -asmlinkage void poly1305_2block_sse2(u32 *h, const u8 *src, const u32 *r, - unsigned int blocks, const u32 *u); -#ifdef CONFIG_AS_AVX2 -asmlinkage void poly1305_4block_avx2(u32 *h, const u8 *src, const u32 *r, - unsigned int blocks, const u32 *u); -static bool poly1305_use_avx2; -#endif - -static int poly1305_simd_init(struct shash_desc *desc) -{ - struct poly1305_simd_desc_ctx *sctx = shash_desc_ctx(desc); - - sctx->uset = false; -#ifdef CONFIG_AS_AVX2 - sctx->wset = false; -#endif - - return crypto_poly1305_init(desc); -} - -static void poly1305_simd_mult(u32 *a, const u32 *b) -{ - u8 m[POLY1305_BLOCK_SIZE]; - - memset(m, 0, sizeof(m)); - /* The poly1305 block function adds a hi-bit to the accumulator which - * we don't need for key multiplication; compensate for it. */ - a[4] -= 1 << 24; - poly1305_block_sse2(a, m, b, 1); -} - -static unsigned int poly1305_simd_blocks(struct poly1305_desc_ctx *dctx, - const u8 *src, unsigned int srclen) -{ - struct poly1305_simd_desc_ctx *sctx; - unsigned int blocks, datalen; - - BUILD_BUG_ON(offsetof(struct poly1305_simd_desc_ctx, base)); - sctx = container_of(dctx, struct poly1305_simd_desc_ctx, base); - - if (unlikely(!dctx->sset)) { - datalen = crypto_poly1305_setdesckey(dctx, src, srclen); - src += srclen - datalen; - srclen = datalen; - } - -#ifdef CONFIG_AS_AVX2 - if (poly1305_use_avx2 && srclen >= POLY1305_BLOCK_SIZE * 4) { - if (unlikely(!sctx->wset)) { - if (!sctx->uset) { - memcpy(sctx->u, dctx->r, sizeof(sctx->u)); - poly1305_simd_mult(sctx->u, dctx->r); - sctx->uset = true; - } - memcpy(sctx->u + 5, sctx->u, sizeof(sctx->u)); - poly1305_simd_mult(sctx->u + 5, dctx->r); - memcpy(sctx->u + 10, sctx->u + 5, sizeof(sctx->u)); - poly1305_simd_mult(sctx->u + 10, dctx->r); - sctx->wset = true; - } - blocks = srclen / (POLY1305_BLOCK_SIZE * 4); - poly1305_4block_avx2(dctx->h, src, dctx->r, blocks, sctx->u); - src += POLY1305_BLOCK_SIZE * 4 * blocks; - srclen -= POLY1305_BLOCK_SIZE * 4 * blocks; - } -#endif - if (likely(srclen >= POLY1305_BLOCK_SIZE * 2)) { - if (unlikely(!sctx->uset)) { - memcpy(sctx->u, dctx->r, sizeof(sctx->u)); - poly1305_simd_mult(sctx->u, dctx->r); - sctx->uset = true; - } - blocks = srclen / (POLY1305_BLOCK_SIZE * 2); - poly1305_2block_sse2(dctx->h, src, dctx->r, blocks, sctx->u); - src += POLY1305_BLOCK_SIZE * 2 * blocks; - srclen -= POLY1305_BLOCK_SIZE * 2 * blocks; - } - if (srclen >= POLY1305_BLOCK_SIZE) { - poly1305_block_sse2(dctx->h, src, dctx->r, 1); - srclen -= POLY1305_BLOCK_SIZE; - } - return srclen; -} - -static int poly1305_simd_update(struct shash_desc *desc, - const u8 *src, unsigned int srclen) -{ - struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc); - unsigned int bytes; - - /* kernel_fpu_begin/end is costly, use fallback for small updates */ - if (srclen <= 288 || !may_use_simd()) - return crypto_poly1305_update(desc, src, srclen); - - kernel_fpu_begin(); - - if (unlikely(dctx->buflen)) { - bytes = min(srclen, POLY1305_BLOCK_SIZE - dctx->buflen); - memcpy(dctx->buf + dctx->buflen, src, bytes); - src += bytes; - srclen -= bytes; - dctx->buflen += bytes; - - if (dctx->buflen == POLY1305_BLOCK_SIZE) { - poly1305_simd_blocks(dctx, dctx->buf, - POLY1305_BLOCK_SIZE); - dctx->buflen = 0; - } - } - - if (likely(srclen >= POLY1305_BLOCK_SIZE)) { - bytes = poly1305_simd_blocks(dctx, src, srclen); - src += srclen - bytes; - srclen = bytes; - } - - kernel_fpu_end(); - - if (unlikely(srclen)) { - dctx->buflen = srclen; - memcpy(dctx->buf, src, srclen); - } - - return 0; -} - -static struct shash_alg alg = { - .digestsize = POLY1305_DIGEST_SIZE, - .init = poly1305_simd_init, - .update = poly1305_simd_update, - .final = crypto_poly1305_final, - .descsize = sizeof(struct poly1305_simd_desc_ctx), - .base = { - .cra_name = "poly1305", - .cra_driver_name = "poly1305-simd", - .cra_priority = 300, - .cra_blocksize = POLY1305_BLOCK_SIZE, - .cra_module = THIS_MODULE, - }, -}; - -static int __init poly1305_simd_mod_init(void) -{ - if (!boot_cpu_has(X86_FEATURE_XMM2)) - return -ENODEV; - -#ifdef CONFIG_AS_AVX2 - poly1305_use_avx2 = boot_cpu_has(X86_FEATURE_AVX) && - boot_cpu_has(X86_FEATURE_AVX2) && - cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL); - alg.descsize = sizeof(struct poly1305_simd_desc_ctx); - if (poly1305_use_avx2) - alg.descsize += 10 * sizeof(u32); -#endif - return crypto_register_shash(&alg); -} - -static void __exit poly1305_simd_mod_exit(void) -{ - crypto_unregister_shash(&alg); -} - -module_init(poly1305_simd_mod_init); -module_exit(poly1305_simd_mod_exit); - -MODULE_LICENSE("GPL"); -MODULE_AUTHOR("Martin Willi "); -MODULE_DESCRIPTION("Poly1305 authenticator"); -MODULE_ALIAS_CRYPTO("poly1305"); -MODULE_ALIAS_CRYPTO("poly1305-simd"); diff --git a/crypto/Kconfig b/crypto/Kconfig index f3e40ac56d93..47859a0f8052 100644 --- a/crypto/Kconfig +++ b/crypto/Kconfig @@ -656,24 +656,13 @@ config CRYPTO_GHASH config CRYPTO_POLY1305 tristate "Poly1305 authenticator algorithm" select CRYPTO_HASH + select ZINC_POLY1305 help Poly1305 authenticator algorithm, RFC7539. Poly1305 is an authenticator algorithm designed by Daniel J. Bernstein. It is used for the ChaCha20-Poly1305 AEAD, specified in RFC7539 for use - in IETF protocols. This is the portable C implementation of Poly1305. - -config CRYPTO_POLY1305_X86_64 - tristate "Poly1305 authenticator algorithm (x86_64/SSE2/AVX2)" - depends on X86 && 64BIT - select CRYPTO_POLY1305 - help - Poly1305 authenticator algorithm, RFC7539. - - Poly1305 is an authenticator algorithm designed by Daniel J. Bernstein. - It is used for the ChaCha20-Poly1305 AEAD, specified in RFC7539 for use - in IETF protocols. This is the x86_64 assembler implementation using SIMD - instructions. + in IETF protocols. config CRYPTO_MD4 tristate "MD4 digest algorithm" diff --git a/crypto/Makefile b/crypto/Makefile index 6d1d40eeb964..5e60348d02e2 100644 --- a/crypto/Makefile +++ b/crypto/Makefile @@ -118,7 +118,7 @@ obj-$(CONFIG_CRYPTO_SEED) += seed.o obj-$(CONFIG_CRYPTO_SPECK) += speck.o obj-$(CONFIG_CRYPTO_SALSA20) += salsa20_generic.o obj-$(CONFIG_CRYPTO_CHACHA20) += chacha20_generic.o -obj-$(CONFIG_CRYPTO_POLY1305) += poly1305_generic.o +obj-$(CONFIG_CRYPTO_POLY1305) += poly1305_zinc.o obj-$(CONFIG_CRYPTO_DEFLATE) += deflate.o obj-$(CONFIG_CRYPTO_MICHAEL_MIC) += michael_mic.o obj-$(CONFIG_CRYPTO_CRC32C) += crc32c_generic.o diff --git a/crypto/chacha20poly1305.c b/crypto/chacha20poly1305.c index 600afa99941f..bf523797bef3 100644 --- a/crypto/chacha20poly1305.c +++ b/crypto/chacha20poly1305.c @@ -14,7 +14,7 @@ #include #include #include -#include +#include #include #include #include @@ -62,7 +62,7 @@ struct chachapoly_req_ctx { /* the key we generate for Poly1305 using Chacha20 */ u8 key[POLY1305_KEY_SIZE]; /* calculated Poly1305 tag */ - u8 tag[POLY1305_DIGEST_SIZE]; + u8 tag[POLY1305_MAC_SIZE]; /* length of data to en/decrypt, without ICV */ unsigned int cryptlen; /* Actual AD, excluding IV */ @@ -471,7 +471,7 @@ static int chachapoly_decrypt(struct aead_request *req) { struct chachapoly_req_ctx *rctx = aead_request_ctx(req); - rctx->cryptlen = req->cryptlen - POLY1305_DIGEST_SIZE; + rctx->cryptlen = req->cryptlen - POLY1305_MAC_SIZE; /* decrypt call chain: * - poly_genkey/done() @@ -513,7 +513,7 @@ static int chachapoly_setkey(struct crypto_aead *aead, const u8 *key, static int chachapoly_setauthsize(struct crypto_aead *tfm, unsigned int authsize) { - if (authsize != POLY1305_DIGEST_SIZE) + if (authsize != POLY1305_MAC_SIZE) return -EINVAL; return 0; @@ -613,7 +613,7 @@ static int chachapoly_create(struct crypto_template *tmpl, struct rtattr **tb, poly_hash = __crypto_hash_alg_common(poly); err = -EINVAL; - if (poly_hash->digestsize != POLY1305_DIGEST_SIZE) + if (poly_hash->digestsize != POLY1305_MAC_SIZE) goto out_put_poly; err = -ENOMEM; @@ -666,7 +666,7 @@ static int chachapoly_create(struct crypto_template *tmpl, struct rtattr **tb, ctx->saltlen; inst->alg.ivsize = ivsize; inst->alg.chunksize = crypto_skcipher_alg_chunksize(chacha); - inst->alg.maxauthsize = POLY1305_DIGEST_SIZE; + inst->alg.maxauthsize = POLY1305_MAC_SIZE; inst->alg.init = chachapoly_init; inst->alg.exit = chachapoly_exit; inst->alg.encrypt = chachapoly_encrypt; diff --git a/crypto/poly1305_generic.c b/crypto/poly1305_generic.c deleted file mode 100644 index 47d3a6b83931..000000000000 --- a/crypto/poly1305_generic.c +++ /dev/null @@ -1,304 +0,0 @@ -/* - * Poly1305 authenticator algorithm, RFC7539 - * - * Copyright (C) 2015 Martin Willi - * - * Based on public domain code by Andrew Moon and Daniel J. Bernstein. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ - -#include -#include -#include -#include -#include -#include -#include - -static inline u64 mlt(u64 a, u64 b) -{ - return a * b; -} - -static inline u32 sr(u64 v, u_char n) -{ - return v >> n; -} - -static inline u32 and(u32 v, u32 mask) -{ - return v & mask; -} - -int crypto_poly1305_init(struct shash_desc *desc) -{ - struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc); - - memset(dctx->h, 0, sizeof(dctx->h)); - dctx->buflen = 0; - dctx->rset = false; - dctx->sset = false; - - return 0; -} -EXPORT_SYMBOL_GPL(crypto_poly1305_init); - -static void poly1305_setrkey(struct poly1305_desc_ctx *dctx, const u8 *key) -{ - /* r &= 0xffffffc0ffffffc0ffffffc0fffffff */ - dctx->r[0] = (get_unaligned_le32(key + 0) >> 0) & 0x3ffffff; - dctx->r[1] = (get_unaligned_le32(key + 3) >> 2) & 0x3ffff03; - dctx->r[2] = (get_unaligned_le32(key + 6) >> 4) & 0x3ffc0ff; - dctx->r[3] = (get_unaligned_le32(key + 9) >> 6) & 0x3f03fff; - dctx->r[4] = (get_unaligned_le32(key + 12) >> 8) & 0x00fffff; -} - -static void poly1305_setskey(struct poly1305_desc_ctx *dctx, const u8 *key) -{ - dctx->s[0] = get_unaligned_le32(key + 0); - dctx->s[1] = get_unaligned_le32(key + 4); - dctx->s[2] = get_unaligned_le32(key + 8); - dctx->s[3] = get_unaligned_le32(key + 12); -} - -/* - * Poly1305 requires a unique key for each tag, which implies that we can't set - * it on the tfm that gets accessed by multiple users simultaneously. Instead we - * expect the key as the first 32 bytes in the update() call. - */ -unsigned int crypto_poly1305_setdesckey(struct poly1305_desc_ctx *dctx, - const u8 *src, unsigned int srclen) -{ - if (!dctx->sset) { - if (!dctx->rset && srclen >= POLY1305_BLOCK_SIZE) { - poly1305_setrkey(dctx, src); - src += POLY1305_BLOCK_SIZE; - srclen -= POLY1305_BLOCK_SIZE; - dctx->rset = true; - } - if (srclen >= POLY1305_BLOCK_SIZE) { - poly1305_setskey(dctx, src); - src += POLY1305_BLOCK_SIZE; - srclen -= POLY1305_BLOCK_SIZE; - dctx->sset = true; - } - } - return srclen; -} -EXPORT_SYMBOL_GPL(crypto_poly1305_setdesckey); - -static unsigned int poly1305_blocks(struct poly1305_desc_ctx *dctx, - const u8 *src, unsigned int srclen, - u32 hibit) -{ - u32 r0, r1, r2, r3, r4; - u32 s1, s2, s3, s4; - u32 h0, h1, h2, h3, h4; - u64 d0, d1, d2, d3, d4; - unsigned int datalen; - - if (unlikely(!dctx->sset)) { - datalen = crypto_poly1305_setdesckey(dctx, src, srclen); - src += srclen - datalen; - srclen = datalen; - } - - r0 = dctx->r[0]; - r1 = dctx->r[1]; - r2 = dctx->r[2]; - r3 = dctx->r[3]; - r4 = dctx->r[4]; - - s1 = r1 * 5; - s2 = r2 * 5; - s3 = r3 * 5; - s4 = r4 * 5; - - h0 = dctx->h[0]; - h1 = dctx->h[1]; - h2 = dctx->h[2]; - h3 = dctx->h[3]; - h4 = dctx->h[4]; - - while (likely(srclen >= POLY1305_BLOCK_SIZE)) { - - /* h += m[i] */ - h0 += (get_unaligned_le32(src + 0) >> 0) & 0x3ffffff; - h1 += (get_unaligned_le32(src + 3) >> 2) & 0x3ffffff; - h2 += (get_unaligned_le32(src + 6) >> 4) & 0x3ffffff; - h3 += (get_unaligned_le32(src + 9) >> 6) & 0x3ffffff; - h4 += (get_unaligned_le32(src + 12) >> 8) | hibit; - - /* h *= r */ - d0 = mlt(h0, r0) + mlt(h1, s4) + mlt(h2, s3) + - mlt(h3, s2) + mlt(h4, s1); - d1 = mlt(h0, r1) + mlt(h1, r0) + mlt(h2, s4) + - mlt(h3, s3) + mlt(h4, s2); - d2 = mlt(h0, r2) + mlt(h1, r1) + mlt(h2, r0) + - mlt(h3, s4) + mlt(h4, s3); - d3 = mlt(h0, r3) + mlt(h1, r2) + mlt(h2, r1) + - mlt(h3, r0) + mlt(h4, s4); - d4 = mlt(h0, r4) + mlt(h1, r3) + mlt(h2, r2) + - mlt(h3, r1) + mlt(h4, r0); - - /* (partial) h %= p */ - d1 += sr(d0, 26); h0 = and(d0, 0x3ffffff); - d2 += sr(d1, 26); h1 = and(d1, 0x3ffffff); - d3 += sr(d2, 26); h2 = and(d2, 0x3ffffff); - d4 += sr(d3, 26); h3 = and(d3, 0x3ffffff); - h0 += sr(d4, 26) * 5; h4 = and(d4, 0x3ffffff); - h1 += h0 >> 26; h0 = h0 & 0x3ffffff; - - src += POLY1305_BLOCK_SIZE; - srclen -= POLY1305_BLOCK_SIZE; - } - - dctx->h[0] = h0; - dctx->h[1] = h1; - dctx->h[2] = h2; - dctx->h[3] = h3; - dctx->h[4] = h4; - - return srclen; -} - -int crypto_poly1305_update(struct shash_desc *desc, - const u8 *src, unsigned int srclen) -{ - struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc); - unsigned int bytes; - - if (unlikely(dctx->buflen)) { - bytes = min(srclen, POLY1305_BLOCK_SIZE - dctx->buflen); - memcpy(dctx->buf + dctx->buflen, src, bytes); - src += bytes; - srclen -= bytes; - dctx->buflen += bytes; - - if (dctx->buflen == POLY1305_BLOCK_SIZE) { - poly1305_blocks(dctx, dctx->buf, - POLY1305_BLOCK_SIZE, 1 << 24); - dctx->buflen = 0; - } - } - - if (likely(srclen >= POLY1305_BLOCK_SIZE)) { - bytes = poly1305_blocks(dctx, src, srclen, 1 << 24); - src += srclen - bytes; - srclen = bytes; - } - - if (unlikely(srclen)) { - dctx->buflen = srclen; - memcpy(dctx->buf, src, srclen); - } - - return 0; -} -EXPORT_SYMBOL_GPL(crypto_poly1305_update); - -int crypto_poly1305_final(struct shash_desc *desc, u8 *dst) -{ - struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc); - u32 h0, h1, h2, h3, h4; - u32 g0, g1, g2, g3, g4; - u32 mask; - u64 f = 0; - - if (unlikely(!dctx->sset)) - return -ENOKEY; - - if (unlikely(dctx->buflen)) { - dctx->buf[dctx->buflen++] = 1; - memset(dctx->buf + dctx->buflen, 0, - POLY1305_BLOCK_SIZE - dctx->buflen); - poly1305_blocks(dctx, dctx->buf, POLY1305_BLOCK_SIZE, 0); - } - - /* fully carry h */ - h0 = dctx->h[0]; - h1 = dctx->h[1]; - h2 = dctx->h[2]; - h3 = dctx->h[3]; - h4 = dctx->h[4]; - - h2 += (h1 >> 26); h1 = h1 & 0x3ffffff; - h3 += (h2 >> 26); h2 = h2 & 0x3ffffff; - h4 += (h3 >> 26); h3 = h3 & 0x3ffffff; - h0 += (h4 >> 26) * 5; h4 = h4 & 0x3ffffff; - h1 += (h0 >> 26); h0 = h0 & 0x3ffffff; - - /* compute h + -p */ - g0 = h0 + 5; - g1 = h1 + (g0 >> 26); g0 &= 0x3ffffff; - g2 = h2 + (g1 >> 26); g1 &= 0x3ffffff; - g3 = h3 + (g2 >> 26); g2 &= 0x3ffffff; - g4 = h4 + (g3 >> 26) - (1 << 26); g3 &= 0x3ffffff; - - /* select h if h < p, or h + -p if h >= p */ - mask = (g4 >> ((sizeof(u32) * 8) - 1)) - 1; - g0 &= mask; - g1 &= mask; - g2 &= mask; - g3 &= mask; - g4 &= mask; - mask = ~mask; - h0 = (h0 & mask) | g0; - h1 = (h1 & mask) | g1; - h2 = (h2 & mask) | g2; - h3 = (h3 & mask) | g3; - h4 = (h4 & mask) | g4; - - /* h = h % (2^128) */ - h0 = (h0 >> 0) | (h1 << 26); - h1 = (h1 >> 6) | (h2 << 20); - h2 = (h2 >> 12) | (h3 << 14); - h3 = (h3 >> 18) | (h4 << 8); - - /* mac = (h + s) % (2^128) */ - f = (f >> 32) + h0 + dctx->s[0]; put_unaligned_le32(f, dst + 0); - f = (f >> 32) + h1 + dctx->s[1]; put_unaligned_le32(f, dst + 4); - f = (f >> 32) + h2 + dctx->s[2]; put_unaligned_le32(f, dst + 8); - f = (f >> 32) + h3 + dctx->s[3]; put_unaligned_le32(f, dst + 12); - - return 0; -} -EXPORT_SYMBOL_GPL(crypto_poly1305_final); - -static struct shash_alg poly1305_alg = { - .digestsize = POLY1305_DIGEST_SIZE, - .init = crypto_poly1305_init, - .update = crypto_poly1305_update, - .final = crypto_poly1305_final, - .descsize = sizeof(struct poly1305_desc_ctx), - .base = { - .cra_name = "poly1305", - .cra_driver_name = "poly1305-generic", - .cra_priority = 100, - .cra_blocksize = POLY1305_BLOCK_SIZE, - .cra_module = THIS_MODULE, - }, -}; - -static int __init poly1305_mod_init(void) -{ - return crypto_register_shash(&poly1305_alg); -} - -static void __exit poly1305_mod_exit(void) -{ - crypto_unregister_shash(&poly1305_alg); -} - -module_init(poly1305_mod_init); -module_exit(poly1305_mod_exit); - -MODULE_LICENSE("GPL"); -MODULE_AUTHOR("Martin Willi "); -MODULE_DESCRIPTION("Poly1305 authenticator"); -MODULE_ALIAS_CRYPTO("poly1305"); -MODULE_ALIAS_CRYPTO("poly1305-generic"); diff --git a/crypto/poly1305_zinc.c b/crypto/poly1305_zinc.c new file mode 100644 index 000000000000..4794442edf26 --- /dev/null +++ b/crypto/poly1305_zinc.c @@ -0,0 +1,98 @@ +/* SPDX-License-Identifier: GPL-2.0 + * + * Copyright (C) 2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#include +#include +#include +#include +#include +#include +#include + +struct poly1305_desc_ctx { + struct poly1305_ctx ctx; + u8 key[POLY1305_KEY_SIZE]; + unsigned int rem_key_bytes; +}; + +static int crypto_poly1305_init(struct shash_desc *desc) +{ + struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc); + dctx->rem_key_bytes = POLY1305_KEY_SIZE; + return 0; +} + +static int crypto_poly1305_update(struct shash_desc *desc, const u8 *src, + unsigned int srclen) +{ + struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc); + simd_context_t simd_context; + + if (unlikely(dctx->rem_key_bytes)) { + unsigned int key_bytes = min(srclen, dctx->rem_key_bytes); + memcpy(dctx->key + (POLY1305_KEY_SIZE - dctx->rem_key_bytes), + src, key_bytes); + src += key_bytes; + srclen -= key_bytes; + dctx->rem_key_bytes -= key_bytes; + if (!dctx->rem_key_bytes) { + poly1305_init(&dctx->ctx, dctx->key); + memzero_explicit(dctx->key, sizeof(dctx->key)); + } + if (!srclen) + return 0; + } + + simd_get(&simd_context); + poly1305_update(&dctx->ctx, src, srclen, &simd_context); + simd_put(&simd_context); + + return 0; +} + +static int crypto_poly1305_final(struct shash_desc *desc, u8 *dst) +{ + struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc); + simd_context_t simd_context; + + simd_get(&simd_context); + poly1305_final(&dctx->ctx, dst, &simd_context); + simd_put(&simd_context); + return 0; +} + +static struct shash_alg poly1305_alg = { + .digestsize = POLY1305_MAC_SIZE, + .init = crypto_poly1305_init, + .update = crypto_poly1305_update, + .final = crypto_poly1305_final, + .descsize = sizeof(struct poly1305_desc_ctx), + .base = { + .cra_name = "poly1305", + .cra_driver_name = "poly1305-software", + .cra_priority = 100, + .cra_blocksize = POLY1305_BLOCK_SIZE, + .cra_module = THIS_MODULE, + }, +}; + +static int __init poly1305_mod_init(void) +{ + return crypto_register_shash(&poly1305_alg); +} + +static void __exit poly1305_mod_exit(void) +{ + crypto_unregister_shash(&poly1305_alg); +} + +module_init(poly1305_mod_init); +module_exit(poly1305_mod_exit); + +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Jason A. Donenfeld "); +MODULE_DESCRIPTION("Poly1305 authenticator"); +MODULE_ALIAS_CRYPTO("poly1305"); +MODULE_ALIAS_CRYPTO("poly1305-software"); diff --git a/include/crypto/poly1305.h b/include/crypto/poly1305.h deleted file mode 100644 index f718a19da82f..000000000000 --- a/include/crypto/poly1305.h +++ /dev/null @@ -1,40 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -/* - * Common values for the Poly1305 algorithm - */ - -#ifndef _CRYPTO_POLY1305_H -#define _CRYPTO_POLY1305_H - -#include -#include - -#define POLY1305_BLOCK_SIZE 16 -#define POLY1305_KEY_SIZE 32 -#define POLY1305_DIGEST_SIZE 16 - -struct poly1305_desc_ctx { - /* key */ - u32 r[5]; - /* finalize key */ - u32 s[4]; - /* accumulator */ - u32 h[5]; - /* partial buffer */ - u8 buf[POLY1305_BLOCK_SIZE]; - /* bytes used in partial buffer */ - unsigned int buflen; - /* r key has been set */ - bool rset; - /* s key has been set */ - bool sset; -}; - -int crypto_poly1305_init(struct shash_desc *desc); -unsigned int crypto_poly1305_setdesckey(struct poly1305_desc_ctx *dctx, - const u8 *src, unsigned int srclen); -int crypto_poly1305_update(struct shash_desc *desc, - const u8 *src, unsigned int srclen); -int crypto_poly1305_final(struct shash_desc *desc, u8 *dst); - -#endif From patchwork Sat Oct 6 02:57:07 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Jason A. Donenfeld" X-Patchwork-Id: 148320 Delivered-To: patch@linaro.org Received: by 2002:a2e:8595:0:0:0:0:0 with SMTP id b21-v6csp1158057lji; Fri, 5 Oct 2018 19:58:57 -0700 (PDT) X-Google-Smtp-Source: ACcGV618Rg6TsSQ0xFTpZm7sT3pI0HV25Q76mjYfw9XVV4dUOxQZxec2YAfPuQZc5PB34Den66HD X-Received: by 2002:a17:902:b903:: with SMTP id bf3-v6mr14398963plb.54.1538794737550; Fri, 05 Oct 2018 19:58:57 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1538794737; cv=none; d=google.com; s=arc-20160816; b=q8GjI9NPi6qnKss8u2S+3QdG9mWUpUIZpKixd6WVyzkLsJZIi2urn/R8BUDrg6gPoJ YxHPtCgwTF/DxikPYnqPPhZS64lfG1KBrA94cdmmXeyQwMECYLS91vC3GPQHTtIGaHYi +ApA88bAUpY8HBq/hn5d1/TV3mm6LnOwfg+T/RixbNpsIkYlzKRURtj2KeHakE7ga8R3 kwCfzPzGOFL3M+qxz953lh7Git+3C+y1E4qpJLNKE9c6IS9WFGoQQ6Jpyb87TLf0DSDS SnuYRd/DlfvWJkGvyWEKrupXFEnf4dYkhu8I3/hx0SUDgQtXV8oGVca/WXOSQjTGSsz8 RCWA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=EontgSI05TLNwVPcc0KrWCNSPY6TL1o5v2UgGslrPwU=; b=zUYQ5hImsceSpO2j4Sdy9MvJSCQZNdlHnvzBUP0LPqZEVziGv2ijGXW+h/xpBSAlT/ HkYzFgyCN8mk8OGbfzyeM9x3jyA2BlijEHBALHhO1XOWnsMnYiY3Ac0n0eBm57PQmG2I 4U1fC6/IERhCIfiQp6LtzquZZg/277Mu4t/fuLrtE20tsuhRigSYPIUAHQP2hFW3i6lE bXLBwh9Zaauu7s9GZjROUAjQf32y2NiKKyZ8wsyH1c8kmVWhU5M3IQpDrwx9vHGkQFMf BzdHJ9Ay1DGN4evPoHA23Y6SIEZPMMnQ1j9+wj2tvHBpK+GmXExX/m6F2EF80Z8d8QbR Zysw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=U+jLJM32; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id bd1-v6si10633502plb.156.2018.10.05.19.58.56; Fri, 05 Oct 2018 19:58:57 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=U+jLJM32; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729987AbeJFKA0 (ORCPT + 32 others); Sat, 6 Oct 2018 06:00:26 -0400 Received: from frisell.zx2c4.com ([192.95.5.64]:60175 "EHLO frisell.zx2c4.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729864AbeJFKAY (ORCPT ); Sat, 6 Oct 2018 06:00:24 -0400 Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTP id e15d9008; Sat, 6 Oct 2018 02:58:18 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=zx2c4.com; h=from:to:cc :subject:date:message-id:in-reply-to:references:mime-version :content-transfer-encoding; s=mail; bh=tJVF0LR38wlJUhuWGK8T9G4So zg=; b=U+jLJM32OIZh9A0/o4cGNEqC5wByfzUlLpWZrEw4gyZT+ZTQW+pTSL5Zv sUYLIM8ybDbL2egpiaTqjifVr0Pp3c82i7dOZL118Gh3k6iSjc/ApuKkU0BKRP/f 2BWcYNsgSthCELWlY6D5+yjll/u5AjW9Gf31Oi0zaBPsLNQGaVYI68TppOrGPBvk TRhNr1pGVljQA9z2uUIn9XKLIxziflV9aTGhjUMLpcmx6Q20ptWIyTB6DGMLZmQ5 BsCJNfqDVwcj0SNu9tkZB2x3ADgjBmcxwL/N8KXQ2+v/OXv/OEWUN9+ucJ6njOGh AH1n27j/4wlywbxrVtz4UCLKc/QAA== Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTPSA id b7f89c17 (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256:NO); Sat, 6 Oct 2018 02:58:17 +0000 (UTC) From: "Jason A. Donenfeld" To: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, davem@davemloft.net, gregkh@linuxfoundation.org Cc: "Jason A. Donenfeld" , Samuel Neves , Andy Lutomirski , linux-crypto@vger.kernel.org Subject: [PATCH net-next v7 26/28] crypto: port ChaCha20 to Zinc Date: Sat, 6 Oct 2018 04:57:07 +0200 Message-Id: <20181006025709.4019-27-Jason@zx2c4.com> In-Reply-To: <20181006025709.4019-1-Jason@zx2c4.com> References: <20181006025709.4019-1-Jason@zx2c4.com> MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Now that ChaCha20 is in Zinc, we can have the crypto API code simply call into it. The crypto API expects to have a stored key per instance and independent nonces, so we follow suite and store the key and initialize the nonce independently. Signed-off-by: Jason A. Donenfeld Cc: Samuel Neves Cc: Andy Lutomirski Cc: Greg KH Cc: linux-crypto@vger.kernel.org --- arch/arm/configs/exynos_defconfig | 1 - arch/arm/configs/multi_v7_defconfig | 1 - arch/arm/configs/omap2plus_defconfig | 1 - arch/arm/crypto/Kconfig | 6 - arch/arm/crypto/Makefile | 2 - arch/arm/crypto/chacha20-neon-core.S | 521 -------------------- arch/arm/crypto/chacha20-neon-glue.c | 127 ----- arch/arm64/configs/defconfig | 1 - arch/arm64/crypto/Kconfig | 6 - arch/arm64/crypto/Makefile | 3 - arch/arm64/crypto/chacha20-neon-core.S | 450 ----------------- arch/arm64/crypto/chacha20-neon-glue.c | 133 ----- arch/x86/crypto/Makefile | 3 - arch/x86/crypto/chacha20-avx2-x86_64.S | 448 ----------------- arch/x86/crypto/chacha20-ssse3-x86_64.S | 630 ------------------------ arch/x86/crypto/chacha20_glue.c | 146 ------ crypto/Kconfig | 17 +- crypto/Makefile | 2 +- crypto/chacha20_generic.c | 136 ----- crypto/chacha20_zinc.c | 90 ++++ crypto/chacha20poly1305.c | 8 +- include/crypto/chacha20.h | 12 - 22 files changed, 96 insertions(+), 2648 deletions(-) delete mode 100644 arch/arm/crypto/chacha20-neon-core.S delete mode 100644 arch/arm/crypto/chacha20-neon-glue.c delete mode 100644 arch/arm64/crypto/chacha20-neon-core.S delete mode 100644 arch/arm64/crypto/chacha20-neon-glue.c delete mode 100644 arch/x86/crypto/chacha20-avx2-x86_64.S delete mode 100644 arch/x86/crypto/chacha20-ssse3-x86_64.S delete mode 100644 arch/x86/crypto/chacha20_glue.c delete mode 100644 crypto/chacha20_generic.c create mode 100644 crypto/chacha20_zinc.c -- 2.19.0 diff --git a/arch/arm/configs/exynos_defconfig b/arch/arm/configs/exynos_defconfig index 27ea6dfcf2f2..95929b5e7b10 100644 --- a/arch/arm/configs/exynos_defconfig +++ b/arch/arm/configs/exynos_defconfig @@ -350,7 +350,6 @@ CONFIG_CRYPTO_SHA1_ARM_NEON=m CONFIG_CRYPTO_SHA256_ARM=m CONFIG_CRYPTO_SHA512_ARM=m CONFIG_CRYPTO_AES_ARM_BS=m -CONFIG_CRYPTO_CHACHA20_NEON=m CONFIG_CRC_CCITT=y CONFIG_FONTS=y CONFIG_FONT_7x14=y diff --git a/arch/arm/configs/multi_v7_defconfig b/arch/arm/configs/multi_v7_defconfig index fc33444e94f0..63be07724db3 100644 --- a/arch/arm/configs/multi_v7_defconfig +++ b/arch/arm/configs/multi_v7_defconfig @@ -1000,4 +1000,3 @@ CONFIG_CRYPTO_AES_ARM_BS=m CONFIG_CRYPTO_AES_ARM_CE=m CONFIG_CRYPTO_GHASH_ARM_CE=m CONFIG_CRYPTO_CRC32_ARM_CE=m -CONFIG_CRYPTO_CHACHA20_NEON=m diff --git a/arch/arm/configs/omap2plus_defconfig b/arch/arm/configs/omap2plus_defconfig index 6491419b1dad..f585a8ecc336 100644 --- a/arch/arm/configs/omap2plus_defconfig +++ b/arch/arm/configs/omap2plus_defconfig @@ -547,7 +547,6 @@ CONFIG_CRYPTO_SHA512_ARM=m CONFIG_CRYPTO_AES_ARM=m CONFIG_CRYPTO_AES_ARM_BS=m CONFIG_CRYPTO_GHASH_ARM_CE=m -CONFIG_CRYPTO_CHACHA20_NEON=m CONFIG_CRC_CCITT=y CONFIG_CRC_T10DIF=y CONFIG_CRC_ITU_T=y diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig index 925d1364727a..fb80fd89f0e7 100644 --- a/arch/arm/crypto/Kconfig +++ b/arch/arm/crypto/Kconfig @@ -115,12 +115,6 @@ config CRYPTO_CRC32_ARM_CE depends on KERNEL_MODE_NEON && CRC32 select CRYPTO_HASH -config CRYPTO_CHACHA20_NEON - tristate "NEON accelerated ChaCha20 symmetric cipher" - depends on KERNEL_MODE_NEON - select CRYPTO_BLKCIPHER - select CRYPTO_CHACHA20 - config CRYPTO_SPECK_NEON tristate "NEON accelerated Speck cipher algorithms" depends on KERNEL_MODE_NEON diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile index 8de542c48ade..bbfa98447063 100644 --- a/arch/arm/crypto/Makefile +++ b/arch/arm/crypto/Makefile @@ -9,7 +9,6 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM) += sha1-arm.o obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o -obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o obj-$(CONFIG_CRYPTO_SPECK_NEON) += speck-neon.o ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o @@ -53,7 +52,6 @@ aes-arm-ce-y := aes-ce-core.o aes-ce-glue.o ghash-arm-ce-y := ghash-ce-core.o ghash-ce-glue.o crct10dif-arm-ce-y := crct10dif-ce-core.o crct10dif-ce-glue.o crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o -chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o speck-neon-y := speck-neon-core.o speck-neon-glue.o ifdef REGENERATE_ARM_CRYPTO diff --git a/arch/arm/crypto/chacha20-neon-core.S b/arch/arm/crypto/chacha20-neon-core.S deleted file mode 100644 index 451a849ad518..000000000000 --- a/arch/arm/crypto/chacha20-neon-core.S +++ /dev/null @@ -1,521 +0,0 @@ -/* - * ChaCha20 256-bit cipher algorithm, RFC7539, ARM NEON functions - * - * Copyright (C) 2016 Linaro, Ltd. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 as - * published by the Free Software Foundation. - * - * Based on: - * ChaCha20 256-bit cipher algorithm, RFC7539, x64 SSE3 functions - * - * Copyright (C) 2015 Martin Willi - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ - -#include - - .text - .fpu neon - .align 5 - -ENTRY(chacha20_block_xor_neon) - // r0: Input state matrix, s - // r1: 1 data block output, o - // r2: 1 data block input, i - - // - // This function encrypts one ChaCha20 block by loading the state matrix - // in four NEON registers. It performs matrix operation on four words in - // parallel, but requireds shuffling to rearrange the words after each - // round. - // - - // x0..3 = s0..3 - add ip, r0, #0x20 - vld1.32 {q0-q1}, [r0] - vld1.32 {q2-q3}, [ip] - - vmov q8, q0 - vmov q9, q1 - vmov q10, q2 - vmov q11, q3 - - mov r3, #10 - -.Ldoubleround: - // x0 += x1, x3 = rotl32(x3 ^ x0, 16) - vadd.i32 q0, q0, q1 - veor q3, q3, q0 - vrev32.16 q3, q3 - - // x2 += x3, x1 = rotl32(x1 ^ x2, 12) - vadd.i32 q2, q2, q3 - veor q4, q1, q2 - vshl.u32 q1, q4, #12 - vsri.u32 q1, q4, #20 - - // x0 += x1, x3 = rotl32(x3 ^ x0, 8) - vadd.i32 q0, q0, q1 - veor q4, q3, q0 - vshl.u32 q3, q4, #8 - vsri.u32 q3, q4, #24 - - // x2 += x3, x1 = rotl32(x1 ^ x2, 7) - vadd.i32 q2, q2, q3 - veor q4, q1, q2 - vshl.u32 q1, q4, #7 - vsri.u32 q1, q4, #25 - - // x1 = shuffle32(x1, MASK(0, 3, 2, 1)) - vext.8 q1, q1, q1, #4 - // x2 = shuffle32(x2, MASK(1, 0, 3, 2)) - vext.8 q2, q2, q2, #8 - // x3 = shuffle32(x3, MASK(2, 1, 0, 3)) - vext.8 q3, q3, q3, #12 - - // x0 += x1, x3 = rotl32(x3 ^ x0, 16) - vadd.i32 q0, q0, q1 - veor q3, q3, q0 - vrev32.16 q3, q3 - - // x2 += x3, x1 = rotl32(x1 ^ x2, 12) - vadd.i32 q2, q2, q3 - veor q4, q1, q2 - vshl.u32 q1, q4, #12 - vsri.u32 q1, q4, #20 - - // x0 += x1, x3 = rotl32(x3 ^ x0, 8) - vadd.i32 q0, q0, q1 - veor q4, q3, q0 - vshl.u32 q3, q4, #8 - vsri.u32 q3, q4, #24 - - // x2 += x3, x1 = rotl32(x1 ^ x2, 7) - vadd.i32 q2, q2, q3 - veor q4, q1, q2 - vshl.u32 q1, q4, #7 - vsri.u32 q1, q4, #25 - - // x1 = shuffle32(x1, MASK(2, 1, 0, 3)) - vext.8 q1, q1, q1, #12 - // x2 = shuffle32(x2, MASK(1, 0, 3, 2)) - vext.8 q2, q2, q2, #8 - // x3 = shuffle32(x3, MASK(0, 3, 2, 1)) - vext.8 q3, q3, q3, #4 - - subs r3, r3, #1 - bne .Ldoubleround - - add ip, r2, #0x20 - vld1.8 {q4-q5}, [r2] - vld1.8 {q6-q7}, [ip] - - // o0 = i0 ^ (x0 + s0) - vadd.i32 q0, q0, q8 - veor q0, q0, q4 - - // o1 = i1 ^ (x1 + s1) - vadd.i32 q1, q1, q9 - veor q1, q1, q5 - - // o2 = i2 ^ (x2 + s2) - vadd.i32 q2, q2, q10 - veor q2, q2, q6 - - // o3 = i3 ^ (x3 + s3) - vadd.i32 q3, q3, q11 - veor q3, q3, q7 - - add ip, r1, #0x20 - vst1.8 {q0-q1}, [r1] - vst1.8 {q2-q3}, [ip] - - bx lr -ENDPROC(chacha20_block_xor_neon) - - .align 5 -ENTRY(chacha20_4block_xor_neon) - push {r4-r6, lr} - mov ip, sp // preserve the stack pointer - sub r3, sp, #0x20 // allocate a 32 byte buffer - bic r3, r3, #0x1f // aligned to 32 bytes - mov sp, r3 - - // r0: Input state matrix, s - // r1: 4 data blocks output, o - // r2: 4 data blocks input, i - - // - // This function encrypts four consecutive ChaCha20 blocks by loading - // the state matrix in NEON registers four times. The algorithm performs - // each operation on the corresponding word of each state matrix, hence - // requires no word shuffling. For final XORing step we transpose the - // matrix by interleaving 32- and then 64-bit words, which allows us to - // do XOR in NEON registers. - // - - // x0..15[0-3] = s0..3[0..3] - add r3, r0, #0x20 - vld1.32 {q0-q1}, [r0] - vld1.32 {q2-q3}, [r3] - - adr r3, CTRINC - vdup.32 q15, d7[1] - vdup.32 q14, d7[0] - vld1.32 {q11}, [r3, :128] - vdup.32 q13, d6[1] - vdup.32 q12, d6[0] - vadd.i32 q12, q12, q11 // x12 += counter values 0-3 - vdup.32 q11, d5[1] - vdup.32 q10, d5[0] - vdup.32 q9, d4[1] - vdup.32 q8, d4[0] - vdup.32 q7, d3[1] - vdup.32 q6, d3[0] - vdup.32 q5, d2[1] - vdup.32 q4, d2[0] - vdup.32 q3, d1[1] - vdup.32 q2, d1[0] - vdup.32 q1, d0[1] - vdup.32 q0, d0[0] - - mov r3, #10 - -.Ldoubleround4: - // x0 += x4, x12 = rotl32(x12 ^ x0, 16) - // x1 += x5, x13 = rotl32(x13 ^ x1, 16) - // x2 += x6, x14 = rotl32(x14 ^ x2, 16) - // x3 += x7, x15 = rotl32(x15 ^ x3, 16) - vadd.i32 q0, q0, q4 - vadd.i32 q1, q1, q5 - vadd.i32 q2, q2, q6 - vadd.i32 q3, q3, q7 - - veor q12, q12, q0 - veor q13, q13, q1 - veor q14, q14, q2 - veor q15, q15, q3 - - vrev32.16 q12, q12 - vrev32.16 q13, q13 - vrev32.16 q14, q14 - vrev32.16 q15, q15 - - // x8 += x12, x4 = rotl32(x4 ^ x8, 12) - // x9 += x13, x5 = rotl32(x5 ^ x9, 12) - // x10 += x14, x6 = rotl32(x6 ^ x10, 12) - // x11 += x15, x7 = rotl32(x7 ^ x11, 12) - vadd.i32 q8, q8, q12 - vadd.i32 q9, q9, q13 - vadd.i32 q10, q10, q14 - vadd.i32 q11, q11, q15 - - vst1.32 {q8-q9}, [sp, :256] - - veor q8, q4, q8 - veor q9, q5, q9 - vshl.u32 q4, q8, #12 - vshl.u32 q5, q9, #12 - vsri.u32 q4, q8, #20 - vsri.u32 q5, q9, #20 - - veor q8, q6, q10 - veor q9, q7, q11 - vshl.u32 q6, q8, #12 - vshl.u32 q7, q9, #12 - vsri.u32 q6, q8, #20 - vsri.u32 q7, q9, #20 - - // x0 += x4, x12 = rotl32(x12 ^ x0, 8) - // x1 += x5, x13 = rotl32(x13 ^ x1, 8) - // x2 += x6, x14 = rotl32(x14 ^ x2, 8) - // x3 += x7, x15 = rotl32(x15 ^ x3, 8) - vadd.i32 q0, q0, q4 - vadd.i32 q1, q1, q5 - vadd.i32 q2, q2, q6 - vadd.i32 q3, q3, q7 - - veor q8, q12, q0 - veor q9, q13, q1 - vshl.u32 q12, q8, #8 - vshl.u32 q13, q9, #8 - vsri.u32 q12, q8, #24 - vsri.u32 q13, q9, #24 - - veor q8, q14, q2 - veor q9, q15, q3 - vshl.u32 q14, q8, #8 - vshl.u32 q15, q9, #8 - vsri.u32 q14, q8, #24 - vsri.u32 q15, q9, #24 - - vld1.32 {q8-q9}, [sp, :256] - - // x8 += x12, x4 = rotl32(x4 ^ x8, 7) - // x9 += x13, x5 = rotl32(x5 ^ x9, 7) - // x10 += x14, x6 = rotl32(x6 ^ x10, 7) - // x11 += x15, x7 = rotl32(x7 ^ x11, 7) - vadd.i32 q8, q8, q12 - vadd.i32 q9, q9, q13 - vadd.i32 q10, q10, q14 - vadd.i32 q11, q11, q15 - - vst1.32 {q8-q9}, [sp, :256] - - veor q8, q4, q8 - veor q9, q5, q9 - vshl.u32 q4, q8, #7 - vshl.u32 q5, q9, #7 - vsri.u32 q4, q8, #25 - vsri.u32 q5, q9, #25 - - veor q8, q6, q10 - veor q9, q7, q11 - vshl.u32 q6, q8, #7 - vshl.u32 q7, q9, #7 - vsri.u32 q6, q8, #25 - vsri.u32 q7, q9, #25 - - vld1.32 {q8-q9}, [sp, :256] - - // x0 += x5, x15 = rotl32(x15 ^ x0, 16) - // x1 += x6, x12 = rotl32(x12 ^ x1, 16) - // x2 += x7, x13 = rotl32(x13 ^ x2, 16) - // x3 += x4, x14 = rotl32(x14 ^ x3, 16) - vadd.i32 q0, q0, q5 - vadd.i32 q1, q1, q6 - vadd.i32 q2, q2, q7 - vadd.i32 q3, q3, q4 - - veor q15, q15, q0 - veor q12, q12, q1 - veor q13, q13, q2 - veor q14, q14, q3 - - vrev32.16 q15, q15 - vrev32.16 q12, q12 - vrev32.16 q13, q13 - vrev32.16 q14, q14 - - // x10 += x15, x5 = rotl32(x5 ^ x10, 12) - // x11 += x12, x6 = rotl32(x6 ^ x11, 12) - // x8 += x13, x7 = rotl32(x7 ^ x8, 12) - // x9 += x14, x4 = rotl32(x4 ^ x9, 12) - vadd.i32 q10, q10, q15 - vadd.i32 q11, q11, q12 - vadd.i32 q8, q8, q13 - vadd.i32 q9, q9, q14 - - vst1.32 {q8-q9}, [sp, :256] - - veor q8, q7, q8 - veor q9, q4, q9 - vshl.u32 q7, q8, #12 - vshl.u32 q4, q9, #12 - vsri.u32 q7, q8, #20 - vsri.u32 q4, q9, #20 - - veor q8, q5, q10 - veor q9, q6, q11 - vshl.u32 q5, q8, #12 - vshl.u32 q6, q9, #12 - vsri.u32 q5, q8, #20 - vsri.u32 q6, q9, #20 - - // x0 += x5, x15 = rotl32(x15 ^ x0, 8) - // x1 += x6, x12 = rotl32(x12 ^ x1, 8) - // x2 += x7, x13 = rotl32(x13 ^ x2, 8) - // x3 += x4, x14 = rotl32(x14 ^ x3, 8) - vadd.i32 q0, q0, q5 - vadd.i32 q1, q1, q6 - vadd.i32 q2, q2, q7 - vadd.i32 q3, q3, q4 - - veor q8, q15, q0 - veor q9, q12, q1 - vshl.u32 q15, q8, #8 - vshl.u32 q12, q9, #8 - vsri.u32 q15, q8, #24 - vsri.u32 q12, q9, #24 - - veor q8, q13, q2 - veor q9, q14, q3 - vshl.u32 q13, q8, #8 - vshl.u32 q14, q9, #8 - vsri.u32 q13, q8, #24 - vsri.u32 q14, q9, #24 - - vld1.32 {q8-q9}, [sp, :256] - - // x10 += x15, x5 = rotl32(x5 ^ x10, 7) - // x11 += x12, x6 = rotl32(x6 ^ x11, 7) - // x8 += x13, x7 = rotl32(x7 ^ x8, 7) - // x9 += x14, x4 = rotl32(x4 ^ x9, 7) - vadd.i32 q10, q10, q15 - vadd.i32 q11, q11, q12 - vadd.i32 q8, q8, q13 - vadd.i32 q9, q9, q14 - - vst1.32 {q8-q9}, [sp, :256] - - veor q8, q7, q8 - veor q9, q4, q9 - vshl.u32 q7, q8, #7 - vshl.u32 q4, q9, #7 - vsri.u32 q7, q8, #25 - vsri.u32 q4, q9, #25 - - veor q8, q5, q10 - veor q9, q6, q11 - vshl.u32 q5, q8, #7 - vshl.u32 q6, q9, #7 - vsri.u32 q5, q8, #25 - vsri.u32 q6, q9, #25 - - subs r3, r3, #1 - beq 0f - - vld1.32 {q8-q9}, [sp, :256] - b .Ldoubleround4 - - // x0[0-3] += s0[0] - // x1[0-3] += s0[1] - // x2[0-3] += s0[2] - // x3[0-3] += s0[3] -0: ldmia r0!, {r3-r6} - vdup.32 q8, r3 - vdup.32 q9, r4 - vadd.i32 q0, q0, q8 - vadd.i32 q1, q1, q9 - vdup.32 q8, r5 - vdup.32 q9, r6 - vadd.i32 q2, q2, q8 - vadd.i32 q3, q3, q9 - - // x4[0-3] += s1[0] - // x5[0-3] += s1[1] - // x6[0-3] += s1[2] - // x7[0-3] += s1[3] - ldmia r0!, {r3-r6} - vdup.32 q8, r3 - vdup.32 q9, r4 - vadd.i32 q4, q4, q8 - vadd.i32 q5, q5, q9 - vdup.32 q8, r5 - vdup.32 q9, r6 - vadd.i32 q6, q6, q8 - vadd.i32 q7, q7, q9 - - // interleave 32-bit words in state n, n+1 - vzip.32 q0, q1 - vzip.32 q2, q3 - vzip.32 q4, q5 - vzip.32 q6, q7 - - // interleave 64-bit words in state n, n+2 - vswp d1, d4 - vswp d3, d6 - vswp d9, d12 - vswp d11, d14 - - // xor with corresponding input, write to output - vld1.8 {q8-q9}, [r2]! - veor q8, q8, q0 - veor q9, q9, q4 - vst1.8 {q8-q9}, [r1]! - - vld1.32 {q8-q9}, [sp, :256] - - // x8[0-3] += s2[0] - // x9[0-3] += s2[1] - // x10[0-3] += s2[2] - // x11[0-3] += s2[3] - ldmia r0!, {r3-r6} - vdup.32 q0, r3 - vdup.32 q4, r4 - vadd.i32 q8, q8, q0 - vadd.i32 q9, q9, q4 - vdup.32 q0, r5 - vdup.32 q4, r6 - vadd.i32 q10, q10, q0 - vadd.i32 q11, q11, q4 - - // x12[0-3] += s3[0] - // x13[0-3] += s3[1] - // x14[0-3] += s3[2] - // x15[0-3] += s3[3] - ldmia r0!, {r3-r6} - vdup.32 q0, r3 - vdup.32 q4, r4 - adr r3, CTRINC - vadd.i32 q12, q12, q0 - vld1.32 {q0}, [r3, :128] - vadd.i32 q13, q13, q4 - vadd.i32 q12, q12, q0 // x12 += counter values 0-3 - - vdup.32 q0, r5 - vdup.32 q4, r6 - vadd.i32 q14, q14, q0 - vadd.i32 q15, q15, q4 - - // interleave 32-bit words in state n, n+1 - vzip.32 q8, q9 - vzip.32 q10, q11 - vzip.32 q12, q13 - vzip.32 q14, q15 - - // interleave 64-bit words in state n, n+2 - vswp d17, d20 - vswp d19, d22 - vswp d25, d28 - vswp d27, d30 - - vmov q4, q1 - - vld1.8 {q0-q1}, [r2]! - veor q0, q0, q8 - veor q1, q1, q12 - vst1.8 {q0-q1}, [r1]! - - vld1.8 {q0-q1}, [r2]! - veor q0, q0, q2 - veor q1, q1, q6 - vst1.8 {q0-q1}, [r1]! - - vld1.8 {q0-q1}, [r2]! - veor q0, q0, q10 - veor q1, q1, q14 - vst1.8 {q0-q1}, [r1]! - - vld1.8 {q0-q1}, [r2]! - veor q0, q0, q4 - veor q1, q1, q5 - vst1.8 {q0-q1}, [r1]! - - vld1.8 {q0-q1}, [r2]! - veor q0, q0, q9 - veor q1, q1, q13 - vst1.8 {q0-q1}, [r1]! - - vld1.8 {q0-q1}, [r2]! - veor q0, q0, q3 - veor q1, q1, q7 - vst1.8 {q0-q1}, [r1]! - - vld1.8 {q0-q1}, [r2] - veor q0, q0, q11 - veor q1, q1, q15 - vst1.8 {q0-q1}, [r1] - - mov sp, ip - pop {r4-r6, pc} -ENDPROC(chacha20_4block_xor_neon) - - .align 4 -CTRINC: .word 0, 1, 2, 3 diff --git a/arch/arm/crypto/chacha20-neon-glue.c b/arch/arm/crypto/chacha20-neon-glue.c deleted file mode 100644 index 59a7be08e80c..000000000000 --- a/arch/arm/crypto/chacha20-neon-glue.c +++ /dev/null @@ -1,127 +0,0 @@ -/* - * ChaCha20 256-bit cipher algorithm, RFC7539, ARM NEON functions - * - * Copyright (C) 2016 Linaro, Ltd. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 as - * published by the Free Software Foundation. - * - * Based on: - * ChaCha20 256-bit cipher algorithm, RFC7539, SIMD glue code - * - * Copyright (C) 2015 Martin Willi - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ - -#include -#include -#include -#include -#include - -#include -#include -#include - -asmlinkage void chacha20_block_xor_neon(u32 *state, u8 *dst, const u8 *src); -asmlinkage void chacha20_4block_xor_neon(u32 *state, u8 *dst, const u8 *src); - -static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src, - unsigned int bytes) -{ - u8 buf[CHACHA20_BLOCK_SIZE]; - - while (bytes >= CHACHA20_BLOCK_SIZE * 4) { - chacha20_4block_xor_neon(state, dst, src); - bytes -= CHACHA20_BLOCK_SIZE * 4; - src += CHACHA20_BLOCK_SIZE * 4; - dst += CHACHA20_BLOCK_SIZE * 4; - state[12] += 4; - } - while (bytes >= CHACHA20_BLOCK_SIZE) { - chacha20_block_xor_neon(state, dst, src); - bytes -= CHACHA20_BLOCK_SIZE; - src += CHACHA20_BLOCK_SIZE; - dst += CHACHA20_BLOCK_SIZE; - state[12]++; - } - if (bytes) { - memcpy(buf, src, bytes); - chacha20_block_xor_neon(state, buf, buf); - memcpy(dst, buf, bytes); - } -} - -static int chacha20_neon(struct skcipher_request *req) -{ - struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req); - struct chacha20_ctx *ctx = crypto_skcipher_ctx(tfm); - struct skcipher_walk walk; - u32 state[16]; - int err; - - if (req->cryptlen <= CHACHA20_BLOCK_SIZE || !may_use_simd()) - return crypto_chacha20_crypt(req); - - err = skcipher_walk_virt(&walk, req, true); - - crypto_chacha20_init(state, ctx, walk.iv); - - kernel_neon_begin(); - while (walk.nbytes > 0) { - unsigned int nbytes = walk.nbytes; - - if (nbytes < walk.total) - nbytes = round_down(nbytes, walk.stride); - - chacha20_doneon(state, walk.dst.virt.addr, walk.src.virt.addr, - nbytes); - err = skcipher_walk_done(&walk, walk.nbytes - nbytes); - } - kernel_neon_end(); - - return err; -} - -static struct skcipher_alg alg = { - .base.cra_name = "chacha20", - .base.cra_driver_name = "chacha20-neon", - .base.cra_priority = 300, - .base.cra_blocksize = 1, - .base.cra_ctxsize = sizeof(struct chacha20_ctx), - .base.cra_module = THIS_MODULE, - - .min_keysize = CHACHA20_KEY_SIZE, - .max_keysize = CHACHA20_KEY_SIZE, - .ivsize = CHACHA20_IV_SIZE, - .chunksize = CHACHA20_BLOCK_SIZE, - .walksize = 4 * CHACHA20_BLOCK_SIZE, - .setkey = crypto_chacha20_setkey, - .encrypt = chacha20_neon, - .decrypt = chacha20_neon, -}; - -static int __init chacha20_simd_mod_init(void) -{ - if (!(elf_hwcap & HWCAP_NEON)) - return -ENODEV; - - return crypto_register_skcipher(&alg); -} - -static void __exit chacha20_simd_mod_fini(void) -{ - crypto_unregister_skcipher(&alg); -} - -module_init(chacha20_simd_mod_init); -module_exit(chacha20_simd_mod_fini); - -MODULE_AUTHOR("Ard Biesheuvel "); -MODULE_LICENSE("GPL v2"); -MODULE_ALIAS_CRYPTO("chacha20"); diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig index db8d364f8476..6cc3c8a0ad88 100644 --- a/arch/arm64/configs/defconfig +++ b/arch/arm64/configs/defconfig @@ -709,5 +709,4 @@ CONFIG_CRYPTO_CRCT10DIF_ARM64_CE=m CONFIG_CRYPTO_CRC32_ARM64_CE=m CONFIG_CRYPTO_AES_ARM64_CE_CCM=y CONFIG_CRYPTO_AES_ARM64_CE_BLK=y -CONFIG_CRYPTO_CHACHA20_NEON=m CONFIG_CRYPTO_AES_ARM64_BS=m diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig index e3fdb0fd6f70..9db6d775a880 100644 --- a/arch/arm64/crypto/Kconfig +++ b/arch/arm64/crypto/Kconfig @@ -105,12 +105,6 @@ config CRYPTO_AES_ARM64_NEON_BLK select CRYPTO_AES select CRYPTO_SIMD -config CRYPTO_CHACHA20_NEON - tristate "NEON accelerated ChaCha20 symmetric cipher" - depends on KERNEL_MODE_NEON - select CRYPTO_BLKCIPHER - select CRYPTO_CHACHA20 - config CRYPTO_AES_ARM64_BS tristate "AES in ECB/CBC/CTR/XTS modes using bit-sliced NEON algorithm" depends on KERNEL_MODE_NEON diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile index bcafd016618e..507c4bfb86e3 100644 --- a/arch/arm64/crypto/Makefile +++ b/arch/arm64/crypto/Makefile @@ -53,9 +53,6 @@ sha256-arm64-y := sha256-glue.o sha256-core.o obj-$(CONFIG_CRYPTO_SHA512_ARM64) += sha512-arm64.o sha512-arm64-y := sha512-glue.o sha512-core.o -obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o -chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o - obj-$(CONFIG_CRYPTO_SPECK_NEON) += speck-neon.o speck-neon-y := speck-neon-core.o speck-neon-glue.o diff --git a/arch/arm64/crypto/chacha20-neon-core.S b/arch/arm64/crypto/chacha20-neon-core.S deleted file mode 100644 index 13c85e272c2a..000000000000 --- a/arch/arm64/crypto/chacha20-neon-core.S +++ /dev/null @@ -1,450 +0,0 @@ -/* - * ChaCha20 256-bit cipher algorithm, RFC7539, arm64 NEON functions - * - * Copyright (C) 2016 Linaro, Ltd. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 as - * published by the Free Software Foundation. - * - * Based on: - * ChaCha20 256-bit cipher algorithm, RFC7539, x64 SSSE3 functions - * - * Copyright (C) 2015 Martin Willi - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ - -#include - - .text - .align 6 - -ENTRY(chacha20_block_xor_neon) - // x0: Input state matrix, s - // x1: 1 data block output, o - // x2: 1 data block input, i - - // - // This function encrypts one ChaCha20 block by loading the state matrix - // in four NEON registers. It performs matrix operation on four words in - // parallel, but requires shuffling to rearrange the words after each - // round. - // - - // x0..3 = s0..3 - adr x3, ROT8 - ld1 {v0.4s-v3.4s}, [x0] - ld1 {v8.4s-v11.4s}, [x0] - ld1 {v12.4s}, [x3] - - mov x3, #10 - -.Ldoubleround: - // x0 += x1, x3 = rotl32(x3 ^ x0, 16) - add v0.4s, v0.4s, v1.4s - eor v3.16b, v3.16b, v0.16b - rev32 v3.8h, v3.8h - - // x2 += x3, x1 = rotl32(x1 ^ x2, 12) - add v2.4s, v2.4s, v3.4s - eor v4.16b, v1.16b, v2.16b - shl v1.4s, v4.4s, #12 - sri v1.4s, v4.4s, #20 - - // x0 += x1, x3 = rotl32(x3 ^ x0, 8) - add v0.4s, v0.4s, v1.4s - eor v3.16b, v3.16b, v0.16b - tbl v3.16b, {v3.16b}, v12.16b - - // x2 += x3, x1 = rotl32(x1 ^ x2, 7) - add v2.4s, v2.4s, v3.4s - eor v4.16b, v1.16b, v2.16b - shl v1.4s, v4.4s, #7 - sri v1.4s, v4.4s, #25 - - // x1 = shuffle32(x1, MASK(0, 3, 2, 1)) - ext v1.16b, v1.16b, v1.16b, #4 - // x2 = shuffle32(x2, MASK(1, 0, 3, 2)) - ext v2.16b, v2.16b, v2.16b, #8 - // x3 = shuffle32(x3, MASK(2, 1, 0, 3)) - ext v3.16b, v3.16b, v3.16b, #12 - - // x0 += x1, x3 = rotl32(x3 ^ x0, 16) - add v0.4s, v0.4s, v1.4s - eor v3.16b, v3.16b, v0.16b - rev32 v3.8h, v3.8h - - // x2 += x3, x1 = rotl32(x1 ^ x2, 12) - add v2.4s, v2.4s, v3.4s - eor v4.16b, v1.16b, v2.16b - shl v1.4s, v4.4s, #12 - sri v1.4s, v4.4s, #20 - - // x0 += x1, x3 = rotl32(x3 ^ x0, 8) - add v0.4s, v0.4s, v1.4s - eor v3.16b, v3.16b, v0.16b - tbl v3.16b, {v3.16b}, v12.16b - - // x2 += x3, x1 = rotl32(x1 ^ x2, 7) - add v2.4s, v2.4s, v3.4s - eor v4.16b, v1.16b, v2.16b - shl v1.4s, v4.4s, #7 - sri v1.4s, v4.4s, #25 - - // x1 = shuffle32(x1, MASK(2, 1, 0, 3)) - ext v1.16b, v1.16b, v1.16b, #12 - // x2 = shuffle32(x2, MASK(1, 0, 3, 2)) - ext v2.16b, v2.16b, v2.16b, #8 - // x3 = shuffle32(x3, MASK(0, 3, 2, 1)) - ext v3.16b, v3.16b, v3.16b, #4 - - subs x3, x3, #1 - b.ne .Ldoubleround - - ld1 {v4.16b-v7.16b}, [x2] - - // o0 = i0 ^ (x0 + s0) - add v0.4s, v0.4s, v8.4s - eor v0.16b, v0.16b, v4.16b - - // o1 = i1 ^ (x1 + s1) - add v1.4s, v1.4s, v9.4s - eor v1.16b, v1.16b, v5.16b - - // o2 = i2 ^ (x2 + s2) - add v2.4s, v2.4s, v10.4s - eor v2.16b, v2.16b, v6.16b - - // o3 = i3 ^ (x3 + s3) - add v3.4s, v3.4s, v11.4s - eor v3.16b, v3.16b, v7.16b - - st1 {v0.16b-v3.16b}, [x1] - - ret -ENDPROC(chacha20_block_xor_neon) - - .align 6 -ENTRY(chacha20_4block_xor_neon) - // x0: Input state matrix, s - // x1: 4 data blocks output, o - // x2: 4 data blocks input, i - - // - // This function encrypts four consecutive ChaCha20 blocks by loading - // the state matrix in NEON registers four times. The algorithm performs - // each operation on the corresponding word of each state matrix, hence - // requires no word shuffling. For final XORing step we transpose the - // matrix by interleaving 32- and then 64-bit words, which allows us to - // do XOR in NEON registers. - // - adr x3, CTRINC // ... and ROT8 - ld1 {v30.4s-v31.4s}, [x3] - - // x0..15[0-3] = s0..3[0..3] - mov x4, x0 - ld4r { v0.4s- v3.4s}, [x4], #16 - ld4r { v4.4s- v7.4s}, [x4], #16 - ld4r { v8.4s-v11.4s}, [x4], #16 - ld4r {v12.4s-v15.4s}, [x4] - - // x12 += counter values 0-3 - add v12.4s, v12.4s, v30.4s - - mov x3, #10 - -.Ldoubleround4: - // x0 += x4, x12 = rotl32(x12 ^ x0, 16) - // x1 += x5, x13 = rotl32(x13 ^ x1, 16) - // x2 += x6, x14 = rotl32(x14 ^ x2, 16) - // x3 += x7, x15 = rotl32(x15 ^ x3, 16) - add v0.4s, v0.4s, v4.4s - add v1.4s, v1.4s, v5.4s - add v2.4s, v2.4s, v6.4s - add v3.4s, v3.4s, v7.4s - - eor v12.16b, v12.16b, v0.16b - eor v13.16b, v13.16b, v1.16b - eor v14.16b, v14.16b, v2.16b - eor v15.16b, v15.16b, v3.16b - - rev32 v12.8h, v12.8h - rev32 v13.8h, v13.8h - rev32 v14.8h, v14.8h - rev32 v15.8h, v15.8h - - // x8 += x12, x4 = rotl32(x4 ^ x8, 12) - // x9 += x13, x5 = rotl32(x5 ^ x9, 12) - // x10 += x14, x6 = rotl32(x6 ^ x10, 12) - // x11 += x15, x7 = rotl32(x7 ^ x11, 12) - add v8.4s, v8.4s, v12.4s - add v9.4s, v9.4s, v13.4s - add v10.4s, v10.4s, v14.4s - add v11.4s, v11.4s, v15.4s - - eor v16.16b, v4.16b, v8.16b - eor v17.16b, v5.16b, v9.16b - eor v18.16b, v6.16b, v10.16b - eor v19.16b, v7.16b, v11.16b - - shl v4.4s, v16.4s, #12 - shl v5.4s, v17.4s, #12 - shl v6.4s, v18.4s, #12 - shl v7.4s, v19.4s, #12 - - sri v4.4s, v16.4s, #20 - sri v5.4s, v17.4s, #20 - sri v6.4s, v18.4s, #20 - sri v7.4s, v19.4s, #20 - - // x0 += x4, x12 = rotl32(x12 ^ x0, 8) - // x1 += x5, x13 = rotl32(x13 ^ x1, 8) - // x2 += x6, x14 = rotl32(x14 ^ x2, 8) - // x3 += x7, x15 = rotl32(x15 ^ x3, 8) - add v0.4s, v0.4s, v4.4s - add v1.4s, v1.4s, v5.4s - add v2.4s, v2.4s, v6.4s - add v3.4s, v3.4s, v7.4s - - eor v12.16b, v12.16b, v0.16b - eor v13.16b, v13.16b, v1.16b - eor v14.16b, v14.16b, v2.16b - eor v15.16b, v15.16b, v3.16b - - tbl v12.16b, {v12.16b}, v31.16b - tbl v13.16b, {v13.16b}, v31.16b - tbl v14.16b, {v14.16b}, v31.16b - tbl v15.16b, {v15.16b}, v31.16b - - // x8 += x12, x4 = rotl32(x4 ^ x8, 7) - // x9 += x13, x5 = rotl32(x5 ^ x9, 7) - // x10 += x14, x6 = rotl32(x6 ^ x10, 7) - // x11 += x15, x7 = rotl32(x7 ^ x11, 7) - add v8.4s, v8.4s, v12.4s - add v9.4s, v9.4s, v13.4s - add v10.4s, v10.4s, v14.4s - add v11.4s, v11.4s, v15.4s - - eor v16.16b, v4.16b, v8.16b - eor v17.16b, v5.16b, v9.16b - eor v18.16b, v6.16b, v10.16b - eor v19.16b, v7.16b, v11.16b - - shl v4.4s, v16.4s, #7 - shl v5.4s, v17.4s, #7 - shl v6.4s, v18.4s, #7 - shl v7.4s, v19.4s, #7 - - sri v4.4s, v16.4s, #25 - sri v5.4s, v17.4s, #25 - sri v6.4s, v18.4s, #25 - sri v7.4s, v19.4s, #25 - - // x0 += x5, x15 = rotl32(x15 ^ x0, 16) - // x1 += x6, x12 = rotl32(x12 ^ x1, 16) - // x2 += x7, x13 = rotl32(x13 ^ x2, 16) - // x3 += x4, x14 = rotl32(x14 ^ x3, 16) - add v0.4s, v0.4s, v5.4s - add v1.4s, v1.4s, v6.4s - add v2.4s, v2.4s, v7.4s - add v3.4s, v3.4s, v4.4s - - eor v15.16b, v15.16b, v0.16b - eor v12.16b, v12.16b, v1.16b - eor v13.16b, v13.16b, v2.16b - eor v14.16b, v14.16b, v3.16b - - rev32 v15.8h, v15.8h - rev32 v12.8h, v12.8h - rev32 v13.8h, v13.8h - rev32 v14.8h, v14.8h - - // x10 += x15, x5 = rotl32(x5 ^ x10, 12) - // x11 += x12, x6 = rotl32(x6 ^ x11, 12) - // x8 += x13, x7 = rotl32(x7 ^ x8, 12) - // x9 += x14, x4 = rotl32(x4 ^ x9, 12) - add v10.4s, v10.4s, v15.4s - add v11.4s, v11.4s, v12.4s - add v8.4s, v8.4s, v13.4s - add v9.4s, v9.4s, v14.4s - - eor v16.16b, v5.16b, v10.16b - eor v17.16b, v6.16b, v11.16b - eor v18.16b, v7.16b, v8.16b - eor v19.16b, v4.16b, v9.16b - - shl v5.4s, v16.4s, #12 - shl v6.4s, v17.4s, #12 - shl v7.4s, v18.4s, #12 - shl v4.4s, v19.4s, #12 - - sri v5.4s, v16.4s, #20 - sri v6.4s, v17.4s, #20 - sri v7.4s, v18.4s, #20 - sri v4.4s, v19.4s, #20 - - // x0 += x5, x15 = rotl32(x15 ^ x0, 8) - // x1 += x6, x12 = rotl32(x12 ^ x1, 8) - // x2 += x7, x13 = rotl32(x13 ^ x2, 8) - // x3 += x4, x14 = rotl32(x14 ^ x3, 8) - add v0.4s, v0.4s, v5.4s - add v1.4s, v1.4s, v6.4s - add v2.4s, v2.4s, v7.4s - add v3.4s, v3.4s, v4.4s - - eor v15.16b, v15.16b, v0.16b - eor v12.16b, v12.16b, v1.16b - eor v13.16b, v13.16b, v2.16b - eor v14.16b, v14.16b, v3.16b - - tbl v15.16b, {v15.16b}, v31.16b - tbl v12.16b, {v12.16b}, v31.16b - tbl v13.16b, {v13.16b}, v31.16b - tbl v14.16b, {v14.16b}, v31.16b - - // x10 += x15, x5 = rotl32(x5 ^ x10, 7) - // x11 += x12, x6 = rotl32(x6 ^ x11, 7) - // x8 += x13, x7 = rotl32(x7 ^ x8, 7) - // x9 += x14, x4 = rotl32(x4 ^ x9, 7) - add v10.4s, v10.4s, v15.4s - add v11.4s, v11.4s, v12.4s - add v8.4s, v8.4s, v13.4s - add v9.4s, v9.4s, v14.4s - - eor v16.16b, v5.16b, v10.16b - eor v17.16b, v6.16b, v11.16b - eor v18.16b, v7.16b, v8.16b - eor v19.16b, v4.16b, v9.16b - - shl v5.4s, v16.4s, #7 - shl v6.4s, v17.4s, #7 - shl v7.4s, v18.4s, #7 - shl v4.4s, v19.4s, #7 - - sri v5.4s, v16.4s, #25 - sri v6.4s, v17.4s, #25 - sri v7.4s, v18.4s, #25 - sri v4.4s, v19.4s, #25 - - subs x3, x3, #1 - b.ne .Ldoubleround4 - - ld4r {v16.4s-v19.4s}, [x0], #16 - ld4r {v20.4s-v23.4s}, [x0], #16 - - // x12 += counter values 0-3 - add v12.4s, v12.4s, v30.4s - - // x0[0-3] += s0[0] - // x1[0-3] += s0[1] - // x2[0-3] += s0[2] - // x3[0-3] += s0[3] - add v0.4s, v0.4s, v16.4s - add v1.4s, v1.4s, v17.4s - add v2.4s, v2.4s, v18.4s - add v3.4s, v3.4s, v19.4s - - ld4r {v24.4s-v27.4s}, [x0], #16 - ld4r {v28.4s-v31.4s}, [x0] - - // x4[0-3] += s1[0] - // x5[0-3] += s1[1] - // x6[0-3] += s1[2] - // x7[0-3] += s1[3] - add v4.4s, v4.4s, v20.4s - add v5.4s, v5.4s, v21.4s - add v6.4s, v6.4s, v22.4s - add v7.4s, v7.4s, v23.4s - - // x8[0-3] += s2[0] - // x9[0-3] += s2[1] - // x10[0-3] += s2[2] - // x11[0-3] += s2[3] - add v8.4s, v8.4s, v24.4s - add v9.4s, v9.4s, v25.4s - add v10.4s, v10.4s, v26.4s - add v11.4s, v11.4s, v27.4s - - // x12[0-3] += s3[0] - // x13[0-3] += s3[1] - // x14[0-3] += s3[2] - // x15[0-3] += s3[3] - add v12.4s, v12.4s, v28.4s - add v13.4s, v13.4s, v29.4s - add v14.4s, v14.4s, v30.4s - add v15.4s, v15.4s, v31.4s - - // interleave 32-bit words in state n, n+1 - zip1 v16.4s, v0.4s, v1.4s - zip2 v17.4s, v0.4s, v1.4s - zip1 v18.4s, v2.4s, v3.4s - zip2 v19.4s, v2.4s, v3.4s - zip1 v20.4s, v4.4s, v5.4s - zip2 v21.4s, v4.4s, v5.4s - zip1 v22.4s, v6.4s, v7.4s - zip2 v23.4s, v6.4s, v7.4s - zip1 v24.4s, v8.4s, v9.4s - zip2 v25.4s, v8.4s, v9.4s - zip1 v26.4s, v10.4s, v11.4s - zip2 v27.4s, v10.4s, v11.4s - zip1 v28.4s, v12.4s, v13.4s - zip2 v29.4s, v12.4s, v13.4s - zip1 v30.4s, v14.4s, v15.4s - zip2 v31.4s, v14.4s, v15.4s - - // interleave 64-bit words in state n, n+2 - zip1 v0.2d, v16.2d, v18.2d - zip2 v4.2d, v16.2d, v18.2d - zip1 v8.2d, v17.2d, v19.2d - zip2 v12.2d, v17.2d, v19.2d - ld1 {v16.16b-v19.16b}, [x2], #64 - - zip1 v1.2d, v20.2d, v22.2d - zip2 v5.2d, v20.2d, v22.2d - zip1 v9.2d, v21.2d, v23.2d - zip2 v13.2d, v21.2d, v23.2d - ld1 {v20.16b-v23.16b}, [x2], #64 - - zip1 v2.2d, v24.2d, v26.2d - zip2 v6.2d, v24.2d, v26.2d - zip1 v10.2d, v25.2d, v27.2d - zip2 v14.2d, v25.2d, v27.2d - ld1 {v24.16b-v27.16b}, [x2], #64 - - zip1 v3.2d, v28.2d, v30.2d - zip2 v7.2d, v28.2d, v30.2d - zip1 v11.2d, v29.2d, v31.2d - zip2 v15.2d, v29.2d, v31.2d - ld1 {v28.16b-v31.16b}, [x2] - - // xor with corresponding input, write to output - eor v16.16b, v16.16b, v0.16b - eor v17.16b, v17.16b, v1.16b - eor v18.16b, v18.16b, v2.16b - eor v19.16b, v19.16b, v3.16b - eor v20.16b, v20.16b, v4.16b - eor v21.16b, v21.16b, v5.16b - st1 {v16.16b-v19.16b}, [x1], #64 - eor v22.16b, v22.16b, v6.16b - eor v23.16b, v23.16b, v7.16b - eor v24.16b, v24.16b, v8.16b - eor v25.16b, v25.16b, v9.16b - st1 {v20.16b-v23.16b}, [x1], #64 - eor v26.16b, v26.16b, v10.16b - eor v27.16b, v27.16b, v11.16b - eor v28.16b, v28.16b, v12.16b - st1 {v24.16b-v27.16b}, [x1], #64 - eor v29.16b, v29.16b, v13.16b - eor v30.16b, v30.16b, v14.16b - eor v31.16b, v31.16b, v15.16b - st1 {v28.16b-v31.16b}, [x1] - - ret -ENDPROC(chacha20_4block_xor_neon) - -CTRINC: .word 0, 1, 2, 3 -ROT8: .word 0x02010003, 0x06050407, 0x0a09080b, 0x0e0d0c0f diff --git a/arch/arm64/crypto/chacha20-neon-glue.c b/arch/arm64/crypto/chacha20-neon-glue.c deleted file mode 100644 index 727579c93ded..000000000000 --- a/arch/arm64/crypto/chacha20-neon-glue.c +++ /dev/null @@ -1,133 +0,0 @@ -/* - * ChaCha20 256-bit cipher algorithm, RFC7539, arm64 NEON functions - * - * Copyright (C) 2016 - 2017 Linaro, Ltd. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 as - * published by the Free Software Foundation. - * - * Based on: - * ChaCha20 256-bit cipher algorithm, RFC7539, SIMD glue code - * - * Copyright (C) 2015 Martin Willi - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ - -#include -#include -#include -#include -#include - -#include -#include -#include - -asmlinkage void chacha20_block_xor_neon(u32 *state, u8 *dst, const u8 *src); -asmlinkage void chacha20_4block_xor_neon(u32 *state, u8 *dst, const u8 *src); - -static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src, - unsigned int bytes) -{ - u8 buf[CHACHA20_BLOCK_SIZE]; - - while (bytes >= CHACHA20_BLOCK_SIZE * 4) { - kernel_neon_begin(); - chacha20_4block_xor_neon(state, dst, src); - kernel_neon_end(); - bytes -= CHACHA20_BLOCK_SIZE * 4; - src += CHACHA20_BLOCK_SIZE * 4; - dst += CHACHA20_BLOCK_SIZE * 4; - state[12] += 4; - } - - if (!bytes) - return; - - kernel_neon_begin(); - while (bytes >= CHACHA20_BLOCK_SIZE) { - chacha20_block_xor_neon(state, dst, src); - bytes -= CHACHA20_BLOCK_SIZE; - src += CHACHA20_BLOCK_SIZE; - dst += CHACHA20_BLOCK_SIZE; - state[12]++; - } - if (bytes) { - memcpy(buf, src, bytes); - chacha20_block_xor_neon(state, buf, buf); - memcpy(dst, buf, bytes); - } - kernel_neon_end(); -} - -static int chacha20_neon(struct skcipher_request *req) -{ - struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req); - struct chacha20_ctx *ctx = crypto_skcipher_ctx(tfm); - struct skcipher_walk walk; - u32 state[16]; - int err; - - if (!may_use_simd() || req->cryptlen <= CHACHA20_BLOCK_SIZE) - return crypto_chacha20_crypt(req); - - err = skcipher_walk_virt(&walk, req, false); - - crypto_chacha20_init(state, ctx, walk.iv); - - while (walk.nbytes > 0) { - unsigned int nbytes = walk.nbytes; - - if (nbytes < walk.total) - nbytes = round_down(nbytes, walk.stride); - - chacha20_doneon(state, walk.dst.virt.addr, walk.src.virt.addr, - nbytes); - err = skcipher_walk_done(&walk, walk.nbytes - nbytes); - } - - return err; -} - -static struct skcipher_alg alg = { - .base.cra_name = "chacha20", - .base.cra_driver_name = "chacha20-neon", - .base.cra_priority = 300, - .base.cra_blocksize = 1, - .base.cra_ctxsize = sizeof(struct chacha20_ctx), - .base.cra_module = THIS_MODULE, - - .min_keysize = CHACHA20_KEY_SIZE, - .max_keysize = CHACHA20_KEY_SIZE, - .ivsize = CHACHA20_IV_SIZE, - .chunksize = CHACHA20_BLOCK_SIZE, - .walksize = 4 * CHACHA20_BLOCK_SIZE, - .setkey = crypto_chacha20_setkey, - .encrypt = chacha20_neon, - .decrypt = chacha20_neon, -}; - -static int __init chacha20_simd_mod_init(void) -{ - if (!(elf_hwcap & HWCAP_ASIMD)) - return -ENODEV; - - return crypto_register_skcipher(&alg); -} - -static void __exit chacha20_simd_mod_fini(void) -{ - crypto_unregister_skcipher(&alg); -} - -module_init(chacha20_simd_mod_init); -module_exit(chacha20_simd_mod_fini); - -MODULE_AUTHOR("Ard Biesheuvel "); -MODULE_LICENSE("GPL v2"); -MODULE_ALIAS_CRYPTO("chacha20"); diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile index cf830219846b..419212c31246 100644 --- a/arch/x86/crypto/Makefile +++ b/arch/x86/crypto/Makefile @@ -23,7 +23,6 @@ obj-$(CONFIG_CRYPTO_CAMELLIA_X86_64) += camellia-x86_64.o obj-$(CONFIG_CRYPTO_BLOWFISH_X86_64) += blowfish-x86_64.o obj-$(CONFIG_CRYPTO_TWOFISH_X86_64) += twofish-x86_64.o obj-$(CONFIG_CRYPTO_TWOFISH_X86_64_3WAY) += twofish-x86_64-3way.o -obj-$(CONFIG_CRYPTO_CHACHA20_X86_64) += chacha20-x86_64.o obj-$(CONFIG_CRYPTO_SERPENT_SSE2_X86_64) += serpent-sse2-x86_64.o obj-$(CONFIG_CRYPTO_AES_NI_INTEL) += aesni-intel.o obj-$(CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL) += ghash-clmulni-intel.o @@ -76,7 +75,6 @@ camellia-x86_64-y := camellia-x86_64-asm_64.o camellia_glue.o blowfish-x86_64-y := blowfish-x86_64-asm_64.o blowfish_glue.o twofish-x86_64-y := twofish-x86_64-asm_64.o twofish_glue.o twofish-x86_64-3way-y := twofish-x86_64-asm_64-3way.o twofish_glue_3way.o -chacha20-x86_64-y := chacha20-ssse3-x86_64.o chacha20_glue.o serpent-sse2-x86_64-y := serpent-sse2-x86_64-asm_64.o serpent_sse2_glue.o aegis128-aesni-y := aegis128-aesni-asm.o aegis128-aesni-glue.o @@ -99,7 +97,6 @@ endif ifeq ($(avx2_supported),yes) camellia-aesni-avx2-y := camellia-aesni-avx2-asm_64.o camellia_aesni_avx2_glue.o - chacha20-x86_64-y += chacha20-avx2-x86_64.o serpent-avx2-y := serpent-avx2-asm_64.o serpent_avx2_glue.o morus1280-avx2-y := morus1280-avx2-asm.o morus1280-avx2-glue.o diff --git a/arch/x86/crypto/chacha20-avx2-x86_64.S b/arch/x86/crypto/chacha20-avx2-x86_64.S deleted file mode 100644 index f3cd26f48332..000000000000 --- a/arch/x86/crypto/chacha20-avx2-x86_64.S +++ /dev/null @@ -1,448 +0,0 @@ -/* - * ChaCha20 256-bit cipher algorithm, RFC7539, x64 AVX2 functions - * - * Copyright (C) 2015 Martin Willi - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ - -#include - -.section .rodata.cst32.ROT8, "aM", @progbits, 32 -.align 32 -ROT8: .octa 0x0e0d0c0f0a09080b0605040702010003 - .octa 0x0e0d0c0f0a09080b0605040702010003 - -.section .rodata.cst32.ROT16, "aM", @progbits, 32 -.align 32 -ROT16: .octa 0x0d0c0f0e09080b0a0504070601000302 - .octa 0x0d0c0f0e09080b0a0504070601000302 - -.section .rodata.cst32.CTRINC, "aM", @progbits, 32 -.align 32 -CTRINC: .octa 0x00000003000000020000000100000000 - .octa 0x00000007000000060000000500000004 - -.text - -ENTRY(chacha20_8block_xor_avx2) - # %rdi: Input state matrix, s - # %rsi: 8 data blocks output, o - # %rdx: 8 data blocks input, i - - # This function encrypts eight consecutive ChaCha20 blocks by loading - # the state matrix in AVX registers eight times. As we need some - # scratch registers, we save the first four registers on the stack. The - # algorithm performs each operation on the corresponding word of each - # state matrix, hence requires no word shuffling. For final XORing step - # we transpose the matrix by interleaving 32-, 64- and then 128-bit - # words, which allows us to do XOR in AVX registers. 8/16-bit word - # rotation is done with the slightly better performing byte shuffling, - # 7/12-bit word rotation uses traditional shift+OR. - - vzeroupper - # 4 * 32 byte stack, 32-byte aligned - lea 8(%rsp),%r10 - and $~31, %rsp - sub $0x80, %rsp - - # x0..15[0-7] = s[0..15] - vpbroadcastd 0x00(%rdi),%ymm0 - vpbroadcastd 0x04(%rdi),%ymm1 - vpbroadcastd 0x08(%rdi),%ymm2 - vpbroadcastd 0x0c(%rdi),%ymm3 - vpbroadcastd 0x10(%rdi),%ymm4 - vpbroadcastd 0x14(%rdi),%ymm5 - vpbroadcastd 0x18(%rdi),%ymm6 - vpbroadcastd 0x1c(%rdi),%ymm7 - vpbroadcastd 0x20(%rdi),%ymm8 - vpbroadcastd 0x24(%rdi),%ymm9 - vpbroadcastd 0x28(%rdi),%ymm10 - vpbroadcastd 0x2c(%rdi),%ymm11 - vpbroadcastd 0x30(%rdi),%ymm12 - vpbroadcastd 0x34(%rdi),%ymm13 - vpbroadcastd 0x38(%rdi),%ymm14 - vpbroadcastd 0x3c(%rdi),%ymm15 - # x0..3 on stack - vmovdqa %ymm0,0x00(%rsp) - vmovdqa %ymm1,0x20(%rsp) - vmovdqa %ymm2,0x40(%rsp) - vmovdqa %ymm3,0x60(%rsp) - - vmovdqa CTRINC(%rip),%ymm1 - vmovdqa ROT8(%rip),%ymm2 - vmovdqa ROT16(%rip),%ymm3 - - # x12 += counter values 0-3 - vpaddd %ymm1,%ymm12,%ymm12 - - mov $10,%ecx - -.Ldoubleround8: - # x0 += x4, x12 = rotl32(x12 ^ x0, 16) - vpaddd 0x00(%rsp),%ymm4,%ymm0 - vmovdqa %ymm0,0x00(%rsp) - vpxor %ymm0,%ymm12,%ymm12 - vpshufb %ymm3,%ymm12,%ymm12 - # x1 += x5, x13 = rotl32(x13 ^ x1, 16) - vpaddd 0x20(%rsp),%ymm5,%ymm0 - vmovdqa %ymm0,0x20(%rsp) - vpxor %ymm0,%ymm13,%ymm13 - vpshufb %ymm3,%ymm13,%ymm13 - # x2 += x6, x14 = rotl32(x14 ^ x2, 16) - vpaddd 0x40(%rsp),%ymm6,%ymm0 - vmovdqa %ymm0,0x40(%rsp) - vpxor %ymm0,%ymm14,%ymm14 - vpshufb %ymm3,%ymm14,%ymm14 - # x3 += x7, x15 = rotl32(x15 ^ x3, 16) - vpaddd 0x60(%rsp),%ymm7,%ymm0 - vmovdqa %ymm0,0x60(%rsp) - vpxor %ymm0,%ymm15,%ymm15 - vpshufb %ymm3,%ymm15,%ymm15 - - # x8 += x12, x4 = rotl32(x4 ^ x8, 12) - vpaddd %ymm12,%ymm8,%ymm8 - vpxor %ymm8,%ymm4,%ymm4 - vpslld $12,%ymm4,%ymm0 - vpsrld $20,%ymm4,%ymm4 - vpor %ymm0,%ymm4,%ymm4 - # x9 += x13, x5 = rotl32(x5 ^ x9, 12) - vpaddd %ymm13,%ymm9,%ymm9 - vpxor %ymm9,%ymm5,%ymm5 - vpslld $12,%ymm5,%ymm0 - vpsrld $20,%ymm5,%ymm5 - vpor %ymm0,%ymm5,%ymm5 - # x10 += x14, x6 = rotl32(x6 ^ x10, 12) - vpaddd %ymm14,%ymm10,%ymm10 - vpxor %ymm10,%ymm6,%ymm6 - vpslld $12,%ymm6,%ymm0 - vpsrld $20,%ymm6,%ymm6 - vpor %ymm0,%ymm6,%ymm6 - # x11 += x15, x7 = rotl32(x7 ^ x11, 12) - vpaddd %ymm15,%ymm11,%ymm11 - vpxor %ymm11,%ymm7,%ymm7 - vpslld $12,%ymm7,%ymm0 - vpsrld $20,%ymm7,%ymm7 - vpor %ymm0,%ymm7,%ymm7 - - # x0 += x4, x12 = rotl32(x12 ^ x0, 8) - vpaddd 0x00(%rsp),%ymm4,%ymm0 - vmovdqa %ymm0,0x00(%rsp) - vpxor %ymm0,%ymm12,%ymm12 - vpshufb %ymm2,%ymm12,%ymm12 - # x1 += x5, x13 = rotl32(x13 ^ x1, 8) - vpaddd 0x20(%rsp),%ymm5,%ymm0 - vmovdqa %ymm0,0x20(%rsp) - vpxor %ymm0,%ymm13,%ymm13 - vpshufb %ymm2,%ymm13,%ymm13 - # x2 += x6, x14 = rotl32(x14 ^ x2, 8) - vpaddd 0x40(%rsp),%ymm6,%ymm0 - vmovdqa %ymm0,0x40(%rsp) - vpxor %ymm0,%ymm14,%ymm14 - vpshufb %ymm2,%ymm14,%ymm14 - # x3 += x7, x15 = rotl32(x15 ^ x3, 8) - vpaddd 0x60(%rsp),%ymm7,%ymm0 - vmovdqa %ymm0,0x60(%rsp) - vpxor %ymm0,%ymm15,%ymm15 - vpshufb %ymm2,%ymm15,%ymm15 - - # x8 += x12, x4 = rotl32(x4 ^ x8, 7) - vpaddd %ymm12,%ymm8,%ymm8 - vpxor %ymm8,%ymm4,%ymm4 - vpslld $7,%ymm4,%ymm0 - vpsrld $25,%ymm4,%ymm4 - vpor %ymm0,%ymm4,%ymm4 - # x9 += x13, x5 = rotl32(x5 ^ x9, 7) - vpaddd %ymm13,%ymm9,%ymm9 - vpxor %ymm9,%ymm5,%ymm5 - vpslld $7,%ymm5,%ymm0 - vpsrld $25,%ymm5,%ymm5 - vpor %ymm0,%ymm5,%ymm5 - # x10 += x14, x6 = rotl32(x6 ^ x10, 7) - vpaddd %ymm14,%ymm10,%ymm10 - vpxor %ymm10,%ymm6,%ymm6 - vpslld $7,%ymm6,%ymm0 - vpsrld $25,%ymm6,%ymm6 - vpor %ymm0,%ymm6,%ymm6 - # x11 += x15, x7 = rotl32(x7 ^ x11, 7) - vpaddd %ymm15,%ymm11,%ymm11 - vpxor %ymm11,%ymm7,%ymm7 - vpslld $7,%ymm7,%ymm0 - vpsrld $25,%ymm7,%ymm7 - vpor %ymm0,%ymm7,%ymm7 - - # x0 += x5, x15 = rotl32(x15 ^ x0, 16) - vpaddd 0x00(%rsp),%ymm5,%ymm0 - vmovdqa %ymm0,0x00(%rsp) - vpxor %ymm0,%ymm15,%ymm15 - vpshufb %ymm3,%ymm15,%ymm15 - # x1 += x6, x12 = rotl32(x12 ^ x1, 16)%ymm0 - vpaddd 0x20(%rsp),%ymm6,%ymm0 - vmovdqa %ymm0,0x20(%rsp) - vpxor %ymm0,%ymm12,%ymm12 - vpshufb %ymm3,%ymm12,%ymm12 - # x2 += x7, x13 = rotl32(x13 ^ x2, 16) - vpaddd 0x40(%rsp),%ymm7,%ymm0 - vmovdqa %ymm0,0x40(%rsp) - vpxor %ymm0,%ymm13,%ymm13 - vpshufb %ymm3,%ymm13,%ymm13 - # x3 += x4, x14 = rotl32(x14 ^ x3, 16) - vpaddd 0x60(%rsp),%ymm4,%ymm0 - vmovdqa %ymm0,0x60(%rsp) - vpxor %ymm0,%ymm14,%ymm14 - vpshufb %ymm3,%ymm14,%ymm14 - - # x10 += x15, x5 = rotl32(x5 ^ x10, 12) - vpaddd %ymm15,%ymm10,%ymm10 - vpxor %ymm10,%ymm5,%ymm5 - vpslld $12,%ymm5,%ymm0 - vpsrld $20,%ymm5,%ymm5 - vpor %ymm0,%ymm5,%ymm5 - # x11 += x12, x6 = rotl32(x6 ^ x11, 12) - vpaddd %ymm12,%ymm11,%ymm11 - vpxor %ymm11,%ymm6,%ymm6 - vpslld $12,%ymm6,%ymm0 - vpsrld $20,%ymm6,%ymm6 - vpor %ymm0,%ymm6,%ymm6 - # x8 += x13, x7 = rotl32(x7 ^ x8, 12) - vpaddd %ymm13,%ymm8,%ymm8 - vpxor %ymm8,%ymm7,%ymm7 - vpslld $12,%ymm7,%ymm0 - vpsrld $20,%ymm7,%ymm7 - vpor %ymm0,%ymm7,%ymm7 - # x9 += x14, x4 = rotl32(x4 ^ x9, 12) - vpaddd %ymm14,%ymm9,%ymm9 - vpxor %ymm9,%ymm4,%ymm4 - vpslld $12,%ymm4,%ymm0 - vpsrld $20,%ymm4,%ymm4 - vpor %ymm0,%ymm4,%ymm4 - - # x0 += x5, x15 = rotl32(x15 ^ x0, 8) - vpaddd 0x00(%rsp),%ymm5,%ymm0 - vmovdqa %ymm0,0x00(%rsp) - vpxor %ymm0,%ymm15,%ymm15 - vpshufb %ymm2,%ymm15,%ymm15 - # x1 += x6, x12 = rotl32(x12 ^ x1, 8) - vpaddd 0x20(%rsp),%ymm6,%ymm0 - vmovdqa %ymm0,0x20(%rsp) - vpxor %ymm0,%ymm12,%ymm12 - vpshufb %ymm2,%ymm12,%ymm12 - # x2 += x7, x13 = rotl32(x13 ^ x2, 8) - vpaddd 0x40(%rsp),%ymm7,%ymm0 - vmovdqa %ymm0,0x40(%rsp) - vpxor %ymm0,%ymm13,%ymm13 - vpshufb %ymm2,%ymm13,%ymm13 - # x3 += x4, x14 = rotl32(x14 ^ x3, 8) - vpaddd 0x60(%rsp),%ymm4,%ymm0 - vmovdqa %ymm0,0x60(%rsp) - vpxor %ymm0,%ymm14,%ymm14 - vpshufb %ymm2,%ymm14,%ymm14 - - # x10 += x15, x5 = rotl32(x5 ^ x10, 7) - vpaddd %ymm15,%ymm10,%ymm10 - vpxor %ymm10,%ymm5,%ymm5 - vpslld $7,%ymm5,%ymm0 - vpsrld $25,%ymm5,%ymm5 - vpor %ymm0,%ymm5,%ymm5 - # x11 += x12, x6 = rotl32(x6 ^ x11, 7) - vpaddd %ymm12,%ymm11,%ymm11 - vpxor %ymm11,%ymm6,%ymm6 - vpslld $7,%ymm6,%ymm0 - vpsrld $25,%ymm6,%ymm6 - vpor %ymm0,%ymm6,%ymm6 - # x8 += x13, x7 = rotl32(x7 ^ x8, 7) - vpaddd %ymm13,%ymm8,%ymm8 - vpxor %ymm8,%ymm7,%ymm7 - vpslld $7,%ymm7,%ymm0 - vpsrld $25,%ymm7,%ymm7 - vpor %ymm0,%ymm7,%ymm7 - # x9 += x14, x4 = rotl32(x4 ^ x9, 7) - vpaddd %ymm14,%ymm9,%ymm9 - vpxor %ymm9,%ymm4,%ymm4 - vpslld $7,%ymm4,%ymm0 - vpsrld $25,%ymm4,%ymm4 - vpor %ymm0,%ymm4,%ymm4 - - dec %ecx - jnz .Ldoubleround8 - - # x0..15[0-3] += s[0..15] - vpbroadcastd 0x00(%rdi),%ymm0 - vpaddd 0x00(%rsp),%ymm0,%ymm0 - vmovdqa %ymm0,0x00(%rsp) - vpbroadcastd 0x04(%rdi),%ymm0 - vpaddd 0x20(%rsp),%ymm0,%ymm0 - vmovdqa %ymm0,0x20(%rsp) - vpbroadcastd 0x08(%rdi),%ymm0 - vpaddd 0x40(%rsp),%ymm0,%ymm0 - vmovdqa %ymm0,0x40(%rsp) - vpbroadcastd 0x0c(%rdi),%ymm0 - vpaddd 0x60(%rsp),%ymm0,%ymm0 - vmovdqa %ymm0,0x60(%rsp) - vpbroadcastd 0x10(%rdi),%ymm0 - vpaddd %ymm0,%ymm4,%ymm4 - vpbroadcastd 0x14(%rdi),%ymm0 - vpaddd %ymm0,%ymm5,%ymm5 - vpbroadcastd 0x18(%rdi),%ymm0 - vpaddd %ymm0,%ymm6,%ymm6 - vpbroadcastd 0x1c(%rdi),%ymm0 - vpaddd %ymm0,%ymm7,%ymm7 - vpbroadcastd 0x20(%rdi),%ymm0 - vpaddd %ymm0,%ymm8,%ymm8 - vpbroadcastd 0x24(%rdi),%ymm0 - vpaddd %ymm0,%ymm9,%ymm9 - vpbroadcastd 0x28(%rdi),%ymm0 - vpaddd %ymm0,%ymm10,%ymm10 - vpbroadcastd 0x2c(%rdi),%ymm0 - vpaddd %ymm0,%ymm11,%ymm11 - vpbroadcastd 0x30(%rdi),%ymm0 - vpaddd %ymm0,%ymm12,%ymm12 - vpbroadcastd 0x34(%rdi),%ymm0 - vpaddd %ymm0,%ymm13,%ymm13 - vpbroadcastd 0x38(%rdi),%ymm0 - vpaddd %ymm0,%ymm14,%ymm14 - vpbroadcastd 0x3c(%rdi),%ymm0 - vpaddd %ymm0,%ymm15,%ymm15 - - # x12 += counter values 0-3 - vpaddd %ymm1,%ymm12,%ymm12 - - # interleave 32-bit words in state n, n+1 - vmovdqa 0x00(%rsp),%ymm0 - vmovdqa 0x20(%rsp),%ymm1 - vpunpckldq %ymm1,%ymm0,%ymm2 - vpunpckhdq %ymm1,%ymm0,%ymm1 - vmovdqa %ymm2,0x00(%rsp) - vmovdqa %ymm1,0x20(%rsp) - vmovdqa 0x40(%rsp),%ymm0 - vmovdqa 0x60(%rsp),%ymm1 - vpunpckldq %ymm1,%ymm0,%ymm2 - vpunpckhdq %ymm1,%ymm0,%ymm1 - vmovdqa %ymm2,0x40(%rsp) - vmovdqa %ymm1,0x60(%rsp) - vmovdqa %ymm4,%ymm0 - vpunpckldq %ymm5,%ymm0,%ymm4 - vpunpckhdq %ymm5,%ymm0,%ymm5 - vmovdqa %ymm6,%ymm0 - vpunpckldq %ymm7,%ymm0,%ymm6 - vpunpckhdq %ymm7,%ymm0,%ymm7 - vmovdqa %ymm8,%ymm0 - vpunpckldq %ymm9,%ymm0,%ymm8 - vpunpckhdq %ymm9,%ymm0,%ymm9 - vmovdqa %ymm10,%ymm0 - vpunpckldq %ymm11,%ymm0,%ymm10 - vpunpckhdq %ymm11,%ymm0,%ymm11 - vmovdqa %ymm12,%ymm0 - vpunpckldq %ymm13,%ymm0,%ymm12 - vpunpckhdq %ymm13,%ymm0,%ymm13 - vmovdqa %ymm14,%ymm0 - vpunpckldq %ymm15,%ymm0,%ymm14 - vpunpckhdq %ymm15,%ymm0,%ymm15 - - # interleave 64-bit words in state n, n+2 - vmovdqa 0x00(%rsp),%ymm0 - vmovdqa 0x40(%rsp),%ymm2 - vpunpcklqdq %ymm2,%ymm0,%ymm1 - vpunpckhqdq %ymm2,%ymm0,%ymm2 - vmovdqa %ymm1,0x00(%rsp) - vmovdqa %ymm2,0x40(%rsp) - vmovdqa 0x20(%rsp),%ymm0 - vmovdqa 0x60(%rsp),%ymm2 - vpunpcklqdq %ymm2,%ymm0,%ymm1 - vpunpckhqdq %ymm2,%ymm0,%ymm2 - vmovdqa %ymm1,0x20(%rsp) - vmovdqa %ymm2,0x60(%rsp) - vmovdqa %ymm4,%ymm0 - vpunpcklqdq %ymm6,%ymm0,%ymm4 - vpunpckhqdq %ymm6,%ymm0,%ymm6 - vmovdqa %ymm5,%ymm0 - vpunpcklqdq %ymm7,%ymm0,%ymm5 - vpunpckhqdq %ymm7,%ymm0,%ymm7 - vmovdqa %ymm8,%ymm0 - vpunpcklqdq %ymm10,%ymm0,%ymm8 - vpunpckhqdq %ymm10,%ymm0,%ymm10 - vmovdqa %ymm9,%ymm0 - vpunpcklqdq %ymm11,%ymm0,%ymm9 - vpunpckhqdq %ymm11,%ymm0,%ymm11 - vmovdqa %ymm12,%ymm0 - vpunpcklqdq %ymm14,%ymm0,%ymm12 - vpunpckhqdq %ymm14,%ymm0,%ymm14 - vmovdqa %ymm13,%ymm0 - vpunpcklqdq %ymm15,%ymm0,%ymm13 - vpunpckhqdq %ymm15,%ymm0,%ymm15 - - # interleave 128-bit words in state n, n+4 - vmovdqa 0x00(%rsp),%ymm0 - vperm2i128 $0x20,%ymm4,%ymm0,%ymm1 - vperm2i128 $0x31,%ymm4,%ymm0,%ymm4 - vmovdqa %ymm1,0x00(%rsp) - vmovdqa 0x20(%rsp),%ymm0 - vperm2i128 $0x20,%ymm5,%ymm0,%ymm1 - vperm2i128 $0x31,%ymm5,%ymm0,%ymm5 - vmovdqa %ymm1,0x20(%rsp) - vmovdqa 0x40(%rsp),%ymm0 - vperm2i128 $0x20,%ymm6,%ymm0,%ymm1 - vperm2i128 $0x31,%ymm6,%ymm0,%ymm6 - vmovdqa %ymm1,0x40(%rsp) - vmovdqa 0x60(%rsp),%ymm0 - vperm2i128 $0x20,%ymm7,%ymm0,%ymm1 - vperm2i128 $0x31,%ymm7,%ymm0,%ymm7 - vmovdqa %ymm1,0x60(%rsp) - vperm2i128 $0x20,%ymm12,%ymm8,%ymm0 - vperm2i128 $0x31,%ymm12,%ymm8,%ymm12 - vmovdqa %ymm0,%ymm8 - vperm2i128 $0x20,%ymm13,%ymm9,%ymm0 - vperm2i128 $0x31,%ymm13,%ymm9,%ymm13 - vmovdqa %ymm0,%ymm9 - vperm2i128 $0x20,%ymm14,%ymm10,%ymm0 - vperm2i128 $0x31,%ymm14,%ymm10,%ymm14 - vmovdqa %ymm0,%ymm10 - vperm2i128 $0x20,%ymm15,%ymm11,%ymm0 - vperm2i128 $0x31,%ymm15,%ymm11,%ymm15 - vmovdqa %ymm0,%ymm11 - - # xor with corresponding input, write to output - vmovdqa 0x00(%rsp),%ymm0 - vpxor 0x0000(%rdx),%ymm0,%ymm0 - vmovdqu %ymm0,0x0000(%rsi) - vmovdqa 0x20(%rsp),%ymm0 - vpxor 0x0080(%rdx),%ymm0,%ymm0 - vmovdqu %ymm0,0x0080(%rsi) - vmovdqa 0x40(%rsp),%ymm0 - vpxor 0x0040(%rdx),%ymm0,%ymm0 - vmovdqu %ymm0,0x0040(%rsi) - vmovdqa 0x60(%rsp),%ymm0 - vpxor 0x00c0(%rdx),%ymm0,%ymm0 - vmovdqu %ymm0,0x00c0(%rsi) - vpxor 0x0100(%rdx),%ymm4,%ymm4 - vmovdqu %ymm4,0x0100(%rsi) - vpxor 0x0180(%rdx),%ymm5,%ymm5 - vmovdqu %ymm5,0x00180(%rsi) - vpxor 0x0140(%rdx),%ymm6,%ymm6 - vmovdqu %ymm6,0x0140(%rsi) - vpxor 0x01c0(%rdx),%ymm7,%ymm7 - vmovdqu %ymm7,0x01c0(%rsi) - vpxor 0x0020(%rdx),%ymm8,%ymm8 - vmovdqu %ymm8,0x0020(%rsi) - vpxor 0x00a0(%rdx),%ymm9,%ymm9 - vmovdqu %ymm9,0x00a0(%rsi) - vpxor 0x0060(%rdx),%ymm10,%ymm10 - vmovdqu %ymm10,0x0060(%rsi) - vpxor 0x00e0(%rdx),%ymm11,%ymm11 - vmovdqu %ymm11,0x00e0(%rsi) - vpxor 0x0120(%rdx),%ymm12,%ymm12 - vmovdqu %ymm12,0x0120(%rsi) - vpxor 0x01a0(%rdx),%ymm13,%ymm13 - vmovdqu %ymm13,0x01a0(%rsi) - vpxor 0x0160(%rdx),%ymm14,%ymm14 - vmovdqu %ymm14,0x0160(%rsi) - vpxor 0x01e0(%rdx),%ymm15,%ymm15 - vmovdqu %ymm15,0x01e0(%rsi) - - vzeroupper - lea -8(%r10),%rsp - ret -ENDPROC(chacha20_8block_xor_avx2) diff --git a/arch/x86/crypto/chacha20-ssse3-x86_64.S b/arch/x86/crypto/chacha20-ssse3-x86_64.S deleted file mode 100644 index 512a2b500fd1..000000000000 --- a/arch/x86/crypto/chacha20-ssse3-x86_64.S +++ /dev/null @@ -1,630 +0,0 @@ -/* - * ChaCha20 256-bit cipher algorithm, RFC7539, x64 SSSE3 functions - * - * Copyright (C) 2015 Martin Willi - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ - -#include - -.section .rodata.cst16.ROT8, "aM", @progbits, 16 -.align 16 -ROT8: .octa 0x0e0d0c0f0a09080b0605040702010003 -.section .rodata.cst16.ROT16, "aM", @progbits, 16 -.align 16 -ROT16: .octa 0x0d0c0f0e09080b0a0504070601000302 -.section .rodata.cst16.CTRINC, "aM", @progbits, 16 -.align 16 -CTRINC: .octa 0x00000003000000020000000100000000 - -.text - -ENTRY(chacha20_block_xor_ssse3) - # %rdi: Input state matrix, s - # %rsi: 1 data block output, o - # %rdx: 1 data block input, i - - # This function encrypts one ChaCha20 block by loading the state matrix - # in four SSE registers. It performs matrix operation on four words in - # parallel, but requireds shuffling to rearrange the words after each - # round. 8/16-bit word rotation is done with the slightly better - # performing SSSE3 byte shuffling, 7/12-bit word rotation uses - # traditional shift+OR. - - # x0..3 = s0..3 - movdqa 0x00(%rdi),%xmm0 - movdqa 0x10(%rdi),%xmm1 - movdqa 0x20(%rdi),%xmm2 - movdqa 0x30(%rdi),%xmm3 - movdqa %xmm0,%xmm8 - movdqa %xmm1,%xmm9 - movdqa %xmm2,%xmm10 - movdqa %xmm3,%xmm11 - - movdqa ROT8(%rip),%xmm4 - movdqa ROT16(%rip),%xmm5 - - mov $10,%ecx - -.Ldoubleround: - - # x0 += x1, x3 = rotl32(x3 ^ x0, 16) - paddd %xmm1,%xmm0 - pxor %xmm0,%xmm3 - pshufb %xmm5,%xmm3 - - # x2 += x3, x1 = rotl32(x1 ^ x2, 12) - paddd %xmm3,%xmm2 - pxor %xmm2,%xmm1 - movdqa %xmm1,%xmm6 - pslld $12,%xmm6 - psrld $20,%xmm1 - por %xmm6,%xmm1 - - # x0 += x1, x3 = rotl32(x3 ^ x0, 8) - paddd %xmm1,%xmm0 - pxor %xmm0,%xmm3 - pshufb %xmm4,%xmm3 - - # x2 += x3, x1 = rotl32(x1 ^ x2, 7) - paddd %xmm3,%xmm2 - pxor %xmm2,%xmm1 - movdqa %xmm1,%xmm7 - pslld $7,%xmm7 - psrld $25,%xmm1 - por %xmm7,%xmm1 - - # x1 = shuffle32(x1, MASK(0, 3, 2, 1)) - pshufd $0x39,%xmm1,%xmm1 - # x2 = shuffle32(x2, MASK(1, 0, 3, 2)) - pshufd $0x4e,%xmm2,%xmm2 - # x3 = shuffle32(x3, MASK(2, 1, 0, 3)) - pshufd $0x93,%xmm3,%xmm3 - - # x0 += x1, x3 = rotl32(x3 ^ x0, 16) - paddd %xmm1,%xmm0 - pxor %xmm0,%xmm3 - pshufb %xmm5,%xmm3 - - # x2 += x3, x1 = rotl32(x1 ^ x2, 12) - paddd %xmm3,%xmm2 - pxor %xmm2,%xmm1 - movdqa %xmm1,%xmm6 - pslld $12,%xmm6 - psrld $20,%xmm1 - por %xmm6,%xmm1 - - # x0 += x1, x3 = rotl32(x3 ^ x0, 8) - paddd %xmm1,%xmm0 - pxor %xmm0,%xmm3 - pshufb %xmm4,%xmm3 - - # x2 += x3, x1 = rotl32(x1 ^ x2, 7) - paddd %xmm3,%xmm2 - pxor %xmm2,%xmm1 - movdqa %xmm1,%xmm7 - pslld $7,%xmm7 - psrld $25,%xmm1 - por %xmm7,%xmm1 - - # x1 = shuffle32(x1, MASK(2, 1, 0, 3)) - pshufd $0x93,%xmm1,%xmm1 - # x2 = shuffle32(x2, MASK(1, 0, 3, 2)) - pshufd $0x4e,%xmm2,%xmm2 - # x3 = shuffle32(x3, MASK(0, 3, 2, 1)) - pshufd $0x39,%xmm3,%xmm3 - - dec %ecx - jnz .Ldoubleround - - # o0 = i0 ^ (x0 + s0) - movdqu 0x00(%rdx),%xmm4 - paddd %xmm8,%xmm0 - pxor %xmm4,%xmm0 - movdqu %xmm0,0x00(%rsi) - # o1 = i1 ^ (x1 + s1) - movdqu 0x10(%rdx),%xmm5 - paddd %xmm9,%xmm1 - pxor %xmm5,%xmm1 - movdqu %xmm1,0x10(%rsi) - # o2 = i2 ^ (x2 + s2) - movdqu 0x20(%rdx),%xmm6 - paddd %xmm10,%xmm2 - pxor %xmm6,%xmm2 - movdqu %xmm2,0x20(%rsi) - # o3 = i3 ^ (x3 + s3) - movdqu 0x30(%rdx),%xmm7 - paddd %xmm11,%xmm3 - pxor %xmm7,%xmm3 - movdqu %xmm3,0x30(%rsi) - - ret -ENDPROC(chacha20_block_xor_ssse3) - -ENTRY(chacha20_4block_xor_ssse3) - # %rdi: Input state matrix, s - # %rsi: 4 data blocks output, o - # %rdx: 4 data blocks input, i - - # This function encrypts four consecutive ChaCha20 blocks by loading the - # the state matrix in SSE registers four times. As we need some scratch - # registers, we save the first four registers on the stack. The - # algorithm performs each operation on the corresponding word of each - # state matrix, hence requires no word shuffling. For final XORing step - # we transpose the matrix by interleaving 32- and then 64-bit words, - # which allows us to do XOR in SSE registers. 8/16-bit word rotation is - # done with the slightly better performing SSSE3 byte shuffling, - # 7/12-bit word rotation uses traditional shift+OR. - - lea 8(%rsp),%r10 - sub $0x80,%rsp - and $~63,%rsp - - # x0..15[0-3] = s0..3[0..3] - movq 0x00(%rdi),%xmm1 - pshufd $0x00,%xmm1,%xmm0 - pshufd $0x55,%xmm1,%xmm1 - movq 0x08(%rdi),%xmm3 - pshufd $0x00,%xmm3,%xmm2 - pshufd $0x55,%xmm3,%xmm3 - movq 0x10(%rdi),%xmm5 - pshufd $0x00,%xmm5,%xmm4 - pshufd $0x55,%xmm5,%xmm5 - movq 0x18(%rdi),%xmm7 - pshufd $0x00,%xmm7,%xmm6 - pshufd $0x55,%xmm7,%xmm7 - movq 0x20(%rdi),%xmm9 - pshufd $0x00,%xmm9,%xmm8 - pshufd $0x55,%xmm9,%xmm9 - movq 0x28(%rdi),%xmm11 - pshufd $0x00,%xmm11,%xmm10 - pshufd $0x55,%xmm11,%xmm11 - movq 0x30(%rdi),%xmm13 - pshufd $0x00,%xmm13,%xmm12 - pshufd $0x55,%xmm13,%xmm13 - movq 0x38(%rdi),%xmm15 - pshufd $0x00,%xmm15,%xmm14 - pshufd $0x55,%xmm15,%xmm15 - # x0..3 on stack - movdqa %xmm0,0x00(%rsp) - movdqa %xmm1,0x10(%rsp) - movdqa %xmm2,0x20(%rsp) - movdqa %xmm3,0x30(%rsp) - - movdqa CTRINC(%rip),%xmm1 - movdqa ROT8(%rip),%xmm2 - movdqa ROT16(%rip),%xmm3 - - # x12 += counter values 0-3 - paddd %xmm1,%xmm12 - - mov $10,%ecx - -.Ldoubleround4: - # x0 += x4, x12 = rotl32(x12 ^ x0, 16) - movdqa 0x00(%rsp),%xmm0 - paddd %xmm4,%xmm0 - movdqa %xmm0,0x00(%rsp) - pxor %xmm0,%xmm12 - pshufb %xmm3,%xmm12 - # x1 += x5, x13 = rotl32(x13 ^ x1, 16) - movdqa 0x10(%rsp),%xmm0 - paddd %xmm5,%xmm0 - movdqa %xmm0,0x10(%rsp) - pxor %xmm0,%xmm13 - pshufb %xmm3,%xmm13 - # x2 += x6, x14 = rotl32(x14 ^ x2, 16) - movdqa 0x20(%rsp),%xmm0 - paddd %xmm6,%xmm0 - movdqa %xmm0,0x20(%rsp) - pxor %xmm0,%xmm14 - pshufb %xmm3,%xmm14 - # x3 += x7, x15 = rotl32(x15 ^ x3, 16) - movdqa 0x30(%rsp),%xmm0 - paddd %xmm7,%xmm0 - movdqa %xmm0,0x30(%rsp) - pxor %xmm0,%xmm15 - pshufb %xmm3,%xmm15 - - # x8 += x12, x4 = rotl32(x4 ^ x8, 12) - paddd %xmm12,%xmm8 - pxor %xmm8,%xmm4 - movdqa %xmm4,%xmm0 - pslld $12,%xmm0 - psrld $20,%xmm4 - por %xmm0,%xmm4 - # x9 += x13, x5 = rotl32(x5 ^ x9, 12) - paddd %xmm13,%xmm9 - pxor %xmm9,%xmm5 - movdqa %xmm5,%xmm0 - pslld $12,%xmm0 - psrld $20,%xmm5 - por %xmm0,%xmm5 - # x10 += x14, x6 = rotl32(x6 ^ x10, 12) - paddd %xmm14,%xmm10 - pxor %xmm10,%xmm6 - movdqa %xmm6,%xmm0 - pslld $12,%xmm0 - psrld $20,%xmm6 - por %xmm0,%xmm6 - # x11 += x15, x7 = rotl32(x7 ^ x11, 12) - paddd %xmm15,%xmm11 - pxor %xmm11,%xmm7 - movdqa %xmm7,%xmm0 - pslld $12,%xmm0 - psrld $20,%xmm7 - por %xmm0,%xmm7 - - # x0 += x4, x12 = rotl32(x12 ^ x0, 8) - movdqa 0x00(%rsp),%xmm0 - paddd %xmm4,%xmm0 - movdqa %xmm0,0x00(%rsp) - pxor %xmm0,%xmm12 - pshufb %xmm2,%xmm12 - # x1 += x5, x13 = rotl32(x13 ^ x1, 8) - movdqa 0x10(%rsp),%xmm0 - paddd %xmm5,%xmm0 - movdqa %xmm0,0x10(%rsp) - pxor %xmm0,%xmm13 - pshufb %xmm2,%xmm13 - # x2 += x6, x14 = rotl32(x14 ^ x2, 8) - movdqa 0x20(%rsp),%xmm0 - paddd %xmm6,%xmm0 - movdqa %xmm0,0x20(%rsp) - pxor %xmm0,%xmm14 - pshufb %xmm2,%xmm14 - # x3 += x7, x15 = rotl32(x15 ^ x3, 8) - movdqa 0x30(%rsp),%xmm0 - paddd %xmm7,%xmm0 - movdqa %xmm0,0x30(%rsp) - pxor %xmm0,%xmm15 - pshufb %xmm2,%xmm15 - - # x8 += x12, x4 = rotl32(x4 ^ x8, 7) - paddd %xmm12,%xmm8 - pxor %xmm8,%xmm4 - movdqa %xmm4,%xmm0 - pslld $7,%xmm0 - psrld $25,%xmm4 - por %xmm0,%xmm4 - # x9 += x13, x5 = rotl32(x5 ^ x9, 7) - paddd %xmm13,%xmm9 - pxor %xmm9,%xmm5 - movdqa %xmm5,%xmm0 - pslld $7,%xmm0 - psrld $25,%xmm5 - por %xmm0,%xmm5 - # x10 += x14, x6 = rotl32(x6 ^ x10, 7) - paddd %xmm14,%xmm10 - pxor %xmm10,%xmm6 - movdqa %xmm6,%xmm0 - pslld $7,%xmm0 - psrld $25,%xmm6 - por %xmm0,%xmm6 - # x11 += x15, x7 = rotl32(x7 ^ x11, 7) - paddd %xmm15,%xmm11 - pxor %xmm11,%xmm7 - movdqa %xmm7,%xmm0 - pslld $7,%xmm0 - psrld $25,%xmm7 - por %xmm0,%xmm7 - - # x0 += x5, x15 = rotl32(x15 ^ x0, 16) - movdqa 0x00(%rsp),%xmm0 - paddd %xmm5,%xmm0 - movdqa %xmm0,0x00(%rsp) - pxor %xmm0,%xmm15 - pshufb %xmm3,%xmm15 - # x1 += x6, x12 = rotl32(x12 ^ x1, 16) - movdqa 0x10(%rsp),%xmm0 - paddd %xmm6,%xmm0 - movdqa %xmm0,0x10(%rsp) - pxor %xmm0,%xmm12 - pshufb %xmm3,%xmm12 - # x2 += x7, x13 = rotl32(x13 ^ x2, 16) - movdqa 0x20(%rsp),%xmm0 - paddd %xmm7,%xmm0 - movdqa %xmm0,0x20(%rsp) - pxor %xmm0,%xmm13 - pshufb %xmm3,%xmm13 - # x3 += x4, x14 = rotl32(x14 ^ x3, 16) - movdqa 0x30(%rsp),%xmm0 - paddd %xmm4,%xmm0 - movdqa %xmm0,0x30(%rsp) - pxor %xmm0,%xmm14 - pshufb %xmm3,%xmm14 - - # x10 += x15, x5 = rotl32(x5 ^ x10, 12) - paddd %xmm15,%xmm10 - pxor %xmm10,%xmm5 - movdqa %xmm5,%xmm0 - pslld $12,%xmm0 - psrld $20,%xmm5 - por %xmm0,%xmm5 - # x11 += x12, x6 = rotl32(x6 ^ x11, 12) - paddd %xmm12,%xmm11 - pxor %xmm11,%xmm6 - movdqa %xmm6,%xmm0 - pslld $12,%xmm0 - psrld $20,%xmm6 - por %xmm0,%xmm6 - # x8 += x13, x7 = rotl32(x7 ^ x8, 12) - paddd %xmm13,%xmm8 - pxor %xmm8,%xmm7 - movdqa %xmm7,%xmm0 - pslld $12,%xmm0 - psrld $20,%xmm7 - por %xmm0,%xmm7 - # x9 += x14, x4 = rotl32(x4 ^ x9, 12) - paddd %xmm14,%xmm9 - pxor %xmm9,%xmm4 - movdqa %xmm4,%xmm0 - pslld $12,%xmm0 - psrld $20,%xmm4 - por %xmm0,%xmm4 - - # x0 += x5, x15 = rotl32(x15 ^ x0, 8) - movdqa 0x00(%rsp),%xmm0 - paddd %xmm5,%xmm0 - movdqa %xmm0,0x00(%rsp) - pxor %xmm0,%xmm15 - pshufb %xmm2,%xmm15 - # x1 += x6, x12 = rotl32(x12 ^ x1, 8) - movdqa 0x10(%rsp),%xmm0 - paddd %xmm6,%xmm0 - movdqa %xmm0,0x10(%rsp) - pxor %xmm0,%xmm12 - pshufb %xmm2,%xmm12 - # x2 += x7, x13 = rotl32(x13 ^ x2, 8) - movdqa 0x20(%rsp),%xmm0 - paddd %xmm7,%xmm0 - movdqa %xmm0,0x20(%rsp) - pxor %xmm0,%xmm13 - pshufb %xmm2,%xmm13 - # x3 += x4, x14 = rotl32(x14 ^ x3, 8) - movdqa 0x30(%rsp),%xmm0 - paddd %xmm4,%xmm0 - movdqa %xmm0,0x30(%rsp) - pxor %xmm0,%xmm14 - pshufb %xmm2,%xmm14 - - # x10 += x15, x5 = rotl32(x5 ^ x10, 7) - paddd %xmm15,%xmm10 - pxor %xmm10,%xmm5 - movdqa %xmm5,%xmm0 - pslld $7,%xmm0 - psrld $25,%xmm5 - por %xmm0,%xmm5 - # x11 += x12, x6 = rotl32(x6 ^ x11, 7) - paddd %xmm12,%xmm11 - pxor %xmm11,%xmm6 - movdqa %xmm6,%xmm0 - pslld $7,%xmm0 - psrld $25,%xmm6 - por %xmm0,%xmm6 - # x8 += x13, x7 = rotl32(x7 ^ x8, 7) - paddd %xmm13,%xmm8 - pxor %xmm8,%xmm7 - movdqa %xmm7,%xmm0 - pslld $7,%xmm0 - psrld $25,%xmm7 - por %xmm0,%xmm7 - # x9 += x14, x4 = rotl32(x4 ^ x9, 7) - paddd %xmm14,%xmm9 - pxor %xmm9,%xmm4 - movdqa %xmm4,%xmm0 - pslld $7,%xmm0 - psrld $25,%xmm4 - por %xmm0,%xmm4 - - dec %ecx - jnz .Ldoubleround4 - - # x0[0-3] += s0[0] - # x1[0-3] += s0[1] - movq 0x00(%rdi),%xmm3 - pshufd $0x00,%xmm3,%xmm2 - pshufd $0x55,%xmm3,%xmm3 - paddd 0x00(%rsp),%xmm2 - movdqa %xmm2,0x00(%rsp) - paddd 0x10(%rsp),%xmm3 - movdqa %xmm3,0x10(%rsp) - # x2[0-3] += s0[2] - # x3[0-3] += s0[3] - movq 0x08(%rdi),%xmm3 - pshufd $0x00,%xmm3,%xmm2 - pshufd $0x55,%xmm3,%xmm3 - paddd 0x20(%rsp),%xmm2 - movdqa %xmm2,0x20(%rsp) - paddd 0x30(%rsp),%xmm3 - movdqa %xmm3,0x30(%rsp) - - # x4[0-3] += s1[0] - # x5[0-3] += s1[1] - movq 0x10(%rdi),%xmm3 - pshufd $0x00,%xmm3,%xmm2 - pshufd $0x55,%xmm3,%xmm3 - paddd %xmm2,%xmm4 - paddd %xmm3,%xmm5 - # x6[0-3] += s1[2] - # x7[0-3] += s1[3] - movq 0x18(%rdi),%xmm3 - pshufd $0x00,%xmm3,%xmm2 - pshufd $0x55,%xmm3,%xmm3 - paddd %xmm2,%xmm6 - paddd %xmm3,%xmm7 - - # x8[0-3] += s2[0] - # x9[0-3] += s2[1] - movq 0x20(%rdi),%xmm3 - pshufd $0x00,%xmm3,%xmm2 - pshufd $0x55,%xmm3,%xmm3 - paddd %xmm2,%xmm8 - paddd %xmm3,%xmm9 - # x10[0-3] += s2[2] - # x11[0-3] += s2[3] - movq 0x28(%rdi),%xmm3 - pshufd $0x00,%xmm3,%xmm2 - pshufd $0x55,%xmm3,%xmm3 - paddd %xmm2,%xmm10 - paddd %xmm3,%xmm11 - - # x12[0-3] += s3[0] - # x13[0-3] += s3[1] - movq 0x30(%rdi),%xmm3 - pshufd $0x00,%xmm3,%xmm2 - pshufd $0x55,%xmm3,%xmm3 - paddd %xmm2,%xmm12 - paddd %xmm3,%xmm13 - # x14[0-3] += s3[2] - # x15[0-3] += s3[3] - movq 0x38(%rdi),%xmm3 - pshufd $0x00,%xmm3,%xmm2 - pshufd $0x55,%xmm3,%xmm3 - paddd %xmm2,%xmm14 - paddd %xmm3,%xmm15 - - # x12 += counter values 0-3 - paddd %xmm1,%xmm12 - - # interleave 32-bit words in state n, n+1 - movdqa 0x00(%rsp),%xmm0 - movdqa 0x10(%rsp),%xmm1 - movdqa %xmm0,%xmm2 - punpckldq %xmm1,%xmm2 - punpckhdq %xmm1,%xmm0 - movdqa %xmm2,0x00(%rsp) - movdqa %xmm0,0x10(%rsp) - movdqa 0x20(%rsp),%xmm0 - movdqa 0x30(%rsp),%xmm1 - movdqa %xmm0,%xmm2 - punpckldq %xmm1,%xmm2 - punpckhdq %xmm1,%xmm0 - movdqa %xmm2,0x20(%rsp) - movdqa %xmm0,0x30(%rsp) - movdqa %xmm4,%xmm0 - punpckldq %xmm5,%xmm4 - punpckhdq %xmm5,%xmm0 - movdqa %xmm0,%xmm5 - movdqa %xmm6,%xmm0 - punpckldq %xmm7,%xmm6 - punpckhdq %xmm7,%xmm0 - movdqa %xmm0,%xmm7 - movdqa %xmm8,%xmm0 - punpckldq %xmm9,%xmm8 - punpckhdq %xmm9,%xmm0 - movdqa %xmm0,%xmm9 - movdqa %xmm10,%xmm0 - punpckldq %xmm11,%xmm10 - punpckhdq %xmm11,%xmm0 - movdqa %xmm0,%xmm11 - movdqa %xmm12,%xmm0 - punpckldq %xmm13,%xmm12 - punpckhdq %xmm13,%xmm0 - movdqa %xmm0,%xmm13 - movdqa %xmm14,%xmm0 - punpckldq %xmm15,%xmm14 - punpckhdq %xmm15,%xmm0 - movdqa %xmm0,%xmm15 - - # interleave 64-bit words in state n, n+2 - movdqa 0x00(%rsp),%xmm0 - movdqa 0x20(%rsp),%xmm1 - movdqa %xmm0,%xmm2 - punpcklqdq %xmm1,%xmm2 - punpckhqdq %xmm1,%xmm0 - movdqa %xmm2,0x00(%rsp) - movdqa %xmm0,0x20(%rsp) - movdqa 0x10(%rsp),%xmm0 - movdqa 0x30(%rsp),%xmm1 - movdqa %xmm0,%xmm2 - punpcklqdq %xmm1,%xmm2 - punpckhqdq %xmm1,%xmm0 - movdqa %xmm2,0x10(%rsp) - movdqa %xmm0,0x30(%rsp) - movdqa %xmm4,%xmm0 - punpcklqdq %xmm6,%xmm4 - punpckhqdq %xmm6,%xmm0 - movdqa %xmm0,%xmm6 - movdqa %xmm5,%xmm0 - punpcklqdq %xmm7,%xmm5 - punpckhqdq %xmm7,%xmm0 - movdqa %xmm0,%xmm7 - movdqa %xmm8,%xmm0 - punpcklqdq %xmm10,%xmm8 - punpckhqdq %xmm10,%xmm0 - movdqa %xmm0,%xmm10 - movdqa %xmm9,%xmm0 - punpcklqdq %xmm11,%xmm9 - punpckhqdq %xmm11,%xmm0 - movdqa %xmm0,%xmm11 - movdqa %xmm12,%xmm0 - punpcklqdq %xmm14,%xmm12 - punpckhqdq %xmm14,%xmm0 - movdqa %xmm0,%xmm14 - movdqa %xmm13,%xmm0 - punpcklqdq %xmm15,%xmm13 - punpckhqdq %xmm15,%xmm0 - movdqa %xmm0,%xmm15 - - # xor with corresponding input, write to output - movdqa 0x00(%rsp),%xmm0 - movdqu 0x00(%rdx),%xmm1 - pxor %xmm1,%xmm0 - movdqu %xmm0,0x00(%rsi) - movdqa 0x10(%rsp),%xmm0 - movdqu 0x80(%rdx),%xmm1 - pxor %xmm1,%xmm0 - movdqu %xmm0,0x80(%rsi) - movdqa 0x20(%rsp),%xmm0 - movdqu 0x40(%rdx),%xmm1 - pxor %xmm1,%xmm0 - movdqu %xmm0,0x40(%rsi) - movdqa 0x30(%rsp),%xmm0 - movdqu 0xc0(%rdx),%xmm1 - pxor %xmm1,%xmm0 - movdqu %xmm0,0xc0(%rsi) - movdqu 0x10(%rdx),%xmm1 - pxor %xmm1,%xmm4 - movdqu %xmm4,0x10(%rsi) - movdqu 0x90(%rdx),%xmm1 - pxor %xmm1,%xmm5 - movdqu %xmm5,0x90(%rsi) - movdqu 0x50(%rdx),%xmm1 - pxor %xmm1,%xmm6 - movdqu %xmm6,0x50(%rsi) - movdqu 0xd0(%rdx),%xmm1 - pxor %xmm1,%xmm7 - movdqu %xmm7,0xd0(%rsi) - movdqu 0x20(%rdx),%xmm1 - pxor %xmm1,%xmm8 - movdqu %xmm8,0x20(%rsi) - movdqu 0xa0(%rdx),%xmm1 - pxor %xmm1,%xmm9 - movdqu %xmm9,0xa0(%rsi) - movdqu 0x60(%rdx),%xmm1 - pxor %xmm1,%xmm10 - movdqu %xmm10,0x60(%rsi) - movdqu 0xe0(%rdx),%xmm1 - pxor %xmm1,%xmm11 - movdqu %xmm11,0xe0(%rsi) - movdqu 0x30(%rdx),%xmm1 - pxor %xmm1,%xmm12 - movdqu %xmm12,0x30(%rsi) - movdqu 0xb0(%rdx),%xmm1 - pxor %xmm1,%xmm13 - movdqu %xmm13,0xb0(%rsi) - movdqu 0x70(%rdx),%xmm1 - pxor %xmm1,%xmm14 - movdqu %xmm14,0x70(%rsi) - movdqu 0xf0(%rdx),%xmm1 - pxor %xmm1,%xmm15 - movdqu %xmm15,0xf0(%rsi) - - lea -8(%r10),%rsp - ret -ENDPROC(chacha20_4block_xor_ssse3) diff --git a/arch/x86/crypto/chacha20_glue.c b/arch/x86/crypto/chacha20_glue.c deleted file mode 100644 index dce7c5d39c2f..000000000000 --- a/arch/x86/crypto/chacha20_glue.c +++ /dev/null @@ -1,146 +0,0 @@ -/* - * ChaCha20 256-bit cipher algorithm, RFC7539, SIMD glue code - * - * Copyright (C) 2015 Martin Willi - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ - -#include -#include -#include -#include -#include -#include -#include - -#define CHACHA20_STATE_ALIGN 16 - -asmlinkage void chacha20_block_xor_ssse3(u32 *state, u8 *dst, const u8 *src); -asmlinkage void chacha20_4block_xor_ssse3(u32 *state, u8 *dst, const u8 *src); -#ifdef CONFIG_AS_AVX2 -asmlinkage void chacha20_8block_xor_avx2(u32 *state, u8 *dst, const u8 *src); -static bool chacha20_use_avx2; -#endif - -static void chacha20_dosimd(u32 *state, u8 *dst, const u8 *src, - unsigned int bytes) -{ - u8 buf[CHACHA20_BLOCK_SIZE]; - -#ifdef CONFIG_AS_AVX2 - if (chacha20_use_avx2) { - while (bytes >= CHACHA20_BLOCK_SIZE * 8) { - chacha20_8block_xor_avx2(state, dst, src); - bytes -= CHACHA20_BLOCK_SIZE * 8; - src += CHACHA20_BLOCK_SIZE * 8; - dst += CHACHA20_BLOCK_SIZE * 8; - state[12] += 8; - } - } -#endif - while (bytes >= CHACHA20_BLOCK_SIZE * 4) { - chacha20_4block_xor_ssse3(state, dst, src); - bytes -= CHACHA20_BLOCK_SIZE * 4; - src += CHACHA20_BLOCK_SIZE * 4; - dst += CHACHA20_BLOCK_SIZE * 4; - state[12] += 4; - } - while (bytes >= CHACHA20_BLOCK_SIZE) { - chacha20_block_xor_ssse3(state, dst, src); - bytes -= CHACHA20_BLOCK_SIZE; - src += CHACHA20_BLOCK_SIZE; - dst += CHACHA20_BLOCK_SIZE; - state[12]++; - } - if (bytes) { - memcpy(buf, src, bytes); - chacha20_block_xor_ssse3(state, buf, buf); - memcpy(dst, buf, bytes); - } -} - -static int chacha20_simd(struct skcipher_request *req) -{ - struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req); - struct chacha20_ctx *ctx = crypto_skcipher_ctx(tfm); - u32 *state, state_buf[16 + 2] __aligned(8); - struct skcipher_walk walk; - int err; - - BUILD_BUG_ON(CHACHA20_STATE_ALIGN != 16); - state = PTR_ALIGN(state_buf + 0, CHACHA20_STATE_ALIGN); - - if (req->cryptlen <= CHACHA20_BLOCK_SIZE || !may_use_simd()) - return crypto_chacha20_crypt(req); - - err = skcipher_walk_virt(&walk, req, true); - - crypto_chacha20_init(state, ctx, walk.iv); - - kernel_fpu_begin(); - - while (walk.nbytes >= CHACHA20_BLOCK_SIZE) { - chacha20_dosimd(state, walk.dst.virt.addr, walk.src.virt.addr, - rounddown(walk.nbytes, CHACHA20_BLOCK_SIZE)); - err = skcipher_walk_done(&walk, - walk.nbytes % CHACHA20_BLOCK_SIZE); - } - - if (walk.nbytes) { - chacha20_dosimd(state, walk.dst.virt.addr, walk.src.virt.addr, - walk.nbytes); - err = skcipher_walk_done(&walk, 0); - } - - kernel_fpu_end(); - - return err; -} - -static struct skcipher_alg alg = { - .base.cra_name = "chacha20", - .base.cra_driver_name = "chacha20-simd", - .base.cra_priority = 300, - .base.cra_blocksize = 1, - .base.cra_ctxsize = sizeof(struct chacha20_ctx), - .base.cra_module = THIS_MODULE, - - .min_keysize = CHACHA20_KEY_SIZE, - .max_keysize = CHACHA20_KEY_SIZE, - .ivsize = CHACHA20_IV_SIZE, - .chunksize = CHACHA20_BLOCK_SIZE, - .setkey = crypto_chacha20_setkey, - .encrypt = chacha20_simd, - .decrypt = chacha20_simd, -}; - -static int __init chacha20_simd_mod_init(void) -{ - if (!boot_cpu_has(X86_FEATURE_SSSE3)) - return -ENODEV; - -#ifdef CONFIG_AS_AVX2 - chacha20_use_avx2 = boot_cpu_has(X86_FEATURE_AVX) && - boot_cpu_has(X86_FEATURE_AVX2) && - cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL); -#endif - return crypto_register_skcipher(&alg); -} - -static void __exit chacha20_simd_mod_fini(void) -{ - crypto_unregister_skcipher(&alg); -} - -module_init(chacha20_simd_mod_init); -module_exit(chacha20_simd_mod_fini); - -MODULE_LICENSE("GPL"); -MODULE_AUTHOR("Martin Willi "); -MODULE_DESCRIPTION("chacha20 cipher algorithm, SIMD accelerated"); -MODULE_ALIAS_CRYPTO("chacha20"); -MODULE_ALIAS_CRYPTO("chacha20-simd"); diff --git a/crypto/Kconfig b/crypto/Kconfig index 47859a0f8052..42dc48aa9b81 100644 --- a/crypto/Kconfig +++ b/crypto/Kconfig @@ -1428,27 +1428,12 @@ config CRYPTO_SALSA20 config CRYPTO_CHACHA20 tristate "ChaCha20 cipher algorithm" select CRYPTO_BLKCIPHER + select ZINC_CHACHA20 help ChaCha20 cipher algorithm, RFC7539. ChaCha20 is a 256-bit high-speed stream cipher designed by Daniel J. Bernstein and further specified in RFC7539 for use in IETF protocols. - This is the portable C implementation of ChaCha20. - - See also: - - -config CRYPTO_CHACHA20_X86_64 - tristate "ChaCha20 cipher algorithm (x86_64/SSSE3/AVX2)" - depends on X86 && 64BIT - select CRYPTO_BLKCIPHER - select CRYPTO_CHACHA20 - help - ChaCha20 cipher algorithm, RFC7539. - - ChaCha20 is a 256-bit high-speed stream cipher designed by Daniel J. - Bernstein and further specified in RFC7539 for use in IETF protocols. - This is the x86_64 assembler implementation using SIMD instructions. See also: diff --git a/crypto/Makefile b/crypto/Makefile index 5e60348d02e2..587103b87890 100644 --- a/crypto/Makefile +++ b/crypto/Makefile @@ -117,7 +117,7 @@ obj-$(CONFIG_CRYPTO_ANUBIS) += anubis.o obj-$(CONFIG_CRYPTO_SEED) += seed.o obj-$(CONFIG_CRYPTO_SPECK) += speck.o obj-$(CONFIG_CRYPTO_SALSA20) += salsa20_generic.o -obj-$(CONFIG_CRYPTO_CHACHA20) += chacha20_generic.o +obj-$(CONFIG_CRYPTO_CHACHA20) += chacha20_zinc.o obj-$(CONFIG_CRYPTO_POLY1305) += poly1305_zinc.o obj-$(CONFIG_CRYPTO_DEFLATE) += deflate.o obj-$(CONFIG_CRYPTO_MICHAEL_MIC) += michael_mic.o diff --git a/crypto/chacha20_generic.c b/crypto/chacha20_generic.c deleted file mode 100644 index e451c3cb6a56..000000000000 --- a/crypto/chacha20_generic.c +++ /dev/null @@ -1,136 +0,0 @@ -/* - * ChaCha20 256-bit cipher algorithm, RFC7539 - * - * Copyright (C) 2015 Martin Willi - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ - -#include -#include -#include -#include -#include - -static void chacha20_docrypt(u32 *state, u8 *dst, const u8 *src, - unsigned int bytes) -{ - u32 stream[CHACHA20_BLOCK_WORDS]; - - if (dst != src) - memcpy(dst, src, bytes); - - while (bytes >= CHACHA20_BLOCK_SIZE) { - chacha20_block(state, stream); - crypto_xor(dst, (const u8 *)stream, CHACHA20_BLOCK_SIZE); - bytes -= CHACHA20_BLOCK_SIZE; - dst += CHACHA20_BLOCK_SIZE; - } - if (bytes) { - chacha20_block(state, stream); - crypto_xor(dst, (const u8 *)stream, bytes); - } -} - -void crypto_chacha20_init(u32 *state, struct chacha20_ctx *ctx, u8 *iv) -{ - state[0] = 0x61707865; /* "expa" */ - state[1] = 0x3320646e; /* "nd 3" */ - state[2] = 0x79622d32; /* "2-by" */ - state[3] = 0x6b206574; /* "te k" */ - state[4] = ctx->key[0]; - state[5] = ctx->key[1]; - state[6] = ctx->key[2]; - state[7] = ctx->key[3]; - state[8] = ctx->key[4]; - state[9] = ctx->key[5]; - state[10] = ctx->key[6]; - state[11] = ctx->key[7]; - state[12] = get_unaligned_le32(iv + 0); - state[13] = get_unaligned_le32(iv + 4); - state[14] = get_unaligned_le32(iv + 8); - state[15] = get_unaligned_le32(iv + 12); -} -EXPORT_SYMBOL_GPL(crypto_chacha20_init); - -int crypto_chacha20_setkey(struct crypto_skcipher *tfm, const u8 *key, - unsigned int keysize) -{ - struct chacha20_ctx *ctx = crypto_skcipher_ctx(tfm); - int i; - - if (keysize != CHACHA20_KEY_SIZE) - return -EINVAL; - - for (i = 0; i < ARRAY_SIZE(ctx->key); i++) - ctx->key[i] = get_unaligned_le32(key + i * sizeof(u32)); - - return 0; -} -EXPORT_SYMBOL_GPL(crypto_chacha20_setkey); - -int crypto_chacha20_crypt(struct skcipher_request *req) -{ - struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req); - struct chacha20_ctx *ctx = crypto_skcipher_ctx(tfm); - struct skcipher_walk walk; - u32 state[16]; - int err; - - err = skcipher_walk_virt(&walk, req, true); - - crypto_chacha20_init(state, ctx, walk.iv); - - while (walk.nbytes > 0) { - unsigned int nbytes = walk.nbytes; - - if (nbytes < walk.total) - nbytes = round_down(nbytes, walk.stride); - - chacha20_docrypt(state, walk.dst.virt.addr, walk.src.virt.addr, - nbytes); - err = skcipher_walk_done(&walk, walk.nbytes - nbytes); - } - - return err; -} -EXPORT_SYMBOL_GPL(crypto_chacha20_crypt); - -static struct skcipher_alg alg = { - .base.cra_name = "chacha20", - .base.cra_driver_name = "chacha20-generic", - .base.cra_priority = 100, - .base.cra_blocksize = 1, - .base.cra_ctxsize = sizeof(struct chacha20_ctx), - .base.cra_module = THIS_MODULE, - - .min_keysize = CHACHA20_KEY_SIZE, - .max_keysize = CHACHA20_KEY_SIZE, - .ivsize = CHACHA20_IV_SIZE, - .chunksize = CHACHA20_BLOCK_SIZE, - .setkey = crypto_chacha20_setkey, - .encrypt = crypto_chacha20_crypt, - .decrypt = crypto_chacha20_crypt, -}; - -static int __init chacha20_generic_mod_init(void) -{ - return crypto_register_skcipher(&alg); -} - -static void __exit chacha20_generic_mod_fini(void) -{ - crypto_unregister_skcipher(&alg); -} - -module_init(chacha20_generic_mod_init); -module_exit(chacha20_generic_mod_fini); - -MODULE_LICENSE("GPL"); -MODULE_AUTHOR("Martin Willi "); -MODULE_DESCRIPTION("chacha20 cipher algorithm"); -MODULE_ALIAS_CRYPTO("chacha20"); -MODULE_ALIAS_CRYPTO("chacha20-generic"); diff --git a/crypto/chacha20_zinc.c b/crypto/chacha20_zinc.c new file mode 100644 index 000000000000..55e4585de08c --- /dev/null +++ b/crypto/chacha20_zinc.c @@ -0,0 +1,90 @@ +/* SPDX-License-Identifier: GPL-2.0 + * + * Copyright (C) 2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#include +#include +#include +#include +#include + +static int crypto_chacha20_setkey(struct crypto_skcipher *tfm, const u8 *key, + unsigned int keysize) +{ + struct chacha20_ctx *ctx = crypto_skcipher_ctx(tfm); + + if (keysize != CHACHA20_KEY_SIZE) + return -EINVAL; + chacha20_init(ctx, key, 0); + return 0; +} + +static int crypto_chacha20_crypt(struct skcipher_request *req) +{ + struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req); + struct chacha20_ctx ctx = *(struct chacha20_ctx *)crypto_skcipher_ctx(tfm); + struct skcipher_walk walk; + simd_context_t simd_context; + int err, i; + + err = skcipher_walk_virt(&walk, req, true); + if (unlikely(err)) + return err; + + for (i = 0; i < ARRAY_SIZE(ctx.counter); ++i) + ctx.counter[i] = get_unaligned_le32(walk.iv + i * sizeof(u32)); + + simd_get(&simd_context); + while (walk.nbytes > 0) { + unsigned int nbytes = walk.nbytes; + + if (nbytes < walk.total) + nbytes = round_down(nbytes, walk.stride); + + chacha20(&ctx, walk.dst.virt.addr, walk.src.virt.addr, nbytes, + &simd_context); + + err = skcipher_walk_done(&walk, walk.nbytes - nbytes); + simd_relax(&simd_context); + } + simd_put(&simd_context); + + return err; +} + +static struct skcipher_alg alg = { + .base.cra_name = "chacha20", + .base.cra_driver_name = "chacha20-software", + .base.cra_priority = 100, + .base.cra_blocksize = 1, + .base.cra_ctxsize = sizeof(struct chacha20_ctx), + .base.cra_module = THIS_MODULE, + + .min_keysize = CHACHA20_KEY_SIZE, + .max_keysize = CHACHA20_KEY_SIZE, + .ivsize = CHACHA20_NONCE_SIZE, + .chunksize = CHACHA20_BLOCK_SIZE, + .setkey = crypto_chacha20_setkey, + .encrypt = crypto_chacha20_crypt, + .decrypt = crypto_chacha20_crypt, +}; + +static int __init chacha20_mod_init(void) +{ + return crypto_register_skcipher(&alg); +} + +static void __exit chacha20_mod_exit(void) +{ + crypto_unregister_skcipher(&alg); +} + +module_init(chacha20_mod_init); +module_exit(chacha20_mod_exit); + +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Jason A. Donenfeld "); +MODULE_DESCRIPTION("ChaCha20 stream cipher"); +MODULE_ALIAS_CRYPTO("chacha20"); +MODULE_ALIAS_CRYPTO("chacha20-software"); diff --git a/crypto/chacha20poly1305.c b/crypto/chacha20poly1305.c index bf523797bef3..585c7ef4f543 100644 --- a/crypto/chacha20poly1305.c +++ b/crypto/chacha20poly1305.c @@ -13,7 +13,7 @@ #include #include #include -#include +#include #include #include #include @@ -51,7 +51,7 @@ struct poly_req { }; struct chacha_req { - u8 iv[CHACHA20_IV_SIZE]; + u8 iv[CHACHA20_NONCE_SIZE]; struct scatterlist src[1]; struct skcipher_request req; /* must be last member */ }; @@ -91,7 +91,7 @@ static void chacha_iv(u8 *iv, struct aead_request *req, u32 icb) memcpy(iv, &leicb, sizeof(leicb)); memcpy(iv + sizeof(leicb), ctx->salt, ctx->saltlen); memcpy(iv + sizeof(leicb) + ctx->saltlen, req->iv, - CHACHA20_IV_SIZE - sizeof(leicb) - ctx->saltlen); + CHACHA20_NONCE_SIZE - sizeof(leicb) - ctx->saltlen); } static int poly_verify_tag(struct aead_request *req) @@ -639,7 +639,7 @@ static int chachapoly_create(struct crypto_template *tmpl, struct rtattr **tb, err = -EINVAL; /* Need 16-byte IV size, including Initial Block Counter value */ - if (crypto_skcipher_alg_ivsize(chacha) != CHACHA20_IV_SIZE) + if (crypto_skcipher_alg_ivsize(chacha) != CHACHA20_NONCE_SIZE) goto out_drop_chacha; /* Not a stream cipher? */ if (chacha->base.cra_blocksize != 1) diff --git a/include/crypto/chacha20.h b/include/crypto/chacha20.h index b83d66073db0..3b92f58f3891 100644 --- a/include/crypto/chacha20.h +++ b/include/crypto/chacha20.h @@ -6,23 +6,11 @@ #ifndef _CRYPTO_CHACHA20_H #define _CRYPTO_CHACHA20_H -#include -#include -#include - #define CHACHA20_IV_SIZE 16 #define CHACHA20_KEY_SIZE 32 #define CHACHA20_BLOCK_SIZE 64 #define CHACHA20_BLOCK_WORDS (CHACHA20_BLOCK_SIZE / sizeof(u32)) -struct chacha20_ctx { - u32 key[8]; -}; - void chacha20_block(u32 *state, u32 *stream); -void crypto_chacha20_init(u32 *state, struct chacha20_ctx *ctx, u8 *iv); -int crypto_chacha20_setkey(struct crypto_skcipher *tfm, const u8 *key, - unsigned int keysize); -int crypto_chacha20_crypt(struct skcipher_request *req); #endif From patchwork Sat Oct 6 02:57:08 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Jason A. Donenfeld" X-Patchwork-Id: 148321 Delivered-To: patch@linaro.org Received: by 2002:a2e:8595:0:0:0:0:0 with SMTP id b21-v6csp1158108lji; Fri, 5 Oct 2018 19:59:04 -0700 (PDT) X-Google-Smtp-Source: ACcGV62P4fFVXd1HDc34GN2s8MCjOLkAQ3cVFHRfPjyEkDX9UNVj5Xf0M3KN5sgNNDEJNdA1z9p6 X-Received: by 2002:a62:1b45:: with SMTP id b66-v6mr14692455pfb.94.1538794744766; Fri, 05 Oct 2018 19:59:04 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1538794744; cv=none; d=google.com; s=arc-20160816; b=MLVuA5scr2FsYqRr/q7ek0oXu6rNtLl4MeU74GtG4Pu9GvznT96qvf4n/yvIm1oYjc FHS7ACv0ryp3XftEItJCcaTpJ4l7cCRtI8XuHPos6etIWkF7GV9q7slAuNR4SjTQ2cAU 7RAlASnSb0bja4X4AehTgICPL+ZWcv72ZUmK3Wxgdrm60K20uTsnbLwmz2/dbJ48xtH2 E0c1S52jrD+wLLMcbtz2nN9o75mXrYljhH/3qGITT3JGrXOPVpbmHcvBLvXZ3weZQt16 Rf2DdR/5KrZ/cL/IMGRlxMg82qFgYZHYkLL8L7pvPuMqlavmLeD9NKJqiuRcn+GMaA1K dBQg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=b4zSKbngRY5UrAWI0brnuXT4yiVghuvITMVgO+TrNq0=; b=jK/i1q5sZCRLYWgAj1tS7OMlHo+S8jBq09bNyeFIbnO/AzrYf96geMXUtkNu3gU5YA J3Mwj69EgpIBBkRZ63WymS72vq5+UICBeAfhApjNF5Z2e/oxraARO0my6d3vcZA7mLl1 LvrW2xjZnK0vNTCWhxf+pJrI00YeLMcwWF7CK2kvQpxj3hra2l0rcuO9m9VSCXvAc8cM QRfG5z9Gu7MdHPKTQgCl5fj6LJYYEJto6wCdIiV/aQyticN50cczGR19mPp1UZolqeuR hByGV5ldJCHUSSf9kVk5/5Wl6XPZkJMWZ4AAAkYncsywxHFxJl7AVrx/mEr+t7cfT0gA 8avw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=ng4rxgg5; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 80-v6si11181087pfv.135.2018.10.05.19.59.04; Fri, 05 Oct 2018 19:59:04 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=ng4rxgg5; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729969AbeJFKA0 (ORCPT + 32 others); Sat, 6 Oct 2018 06:00:26 -0400 Received: from frisell.zx2c4.com ([192.95.5.64]:50073 "EHLO frisell.zx2c4.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729862AbeJFKAY (ORCPT ); Sat, 6 Oct 2018 06:00:24 -0400 Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTP id ea36f536; Sat, 6 Oct 2018 02:58:21 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=zx2c4.com; h=from:to:cc :subject:date:message-id:in-reply-to:references:mime-version :content-transfer-encoding; s=mail; bh=QTKzG5bMM2+DpXPT2gKuxtdDh k8=; b=ng4rxgg5q8EwZ+Nv71s/6Gf1ov+dNVee2OdOA0bNiN01y/3Cdkgd94MGI yQ8GFjDtm5WxQUq2YjS/pn4/vSQ5vv2kswJ2wg55GVAJI/pB4W9l3nctaUEY3qZW DcJ5oTxZcHOYcOJ4rdYNnpVXHxZ9P6nkwhFC1OAKMLBmTU9YrUJK9KTg2zJKXrW/ QoMF3nDFTPzSlAAtzxMzOaV8x5zRxBN2SE/4YI+CdJfzKh42UZKGzEQkagn/wc1+ yJCU7lbWRkeD6vnYn+M1sTBHFpkfjh+f78cugv6son1LvCE+4q4jM45AKtlhSOow +M95DvTCCGeCaxTHmttyLxzSvlqfQ== Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTPSA id e2531a8a (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256:NO); Sat, 6 Oct 2018 02:58:20 +0000 (UTC) From: "Jason A. Donenfeld" To: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, davem@davemloft.net, gregkh@linuxfoundation.org Cc: "Jason A. Donenfeld" , David Howells , Samuel Neves , Jean-Philippe Aumasson , Andy Lutomirski , Andrew Morton , Linus Torvalds , kernel-hardening@lists.openwall.com Subject: [PATCH net-next v7 27/28] security/keys: rewrite big_key crypto to use Zinc Date: Sat, 6 Oct 2018 04:57:08 +0200 Message-Id: <20181006025709.4019-28-Jason@zx2c4.com> In-Reply-To: <20181006025709.4019-1-Jason@zx2c4.com> References: <20181006025709.4019-1-Jason@zx2c4.com> MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org A while back, I noticed that the crypto and crypto API usage in big_keys were entirely broken in multiple ways, so I rewrote it. Now, I'm rewriting it again, but this time using Zinc's ChaCha20Poly1305 function. This makes the file considerably more simple; the diffstat alone should justify this commit. It also should be faster, since it no longer requires a mutex around the "aead api object" (nor allocations), allowing us to encrypt multiple items in parallel. We also benefit from being able to pass any type of pointer to Zinc, so we can get rid of the ridiculously complex custom page allocator that big_key really doesn't need. Signed-off-by: Jason A. Donenfeld Cc: David Howells Cc: Samuel Neves Cc: Jean-Philippe Aumasson Cc: Andy Lutomirski Cc: Greg KH Cc: Andrew Morton Cc: Linus Torvalds Cc: kernel-hardening@lists.openwall.com --- security/keys/Kconfig | 4 +- security/keys/big_key.c | 230 +++++----------------------------------- 2 files changed, 28 insertions(+), 206 deletions(-) -- 2.19.0 diff --git a/security/keys/Kconfig b/security/keys/Kconfig index 6462e6654ccf..66ff26298fb3 100644 --- a/security/keys/Kconfig +++ b/security/keys/Kconfig @@ -45,9 +45,7 @@ config BIG_KEYS bool "Large payload keys" depends on KEYS depends on TMPFS - select CRYPTO - select CRYPTO_AES - select CRYPTO_GCM + select ZINC_CHACHA20POLY1305 help This option provides support for holding large keys within the kernel (for example Kerberos ticket caches). The data may be stored out to diff --git a/security/keys/big_key.c b/security/keys/big_key.c index 2806e70d7f8f..bb15360221d8 100644 --- a/security/keys/big_key.c +++ b/security/keys/big_key.c @@ -1,6 +1,6 @@ /* Large capacity key type * - * Copyright (C) 2017 Jason A. Donenfeld . All Rights Reserved. + * Copyright (C) 2017-2018 Jason A. Donenfeld . All Rights Reserved. * Copyright (C) 2013 Red Hat, Inc. All Rights Reserved. * Written by David Howells (dhowells@redhat.com) * @@ -16,20 +16,10 @@ #include #include #include -#include #include -#include #include #include -#include -#include - -struct big_key_buf { - unsigned int nr_pages; - void *virt; - struct scatterlist *sg; - struct page *pages[]; -}; +#include /* * Layout of key payload words. @@ -41,14 +31,6 @@ enum { big_key_len, }; -/* - * Crypto operation with big_key data - */ -enum big_key_op { - BIG_KEY_ENC, - BIG_KEY_DEC, -}; - /* * If the data is under this limit, there's no point creating a shm file to * hold it as the permanently resident metadata for the shmem fs will be at @@ -56,16 +38,6 @@ enum big_key_op { */ #define BIG_KEY_FILE_THRESHOLD (sizeof(struct inode) + sizeof(struct dentry)) -/* - * Key size for big_key data encryption - */ -#define ENC_KEY_SIZE 32 - -/* - * Authentication tag length - */ -#define ENC_AUTHTAG_SIZE 16 - /* * big_key defined keys take an arbitrary string as the description and an * arbitrary blob of data as the payload @@ -79,136 +51,20 @@ struct key_type key_type_big_key = { .destroy = big_key_destroy, .describe = big_key_describe, .read = big_key_read, - /* no ->update(); don't add it without changing big_key_crypt() nonce */ + /* no ->update(); don't add it without changing chacha20poly1305's nonce */ }; -/* - * Crypto names for big_key data authenticated encryption - */ -static const char big_key_alg_name[] = "gcm(aes)"; -#define BIG_KEY_IV_SIZE GCM_AES_IV_SIZE - -/* - * Crypto algorithms for big_key data authenticated encryption - */ -static struct crypto_aead *big_key_aead; - -/* - * Since changing the key affects the entire object, we need a mutex. - */ -static DEFINE_MUTEX(big_key_aead_lock); - -/* - * Encrypt/decrypt big_key data - */ -static int big_key_crypt(enum big_key_op op, struct big_key_buf *buf, size_t datalen, u8 *key) -{ - int ret; - struct aead_request *aead_req; - /* We always use a zero nonce. The reason we can get away with this is - * because we're using a different randomly generated key for every - * different encryption. Notably, too, key_type_big_key doesn't define - * an .update function, so there's no chance we'll wind up reusing the - * key to encrypt updated data. Simply put: one key, one encryption. - */ - u8 zero_nonce[BIG_KEY_IV_SIZE]; - - aead_req = aead_request_alloc(big_key_aead, GFP_KERNEL); - if (!aead_req) - return -ENOMEM; - - memset(zero_nonce, 0, sizeof(zero_nonce)); - aead_request_set_crypt(aead_req, buf->sg, buf->sg, datalen, zero_nonce); - aead_request_set_callback(aead_req, CRYPTO_TFM_REQ_MAY_SLEEP, NULL, NULL); - aead_request_set_ad(aead_req, 0); - - mutex_lock(&big_key_aead_lock); - if (crypto_aead_setkey(big_key_aead, key, ENC_KEY_SIZE)) { - ret = -EAGAIN; - goto error; - } - if (op == BIG_KEY_ENC) - ret = crypto_aead_encrypt(aead_req); - else - ret = crypto_aead_decrypt(aead_req); -error: - mutex_unlock(&big_key_aead_lock); - aead_request_free(aead_req); - return ret; -} - -/* - * Free up the buffer. - */ -static void big_key_free_buffer(struct big_key_buf *buf) -{ - unsigned int i; - - if (buf->virt) { - memset(buf->virt, 0, buf->nr_pages * PAGE_SIZE); - vunmap(buf->virt); - } - - for (i = 0; i < buf->nr_pages; i++) - if (buf->pages[i]) - __free_page(buf->pages[i]); - - kfree(buf); -} - -/* - * Allocate a buffer consisting of a set of pages with a virtual mapping - * applied over them. - */ -static void *big_key_alloc_buffer(size_t len) -{ - struct big_key_buf *buf; - unsigned int npg = (len + PAGE_SIZE - 1) >> PAGE_SHIFT; - unsigned int i, l; - - buf = kzalloc(sizeof(struct big_key_buf) + - sizeof(struct page) * npg + - sizeof(struct scatterlist) * npg, - GFP_KERNEL); - if (!buf) - return NULL; - - buf->nr_pages = npg; - buf->sg = (void *)(buf->pages + npg); - sg_init_table(buf->sg, npg); - - for (i = 0; i < buf->nr_pages; i++) { - buf->pages[i] = alloc_page(GFP_KERNEL); - if (!buf->pages[i]) - goto nomem; - - l = min_t(size_t, len, PAGE_SIZE); - sg_set_page(&buf->sg[i], buf->pages[i], l, 0); - len -= l; - } - - buf->virt = vmap(buf->pages, buf->nr_pages, VM_MAP, PAGE_KERNEL); - if (!buf->virt) - goto nomem; - - return buf; - -nomem: - big_key_free_buffer(buf); - return NULL; -} - /* * Preparse a big key */ int big_key_preparse(struct key_preparsed_payload *prep) { - struct big_key_buf *buf; struct path *path = (struct path *)&prep->payload.data[big_key_path]; struct file *file; - u8 *enckey; + u8 *buf, *enckey; ssize_t written; - size_t datalen = prep->datalen, enclen = datalen + ENC_AUTHTAG_SIZE; + size_t datalen = prep->datalen; + size_t enclen = datalen + CHACHA20POLY1305_AUTHTAG_SIZE; int ret; if (datalen <= 0 || datalen > 1024 * 1024 || !prep->data) @@ -224,28 +80,28 @@ int big_key_preparse(struct key_preparsed_payload *prep) * to be swapped out if needed. * * File content is stored encrypted with randomly generated key. + * Since the key is random for each file, we can set the nonce + * to zero, provided we never define a ->update() call. */ loff_t pos = 0; - buf = big_key_alloc_buffer(enclen); + buf = kvmalloc(enclen, GFP_KERNEL); if (!buf) return -ENOMEM; - memcpy(buf->virt, prep->data, datalen); /* generate random key */ - enckey = kmalloc(ENC_KEY_SIZE, GFP_KERNEL); + enckey = kmalloc(CHACHA20POLY1305_KEY_SIZE, GFP_KERNEL); if (!enckey) { ret = -ENOMEM; goto error; } - ret = get_random_bytes_wait(enckey, ENC_KEY_SIZE); + ret = get_random_bytes_wait(enckey, CHACHA20POLY1305_KEY_SIZE); if (unlikely(ret)) goto err_enckey; - /* encrypt aligned data */ - ret = big_key_crypt(BIG_KEY_ENC, buf, datalen, enckey); - if (ret) - goto err_enckey; + /* encrypt data */ + chacha20poly1305_encrypt(buf, prep->data, datalen, NULL, 0, + 0, enckey); /* save aligned data to file */ file = shmem_kernel_file_setup("", enclen, 0); @@ -254,7 +110,7 @@ int big_key_preparse(struct key_preparsed_payload *prep) goto err_enckey; } - written = kernel_write(file, buf->virt, enclen, &pos); + written = kernel_write(file, buf, enclen, &pos); if (written != enclen) { ret = written; if (written >= 0) @@ -269,7 +125,7 @@ int big_key_preparse(struct key_preparsed_payload *prep) *path = file->f_path; path_get(path); fput(file); - big_key_free_buffer(buf); + kvfree(buf); } else { /* Just store the data in a buffer */ void *data = kmalloc(datalen, GFP_KERNEL); @@ -287,7 +143,7 @@ int big_key_preparse(struct key_preparsed_payload *prep) err_enckey: kzfree(enckey); error: - big_key_free_buffer(buf); + kvfree(buf); return ret; } @@ -365,14 +221,13 @@ long big_key_read(const struct key *key, char __user *buffer, size_t buflen) return datalen; if (datalen > BIG_KEY_FILE_THRESHOLD) { - struct big_key_buf *buf; struct path *path = (struct path *)&key->payload.data[big_key_path]; struct file *file; - u8 *enckey = (u8 *)key->payload.data[big_key_data]; - size_t enclen = datalen + ENC_AUTHTAG_SIZE; + u8 *buf, *enckey = (u8 *)key->payload.data[big_key_data]; + size_t enclen = datalen + CHACHA20POLY1305_AUTHTAG_SIZE; loff_t pos = 0; - buf = big_key_alloc_buffer(enclen); + buf = kvmalloc(enclen, GFP_KERNEL); if (!buf) return -ENOMEM; @@ -383,26 +238,27 @@ long big_key_read(const struct key *key, char __user *buffer, size_t buflen) } /* read file to kernel and decrypt */ - ret = kernel_read(file, buf->virt, enclen, &pos); + ret = kernel_read(file, buf, enclen, &pos); if (ret >= 0 && ret != enclen) { ret = -EIO; goto err_fput; } - ret = big_key_crypt(BIG_KEY_DEC, buf, enclen, enckey); - if (ret) + ret = chacha20poly1305_decrypt(buf, buf, enclen, NULL, 0, 0, + enckey) ? 0 : -EINVAL; + if (unlikely(ret)) goto err_fput; ret = datalen; /* copy decrypted data to user */ - if (copy_to_user(buffer, buf->virt, datalen) != 0) + if (copy_to_user(buffer, buf, datalen) != 0) ret = -EFAULT; err_fput: fput(file); error: - big_key_free_buffer(buf); + kvfree(buf); } else { ret = datalen; if (copy_to_user(buffer, key->payload.data[big_key_data], @@ -418,39 +274,7 @@ long big_key_read(const struct key *key, char __user *buffer, size_t buflen) */ static int __init big_key_init(void) { - int ret; - - /* init block cipher */ - big_key_aead = crypto_alloc_aead(big_key_alg_name, 0, CRYPTO_ALG_ASYNC); - if (IS_ERR(big_key_aead)) { - ret = PTR_ERR(big_key_aead); - pr_err("Can't alloc crypto: %d\n", ret); - return ret; - } - - if (unlikely(crypto_aead_ivsize(big_key_aead) != BIG_KEY_IV_SIZE)) { - WARN(1, "big key algorithm changed?"); - ret = -EINVAL; - goto free_aead; - } - - ret = crypto_aead_setauthsize(big_key_aead, ENC_AUTHTAG_SIZE); - if (ret < 0) { - pr_err("Can't set crypto auth tag len: %d\n", ret); - goto free_aead; - } - - ret = register_key_type(&key_type_big_key); - if (ret < 0) { - pr_err("Can't register type: %d\n", ret); - goto free_aead; - } - - return 0; - -free_aead: - crypto_free_aead(big_key_aead); - return ret; + return register_key_type(&key_type_big_key); } late_initcall(big_key_init); From patchwork Sat Oct 6 02:57:09 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: "Jason A. Donenfeld" X-Patchwork-Id: 148322 Delivered-To: patch@linaro.org Received: by 2002:a2e:8595:0:0:0:0:0 with SMTP id b21-v6csp1158224lji; Fri, 5 Oct 2018 19:59:12 -0700 (PDT) X-Google-Smtp-Source: ACcGV63bgBZbqgsV27O7hQb1PUOEZ8dBZdBidHaG2cJlF7avRh6ioB+CUg15ON+rxFLBoOEyTt1R X-Received: by 2002:a17:902:8609:: with SMTP id f9-v6mr14204550plo.134.1538794752316; Fri, 05 Oct 2018 19:59:12 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1538794752; cv=none; d=google.com; s=arc-20160816; b=h1tZh2CcCGcn8PMGVsafgF+ea+jnYanrgary6YB/d8U8/6mT4Kl31n7juJlO58J9ZW CyKjKfONPeneBmwsA/QGcw+H3CIezhmUgWmtJR1UaOw3zIemrelfiP+BribRY4CwYVgI aWpprzfTlDSLdkwR0gNmfMJH45+oNQESRMyZ+tVrB+8RdwfrQg/8gXiEV6Neaf+EA1hW ZEyFFlUSL7R081mzkBEDdTXu4WjxhoSwqS0nrvJfnAYSBYPs70kSg/AZUGK8vd0Jp/FE l9/e4tXZ/Rbw4+ebRoJpHLBKeglipqnQmZa2uHbFKWU0q5G28t5KhdxcZ7im0Q812IwQ NdDA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=27U0P2HdWb/I2he70jayQv6rcKXP+rXHhRDBp663aOE=; b=wRF7ahyukqpoHElIhxA5EcqeqDwYszgvVNCQZyo+o+4ppjgHkqjcvHemXXlxoKrPaY nZTg/jgpCE+wz8g0GMCFgUlsWYndo3zWxq4vm9gx2gGURMp/WywyVJocCLlFqKWPGVOZ 47HTGdbTX/X6yut02Ka9yHm8bepV2VOxEFvicV/qFqhrouGXiDI+NIqdqs5QFdU+XgOp LRQPgTKC3LHMKhTYatPoHrknegFzNCkWDwGS7eDjOCX6UsnhR5tZXkHdU/ywfZW/vSrT 2HOql5pHSzSCykfHJH4oMW7ux5gB+GMj5npUJbC4kDbhGOw75nKBZK/1g7I7Ihza1wwn oqKQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=NXuHdufF; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id d11-v6si8277582pgg.91.2018.10.05.19.59.10; Fri, 05 Oct 2018 19:59:12 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@zx2c4.com header.s=mail header.b=NXuHdufF; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=zx2c4.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730005AbeJFKAl (ORCPT + 32 others); Sat, 6 Oct 2018 06:00:41 -0400 Received: from frisell.zx2c4.com ([192.95.5.64]:35699 "EHLO frisell.zx2c4.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729864AbeJFKAj (ORCPT ); Sat, 6 Oct 2018 06:00:39 -0400 Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTP id 0448c941; Sat, 6 Oct 2018 02:58:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=zx2c4.com; h=from:to:cc :subject:date:message-id:in-reply-to:references:mime-version :content-type:content-transfer-encoding; s=mail; bh=nWKYoIMdwtiI JO871/ZId7+vo04=; b=NXuHdufFJjhSPyfQqznsMWV8YtiAjxUMcx19eZmmHRDH G0ugo+UlUEPKE3a+d9fJDKFCuJmG+2CYzVhccjP11BNN9e52vsXxj+Ufvc88g/NS SSs2ZcCRYVmaax/ORYUrceXXazvRrxBQjA4KesKdYIqnAGxyi7joblw6AxAvIA4G FMxMxHcvLQr9k/UtMMZJEKFeFZ7MD7x3M8zLalCnZAd2gTZKOprx4zMsBZONi/r7 Q+8DSsZ4rNABCstlS+xqN4HP5ftR0i+PYt8SDuV/NPpyIPJOwNIlV8H5NqKJHhDE tVGmS7acCT4X/MgIm4EhjaGb10UgsLrGt38G3Euhyg== Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTPSA id 84bc29f5 (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256:NO); Sat, 6 Oct 2018 02:58:22 +0000 (UTC) From: "Jason A. Donenfeld" To: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, davem@davemloft.net, gregkh@linuxfoundation.org Cc: "Jason A. Donenfeld" Subject: [PATCH net-next v7 28/28] net: WireGuard secure network tunnel Date: Sat, 6 Oct 2018 04:57:09 +0200 Message-Id: <20181006025709.4019-29-Jason@zx2c4.com> In-Reply-To: <20181006025709.4019-1-Jason@zx2c4.com> References: <20181006025709.4019-1-Jason@zx2c4.com> MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org WireGuard is a layer 3 secure networking tunnel made specifically for the kernel, that aims to be much simpler and easier to audit than IPsec. Extensive documentation and description of the protocol and considerations, along with formal proofs of the cryptography, are available at: * https://www.wireguard.com/ * https://www.wireguard.com/papers/wireguard.pdf This commit implements WireGuard as a simple network device driver, accessible in the usual RTNL way used by virtual network drivers. It makes use of the udp_tunnel APIs, GRO, GSO, NAPI, and the usual set of networking subsystem APIs. It has a somewhat novel multicore queueing system designed for maximum throughput and minimal latency of encryption operations, but it is implemented modestly using workqueues and NAPI. Configuration is done via generic Netlink, and following a review from the Netlink maintainer a year ago, several high profile userspace have already implemented the API. This commit also comes with several different tests, both in-kernel tests and out-of-kernel tests based on network namespaces, taking profit of the fact that sockets used by WireGuard intentionally stay in the namespace the WireGuard interface was originally created, exactly like the semantics of userspace tun devices. See wireguard.com/netns/ for pictures and examples. The source code is fairly short, but rather than combining everything into a single file, WireGuard is developed as cleanly separable files, making auditing and comprehension easier. Things are laid out as follows: * noise.[ch], cookie.[ch], messages.h: These implement the bulk of the cryptographic aspects of the protocol, and are mostly data-only in nature, taking in buffers of bytes and spitting out buffers of bytes. They also handle reference counting for their various shared pieces of data, like keys and key lists. * ratelimiter.[ch]: Used as an integral part of cookie.[ch] for ratelimiting certain types of cryptographic operations in accordance with particular WireGuard semantics. * allowedips.[ch], hashtables.[ch]: The main lookup structures of WireGuard, the former being trie-like with particular semantics, an integral part of the design of the protocol, and the latter just being nice helper functions around the specific hashtables we use. * device.[ch]: Implementation of functions for the netdevice and for rtnl, responsible for maintaining the life of a given interface and wiring it up to the rest of WireGuard. * peer.[ch]: Each interface has a list of peers, with helper functions available here for creation, destruction, and reference counting. * socket.[ch]: Implementation of functions related to udp_socket and the general set of kernel socket APIs, for sending and receiving ciphertext UDP packets, and taking care of WireGuard-specific sticky socket routing semantics for the automatic roaming. * netlink.[ch]: Userspace API entry point for configuring WireGuard peers and devices. The API has been implemented by several userspace tools and network management utility, and the WireGuard project distributes the basic wg(8) tool. * queueing.[ch]: Shared function on the rx and tx path for handling the various queues used in the multicore algorithms. * send.c: Handles encrypting outgoing packets in parallel on multiple cores, before sending them in order on a single core, via workqueues and ring buffers. Also handles sending handshake and cookie messages as part of the protocol, in parallel. * receive.c: Handles decrypting incoming packets in parallel on multiple cores, before passing them off in order to be ingested via the rest of the networking subsystem with GRO via the typical NAPI poll function. Also handles receiving handshake and cookie messages as part of the protocol, in parallel. * timers.[ch]: Uses the timer wheel to implement protocol particular event timeouts, and gives a set of very simple event-driven entry point functions for callers. * main.c, version.h: Initialization and deinitialization of the module. * selftest/*.h: Runtime unit tests for some of the most security sensitive functions. * tools/testing/selftests/wireguard/netns.sh: Aforementioned testing script using network namespaces. This commit aims to be as self-contained as possible, implementing WireGuard as a standalone module not needing much special handling or coordination from the network subsystem. I expect for future optimizations to the network stack to positively improve WireGuard, and vice-versa, but for the time being, this exists as intentionally standalone. We introduce a menu option for CONFIG_WIREGUARD, as well as providing a verbose debug log and self-tests via CONFIG_WIREGUARD_DEBUG. Signed-off-by: Jason A. Donenfeld Cc: David Miller Cc: Greg KH --- MAINTAINERS | 8 + drivers/net/Kconfig | 30 + drivers/net/Makefile | 1 + drivers/net/wireguard/Makefile | 18 + drivers/net/wireguard/allowedips.c | 405 ++++++++++ drivers/net/wireguard/allowedips.h | 56 ++ drivers/net/wireguard/cookie.c | 235 ++++++ drivers/net/wireguard/cookie.h | 59 ++ drivers/net/wireguard/device.c | 439 +++++++++++ drivers/net/wireguard/device.h | 65 ++ drivers/net/wireguard/hashtables.c | 209 +++++ drivers/net/wireguard/hashtables.h | 64 ++ drivers/net/wireguard/main.c | 65 ++ drivers/net/wireguard/messages.h | 128 +++ drivers/net/wireguard/netlink.c | 606 ++++++++++++++ drivers/net/wireguard/netlink.h | 12 + drivers/net/wireguard/noise.c | 786 +++++++++++++++++++ drivers/net/wireguard/noise.h | 130 +++ drivers/net/wireguard/peer.c | 194 +++++ drivers/net/wireguard/peer.h | 87 ++ drivers/net/wireguard/queueing.c | 52 ++ drivers/net/wireguard/queueing.h | 193 +++++ drivers/net/wireguard/ratelimiter.c | 220 ++++++ drivers/net/wireguard/ratelimiter.h | 19 + drivers/net/wireguard/receive.c | 596 ++++++++++++++ drivers/net/wireguard/selftest/allowedips.c | 679 ++++++++++++++++ drivers/net/wireguard/selftest/counter.c | 103 +++ drivers/net/wireguard/selftest/ratelimiter.c | 178 +++++ drivers/net/wireguard/send.c | 421 ++++++++++ drivers/net/wireguard/socket.c | 432 ++++++++++ drivers/net/wireguard/socket.h | 44 ++ drivers/net/wireguard/timers.c | 256 ++++++ drivers/net/wireguard/timers.h | 31 + drivers/net/wireguard/version.h | 1 + include/uapi/linux/wireguard.h | 190 +++++ tools/testing/selftests/wireguard/netns.sh | 499 ++++++++++++ 36 files changed, 7511 insertions(+) create mode 100644 drivers/net/wireguard/Makefile create mode 100644 drivers/net/wireguard/allowedips.c create mode 100644 drivers/net/wireguard/allowedips.h create mode 100644 drivers/net/wireguard/cookie.c create mode 100644 drivers/net/wireguard/cookie.h create mode 100644 drivers/net/wireguard/device.c create mode 100644 drivers/net/wireguard/device.h create mode 100644 drivers/net/wireguard/hashtables.c create mode 100644 drivers/net/wireguard/hashtables.h create mode 100644 drivers/net/wireguard/main.c create mode 100644 drivers/net/wireguard/messages.h create mode 100644 drivers/net/wireguard/netlink.c create mode 100644 drivers/net/wireguard/netlink.h create mode 100644 drivers/net/wireguard/noise.c create mode 100644 drivers/net/wireguard/noise.h create mode 100644 drivers/net/wireguard/peer.c create mode 100644 drivers/net/wireguard/peer.h create mode 100644 drivers/net/wireguard/queueing.c create mode 100644 drivers/net/wireguard/queueing.h create mode 100644 drivers/net/wireguard/ratelimiter.c create mode 100644 drivers/net/wireguard/ratelimiter.h create mode 100644 drivers/net/wireguard/receive.c create mode 100644 drivers/net/wireguard/selftest/allowedips.c create mode 100644 drivers/net/wireguard/selftest/counter.c create mode 100644 drivers/net/wireguard/selftest/ratelimiter.c create mode 100644 drivers/net/wireguard/send.c create mode 100644 drivers/net/wireguard/socket.c create mode 100644 drivers/net/wireguard/socket.h create mode 100644 drivers/net/wireguard/timers.c create mode 100644 drivers/net/wireguard/timers.h create mode 100644 drivers/net/wireguard/version.h create mode 100644 include/uapi/linux/wireguard.h create mode 100755 tools/testing/selftests/wireguard/netns.sh -- 2.19.0 diff --git a/MAINTAINERS b/MAINTAINERS index ab349f7e8d53..ec54ebf361d7 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -15845,6 +15845,14 @@ L: linux-gpio@vger.kernel.org S: Maintained F: drivers/gpio/gpio-ws16c48.c +WIREGUARD SECURE NETWORK TUNNEL +M: Jason A. Donenfeld +S: Maintained +F: drivers/net/wireguard/ +F: tools/testing/selftests/wireguard/ +L: wireguard@lists.zx2c4.com +L: netdev@vger.kernel.org + WISTRON LAPTOP BUTTON DRIVER M: Miloslav Trmac S: Maintained diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index d03775100f7d..d4b2568f86f9 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -70,6 +70,36 @@ config DUMMY To compile this driver as a module, choose M here: the module will be called dummy. +config WIREGUARD + tristate "WireGuard secure network tunnel" + depends on NET && INET + depends on IPV6 || !IPV6 + select NET_UDP_TUNNEL + select DST_CACHE + select ZINC_CHACHA20POLY1305 + select ZINC_BLAKE2S + select ZINC_CURVE25519 + help + WireGuard is a secure, fast, and easy to use replacement for IPSec + that uses modern cryptography and clever networking tricks. It's + designed to be fairly general purpose and abstract enough to fit most + use cases, while at the same time remaining extremely simple to + configure. See www.wireguard.com for more info. + + It's safe to say Y or M here, as the driver is very lightweight and + is only in use when an administrator chooses to add an interface. + +config WIREGUARD_DEBUG + bool "Debugging checks and verbose messages" + depends on WIREGUARD + help + This will write log messages for handshake and other events + that occur for a WireGuard interface. It will also perform some + extra validation checks and unit tests at various points. This is + only useful for debugging. + + Say N here unless you know what you're doing. + config EQUALIZER tristate "EQL (serial line load balancing) support" ---help--- diff --git a/drivers/net/Makefile b/drivers/net/Makefile index 21cde7e78621..f0acd11a143d 100644 --- a/drivers/net/Makefile +++ b/drivers/net/Makefile @@ -24,6 +24,7 @@ obj-$(CONFIG_RIONET) += rionet.o obj-$(CONFIG_NET_TEAM) += team/ obj-$(CONFIG_TUN) += tun.o obj-$(CONFIG_TAP) += tap.o +obj-$(CONFIG_WIREGUARD) += wireguard/ obj-$(CONFIG_VETH) += veth.o obj-$(CONFIG_VIRTIO_NET) += virtio_net.o obj-$(CONFIG_VXLAN) += vxlan.o diff --git a/drivers/net/wireguard/Makefile b/drivers/net/wireguard/Makefile new file mode 100644 index 000000000000..d8856255bc9d --- /dev/null +++ b/drivers/net/wireguard/Makefile @@ -0,0 +1,18 @@ +ccflags-y := -O3 +ccflags-y += -D'pr_fmt(fmt)=KBUILD_MODNAME ": " fmt' +ccflags-$(CONFIG_WIREGUARD_DEBUG) += -DDEBUG +wireguard-y := main.o +wireguard-y += noise.o +wireguard-y += device.o +wireguard-y += peer.o +wireguard-y += timers.o +wireguard-y += queueing.o +wireguard-y += send.o +wireguard-y += receive.o +wireguard-y += socket.o +wireguard-y += hashtables.o +wireguard-y += allowedips.o +wireguard-y += ratelimiter.o +wireguard-y += cookie.o +wireguard-y += netlink.o +obj-$(CONFIG_WIREGUARD) := wireguard.o diff --git a/drivers/net/wireguard/allowedips.c b/drivers/net/wireguard/allowedips.c new file mode 100644 index 000000000000..60c7723ef5be --- /dev/null +++ b/drivers/net/wireguard/allowedips.c @@ -0,0 +1,405 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#include "allowedips.h" +#include "peer.h" + +struct allowedips_node { + struct wireguard_peer __rcu *peer; + struct rcu_head rcu; + struct allowedips_node __rcu *bit[2]; + /* While it may seem scandalous that we waste space for v4, + * we're alloc'ing to the nearest power of 2 anyway, so this + * doesn't actually make a difference. + */ + u8 bits[16] __aligned(__alignof(u64)); + u8 cidr, bit_at_a, bit_at_b; +}; + +static __always_inline void swap_endian(u8 *dst, const u8 *src, u8 bits) +{ + if (bits == 32) + *(u32 *)dst = be32_to_cpu(*(const __be32 *)src); + else if (bits == 128) { + ((u64 *)dst)[0] = be64_to_cpu(((const __be64 *)src)[0]); + ((u64 *)dst)[1] = be64_to_cpu(((const __be64 *)src)[1]); + } +} + +static void copy_and_assign_cidr(struct allowedips_node *node, const u8 *src, + u8 cidr, u8 bits) +{ + node->cidr = cidr; + node->bit_at_a = cidr / 8U; +#ifdef __LITTLE_ENDIAN + node->bit_at_a ^= (bits / 8U - 1U) % 8U; +#endif + node->bit_at_b = 7U - (cidr % 8U); + memcpy(node->bits, src, bits / 8U); +} + +#define choose_node(parent, key) \ + parent->bit[(key[parent->bit_at_a] >> parent->bit_at_b) & 1] + +static void node_free_rcu(struct rcu_head *rcu) +{ + kfree(container_of(rcu, struct allowedips_node, rcu)); +} + +#define push_rcu(stack, p, len) ({ \ + if (rcu_access_pointer(p)) { \ + WARN_ON(IS_ENABLED(DEBUG) && len >= 128); \ + stack[len++] = rcu_dereference_raw(p); \ + } \ + true; \ + }) +static void root_free_rcu(struct rcu_head *rcu) +{ + struct allowedips_node *node, *stack[128] = { + container_of(rcu, struct allowedips_node, rcu) }; + unsigned int len = 1; + + while (len > 0 && (node = stack[--len]) && + push_rcu(stack, node->bit[0], len) && + push_rcu(stack, node->bit[1], len)) + kfree(node); +} + +static int +walk_by_peer(struct allowedips_node __rcu *top, u8 bits, + struct allowedips_cursor *cursor, struct wireguard_peer *peer, + int (*func)(void *ctx, const u8 *ip, u8 cidr, int family), + void *ctx, struct mutex *lock) +{ + const int address_family = bits == 32 ? AF_INET : AF_INET6; + u8 ip[16] __aligned(__alignof(u64)); + struct allowedips_node *node; + int ret; + + if (!rcu_access_pointer(top)) + return 0; + + if (!cursor->len) + push_rcu(cursor->stack, top, cursor->len); + + for (; cursor->len > 0 && (node = cursor->stack[cursor->len - 1]); + --cursor->len, push_rcu(cursor->stack, node->bit[0], cursor->len), + push_rcu(cursor->stack, node->bit[1], cursor->len)) { + const unsigned int cidr_bytes = DIV_ROUND_UP(node->cidr, 8U); + + if (rcu_dereference_protected(node->peer, + lockdep_is_held(lock)) != peer) + continue; + + swap_endian(ip, node->bits, bits); + memset(ip + cidr_bytes, 0, bits / 8U - cidr_bytes); + if (node->cidr) + ip[cidr_bytes - 1U] &= ~0U << (-node->cidr % 8U); + + ret = func(ctx, ip, node->cidr, address_family); + if (ret) + return ret; + } + return 0; +} +#undef push_rcu + +#define ref(p) rcu_access_pointer(p) +#define deref(p) rcu_dereference_protected(*p, lockdep_is_held(lock)) +#define push(p) ({ \ + WARN_ON(IS_ENABLED(DEBUG) && len >= 128); \ + stack[len++] = p; \ + }) +static void walk_remove_by_peer(struct allowedips_node __rcu **top, + struct wireguard_peer *peer, struct mutex *lock) +{ + struct allowedips_node __rcu **stack[128], **nptr; + struct allowedips_node *node, *prev; + unsigned int len; + + if (unlikely(!peer || !ref(*top))) + return; + + for (prev = NULL, len = 0, push(top); len > 0; prev = node) { + nptr = stack[len - 1]; + node = deref(nptr); + if (!node) { + --len; + continue; + } + if (!prev || ref(prev->bit[0]) == node || + ref(prev->bit[1]) == node) { + if (ref(node->bit[0])) + push(&node->bit[0]); + else if (ref(node->bit[1])) + push(&node->bit[1]); + } else if (ref(node->bit[0]) == prev) { + if (ref(node->bit[1])) + push(&node->bit[1]); + } else { + if (rcu_dereference_protected(node->peer, + lockdep_is_held(lock)) == peer) { + RCU_INIT_POINTER(node->peer, NULL); + if (!node->bit[0] || !node->bit[1]) { + rcu_assign_pointer(*nptr, + deref(&node->bit[!ref(node->bit[0])])); + call_rcu_bh(&node->rcu, node_free_rcu); + node = deref(nptr); + } + } + --len; + } + } +} +#undef ref +#undef deref +#undef push + +static __always_inline unsigned int fls128(u64 a, u64 b) +{ + return a ? fls64(a) + 64U : fls64(b); +} + +static __always_inline u8 common_bits(const struct allowedips_node *node, + const u8 *key, u8 bits) +{ + if (bits == 32) + return 32U - fls(*(const u32 *)node->bits ^ *(const u32 *)key); + else if (bits == 128) + return 128U - fls128( + *(const u64 *)&node->bits[0] ^ *(const u64 *)&key[0], + *(const u64 *)&node->bits[8] ^ *(const u64 *)&key[8]); + return 0; +} + +/* This could be much faster if it actually just compared the common bits + * properly, by precomputing a mask bswap(~0 << (32 - cidr)), and the rest, but + * it turns out that common_bits is already super fast on modern processors, + * even taking into account the unfortunate bswap. So, we just inline it like + * this instead. + */ +#define prefix_matches(node, key, bits) \ + (common_bits(node, key, bits) >= node->cidr) + +static __always_inline struct allowedips_node * +find_node(struct allowedips_node *trie, u8 bits, const u8 *key) +{ + struct allowedips_node *node = trie, *found = NULL; + + while (node && prefix_matches(node, key, bits)) { + if (rcu_access_pointer(node->peer)) + found = node; + if (node->cidr == bits) + break; + node = rcu_dereference_bh(choose_node(node, key)); + } + return found; +} + +/* Returns a strong reference to a peer */ +static __always_inline struct wireguard_peer * +lookup(struct allowedips_node __rcu *root, u8 bits, const void *be_ip) +{ + u8 ip[16] __aligned(__alignof(u64)); + struct wireguard_peer *peer = NULL; + struct allowedips_node *node; + + swap_endian(ip, be_ip, bits); + + rcu_read_lock_bh(); +retry: + node = find_node(rcu_dereference_bh(root), bits, ip); + if (node) { + peer = wg_peer_get_maybe_zero(rcu_dereference_bh(node->peer)); + if (!peer) + goto retry; + } + rcu_read_unlock_bh(); + return peer; +} + +__attribute__((nonnull(1))) static bool +node_placement(struct allowedips_node __rcu *trie, const u8 *key, u8 cidr, + u8 bits, struct allowedips_node **rnode, struct mutex *lock) +{ + struct allowedips_node *node = rcu_dereference_protected(trie, + lockdep_is_held(lock)); + struct allowedips_node *parent = NULL; + bool exact = false; + + while (node && node->cidr <= cidr && prefix_matches(node, key, bits)) { + parent = node; + if (parent->cidr == cidr) { + exact = true; + break; + } + node = rcu_dereference_protected(choose_node(parent, key), + lockdep_is_held(lock)); + } + *rnode = parent; + return exact; +} + +static int add(struct allowedips_node __rcu **trie, u8 bits, const u8 *be_key, + u8 cidr, struct wireguard_peer *peer, struct mutex *lock) +{ + struct allowedips_node *node, *parent, *down, *newnode; + u8 key[16] __aligned(__alignof(u64)); + + if (unlikely(cidr > bits || !peer)) + return -EINVAL; + + swap_endian(key, be_key, bits); + + if (!rcu_access_pointer(*trie)) { + node = kzalloc(sizeof(*node), GFP_KERNEL); + if (unlikely(!node)) + return -ENOMEM; + RCU_INIT_POINTER(node->peer, peer); + copy_and_assign_cidr(node, key, cidr, bits); + rcu_assign_pointer(*trie, node); + return 0; + } + if (node_placement(*trie, key, cidr, bits, &node, lock)) { + rcu_assign_pointer(node->peer, peer); + return 0; + } + + newnode = kzalloc(sizeof(*newnode), GFP_KERNEL); + if (unlikely(!newnode)) + return -ENOMEM; + RCU_INIT_POINTER(newnode->peer, peer); + copy_and_assign_cidr(newnode, key, cidr, bits); + + if (!node) + down = rcu_dereference_protected(*trie, lockdep_is_held(lock)); + else { + down = rcu_dereference_protected(choose_node(node, key), + lockdep_is_held(lock)); + if (!down) { + rcu_assign_pointer(choose_node(node, key), newnode); + return 0; + } + } + cidr = min(cidr, common_bits(down, key, bits)); + parent = node; + + if (newnode->cidr == cidr) { + rcu_assign_pointer(choose_node(newnode, down->bits), down); + if (!parent) + rcu_assign_pointer(*trie, newnode); + else + rcu_assign_pointer(choose_node(parent, newnode->bits), + newnode); + } else { + node = kzalloc(sizeof(*node), GFP_KERNEL); + if (unlikely(!node)) { + kfree(newnode); + return -ENOMEM; + } + copy_and_assign_cidr(node, newnode->bits, cidr, bits); + + rcu_assign_pointer(choose_node(node, down->bits), down); + rcu_assign_pointer(choose_node(node, newnode->bits), newnode); + if (!parent) + rcu_assign_pointer(*trie, node); + else + rcu_assign_pointer(choose_node(parent, node->bits), + node); + } + return 0; +} + +void wg_allowedips_init(struct allowedips *table) +{ + table->root4 = table->root6 = NULL; + table->seq = 1; +} + +void wg_allowedips_free(struct allowedips *table, struct mutex *lock) +{ + struct allowedips_node __rcu *old4 = table->root4, *old6 = table->root6; + ++table->seq; + RCU_INIT_POINTER(table->root4, NULL); + RCU_INIT_POINTER(table->root6, NULL); + if (rcu_access_pointer(old4)) + call_rcu_bh(&rcu_dereference_protected(old4, + lockdep_is_held(lock))->rcu, root_free_rcu); + if (rcu_access_pointer(old6)) + call_rcu_bh(&rcu_dereference_protected(old6, + lockdep_is_held(lock))->rcu, root_free_rcu); +} + +int wg_allowedips_insert_v4(struct allowedips *table, const struct in_addr *ip, + u8 cidr, struct wireguard_peer *peer, + struct mutex *lock) +{ + ++table->seq; + return add(&table->root4, 32, (const u8 *)ip, cidr, peer, lock); +} + +int wg_allowedips_insert_v6(struct allowedips *table, const struct in6_addr *ip, + u8 cidr, struct wireguard_peer *peer, + struct mutex *lock) +{ + ++table->seq; + return add(&table->root6, 128, (const u8 *)ip, cidr, peer, lock); +} + +void wg_allowedips_remove_by_peer(struct allowedips *table, + struct wireguard_peer *peer, + struct mutex *lock) +{ + ++table->seq; + walk_remove_by_peer(&table->root4, peer, lock); + walk_remove_by_peer(&table->root6, peer, lock); +} + +int wg_allowedips_walk_by_peer(struct allowedips *table, + struct allowedips_cursor *cursor, + struct wireguard_peer *peer, + int (*func)(void *ctx, const u8 *ip, u8 cidr, int family), + void *ctx, struct mutex *lock) +{ + int ret; + + if (!cursor->seq) + cursor->seq = table->seq; + else if (cursor->seq != table->seq) + return 0; + + if (!cursor->second_half) { + ret = walk_by_peer(table->root4, 32, cursor, peer, func, ctx, lock); + if (ret) + return ret; + cursor->len = 0; + cursor->second_half = true; + } + return walk_by_peer(table->root6, 128, cursor, peer, func, ctx, lock); +} + +/* Returns a strong reference to a peer */ +struct wireguard_peer *wg_allowedips_lookup_dst(struct allowedips *table, + struct sk_buff *skb) +{ + if (skb->protocol == htons(ETH_P_IP)) + return lookup(table->root4, 32, &ip_hdr(skb)->daddr); + else if (skb->protocol == htons(ETH_P_IPV6)) + return lookup(table->root6, 128, &ipv6_hdr(skb)->daddr); + return NULL; +} + +/* Returns a strong reference to a peer */ +struct wireguard_peer *wg_allowedips_lookup_src(struct allowedips *table, + struct sk_buff *skb) +{ + if (skb->protocol == htons(ETH_P_IP)) + return lookup(table->root4, 32, &ip_hdr(skb)->saddr); + else if (skb->protocol == htons(ETH_P_IPV6)) + return lookup(table->root6, 128, &ipv6_hdr(skb)->saddr); + return NULL; +} + +#include "selftest/allowedips.c" diff --git a/drivers/net/wireguard/allowedips.h b/drivers/net/wireguard/allowedips.h new file mode 100644 index 000000000000..c34e2165fc65 --- /dev/null +++ b/drivers/net/wireguard/allowedips.h @@ -0,0 +1,56 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#ifndef _WG_ALLOWEDIPS_H +#define _WG_ALLOWEDIPS_H + +#include +#include +#include + +struct wireguard_peer; +struct allowedips_node; + +struct allowedips { + struct allowedips_node __rcu *root4; + struct allowedips_node __rcu *root6; + u64 seq; +}; + +struct allowedips_cursor { + u64 seq; + struct allowedips_node *stack[128]; + unsigned int len; + bool second_half; +}; + +void wg_allowedips_init(struct allowedips *table); +void wg_allowedips_free(struct allowedips *table, struct mutex *mutex); +int wg_allowedips_insert_v4(struct allowedips *table, const struct in_addr *ip, + u8 cidr, struct wireguard_peer *peer, + struct mutex *lock); +int wg_allowedips_insert_v6(struct allowedips *table, const struct in6_addr *ip, + u8 cidr, struct wireguard_peer *peer, + struct mutex *lock); +void wg_allowedips_remove_by_peer(struct allowedips *table, + struct wireguard_peer *peer, + struct mutex *lock); +int wg_allowedips_walk_by_peer(struct allowedips *table, + struct allowedips_cursor *cursor, + struct wireguard_peer *peer, + int (*func)(void *ctx, const u8 *ip, u8 cidr, int family), + void *ctx, struct mutex *lock); + +/* These return a strong reference to a peer: */ +struct wireguard_peer *wg_allowedips_lookup_dst(struct allowedips *table, + struct sk_buff *skb); +struct wireguard_peer *wg_allowedips_lookup_src(struct allowedips *table, + struct sk_buff *skb); + +#ifdef DEBUG +bool wg_allowedips_selftest(void); +#endif + +#endif /* _WG_ALLOWEDIPS_H */ diff --git a/drivers/net/wireguard/cookie.c b/drivers/net/wireguard/cookie.c new file mode 100644 index 000000000000..3ac05e60d085 --- /dev/null +++ b/drivers/net/wireguard/cookie.c @@ -0,0 +1,235 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#include "cookie.h" +#include "peer.h" +#include "device.h" +#include "messages.h" +#include "ratelimiter.h" +#include "timers.h" + +#include +#include + +#include +#include + +void wg_cookie_checker_init(struct cookie_checker *checker, + struct wireguard_device *wg) +{ + init_rwsem(&checker->secret_lock); + checker->secret_birthdate = ktime_get_boot_fast_ns(); + get_random_bytes(checker->secret, NOISE_HASH_LEN); + checker->device = wg; +} + +enum { COOKIE_KEY_LABEL_LEN = 8 }; +static const u8 mac1_key_label[COOKIE_KEY_LABEL_LEN] = "mac1----"; +static const u8 cookie_key_label[COOKIE_KEY_LABEL_LEN] = "cookie--"; + +static void precompute_key(u8 key[NOISE_SYMMETRIC_KEY_LEN], + const u8 pubkey[NOISE_PUBLIC_KEY_LEN], + const u8 label[COOKIE_KEY_LABEL_LEN]) +{ + struct blake2s_state blake; + + blake2s_init(&blake, NOISE_SYMMETRIC_KEY_LEN); + blake2s_update(&blake, label, COOKIE_KEY_LABEL_LEN); + blake2s_update(&blake, pubkey, NOISE_PUBLIC_KEY_LEN); + blake2s_final(&blake, key, NOISE_SYMMETRIC_KEY_LEN); +} + +/* Must hold peer->handshake.static_identity->lock */ +void wg_cookie_checker_precompute_device_keys(struct cookie_checker *checker) +{ + if (likely(checker->device->static_identity.has_identity)) { + precompute_key(checker->cookie_encryption_key, + checker->device->static_identity.static_public, + cookie_key_label); + precompute_key(checker->message_mac1_key, + checker->device->static_identity.static_public, + mac1_key_label); + } else { + memset(checker->cookie_encryption_key, 0, + NOISE_SYMMETRIC_KEY_LEN); + memset(checker->message_mac1_key, 0, NOISE_SYMMETRIC_KEY_LEN); + } +} + +void wg_cookie_checker_precompute_peer_keys(struct wireguard_peer *peer) +{ + precompute_key(peer->latest_cookie.cookie_decryption_key, + peer->handshake.remote_static, cookie_key_label); + precompute_key(peer->latest_cookie.message_mac1_key, + peer->handshake.remote_static, mac1_key_label); +} + +void wg_cookie_init(struct cookie *cookie) +{ + memset(cookie, 0, sizeof(*cookie)); + init_rwsem(&cookie->lock); +} + +static void compute_mac1(u8 mac1[COOKIE_LEN], const void *message, size_t len, + const u8 key[NOISE_SYMMETRIC_KEY_LEN]) +{ + len = len - sizeof(struct message_macs) + + offsetof(struct message_macs, mac1); + blake2s(mac1, message, key, COOKIE_LEN, len, NOISE_SYMMETRIC_KEY_LEN); +} + +static void compute_mac2(u8 mac2[COOKIE_LEN], const void *message, size_t len, + const u8 cookie[COOKIE_LEN]) +{ + len = len - sizeof(struct message_macs) + + offsetof(struct message_macs, mac2); + blake2s(mac2, message, cookie, COOKIE_LEN, len, COOKIE_LEN); +} + +static void make_cookie(u8 cookie[COOKIE_LEN], struct sk_buff *skb, + struct cookie_checker *checker) +{ + struct blake2s_state state; + + if (wg_birthdate_has_expired(checker->secret_birthdate, + COOKIE_SECRET_MAX_AGE)) { + down_write(&checker->secret_lock); + checker->secret_birthdate = ktime_get_boot_fast_ns(); + get_random_bytes(checker->secret, NOISE_HASH_LEN); + up_write(&checker->secret_lock); + } + + down_read(&checker->secret_lock); + + blake2s_init_key(&state, COOKIE_LEN, checker->secret, NOISE_HASH_LEN); + if (skb->protocol == htons(ETH_P_IP)) + blake2s_update(&state, (u8 *)&ip_hdr(skb)->saddr, + sizeof(struct in_addr)); + else if (skb->protocol == htons(ETH_P_IPV6)) + blake2s_update(&state, (u8 *)&ipv6_hdr(skb)->saddr, + sizeof(struct in6_addr)); + blake2s_update(&state, (u8 *)&udp_hdr(skb)->source, sizeof(__be16)); + blake2s_final(&state, cookie, COOKIE_LEN); + + up_read(&checker->secret_lock); +} + +enum cookie_mac_state wg_cookie_validate_packet(struct cookie_checker *checker, + struct sk_buff *skb, + bool check_cookie) +{ + struct message_macs *macs = (struct message_macs *) + (skb->data + skb->len - sizeof(*macs)); + enum cookie_mac_state ret; + u8 computed_mac[COOKIE_LEN]; + u8 cookie[COOKIE_LEN]; + + ret = INVALID_MAC; + compute_mac1(computed_mac, skb->data, skb->len, + checker->message_mac1_key); + if (crypto_memneq(computed_mac, macs->mac1, COOKIE_LEN)) + goto out; + + ret = VALID_MAC_BUT_NO_COOKIE; + + if (!check_cookie) + goto out; + + make_cookie(cookie, skb, checker); + + compute_mac2(computed_mac, skb->data, skb->len, cookie); + if (crypto_memneq(computed_mac, macs->mac2, COOKIE_LEN)) + goto out; + + ret = VALID_MAC_WITH_COOKIE_BUT_RATELIMITED; + if (!wg_ratelimiter_allow(skb, dev_net(checker->device->dev))) + goto out; + + ret = VALID_MAC_WITH_COOKIE; + +out: + return ret; +} + +void wg_cookie_add_mac_to_packet(void *message, size_t len, + struct wireguard_peer *peer) +{ + struct message_macs *macs = (struct message_macs *) + ((u8 *)message + len - sizeof(*macs)); + + down_write(&peer->latest_cookie.lock); + compute_mac1(macs->mac1, message, len, + peer->latest_cookie.message_mac1_key); + memcpy(peer->latest_cookie.last_mac1_sent, macs->mac1, COOKIE_LEN); + peer->latest_cookie.have_sent_mac1 = true; + up_write(&peer->latest_cookie.lock); + + down_read(&peer->latest_cookie.lock); + if (peer->latest_cookie.is_valid && + !wg_birthdate_has_expired(peer->latest_cookie.birthdate, + COOKIE_SECRET_MAX_AGE - COOKIE_SECRET_LATENCY)) + compute_mac2(macs->mac2, message, len, + peer->latest_cookie.cookie); + else + memset(macs->mac2, 0, COOKIE_LEN); + up_read(&peer->latest_cookie.lock); +} + +void wg_cookie_message_create(struct message_handshake_cookie *dst, + struct sk_buff *skb, __le32 index, + struct cookie_checker *checker) +{ + struct message_macs *macs = (struct message_macs *) + ((u8 *)skb->data + skb->len - sizeof(*macs)); + u8 cookie[COOKIE_LEN]; + + dst->header.type = cpu_to_le32(MESSAGE_HANDSHAKE_COOKIE); + dst->receiver_index = index; + get_random_bytes_wait(dst->nonce, COOKIE_NONCE_LEN); + + make_cookie(cookie, skb, checker); + xchacha20poly1305_encrypt(dst->encrypted_cookie, cookie, COOKIE_LEN, + macs->mac1, COOKIE_LEN, dst->nonce, + checker->cookie_encryption_key); +} + +void wg_cookie_message_consume(struct message_handshake_cookie *src, + struct wireguard_device *wg) +{ + struct wireguard_peer *peer = NULL; + u8 cookie[COOKIE_LEN]; + bool ret; + + if (unlikely(!wg_index_hashtable_lookup(&wg->index_hashtable, + INDEX_HASHTABLE_HANDSHAKE | + INDEX_HASHTABLE_KEYPAIR, + src->receiver_index, &peer))) + return; + + down_read(&peer->latest_cookie.lock); + if (unlikely(!peer->latest_cookie.have_sent_mac1)) { + up_read(&peer->latest_cookie.lock); + goto out; + } + ret = xchacha20poly1305_decrypt( + cookie, src->encrypted_cookie, sizeof(src->encrypted_cookie), + peer->latest_cookie.last_mac1_sent, COOKIE_LEN, src->nonce, + peer->latest_cookie.cookie_decryption_key); + up_read(&peer->latest_cookie.lock); + + if (ret) { + down_write(&peer->latest_cookie.lock); + memcpy(peer->latest_cookie.cookie, cookie, COOKIE_LEN); + peer->latest_cookie.birthdate = ktime_get_boot_fast_ns(); + peer->latest_cookie.is_valid = true; + peer->latest_cookie.have_sent_mac1 = false; + up_write(&peer->latest_cookie.lock); + } else + net_dbg_ratelimited("%s: Could not decrypt invalid cookie response\n", + wg->dev->name); + +out: + wg_peer_put(peer); +} diff --git a/drivers/net/wireguard/cookie.h b/drivers/net/wireguard/cookie.h new file mode 100644 index 000000000000..409093ff2522 --- /dev/null +++ b/drivers/net/wireguard/cookie.h @@ -0,0 +1,59 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#ifndef _WG_COOKIE_H +#define _WG_COOKIE_H + +#include "messages.h" +#include + +struct wireguard_peer; + +struct cookie_checker { + u8 secret[NOISE_HASH_LEN]; + u8 cookie_encryption_key[NOISE_SYMMETRIC_KEY_LEN]; + u8 message_mac1_key[NOISE_SYMMETRIC_KEY_LEN]; + u64 secret_birthdate; + struct rw_semaphore secret_lock; + struct wireguard_device *device; +}; + +struct cookie { + u64 birthdate; + bool is_valid; + u8 cookie[COOKIE_LEN]; + bool have_sent_mac1; + u8 last_mac1_sent[COOKIE_LEN]; + u8 cookie_decryption_key[NOISE_SYMMETRIC_KEY_LEN]; + u8 message_mac1_key[NOISE_SYMMETRIC_KEY_LEN]; + struct rw_semaphore lock; +}; + +enum cookie_mac_state { + INVALID_MAC, + VALID_MAC_BUT_NO_COOKIE, + VALID_MAC_WITH_COOKIE_BUT_RATELIMITED, + VALID_MAC_WITH_COOKIE +}; + +void wg_cookie_checker_init(struct cookie_checker *checker, + struct wireguard_device *wg); +void wg_cookie_checker_precompute_device_keys(struct cookie_checker *checker); +void wg_cookie_checker_precompute_peer_keys(struct wireguard_peer *peer); +void wg_cookie_init(struct cookie *cookie); + +enum cookie_mac_state wg_cookie_validate_packet(struct cookie_checker *checker, + struct sk_buff *skb, + bool check_cookie); +void wg_cookie_add_mac_to_packet(void *message, size_t len, + struct wireguard_peer *peer); + +void wg_cookie_message_create(struct message_handshake_cookie *src, + struct sk_buff *skb, __le32 index, + struct cookie_checker *checker); +void wg_cookie_message_consume(struct message_handshake_cookie *src, + struct wireguard_device *wg); + +#endif /* _WG_COOKIE_H */ diff --git a/drivers/net/wireguard/device.c b/drivers/net/wireguard/device.c new file mode 100644 index 000000000000..e1175c18b3c3 --- /dev/null +++ b/drivers/net/wireguard/device.c @@ -0,0 +1,439 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#include "queueing.h" +#include "socket.h" +#include "timers.h" +#include "device.h" +#include "ratelimiter.h" +#include "peer.h" +#include "messages.h" + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +static LIST_HEAD(device_list); + +static int open(struct net_device *dev) +{ + struct in_device *dev_v4 = __in_dev_get_rtnl(dev); + struct wireguard_device *wg = netdev_priv(dev); + struct inet6_dev *dev_v6 = __in6_dev_get(dev); + struct wireguard_peer *peer; + int ret; + + if (dev_v4) { + /* At some point we might put this check near the ip_rt_send_ + * redirect call of ip_forward in net/ipv4/ip_forward.c, similar + * to the current secpath check. + */ + IN_DEV_CONF_SET(dev_v4, SEND_REDIRECTS, false); + IPV4_DEVCONF_ALL(dev_net(dev), SEND_REDIRECTS) = false; + } + if (dev_v6) + dev_v6->cnf.addr_gen_mode = IN6_ADDR_GEN_MODE_NONE; + + ret = wg_socket_init(wg, wg->incoming_port); + if (ret < 0) + return ret; + mutex_lock(&wg->device_update_lock); + list_for_each_entry (peer, &wg->peer_list, peer_list) { + wg_packet_send_staged_packets(peer); + if (peer->persistent_keepalive_interval) + wg_packet_send_keepalive(peer); + } + mutex_unlock(&wg->device_update_lock); + return 0; +} + +#if defined(CONFIG_PM_SLEEP) && !defined(CONFIG_ANDROID) +static int pm_notification(struct notifier_block *nb, unsigned long action, + void *data) +{ + struct wireguard_device *wg; + struct wireguard_peer *peer; + + if (action != PM_HIBERNATION_PREPARE && action != PM_SUSPEND_PREPARE) + return 0; + + rtnl_lock(); + list_for_each_entry (wg, &device_list, device_list) { + mutex_lock(&wg->device_update_lock); + list_for_each_entry (peer, &wg->peer_list, peer_list) { + wg_noise_handshake_clear(&peer->handshake); + wg_noise_keypairs_clear(&peer->keypairs); + if (peer->timers_enabled) + del_timer(&peer->timer_zero_key_material); + } + mutex_unlock(&wg->device_update_lock); + } + rtnl_unlock(); + rcu_barrier_bh(); + return 0; +} +static struct notifier_block pm_notifier = { .notifier_call = pm_notification }; +#endif + +static int stop(struct net_device *dev) +{ + struct wireguard_device *wg = netdev_priv(dev); + struct wireguard_peer *peer; + + mutex_lock(&wg->device_update_lock); + list_for_each_entry (peer, &wg->peer_list, peer_list) { + skb_queue_purge(&peer->staged_packet_queue); + wg_timers_stop(peer); + wg_noise_handshake_clear(&peer->handshake); + wg_noise_keypairs_clear(&peer->keypairs); + atomic64_set(&peer->last_sent_handshake, + ktime_get_boot_fast_ns() - + (u64)(REKEY_TIMEOUT + 1) * NSEC_PER_SEC); + } + mutex_unlock(&wg->device_update_lock); + skb_queue_purge(&wg->incoming_handshakes); + wg_socket_reinit(wg, NULL, NULL); + return 0; +} + +static netdev_tx_t xmit(struct sk_buff *skb, struct net_device *dev) +{ + struct wireguard_device *wg = netdev_priv(dev); + struct wireguard_peer *peer; + struct sk_buff *next; + struct sk_buff_head packets; + sa_family_t family; + u32 mtu; + int ret; + + if (unlikely(wg_skb_examine_untrusted_ip_hdr(skb) != skb->protocol)) { + ret = -EPROTONOSUPPORT; + net_dbg_ratelimited("%s: Invalid IP packet\n", dev->name); + goto err; + } + + peer = wg_allowedips_lookup_dst(&wg->peer_allowedips, skb); + if (unlikely(!peer)) { + ret = -ENOKEY; + if (skb->protocol == htons(ETH_P_IP)) + net_dbg_ratelimited("%s: No peer has allowed IPs matching %pI4\n", + dev->name, &ip_hdr(skb)->daddr); + else if (skb->protocol == htons(ETH_P_IPV6)) + net_dbg_ratelimited("%s: No peer has allowed IPs matching %pI6\n", + dev->name, &ipv6_hdr(skb)->daddr); + goto err; + } + + family = READ_ONCE(peer->endpoint.addr.sa_family); + if (unlikely(family != AF_INET && family != AF_INET6)) { + ret = -EDESTADDRREQ; + net_dbg_ratelimited("%s: No valid endpoint has been configured or discovered for peer %llu\n", + dev->name, peer->internal_id); + goto err_peer; + } + + mtu = skb_dst(skb) ? dst_mtu(skb_dst(skb)) : dev->mtu; + + __skb_queue_head_init(&packets); + if (!skb_is_gso(skb)) + skb->next = NULL; + else { + struct sk_buff *segs = skb_gso_segment(skb, 0); + + if (unlikely(IS_ERR(segs))) { + ret = PTR_ERR(segs); + goto err_peer; + } + dev_kfree_skb(skb); + skb = segs; + } + do { + next = skb->next; + skb->next = skb->prev = NULL; + + skb = skb_share_check(skb, GFP_ATOMIC); + if (unlikely(!skb)) + continue; + + /* We only need to keep the original dst around for icmp, + * so at this point we're in a position to drop it. + */ + skb_dst_drop(skb); + + PACKET_CB(skb)->mtu = mtu; + + __skb_queue_tail(&packets, skb); + } while ((skb = next) != NULL); + + spin_lock_bh(&peer->staged_packet_queue.lock); + /* If the queue is getting too big, we start removing the oldest packets + * until it's small again. We do this before adding the new packet, so + * we don't remove GSO segments that are in excess. + */ + while (skb_queue_len(&peer->staged_packet_queue) > MAX_STAGED_PACKETS) + dev_kfree_skb(__skb_dequeue(&peer->staged_packet_queue)); + skb_queue_splice_tail(&packets, &peer->staged_packet_queue); + spin_unlock_bh(&peer->staged_packet_queue.lock); + + wg_packet_send_staged_packets(peer); + + wg_peer_put(peer); + return NETDEV_TX_OK; + +err_peer: + wg_peer_put(peer); +err: + ++dev->stats.tx_errors; + if (skb->protocol == htons(ETH_P_IP)) + icmp_send(skb, ICMP_DEST_UNREACH, ICMP_HOST_UNREACH, 0); + else if (skb->protocol == htons(ETH_P_IPV6)) + icmpv6_send(skb, ICMPV6_DEST_UNREACH, ICMPV6_ADDR_UNREACH, 0); + kfree_skb(skb); + return ret; +} + +static const struct net_device_ops netdev_ops = { + .ndo_open = open, + .ndo_stop = stop, + .ndo_start_xmit = xmit, + .ndo_get_stats64 = ip_tunnel_get_stats64 +}; + +static void destruct(struct net_device *dev) +{ + struct wireguard_device *wg = netdev_priv(dev); + + rtnl_lock(); + list_del(&wg->device_list); + rtnl_unlock(); + mutex_lock(&wg->device_update_lock); + wg->incoming_port = 0; + wg_socket_reinit(wg, NULL, NULL); + wg_allowedips_free(&wg->peer_allowedips, &wg->device_update_lock); + /* The final references are cleared in the below calls to destroy_workqueue. */ + wg_peer_remove_all(wg); + destroy_workqueue(wg->handshake_receive_wq); + destroy_workqueue(wg->handshake_send_wq); + destroy_workqueue(wg->packet_crypt_wq); + wg_packet_queue_free(&wg->decrypt_queue, true); + wg_packet_queue_free(&wg->encrypt_queue, true); + rcu_barrier_bh(); /* Wait for all the peers to be actually freed. */ + wg_ratelimiter_uninit(); + memzero_explicit(&wg->static_identity, sizeof(wg->static_identity)); + skb_queue_purge(&wg->incoming_handshakes); + free_percpu(dev->tstats); + free_percpu(wg->incoming_handshakes_worker); + if (wg->have_creating_net_ref) + put_net(wg->creating_net); + mutex_unlock(&wg->device_update_lock); + + pr_debug("%s: Interface deleted\n", dev->name); + free_netdev(dev); +} + +static const struct device_type device_type = { .name = KBUILD_MODNAME }; + +static void setup(struct net_device *dev) +{ + struct wireguard_device *wg = netdev_priv(dev); + enum { WG_NETDEV_FEATURES = NETIF_F_HW_CSUM | NETIF_F_RXCSUM | + NETIF_F_SG | NETIF_F_GSO | + NETIF_F_GSO_SOFTWARE | NETIF_F_HIGHDMA }; + + dev->netdev_ops = &netdev_ops; + dev->hard_header_len = 0; + dev->addr_len = 0; + dev->needed_headroom = DATA_PACKET_HEAD_ROOM; + dev->needed_tailroom = noise_encrypted_len(MESSAGE_PADDING_MULTIPLE); + dev->type = ARPHRD_NONE; + dev->flags = IFF_POINTOPOINT | IFF_NOARP; + dev->priv_flags |= IFF_NO_QUEUE; + dev->features |= NETIF_F_LLTX; + dev->features |= WG_NETDEV_FEATURES; + dev->hw_features |= WG_NETDEV_FEATURES; + dev->hw_enc_features |= WG_NETDEV_FEATURES; + dev->mtu = ETH_DATA_LEN - MESSAGE_MINIMUM_LENGTH - + sizeof(struct udphdr) - + max(sizeof(struct ipv6hdr), sizeof(struct iphdr)); + + SET_NETDEV_DEVTYPE(dev, &device_type); + + /* We need to keep the dst around in case of icmp replies. */ + netif_keep_dst(dev); + + memset(wg, 0, sizeof(*wg)); + wg->dev = dev; +} + +static int newlink(struct net *src_net, struct net_device *dev, + struct nlattr *tb[], struct nlattr *data[], + struct netlink_ext_ack *extack) +{ + int ret = -ENOMEM; + struct wireguard_device *wg = netdev_priv(dev); + + wg->creating_net = src_net; + init_rwsem(&wg->static_identity.lock); + mutex_init(&wg->socket_update_lock); + mutex_init(&wg->device_update_lock); + skb_queue_head_init(&wg->incoming_handshakes); + wg_pubkey_hashtable_init(&wg->peer_hashtable); + wg_index_hashtable_init(&wg->index_hashtable); + wg_allowedips_init(&wg->peer_allowedips); + wg_cookie_checker_init(&wg->cookie_checker, wg); + INIT_LIST_HEAD(&wg->peer_list); + wg->device_update_gen = 1; + + dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats); + if (!dev->tstats) + goto error_1; + + wg->incoming_handshakes_worker = + wg_packet_alloc_percpu_multicore_worker( + wg_packet_handshake_receive_worker, wg); + if (!wg->incoming_handshakes_worker) + goto error_2; + + wg->handshake_receive_wq = alloc_workqueue("wg-kex-%s", + WQ_CPU_INTENSIVE | WQ_FREEZABLE, 0, dev->name); + if (!wg->handshake_receive_wq) + goto error_3; + + wg->handshake_send_wq = alloc_workqueue("wg-kex-%s", + WQ_UNBOUND | WQ_FREEZABLE, 0, dev->name); + if (!wg->handshake_send_wq) + goto error_4; + + wg->packet_crypt_wq = alloc_workqueue("wg-crypt-%s", + WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM, 0, dev->name); + if (!wg->packet_crypt_wq) + goto error_5; + + if (wg_packet_queue_init(&wg->encrypt_queue, wg_packet_encrypt_worker, + true, MAX_QUEUED_PACKETS) < 0) + goto error_6; + + if (wg_packet_queue_init(&wg->decrypt_queue, wg_packet_decrypt_worker, + true, MAX_QUEUED_PACKETS) < 0) + goto error_7; + + ret = wg_ratelimiter_init(); + if (ret < 0) + goto error_8; + + ret = register_netdevice(dev); + if (ret < 0) + goto error_9; + + list_add(&wg->device_list, &device_list); + + /* We wait until the end to assign priv_destructor, so that + * register_netdevice doesn't call it for us if it fails. + */ + dev->priv_destructor = destruct; + + pr_debug("%s: Interface created\n", dev->name); + return ret; + +error_9: + wg_ratelimiter_uninit(); +error_8: + wg_packet_queue_free(&wg->decrypt_queue, true); +error_7: + wg_packet_queue_free(&wg->encrypt_queue, true); +error_6: + destroy_workqueue(wg->packet_crypt_wq); +error_5: + destroy_workqueue(wg->handshake_send_wq); +error_4: + destroy_workqueue(wg->handshake_receive_wq); +error_3: + free_percpu(wg->incoming_handshakes_worker); +error_2: + free_percpu(dev->tstats); +error_1: + return ret; +} + +static struct rtnl_link_ops link_ops __read_mostly = { + .kind = KBUILD_MODNAME, + .priv_size = sizeof(struct wireguard_device), + .setup = setup, + .newlink = newlink, +}; + +static int netdevice_notification(struct notifier_block *nb, + unsigned long action, void *data) +{ + struct net_device *dev = ((struct netdev_notifier_info *)data)->dev; + struct wireguard_device *wg = netdev_priv(dev); + + ASSERT_RTNL(); + + if (action != NETDEV_REGISTER || dev->netdev_ops != &netdev_ops) + return 0; + + if (dev_net(dev) == wg->creating_net && wg->have_creating_net_ref) { + put_net(wg->creating_net); + wg->have_creating_net_ref = false; + } else if (dev_net(dev) != wg->creating_net && + !wg->have_creating_net_ref) { + wg->have_creating_net_ref = true; + get_net(wg->creating_net); + } + return 0; +} + +static struct notifier_block netdevice_notifier = { + .notifier_call = netdevice_notification +}; + +int __init wg_device_init(void) +{ + int ret; + +#if defined(CONFIG_PM_SLEEP) && !defined(CONFIG_ANDROID) + ret = register_pm_notifier(&pm_notifier); + if (ret) + return ret; +#endif + + ret = register_netdevice_notifier(&netdevice_notifier); + if (ret) + goto error_pm; + + ret = rtnl_link_register(&link_ops); + if (ret) + goto error_netdevice; + + return 0; + +error_netdevice: + unregister_netdevice_notifier(&netdevice_notifier); +error_pm: +#if defined(CONFIG_PM_SLEEP) && !defined(CONFIG_ANDROID) + unregister_pm_notifier(&pm_notifier); +#endif + return ret; +} + +void wg_device_uninit(void) +{ + rtnl_link_unregister(&link_ops); + unregister_netdevice_notifier(&netdevice_notifier); +#if defined(CONFIG_PM_SLEEP) && !defined(CONFIG_ANDROID) + unregister_pm_notifier(&pm_notifier); +#endif + rcu_barrier_bh(); +} diff --git a/drivers/net/wireguard/device.h b/drivers/net/wireguard/device.h new file mode 100644 index 000000000000..2bd1429b5831 --- /dev/null +++ b/drivers/net/wireguard/device.h @@ -0,0 +1,65 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#ifndef _WG_DEVICE_H +#define _WG_DEVICE_H + +#include "noise.h" +#include "allowedips.h" +#include "hashtables.h" +#include "cookie.h" + +#include +#include +#include +#include +#include +#include + +struct wireguard_device; + +struct multicore_worker { + void *ptr; + struct work_struct work; +}; + +struct crypt_queue { + struct ptr_ring ring; + union { + struct { + struct multicore_worker __percpu *worker; + int last_cpu; + }; + struct work_struct work; + }; +}; + +struct wireguard_device { + struct net_device *dev; + struct crypt_queue encrypt_queue, decrypt_queue; + struct sock __rcu *sock4, *sock6; + struct net *creating_net; + struct noise_static_identity static_identity; + struct workqueue_struct *handshake_receive_wq, *handshake_send_wq; + struct workqueue_struct *packet_crypt_wq; + struct sk_buff_head incoming_handshakes; + int incoming_handshake_cpu; + struct multicore_worker __percpu *incoming_handshakes_worker; + struct cookie_checker cookie_checker; + struct pubkey_hashtable peer_hashtable; + struct index_hashtable index_hashtable; + struct allowedips peer_allowedips; + struct mutex device_update_lock, socket_update_lock; + struct list_head device_list, peer_list; + unsigned int num_peers, device_update_gen; + u32 fwmark; + u16 incoming_port; + bool have_creating_net_ref; +}; + +int wg_device_init(void); +void wg_device_uninit(void); + +#endif /* _WG_DEVICE_H */ diff --git a/drivers/net/wireguard/hashtables.c b/drivers/net/wireguard/hashtables.c new file mode 100644 index 000000000000..6e5518bfbd23 --- /dev/null +++ b/drivers/net/wireguard/hashtables.c @@ -0,0 +1,209 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#include "hashtables.h" +#include "peer.h" +#include "noise.h" + +static struct hlist_head *pubkey_bucket(struct pubkey_hashtable *table, + const u8 pubkey[NOISE_PUBLIC_KEY_LEN]) +{ + /* siphash gives us a secure 64bit number based on a random key. Since + * the bits are uniformly distributed, we can then mask off to get the + * bits we need. + */ + return &table->hashtable[ + siphash(pubkey, NOISE_PUBLIC_KEY_LEN, &table->key) & + (HASH_SIZE(table->hashtable) - 1)]; +} + +void wg_pubkey_hashtable_init(struct pubkey_hashtable *table) +{ + get_random_bytes(&table->key, sizeof(table->key)); + hash_init(table->hashtable); + mutex_init(&table->lock); +} + +void wg_pubkey_hashtable_add(struct pubkey_hashtable *table, + struct wireguard_peer *peer) +{ + mutex_lock(&table->lock); + hlist_add_head_rcu(&peer->pubkey_hash, + pubkey_bucket(table, peer->handshake.remote_static)); + mutex_unlock(&table->lock); +} + +void wg_pubkey_hashtable_remove(struct pubkey_hashtable *table, + struct wireguard_peer *peer) +{ + mutex_lock(&table->lock); + hlist_del_init_rcu(&peer->pubkey_hash); + mutex_unlock(&table->lock); +} + +/* Returns a strong reference to a peer */ +struct wireguard_peer * +wg_pubkey_hashtable_lookup(struct pubkey_hashtable *table, + const u8 pubkey[NOISE_PUBLIC_KEY_LEN]) +{ + struct wireguard_peer *iter_peer, *peer = NULL; + + rcu_read_lock_bh(); + hlist_for_each_entry_rcu_bh (iter_peer, pubkey_bucket(table, pubkey), + pubkey_hash) { + if (!memcmp(pubkey, iter_peer->handshake.remote_static, + NOISE_PUBLIC_KEY_LEN)) { + peer = iter_peer; + break; + } + } + peer = wg_peer_get_maybe_zero(peer); + rcu_read_unlock_bh(); + return peer; +} + +static struct hlist_head *index_bucket(struct index_hashtable *table, + const __le32 index) +{ + /* Since the indices are random and thus all bits are uniformly + * distributed, we can find its bucket simply by masking. + */ + return &table->hashtable[(__force u32)index & + (HASH_SIZE(table->hashtable) - 1)]; +} + +void wg_index_hashtable_init(struct index_hashtable *table) +{ + hash_init(table->hashtable); + spin_lock_init(&table->lock); +} + +/* At the moment, we limit ourselves to 2^20 total peers, which generally might + * amount to 2^20*3 items in this hashtable. The algorithm below works by + * picking a random number and testing it. We can see that these limits mean we + * usually succeed pretty quickly: + * + * >>> def calculation(tries, size): + * ... return (size / 2**32)**(tries - 1) * (1 - (size / 2**32)) + * ... + * >>> calculation(1, 2**20 * 3) + * 0.999267578125 + * >>> calculation(2, 2**20 * 3) + * 0.0007318854331970215 + * >>> calculation(3, 2**20 * 3) + * 5.360489012673497e-07 + * >>> calculation(4, 2**20 * 3) + * 3.9261394135792216e-10 + * + * At the moment, we don't do any masking, so this algorithm isn't exactly + * constant time in either the random guessing or in the hash list lookup. We + * could require a minimum of 3 tries, which would successfully mask the + * guessing. this would not, however, help with the growing hash lengths, which + * is another thing to consider moving forward. + */ + +__le32 wg_index_hashtable_insert(struct index_hashtable *table, + struct index_hashtable_entry *entry) +{ + struct index_hashtable_entry *existing_entry; + + spin_lock_bh(&table->lock); + hlist_del_init_rcu(&entry->index_hash); + spin_unlock_bh(&table->lock); + + rcu_read_lock_bh(); + +search_unused_slot: + /* First we try to find an unused slot, randomly, while unlocked. */ + entry->index = (__force __le32)get_random_u32(); + hlist_for_each_entry_rcu_bh (existing_entry, + index_bucket(table, entry->index), + index_hash) { + if (existing_entry->index == entry->index) + /* If it's already in use, we continue searching. */ + goto search_unused_slot; + } + + /* Once we've found an unused slot, we lock it, and then double-check + * that nobody else stole it from us. + */ + spin_lock_bh(&table->lock); + hlist_for_each_entry_rcu_bh (existing_entry, + index_bucket(table, entry->index), + index_hash) { + if (existing_entry->index == entry->index) { + spin_unlock_bh(&table->lock); + /* If it was stolen, we start over. */ + goto search_unused_slot; + } + } + /* Otherwise, we know we have it exclusively (since we're locked), + * so we insert. + */ + hlist_add_head_rcu(&entry->index_hash, + index_bucket(table, entry->index)); + spin_unlock_bh(&table->lock); + + rcu_read_unlock_bh(); + + return entry->index; +} + +bool wg_index_hashtable_replace(struct index_hashtable *table, + struct index_hashtable_entry *old, + struct index_hashtable_entry *new) +{ + if (unlikely(hlist_unhashed(&old->index_hash))) + return false; + spin_lock_bh(&table->lock); + new->index = old->index; + hlist_replace_rcu(&old->index_hash, &new->index_hash); + + /* Calling init here NULLs out index_hash, and in fact after this + * function returns, it's theoretically possible for this to get + * reinserted elsewhere. That means the RCU lookup below might either + * terminate early or jump between buckets, in which case the packet + * simply gets dropped, which isn't terrible. + */ + INIT_HLIST_NODE(&old->index_hash); + spin_unlock_bh(&table->lock); + return true; +} + +void wg_index_hashtable_remove(struct index_hashtable *table, + struct index_hashtable_entry *entry) +{ + spin_lock_bh(&table->lock); + hlist_del_init_rcu(&entry->index_hash); + spin_unlock_bh(&table->lock); +} + +/* Returns a strong reference to a entry->peer */ +struct index_hashtable_entry * +wg_index_hashtable_lookup(struct index_hashtable *table, + const enum index_hashtable_type type_mask, + const __le32 index, struct wireguard_peer **peer) +{ + struct index_hashtable_entry *iter_entry, *entry = NULL; + + rcu_read_lock_bh(); + hlist_for_each_entry_rcu_bh (iter_entry, index_bucket(table, index), + index_hash) { + if (iter_entry->index == index) { + if (likely(iter_entry->type & type_mask)) + entry = iter_entry; + break; + } + } + if (likely(entry)) { + entry->peer = wg_peer_get_maybe_zero(entry->peer); + if (likely(entry->peer)) + *peer = entry->peer; + else + entry = NULL; + } + rcu_read_unlock_bh(); + return entry; +} diff --git a/drivers/net/wireguard/hashtables.h b/drivers/net/wireguard/hashtables.h new file mode 100644 index 000000000000..8b855d79ec70 --- /dev/null +++ b/drivers/net/wireguard/hashtables.h @@ -0,0 +1,64 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#ifndef _WG_HASHTABLES_H +#define _WG_HASHTABLES_H + +#include "messages.h" + +#include +#include +#include + +struct wireguard_peer; + +struct pubkey_hashtable { + /* TODO: move to rhashtable */ + DECLARE_HASHTABLE(hashtable, 11); + siphash_key_t key; + struct mutex lock; +}; + +void wg_pubkey_hashtable_init(struct pubkey_hashtable *table); +void wg_pubkey_hashtable_add(struct pubkey_hashtable *table, + struct wireguard_peer *peer); +void wg_pubkey_hashtable_remove(struct pubkey_hashtable *table, + struct wireguard_peer *peer); +struct wireguard_peer * +wg_pubkey_hashtable_lookup(struct pubkey_hashtable *table, + const u8 pubkey[NOISE_PUBLIC_KEY_LEN]); + +struct index_hashtable { + /* TODO: move to rhashtable */ + DECLARE_HASHTABLE(hashtable, 13); + spinlock_t lock; +}; + +enum index_hashtable_type { + INDEX_HASHTABLE_HANDSHAKE = 1U << 0, + INDEX_HASHTABLE_KEYPAIR = 1U << 1 +}; + +struct index_hashtable_entry { + struct wireguard_peer *peer; + struct hlist_node index_hash; + enum index_hashtable_type type; + __le32 index; +}; + +void wg_index_hashtable_init(struct index_hashtable *table); +__le32 wg_index_hashtable_insert(struct index_hashtable *table, + struct index_hashtable_entry *entry); +bool wg_index_hashtable_replace(struct index_hashtable *table, + struct index_hashtable_entry *old, + struct index_hashtable_entry *new); +void wg_index_hashtable_remove(struct index_hashtable *table, + struct index_hashtable_entry *entry); +struct index_hashtable_entry * +wg_index_hashtable_lookup(struct index_hashtable *table, + const enum index_hashtable_type type_mask, + const __le32 index, struct wireguard_peer **peer); + +#endif /* _WG_HASHTABLES_H */ diff --git a/drivers/net/wireguard/main.c b/drivers/net/wireguard/main.c new file mode 100644 index 000000000000..dbde274f235f --- /dev/null +++ b/drivers/net/wireguard/main.c @@ -0,0 +1,65 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#include "version.h" +#include "device.h" +#include "noise.h" +#include "queueing.h" +#include "ratelimiter.h" +#include "netlink.h" + +#include + +#include +#include +#include +#include +#include + +static int __init mod_init(void) +{ + int ret; + +#ifdef DEBUG + if (!wg_allowedips_selftest() || !wg_packet_counter_selftest() || + !wg_ratelimiter_selftest()) + return -ENOTRECOVERABLE; +#endif + wg_noise_init(); + + ret = wg_device_init(); + if (ret < 0) + goto err_device; + + ret = wg_genetlink_init(); + if (ret < 0) + goto err_netlink; + + pr_info("WireGuard " WIREGUARD_VERSION " loaded. See www.wireguard.com for information.\n"); + pr_info("Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved.\n"); + + return 0; + +err_netlink: + wg_device_uninit(); +err_device: + return ret; +} + +static void __exit mod_exit(void) +{ + wg_genetlink_uninit(); + wg_device_uninit(); + pr_debug("WireGuard unloaded\n"); +} + +module_init(mod_init); +module_exit(mod_exit); +MODULE_LICENSE("GPL v2"); +MODULE_DESCRIPTION("Fast, modern, and secure VPN tunnel"); +MODULE_AUTHOR("Jason A. Donenfeld "); +MODULE_VERSION(WIREGUARD_VERSION); +MODULE_ALIAS_RTNL_LINK(KBUILD_MODNAME); +MODULE_ALIAS_GENL_FAMILY(WG_GENL_NAME); diff --git a/drivers/net/wireguard/messages.h b/drivers/net/wireguard/messages.h new file mode 100644 index 000000000000..090e6f09c7df --- /dev/null +++ b/drivers/net/wireguard/messages.h @@ -0,0 +1,128 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#ifndef _WG_MESSAGES_H +#define _WG_MESSAGES_H + +#include +#include +#include + +#include +#include +#include + +enum noise_lengths { + NOISE_PUBLIC_KEY_LEN = CURVE25519_KEY_SIZE, + NOISE_SYMMETRIC_KEY_LEN = CHACHA20POLY1305_KEY_SIZE, + NOISE_TIMESTAMP_LEN = sizeof(u64) + sizeof(u32), + NOISE_AUTHTAG_LEN = CHACHA20POLY1305_AUTHTAG_SIZE, + NOISE_HASH_LEN = BLAKE2S_HASH_SIZE +}; + +#define noise_encrypted_len(plain_len) (plain_len + NOISE_AUTHTAG_LEN) + +enum cookie_values { + COOKIE_SECRET_MAX_AGE = 2 * 60, + COOKIE_SECRET_LATENCY = 5, + COOKIE_NONCE_LEN = XCHACHA20POLY1305_NONCE_SIZE, + COOKIE_LEN = 16 +}; + +enum counter_values { + COUNTER_BITS_TOTAL = 2048, + COUNTER_REDUNDANT_BITS = BITS_PER_LONG, + COUNTER_WINDOW_SIZE = COUNTER_BITS_TOTAL - COUNTER_REDUNDANT_BITS +}; + +enum limits { + REKEY_AFTER_MESSAGES = U64_MAX - 0xffff, + REJECT_AFTER_MESSAGES = U64_MAX - COUNTER_WINDOW_SIZE - 1, + REKEY_TIMEOUT = 5, + REKEY_TIMEOUT_JITTER_MAX_JIFFIES = HZ / 3, + REKEY_AFTER_TIME = 120, + REJECT_AFTER_TIME = 180, + INITIATIONS_PER_SECOND = 50, + MAX_PEERS_PER_DEVICE = 1U << 20, + KEEPALIVE_TIMEOUT = 10, + MAX_TIMER_HANDSHAKES = 90 / REKEY_TIMEOUT, + MAX_QUEUED_INCOMING_HANDSHAKES = 4096, /* TODO: replace this with DQL */ + MAX_STAGED_PACKETS = 128, + MAX_QUEUED_PACKETS = 1024 /* TODO: replace this with DQL */ +}; + +enum message_type { + MESSAGE_INVALID = 0, + MESSAGE_HANDSHAKE_INITIATION = 1, + MESSAGE_HANDSHAKE_RESPONSE = 2, + MESSAGE_HANDSHAKE_COOKIE = 3, + MESSAGE_DATA = 4 +}; + +struct message_header { + /* The actual layout of this that we want is: + * u8 type + * u8 reserved_zero[3] + * + * But it turns out that by encoding this as little endian, + * we achieve the same thing, and it makes checking faster. + */ + __le32 type; +}; + +struct message_macs { + u8 mac1[COOKIE_LEN]; + u8 mac2[COOKIE_LEN]; +}; + +struct message_handshake_initiation { + struct message_header header; + __le32 sender_index; + u8 unencrypted_ephemeral[NOISE_PUBLIC_KEY_LEN]; + u8 encrypted_static[noise_encrypted_len(NOISE_PUBLIC_KEY_LEN)]; + u8 encrypted_timestamp[noise_encrypted_len(NOISE_TIMESTAMP_LEN)]; + struct message_macs macs; +}; + +struct message_handshake_response { + struct message_header header; + __le32 sender_index; + __le32 receiver_index; + u8 unencrypted_ephemeral[NOISE_PUBLIC_KEY_LEN]; + u8 encrypted_nothing[noise_encrypted_len(0)]; + struct message_macs macs; +}; + +struct message_handshake_cookie { + struct message_header header; + __le32 receiver_index; + u8 nonce[COOKIE_NONCE_LEN]; + u8 encrypted_cookie[noise_encrypted_len(COOKIE_LEN)]; +}; + +struct message_data { + struct message_header header; + __le32 key_idx; + __le64 counter; + u8 encrypted_data[]; +}; + +#define message_data_len(plain_len) \ + (noise_encrypted_len(plain_len) + sizeof(struct message_data)) + +enum message_alignments { + MESSAGE_PADDING_MULTIPLE = 16, + MESSAGE_MINIMUM_LENGTH = message_data_len(0) +}; + +#define SKB_HEADER_LEN \ + (max(sizeof(struct iphdr), sizeof(struct ipv6hdr)) + \ + sizeof(struct udphdr) + NET_SKB_PAD) +#define DATA_PACKET_HEAD_ROOM \ + ALIGN(sizeof(struct message_data) + SKB_HEADER_LEN, 4) + +enum { HANDSHAKE_DSCP = 0x88 /* AF41, plus 00 ECN */ }; + +#endif /* _WG_MESSAGES_H */ diff --git a/drivers/net/wireguard/netlink.c b/drivers/net/wireguard/netlink.c new file mode 100644 index 000000000000..45827d8132eb --- /dev/null +++ b/drivers/net/wireguard/netlink.c @@ -0,0 +1,606 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#include "netlink.h" +#include "device.h" +#include "peer.h" +#include "socket.h" +#include "queueing.h" +#include "messages.h" + +#include + +#include +#include +#include + +static struct genl_family genl_family; + +static const struct nla_policy device_policy[WGDEVICE_A_MAX + 1] = { + [WGDEVICE_A_IFINDEX] = { .type = NLA_U32 }, + [WGDEVICE_A_IFNAME] = { .type = NLA_NUL_STRING, .len = IFNAMSIZ - 1 }, + [WGDEVICE_A_PRIVATE_KEY] = { .len = NOISE_PUBLIC_KEY_LEN }, + [WGDEVICE_A_PUBLIC_KEY] = { .len = NOISE_PUBLIC_KEY_LEN }, + [WGDEVICE_A_FLAGS] = { .type = NLA_U32 }, + [WGDEVICE_A_LISTEN_PORT] = { .type = NLA_U16 }, + [WGDEVICE_A_FWMARK] = { .type = NLA_U32 }, + [WGDEVICE_A_PEERS] = { .type = NLA_NESTED } +}; + +static const struct nla_policy peer_policy[WGPEER_A_MAX + 1] = { + [WGPEER_A_PUBLIC_KEY] = { .len = NOISE_PUBLIC_KEY_LEN }, + [WGPEER_A_PRESHARED_KEY] = { .len = NOISE_SYMMETRIC_KEY_LEN }, + [WGPEER_A_FLAGS] = { .type = NLA_U32 }, + [WGPEER_A_ENDPOINT] = { .len = sizeof(struct sockaddr) }, + [WGPEER_A_PERSISTENT_KEEPALIVE_INTERVAL] = { .type = NLA_U16 }, + [WGPEER_A_LAST_HANDSHAKE_TIME] = { .len = sizeof(struct timespec) }, + [WGPEER_A_RX_BYTES] = { .type = NLA_U64 }, + [WGPEER_A_TX_BYTES] = { .type = NLA_U64 }, + [WGPEER_A_ALLOWEDIPS] = { .type = NLA_NESTED }, + [WGPEER_A_PROTOCOL_VERSION] = { .type = NLA_U32 } +}; + +static const struct nla_policy allowedip_policy[WGALLOWEDIP_A_MAX + 1] = { + [WGALLOWEDIP_A_FAMILY] = { .type = NLA_U16 }, + [WGALLOWEDIP_A_IPADDR] = { .len = sizeof(struct in_addr) }, + [WGALLOWEDIP_A_CIDR_MASK] = { .type = NLA_U8 } +}; + +static struct wireguard_device *lookup_interface(struct nlattr **attrs, + struct sk_buff *skb) +{ + struct net_device *dev = NULL; + + if (!attrs[WGDEVICE_A_IFINDEX] == !attrs[WGDEVICE_A_IFNAME]) + return ERR_PTR(-EBADR); + if (attrs[WGDEVICE_A_IFINDEX]) + dev = dev_get_by_index(sock_net(skb->sk), + nla_get_u32(attrs[WGDEVICE_A_IFINDEX])); + else if (attrs[WGDEVICE_A_IFNAME]) + dev = dev_get_by_name(sock_net(skb->sk), + nla_data(attrs[WGDEVICE_A_IFNAME])); + if (!dev) + return ERR_PTR(-ENODEV); + if (!dev->rtnl_link_ops || !dev->rtnl_link_ops->kind || + strcmp(dev->rtnl_link_ops->kind, KBUILD_MODNAME)) { + dev_put(dev); + return ERR_PTR(-EOPNOTSUPP); + } + return netdev_priv(dev); +} + +struct allowedips_ctx { + struct sk_buff *skb; + unsigned int i; +}; + +static int get_allowedips(void *ctx, const u8 *ip, u8 cidr, int family) +{ + struct allowedips_ctx *actx = ctx; + struct nlattr *allowedip_nest; + + allowedip_nest = nla_nest_start(actx->skb, actx->i++); + if (!allowedip_nest) + return -EMSGSIZE; + + if (nla_put_u8(actx->skb, WGALLOWEDIP_A_CIDR_MASK, cidr) || + nla_put_u16(actx->skb, WGALLOWEDIP_A_FAMILY, family) || + nla_put(actx->skb, WGALLOWEDIP_A_IPADDR, family == AF_INET6 ? + sizeof(struct in6_addr) : sizeof(struct in_addr), ip)) { + nla_nest_cancel(actx->skb, allowedip_nest); + return -EMSGSIZE; + } + + nla_nest_end(actx->skb, allowedip_nest); + return 0; +} + +static int get_peer(struct wireguard_peer *peer, unsigned int index, + struct allowedips_cursor *rt_cursor, struct sk_buff *skb) +{ + struct nlattr *allowedips_nest, *peer_nest = nla_nest_start(skb, index); + struct allowedips_ctx ctx = { .skb = skb }; + bool fail; + + if (!peer_nest) + return -EMSGSIZE; + + down_read(&peer->handshake.lock); + fail = nla_put(skb, WGPEER_A_PUBLIC_KEY, NOISE_PUBLIC_KEY_LEN, + peer->handshake.remote_static); + up_read(&peer->handshake.lock); + if (fail) + goto err; + + if (!rt_cursor->seq) { + down_read(&peer->handshake.lock); + fail = nla_put(skb, WGPEER_A_PRESHARED_KEY, + NOISE_SYMMETRIC_KEY_LEN, + peer->handshake.preshared_key); + up_read(&peer->handshake.lock); + if (fail) + goto err; + + if (nla_put(skb, WGPEER_A_LAST_HANDSHAKE_TIME, + sizeof(peer->walltime_last_handshake), + &peer->walltime_last_handshake) || + nla_put_u16(skb, WGPEER_A_PERSISTENT_KEEPALIVE_INTERVAL, + peer->persistent_keepalive_interval) || + nla_put_u64_64bit(skb, WGPEER_A_TX_BYTES, peer->tx_bytes, + WGPEER_A_UNSPEC) || + nla_put_u64_64bit(skb, WGPEER_A_RX_BYTES, peer->rx_bytes, + WGPEER_A_UNSPEC) || + nla_put_u32(skb, WGPEER_A_PROTOCOL_VERSION, 1)) + goto err; + + read_lock_bh(&peer->endpoint_lock); + if (peer->endpoint.addr.sa_family == AF_INET) + fail = nla_put(skb, WGPEER_A_ENDPOINT, + sizeof(peer->endpoint.addr4), + &peer->endpoint.addr4); + else if (peer->endpoint.addr.sa_family == AF_INET6) + fail = nla_put(skb, WGPEER_A_ENDPOINT, + sizeof(peer->endpoint.addr6), + &peer->endpoint.addr6); + read_unlock_bh(&peer->endpoint_lock); + if (fail) + goto err; + } + + allowedips_nest = nla_nest_start(skb, WGPEER_A_ALLOWEDIPS); + if (!allowedips_nest) + goto err; + if (wg_allowedips_walk_by_peer(&peer->device->peer_allowedips, + rt_cursor, peer, get_allowedips, &ctx, + &peer->device->device_update_lock)) { + nla_nest_end(skb, allowedips_nest); + nla_nest_end(skb, peer_nest); + return -EMSGSIZE; + } + memset(rt_cursor, 0, sizeof(*rt_cursor)); + nla_nest_end(skb, allowedips_nest); + nla_nest_end(skb, peer_nest); + return 0; +err: + nla_nest_cancel(skb, peer_nest); + return -EMSGSIZE; +} + +static int get_device_start(struct netlink_callback *cb) +{ + struct nlattr **attrs = genl_family_attrbuf(&genl_family); + struct wireguard_device *wg; + int ret; + + ret = nlmsg_parse(cb->nlh, GENL_HDRLEN + genl_family.hdrsize, attrs, + genl_family.maxattr, device_policy, NULL); + if (ret < 0) + return ret; + cb->args[2] = (long)kzalloc(sizeof(struct allowedips_cursor), + GFP_KERNEL); + if (unlikely(!cb->args[2])) + return -ENOMEM; + wg = lookup_interface(attrs, cb->skb); + if (IS_ERR(wg)) { + kfree((void *)cb->args[2]); + cb->args[2] = 0; + return PTR_ERR(wg); + } + cb->args[0] = (long)wg; + return 0; +} + +static int get_device_dump(struct sk_buff *skb, struct netlink_callback *cb) +{ + struct wireguard_peer *peer, *next_peer_cursor, *last_peer_cursor; + struct allowedips_cursor *rt_cursor; + struct wireguard_device *wg; + unsigned int peer_idx = 0; + struct nlattr *peers_nest; + int ret = -EMSGSIZE; + bool done = true; + void *hdr; + + wg = (struct wireguard_device *)cb->args[0]; + next_peer_cursor = (struct wireguard_peer *)cb->args[1]; + last_peer_cursor = (struct wireguard_peer *)cb->args[1]; + rt_cursor = (struct allowedips_cursor *)cb->args[2]; + + rtnl_lock(); + mutex_lock(&wg->device_update_lock); + cb->seq = wg->device_update_gen; + + hdr = genlmsg_put(skb, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, + &genl_family, NLM_F_MULTI, WG_CMD_GET_DEVICE); + if (!hdr) + goto out; + genl_dump_check_consistent(cb, hdr); + + if (!last_peer_cursor) { + if (nla_put_u16(skb, WGDEVICE_A_LISTEN_PORT, + wg->incoming_port) || + nla_put_u32(skb, WGDEVICE_A_FWMARK, wg->fwmark) || + nla_put_u32(skb, WGDEVICE_A_IFINDEX, wg->dev->ifindex) || + nla_put_string(skb, WGDEVICE_A_IFNAME, wg->dev->name)) + goto out; + + down_read(&wg->static_identity.lock); + if (wg->static_identity.has_identity) { + if (nla_put(skb, WGDEVICE_A_PRIVATE_KEY, + NOISE_PUBLIC_KEY_LEN, + wg->static_identity.static_private) || + nla_put(skb, WGDEVICE_A_PUBLIC_KEY, + NOISE_PUBLIC_KEY_LEN, + wg->static_identity.static_public)) { + up_read(&wg->static_identity.lock); + goto out; + } + } + up_read(&wg->static_identity.lock); + } + + peers_nest = nla_nest_start(skb, WGDEVICE_A_PEERS); + if (!peers_nest) + goto out; + ret = 0; + /* If the last cursor was removed via list_del_init in peer_remove, then + * we just treat this the same as there being no more peers left. The + * reason is that seq_nr should indicate to userspace that this isn't a + * coherent dump anyway, so they'll try again. + */ + if (list_empty(&wg->peer_list) || + (last_peer_cursor && list_empty(&last_peer_cursor->peer_list))) { + nla_nest_cancel(skb, peers_nest); + goto out; + } + lockdep_assert_held(&wg->device_update_lock); + peer = list_prepare_entry(last_peer_cursor, &wg->peer_list, peer_list); + list_for_each_entry_continue (peer, &wg->peer_list, peer_list) { + if (get_peer(peer, peer_idx++, rt_cursor, skb)) { + done = false; + break; + } + next_peer_cursor = peer; + } + nla_nest_end(skb, peers_nest); + +out: + if (!ret && !done && next_peer_cursor) + wg_peer_get(next_peer_cursor); + wg_peer_put(last_peer_cursor); + mutex_unlock(&wg->device_update_lock); + rtnl_unlock(); + + if (ret) { + genlmsg_cancel(skb, hdr); + return ret; + } + genlmsg_end(skb, hdr); + if (done) { + cb->args[1] = 0; + return 0; + } + cb->args[1] = (long)next_peer_cursor; + return skb->len; + + /* At this point, we can't really deal ourselves with safely zeroing out + * the private key material after usage. This will need an additional API + * in the kernel for marking skbs as zero_on_free. + */ +} + +static int get_device_done(struct netlink_callback *cb) +{ + struct wireguard_device *wg = (struct wireguard_device *)cb->args[0]; + struct wireguard_peer *peer = (struct wireguard_peer *)cb->args[1]; + struct allowedips_cursor *rt_cursor = + (struct allowedips_cursor *)cb->args[2]; + + if (wg) + dev_put(wg->dev); + kfree(rt_cursor); + wg_peer_put(peer); + return 0; +} + +static int set_port(struct wireguard_device *wg, u16 port) +{ + struct wireguard_peer *peer; + + if (wg->incoming_port == port) + return 0; + list_for_each_entry (peer, &wg->peer_list, peer_list) + wg_socket_clear_peer_endpoint_src(peer); + if (!netif_running(wg->dev)) { + wg->incoming_port = port; + return 0; + } + return wg_socket_init(wg, port); +} + +static int set_allowedip(struct wireguard_peer *peer, struct nlattr **attrs) +{ + int ret = -EINVAL; + u16 family; + u8 cidr; + + if (!attrs[WGALLOWEDIP_A_FAMILY] || !attrs[WGALLOWEDIP_A_IPADDR] || + !attrs[WGALLOWEDIP_A_CIDR_MASK]) + return ret; + family = nla_get_u16(attrs[WGALLOWEDIP_A_FAMILY]); + cidr = nla_get_u8(attrs[WGALLOWEDIP_A_CIDR_MASK]); + + if (family == AF_INET && cidr <= 32 && + nla_len(attrs[WGALLOWEDIP_A_IPADDR]) == sizeof(struct in_addr)) + ret = wg_allowedips_insert_v4( + &peer->device->peer_allowedips, + nla_data(attrs[WGALLOWEDIP_A_IPADDR]), cidr, peer, + &peer->device->device_update_lock); + else if (family == AF_INET6 && cidr <= 128 && + nla_len(attrs[WGALLOWEDIP_A_IPADDR]) == sizeof(struct in6_addr)) + ret = wg_allowedips_insert_v6( + &peer->device->peer_allowedips, + nla_data(attrs[WGALLOWEDIP_A_IPADDR]), cidr, peer, + &peer->device->device_update_lock); + + return ret; +} + +static int set_peer(struct wireguard_device *wg, struct nlattr **attrs) +{ + u8 *public_key = NULL, *preshared_key = NULL; + struct wireguard_peer *peer = NULL; + u32 flags = 0; + int ret; + + ret = -EINVAL; + if (attrs[WGPEER_A_PUBLIC_KEY] && + nla_len(attrs[WGPEER_A_PUBLIC_KEY]) == NOISE_PUBLIC_KEY_LEN) + public_key = nla_data(attrs[WGPEER_A_PUBLIC_KEY]); + else + goto out; + if (attrs[WGPEER_A_PRESHARED_KEY] && + nla_len(attrs[WGPEER_A_PRESHARED_KEY]) == NOISE_SYMMETRIC_KEY_LEN) + preshared_key = nla_data(attrs[WGPEER_A_PRESHARED_KEY]); + if (attrs[WGPEER_A_FLAGS]) + flags = nla_get_u32(attrs[WGPEER_A_FLAGS]); + + ret = -EPFNOSUPPORT; + if (attrs[WGPEER_A_PROTOCOL_VERSION]) { + if (nla_get_u32(attrs[WGPEER_A_PROTOCOL_VERSION]) != 1) + goto out; + } + + peer = wg_pubkey_hashtable_lookup(&wg->peer_hashtable, + nla_data(attrs[WGPEER_A_PUBLIC_KEY])); + if (!peer) { /* Peer doesn't exist yet. Add a new one. */ + ret = -ENODEV; + if (flags & WGPEER_F_REMOVE_ME) + goto out; /* Tried to remove a non-existing peer. */ + + down_read(&wg->static_identity.lock); + if (wg->static_identity.has_identity && + !memcmp(nla_data(attrs[WGPEER_A_PUBLIC_KEY]), + wg->static_identity.static_public, + NOISE_PUBLIC_KEY_LEN)) { + /* We silently ignore peers that have the same public + * key as the device. The reason we do it silently is + * that we'd like for people to be able to reuse the + * same set of API calls across peers. + */ + up_read(&wg->static_identity.lock); + ret = 0; + goto out; + } + up_read(&wg->static_identity.lock); + + ret = -ENOMEM; + peer = wg_peer_create(wg, public_key, preshared_key); + if (!peer) + goto out; + /* Take additional reference, as though we've just been + * looked up. + */ + wg_peer_get(peer); + } + + ret = 0; + if (flags & WGPEER_F_REMOVE_ME) { + wg_peer_remove(peer); + goto out; + } + + if (preshared_key) { + down_write(&peer->handshake.lock); + memcpy(&peer->handshake.preshared_key, preshared_key, + NOISE_SYMMETRIC_KEY_LEN); + up_write(&peer->handshake.lock); + } + + if (attrs[WGPEER_A_ENDPOINT]) { + struct sockaddr *addr = nla_data(attrs[WGPEER_A_ENDPOINT]); + size_t len = nla_len(attrs[WGPEER_A_ENDPOINT]); + + if ((len == sizeof(struct sockaddr_in) && + addr->sa_family == AF_INET) || + (len == sizeof(struct sockaddr_in6) && + addr->sa_family == AF_INET6)) { + struct endpoint endpoint = { { { 0 } } }; + + memcpy(&endpoint.addr, addr, len); + wg_socket_set_peer_endpoint(peer, &endpoint); + } + } + + if (flags & WGPEER_F_REPLACE_ALLOWEDIPS) + wg_allowedips_remove_by_peer(&wg->peer_allowedips, peer, + &wg->device_update_lock); + + if (attrs[WGPEER_A_ALLOWEDIPS]) { + struct nlattr *attr, *allowedip[WGALLOWEDIP_A_MAX + 1]; + int rem; + + nla_for_each_nested (attr, attrs[WGPEER_A_ALLOWEDIPS], rem) { + ret = nla_parse_nested(allowedip, WGALLOWEDIP_A_MAX, + attr, allowedip_policy, NULL); + if (ret < 0) + goto out; + ret = set_allowedip(peer, allowedip); + if (ret < 0) + goto out; + } + } + + if (attrs[WGPEER_A_PERSISTENT_KEEPALIVE_INTERVAL]) { + const u16 persistent_keepalive_interval = nla_get_u16( + attrs[WGPEER_A_PERSISTENT_KEEPALIVE_INTERVAL]); + const bool send_keepalive = + !peer->persistent_keepalive_interval && + persistent_keepalive_interval && + netif_running(wg->dev); + + peer->persistent_keepalive_interval = persistent_keepalive_interval; + if (send_keepalive) + wg_packet_send_keepalive(peer); + } + + if (netif_running(wg->dev)) + wg_packet_send_staged_packets(peer); + +out: + wg_peer_put(peer); + if (attrs[WGPEER_A_PRESHARED_KEY]) + memzero_explicit(nla_data(attrs[WGPEER_A_PRESHARED_KEY]), + nla_len(attrs[WGPEER_A_PRESHARED_KEY])); + return ret; +} + +static int set_device(struct sk_buff *skb, struct genl_info *info) +{ + struct wireguard_device *wg = lookup_interface(info->attrs, skb); + int ret; + + if (IS_ERR(wg)) { + ret = PTR_ERR(wg); + goto out_nodev; + } + + rtnl_lock(); + mutex_lock(&wg->device_update_lock); + ++wg->device_update_gen; + + if (info->attrs[WGDEVICE_A_FWMARK]) { + struct wireguard_peer *peer; + + wg->fwmark = nla_get_u32(info->attrs[WGDEVICE_A_FWMARK]); + list_for_each_entry (peer, &wg->peer_list, peer_list) + wg_socket_clear_peer_endpoint_src(peer); + } + + if (info->attrs[WGDEVICE_A_LISTEN_PORT]) { + ret = set_port(wg, + nla_get_u16(info->attrs[WGDEVICE_A_LISTEN_PORT])); + if (ret) + goto out; + } + + if (info->attrs[WGDEVICE_A_FLAGS] && + nla_get_u32(info->attrs[WGDEVICE_A_FLAGS]) & + WGDEVICE_F_REPLACE_PEERS) + wg_peer_remove_all(wg); + + if (info->attrs[WGDEVICE_A_PRIVATE_KEY] && + nla_len(info->attrs[WGDEVICE_A_PRIVATE_KEY]) == + NOISE_PUBLIC_KEY_LEN) { + u8 *private_key = nla_data(info->attrs[WGDEVICE_A_PRIVATE_KEY]); + u8 public_key[NOISE_PUBLIC_KEY_LEN]; + struct wireguard_peer *peer, *temp; + + /* We remove before setting, to prevent race, which means doing + * two 25519-genpub ops. + */ + if (curve25519_generate_public(public_key, private_key)) { + peer = wg_pubkey_hashtable_lookup(&wg->peer_hashtable, + public_key); + if (peer) { + wg_peer_put(peer); + wg_peer_remove(peer); + } + } + + down_write(&wg->static_identity.lock); + wg_noise_set_static_identity_private_key(&wg->static_identity, + private_key); + list_for_each_entry_safe (peer, temp, &wg->peer_list, + peer_list) { + if (!wg_noise_precompute_static_static(peer)) + wg_peer_remove(peer); + } + wg_cookie_checker_precompute_device_keys(&wg->cookie_checker); + up_write(&wg->static_identity.lock); + } + + if (info->attrs[WGDEVICE_A_PEERS]) { + struct nlattr *attr, *peer[WGPEER_A_MAX + 1]; + int rem; + + nla_for_each_nested (attr, info->attrs[WGDEVICE_A_PEERS], rem) { + ret = nla_parse_nested(peer, WGPEER_A_MAX, attr, + peer_policy, NULL); + if (ret < 0) + goto out; + ret = set_peer(wg, peer); + if (ret < 0) + goto out; + } + } + ret = 0; + +out: + mutex_unlock(&wg->device_update_lock); + rtnl_unlock(); + dev_put(wg->dev); +out_nodev: + if (info->attrs[WGDEVICE_A_PRIVATE_KEY]) + memzero_explicit(nla_data(info->attrs[WGDEVICE_A_PRIVATE_KEY]), + nla_len(info->attrs[WGDEVICE_A_PRIVATE_KEY])); + return ret; +} + +static const struct genl_ops genl_ops[] = { + { + .cmd = WG_CMD_GET_DEVICE, + .start = get_device_start, + .dumpit = get_device_dump, + .done = get_device_done, + .policy = device_policy, + .flags = GENL_UNS_ADMIN_PERM + }, { + .cmd = WG_CMD_SET_DEVICE, + .doit = set_device, + .policy = device_policy, + .flags = GENL_UNS_ADMIN_PERM + } +}; + +static struct genl_family genl_family __ro_after_init = { + .ops = genl_ops, + .n_ops = ARRAY_SIZE(genl_ops), + .name = WG_GENL_NAME, + .version = WG_GENL_VERSION, + .maxattr = WGDEVICE_A_MAX, + .module = THIS_MODULE, + .netnsok = true +}; + +int __init wg_genetlink_init(void) +{ + return genl_register_family(&genl_family); +} + +void __exit wg_genetlink_uninit(void) +{ + genl_unregister_family(&genl_family); +} diff --git a/drivers/net/wireguard/netlink.h b/drivers/net/wireguard/netlink.h new file mode 100644 index 000000000000..1dc6a67bc719 --- /dev/null +++ b/drivers/net/wireguard/netlink.h @@ -0,0 +1,12 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#ifndef _WG_NETLINK_H +#define _WG_NETLINK_H + +int wg_genetlink_init(void); +void wg_genetlink_uninit(void); + +#endif /* _WG_NETLINK_H */ diff --git a/drivers/net/wireguard/noise.c b/drivers/net/wireguard/noise.c new file mode 100644 index 000000000000..830858cb7e76 --- /dev/null +++ b/drivers/net/wireguard/noise.c @@ -0,0 +1,786 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#include "noise.h" +#include "device.h" +#include "peer.h" +#include "messages.h" +#include "queueing.h" +#include "hashtables.h" + +#include +#include +#include +#include +#include +#include + +/* This implements Noise_IKpsk2: + * + * <- s + * ****** + * -> e, es, s, ss, {t} + * <- e, ee, se, psk, {} + */ + +static const u8 handshake_name[37] = "Noise_IKpsk2_25519_ChaChaPoly_BLAKE2s"; +static const u8 identifier_name[34] = "WireGuard v1 zx2c4 Jason@zx2c4.com"; +static u8 handshake_init_hash[NOISE_HASH_LEN] __ro_after_init; +static u8 handshake_init_chaining_key[NOISE_HASH_LEN] __ro_after_init; +static atomic64_t keypair_counter = ATOMIC64_INIT(0); + +void __init wg_noise_init(void) +{ + struct blake2s_state blake; + + blake2s(handshake_init_chaining_key, handshake_name, NULL, + NOISE_HASH_LEN, sizeof(handshake_name), 0); + blake2s_init(&blake, NOISE_HASH_LEN); + blake2s_update(&blake, handshake_init_chaining_key, NOISE_HASH_LEN); + blake2s_update(&blake, identifier_name, sizeof(identifier_name)); + blake2s_final(&blake, handshake_init_hash, NOISE_HASH_LEN); +} + +/* Must hold peer->handshake.static_identity->lock */ +bool wg_noise_precompute_static_static(struct wireguard_peer *peer) +{ + bool ret = true; + + down_write(&peer->handshake.lock); + if (peer->handshake.static_identity->has_identity) + ret = curve25519( + peer->handshake.precomputed_static_static, + peer->handshake.static_identity->static_private, + peer->handshake.remote_static); + else + memset(peer->handshake.precomputed_static_static, 0, + NOISE_PUBLIC_KEY_LEN); + up_write(&peer->handshake.lock); + return ret; +} + +bool wg_noise_handshake_init(struct noise_handshake *handshake, + struct noise_static_identity *static_identity, + const u8 peer_public_key[NOISE_PUBLIC_KEY_LEN], + const u8 peer_preshared_key[NOISE_SYMMETRIC_KEY_LEN], + struct wireguard_peer *peer) +{ + memset(handshake, 0, sizeof(*handshake)); + init_rwsem(&handshake->lock); + handshake->entry.type = INDEX_HASHTABLE_HANDSHAKE; + handshake->entry.peer = peer; + memcpy(handshake->remote_static, peer_public_key, NOISE_PUBLIC_KEY_LEN); + if (peer_preshared_key) + memcpy(handshake->preshared_key, peer_preshared_key, + NOISE_SYMMETRIC_KEY_LEN); + handshake->static_identity = static_identity; + handshake->state = HANDSHAKE_ZEROED; + return wg_noise_precompute_static_static(peer); +} + +static void handshake_zero(struct noise_handshake *handshake) +{ + memset(&handshake->ephemeral_private, 0, NOISE_PUBLIC_KEY_LEN); + memset(&handshake->remote_ephemeral, 0, NOISE_PUBLIC_KEY_LEN); + memset(&handshake->hash, 0, NOISE_HASH_LEN); + memset(&handshake->chaining_key, 0, NOISE_HASH_LEN); + handshake->remote_index = 0; + handshake->state = HANDSHAKE_ZEROED; +} + +void wg_noise_handshake_clear(struct noise_handshake *handshake) +{ + wg_index_hashtable_remove( + &handshake->entry.peer->device->index_hashtable, + &handshake->entry); + down_write(&handshake->lock); + handshake_zero(handshake); + up_write(&handshake->lock); + wg_index_hashtable_remove( + &handshake->entry.peer->device->index_hashtable, + &handshake->entry); +} + +static struct noise_keypair *keypair_create(struct wireguard_peer *peer) +{ + struct noise_keypair *keypair = kzalloc(sizeof(*keypair), GFP_KERNEL); + + if (unlikely(!keypair)) + return NULL; + keypair->internal_id = atomic64_inc_return(&keypair_counter); + keypair->entry.type = INDEX_HASHTABLE_KEYPAIR; + keypair->entry.peer = peer; + kref_init(&keypair->refcount); + return keypair; +} + +static void keypair_free_rcu(struct rcu_head *rcu) +{ + kzfree(container_of(rcu, struct noise_keypair, rcu)); +} + +static void keypair_free_kref(struct kref *kref) +{ + struct noise_keypair *keypair = + container_of(kref, struct noise_keypair, refcount); + net_dbg_ratelimited("%s: Keypair %llu destroyed for peer %llu\n", + keypair->entry.peer->device->dev->name, + keypair->internal_id, + keypair->entry.peer->internal_id); + wg_index_hashtable_remove(&keypair->entry.peer->device->index_hashtable, + &keypair->entry); + call_rcu_bh(&keypair->rcu, keypair_free_rcu); +} + +void wg_noise_keypair_put(struct noise_keypair *keypair, bool unreference_now) +{ + if (unlikely(!keypair)) + return; + if (unlikely(unreference_now)) + wg_index_hashtable_remove( + &keypair->entry.peer->device->index_hashtable, + &keypair->entry); + kref_put(&keypair->refcount, keypair_free_kref); +} + +struct noise_keypair *wg_noise_keypair_get(struct noise_keypair *keypair) +{ + RCU_LOCKDEP_WARN(!rcu_read_lock_bh_held(), + "Taking noise keypair reference without holding the RCU BH read lock"); + if (unlikely(!keypair || !kref_get_unless_zero(&keypair->refcount))) + return NULL; + return keypair; +} + +void wg_noise_keypairs_clear(struct noise_keypairs *keypairs) +{ + struct noise_keypair *old; + + spin_lock_bh(&keypairs->keypair_update_lock); + old = rcu_dereference_protected(keypairs->previous_keypair, + lockdep_is_held(&keypairs->keypair_update_lock)); + RCU_INIT_POINTER(keypairs->previous_keypair, NULL); + wg_noise_keypair_put(old, true); + old = rcu_dereference_protected(keypairs->next_keypair, + lockdep_is_held(&keypairs->keypair_update_lock)); + RCU_INIT_POINTER(keypairs->next_keypair, NULL); + wg_noise_keypair_put(old, true); + old = rcu_dereference_protected(keypairs->current_keypair, + lockdep_is_held(&keypairs->keypair_update_lock)); + RCU_INIT_POINTER(keypairs->current_keypair, NULL); + wg_noise_keypair_put(old, true); + spin_unlock_bh(&keypairs->keypair_update_lock); +} + +static void add_new_keypair(struct noise_keypairs *keypairs, + struct noise_keypair *new_keypair) +{ + struct noise_keypair *previous_keypair, *next_keypair, *current_keypair; + + spin_lock_bh(&keypairs->keypair_update_lock); + previous_keypair = rcu_dereference_protected(keypairs->previous_keypair, + lockdep_is_held(&keypairs->keypair_update_lock)); + next_keypair = rcu_dereference_protected(keypairs->next_keypair, + lockdep_is_held(&keypairs->keypair_update_lock)); + current_keypair = rcu_dereference_protected(keypairs->current_keypair, + lockdep_is_held(&keypairs->keypair_update_lock)); + if (new_keypair->i_am_the_initiator) { + /* If we're the initiator, it means we've sent a handshake, and + * received a confirmation response, which means this new + * keypair can now be used. + */ + if (next_keypair) { + /* If there already was a next keypair pending, we + * demote it to be the previous keypair, and free the + * existing current. Note that this means KCI can result + * in this transition. It would perhaps be more sound to + * always just get rid of the unused next keypair + * instead of putting it in the previous slot, but this + * might be a bit less robust. Something to think about + * for the future. + */ + RCU_INIT_POINTER(keypairs->next_keypair, NULL); + rcu_assign_pointer(keypairs->previous_keypair, + next_keypair); + wg_noise_keypair_put(current_keypair, true); + } else /* If there wasn't an existing next keypair, we replace + * the previous with the current one. + */ + rcu_assign_pointer(keypairs->previous_keypair, + current_keypair); + /* At this point we can get rid of the old previous keypair, and + * set up the new keypair. + */ + wg_noise_keypair_put(previous_keypair, true); + rcu_assign_pointer(keypairs->current_keypair, new_keypair); + } else { + /* If we're the responder, it means we can't use the new keypair + * until we receive confirmation via the first data packet, so + * we get rid of the existing previous one, the possibly + * existing next one, and slide in the new next one. + */ + rcu_assign_pointer(keypairs->next_keypair, new_keypair); + wg_noise_keypair_put(next_keypair, true); + RCU_INIT_POINTER(keypairs->previous_keypair, NULL); + wg_noise_keypair_put(previous_keypair, true); + } + spin_unlock_bh(&keypairs->keypair_update_lock); +} + +bool wg_noise_received_with_keypair(struct noise_keypairs *keypairs, + struct noise_keypair *received_keypair) +{ + struct noise_keypair *old_keypair; + bool key_is_new; + + /* We first check without taking the spinlock. */ + key_is_new = received_keypair == + rcu_access_pointer(keypairs->next_keypair); + if (likely(!key_is_new)) + return false; + + spin_lock_bh(&keypairs->keypair_update_lock); + /* After locking, we double check that things didn't change from + * beneath us. + */ + if (unlikely(received_keypair != + rcu_dereference_protected(keypairs->next_keypair, + lockdep_is_held(&keypairs->keypair_update_lock)))) { + spin_unlock_bh(&keypairs->keypair_update_lock); + return false; + } + + /* When we've finally received the confirmation, we slide the next + * into the current, the current into the previous, and get rid of + * the old previous. + */ + old_keypair = rcu_dereference_protected(keypairs->previous_keypair, + lockdep_is_held(&keypairs->keypair_update_lock)); + rcu_assign_pointer(keypairs->previous_keypair, + rcu_dereference_protected(keypairs->current_keypair, + lockdep_is_held(&keypairs->keypair_update_lock))); + wg_noise_keypair_put(old_keypair, true); + rcu_assign_pointer(keypairs->current_keypair, received_keypair); + RCU_INIT_POINTER(keypairs->next_keypair, NULL); + + spin_unlock_bh(&keypairs->keypair_update_lock); + return true; +} + +/* Must hold static_identity->lock */ +void wg_noise_set_static_identity_private_key( + struct noise_static_identity *static_identity, + const u8 private_key[NOISE_PUBLIC_KEY_LEN]) +{ + memcpy(static_identity->static_private, private_key, + NOISE_PUBLIC_KEY_LEN); + static_identity->has_identity = curve25519_generate_public( + static_identity->static_public, private_key); +} + +/* This is Hugo Krawczyk's HKDF: + * - https://eprint.iacr.org/2010/264.pdf + * - https://tools.ietf.org/html/rfc5869 + */ +static void kdf(u8 *first_dst, u8 *second_dst, u8 *third_dst, const u8 *data, + size_t first_len, size_t second_len, size_t third_len, + size_t data_len, const u8 chaining_key[NOISE_HASH_LEN]) +{ + u8 output[BLAKE2S_HASH_SIZE + 1]; + u8 secret[BLAKE2S_HASH_SIZE]; + + WARN_ON(IS_ENABLED(DEBUG) && + (first_len > BLAKE2S_HASH_SIZE || second_len > BLAKE2S_HASH_SIZE || + third_len > BLAKE2S_HASH_SIZE || + ((second_len || second_dst || third_len || third_dst) && + (!first_len || !first_dst)) || + ((third_len || third_dst) && (!second_len || !second_dst)))); + + /* Extract entropy from data into secret */ + blake2s_hmac(secret, data, chaining_key, BLAKE2S_HASH_SIZE, data_len, + NOISE_HASH_LEN); + + if (!first_dst || !first_len) + goto out; + + /* Expand first key: key = secret, data = 0x1 */ + output[0] = 1; + blake2s_hmac(output, output, secret, BLAKE2S_HASH_SIZE, 1, + BLAKE2S_HASH_SIZE); + memcpy(first_dst, output, first_len); + + if (!second_dst || !second_len) + goto out; + + /* Expand second key: key = secret, data = first-key || 0x2 */ + output[BLAKE2S_HASH_SIZE] = 2; + blake2s_hmac(output, output, secret, BLAKE2S_HASH_SIZE, + BLAKE2S_HASH_SIZE + 1, BLAKE2S_HASH_SIZE); + memcpy(second_dst, output, second_len); + + if (!third_dst || !third_len) + goto out; + + /* Expand third key: key = secret, data = second-key || 0x3 */ + output[BLAKE2S_HASH_SIZE] = 3; + blake2s_hmac(output, output, secret, BLAKE2S_HASH_SIZE, + BLAKE2S_HASH_SIZE + 1, BLAKE2S_HASH_SIZE); + memcpy(third_dst, output, third_len); + +out: + /* Clear sensitive data from stack */ + memzero_explicit(secret, BLAKE2S_HASH_SIZE); + memzero_explicit(output, BLAKE2S_HASH_SIZE + 1); +} + +static void symmetric_key_init(struct noise_symmetric_key *key) +{ + spin_lock_init(&key->counter.receive.lock); + atomic64_set(&key->counter.counter, 0); + memset(key->counter.receive.backtrack, 0, + sizeof(key->counter.receive.backtrack)); + key->birthdate = ktime_get_boot_fast_ns(); + key->is_valid = true; +} + +static void derive_keys(struct noise_symmetric_key *first_dst, + struct noise_symmetric_key *second_dst, + const u8 chaining_key[NOISE_HASH_LEN]) +{ + kdf(first_dst->key, second_dst->key, NULL, NULL, + NOISE_SYMMETRIC_KEY_LEN, NOISE_SYMMETRIC_KEY_LEN, 0, 0, + chaining_key); + symmetric_key_init(first_dst); + symmetric_key_init(second_dst); +} + +static bool __must_check mix_dh(u8 chaining_key[NOISE_HASH_LEN], + u8 key[NOISE_SYMMETRIC_KEY_LEN], + const u8 private[NOISE_PUBLIC_KEY_LEN], + const u8 public[NOISE_PUBLIC_KEY_LEN]) +{ + u8 dh_calculation[NOISE_PUBLIC_KEY_LEN]; + + if (unlikely(!curve25519(dh_calculation, private, public))) + return false; + kdf(chaining_key, key, NULL, dh_calculation, NOISE_HASH_LEN, + NOISE_SYMMETRIC_KEY_LEN, 0, NOISE_PUBLIC_KEY_LEN, chaining_key); + memzero_explicit(dh_calculation, NOISE_PUBLIC_KEY_LEN); + return true; +} + +static void mix_hash(u8 hash[NOISE_HASH_LEN], const u8 *src, size_t src_len) +{ + struct blake2s_state blake; + + blake2s_init(&blake, NOISE_HASH_LEN); + blake2s_update(&blake, hash, NOISE_HASH_LEN); + blake2s_update(&blake, src, src_len); + blake2s_final(&blake, hash, NOISE_HASH_LEN); +} + +static void mix_psk(u8 chaining_key[NOISE_HASH_LEN], u8 hash[NOISE_HASH_LEN], + u8 key[NOISE_SYMMETRIC_KEY_LEN], + const u8 psk[NOISE_SYMMETRIC_KEY_LEN]) +{ + u8 temp_hash[NOISE_HASH_LEN]; + + kdf(chaining_key, temp_hash, key, psk, NOISE_HASH_LEN, NOISE_HASH_LEN, + NOISE_SYMMETRIC_KEY_LEN, NOISE_SYMMETRIC_KEY_LEN, chaining_key); + mix_hash(hash, temp_hash, NOISE_HASH_LEN); + memzero_explicit(temp_hash, NOISE_HASH_LEN); +} + +static void handshake_init(u8 chaining_key[NOISE_HASH_LEN], + u8 hash[NOISE_HASH_LEN], + const u8 remote_static[NOISE_PUBLIC_KEY_LEN]) +{ + memcpy(hash, handshake_init_hash, NOISE_HASH_LEN); + memcpy(chaining_key, handshake_init_chaining_key, NOISE_HASH_LEN); + mix_hash(hash, remote_static, NOISE_PUBLIC_KEY_LEN); +} + +static void message_encrypt(u8 *dst_ciphertext, const u8 *src_plaintext, + size_t src_len, u8 key[NOISE_SYMMETRIC_KEY_LEN], + u8 hash[NOISE_HASH_LEN]) +{ + chacha20poly1305_encrypt(dst_ciphertext, src_plaintext, src_len, hash, + NOISE_HASH_LEN, + 0 /* Always zero for Noise_IK */, key); + mix_hash(hash, dst_ciphertext, noise_encrypted_len(src_len)); +} + +static bool message_decrypt(u8 *dst_plaintext, const u8 *src_ciphertext, + size_t src_len, u8 key[NOISE_SYMMETRIC_KEY_LEN], + u8 hash[NOISE_HASH_LEN]) +{ + if (!chacha20poly1305_decrypt(dst_plaintext, src_ciphertext, src_len, + hash, NOISE_HASH_LEN, + 0 /* Always zero for Noise_IK */, key)) + return false; + mix_hash(hash, src_ciphertext, src_len); + return true; +} + +static void message_ephemeral(u8 ephemeral_dst[NOISE_PUBLIC_KEY_LEN], + const u8 ephemeral_src[NOISE_PUBLIC_KEY_LEN], + u8 chaining_key[NOISE_HASH_LEN], + u8 hash[NOISE_HASH_LEN]) +{ + if (ephemeral_dst != ephemeral_src) + memcpy(ephemeral_dst, ephemeral_src, NOISE_PUBLIC_KEY_LEN); + mix_hash(hash, ephemeral_src, NOISE_PUBLIC_KEY_LEN); + kdf(chaining_key, NULL, NULL, ephemeral_src, NOISE_HASH_LEN, 0, 0, + NOISE_PUBLIC_KEY_LEN, chaining_key); +} + +static void tai64n_now(u8 output[NOISE_TIMESTAMP_LEN]) +{ + struct timespec64 now; + + getnstimeofday64(&now); + /* https://cr.yp.to/libtai/tai64.html */ + *(__be64 *)output = cpu_to_be64(0x400000000000000aULL + now.tv_sec); + *(__be32 *)(output + sizeof(__be64)) = cpu_to_be32(now.tv_nsec); +} + +bool +wg_noise_handshake_create_initiation(struct message_handshake_initiation *dst, + struct noise_handshake *handshake) +{ + u8 timestamp[NOISE_TIMESTAMP_LEN]; + u8 key[NOISE_SYMMETRIC_KEY_LEN]; + bool ret = false; + + /* We need to wait for crng _before_ taking any locks, since + * curve25519_generate_secret uses get_random_bytes_wait. + */ + wait_for_random_bytes(); + + down_read(&handshake->static_identity->lock); + down_write(&handshake->lock); + + if (unlikely(!handshake->static_identity->has_identity)) + goto out; + + dst->header.type = cpu_to_le32(MESSAGE_HANDSHAKE_INITIATION); + + handshake_init(handshake->chaining_key, handshake->hash, + handshake->remote_static); + + /* e */ + curve25519_generate_secret(handshake->ephemeral_private); + if (!curve25519_generate_public(dst->unencrypted_ephemeral, + handshake->ephemeral_private)) + goto out; + message_ephemeral(dst->unencrypted_ephemeral, + dst->unencrypted_ephemeral, handshake->chaining_key, + handshake->hash); + + /* es */ + if (!mix_dh(handshake->chaining_key, key, handshake->ephemeral_private, + handshake->remote_static)) + goto out; + + /* s */ + message_encrypt(dst->encrypted_static, + handshake->static_identity->static_public, + NOISE_PUBLIC_KEY_LEN, key, handshake->hash); + + /* ss */ + kdf(handshake->chaining_key, key, NULL, + handshake->precomputed_static_static, NOISE_HASH_LEN, + NOISE_SYMMETRIC_KEY_LEN, 0, NOISE_PUBLIC_KEY_LEN, + handshake->chaining_key); + + /* {t} */ + tai64n_now(timestamp); + message_encrypt(dst->encrypted_timestamp, timestamp, + NOISE_TIMESTAMP_LEN, key, handshake->hash); + + dst->sender_index = wg_index_hashtable_insert( + &handshake->entry.peer->device->index_hashtable, + &handshake->entry); + + handshake->state = HANDSHAKE_CREATED_INITIATION; + ret = true; + +out: + up_write(&handshake->lock); + up_read(&handshake->static_identity->lock); + memzero_explicit(key, NOISE_SYMMETRIC_KEY_LEN); + return ret; +} + +struct wireguard_peer * +wg_noise_handshake_consume_initiation(struct message_handshake_initiation *src, + struct wireguard_device *wg) +{ + struct wireguard_peer *peer = NULL, *ret_peer = NULL; + struct noise_handshake *handshake; + bool replay_attack, flood_attack; + u8 key[NOISE_SYMMETRIC_KEY_LEN]; + u8 chaining_key[NOISE_HASH_LEN]; + u8 hash[NOISE_HASH_LEN]; + u8 s[NOISE_PUBLIC_KEY_LEN]; + u8 e[NOISE_PUBLIC_KEY_LEN]; + u8 t[NOISE_TIMESTAMP_LEN]; + + down_read(&wg->static_identity.lock); + if (unlikely(!wg->static_identity.has_identity)) + goto out; + + handshake_init(chaining_key, hash, wg->static_identity.static_public); + + /* e */ + message_ephemeral(e, src->unencrypted_ephemeral, chaining_key, hash); + + /* es */ + if (!mix_dh(chaining_key, key, wg->static_identity.static_private, e)) + goto out; + + /* s */ + if (!message_decrypt(s, src->encrypted_static, + sizeof(src->encrypted_static), key, hash)) + goto out; + + /* Lookup which peer we're actually talking to */ + peer = wg_pubkey_hashtable_lookup(&wg->peer_hashtable, s); + if (!peer) + goto out; + handshake = &peer->handshake; + + /* ss */ + kdf(chaining_key, key, NULL, handshake->precomputed_static_static, + NOISE_HASH_LEN, NOISE_SYMMETRIC_KEY_LEN, 0, NOISE_PUBLIC_KEY_LEN, + chaining_key); + + /* {t} */ + if (!message_decrypt(t, src->encrypted_timestamp, + sizeof(src->encrypted_timestamp), key, hash)) + goto out; + + down_read(&handshake->lock); + replay_attack = memcmp(t, handshake->latest_timestamp, + NOISE_TIMESTAMP_LEN) <= 0; + flood_attack = handshake->last_initiation_consumption + + NSEC_PER_SEC / INITIATIONS_PER_SECOND > + ktime_get_boot_fast_ns(); + up_read(&handshake->lock); + if (replay_attack || flood_attack) + goto out; + + /* Success! Copy everything to peer */ + down_write(&handshake->lock); + memcpy(handshake->remote_ephemeral, e, NOISE_PUBLIC_KEY_LEN); + memcpy(handshake->latest_timestamp, t, NOISE_TIMESTAMP_LEN); + memcpy(handshake->hash, hash, NOISE_HASH_LEN); + memcpy(handshake->chaining_key, chaining_key, NOISE_HASH_LEN); + handshake->remote_index = src->sender_index; + handshake->last_initiation_consumption = ktime_get_boot_fast_ns(); + handshake->state = HANDSHAKE_CONSUMED_INITIATION; + up_write(&handshake->lock); + ret_peer = peer; + +out: + memzero_explicit(key, NOISE_SYMMETRIC_KEY_LEN); + memzero_explicit(hash, NOISE_HASH_LEN); + memzero_explicit(chaining_key, NOISE_HASH_LEN); + up_read(&wg->static_identity.lock); + if (!ret_peer) + wg_peer_put(peer); + return ret_peer; +} + +bool wg_noise_handshake_create_response(struct message_handshake_response *dst, + struct noise_handshake *handshake) +{ + bool ret = false; + u8 key[NOISE_SYMMETRIC_KEY_LEN]; + + /* We need to wait for crng _before_ taking any locks, since + * curve25519_generate_secret uses get_random_bytes_wait. + */ + wait_for_random_bytes(); + + down_read(&handshake->static_identity->lock); + down_write(&handshake->lock); + + if (handshake->state != HANDSHAKE_CONSUMED_INITIATION) + goto out; + + dst->header.type = cpu_to_le32(MESSAGE_HANDSHAKE_RESPONSE); + dst->receiver_index = handshake->remote_index; + + /* e */ + curve25519_generate_secret(handshake->ephemeral_private); + if (!curve25519_generate_public(dst->unencrypted_ephemeral, + handshake->ephemeral_private)) + goto out; + message_ephemeral(dst->unencrypted_ephemeral, + dst->unencrypted_ephemeral, handshake->chaining_key, + handshake->hash); + + /* ee */ + if (!mix_dh(handshake->chaining_key, NULL, handshake->ephemeral_private, + handshake->remote_ephemeral)) + goto out; + + /* se */ + if (!mix_dh(handshake->chaining_key, NULL, handshake->ephemeral_private, + handshake->remote_static)) + goto out; + + /* psk */ + mix_psk(handshake->chaining_key, handshake->hash, key, + handshake->preshared_key); + + /* {} */ + message_encrypt(dst->encrypted_nothing, NULL, 0, key, handshake->hash); + + dst->sender_index = wg_index_hashtable_insert( + &handshake->entry.peer->device->index_hashtable, + &handshake->entry); + + handshake->state = HANDSHAKE_CREATED_RESPONSE; + ret = true; + +out: + up_write(&handshake->lock); + up_read(&handshake->static_identity->lock); + memzero_explicit(key, NOISE_SYMMETRIC_KEY_LEN); + return ret; +} + +struct wireguard_peer * +wg_noise_handshake_consume_response(struct message_handshake_response *src, + struct wireguard_device *wg) +{ + struct noise_handshake *handshake; + struct wireguard_peer *peer = NULL, *ret_peer = NULL; + u8 key[NOISE_SYMMETRIC_KEY_LEN]; + u8 hash[NOISE_HASH_LEN]; + u8 chaining_key[NOISE_HASH_LEN]; + u8 e[NOISE_PUBLIC_KEY_LEN]; + u8 ephemeral_private[NOISE_PUBLIC_KEY_LEN]; + u8 static_private[NOISE_PUBLIC_KEY_LEN]; + enum noise_handshake_state state = HANDSHAKE_ZEROED; + + down_read(&wg->static_identity.lock); + + if (unlikely(!wg->static_identity.has_identity)) + goto out; + + handshake = (struct noise_handshake *)wg_index_hashtable_lookup( + &wg->index_hashtable, INDEX_HASHTABLE_HANDSHAKE, + src->receiver_index, &peer); + if (unlikely(!handshake)) + goto out; + + down_read(&handshake->lock); + state = handshake->state; + memcpy(hash, handshake->hash, NOISE_HASH_LEN); + memcpy(chaining_key, handshake->chaining_key, NOISE_HASH_LEN); + memcpy(ephemeral_private, handshake->ephemeral_private, + NOISE_PUBLIC_KEY_LEN); + up_read(&handshake->lock); + + if (state != HANDSHAKE_CREATED_INITIATION) + goto fail; + + /* e */ + message_ephemeral(e, src->unencrypted_ephemeral, chaining_key, hash); + + /* ee */ + if (!mix_dh(chaining_key, NULL, ephemeral_private, e)) + goto fail; + + /* se */ + if (!mix_dh(chaining_key, NULL, wg->static_identity.static_private, e)) + goto fail; + + /* psk */ + mix_psk(chaining_key, hash, key, handshake->preshared_key); + + /* {} */ + if (!message_decrypt(NULL, src->encrypted_nothing, + sizeof(src->encrypted_nothing), key, hash)) + goto fail; + + /* Success! Copy everything to peer */ + down_write(&handshake->lock); + /* It's important to check that the state is still the same, while we + * have an exclusive lock. + */ + if (handshake->state != state) { + up_write(&handshake->lock); + goto fail; + } + memcpy(handshake->remote_ephemeral, e, NOISE_PUBLIC_KEY_LEN); + memcpy(handshake->hash, hash, NOISE_HASH_LEN); + memcpy(handshake->chaining_key, chaining_key, NOISE_HASH_LEN); + handshake->remote_index = src->sender_index; + handshake->state = HANDSHAKE_CONSUMED_RESPONSE; + up_write(&handshake->lock); + ret_peer = peer; + goto out; + +fail: + wg_peer_put(peer); +out: + memzero_explicit(key, NOISE_SYMMETRIC_KEY_LEN); + memzero_explicit(hash, NOISE_HASH_LEN); + memzero_explicit(chaining_key, NOISE_HASH_LEN); + memzero_explicit(ephemeral_private, NOISE_PUBLIC_KEY_LEN); + memzero_explicit(static_private, NOISE_PUBLIC_KEY_LEN); + up_read(&wg->static_identity.lock); + return ret_peer; +} + +bool wg_noise_handshake_begin_session(struct noise_handshake *handshake, + struct noise_keypairs *keypairs) +{ + struct noise_keypair *new_keypair; + bool ret = false; + + down_write(&handshake->lock); + if (handshake->state != HANDSHAKE_CREATED_RESPONSE && + handshake->state != HANDSHAKE_CONSUMED_RESPONSE) + goto out; + + new_keypair = keypair_create(handshake->entry.peer); + if (!new_keypair) + goto out; + new_keypair->i_am_the_initiator = handshake->state == + HANDSHAKE_CONSUMED_RESPONSE; + new_keypair->remote_index = handshake->remote_index; + + if (new_keypair->i_am_the_initiator) + derive_keys(&new_keypair->sending, &new_keypair->receiving, + handshake->chaining_key); + else + derive_keys(&new_keypair->receiving, &new_keypair->sending, + handshake->chaining_key); + + handshake_zero(handshake); + rcu_read_lock_bh(); + if (likely(!container_of(handshake, struct wireguard_peer, + handshake)->is_dead)) { + add_new_keypair(keypairs, new_keypair); + net_dbg_ratelimited("%s: Keypair %llu created for peer %llu\n", + handshake->entry.peer->device->dev->name, + new_keypair->internal_id, + handshake->entry.peer->internal_id); + ret = wg_index_hashtable_replace( + &handshake->entry.peer->device->index_hashtable, + &handshake->entry, &new_keypair->entry); + } else + kzfree(new_keypair); + rcu_read_unlock_bh(); + +out: + up_write(&handshake->lock); + return ret; +} diff --git a/drivers/net/wireguard/noise.h b/drivers/net/wireguard/noise.h new file mode 100644 index 000000000000..7fe2c62f070a --- /dev/null +++ b/drivers/net/wireguard/noise.h @@ -0,0 +1,130 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ +#ifndef _WG_NOISE_H +#define _WG_NOISE_H + +#include "messages.h" +#include "hashtables.h" + +#include +#include +#include +#include +#include +#include +#include + +union noise_counter { + struct { + u64 counter; + unsigned long backtrack[COUNTER_BITS_TOTAL / BITS_PER_LONG]; + spinlock_t lock; + } receive; + atomic64_t counter; +}; + +struct noise_symmetric_key { + u8 key[NOISE_SYMMETRIC_KEY_LEN]; + union noise_counter counter; + u64 birthdate; + bool is_valid; +}; + +struct noise_keypair { + struct index_hashtable_entry entry; + struct noise_symmetric_key sending; + struct noise_symmetric_key receiving; + __le32 remote_index; + bool i_am_the_initiator; + struct kref refcount; + struct rcu_head rcu; + u64 internal_id; +}; + +struct noise_keypairs { + struct noise_keypair __rcu *current_keypair; + struct noise_keypair __rcu *previous_keypair; + struct noise_keypair __rcu *next_keypair; + spinlock_t keypair_update_lock; +}; + +struct noise_static_identity { + u8 static_public[NOISE_PUBLIC_KEY_LEN]; + u8 static_private[NOISE_PUBLIC_KEY_LEN]; + struct rw_semaphore lock; + bool has_identity; +}; + +enum noise_handshake_state { + HANDSHAKE_ZEROED, + HANDSHAKE_CREATED_INITIATION, + HANDSHAKE_CONSUMED_INITIATION, + HANDSHAKE_CREATED_RESPONSE, + HANDSHAKE_CONSUMED_RESPONSE +}; + +struct noise_handshake { + struct index_hashtable_entry entry; + + enum noise_handshake_state state; + u64 last_initiation_consumption; + + struct noise_static_identity *static_identity; + + u8 ephemeral_private[NOISE_PUBLIC_KEY_LEN]; + u8 remote_static[NOISE_PUBLIC_KEY_LEN]; + u8 remote_ephemeral[NOISE_PUBLIC_KEY_LEN]; + u8 precomputed_static_static[NOISE_PUBLIC_KEY_LEN]; + + u8 preshared_key[NOISE_SYMMETRIC_KEY_LEN]; + + u8 hash[NOISE_HASH_LEN]; + u8 chaining_key[NOISE_HASH_LEN]; + + u8 latest_timestamp[NOISE_TIMESTAMP_LEN]; + __le32 remote_index; + + /* Protects all members except the immutable (after noise_handshake_ + * init): remote_static, precomputed_static_static, static_identity. */ + struct rw_semaphore lock; +}; + +struct wireguard_device; + +void wg_noise_init(void); +bool wg_noise_handshake_init(struct noise_handshake *handshake, + struct noise_static_identity *static_identity, + const u8 peer_public_key[NOISE_PUBLIC_KEY_LEN], + const u8 peer_preshared_key[NOISE_SYMMETRIC_KEY_LEN], + struct wireguard_peer *peer); +void wg_noise_handshake_clear(struct noise_handshake *handshake); +void wg_noise_keypair_put(struct noise_keypair *keypair, bool unreference_now); +struct noise_keypair *wg_noise_keypair_get(struct noise_keypair *keypair); +void wg_noise_keypairs_clear(struct noise_keypairs *keypairs); +bool wg_noise_received_with_keypair(struct noise_keypairs *keypairs, + struct noise_keypair *received_keypair); + +void wg_noise_set_static_identity_private_key( + struct noise_static_identity *static_identity, + const u8 private_key[NOISE_PUBLIC_KEY_LEN]); +bool wg_noise_precompute_static_static(struct wireguard_peer *peer); + +bool +wg_noise_handshake_create_initiation(struct message_handshake_initiation *dst, + struct noise_handshake *handshake); +struct wireguard_peer * +wg_noise_handshake_consume_initiation(struct message_handshake_initiation *src, + struct wireguard_device *wg); + +bool wg_noise_handshake_create_response(struct message_handshake_response *dst, + struct noise_handshake *handshake); +struct wireguard_peer * +wg_noise_handshake_consume_response(struct message_handshake_response *src, + struct wireguard_device *wg); + +bool wg_noise_handshake_begin_session(struct noise_handshake *handshake, + struct noise_keypairs *keypairs); + +#endif /* _WG_NOISE_H */ diff --git a/drivers/net/wireguard/peer.c b/drivers/net/wireguard/peer.c new file mode 100644 index 000000000000..c4737ae2ff53 --- /dev/null +++ b/drivers/net/wireguard/peer.c @@ -0,0 +1,194 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#include "peer.h" +#include "device.h" +#include "queueing.h" +#include "timers.h" +#include "hashtables.h" +#include "noise.h" + +#include +#include +#include +#include + +static atomic64_t peer_counter = ATOMIC64_INIT(0); + +struct wireguard_peer * +wg_peer_create(struct wireguard_device *wg, + const u8 public_key[NOISE_PUBLIC_KEY_LEN], + const u8 preshared_key[NOISE_SYMMETRIC_KEY_LEN]) +{ + struct wireguard_peer *peer; + + lockdep_assert_held(&wg->device_update_lock); + + if (wg->num_peers >= MAX_PEERS_PER_DEVICE) + return NULL; + + peer = kzalloc(sizeof(*peer), GFP_KERNEL); + if (unlikely(!peer)) + return NULL; + peer->device = wg; + + if (!wg_noise_handshake_init(&peer->handshake, &wg->static_identity, + public_key, preshared_key, peer)) + goto err_1; + if (dst_cache_init(&peer->endpoint_cache, GFP_KERNEL)) + goto err_1; + if (wg_packet_queue_init(&peer->tx_queue, wg_packet_tx_worker, false, + MAX_QUEUED_PACKETS)) + goto err_2; + if (wg_packet_queue_init(&peer->rx_queue, NULL, false, + MAX_QUEUED_PACKETS)) + goto err_3; + + peer->internal_id = atomic64_inc_return(&peer_counter); + peer->serial_work_cpu = nr_cpumask_bits; + wg_cookie_init(&peer->latest_cookie); + wg_timers_init(peer); + wg_cookie_checker_precompute_peer_keys(peer); + spin_lock_init(&peer->keypairs.keypair_update_lock); + INIT_WORK(&peer->transmit_handshake_work, + wg_packet_handshake_send_worker); + rwlock_init(&peer->endpoint_lock); + kref_init(&peer->refcount); + skb_queue_head_init(&peer->staged_packet_queue); + atomic64_set(&peer->last_sent_handshake, + ktime_get_boot_fast_ns() - + (u64)(REKEY_TIMEOUT + 1) * NSEC_PER_SEC); + set_bit(NAPI_STATE_NO_BUSY_POLL, &peer->napi.state); + netif_napi_add(wg->dev, &peer->napi, wg_packet_rx_poll, + NAPI_POLL_WEIGHT); + napi_enable(&peer->napi); + list_add_tail(&peer->peer_list, &wg->peer_list); + wg_pubkey_hashtable_add(&wg->peer_hashtable, peer); + ++wg->num_peers; + pr_debug("%s: Peer %llu created\n", wg->dev->name, peer->internal_id); + return peer; + +err_3: + wg_packet_queue_free(&peer->tx_queue, false); +err_2: + dst_cache_destroy(&peer->endpoint_cache); +err_1: + kfree(peer); + return NULL; +} + +struct wireguard_peer *wg_peer_get_maybe_zero(struct wireguard_peer *peer) +{ + RCU_LOCKDEP_WARN(!rcu_read_lock_bh_held(), + "Taking peer reference without holding the RCU read lock"); + if (unlikely(!peer || !kref_get_unless_zero(&peer->refcount))) + return NULL; + return peer; +} + +/* We have a separate "remove" function to get rid of the final reference + * because peer_list, clearing handshakes, and flushing all require mutexes + * which requires sleeping, which must only be done from certain contexts. + */ +void wg_peer_remove(struct wireguard_peer *peer) +{ + if (unlikely(!peer)) + return; + lockdep_assert_held(&peer->device->device_update_lock); + + /* Remove from configuration-time lookup structures so new packets + * can't enter. + */ + list_del_init(&peer->peer_list); + wg_allowedips_remove_by_peer(&peer->device->peer_allowedips, peer, + &peer->device->device_update_lock); + wg_pubkey_hashtable_remove(&peer->device->peer_hashtable, peer); + + /* Mark as dead, so that we don't allow jumping contexts after. */ + WRITE_ONCE(peer->is_dead, true); + synchronize_rcu_bh(); + + /* Now that no more keypairs can be created for this peer, we destroy + * existing ones. + */ + wg_noise_keypairs_clear(&peer->keypairs); + + /* Destroy all ongoing timers that were in-flight at the beginning of + * this function. + */ + wg_timers_stop(peer); + + /* The transition between packet encryption/decryption queues isn't + * guarded by is_dead, but each reference's life is strictly bounded by + * two generations: once for parallel crypto and once for serial + * ingestion, so we can simply flush twice, and be sure that we no + * longer have references inside these queues. + */ + + /* a) For encrypt/decrypt. */ + flush_workqueue(peer->device->packet_crypt_wq); + /* b.1) For send (but not receive, since that's napi). */ + flush_workqueue(peer->device->packet_crypt_wq); + /* b.2.1) For receive (but not send, since that's wq). */ + napi_disable(&peer->napi); + /* b.2.1) It's now safe to remove the napi struct, which must be done + * here from process context. + */ + netif_napi_del(&peer->napi); + + /* Ensure any workstructs we own (like transmit_handshake_work or + * clear_peer_work) no longer are in use. + */ + flush_workqueue(peer->device->handshake_send_wq); + + --peer->device->num_peers; + wg_peer_put(peer); +} + +static void rcu_release(struct rcu_head *rcu) +{ + struct wireguard_peer *peer = + container_of(rcu, struct wireguard_peer, rcu); + dst_cache_destroy(&peer->endpoint_cache); + wg_packet_queue_free(&peer->rx_queue, false); + wg_packet_queue_free(&peer->tx_queue, false); + kzfree(peer); +} + +static void kref_release(struct kref *refcount) +{ + struct wireguard_peer *peer = + container_of(refcount, struct wireguard_peer, refcount); + pr_debug("%s: Peer %llu (%pISpfsc) destroyed\n", + peer->device->dev->name, peer->internal_id, + &peer->endpoint.addr); + /* Remove ourself from dynamic runtime lookup structures, now that the + * last reference is gone. + */ + wg_index_hashtable_remove(&peer->device->index_hashtable, + &peer->handshake.entry); + /* Remove any lingering packets that didn't have a chance to be + * transmitted. + */ + skb_queue_purge(&peer->staged_packet_queue); + /* Free the memory used. */ + call_rcu_bh(&peer->rcu, rcu_release); +} + +void wg_peer_put(struct wireguard_peer *peer) +{ + if (unlikely(!peer)) + return; + kref_put(&peer->refcount, kref_release); +} + +void wg_peer_remove_all(struct wireguard_device *wg) +{ + struct wireguard_peer *peer, *temp; + + lockdep_assert_held(&wg->device_update_lock); + list_for_each_entry_safe (peer, temp, &wg->peer_list, peer_list) + wg_peer_remove(peer); +} diff --git a/drivers/net/wireguard/peer.h b/drivers/net/wireguard/peer.h new file mode 100644 index 000000000000..2811b615475d --- /dev/null +++ b/drivers/net/wireguard/peer.h @@ -0,0 +1,87 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#ifndef _WG_PEER_H +#define _WG_PEER_H + +#include "device.h" +#include "noise.h" +#include "cookie.h" + +#include +#include +#include +#include +#include + +struct wireguard_device; + +struct endpoint { + union { + struct sockaddr addr; + struct sockaddr_in addr4; + struct sockaddr_in6 addr6; + }; + union { + struct { + struct in_addr src4; + /* Essentially the same as addr6->scope_id */ + int src_if4; + }; + struct in6_addr src6; + }; +}; + +struct wireguard_peer { + struct wireguard_device *device; + struct crypt_queue tx_queue, rx_queue; + struct sk_buff_head staged_packet_queue; + int serial_work_cpu; + struct noise_keypairs keypairs; + struct endpoint endpoint; + struct dst_cache endpoint_cache; + rwlock_t endpoint_lock; + struct noise_handshake handshake; + atomic64_t last_sent_handshake; + struct work_struct transmit_handshake_work, clear_peer_work; + struct cookie latest_cookie; + struct hlist_node pubkey_hash; + u64 rx_bytes, tx_bytes; + struct timer_list timer_retransmit_handshake, timer_send_keepalive; + struct timer_list timer_new_handshake, timer_zero_key_material; + struct timer_list timer_persistent_keepalive; + unsigned int timer_handshake_attempts; + u16 persistent_keepalive_interval; + bool timers_enabled, timer_need_another_keepalive; + bool sent_lastminute_handshake; + struct timespec walltime_last_handshake; + struct kref refcount; + struct rcu_head rcu; + struct list_head peer_list; + u64 internal_id; + struct napi_struct napi; + bool is_dead; +}; + +struct wireguard_peer * +wg_peer_create(struct wireguard_device *wg, + const u8 public_key[NOISE_PUBLIC_KEY_LEN], + const u8 preshared_key[NOISE_SYMMETRIC_KEY_LEN]); + +struct wireguard_peer *__must_check +wg_peer_get_maybe_zero(struct wireguard_peer *peer); +static inline struct wireguard_peer *wg_peer_get(struct wireguard_peer *peer) +{ + kref_get(&peer->refcount); + return peer; +} +void wg_peer_put(struct wireguard_peer *peer); +void wg_peer_remove(struct wireguard_peer *peer); +void wg_peer_remove_all(struct wireguard_device *wg); + +struct wireguard_peer *wg_peer_lookup_by_index(struct wireguard_device *wg, + u32 index); + +#endif /* _WG_PEER_H */ diff --git a/drivers/net/wireguard/queueing.c b/drivers/net/wireguard/queueing.c new file mode 100644 index 000000000000..939aac997d22 --- /dev/null +++ b/drivers/net/wireguard/queueing.c @@ -0,0 +1,52 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#include "queueing.h" + +struct multicore_worker __percpu * +wg_packet_alloc_percpu_multicore_worker(work_func_t function, void *ptr) +{ + int cpu; + struct multicore_worker __percpu *worker = + alloc_percpu(struct multicore_worker); + + if (!worker) + return NULL; + + for_each_possible_cpu (cpu) { + per_cpu_ptr(worker, cpu)->ptr = ptr; + INIT_WORK(&per_cpu_ptr(worker, cpu)->work, function); + } + return worker; +} + +int wg_packet_queue_init(struct crypt_queue *queue, work_func_t function, + bool multicore, unsigned int len) +{ + int ret; + + memset(queue, 0, sizeof(*queue)); + ret = ptr_ring_init(&queue->ring, len, GFP_KERNEL); + if (ret) + return ret; + if (function) { + if (multicore) { + queue->worker = wg_packet_alloc_percpu_multicore_worker( + function, queue); + if (!queue->worker) + return -ENOMEM; + } else + INIT_WORK(&queue->work, function); + } + return 0; +} + +void wg_packet_queue_free(struct crypt_queue *queue, bool multicore) +{ + if (multicore) + free_percpu(queue->worker); + WARN_ON(!__ptr_ring_empty(&queue->ring)); + ptr_ring_cleanup(&queue->ring, NULL); +} diff --git a/drivers/net/wireguard/queueing.h b/drivers/net/wireguard/queueing.h new file mode 100644 index 000000000000..9a089caced09 --- /dev/null +++ b/drivers/net/wireguard/queueing.h @@ -0,0 +1,193 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#ifndef _WG_QUEUEING_H +#define _WG_QUEUEING_H + +#include "peer.h" +#include +#include +#include +#include + +struct wireguard_device; +struct wireguard_peer; +struct multicore_worker; +struct crypt_queue; +struct sk_buff; + +/* queueing.c APIs: */ +int wg_packet_queue_init(struct crypt_queue *queue, work_func_t function, + bool multicore, unsigned int len); +void wg_packet_queue_free(struct crypt_queue *queue, bool multicore); +struct multicore_worker __percpu * +wg_packet_alloc_percpu_multicore_worker(work_func_t function, void *ptr); + +/* receive.c APIs: */ +void wg_packet_receive(struct wireguard_device *wg, struct sk_buff *skb); +void wg_packet_handshake_receive_worker(struct work_struct *work); +/* NAPI poll function: */ +int wg_packet_rx_poll(struct napi_struct *napi, int budget); +/* Workqueue worker: */ +void wg_packet_decrypt_worker(struct work_struct *work); + +/* send.c APIs: */ +void wg_packet_send_queued_handshake_initiation(struct wireguard_peer *peer, + bool is_retry); +void wg_packet_send_handshake_response(struct wireguard_peer *peer); +void wg_packet_send_handshake_cookie(struct wireguard_device *wg, + struct sk_buff *initiating_skb, + __le32 sender_index); +void wg_packet_send_keepalive(struct wireguard_peer *peer); +void wg_packet_send_staged_packets(struct wireguard_peer *peer); +/* Workqueue workers: */ +void wg_packet_handshake_send_worker(struct work_struct *work); +void wg_packet_tx_worker(struct work_struct *work); +void wg_packet_encrypt_worker(struct work_struct *work); + +enum packet_state { + PACKET_STATE_UNCRYPTED, + PACKET_STATE_CRYPTED, + PACKET_STATE_DEAD +}; + +struct packet_cb { + u64 nonce; + struct noise_keypair *keypair; + atomic_t state; + u32 mtu; + u8 ds; +}; + +#define PACKET_PEER(skb) (((struct packet_cb *)skb->cb)->keypair->entry.peer) +#define PACKET_CB(skb) ((struct packet_cb *)skb->cb) + +/* Returns either the correct skb->protocol value, or 0 if invalid. */ +static inline __be16 wg_skb_examine_untrusted_ip_hdr(struct sk_buff *skb) +{ + if (skb_network_header(skb) >= skb->head && + (skb_network_header(skb) + sizeof(struct iphdr)) <= + skb_tail_pointer(skb) && + ip_hdr(skb)->version == 4) + return htons(ETH_P_IP); + if (skb_network_header(skb) >= skb->head && + (skb_network_header(skb) + sizeof(struct ipv6hdr)) <= + skb_tail_pointer(skb) && + ipv6_hdr(skb)->version == 6) + return htons(ETH_P_IPV6); + return 0; +} + +static inline void wg_reset_packet(struct sk_buff *skb) +{ + const int pfmemalloc = skb->pfmemalloc; + skb_scrub_packet(skb, true); + memset(&skb->headers_start, 0, + offsetof(struct sk_buff, headers_end) - + offsetof(struct sk_buff, headers_start)); + skb->pfmemalloc = pfmemalloc; + skb->queue_mapping = 0; + skb->nohdr = 0; + skb->peeked = 0; + skb->mac_len = 0; + skb->dev = NULL; +#ifdef CONFIG_NET_SCHED + skb->tc_index = 0; + skb_reset_tc(skb); +#endif + skb->hdr_len = skb_headroom(skb); + skb_reset_mac_header(skb); + skb_reset_network_header(skb); + skb_probe_transport_header(skb, 0); + skb_reset_inner_headers(skb); +} + +static inline int wg_cpumask_choose_online(int *stored_cpu, unsigned int id) +{ + unsigned int cpu = *stored_cpu, cpu_index, i; + + if (unlikely(cpu == nr_cpumask_bits || + !cpumask_test_cpu(cpu, cpu_online_mask))) { + cpu_index = id % cpumask_weight(cpu_online_mask); + cpu = cpumask_first(cpu_online_mask); + for (i = 0; i < cpu_index; ++i) + cpu = cpumask_next(cpu, cpu_online_mask); + *stored_cpu = cpu; + } + return cpu; +} + +/* This function is racy, in the sense that next is unlocked, so it could return + * the same CPU twice. A race-free version of this would be to instead store an + * atomic sequence number, do an increment-and-return, and then iterate through + * every possible CPU until we get to that index -- choose_cpu. However that's + * a bit slower, and it doesn't seem like this potential race actually + * introduces any performance loss, so we live with it. + */ +static inline int wg_cpumask_next_online(int *next) +{ + int cpu = *next; + + while (unlikely(!cpumask_test_cpu(cpu, cpu_online_mask))) + cpu = cpumask_next(cpu, cpu_online_mask) % nr_cpumask_bits; + *next = cpumask_next(cpu, cpu_online_mask) % nr_cpumask_bits; + return cpu; +} + +static inline int wg_queue_enqueue_per_device_and_peer( + struct crypt_queue *device_queue, struct crypt_queue *peer_queue, + struct sk_buff *skb, struct workqueue_struct *wq, int *next_cpu) +{ + int cpu; + + atomic_set_release(&PACKET_CB(skb)->state, PACKET_STATE_UNCRYPTED); + /* We first queue this up for the peer ingestion, but the consumer + * will wait for the state to change to CRYPTED or DEAD before. + */ + if (unlikely(ptr_ring_produce_bh(&peer_queue->ring, skb))) + return -ENOSPC; + /* Then we queue it up in the device queue, which consumes the + * packet as soon as it can. + */ + cpu = wg_cpumask_next_online(next_cpu); + if (unlikely(ptr_ring_produce_bh(&device_queue->ring, skb))) + return -EPIPE; + queue_work_on(cpu, wq, &per_cpu_ptr(device_queue->worker, cpu)->work); + return 0; +} + +static inline void wg_queue_enqueue_per_peer(struct crypt_queue *queue, + struct sk_buff *skb, + enum packet_state state) +{ + /* We take a reference, because as soon as we call atomic_set, the + * peer can be freed from below us. + */ + struct wireguard_peer *peer = wg_peer_get(PACKET_PEER(skb)); + atomic_set_release(&PACKET_CB(skb)->state, state); + queue_work_on(wg_cpumask_choose_online(&peer->serial_work_cpu, + peer->internal_id), + peer->device->packet_crypt_wq, &queue->work); + wg_peer_put(peer); +} + +static inline void wg_queue_enqueue_per_peer_napi(struct crypt_queue *queue, + struct sk_buff *skb, + enum packet_state state) +{ + /* We take a reference, because as soon as we call atomic_set, the + * peer can be freed from below us. + */ + struct wireguard_peer *peer = wg_peer_get(PACKET_PEER(skb)); + atomic_set_release(&PACKET_CB(skb)->state, state); + napi_schedule(&peer->napi); + wg_peer_put(peer); +} + +#ifdef DEBUG +bool wg_packet_counter_selftest(void); +#endif + +#endif /* _WG_QUEUEING_H */ diff --git a/drivers/net/wireguard/ratelimiter.c b/drivers/net/wireguard/ratelimiter.c new file mode 100644 index 000000000000..9c9bd8d34682 --- /dev/null +++ b/drivers/net/wireguard/ratelimiter.c @@ -0,0 +1,220 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#include "ratelimiter.h" +#include +#include +#include +#include + +static struct kmem_cache *entry_cache; +static hsiphash_key_t key; +static spinlock_t table_lock = __SPIN_LOCK_UNLOCKED("ratelimiter_table_lock"); +static DEFINE_MUTEX(init_lock); +static atomic64_t refcnt = ATOMIC64_INIT(0); +static atomic_t total_entries = ATOMIC_INIT(0); +static unsigned int max_entries, table_size; +static void gc_entries(struct work_struct *); +static DECLARE_DEFERRABLE_WORK(gc_work, gc_entries); +static struct hlist_head *table_v4; +#if IS_ENABLED(CONFIG_IPV6) +static struct hlist_head *table_v6; +#endif + +struct ratelimiter_entry { + u64 last_time_ns, tokens; + __be64 ip; + void *net; + spinlock_t lock; + struct hlist_node hash; + struct rcu_head rcu; +}; + +enum { + PACKETS_PER_SECOND = 20, + PACKETS_BURSTABLE = 5, + PACKET_COST = NSEC_PER_SEC / PACKETS_PER_SECOND, + TOKEN_MAX = PACKET_COST * PACKETS_BURSTABLE +}; + +static void entry_free(struct rcu_head *rcu) +{ + kmem_cache_free(entry_cache, + container_of(rcu, struct ratelimiter_entry, rcu)); + atomic_dec(&total_entries); +} + +static void entry_uninit(struct ratelimiter_entry *entry) +{ + hlist_del_rcu(&entry->hash); + call_rcu(&entry->rcu, entry_free); +} + +/* Calling this function with a NULL work uninits all entries. */ +static void gc_entries(struct work_struct *work) +{ + const u64 now = ktime_get_boot_fast_ns(); + struct ratelimiter_entry *entry; + struct hlist_node *temp; + unsigned int i; + + for (i = 0; i < table_size; ++i) { + spin_lock(&table_lock); + hlist_for_each_entry_safe (entry, temp, &table_v4[i], hash) { + if (unlikely(!work) || + now - entry->last_time_ns > NSEC_PER_SEC) + entry_uninit(entry); + } +#if IS_ENABLED(CONFIG_IPV6) + hlist_for_each_entry_safe (entry, temp, &table_v6[i], hash) { + if (unlikely(!work) || + now - entry->last_time_ns > NSEC_PER_SEC) + entry_uninit(entry); + } +#endif + spin_unlock(&table_lock); + if (likely(work)) + cond_resched(); + } + if (likely(work)) + queue_delayed_work(system_power_efficient_wq, &gc_work, HZ); +} + +bool wg_ratelimiter_allow(struct sk_buff *skb, struct net *net) +{ + struct { __be64 ip; u32 net; } data = { + .net = (unsigned long)net & 0xffffffff }; + struct ratelimiter_entry *entry; + struct hlist_head *bucket; + + if (skb->protocol == htons(ETH_P_IP)) { + data.ip = (__force __be64)ip_hdr(skb)->saddr; + bucket = &table_v4[hsiphash(&data, sizeof(u32) * 3, &key) & + (table_size - 1)]; + } +#if IS_ENABLED(CONFIG_IPV6) + else if (skb->protocol == htons(ETH_P_IPV6)) { + memcpy(&data.ip, &ipv6_hdr(skb)->saddr, + sizeof(__be64)); /* Only 64 bits */ + bucket = &table_v6[hsiphash(&data, sizeof(u32) * 3, &key) & + (table_size - 1)]; + } +#endif + else + return false; + rcu_read_lock(); + hlist_for_each_entry_rcu (entry, bucket, hash) { + if (entry->net == net && entry->ip == data.ip) { + u64 now, tokens; + bool ret; + /* Quasi-inspired by nft_limit.c, but this is actually a + * slightly different algorithm. Namely, we incorporate + * the burst as part of the maximum tokens, rather than + * as part of the rate. + */ + spin_lock(&entry->lock); + now = ktime_get_boot_fast_ns(); + tokens = min_t(u64, TOKEN_MAX, + entry->tokens + now - + entry->last_time_ns); + entry->last_time_ns = now; + ret = tokens >= PACKET_COST; + entry->tokens = ret ? tokens - PACKET_COST : tokens; + spin_unlock(&entry->lock); + rcu_read_unlock(); + return ret; + } + } + rcu_read_unlock(); + + if (atomic_inc_return(&total_entries) > max_entries) + goto err_oom; + + entry = kmem_cache_alloc(entry_cache, GFP_KERNEL); + if (unlikely(!entry)) + goto err_oom; + + entry->net = net; + entry->ip = data.ip; + INIT_HLIST_NODE(&entry->hash); + spin_lock_init(&entry->lock); + entry->last_time_ns = ktime_get_boot_fast_ns(); + entry->tokens = TOKEN_MAX - PACKET_COST; + spin_lock(&table_lock); + hlist_add_head_rcu(&entry->hash, bucket); + spin_unlock(&table_lock); + return true; + +err_oom: + atomic_dec(&total_entries); + return false; +} + +int wg_ratelimiter_init(void) +{ + mutex_lock(&init_lock); + if (atomic64_inc_return(&refcnt) != 1) + goto out; + + entry_cache = KMEM_CACHE(ratelimiter_entry, 0); + if (!entry_cache) + goto err; + + /* xt_hashlimit.c uses a slightly different algorithm for ratelimiting, + * but what it shares in common is that it uses a massive hashtable. So, + * we borrow their wisdom about good table sizes on different systems + * dependent on RAM. This calculation here comes from there. + */ + table_size = (totalram_pages > (1U << 30) / PAGE_SIZE) ? 8192 : + max_t(unsigned long, 16, roundup_pow_of_two( + (totalram_pages << PAGE_SHIFT) / + (1U << 14) / sizeof(struct hlist_head))); + max_entries = table_size * 8; + + table_v4 = kvzalloc(table_size * sizeof(*table_v4), GFP_KERNEL); + if (unlikely(!table_v4)) + goto err_kmemcache; + +#if IS_ENABLED(CONFIG_IPV6) + table_v6 = kvzalloc(table_size * sizeof(*table_v6), GFP_KERNEL); + if (unlikely(!table_v6)) { + kvfree(table_v4); + goto err_kmemcache; + } +#endif + + queue_delayed_work(system_power_efficient_wq, &gc_work, HZ); + get_random_bytes(&key, sizeof(key)); +out: + mutex_unlock(&init_lock); + return 0; + +err_kmemcache: + kmem_cache_destroy(entry_cache); +err: + atomic64_dec(&refcnt); + mutex_unlock(&init_lock); + return -ENOMEM; +} + +void wg_ratelimiter_uninit(void) +{ + mutex_lock(&init_lock); + if (atomic64_dec_if_positive(&refcnt)) + goto out; + + cancel_delayed_work_sync(&gc_work); + gc_entries(NULL); + rcu_barrier(); + kvfree(table_v4); +#if IS_ENABLED(CONFIG_IPV6) + kvfree(table_v6); +#endif + kmem_cache_destroy(entry_cache); +out: + mutex_unlock(&init_lock); +} + +#include "selftest/ratelimiter.c" diff --git a/drivers/net/wireguard/ratelimiter.h b/drivers/net/wireguard/ratelimiter.h new file mode 100644 index 000000000000..0325d10b1f76 --- /dev/null +++ b/drivers/net/wireguard/ratelimiter.h @@ -0,0 +1,19 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#ifndef _WG_RATELIMITER_H +#define _WG_RATELIMITER_H + +#include + +int wg_ratelimiter_init(void); +void wg_ratelimiter_uninit(void); +bool wg_ratelimiter_allow(struct sk_buff *skb, struct net *net); + +#ifdef DEBUG +bool wg_ratelimiter_selftest(void); +#endif + +#endif /* _WG_RATELIMITER_H */ diff --git a/drivers/net/wireguard/receive.c b/drivers/net/wireguard/receive.c new file mode 100644 index 000000000000..c38a2745dead --- /dev/null +++ b/drivers/net/wireguard/receive.c @@ -0,0 +1,596 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#include "queueing.h" +#include "device.h" +#include "peer.h" +#include "timers.h" +#include "messages.h" +#include "cookie.h" +#include "socket.h" + +#include +#include +#include +#include +#include + +/* Must be called with bh disabled. */ +static void rx_stats(struct wireguard_peer *peer, size_t len) +{ + struct pcpu_sw_netstats *tstats = + get_cpu_ptr(peer->device->dev->tstats); + + u64_stats_update_begin(&tstats->syncp); + ++tstats->rx_packets; + tstats->rx_bytes += len; + peer->rx_bytes += len; + u64_stats_update_end(&tstats->syncp); + put_cpu_ptr(tstats); +} + +#define SKB_TYPE_LE32(skb) (((struct message_header *)(skb)->data)->type) + +static size_t validate_header_len(struct sk_buff *skb) +{ + if (unlikely(skb->len < sizeof(struct message_header))) + return 0; + if (SKB_TYPE_LE32(skb) == cpu_to_le32(MESSAGE_DATA) && + skb->len >= MESSAGE_MINIMUM_LENGTH) + return sizeof(struct message_data); + if (SKB_TYPE_LE32(skb) == cpu_to_le32(MESSAGE_HANDSHAKE_INITIATION) && + skb->len == sizeof(struct message_handshake_initiation)) + return sizeof(struct message_handshake_initiation); + if (SKB_TYPE_LE32(skb) == cpu_to_le32(MESSAGE_HANDSHAKE_RESPONSE) && + skb->len == sizeof(struct message_handshake_response)) + return sizeof(struct message_handshake_response); + if (SKB_TYPE_LE32(skb) == cpu_to_le32(MESSAGE_HANDSHAKE_COOKIE) && + skb->len == sizeof(struct message_handshake_cookie)) + return sizeof(struct message_handshake_cookie); + return 0; +} + +static int skb_prepare_header(struct sk_buff *skb, struct wireguard_device *wg) +{ + size_t data_offset, data_len, header_len; + struct udphdr *udp; + + if (unlikely(wg_skb_examine_untrusted_ip_hdr(skb) != skb->protocol || + skb_transport_header(skb) < skb->head || + (skb_transport_header(skb) + sizeof(struct udphdr)) > + skb_tail_pointer(skb))) + return -EINVAL; /* Bogus IP header */ + udp = udp_hdr(skb); + data_offset = (u8 *)udp - skb->data; + if (unlikely(data_offset > U16_MAX || + data_offset + sizeof(struct udphdr) > skb->len)) + /* Packet has offset at impossible location or isn't big enough + * to have UDP fields. + */ + return -EINVAL; + data_len = ntohs(udp->len); + if (unlikely(data_len < sizeof(struct udphdr) || + data_len > skb->len - data_offset)) + /* UDP packet is reporting too small of a size or lying about + * its size. + */ + return -EINVAL; + data_len -= sizeof(struct udphdr); + data_offset = (u8 *)udp + sizeof(struct udphdr) - skb->data; + if (unlikely(!pskb_may_pull(skb, + data_offset + sizeof(struct message_header)) || + pskb_trim(skb, data_len + data_offset) < 0)) + return -EINVAL; + skb_pull(skb, data_offset); + if (unlikely(skb->len != data_len)) + /* Final len does not agree with calculated len */ + return -EINVAL; + header_len = validate_header_len(skb); + if (unlikely(!header_len)) + return -EINVAL; + __skb_push(skb, data_offset); + if (unlikely(!pskb_may_pull(skb, data_offset + header_len))) + return -EINVAL; + __skb_pull(skb, data_offset); + return 0; +} + +static void receive_handshake_packet(struct wireguard_device *wg, + struct sk_buff *skb) +{ + struct wireguard_peer *peer = NULL; + enum cookie_mac_state mac_state; + /* This is global, so that our load calculation applies to + * the whole system. + */ + static u64 last_under_load; + bool packet_needs_cookie; + bool under_load; + + if (SKB_TYPE_LE32(skb) == cpu_to_le32(MESSAGE_HANDSHAKE_COOKIE)) { + net_dbg_skb_ratelimited("%s: Receiving cookie response from %pISpfsc\n", + wg->dev->name, skb); + wg_cookie_message_consume( + (struct message_handshake_cookie *)skb->data, wg); + return; + } + + under_load = skb_queue_len(&wg->incoming_handshakes) >= + MAX_QUEUED_INCOMING_HANDSHAKES / 8; + if (under_load) + last_under_load = ktime_get_boot_fast_ns(); + else if (last_under_load) + under_load = !wg_birthdate_has_expired(last_under_load, 1); + mac_state = wg_cookie_validate_packet(&wg->cookie_checker, skb, + under_load); + if ((under_load && mac_state == VALID_MAC_WITH_COOKIE) || + (!under_load && mac_state == VALID_MAC_BUT_NO_COOKIE)) + packet_needs_cookie = false; + else if (under_load && mac_state == VALID_MAC_BUT_NO_COOKIE) + packet_needs_cookie = true; + else { + net_dbg_skb_ratelimited("%s: Invalid MAC of handshake, dropping packet from %pISpfsc\n", + wg->dev->name, skb); + return; + } + + switch (SKB_TYPE_LE32(skb)) { + case cpu_to_le32(MESSAGE_HANDSHAKE_INITIATION): { + struct message_handshake_initiation *message = + (struct message_handshake_initiation *)skb->data; + + if (packet_needs_cookie) { + wg_packet_send_handshake_cookie(wg, skb, + message->sender_index); + return; + } + peer = wg_noise_handshake_consume_initiation(message, wg); + if (unlikely(!peer)) { + net_dbg_skb_ratelimited("%s: Invalid handshake initiation from %pISpfsc\n", + wg->dev->name, skb); + return; + } + wg_socket_set_peer_endpoint_from_skb(peer, skb); + net_dbg_ratelimited("%s: Receiving handshake initiation from peer %llu (%pISpfsc)\n", + wg->dev->name, peer->internal_id, + &peer->endpoint.addr); + wg_packet_send_handshake_response(peer); + break; + } + case cpu_to_le32(MESSAGE_HANDSHAKE_RESPONSE): { + struct message_handshake_response *message = + (struct message_handshake_response *)skb->data; + + if (packet_needs_cookie) { + wg_packet_send_handshake_cookie(wg, skb, + message->sender_index); + return; + } + peer = wg_noise_handshake_consume_response(message, wg); + if (unlikely(!peer)) { + net_dbg_skb_ratelimited("%s: Invalid handshake response from %pISpfsc\n", + wg->dev->name, skb); + return; + } + wg_socket_set_peer_endpoint_from_skb(peer, skb); + net_dbg_ratelimited("%s: Receiving handshake response from peer %llu (%pISpfsc)\n", + wg->dev->name, peer->internal_id, + &peer->endpoint.addr); + if (wg_noise_handshake_begin_session(&peer->handshake, + &peer->keypairs)) { + wg_timers_session_derived(peer); + wg_timers_handshake_complete(peer); + /* Calling this function will either send any existing + * packets in the queue and not send a keepalive, which + * is the best case, Or, if there's nothing in the + * queue, it will send a keepalive, in order to give + * immediate confirmation of the session. + */ + wg_packet_send_keepalive(peer); + } + break; + } + } + + if (unlikely(!peer)) { + WARN(1, "Somehow a wrong type of packet wound up in the handshake queue!\n"); + return; + } + + local_bh_disable(); + rx_stats(peer, skb->len); + local_bh_enable(); + + wg_timers_any_authenticated_packet_received(peer); + wg_timers_any_authenticated_packet_traversal(peer); + wg_peer_put(peer); +} + +void wg_packet_handshake_receive_worker(struct work_struct *work) +{ + struct wireguard_device *wg = + container_of(work, struct multicore_worker, work)->ptr; + struct sk_buff *skb; + + while ((skb = skb_dequeue(&wg->incoming_handshakes)) != NULL) { + receive_handshake_packet(wg, skb); + dev_kfree_skb(skb); + cond_resched(); + } +} + +static void keep_key_fresh(struct wireguard_peer *peer) +{ + struct noise_keypair *keypair; + bool send = false; + + if (peer->sent_lastminute_handshake) + return; + + rcu_read_lock_bh(); + keypair = rcu_dereference_bh(peer->keypairs.current_keypair); + if (likely(keypair && keypair->sending.is_valid) && + keypair->i_am_the_initiator && + unlikely(wg_birthdate_has_expired(keypair->sending.birthdate, + REJECT_AFTER_TIME - KEEPALIVE_TIMEOUT - REKEY_TIMEOUT))) + send = true; + rcu_read_unlock_bh(); + + if (send) { + peer->sent_lastminute_handshake = true; + wg_packet_send_queued_handshake_initiation(peer, false); + } +} + +static bool decrypt_packet(struct sk_buff *skb, struct noise_symmetric_key *key, + simd_context_t *simd_context) +{ + struct scatterlist sg[MAX_SKB_FRAGS + 8]; + struct sk_buff *trailer; + unsigned int offset; + int num_frags; + + if (unlikely(!key)) + return false; + + if (unlikely(!key->is_valid || + wg_birthdate_has_expired(key->birthdate, REJECT_AFTER_TIME) || + key->counter.receive.counter >= REJECT_AFTER_MESSAGES)) { + key->is_valid = false; + return false; + } + + PACKET_CB(skb)->nonce = + le64_to_cpu(((struct message_data *)skb->data)->counter); + + /* We ensure that the network header is part of the packet before we + * call skb_cow_data, so that there's no chance that data is removed + * from the skb, so that later we can extract the original endpoint. + */ + offset = skb->data - skb_network_header(skb); + skb_push(skb, offset); + num_frags = skb_cow_data(skb, 0, &trailer); + offset += sizeof(struct message_data); + skb_pull(skb, offset); + if (unlikely(num_frags < 0 || num_frags > ARRAY_SIZE(sg))) + return false; + + sg_init_table(sg, num_frags); + if (skb_to_sgvec(skb, sg, 0, skb->len) <= 0) + return false; + + if (!chacha20poly1305_decrypt_sg(sg, sg, skb->len, NULL, 0, + PACKET_CB(skb)->nonce, key->key, + simd_context)) + return false; + + /* Another ugly situation of pushing and pulling the header so as to + * keep endpoint information intact. + */ + skb_push(skb, offset); + if (pskb_trim(skb, skb->len - noise_encrypted_len(0))) + return false; + skb_pull(skb, offset); + + return true; +} + +/* This is RFC6479, a replay detection bitmap algorithm that avoids bitshifts */ +static bool counter_validate(union noise_counter *counter, u64 their_counter) +{ + unsigned long index, index_current, top, i; + bool ret = false; + + spin_lock_bh(&counter->receive.lock); + + if (unlikely(counter->receive.counter >= REJECT_AFTER_MESSAGES + 1 || + their_counter >= REJECT_AFTER_MESSAGES)) + goto out; + + ++their_counter; + + if (unlikely((COUNTER_WINDOW_SIZE + their_counter) < + counter->receive.counter)) + goto out; + + index = their_counter >> ilog2(BITS_PER_LONG); + + if (likely(their_counter > counter->receive.counter)) { + index_current = counter->receive.counter >> ilog2(BITS_PER_LONG); + top = min_t(unsigned long, index - index_current, + COUNTER_BITS_TOTAL / BITS_PER_LONG); + for (i = 1; i <= top; ++i) + counter->receive.backtrack[(i + index_current) & + ((COUNTER_BITS_TOTAL / BITS_PER_LONG) - 1)] = 0; + counter->receive.counter = their_counter; + } + + index &= (COUNTER_BITS_TOTAL / BITS_PER_LONG) - 1; + ret = !test_and_set_bit(their_counter & (BITS_PER_LONG - 1), + &counter->receive.backtrack[index]); + +out: + spin_unlock_bh(&counter->receive.lock); + return ret; +} +#include "selftest/counter.c" + +static void packet_consume_data_done(struct wireguard_peer *peer, + struct sk_buff *skb, + struct endpoint *endpoint) +{ + struct net_device *dev = peer->device->dev; + struct wireguard_peer *routed_peer; + unsigned int len, len_before_trim; + + wg_socket_set_peer_endpoint(peer, endpoint); + + if (unlikely(wg_noise_received_with_keypair(&peer->keypairs, + PACKET_CB(skb)->keypair))) { + wg_timers_handshake_complete(peer); + wg_packet_send_staged_packets(peer); + } + + keep_key_fresh(peer); + + wg_timers_any_authenticated_packet_received(peer); + wg_timers_any_authenticated_packet_traversal(peer); + + /* A packet with length 0 is a keepalive packet */ + if (unlikely(!skb->len)) { + rx_stats(peer, message_data_len(0)); + net_dbg_ratelimited("%s: Receiving keepalive packet from peer %llu (%pISpfsc)\n", + dev->name, peer->internal_id, + &peer->endpoint.addr); + goto packet_processed; + } + + wg_timers_data_received(peer); + + if (unlikely(skb_network_header(skb) < skb->head)) + goto dishonest_packet_size; + if (unlikely(!(pskb_network_may_pull(skb, sizeof(struct iphdr)) && + (ip_hdr(skb)->version == 4 || + (ip_hdr(skb)->version == 6 && + pskb_network_may_pull(skb, sizeof(struct ipv6hdr))))))) + goto dishonest_packet_type; + + skb->dev = dev; + skb->ip_summed = CHECKSUM_UNNECESSARY; + skb->protocol = wg_skb_examine_untrusted_ip_hdr(skb); + if (skb->protocol == htons(ETH_P_IP)) { + len = ntohs(ip_hdr(skb)->tot_len); + if (unlikely(len < sizeof(struct iphdr))) + goto dishonest_packet_size; + if (INET_ECN_is_ce(PACKET_CB(skb)->ds)) + IP_ECN_set_ce(ip_hdr(skb)); + } else if (skb->protocol == htons(ETH_P_IPV6)) { + len = ntohs(ipv6_hdr(skb)->payload_len) + + sizeof(struct ipv6hdr); + if (INET_ECN_is_ce(PACKET_CB(skb)->ds)) + IP6_ECN_set_ce(skb, ipv6_hdr(skb)); + } else + goto dishonest_packet_type; + + if (unlikely(len > skb->len)) + goto dishonest_packet_size; + len_before_trim = skb->len; + if (unlikely(pskb_trim(skb, len))) + goto packet_processed; + + routed_peer = wg_allowedips_lookup_src(&peer->device->peer_allowedips, + skb); + wg_peer_put(routed_peer); /* We don't need the extra reference. */ + + if (unlikely(routed_peer != peer)) + goto dishonest_packet_peer; + + if (unlikely(napi_gro_receive(&peer->napi, skb) == GRO_DROP)) { + ++dev->stats.rx_dropped; + net_dbg_ratelimited("%s: Failed to give packet to userspace from peer %llu (%pISpfsc)\n", + dev->name, peer->internal_id, + &peer->endpoint.addr); + } else + rx_stats(peer, message_data_len(len_before_trim)); + return; + +dishonest_packet_peer: + net_dbg_skb_ratelimited("%s: Packet has unallowed src IP (%pISc) from peer %llu (%pISpfsc)\n", + dev->name, skb, peer->internal_id, + &peer->endpoint.addr); + ++dev->stats.rx_errors; + ++dev->stats.rx_frame_errors; + goto packet_processed; +dishonest_packet_type: + net_dbg_ratelimited("%s: Packet is neither ipv4 nor ipv6 from peer %llu (%pISpfsc)\n", + dev->name, peer->internal_id, &peer->endpoint.addr); + ++dev->stats.rx_errors; + ++dev->stats.rx_frame_errors; + goto packet_processed; +dishonest_packet_size: + net_dbg_ratelimited("%s: Packet has incorrect size from peer %llu (%pISpfsc)\n", + dev->name, peer->internal_id, &peer->endpoint.addr); + ++dev->stats.rx_errors; + ++dev->stats.rx_length_errors; + goto packet_processed; +packet_processed: + dev_kfree_skb(skb); +} + +int wg_packet_rx_poll(struct napi_struct *napi, int budget) +{ + struct wireguard_peer *peer = + container_of(napi, struct wireguard_peer, napi); + struct crypt_queue *queue = &peer->rx_queue; + struct noise_keypair *keypair; + struct endpoint endpoint; + enum packet_state state; + struct sk_buff *skb; + int work_done = 0; + bool free; + + if (unlikely(budget <= 0)) + return 0; + + while ((skb = __ptr_ring_peek(&queue->ring)) != NULL && + (state = atomic_read_acquire(&PACKET_CB(skb)->state)) != + PACKET_STATE_UNCRYPTED) { + __ptr_ring_discard_one(&queue->ring); + peer = PACKET_PEER(skb); + keypair = PACKET_CB(skb)->keypair; + free = true; + + if (unlikely(state != PACKET_STATE_CRYPTED)) + goto next; + + if (unlikely(!counter_validate(&keypair->receiving.counter, + PACKET_CB(skb)->nonce))) { + net_dbg_ratelimited("%s: Packet has invalid nonce %llu (max %llu)\n", + peer->device->dev->name, + PACKET_CB(skb)->nonce, + keypair->receiving.counter.receive.counter); + goto next; + } + + if (unlikely(wg_socket_endpoint_from_skb(&endpoint, skb))) + goto next; + + wg_reset_packet(skb); + packet_consume_data_done(peer, skb, &endpoint); + free = false; + + next: + wg_noise_keypair_put(keypair, false); + wg_peer_put(peer); + if (unlikely(free)) + dev_kfree_skb(skb); + + if (++work_done >= budget) + break; + } + + if (work_done < budget) + napi_complete_done(napi, work_done); + + return work_done; +} + +void wg_packet_decrypt_worker(struct work_struct *work) +{ + struct crypt_queue *queue = + container_of(work, struct multicore_worker, work)->ptr; + simd_context_t simd_context; + struct sk_buff *skb; + + simd_get(&simd_context); + while ((skb = ptr_ring_consume_bh(&queue->ring)) != NULL) { + enum packet_state state = likely(decrypt_packet(skb, + &PACKET_CB(skb)->keypair->receiving, + &simd_context)) ? + PACKET_STATE_CRYPTED : PACKET_STATE_DEAD; + wg_queue_enqueue_per_peer_napi(&PACKET_PEER(skb)->rx_queue, skb, + state); + simd_relax(&simd_context); + } + + simd_put(&simd_context); +} + +static void wg_packet_consume_data(struct wireguard_device *wg, + struct sk_buff *skb) +{ + __le32 idx = ((struct message_data *)skb->data)->key_idx; + struct wireguard_peer *peer = NULL; + int ret; + + rcu_read_lock_bh(); + PACKET_CB(skb)->keypair = + (struct noise_keypair *)wg_index_hashtable_lookup( + &wg->index_hashtable, INDEX_HASHTABLE_KEYPAIR, idx, + &peer); + if (unlikely(!wg_noise_keypair_get(PACKET_CB(skb)->keypair))) + goto err_keypair; + + if (unlikely(peer->is_dead)) + goto err; + + ret = wg_queue_enqueue_per_device_and_peer(&wg->decrypt_queue, + &peer->rx_queue, skb, + wg->packet_crypt_wq, + &wg->decrypt_queue.last_cpu); + if (unlikely(ret == -EPIPE)) + wg_queue_enqueue_per_peer(&peer->rx_queue, skb, PACKET_STATE_DEAD); + if (likely(!ret || ret == -EPIPE)) { + rcu_read_unlock_bh(); + return; + } +err: + wg_noise_keypair_put(PACKET_CB(skb)->keypair, false); +err_keypair: + rcu_read_unlock_bh(); + wg_peer_put(peer); + dev_kfree_skb(skb); +} + +void wg_packet_receive(struct wireguard_device *wg, struct sk_buff *skb) +{ + if (unlikely(skb_prepare_header(skb, wg) < 0)) + goto err; + switch (SKB_TYPE_LE32(skb)) { + case cpu_to_le32(MESSAGE_HANDSHAKE_INITIATION): + case cpu_to_le32(MESSAGE_HANDSHAKE_RESPONSE): + case cpu_to_le32(MESSAGE_HANDSHAKE_COOKIE): { + int cpu; + + if (skb_queue_len(&wg->incoming_handshakes) > + MAX_QUEUED_INCOMING_HANDSHAKES || + unlikely(!rng_is_initialized())) { + net_dbg_skb_ratelimited("%s: Dropping handshake packet from %pISpfsc\n", + wg->dev->name, skb); + goto err; + } + skb_queue_tail(&wg->incoming_handshakes, skb); + /* Queues up a call to packet_process_queued_handshake_ + * packets(skb): + */ + cpu = wg_cpumask_next_online(&wg->incoming_handshake_cpu); + queue_work_on(cpu, wg->handshake_receive_wq, + &per_cpu_ptr(wg->incoming_handshakes_worker, cpu)->work); + break; + } + case cpu_to_le32(MESSAGE_DATA): + PACKET_CB(skb)->ds = ip_tunnel_get_dsfield(ip_hdr(skb), skb); + wg_packet_consume_data(wg, skb); + break; + default: + net_dbg_skb_ratelimited("%s: Invalid packet from %pISpfsc\n", + wg->dev->name, skb); + goto err; + } + return; + +err: + dev_kfree_skb(skb); +} diff --git a/drivers/net/wireguard/selftest/allowedips.c b/drivers/net/wireguard/selftest/allowedips.c new file mode 100644 index 000000000000..a9d2c3ad857b --- /dev/null +++ b/drivers/net/wireguard/selftest/allowedips.c @@ -0,0 +1,679 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#ifdef DEBUG + +#include + +static __init void swap_endian_and_apply_cidr(u8 *dst, const u8 *src, u8 bits, + u8 cidr) +{ + swap_endian(dst, src, bits); + memset(dst + (cidr + 7) / 8, 0, bits / 8 - (cidr + 7) / 8); + if (cidr) + dst[(cidr + 7) / 8 - 1] &= ~0U << ((8 - (cidr % 8)) % 8); +} + +static __init void print_node(struct allowedips_node *node, u8 bits) +{ + char *fmt_connection = KERN_DEBUG "\t\"%p/%d\" -> \"%p/%d\";\n"; + char *fmt_declaration = KERN_DEBUG + "\t\"%p/%d\"[style=%s, color=\"#%06x\"];\n"; + char *style = "dotted"; + u8 ip1[16], ip2[16]; + u32 color = 0; + + if (bits == 32) { + fmt_connection = KERN_DEBUG "\t\"%pI4/%d\" -> \"%pI4/%d\";\n"; + fmt_declaration = KERN_DEBUG + "\t\"%pI4/%d\"[style=%s, color=\"#%06x\"];\n"; + } else if (bits == 128) { + fmt_connection = KERN_DEBUG "\t\"%pI6/%d\" -> \"%pI6/%d\";\n"; + fmt_declaration = KERN_DEBUG + "\t\"%pI6/%d\"[style=%s, color=\"#%06x\"];\n"; + } + if (node->peer) { + hsiphash_key_t key = { 0 }; + memcpy(&key, &node->peer, sizeof(node->peer)); + color = hsiphash_1u32(0xdeadbeef, &key) % 200 << 16 | + hsiphash_1u32(0xbabecafe, &key) % 200 << 8 | + hsiphash_1u32(0xabad1dea, &key) % 200; + style = "bold"; + } + swap_endian_and_apply_cidr(ip1, node->bits, bits, node->cidr); + printk(fmt_declaration, ip1, node->cidr, style, color); + if (node->bit[0]) { + swap_endian_and_apply_cidr(ip2, node->bit[0]->bits, bits, + node->cidr); + printk(fmt_connection, ip1, node->cidr, ip2, + node->bit[0]->cidr); + print_node(node->bit[0], bits); + } + if (node->bit[1]) { + swap_endian_and_apply_cidr(ip2, node->bit[1]->bits, bits, + node->cidr); + printk(fmt_connection, ip1, node->cidr, ip2, + node->bit[1]->cidr); + print_node(node->bit[1], bits); + } +} +static __init void print_tree(struct allowedips_node *top, u8 bits) +{ + printk(KERN_DEBUG "digraph trie {\n"); + print_node(top, bits); + printk(KERN_DEBUG "}\n"); +} + +enum { + NUM_PEERS = 2000, + NUM_RAND_ROUTES = 400, + NUM_MUTATED_ROUTES = 100, + NUM_QUERIES = NUM_RAND_ROUTES * NUM_MUTATED_ROUTES * 30 +}; + +struct horrible_allowedips { + struct hlist_head head; +}; + +struct horrible_allowedips_node { + struct hlist_node table; + union nf_inet_addr ip; + union nf_inet_addr mask; + uint8_t ip_version; + void *value; +}; + +static __init void horrible_allowedips_init(struct horrible_allowedips *table) +{ + INIT_HLIST_HEAD(&table->head); +} + +static __init void horrible_allowedips_free(struct horrible_allowedips *table) +{ + struct horrible_allowedips_node *node; + struct hlist_node *h; + + hlist_for_each_entry_safe (node, h, &table->head, table) { + hlist_del(&node->table); + kfree(node); + } +} + +static __init inline union nf_inet_addr horrible_cidr_to_mask(uint8_t cidr) +{ + union nf_inet_addr mask; + + memset(&mask, 0x00, 128 / 8); + memset(&mask, 0xff, cidr / 8); + if (cidr % 32) + mask.all[cidr / 32] = htonl( + (0xFFFFFFFFUL << (32 - (cidr % 32))) & 0xFFFFFFFFUL); + return mask; +} + +static __init inline uint8_t horrible_mask_to_cidr(union nf_inet_addr subnet) +{ + return hweight32(subnet.all[0]) + hweight32(subnet.all[1]) + + hweight32(subnet.all[2]) + hweight32(subnet.all[3]); +} + +static __init inline void +horrible_mask_self(struct horrible_allowedips_node *node) +{ + if (node->ip_version == 4) + node->ip.ip &= node->mask.ip; + else if (node->ip_version == 6) { + node->ip.ip6[0] &= node->mask.ip6[0]; + node->ip.ip6[1] &= node->mask.ip6[1]; + node->ip.ip6[2] &= node->mask.ip6[2]; + node->ip.ip6[3] &= node->mask.ip6[3]; + } +} + +static __init inline bool +horrible_match_v4(const struct horrible_allowedips_node *node, + struct in_addr *ip) +{ + return (ip->s_addr & node->mask.ip) == node->ip.ip; +} + +static __init inline bool +horrible_match_v6(const struct horrible_allowedips_node *node, + struct in6_addr *ip) +{ + return (ip->in6_u.u6_addr32[0] & node->mask.ip6[0]) == + node->ip.ip6[0] && + (ip->in6_u.u6_addr32[1] & node->mask.ip6[1]) == + node->ip.ip6[1] && + (ip->in6_u.u6_addr32[2] & node->mask.ip6[2]) == + node->ip.ip6[2] && + (ip->in6_u.u6_addr32[3] & node->mask.ip6[3]) == node->ip.ip6[3]; +} + +static __init void +horrible_insert_ordered(struct horrible_allowedips *table, + struct horrible_allowedips_node *node) +{ + struct horrible_allowedips_node *other = NULL, *where = NULL; + uint8_t my_cidr = horrible_mask_to_cidr(node->mask); + + hlist_for_each_entry (other, &table->head, table) { + if (!memcmp(&other->mask, &node->mask, + sizeof(union nf_inet_addr)) && + !memcmp(&other->ip, &node->ip, + sizeof(union nf_inet_addr)) && + other->ip_version == node->ip_version) { + other->value = node->value; + kfree(node); + return; + } + where = other; + if (horrible_mask_to_cidr(other->mask) <= my_cidr) + break; + } + if (!other && !where) + hlist_add_head(&node->table, &table->head); + else if (!other) + hlist_add_behind(&node->table, &where->table); + else + hlist_add_before(&node->table, &where->table); +} + +static __init int +horrible_allowedips_insert_v4(struct horrible_allowedips *table, + struct in_addr *ip, uint8_t cidr, void *value) +{ + struct horrible_allowedips_node *node = kzalloc(sizeof(*node), + GFP_KERNEL); + + if (unlikely(!node)) + return -ENOMEM; + node->ip.in = *ip; + node->mask = horrible_cidr_to_mask(cidr); + node->ip_version = 4; + node->value = value; + horrible_mask_self(node); + horrible_insert_ordered(table, node); + return 0; +} + +static __init int +horrible_allowedips_insert_v6(struct horrible_allowedips *table, + struct in6_addr *ip, uint8_t cidr, void *value) +{ + struct horrible_allowedips_node *node = kzalloc(sizeof(*node), + GFP_KERNEL); + + if (unlikely(!node)) + return -ENOMEM; + node->ip.in6 = *ip; + node->mask = horrible_cidr_to_mask(cidr); + node->ip_version = 6; + node->value = value; + horrible_mask_self(node); + horrible_insert_ordered(table, node); + return 0; +} + +static __init void * +horrible_allowedips_lookup_v4(struct horrible_allowedips *table, + struct in_addr *ip) +{ + struct horrible_allowedips_node *node; + void *ret = NULL; + + hlist_for_each_entry (node, &table->head, table) { + if (node->ip_version != 4) + continue; + if (horrible_match_v4(node, ip)) { + ret = node->value; + break; + } + } + return ret; +} + +static __init void * +horrible_allowedips_lookup_v6(struct horrible_allowedips *table, + struct in6_addr *ip) +{ + struct horrible_allowedips_node *node; + void *ret = NULL; + + hlist_for_each_entry (node, &table->head, table) { + if (node->ip_version != 6) + continue; + if (horrible_match_v6(node, ip)) { + ret = node->value; + break; + } + } + return ret; +} + +static __init bool randomized_test(void) +{ + unsigned int i, j, k, mutate_amount, cidr; + u8 ip[16], mutate_mask[16], mutated[16]; + struct wireguard_peer **peers, *peer; + struct horrible_allowedips h; + DEFINE_MUTEX(mutex); + struct allowedips t; + bool ret = false; + + mutex_init(&mutex); + + wg_allowedips_init(&t); + horrible_allowedips_init(&h); + + peers = kcalloc(NUM_PEERS, sizeof(*peers), GFP_KERNEL); + if (unlikely(!peers)) { + pr_info("allowedips random self-test: out of memory\n"); + goto free; + } + for (i = 0; i < NUM_PEERS; ++i) { + peers[i] = kzalloc(sizeof(*peers[i]), GFP_KERNEL); + if (unlikely(!peers[i])) { + pr_info("allowedips random self-test: out of memory\n"); + goto free; + } + kref_init(&peers[i]->refcount); + } + + mutex_lock(&mutex); + + for (i = 0; i < NUM_RAND_ROUTES; ++i) { + prandom_bytes(ip, 4); + cidr = prandom_u32_max(32) + 1; + peer = peers[prandom_u32_max(NUM_PEERS)]; + if (wg_allowedips_insert_v4(&t, (struct in_addr *)ip, cidr, + peer, &mutex) < 0) { + pr_info("allowedips random self-test: out of memory\n"); + goto free; + } + if (horrible_allowedips_insert_v4(&h, (struct in_addr *)ip, + cidr, peer) < 0) { + pr_info("allowedips random self-test: out of memory\n"); + goto free; + } + for (j = 0; j < NUM_MUTATED_ROUTES; ++j) { + memcpy(mutated, ip, 4); + prandom_bytes(mutate_mask, 4); + mutate_amount = prandom_u32_max(32); + for (k = 0; k < mutate_amount / 8; ++k) + mutate_mask[k] = 0xff; + mutate_mask[k] = 0xff + << ((8 - (mutate_amount % 8)) % 8); + for (; k < 4; ++k) + mutate_mask[k] = 0; + for (k = 0; k < 4; ++k) + mutated[k] = (mutated[k] & mutate_mask[k]) | + (~mutate_mask[k] & + prandom_u32_max(256)); + cidr = prandom_u32_max(32) + 1; + peer = peers[prandom_u32_max(NUM_PEERS)]; + if (wg_allowedips_insert_v4(&t, + (struct in_addr *)mutated, + cidr, peer, &mutex) < 0) { + pr_info("allowedips random self-test: out of memory\n"); + goto free; + } + if (horrible_allowedips_insert_v4(&h, + (struct in_addr *)mutated, cidr, peer)) { + pr_info("allowedips random self-test: out of memory\n"); + goto free; + } + } + } + + for (i = 0; i < NUM_RAND_ROUTES; ++i) { + prandom_bytes(ip, 16); + cidr = prandom_u32_max(128) + 1; + peer = peers[prandom_u32_max(NUM_PEERS)]; + if (wg_allowedips_insert_v6(&t, (struct in6_addr *)ip, cidr, + peer, &mutex) < 0) { + pr_info("allowedips random self-test: out of memory\n"); + goto free; + } + if (horrible_allowedips_insert_v6(&h, (struct in6_addr *)ip, + cidr, peer) < 0) { + pr_info("allowedips random self-test: out of memory\n"); + goto free; + } + for (j = 0; j < NUM_MUTATED_ROUTES; ++j) { + memcpy(mutated, ip, 16); + prandom_bytes(mutate_mask, 16); + mutate_amount = prandom_u32_max(128); + for (k = 0; k < mutate_amount / 8; ++k) + mutate_mask[k] = 0xff; + mutate_mask[k] = 0xff + << ((8 - (mutate_amount % 8)) % 8); + for (; k < 4; ++k) + mutate_mask[k] = 0; + for (k = 0; k < 4; ++k) + mutated[k] = (mutated[k] & mutate_mask[k]) | + (~mutate_mask[k] & + prandom_u32_max(256)); + cidr = prandom_u32_max(128) + 1; + peer = peers[prandom_u32_max(NUM_PEERS)]; + if (wg_allowedips_insert_v6(&t, + (struct in6_addr *)mutated, + cidr, peer, &mutex) < 0) { + pr_info("allowedips random self-test: out of memory\n"); + goto free; + } + if (horrible_allowedips_insert_v6( + &h, (struct in6_addr *)mutated, cidr, + peer)) { + pr_info("allowedips random self-test: out of memory\n"); + goto free; + } + } + } + + mutex_unlock(&mutex); + + if (IS_ENABLED(DEBUG_PRINT_TRIE_GRAPHVIZ)) { + print_tree(t.root4, 32); + print_tree(t.root6, 128); + } + + for (i = 0; i < NUM_QUERIES; ++i) { + prandom_bytes(ip, 4); + if (lookup(t.root4, 32, ip) != + horrible_allowedips_lookup_v4(&h, (struct in_addr *)ip)) { + pr_info("allowedips random self-test: FAIL\n"); + goto free; + } + } + + for (i = 0; i < NUM_QUERIES; ++i) { + prandom_bytes(ip, 16); + if (lookup(t.root6, 128, ip) != + horrible_allowedips_lookup_v6(&h, (struct in6_addr *)ip)) { + pr_info("allowedips random self-test: FAIL\n"); + goto free; + } + } + ret = true; + +free: + mutex_lock(&mutex); + wg_allowedips_free(&t, &mutex); + mutex_unlock(&mutex); + horrible_allowedips_free(&h); + if (peers) { + for (i = 0; i < NUM_PEERS; ++i) + kfree(peers[i]); + } + kfree(peers); + return ret; +} + +static __init inline struct in_addr *ip4(u8 a, u8 b, u8 c, u8 d) +{ + static struct in_addr ip; + u8 *split = (u8 *)&ip; + split[0] = a; + split[1] = b; + split[2] = c; + split[3] = d; + return &ip; +} + +static __init inline struct in6_addr *ip6(u32 a, u32 b, u32 c, u32 d) +{ + static struct in6_addr ip; + __be32 *split = (__be32 *)&ip; + split[0] = cpu_to_be32(a); + split[1] = cpu_to_be32(b); + split[2] = cpu_to_be32(c); + split[3] = cpu_to_be32(d); + return &ip; +} + +struct walk_ctx { + int count; + bool found_a, found_b, found_c, found_d, found_e; + bool found_other; +}; + +static __init int walk_callback(void *ctx, const u8 *ip, u8 cidr, int family) +{ + struct walk_ctx *wctx = ctx; + + wctx->count++; + + if (cidr == 27 && + !memcmp(ip, ip4(192, 95, 5, 64), sizeof(struct in_addr))) + wctx->found_a = true; + else if (cidr == 128 && + !memcmp(ip, ip6(0x26075300, 0x60006b00, 0, 0xc05f0543), + sizeof(struct in6_addr))) + wctx->found_b = true; + else if (cidr == 29 && + !memcmp(ip, ip4(10, 1, 0, 16), sizeof(struct in_addr))) + wctx->found_c = true; + else if (cidr == 83 && + !memcmp(ip, ip6(0x26075300, 0x6d8a6bf8, 0xdab1e000, 0), + sizeof(struct in6_addr))) + wctx->found_d = true; + else if (cidr == 21 && + !memcmp(ip, ip6(0x26075000, 0, 0, 0), sizeof(struct in6_addr))) + wctx->found_e = true; + else + wctx->found_other = true; + + return 0; +} + +#define init_peer(name) do { \ + name = kzalloc(sizeof(*name), GFP_KERNEL); \ + if (unlikely(!name)) { \ + pr_info("allowedips self-test: out of memory\n"); \ + goto free; \ + } \ + kref_init(&name->refcount); \ + } while (0) + +#define insert(version, mem, ipa, ipb, ipc, ipd, cidr) \ + wg_allowedips_insert_v##version(&t, ip##version(ipa, ipb, ipc, ipd), \ + cidr, mem, &mutex) + +#define maybe_fail() do { \ + ++i; \ + if (!_s) { \ + pr_info("allowedips self-test %zu: FAIL\n", i); \ + success = false; \ + } \ + } while (0) + +#define test(version, mem, ipa, ipb, ipc, ipd) do { \ + bool _s = lookup(t.root##version, version == 4 ? 32 : 128, \ + ip##version(ipa, ipb, ipc, ipd)) == mem; \ + maybe_fail(); \ + } while (0) + +#define test_negative(version, mem, ipa, ipb, ipc, ipd) do { \ + bool _s = lookup(t.root##version, version == 4 ? 32 : 128, \ + ip##version(ipa, ipb, ipc, ipd)) != mem; \ + maybe_fail(); \ + } while (0) + +#define test_boolean(cond) do { \ + bool _s = (cond); \ + maybe_fail(); \ + } while (0) + +bool __init wg_allowedips_selftest(void) +{ + struct wireguard_peer *a = NULL, *b = NULL, *c = NULL, *d = NULL, + *e = NULL, *f = NULL, *g = NULL, *h = NULL; + struct allowedips_cursor *cursor; + struct walk_ctx wctx = { 0 }; + bool success = false; + struct allowedips t; + DEFINE_MUTEX(mutex); + struct in6_addr ip; + size_t i = 0; + __be64 part; + + cursor = kzalloc(sizeof(*cursor), GFP_KERNEL); + if (!cursor) { + pr_info("allowedips self-test malloc: FAIL\n"); + return false; + } + + mutex_init(&mutex); + mutex_lock(&mutex); + + wg_allowedips_init(&t); + init_peer(a); + init_peer(b); + init_peer(c); + init_peer(d); + init_peer(e); + init_peer(f); + init_peer(g); + init_peer(h); + + insert(4, a, 192, 168, 4, 0, 24); + insert(4, b, 192, 168, 4, 4, 32); + insert(4, c, 192, 168, 0, 0, 16); + insert(4, d, 192, 95, 5, 64, 27); + /* replaces previous entry, and maskself is required */ + insert(4, c, 192, 95, 5, 65, 27); + insert(6, d, 0x26075300, 0x60006b00, 0, 0xc05f0543, 128); + insert(6, c, 0x26075300, 0x60006b00, 0, 0, 64); + insert(4, e, 0, 0, 0, 0, 0); + insert(6, e, 0, 0, 0, 0, 0); + /* replaces previous entry */ + insert(6, f, 0, 0, 0, 0, 0); + insert(6, g, 0x24046800, 0, 0, 0, 32); + /* maskself is required */ + insert(6, h, 0x24046800, 0x40040800, 0xdeadbeef, 0xdeadbeef, 64); + insert(6, a, 0x24046800, 0x40040800, 0xdeadbeef, 0xdeadbeef, 128); + insert(6, c, 0x24446800, 0x40e40800, 0xdeaebeef, 0xdefbeef, 128); + insert(6, b, 0x24446800, 0xf0e40800, 0xeeaebeef, 0, 98); + insert(4, g, 64, 15, 112, 0, 20); + /* maskself is required */ + insert(4, h, 64, 15, 123, 211, 25); + insert(4, a, 10, 0, 0, 0, 25); + insert(4, b, 10, 0, 0, 128, 25); + insert(4, a, 10, 1, 0, 0, 30); + insert(4, b, 10, 1, 0, 4, 30); + insert(4, c, 10, 1, 0, 8, 29); + insert(4, d, 10, 1, 0, 16, 29); + + if (IS_ENABLED(DEBUG_PRINT_TRIE_GRAPHVIZ)) { + print_tree(t.root4, 32); + print_tree(t.root6, 128); + } + + success = true; + + test(4, a, 192, 168, 4, 20); + test(4, a, 192, 168, 4, 0); + test(4, b, 192, 168, 4, 4); + test(4, c, 192, 168, 200, 182); + test(4, c, 192, 95, 5, 68); + test(4, e, 192, 95, 5, 96); + test(6, d, 0x26075300, 0x60006b00, 0, 0xc05f0543); + test(6, c, 0x26075300, 0x60006b00, 0, 0xc02e01ee); + test(6, f, 0x26075300, 0x60006b01, 0, 0); + test(6, g, 0x24046800, 0x40040806, 0, 0x1006); + test(6, g, 0x24046800, 0x40040806, 0x1234, 0x5678); + test(6, f, 0x240467ff, 0x40040806, 0x1234, 0x5678); + test(6, f, 0x24046801, 0x40040806, 0x1234, 0x5678); + test(6, h, 0x24046800, 0x40040800, 0x1234, 0x5678); + test(6, h, 0x24046800, 0x40040800, 0, 0); + test(6, h, 0x24046800, 0x40040800, 0x10101010, 0x10101010); + test(6, a, 0x24046800, 0x40040800, 0xdeadbeef, 0xdeadbeef); + test(4, g, 64, 15, 116, 26); + test(4, g, 64, 15, 127, 3); + test(4, g, 64, 15, 123, 1); + test(4, h, 64, 15, 123, 128); + test(4, h, 64, 15, 123, 129); + test(4, a, 10, 0, 0, 52); + test(4, b, 10, 0, 0, 220); + test(4, a, 10, 1, 0, 2); + test(4, b, 10, 1, 0, 6); + test(4, c, 10, 1, 0, 10); + test(4, d, 10, 1, 0, 20); + + insert(4, a, 1, 0, 0, 0, 32); + insert(4, a, 64, 0, 0, 0, 32); + insert(4, a, 128, 0, 0, 0, 32); + insert(4, a, 192, 0, 0, 0, 32); + insert(4, a, 255, 0, 0, 0, 32); + wg_allowedips_remove_by_peer(&t, a, &mutex); + test_negative(4, a, 1, 0, 0, 0); + test_negative(4, a, 64, 0, 0, 0); + test_negative(4, a, 128, 0, 0, 0); + test_negative(4, a, 192, 0, 0, 0); + test_negative(4, a, 255, 0, 0, 0); + + wg_allowedips_free(&t, &mutex); + wg_allowedips_init(&t); + insert(4, a, 192, 168, 0, 0, 16); + insert(4, a, 192, 168, 0, 0, 24); + wg_allowedips_remove_by_peer(&t, a, &mutex); + test_negative(4, a, 192, 168, 0, 1); + + /* These will hit the WARN_ON(len >= 128) in free_node if something + * goes wrong. + */ + for (i = 0; i < 128; ++i) { + part = cpu_to_be64(~(1LLU << (i % 64))); + memset(&ip, 0xff, 16); + memcpy((u8 *)&ip + (i < 64) * 8, &part, 8); + wg_allowedips_insert_v6(&t, &ip, 128, a, &mutex); + } + + wg_allowedips_free(&t, &mutex); + + wg_allowedips_init(&t); + insert(4, a, 192, 95, 5, 93, 27); + insert(6, a, 0x26075300, 0x60006b00, 0, 0xc05f0543, 128); + insert(4, a, 10, 1, 0, 20, 29); + insert(6, a, 0x26075300, 0x6d8a6bf8, 0xdab1f1df, 0xc05f1523, 83); + insert(6, a, 0x26075300, 0x6d8a6bf8, 0xdab1f1df, 0xc05f1523, 21); + wg_allowedips_walk_by_peer(&t, cursor, a, walk_callback, &wctx, &mutex); + test_boolean(wctx.count == 5); + test_boolean(wctx.found_a); + test_boolean(wctx.found_b); + test_boolean(wctx.found_c); + test_boolean(wctx.found_d); + test_boolean(wctx.found_e); + test_boolean(!wctx.found_other); + + if (IS_ENABLED(DEBUG_RANDOM_TRIE) && success) + success = randomized_test(); + + if (success) + pr_info("allowedips self-tests: pass\n"); + +free: + wg_allowedips_free(&t, &mutex); + kfree(a); + kfree(b); + kfree(c); + kfree(d); + kfree(e); + kfree(f); + kfree(g); + kfree(h); + mutex_unlock(&mutex); + kfree(cursor); + + return success; +} +#undef test_negative +#undef test +#undef remove +#undef insert +#undef init_peer + +#endif diff --git a/drivers/net/wireguard/selftest/counter.c b/drivers/net/wireguard/selftest/counter.c new file mode 100644 index 000000000000..5cb41d1db6ec --- /dev/null +++ b/drivers/net/wireguard/selftest/counter.c @@ -0,0 +1,103 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#ifdef DEBUG +bool __init wg_packet_counter_selftest(void) +{ + unsigned int test_num = 0, i; + union noise_counter counter; + bool success = true; + +#define T_INIT do { \ + memset(&counter, 0, sizeof(union noise_counter)); \ + spin_lock_init(&counter.receive.lock); \ + } while (0) +#define T_LIM (COUNTER_WINDOW_SIZE + 1) +#define T(n, v) do { \ + ++test_num; \ + if (counter_validate(&counter, n) != v) { \ + pr_info("nonce counter self-test %u: FAIL\n", \ + test_num); \ + success = false; \ + } \ + } while (0) + + T_INIT; + /* 1 */ T(0, true); + /* 2 */ T(1, true); + /* 3 */ T(1, false); + /* 4 */ T(9, true); + /* 5 */ T(8, true); + /* 6 */ T(7, true); + /* 7 */ T(7, false); + /* 8 */ T(T_LIM, true); + /* 9 */ T(T_LIM - 1, true); + /* 10 */ T(T_LIM - 1, false); + /* 11 */ T(T_LIM - 2, true); + /* 12 */ T(2, true); + /* 13 */ T(2, false); + /* 14 */ T(T_LIM + 16, true); + /* 15 */ T(3, false); + /* 16 */ T(T_LIM + 16, false); + /* 17 */ T(T_LIM * 4, true); + /* 18 */ T(T_LIM * 4 - (T_LIM - 1), true); + /* 19 */ T(10, false); + /* 20 */ T(T_LIM * 4 - T_LIM, false); + /* 21 */ T(T_LIM * 4 - (T_LIM + 1), false); + /* 22 */ T(T_LIM * 4 - (T_LIM - 2), true); + /* 23 */ T(T_LIM * 4 + 1 - T_LIM, false); + /* 24 */ T(0, false); + /* 25 */ T(REJECT_AFTER_MESSAGES, false); + /* 26 */ T(REJECT_AFTER_MESSAGES - 1, true); + /* 27 */ T(REJECT_AFTER_MESSAGES, false); + /* 28 */ T(REJECT_AFTER_MESSAGES - 1, false); + /* 29 */ T(REJECT_AFTER_MESSAGES - 2, true); + /* 30 */ T(REJECT_AFTER_MESSAGES + 1, false); + /* 31 */ T(REJECT_AFTER_MESSAGES + 2, false); + /* 32 */ T(REJECT_AFTER_MESSAGES - 2, false); + /* 33 */ T(REJECT_AFTER_MESSAGES - 3, true); + /* 34 */ T(0, false); + + T_INIT; + for (i = 1; i <= COUNTER_WINDOW_SIZE; ++i) + T(i, true); + T(0, true); + T(0, false); + + T_INIT; + for (i = 2; i <= COUNTER_WINDOW_SIZE + 1; ++i) + T(i, true); + T(1, true); + T(0, false); + + T_INIT; + for (i = COUNTER_WINDOW_SIZE + 1; i-- > 0;) + T(i, true); + + T_INIT; + for (i = COUNTER_WINDOW_SIZE + 2; i-- > 1;) + T(i, true); + T(0, false); + + T_INIT; + for (i = COUNTER_WINDOW_SIZE + 1; i-- > 1;) + T(i, true); + T(COUNTER_WINDOW_SIZE + 1, true); + T(0, false); + + T_INIT; + for (i = COUNTER_WINDOW_SIZE + 1; i-- > 1;) + T(i, true); + T(0, true); + T(COUNTER_WINDOW_SIZE + 1, true); +#undef T +#undef T_LIM +#undef T_INIT + + if (success) + pr_info("nonce counter self-tests: pass\n"); + return success; +} +#endif diff --git a/drivers/net/wireguard/selftest/ratelimiter.c b/drivers/net/wireguard/selftest/ratelimiter.c new file mode 100644 index 000000000000..2ea7489b3700 --- /dev/null +++ b/drivers/net/wireguard/selftest/ratelimiter.c @@ -0,0 +1,178 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#ifdef DEBUG + +#include + +static const struct { + bool result; + unsigned int msec_to_sleep_before; +} expected_results[] __initconst = { + [0 ... PACKETS_BURSTABLE - 1] = { true, 0 }, + [PACKETS_BURSTABLE] = { false, 0 }, + [PACKETS_BURSTABLE + 1] = { true, MSEC_PER_SEC / PACKETS_PER_SECOND }, + [PACKETS_BURSTABLE + 2] = { false, 0 }, + [PACKETS_BURSTABLE + 3] = { true, (MSEC_PER_SEC / PACKETS_PER_SECOND) * 2 }, + [PACKETS_BURSTABLE + 4] = { true, 0 }, + [PACKETS_BURSTABLE + 5] = { false, 0 } +}; + +static __init unsigned int maximum_jiffies_at_index(int index) +{ + unsigned int total_msecs = 2 * MSEC_PER_SEC / PACKETS_PER_SECOND / 3; + int i; + + for (i = 0; i <= index; ++i) + total_msecs += expected_results[i].msec_to_sleep_before; + return msecs_to_jiffies(total_msecs); +} + +bool __init wg_ratelimiter_selftest(void) +{ + int i, test = 0, tries = 0, ret = false; + unsigned long loop_start_time; +#if IS_ENABLED(CONFIG_IPV6) + struct sk_buff *skb6; + struct ipv6hdr *hdr6; +#endif + struct sk_buff *skb4; + struct iphdr *hdr4; + + if (IS_ENABLED(CONFIG_KASAN) || IS_ENABLED(CONFIG_UBSAN)) + return true; + + BUILD_BUG_ON(MSEC_PER_SEC % PACKETS_PER_SECOND != 0); + + if (wg_ratelimiter_init()) + goto out; + ++test; + if (wg_ratelimiter_init()) { + wg_ratelimiter_uninit(); + goto out; + } + ++test; + if (wg_ratelimiter_init()) { + wg_ratelimiter_uninit(); + wg_ratelimiter_uninit(); + goto out; + } + ++test; + + skb4 = alloc_skb(sizeof(struct iphdr), GFP_KERNEL); + if (unlikely(!skb4)) + goto err_nofree; + skb4->protocol = htons(ETH_P_IP); + hdr4 = (struct iphdr *)skb_put(skb4, sizeof(*hdr4)); + hdr4->saddr = htonl(8182); + skb_reset_network_header(skb4); + ++test; + +#if IS_ENABLED(CONFIG_IPV6) + skb6 = alloc_skb(sizeof(struct ipv6hdr), GFP_KERNEL); + if (unlikely(!skb6)) { + kfree_skb(skb4); + goto err_nofree; + } + skb6->protocol = htons(ETH_P_IPV6); + hdr6 = (struct ipv6hdr *)skb_put(skb6, sizeof(*hdr6)); + hdr6->saddr.in6_u.u6_addr32[0] = htonl(1212); + hdr6->saddr.in6_u.u6_addr32[1] = htonl(289188); + skb_reset_network_header(skb6); + ++test; +#endif + +restart: + loop_start_time = jiffies; + for (i = 0; i < ARRAY_SIZE(expected_results); ++i) { +#define ensure_time do { \ + if (time_is_before_jiffies(loop_start_time + \ + maximum_jiffies_at_index(i))) { \ + if (++tries >= 5000) \ + goto err; \ + gc_entries(NULL); \ + rcu_barrier(); \ + msleep(500); \ + goto restart; \ + } \ + } while (0) + + if (expected_results[i].msec_to_sleep_before) + msleep(expected_results[i].msec_to_sleep_before); + + ensure_time; + if (wg_ratelimiter_allow(skb4, &init_net) != + expected_results[i].result) + goto err; + ++test; + hdr4->saddr = htonl(ntohl(hdr4->saddr) + i + 1); + ensure_time; + if (!wg_ratelimiter_allow(skb4, &init_net)) + goto err; + ++test; + hdr4->saddr = htonl(ntohl(hdr4->saddr) - i - 1); + +#if IS_ENABLED(CONFIG_IPV6) + hdr6->saddr.in6_u.u6_addr32[2] = + hdr6->saddr.in6_u.u6_addr32[3] = htonl(i); + ensure_time; + if (wg_ratelimiter_allow(skb6, &init_net) != + expected_results[i].result) + goto err; + ++test; + hdr6->saddr.in6_u.u6_addr32[0] = + htonl(ntohl(hdr6->saddr.in6_u.u6_addr32[0]) + i + 1); + ensure_time; + if (!wg_ratelimiter_allow(skb6, &init_net)) + goto err; + ++test; + hdr6->saddr.in6_u.u6_addr32[0] = + htonl(ntohl(hdr6->saddr.in6_u.u6_addr32[0]) - i - 1); + ensure_time; +#endif + } + + tries = 0; +restart2: + gc_entries(NULL); + rcu_barrier(); + + if (atomic_read(&total_entries)) + goto err; + ++test; + + for (i = 0; i <= max_entries; ++i) { + hdr4->saddr = htonl(i); + if (wg_ratelimiter_allow(skb4, &init_net) != + (i != max_entries)) { + if (++tries < 5000) + goto restart2; + goto err; + } + ++test; + } + + ret = true; + +err: + kfree_skb(skb4); +#if IS_ENABLED(CONFIG_IPV6) + kfree_skb(skb6); +#endif +err_nofree: + wg_ratelimiter_uninit(); + wg_ratelimiter_uninit(); + wg_ratelimiter_uninit(); + /* Uninit one extra time to check underflow detection. */ + wg_ratelimiter_uninit(); +out: + if (ret) + pr_info("ratelimiter self-tests: pass\n"); + else + pr_info("ratelimiter self-test %d: fail\n", test); + + return ret; +} +#endif diff --git a/drivers/net/wireguard/send.c b/drivers/net/wireguard/send.c new file mode 100644 index 000000000000..d4bf1a0e0856 --- /dev/null +++ b/drivers/net/wireguard/send.c @@ -0,0 +1,421 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#include "queueing.h" +#include "timers.h" +#include "device.h" +#include "peer.h" +#include "socket.h" +#include "messages.h" +#include "cookie.h" + +#include +#include +#include +#include +#include +#include +#include + +static void packet_send_handshake_initiation(struct wireguard_peer *peer) +{ + struct message_handshake_initiation packet; + + if (!wg_birthdate_has_expired(atomic64_read(&peer->last_sent_handshake), + REKEY_TIMEOUT)) + return; /* This function is rate limited. */ + + atomic64_set(&peer->last_sent_handshake, ktime_get_boot_fast_ns()); + net_dbg_ratelimited("%s: Sending handshake initiation to peer %llu (%pISpfsc)\n", + peer->device->dev->name, peer->internal_id, + &peer->endpoint.addr); + + if (wg_noise_handshake_create_initiation(&packet, &peer->handshake)) { + wg_cookie_add_mac_to_packet(&packet, sizeof(packet), peer); + wg_timers_any_authenticated_packet_traversal(peer); + wg_timers_any_authenticated_packet_sent(peer); + atomic64_set(&peer->last_sent_handshake, + ktime_get_boot_fast_ns()); + wg_socket_send_buffer_to_peer(peer, &packet, sizeof(packet), + HANDSHAKE_DSCP); + wg_timers_handshake_initiated(peer); + } +} + +void wg_packet_handshake_send_worker(struct work_struct *work) +{ + struct wireguard_peer *peer = container_of(work, struct wireguard_peer, + transmit_handshake_work); + + packet_send_handshake_initiation(peer); + wg_peer_put(peer); +} + +void wg_packet_send_queued_handshake_initiation(struct wireguard_peer *peer, + bool is_retry) +{ + if (!is_retry) + peer->timer_handshake_attempts = 0; + + rcu_read_lock_bh(); + /* We check last_sent_handshake here in addition to the actual function + * we're queueing up, so that we don't queue things if not strictly + * necessary: + */ + if (!wg_birthdate_has_expired(atomic64_read(&peer->last_sent_handshake), + REKEY_TIMEOUT) || unlikely(peer->is_dead)) + goto out; + + wg_peer_get(peer); + /* Queues up calling packet_send_queued_handshakes(peer), where we do a + * peer_put(peer) after: + */ + if (!queue_work(peer->device->handshake_send_wq, + &peer->transmit_handshake_work)) + /* If the work was already queued, we want to drop the + * extra reference: + */ + wg_peer_put(peer); +out: + rcu_read_unlock_bh(); +} + +void wg_packet_send_handshake_response(struct wireguard_peer *peer) +{ + struct message_handshake_response packet; + + atomic64_set(&peer->last_sent_handshake, ktime_get_boot_fast_ns()); + net_dbg_ratelimited("%s: Sending handshake response to peer %llu (%pISpfsc)\n", + peer->device->dev->name, peer->internal_id, + &peer->endpoint.addr); + + if (wg_noise_handshake_create_response(&packet, &peer->handshake)) { + wg_cookie_add_mac_to_packet(&packet, sizeof(packet), peer); + if (wg_noise_handshake_begin_session(&peer->handshake, + &peer->keypairs)) { + wg_timers_session_derived(peer); + wg_timers_any_authenticated_packet_traversal(peer); + wg_timers_any_authenticated_packet_sent(peer); + atomic64_set(&peer->last_sent_handshake, + ktime_get_boot_fast_ns()); + wg_socket_send_buffer_to_peer(peer, &packet, + sizeof(packet), + HANDSHAKE_DSCP); + } + } +} + +void wg_packet_send_handshake_cookie(struct wireguard_device *wg, + struct sk_buff *initiating_skb, + __le32 sender_index) +{ + struct message_handshake_cookie packet; + + net_dbg_skb_ratelimited("%s: Sending cookie response for denied handshake message for %pISpfsc\n", + wg->dev->name, initiating_skb); + wg_cookie_message_create(&packet, initiating_skb, sender_index, + &wg->cookie_checker); + wg_socket_send_buffer_as_reply_to_skb(wg, initiating_skb, &packet, + sizeof(packet)); +} + +static void keep_key_fresh(struct wireguard_peer *peer) +{ + struct noise_keypair *keypair; + bool send = false; + + rcu_read_lock_bh(); + keypair = rcu_dereference_bh(peer->keypairs.current_keypair); + if (likely(keypair && keypair->sending.is_valid) && + (unlikely(atomic64_read(&keypair->sending.counter.counter) > + REKEY_AFTER_MESSAGES) || + (keypair->i_am_the_initiator && + unlikely(wg_birthdate_has_expired(keypair->sending.birthdate, + REKEY_AFTER_TIME))))) + send = true; + rcu_read_unlock_bh(); + + if (send) + wg_packet_send_queued_handshake_initiation(peer, false); +} + +static unsigned int skb_padding(struct sk_buff *skb) +{ + /* We do this modulo business with the MTU, just in case the networking + * layer gives us a packet that's bigger than the MTU. In that case, we + * wouldn't want the final subtraction to overflow in the case of the + * padded_size being clamped. + */ + unsigned int last_unit = skb->len % PACKET_CB(skb)->mtu; + unsigned int padded_size = ALIGN(last_unit, MESSAGE_PADDING_MULTIPLE); + + if (padded_size > PACKET_CB(skb)->mtu) + padded_size = PACKET_CB(skb)->mtu; + return padded_size - last_unit; +} + +static bool encrypt_packet(struct sk_buff *skb, struct noise_keypair *keypair, + simd_context_t *simd_context) +{ + unsigned int padding_len, plaintext_len, trailer_len; + struct scatterlist sg[MAX_SKB_FRAGS + 8]; + struct message_data *header; + struct sk_buff *trailer; + int num_frags; + + /* Calculate lengths. */ + padding_len = skb_padding(skb); + trailer_len = padding_len + noise_encrypted_len(0); + plaintext_len = skb->len + padding_len; + + /* Expand data section to have room for padding and auth tag. */ + num_frags = skb_cow_data(skb, trailer_len, &trailer); + if (unlikely(num_frags < 0 || num_frags > ARRAY_SIZE(sg))) + return false; + + /* Set the padding to zeros, and make sure it and the auth tag are part + * of the skb. + */ + memset(skb_tail_pointer(trailer), 0, padding_len); + + /* Expand head section to have room for our header and the network + * stack's headers. + */ + if (unlikely(skb_cow_head(skb, DATA_PACKET_HEAD_ROOM) < 0)) + return false; + + /* We have to remember to add the checksum to the innerpacket, in case + * the receiver forwards it. + */ + if (likely(!skb_checksum_setup(skb, true))) + skb_checksum_help(skb); + + /* Only after checksumming can we safely add on the padding at the end + * and the header. + */ + skb_set_inner_network_header(skb, 0); + header = (struct message_data *)skb_push(skb, sizeof(*header)); + header->header.type = cpu_to_le32(MESSAGE_DATA); + header->key_idx = keypair->remote_index; + header->counter = cpu_to_le64(PACKET_CB(skb)->nonce); + pskb_put(skb, trailer, trailer_len); + + /* Now we can encrypt the scattergather segments */ + sg_init_table(sg, num_frags); + if (skb_to_sgvec(skb, sg, sizeof(struct message_data), + noise_encrypted_len(plaintext_len)) <= 0) + return false; + return chacha20poly1305_encrypt_sg(sg, sg, plaintext_len, NULL, 0, + PACKET_CB(skb)->nonce, + keypair->sending.key, simd_context); +} + +void wg_packet_send_keepalive(struct wireguard_peer *peer) +{ + struct sk_buff *skb; + + if (skb_queue_empty(&peer->staged_packet_queue)) { + skb = alloc_skb(DATA_PACKET_HEAD_ROOM + MESSAGE_MINIMUM_LENGTH, + GFP_ATOMIC); + if (unlikely(!skb)) + return; + skb_reserve(skb, DATA_PACKET_HEAD_ROOM); + skb->dev = peer->device->dev; + PACKET_CB(skb)->mtu = skb->dev->mtu; + skb_queue_tail(&peer->staged_packet_queue, skb); + net_dbg_ratelimited("%s: Sending keepalive packet to peer %llu (%pISpfsc)\n", + peer->device->dev->name, peer->internal_id, + &peer->endpoint.addr); + } + + wg_packet_send_staged_packets(peer); +} + +#define skb_walk_null_queue_safe(first, skb, next) \ + for (skb = first, next = skb->next; skb; \ + skb = next, next = skb ? skb->next : NULL) +static void skb_free_null_queue(struct sk_buff *first) +{ + struct sk_buff *skb, *next; + + skb_walk_null_queue_safe (first, skb, next) + dev_kfree_skb(skb); +} + +static void packet_create_data_done(struct sk_buff *first, + struct wireguard_peer *peer) +{ + struct sk_buff *skb, *next; + bool is_keepalive, data_sent = false; + + wg_timers_any_authenticated_packet_traversal(peer); + wg_timers_any_authenticated_packet_sent(peer); + skb_walk_null_queue_safe (first, skb, next) { + is_keepalive = skb->len == message_data_len(0); + if (likely(!wg_socket_send_skb_to_peer(peer, skb, + PACKET_CB(skb)->ds) && !is_keepalive)) + data_sent = true; + } + + if (likely(data_sent)) + wg_timers_data_sent(peer); + + keep_key_fresh(peer); +} + +void wg_packet_tx_worker(struct work_struct *work) +{ + struct crypt_queue *queue = + container_of(work, struct crypt_queue, work); + struct wireguard_peer *peer; + struct noise_keypair *keypair; + struct sk_buff *first; + enum packet_state state; + + while ((first = __ptr_ring_peek(&queue->ring)) != NULL && + (state = atomic_read_acquire(&PACKET_CB(first)->state)) != + PACKET_STATE_UNCRYPTED) { + __ptr_ring_discard_one(&queue->ring); + peer = PACKET_PEER(first); + keypair = PACKET_CB(first)->keypair; + + if (likely(state == PACKET_STATE_CRYPTED)) + packet_create_data_done(first, peer); + else + skb_free_null_queue(first); + + wg_noise_keypair_put(keypair, false); + wg_peer_put(peer); + } +} + +void wg_packet_encrypt_worker(struct work_struct *work) +{ + struct crypt_queue *queue = + container_of(work, struct multicore_worker, work)->ptr; + struct sk_buff *first, *skb, *next; + simd_context_t simd_context; + + simd_get(&simd_context); + while ((first = ptr_ring_consume_bh(&queue->ring)) != NULL) { + enum packet_state state = PACKET_STATE_CRYPTED; + + skb_walk_null_queue_safe (first, skb, next) { + if (likely(encrypt_packet(skb, PACKET_CB(first)->keypair, + &simd_context))) + wg_reset_packet(skb); + else { + state = PACKET_STATE_DEAD; + break; + } + } + wg_queue_enqueue_per_peer(&PACKET_PEER(first)->tx_queue, first, + state); + + simd_relax(&simd_context); + } + simd_put(&simd_context); +} + +static void packet_create_data(struct sk_buff *first) +{ + struct wireguard_peer *peer = PACKET_PEER(first); + struct wireguard_device *wg = peer->device; + int ret = -EINVAL; + + rcu_read_lock_bh(); + if (unlikely(peer->is_dead)) + goto err; + + ret = wg_queue_enqueue_per_device_and_peer(&wg->encrypt_queue, + &peer->tx_queue, first, + wg->packet_crypt_wq, + &wg->encrypt_queue.last_cpu); + if (unlikely(ret == -EPIPE)) + wg_queue_enqueue_per_peer(&peer->tx_queue, first, + PACKET_STATE_DEAD); +err: + rcu_read_unlock_bh(); + if (likely(!ret || ret == -EPIPE)) + return; + wg_noise_keypair_put(PACKET_CB(first)->keypair, false); + wg_peer_put(peer); + skb_free_null_queue(first); +} + +void wg_packet_send_staged_packets(struct wireguard_peer *peer) +{ + struct noise_symmetric_key *key; + struct noise_keypair *keypair; + struct sk_buff_head packets; + struct sk_buff *skb; + + /* Steal the current queue into our local one. */ + __skb_queue_head_init(&packets); + spin_lock_bh(&peer->staged_packet_queue.lock); + skb_queue_splice_init(&peer->staged_packet_queue, &packets); + spin_unlock_bh(&peer->staged_packet_queue.lock); + if (unlikely(skb_queue_empty(&packets))) + return; + + /* First we make sure we have a valid reference to a valid key. */ + rcu_read_lock_bh(); + keypair = wg_noise_keypair_get( + rcu_dereference_bh(peer->keypairs.current_keypair)); + rcu_read_unlock_bh(); + if (unlikely(!keypair)) + goto out_nokey; + key = &keypair->sending; + if (unlikely(!key->is_valid)) + goto out_nokey; + if (unlikely(wg_birthdate_has_expired( + key->birthdate, REJECT_AFTER_TIME))) + goto out_invalid; + + /* After we know we have a somewhat valid key, we now try to assign + * nonces to all of the packets in the queue. If we can't assign nonces + * for all of them, we just consider it a failure and wait for the next + * handshake. + */ + skb_queue_walk (&packets, skb) { + /* 0 for no outer TOS: no leak. TODO: should we use flowi->tos + * as outer? */ + PACKET_CB(skb)->ds = ip_tunnel_ecn_encap(0, ip_hdr(skb), skb); + PACKET_CB(skb)->nonce = + atomic64_inc_return(&key->counter.counter) - 1; + if (unlikely(PACKET_CB(skb)->nonce >= REJECT_AFTER_MESSAGES)) + goto out_invalid; + } + + packets.prev->next = NULL; + wg_peer_get(keypair->entry.peer); + PACKET_CB(packets.next)->keypair = keypair; + packet_create_data(packets.next); + return; + +out_invalid: + key->is_valid = false; +out_nokey: + wg_noise_keypair_put(keypair, false); + + /* We orphan the packets if we're waiting on a handshake, so that they + * don't block a socket's pool. + */ + skb_queue_walk (&packets, skb) + skb_orphan(skb); + /* Then we put them back on the top of the queue. We're not too + * concerned about accidentally getting things a little out of order if + * packets are being added really fast, because this queue is for before + * packets can even be sent and it's small anyway. + */ + spin_lock_bh(&peer->staged_packet_queue.lock); + skb_queue_splice(&packets, &peer->staged_packet_queue); + spin_unlock_bh(&peer->staged_packet_queue.lock); + + /* If we're exiting because there's something wrong with the key, it + * means we should initiate a new handshake. + */ + wg_packet_send_queued_handshake_initiation(peer, false); +} diff --git a/drivers/net/wireguard/socket.c b/drivers/net/wireguard/socket.c new file mode 100644 index 000000000000..8e9adfd67b35 --- /dev/null +++ b/drivers/net/wireguard/socket.c @@ -0,0 +1,432 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#include "device.h" +#include "peer.h" +#include "socket.h" +#include "queueing.h" +#include "messages.h" + +#include +#include +#include +#include +#include +#include +#include + +static int send4(struct wireguard_device *wg, struct sk_buff *skb, + struct endpoint *endpoint, u8 ds, struct dst_cache *cache) +{ + struct flowi4 fl = { + .saddr = endpoint->src4.s_addr, + .daddr = endpoint->addr4.sin_addr.s_addr, + .fl4_dport = endpoint->addr4.sin_port, + .flowi4_mark = wg->fwmark, + .flowi4_proto = IPPROTO_UDP + }; + struct rtable *rt = NULL; + struct sock *sock; + int ret = 0; + + skb->next = skb->prev = NULL; + skb->dev = wg->dev; + skb->mark = wg->fwmark; + + rcu_read_lock_bh(); + sock = rcu_dereference_bh(wg->sock4); + + if (unlikely(!sock)) { + ret = -ENONET; + goto err; + } + + fl.fl4_sport = inet_sk(sock)->inet_sport; + + if (cache) + rt = dst_cache_get_ip4(cache, &fl.saddr); + + if (!rt) { + security_sk_classify_flow(sock, flowi4_to_flowi(&fl)); + if (unlikely(!inet_confirm_addr(sock_net(sock), NULL, 0, + fl.saddr, RT_SCOPE_HOST))) { + endpoint->src4.s_addr = 0; + *(__force __be32 *)&endpoint->src_if4 = 0; + fl.saddr = 0; + if (cache) + dst_cache_reset(cache); + } + rt = ip_route_output_flow(sock_net(sock), &fl, sock); + if (unlikely(endpoint->src_if4 && ((IS_ERR(rt) && + PTR_ERR(rt) == -EINVAL) || (!IS_ERR(rt) && + rt->dst.dev->ifindex != endpoint->src_if4)))) { + endpoint->src4.s_addr = 0; + *(__force __be32 *)&endpoint->src_if4 = 0; + fl.saddr = 0; + if (cache) + dst_cache_reset(cache); + if (!IS_ERR(rt)) + ip_rt_put(rt); + rt = ip_route_output_flow(sock_net(sock), &fl, sock); + } + if (unlikely(IS_ERR(rt))) { + ret = PTR_ERR(rt); + net_dbg_ratelimited("%s: No route to %pISpfsc, error %d\n", + wg->dev->name, &endpoint->addr, ret); + goto err; + } else if (unlikely(rt->dst.dev == skb->dev)) { + ip_rt_put(rt); + ret = -ELOOP; + net_dbg_ratelimited("%s: Avoiding routing loop to %pISpfsc\n", + wg->dev->name, &endpoint->addr); + goto err; + } + if (cache) + dst_cache_set_ip4(cache, &rt->dst, fl.saddr); + } + udp_tunnel_xmit_skb(rt, sock, skb, fl.saddr, fl.daddr, ds, + ip4_dst_hoplimit(&rt->dst), 0, fl.fl4_sport, + fl.fl4_dport, false, false); + goto out; + +err: + kfree_skb(skb); +out: + rcu_read_unlock_bh(); + return ret; +} + +static int send6(struct wireguard_device *wg, struct sk_buff *skb, + struct endpoint *endpoint, u8 ds, struct dst_cache *cache) +{ +#if IS_ENABLED(CONFIG_IPV6) + struct flowi6 fl = { + .saddr = endpoint->src6, + .daddr = endpoint->addr6.sin6_addr, + .fl6_dport = endpoint->addr6.sin6_port, + .flowi6_mark = wg->fwmark, + .flowi6_oif = endpoint->addr6.sin6_scope_id, + .flowi6_proto = IPPROTO_UDP + /* TODO: addr->sin6_flowinfo */ + }; + struct dst_entry *dst = NULL; + struct sock *sock; + int ret = 0; + + skb->next = skb->prev = NULL; + skb->dev = wg->dev; + skb->mark = wg->fwmark; + + rcu_read_lock_bh(); + sock = rcu_dereference_bh(wg->sock6); + + if (unlikely(!sock)) { + ret = -ENONET; + goto err; + } + + fl.fl6_sport = inet_sk(sock)->inet_sport; + + if (cache) + dst = dst_cache_get_ip6(cache, &fl.saddr); + + if (!dst) { + security_sk_classify_flow(sock, flowi6_to_flowi(&fl)); + if (unlikely(!ipv6_addr_any(&fl.saddr) && + !ipv6_chk_addr(sock_net(sock), &fl.saddr, NULL, 0))) { + endpoint->src6 = fl.saddr = in6addr_any; + if (cache) + dst_cache_reset(cache); + } + ret = ipv6_stub->ipv6_dst_lookup(sock_net(sock), sock, &dst, + &fl); + if (unlikely(ret)) { + net_dbg_ratelimited("%s: No route to %pISpfsc, error %d\n", + wg->dev->name, &endpoint->addr, ret); + goto err; + } else if (unlikely(dst->dev == skb->dev)) { + dst_release(dst); + ret = -ELOOP; + net_dbg_ratelimited("%s: Avoiding routing loop to %pISpfsc\n", + wg->dev->name, &endpoint->addr); + goto err; + } + if (cache) + dst_cache_set_ip6(cache, dst, &fl.saddr); + } + + udp_tunnel6_xmit_skb(dst, sock, skb, skb->dev, &fl.saddr, &fl.daddr, ds, + ip6_dst_hoplimit(dst), 0, fl.fl6_sport, + fl.fl6_dport, false); + goto out; + +err: + kfree_skb(skb); +out: + rcu_read_unlock_bh(); + return ret; +#else + return -EAFNOSUPPORT; +#endif +} + +int wg_socket_send_skb_to_peer(struct wireguard_peer *peer, struct sk_buff *skb, + u8 ds) +{ + size_t skb_len = skb->len; + int ret = -EAFNOSUPPORT; + + read_lock_bh(&peer->endpoint_lock); + if (peer->endpoint.addr.sa_family == AF_INET) + ret = send4(peer->device, skb, &peer->endpoint, ds, + &peer->endpoint_cache); + else if (peer->endpoint.addr.sa_family == AF_INET6) + ret = send6(peer->device, skb, &peer->endpoint, ds, + &peer->endpoint_cache); + else + dev_kfree_skb(skb); + if (likely(!ret)) + peer->tx_bytes += skb_len; + read_unlock_bh(&peer->endpoint_lock); + + return ret; +} + +int wg_socket_send_buffer_to_peer(struct wireguard_peer *peer, void *buffer, + size_t len, u8 ds) +{ + struct sk_buff *skb = alloc_skb(len + SKB_HEADER_LEN, GFP_ATOMIC); + + if (unlikely(!skb)) + return -ENOMEM; + + skb_reserve(skb, SKB_HEADER_LEN); + skb_set_inner_network_header(skb, 0); + skb_put_data(skb, buffer, len); + return wg_socket_send_skb_to_peer(peer, skb, ds); +} + +int wg_socket_send_buffer_as_reply_to_skb(struct wireguard_device *wg, + struct sk_buff *in_skb, void *buffer, + size_t len) +{ + int ret = 0; + struct sk_buff *skb; + struct endpoint endpoint; + + if (unlikely(!in_skb)) + return -EINVAL; + ret = wg_socket_endpoint_from_skb(&endpoint, in_skb); + if (unlikely(ret < 0)) + return ret; + + skb = alloc_skb(len + SKB_HEADER_LEN, GFP_ATOMIC); + if (unlikely(!skb)) + return -ENOMEM; + skb_reserve(skb, SKB_HEADER_LEN); + skb_set_inner_network_header(skb, 0); + skb_put_data(skb, buffer, len); + + if (endpoint.addr.sa_family == AF_INET) + ret = send4(wg, skb, &endpoint, 0, NULL); + else if (endpoint.addr.sa_family == AF_INET6) + ret = send6(wg, skb, &endpoint, 0, NULL); + /* No other possibilities if the endpoint is valid, which it is, + * as we checked above. + */ + + return ret; +} + +int wg_socket_endpoint_from_skb(struct endpoint *endpoint, + const struct sk_buff *skb) +{ + memset(endpoint, 0, sizeof(*endpoint)); + if (skb->protocol == htons(ETH_P_IP)) { + endpoint->addr4.sin_family = AF_INET; + endpoint->addr4.sin_port = udp_hdr(skb)->source; + endpoint->addr4.sin_addr.s_addr = ip_hdr(skb)->saddr; + endpoint->src4.s_addr = ip_hdr(skb)->daddr; + endpoint->src_if4 = skb->skb_iif; + } else if (skb->protocol == htons(ETH_P_IPV6)) { + endpoint->addr6.sin6_family = AF_INET6; + endpoint->addr6.sin6_port = udp_hdr(skb)->source; + endpoint->addr6.sin6_addr = ipv6_hdr(skb)->saddr; + endpoint->addr6.sin6_scope_id = ipv6_iface_scope_id( + &ipv6_hdr(skb)->saddr, skb->skb_iif); + endpoint->src6 = ipv6_hdr(skb)->daddr; + } else + return -EINVAL; + return 0; +} + +static bool endpoint_eq(const struct endpoint *a, const struct endpoint *b) +{ + return (a->addr.sa_family == AF_INET && b->addr.sa_family == AF_INET && + a->addr4.sin_port == b->addr4.sin_port && + a->addr4.sin_addr.s_addr == b->addr4.sin_addr.s_addr && + a->src4.s_addr == b->src4.s_addr && a->src_if4 == b->src_if4) || + (a->addr.sa_family == AF_INET6 && + b->addr.sa_family == AF_INET6 && + a->addr6.sin6_port == b->addr6.sin6_port && + ipv6_addr_equal(&a->addr6.sin6_addr, &b->addr6.sin6_addr) && + a->addr6.sin6_scope_id == b->addr6.sin6_scope_id && + ipv6_addr_equal(&a->src6, &b->src6)) || + unlikely(!a->addr.sa_family && !b->addr.sa_family); +} + +void wg_socket_set_peer_endpoint(struct wireguard_peer *peer, + const struct endpoint *endpoint) +{ + /* First we check unlocked, in order to optimize, since it's pretty rare + * that an endpoint will change. If we happen to be mid-write, and two + * CPUs wind up writing the same thing or something slightly different, + * it doesn't really matter much either. + */ + if (endpoint_eq(endpoint, &peer->endpoint)) + return; + write_lock_bh(&peer->endpoint_lock); + if (endpoint->addr.sa_family == AF_INET) { + peer->endpoint.addr4 = endpoint->addr4; + peer->endpoint.src4 = endpoint->src4; + peer->endpoint.src_if4 = endpoint->src_if4; + } else if (endpoint->addr.sa_family == AF_INET6) { + peer->endpoint.addr6 = endpoint->addr6; + peer->endpoint.src6 = endpoint->src6; + } else + goto out; + dst_cache_reset(&peer->endpoint_cache); +out: + write_unlock_bh(&peer->endpoint_lock); +} + +void wg_socket_set_peer_endpoint_from_skb(struct wireguard_peer *peer, + const struct sk_buff *skb) +{ + struct endpoint endpoint; + + if (!wg_socket_endpoint_from_skb(&endpoint, skb)) + wg_socket_set_peer_endpoint(peer, &endpoint); +} + +void wg_socket_clear_peer_endpoint_src(struct wireguard_peer *peer) +{ + write_lock_bh(&peer->endpoint_lock); + memset(&peer->endpoint.src6, 0, sizeof(peer->endpoint.src6)); + dst_cache_reset(&peer->endpoint_cache); + write_unlock_bh(&peer->endpoint_lock); +} + +static int receive(struct sock *sk, struct sk_buff *skb) +{ + struct wireguard_device *wg; + + if (unlikely(!sk)) + goto err; + wg = sk->sk_user_data; + if (unlikely(!wg)) + goto err; + wg_packet_receive(wg, skb); + return 0; + +err: + kfree_skb(skb); + return 0; +} + +static void sock_free(struct sock *sock) +{ + if (unlikely(!sock)) + return; + sk_clear_memalloc(sock); + udp_tunnel_sock_release(sock->sk_socket); +} + +static void set_sock_opts(struct socket *sock) +{ + sock->sk->sk_allocation = GFP_ATOMIC; + sock->sk->sk_sndbuf = INT_MAX; + sk_set_memalloc(sock->sk); +} + +int wg_socket_init(struct wireguard_device *wg, u16 port) +{ + int ret; + struct udp_tunnel_sock_cfg cfg = { + .sk_user_data = wg, + .encap_type = 1, + .encap_rcv = receive + }; + struct socket *new4 = NULL, *new6 = NULL; + struct udp_port_cfg port4 = { + .family = AF_INET, + .local_ip.s_addr = htonl(INADDR_ANY), + .local_udp_port = htons(port), + .use_udp_checksums = true + }; +#if IS_ENABLED(CONFIG_IPV6) + int retries = 0; + struct udp_port_cfg port6 = { + .family = AF_INET6, + .local_ip6 = IN6ADDR_ANY_INIT, + .use_udp6_tx_checksums = true, + .use_udp6_rx_checksums = true, + .ipv6_v6only = true + }; +#endif + +#if IS_ENABLED(CONFIG_IPV6) +retry: +#endif + + ret = udp_sock_create(wg->creating_net, &port4, &new4); + if (ret < 0) { + pr_err("%s: Could not create IPv4 socket\n", wg->dev->name); + return ret; + } + set_sock_opts(new4); + setup_udp_tunnel_sock(wg->creating_net, new4, &cfg); + +#if IS_ENABLED(CONFIG_IPV6) + if (ipv6_mod_enabled()) { + port6.local_udp_port = inet_sk(new4->sk)->inet_sport; + ret = udp_sock_create(wg->creating_net, &port6, &new6); + if (ret < 0) { + udp_tunnel_sock_release(new4); + if (ret == -EADDRINUSE && !port && retries++ < 100) + goto retry; + pr_err("%s: Could not create IPv6 socket\n", + wg->dev->name); + return ret; + } + set_sock_opts(new6); + setup_udp_tunnel_sock(wg->creating_net, new6, &cfg); + } +#endif + + wg_socket_reinit(wg, new4 ? new4->sk : NULL, new6 ? new6->sk : NULL); + return 0; +} + +void wg_socket_reinit(struct wireguard_device *wg, struct sock *new4, + struct sock *new6) +{ + struct sock *old4, *old6; + + mutex_lock(&wg->socket_update_lock); + old4 = rcu_dereference_protected(wg->sock4, + lockdep_is_held(&wg->socket_update_lock)); + old6 = rcu_dereference_protected(wg->sock6, + lockdep_is_held(&wg->socket_update_lock)); + rcu_assign_pointer(wg->sock4, new4); + rcu_assign_pointer(wg->sock6, new6); + if (new4) + wg->incoming_port = ntohs(inet_sk(new4)->inet_sport); + mutex_unlock(&wg->socket_update_lock); + synchronize_rcu_bh(); + synchronize_net(); + sock_free(old4); + sock_free(old6); +} diff --git a/drivers/net/wireguard/socket.h b/drivers/net/wireguard/socket.h new file mode 100644 index 000000000000..ee5eb157c073 --- /dev/null +++ b/drivers/net/wireguard/socket.h @@ -0,0 +1,44 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#ifndef _WG_SOCKET_H +#define _WG_SOCKET_H + +#include +#include +#include +#include + +int wg_socket_init(struct wireguard_device *wg, u16 port); +void wg_socket_reinit(struct wireguard_device *wg, struct sock *new4, + struct sock *new6); +int wg_socket_send_buffer_to_peer(struct wireguard_peer *peer, void *data, + size_t len, u8 ds); +int wg_socket_send_skb_to_peer(struct wireguard_peer *peer, struct sk_buff *skb, + u8 ds); +int wg_socket_send_buffer_as_reply_to_skb(struct wireguard_device *wg, + struct sk_buff *in_skb, + void *out_buffer, size_t len); + +int wg_socket_endpoint_from_skb(struct endpoint *endpoint, + const struct sk_buff *skb); +void wg_socket_set_peer_endpoint(struct wireguard_peer *peer, + const struct endpoint *endpoint); +void wg_socket_set_peer_endpoint_from_skb(struct wireguard_peer *peer, + const struct sk_buff *skb); +void wg_socket_clear_peer_endpoint_src(struct wireguard_peer *peer); + +#if defined(CONFIG_DYNAMIC_DEBUG) || defined(DEBUG) +#define net_dbg_skb_ratelimited(fmt, dev, skb, ...) do { \ + struct endpoint __endpoint; \ + wg_socket_endpoint_from_skb(&__endpoint, skb); \ + net_dbg_ratelimited(fmt, dev, &__endpoint.addr, \ + ##__VA_ARGS__); \ + } while (0) +#else +#define net_dbg_skb_ratelimited(fmt, skb, ...) +#endif + +#endif /* _WG_SOCKET_H */ diff --git a/drivers/net/wireguard/timers.c b/drivers/net/wireguard/timers.c new file mode 100644 index 000000000000..f8cd5c5519b8 --- /dev/null +++ b/drivers/net/wireguard/timers.c @@ -0,0 +1,256 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#include "timers.h" +#include "device.h" +#include "peer.h" +#include "queueing.h" +#include "socket.h" + +/* + * - Timer for retransmitting the handshake if we don't hear back after + * `REKEY_TIMEOUT + jitter` ms. + * + * - Timer for sending empty packet if we have received a packet but after have + * not sent one for `KEEPALIVE_TIMEOUT` ms. + * + * - Timer for initiating new handshake if we have sent a packet but after have + * not received one (even empty) for `(KEEPALIVE_TIMEOUT + REKEY_TIMEOUT)` ms. + * + * - Timer for zeroing out all ephemeral keys after `(REJECT_AFTER_TIME * 3)` ms + * if no new keys have been received. + * + * - Timer for, if enabled, sending an empty authenticated packet every user- + * specified seconds. + */ + +#define peer_get_from_timer(timer_name) \ + struct wireguard_peer *peer; \ + rcu_read_lock_bh(); \ + peer = wg_peer_get_maybe_zero(from_timer(peer, timer, timer_name)); \ + rcu_read_unlock_bh(); \ + if (unlikely(!peer)) \ + return; + +static inline void mod_peer_timer(struct wireguard_peer *peer, + struct timer_list *timer, + unsigned long expires) +{ + rcu_read_lock_bh(); + if (likely(netif_running(peer->device->dev) && !peer->is_dead)) + mod_timer(timer, expires); + rcu_read_unlock_bh(); +} + +static inline void del_peer_timer(struct wireguard_peer *peer, + struct timer_list *timer) +{ + rcu_read_lock_bh(); + if (likely(netif_running(peer->device->dev) && !peer->is_dead)) + del_timer(timer); + rcu_read_unlock_bh(); +} + +static void expired_retransmit_handshake(struct timer_list *timer) +{ + peer_get_from_timer(timer_retransmit_handshake); + + if (peer->timer_handshake_attempts > MAX_TIMER_HANDSHAKES) { + pr_debug("%s: Handshake for peer %llu (%pISpfsc) did not complete after %d attempts, giving up\n", + peer->device->dev->name, peer->internal_id, + &peer->endpoint.addr, MAX_TIMER_HANDSHAKES + 2); + + del_peer_timer(peer, &peer->timer_send_keepalive); + /* We drop all packets without a keypair and don't try again, + * if we try unsuccessfully for too long to make a handshake. + */ + skb_queue_purge(&peer->staged_packet_queue); + + /* We set a timer for destroying any residue that might be left + * of a partial exchange. + */ + if (!timer_pending(&peer->timer_zero_key_material)) + mod_peer_timer(peer, &peer->timer_zero_key_material, + jiffies + REJECT_AFTER_TIME * 3 * HZ); + } else { + ++peer->timer_handshake_attempts; + pr_debug("%s: Handshake for peer %llu (%pISpfsc) did not complete after %d seconds, retrying (try %d)\n", + peer->device->dev->name, peer->internal_id, + &peer->endpoint.addr, REKEY_TIMEOUT, + peer->timer_handshake_attempts + 1); + + /* We clear the endpoint address src address, in case this is + * the cause of trouble. + */ + wg_socket_clear_peer_endpoint_src(peer); + + wg_packet_send_queued_handshake_initiation(peer, true); + } + wg_peer_put(peer); +} + +static void expired_send_keepalive(struct timer_list *timer) +{ + peer_get_from_timer(timer_send_keepalive); + + wg_packet_send_keepalive(peer); + if (peer->timer_need_another_keepalive) { + peer->timer_need_another_keepalive = false; + mod_peer_timer(peer, &peer->timer_send_keepalive, + jiffies + KEEPALIVE_TIMEOUT * HZ); + } + wg_peer_put(peer); +} + +static void expired_new_handshake(struct timer_list *timer) +{ + peer_get_from_timer(timer_new_handshake); + + pr_debug("%s: Retrying handshake with peer %llu (%pISpfsc) because we stopped hearing back after %d seconds\n", + peer->device->dev->name, peer->internal_id, + &peer->endpoint.addr, KEEPALIVE_TIMEOUT + REKEY_TIMEOUT); + /* We clear the endpoint address src address, in case this is the cause + * of trouble. + */ + wg_socket_clear_peer_endpoint_src(peer); + wg_packet_send_queued_handshake_initiation(peer, false); + wg_peer_put(peer); +} + +static void expired_zero_key_material(struct timer_list *timer) +{ + peer_get_from_timer(timer_zero_key_material); + + rcu_read_lock_bh(); + if (!peer->is_dead) { + /* Should take our reference. */ + if (!queue_work(peer->device->handshake_send_wq, + &peer->clear_peer_work)) + /* If the work was already on the queue, we want to drop the extra reference */ + wg_peer_put(peer); + } + rcu_read_unlock_bh(); +} +static void queued_expired_zero_key_material(struct work_struct *work) +{ + struct wireguard_peer *peer = + container_of(work, struct wireguard_peer, clear_peer_work); + + pr_debug("%s: Zeroing out all keys for peer %llu (%pISpfsc), since we haven't received a new one in %d seconds\n", + peer->device->dev->name, peer->internal_id, + &peer->endpoint.addr, REJECT_AFTER_TIME * 3); + wg_noise_handshake_clear(&peer->handshake); + wg_noise_keypairs_clear(&peer->keypairs); + wg_peer_put(peer); +} + +static void expired_send_persistent_keepalive(struct timer_list *timer) +{ + peer_get_from_timer(timer_persistent_keepalive); + + if (likely(peer->persistent_keepalive_interval)) + wg_packet_send_keepalive(peer); + wg_peer_put(peer); +} + +/* Should be called after an authenticated data packet is sent. */ +void wg_timers_data_sent(struct wireguard_peer *peer) +{ + if (!timer_pending(&peer->timer_new_handshake)) + mod_peer_timer(peer, &peer->timer_new_handshake, + jiffies + (KEEPALIVE_TIMEOUT + REKEY_TIMEOUT) * HZ); +} + +/* Should be called after an authenticated data packet is received. */ +void wg_timers_data_received(struct wireguard_peer *peer) +{ + if (likely(netif_running(peer->device->dev))) { + if (!timer_pending(&peer->timer_send_keepalive)) + mod_peer_timer(peer, &peer->timer_send_keepalive, + jiffies + KEEPALIVE_TIMEOUT * HZ); + else + peer->timer_need_another_keepalive = true; + } +} + +/* Should be called after any type of authenticated packet is sent, whether + * keepalive, data, or handshake. + */ +void wg_timers_any_authenticated_packet_sent(struct wireguard_peer *peer) +{ + del_peer_timer(peer, &peer->timer_send_keepalive); +} + +/* Should be called after any type of authenticated packet is received, whether + * keepalive, data, or handshake. + */ +void wg_timers_any_authenticated_packet_received(struct wireguard_peer *peer) +{ + del_peer_timer(peer, &peer->timer_new_handshake); +} + +/* Should be called after a handshake initiation message is sent. */ +void wg_timers_handshake_initiated(struct wireguard_peer *peer) +{ + mod_peer_timer( + peer, &peer->timer_retransmit_handshake, + jiffies + REKEY_TIMEOUT * HZ + + prandom_u32_max(REKEY_TIMEOUT_JITTER_MAX_JIFFIES)); +} + +/* Should be called after a handshake response message is received and processed + * or when getting key confirmation via the first data message. + */ +void wg_timers_handshake_complete(struct wireguard_peer *peer) +{ + del_peer_timer(peer, &peer->timer_retransmit_handshake); + peer->timer_handshake_attempts = 0; + peer->sent_lastminute_handshake = false; + getnstimeofday(&peer->walltime_last_handshake); +} + +/* Should be called after an ephemeral key is created, which is before sending a + * handshake response or after receiving a handshake response. + */ +void wg_timers_session_derived(struct wireguard_peer *peer) +{ + mod_peer_timer(peer, &peer->timer_zero_key_material, + jiffies + REJECT_AFTER_TIME * 3 * HZ); +} + +/* Should be called before a packet with authentication, whether + * keepalive, data, or handshakem is sent, or after one is received. + */ +void wg_timers_any_authenticated_packet_traversal(struct wireguard_peer *peer) +{ + if (peer->persistent_keepalive_interval) + mod_peer_timer(peer, &peer->timer_persistent_keepalive, + jiffies + peer->persistent_keepalive_interval * HZ); +} + +void wg_timers_init(struct wireguard_peer *peer) +{ + timer_setup(&peer->timer_retransmit_handshake, + expired_retransmit_handshake, 0); + timer_setup(&peer->timer_send_keepalive, expired_send_keepalive, 0); + timer_setup(&peer->timer_new_handshake, expired_new_handshake, 0); + timer_setup(&peer->timer_zero_key_material, expired_zero_key_material, 0); + timer_setup(&peer->timer_persistent_keepalive, + expired_send_persistent_keepalive, 0); + INIT_WORK(&peer->clear_peer_work, queued_expired_zero_key_material); + peer->timer_handshake_attempts = 0; + peer->sent_lastminute_handshake = false; + peer->timer_need_another_keepalive = false; +} + +void wg_timers_stop(struct wireguard_peer *peer) +{ + del_timer_sync(&peer->timer_retransmit_handshake); + del_timer_sync(&peer->timer_send_keepalive); + del_timer_sync(&peer->timer_new_handshake); + del_timer_sync(&peer->timer_zero_key_material); + del_timer_sync(&peer->timer_persistent_keepalive); + flush_work(&peer->clear_peer_work); +} diff --git a/drivers/net/wireguard/timers.h b/drivers/net/wireguard/timers.h new file mode 100644 index 000000000000..eef4248d759c --- /dev/null +++ b/drivers/net/wireguard/timers.h @@ -0,0 +1,31 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#ifndef _WG_TIMERS_H +#define _WG_TIMERS_H + +#include + +struct wireguard_peer; + +void wg_timers_init(struct wireguard_peer *peer); +void wg_timers_stop(struct wireguard_peer *peer); +void wg_timers_data_sent(struct wireguard_peer *peer); +void wg_timers_data_received(struct wireguard_peer *peer); +void wg_timers_any_authenticated_packet_sent(struct wireguard_peer *peer); +void wg_timers_any_authenticated_packet_received(struct wireguard_peer *peer); +void wg_timers_handshake_initiated(struct wireguard_peer *peer); +void wg_timers_handshake_complete(struct wireguard_peer *peer); +void wg_timers_session_derived(struct wireguard_peer *peer); +void wg_timers_any_authenticated_packet_traversal(struct wireguard_peer *peer); + +static inline bool wg_birthdate_has_expired(u64 birthday_nanoseconds, + u64 expiration_seconds) +{ + return (s64)(birthday_nanoseconds + expiration_seconds * NSEC_PER_SEC) + <= (s64)ktime_get_boot_fast_ns(); +} + +#endif /* _WG_TIMERS_H */ diff --git a/drivers/net/wireguard/version.h b/drivers/net/wireguard/version.h new file mode 100644 index 000000000000..327a8d811c1a --- /dev/null +++ b/drivers/net/wireguard/version.h @@ -0,0 +1 @@ +#define WIREGUARD_VERSION "0.0.20181006" diff --git a/include/uapi/linux/wireguard.h b/include/uapi/linux/wireguard.h new file mode 100644 index 000000000000..ab72766a07ff --- /dev/null +++ b/include/uapi/linux/wireguard.h @@ -0,0 +1,190 @@ +/* SPDX-License-Identifier: (GPL-2.0 WITH Linux-syscall-note) OR MIT */ +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + * + * Documentation + * ============= + * + * The below enums and macros are for interfacing with WireGuard, using generic + * netlink, with family WG_GENL_NAME and version WG_GENL_VERSION. It defines two + * methods: get and set. Note that while they share many common attributes, + * these two functions actually accept a slightly different set of inputs and + * outputs. + * + * WG_CMD_GET_DEVICE + * ----------------- + * + * May only be called via NLM_F_REQUEST | NLM_F_DUMP. The command should contain + * one but not both of: + * + * WGDEVICE_A_IFINDEX: NLA_U32 + * WGDEVICE_A_IFNAME: NLA_NUL_STRING, maxlen IFNAMESIZ - 1 + * + * The kernel will then return several messages (NLM_F_MULTI) containing the + * following tree of nested items: + * + * WGDEVICE_A_IFINDEX: NLA_U32 + * WGDEVICE_A_IFNAME: NLA_NUL_STRING, maxlen IFNAMESIZ - 1 + * WGDEVICE_A_PRIVATE_KEY: len WG_KEY_LEN + * WGDEVICE_A_PUBLIC_KEY: len WG_KEY_LEN + * WGDEVICE_A_LISTEN_PORT: NLA_U16 + * WGDEVICE_A_FWMARK: NLA_U32 + * WGDEVICE_A_PEERS: NLA_NESTED + * 0: NLA_NESTED + * WGPEER_A_PUBLIC_KEY: len WG_KEY_LEN + * WGPEER_A_PRESHARED_KEY: len WG_KEY_LEN + * WGPEER_A_ENDPOINT: struct sockaddr_in or struct sockaddr_in6 + * WGPEER_A_PERSISTENT_KEEPALIVE_INTERVAL: NLA_U16 + * WGPEER_A_LAST_HANDSHAKE_TIME: struct timespec + * WGPEER_A_RX_BYTES: NLA_U64 + * WGPEER_A_TX_BYTES: NLA_U64 + * WGPEER_A_ALLOWEDIPS: NLA_NESTED + * 0: NLA_NESTED + * WGALLOWEDIP_A_FAMILY: NLA_U16 + * WGALLOWEDIP_A_IPADDR: struct in_addr or struct in6_addr + * WGALLOWEDIP_A_CIDR_MASK: NLA_U8 + * 1: NLA_NESTED + * ... + * 2: NLA_NESTED + * ... + * ... + * WGPEER_A_PROTOCOL_VERSION: NLA_U32 + * 1: NLA_NESTED + * ... + * ... + * + * It is possible that all of the allowed IPs of a single peer will not + * fit within a single netlink message. In that case, the same peer will + * be written in the following message, except it will only contain + * WGPEER_A_PUBLIC_KEY and WGPEER_A_ALLOWEDIPS. This may occur several + * times in a row for the same peer. It is then up to the receiver to + * coalesce adjacent peers. Likewise, it is possible that all peers will + * not fit within a single message. So, subsequent peers will be sent + * in following messages, except those will only contain WGDEVICE_A_IFNAME + * and WGDEVICE_A_PEERS. It is then up to the receiver to coalesce these + * messages to form the complete list of peers. + * + * Since this is an NLA_F_DUMP command, the final message will always be + * NLMSG_DONE, even if an error occurs. However, this NLMSG_DONE message + * contains an integer error code. It is either zero or a negative error + * code corresponding to the errno. + * + * WG_CMD_SET_DEVICE + * ----------------- + * + * May only be called via NLM_F_REQUEST. The command should contain the + * following tree of nested items, containing one but not both of + * WGDEVICE_A_IFINDEX and WGDEVICE_A_IFNAME: + * + * WGDEVICE_A_IFINDEX: NLA_U32 + * WGDEVICE_A_IFNAME: NLA_NUL_STRING, maxlen IFNAMESIZ - 1 + * WGDEVICE_A_FLAGS: NLA_U32, 0 or WGDEVICE_F_REPLACE_PEERS if all current + * peers should be removed prior to adding the list below. + * WGDEVICE_A_PRIVATE_KEY: len WG_KEY_LEN, all zeros to remove + * WGDEVICE_A_LISTEN_PORT: NLA_U16, 0 to choose randomly + * WGDEVICE_A_FWMARK: NLA_U32, 0 to disable + * WGDEVICE_A_PEERS: NLA_NESTED + * 0: NLA_NESTED + * WGPEER_A_PUBLIC_KEY: len WG_KEY_LEN + * WGPEER_A_FLAGS: NLA_U32, 0 and/or WGPEER_F_REMOVE_ME if the + * specified peer should be removed rather than + * added/updated and/or WGPEER_F_REPLACE_ALLOWEDIPS + * if all current allowed IPs of this peer should be + * removed prior to adding the list below. + * WGPEER_A_PRESHARED_KEY: len WG_KEY_LEN, all zeros to remove + * WGPEER_A_ENDPOINT: struct sockaddr_in or struct sockaddr_in6 + * WGPEER_A_PERSISTENT_KEEPALIVE_INTERVAL: NLA_U16, 0 to disable + * WGPEER_A_ALLOWEDIPS: NLA_NESTED + * 0: NLA_NESTED + * WGALLOWEDIP_A_FAMILY: NLA_U16 + * WGALLOWEDIP_A_IPADDR: struct in_addr or struct in6_addr + * WGALLOWEDIP_A_CIDR_MASK: NLA_U8 + * 1: NLA_NESTED + * ... + * 2: NLA_NESTED + * ... + * ... + * WGPEER_A_PROTOCOL_VERSION: NLA_U32, should not be set or used at + * all by most users of this API, as the + * most recent protocol will be used when + * this is unset. Otherwise, must be set + * to 1. + * 1: NLA_NESTED + * ... + * ... + * + * It is possible that the amount of configuration data exceeds that of + * the maximum message length accepted by the kernel. In that case, several + * messages should be sent one after another, with each successive one + * filling in information not contained in the prior. Note that if + * WGDEVICE_F_REPLACE_PEERS is specified in the first message, it probably + * should not be specified in fragments that come after, so that the list + * of peers is only cleared the first time but appened after. Likewise for + * peers, if WGPEER_F_REPLACE_ALLOWEDIPS is specified in the first message + * of a peer, it likely should not be specified in subsequent fragments. + * + * If an error occurs, NLMSG_ERROR will reply containing an errno. + */ + +#ifndef _WG_UAPI_WIREGUARD_H +#define _WG_UAPI_WIREGUARD_H + +#define WG_GENL_NAME "wireguard" +#define WG_GENL_VERSION 1 + +#define WG_KEY_LEN 32 + +enum wg_cmd { + WG_CMD_GET_DEVICE, + WG_CMD_SET_DEVICE, + __WG_CMD_MAX +}; +#define WG_CMD_MAX (__WG_CMD_MAX - 1) + +enum wgdevice_flag { + WGDEVICE_F_REPLACE_PEERS = 1U << 0 +}; +enum wgdevice_attribute { + WGDEVICE_A_UNSPEC, + WGDEVICE_A_IFINDEX, + WGDEVICE_A_IFNAME, + WGDEVICE_A_PRIVATE_KEY, + WGDEVICE_A_PUBLIC_KEY, + WGDEVICE_A_FLAGS, + WGDEVICE_A_LISTEN_PORT, + WGDEVICE_A_FWMARK, + WGDEVICE_A_PEERS, + __WGDEVICE_A_LAST +}; +#define WGDEVICE_A_MAX (__WGDEVICE_A_LAST - 1) + +enum wgpeer_flag { + WGPEER_F_REMOVE_ME = 1U << 0, + WGPEER_F_REPLACE_ALLOWEDIPS = 1U << 1 +}; +enum wgpeer_attribute { + WGPEER_A_UNSPEC, + WGPEER_A_PUBLIC_KEY, + WGPEER_A_PRESHARED_KEY, + WGPEER_A_FLAGS, + WGPEER_A_ENDPOINT, + WGPEER_A_PERSISTENT_KEEPALIVE_INTERVAL, + WGPEER_A_LAST_HANDSHAKE_TIME, + WGPEER_A_RX_BYTES, + WGPEER_A_TX_BYTES, + WGPEER_A_ALLOWEDIPS, + WGPEER_A_PROTOCOL_VERSION, + __WGPEER_A_LAST +}; +#define WGPEER_A_MAX (__WGPEER_A_LAST - 1) + +enum wgallowedip_attribute { + WGALLOWEDIP_A_UNSPEC, + WGALLOWEDIP_A_FAMILY, + WGALLOWEDIP_A_IPADDR, + WGALLOWEDIP_A_CIDR_MASK, + __WGALLOWEDIP_A_LAST +}; +#define WGALLOWEDIP_A_MAX (__WGALLOWEDIP_A_LAST - 1) + +#endif /* _WG_UAPI_WIREGUARD_H */ diff --git a/tools/testing/selftests/wireguard/netns.sh b/tools/testing/selftests/wireguard/netns.sh new file mode 100755 index 000000000000..568612c45acc --- /dev/null +++ b/tools/testing/selftests/wireguard/netns.sh @@ -0,0 +1,499 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# +# Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. +# +# This script tests the below topology: +# +# ┌─────────────────────┐ ┌──────────────────────────────────┐ ┌─────────────────────┐ +# │ $ns1 namespace │ │ $ns0 namespace │ │ $ns2 namespace │ +# │ │ │ │ │ │ +# │┌────────┐ │ │ ┌────────┐ │ │ ┌────────┐│ +# ││ wg0 │───────────┼───┼────────────│ lo │────────────┼───┼───────────│ wg0 ││ +# │├────────┴──────────┐│ │ ┌───────┴────────┴────────┐ │ │┌──────────┴────────┤│ +# ││192.168.241.1/24 ││ │ │(ns1) (ns2) │ │ ││192.168.241.2/24 ││ +# ││fd00::1/24 ││ │ │127.0.0.1:1 127.0.0.1:2│ │ ││fd00::2/24 ││ +# │└───────────────────┘│ │ │[::]:1 [::]:2 │ │ │└───────────────────┘│ +# └─────────────────────┘ │ └─────────────────────────┘ │ └─────────────────────┘ +# └──────────────────────────────────┘ +# +# After the topology is prepared we run a series of TCP/UDP iperf3 tests between the +# wireguard peers in $ns1 and $ns2. Note that $ns0 is the endpoint for the wg0 +# interfaces in $ns1 and $ns2. See https://www.wireguard.com/netns/ for further +# details on how this is accomplished. +set -e + +exec 3>&1 +export WG_HIDE_KEYS=never +netns0="wg-test-$$-0" +netns1="wg-test-$$-1" +netns2="wg-test-$$-2" +pretty() { echo -e "\x1b[32m\x1b[1m[+] ${1:+NS$1: }${2}\x1b[0m" >&3; } +pp() { pretty "" "$*"; "$@"; } +maybe_exec() { if [[ $BASHPID -eq $$ ]]; then "$@"; else exec "$@"; fi; } +n0() { pretty 0 "$*"; maybe_exec ip netns exec $netns0 "$@"; } +n1() { pretty 1 "$*"; maybe_exec ip netns exec $netns1 "$@"; } +n2() { pretty 2 "$*"; maybe_exec ip netns exec $netns2 "$@"; } +ip0() { pretty 0 "ip $*"; ip -n $netns0 "$@"; } +ip1() { pretty 1 "ip $*"; ip -n $netns1 "$@"; } +ip2() { pretty 2 "ip $*"; ip -n $netns2 "$@"; } +sleep() { read -t "$1" -N 0 || true; } +waitiperf() { pretty "${1//*-}" "wait for iperf:5201"; while [[ $(ss -N "$1" -tlp 'sport = 5201') != *iperf3* ]]; do sleep 0.1; done; } +waitncatudp() { pretty "${1//*-}" "wait for udp:1111"; while [[ $(ss -N "$1" -ulp 'sport = 1111') != *ncat* ]]; do sleep 0.1; done; } +waitncattcp() { pretty "${1//*-}" "wait for tcp:1111"; while [[ $(ss -N "$1" -tlp 'sport = 1111') != *ncat* ]]; do sleep 0.1; done; } +waitiface() { pretty "${1//*-}" "wait for $2 to come up"; ip netns exec "$1" bash -c "while [[ \$(< \"/sys/class/net/$2/operstate\") != up ]]; do read -t .1 -N 0 || true; done;"; } + +cleanup() { + set +e + exec 2>/dev/null + printf "$orig_message_cost" > /proc/sys/net/core/message_cost + ip0 link del dev wg0 + ip1 link del dev wg0 + ip2 link del dev wg0 + local to_kill="$(ip netns pids $netns0) $(ip netns pids $netns1) $(ip netns pids $netns2)" + [[ -n $to_kill ]] && kill $to_kill + pp ip netns del $netns1 + pp ip netns del $netns2 + pp ip netns del $netns0 + exit +} + +orig_message_cost="$(< /proc/sys/net/core/message_cost)" +trap cleanup EXIT +printf 0 > /proc/sys/net/core/message_cost + +ip netns del $netns0 2>/dev/null || true +ip netns del $netns1 2>/dev/null || true +ip netns del $netns2 2>/dev/null || true +pp ip netns add $netns0 +pp ip netns add $netns1 +pp ip netns add $netns2 +ip0 link set up dev lo + +ip0 link add dev wg0 type wireguard +ip0 link set wg0 netns $netns1 +ip0 link add dev wg0 type wireguard +ip0 link set wg0 netns $netns2 +key1="$(pp wg genkey)" +key2="$(pp wg genkey)" +pub1="$(pp wg pubkey <<<"$key1")" +pub2="$(pp wg pubkey <<<"$key2")" +psk="$(pp wg genpsk)" +[[ -n $key1 && -n $key2 && -n $psk ]] + +configure_peers() { + ip1 addr add 192.168.241.1/24 dev wg0 + ip1 addr add fd00::1/24 dev wg0 + + ip2 addr add 192.168.241.2/24 dev wg0 + ip2 addr add fd00::2/24 dev wg0 + + n1 wg set wg0 \ + private-key <(echo "$key1") \ + listen-port 1 \ + peer "$pub2" \ + preshared-key <(echo "$psk") \ + allowed-ips 192.168.241.2/32,fd00::2/128 + n2 wg set wg0 \ + private-key <(echo "$key2") \ + listen-port 2 \ + peer "$pub1" \ + preshared-key <(echo "$psk") \ + allowed-ips 192.168.241.1/32,fd00::1/128 + + ip1 link set up dev wg0 + ip2 link set up dev wg0 +} +configure_peers + +tests() { + # Ping over IPv4 + n2 ping -c 10 -f -W 1 192.168.241.1 + n1 ping -c 10 -f -W 1 192.168.241.2 + + # Ping over IPv6 + n2 ping6 -c 10 -f -W 1 fd00::1 + n1 ping6 -c 10 -f -W 1 fd00::2 + + # TCP over IPv4 + n2 iperf3 -s -1 -B 192.168.241.2 & + waitiperf $netns2 + n1 iperf3 -Z -t 3 -c 192.168.241.2 + + # TCP over IPv6 + n1 iperf3 -s -1 -B fd00::1 & + waitiperf $netns1 + n2 iperf3 -Z -t 3 -c fd00::1 + + # UDP over IPv4 + n1 iperf3 -s -1 -B 192.168.241.1 & + waitiperf $netns1 + n2 iperf3 -Z -t 3 -b 0 -u -c 192.168.241.1 + + # UDP over IPv6 + n2 iperf3 -s -1 -B fd00::2 & + waitiperf $netns2 + n1 iperf3 -Z -t 3 -b 0 -u -c fd00::2 +} + +[[ $(ip1 link show dev wg0) =~ mtu\ ([0-9]+) ]] && orig_mtu="${BASH_REMATCH[1]}" +big_mtu=$(( 34816 - 1500 + $orig_mtu )) + +# Test using IPv4 as outer transport +n1 wg set wg0 peer "$pub2" endpoint 127.0.0.1:2 +n2 wg set wg0 peer "$pub1" endpoint 127.0.0.1:1 +# Before calling tests, we first make sure that the stats counters are working +n2 ping -c 10 -f -W 1 192.168.241.1 +{ read _; read _; read _; read rx_bytes _; read _; read tx_bytes _; } < <(ip2 -stats link show dev wg0) +(( rx_bytes == 1372 && (tx_bytes == 1428 || tx_bytes == 1460) )) +{ read _; read _; read _; read rx_bytes _; read _; read tx_bytes _; } < <(ip1 -stats link show dev wg0) +(( tx_bytes == 1372 && (rx_bytes == 1428 || rx_bytes == 1460) )) +read _ rx_bytes tx_bytes < <(n2 wg show wg0 transfer) +(( rx_bytes == 1372 && (tx_bytes == 1428 || tx_bytes == 1460) )) +read _ rx_bytes tx_bytes < <(n1 wg show wg0 transfer) +(( tx_bytes == 1372 && (rx_bytes == 1428 || rx_bytes == 1460) )) + +tests +ip1 link set wg0 mtu $big_mtu +ip2 link set wg0 mtu $big_mtu +tests + +ip1 link set wg0 mtu $orig_mtu +ip2 link set wg0 mtu $orig_mtu + +# Test using IPv6 as outer transport +n1 wg set wg0 peer "$pub2" endpoint [::1]:2 +n2 wg set wg0 peer "$pub1" endpoint [::1]:1 +tests +ip1 link set wg0 mtu $big_mtu +ip2 link set wg0 mtu $big_mtu +tests + +# Test that route MTUs work with the padding +ip1 link set wg0 mtu 1300 +ip2 link set wg0 mtu 1300 +n1 wg set wg0 peer "$pub2" endpoint 127.0.0.1:2 +n2 wg set wg0 peer "$pub1" endpoint 127.0.0.1:1 +n0 iptables -A INPUT -m length --length 1360 -j DROP +n1 ip route add 192.168.241.2/32 dev wg0 mtu 1299 +n2 ip route add 192.168.241.1/32 dev wg0 mtu 1299 +n2 ping -c 1 -W 1 -s 1269 192.168.241.1 +n2 ip route delete 192.168.241.1/32 dev wg0 mtu 1299 +n1 ip route delete 192.168.241.2/32 dev wg0 mtu 1299 +n0 iptables -F INPUT + +ip1 link set wg0 mtu $orig_mtu +ip2 link set wg0 mtu $orig_mtu + +# Test using IPv4 that roaming works +ip0 -4 addr del 127.0.0.1/8 dev lo +ip0 -4 addr add 127.212.121.99/8 dev lo +n1 wg set wg0 listen-port 9999 +n1 wg set wg0 peer "$pub2" endpoint 127.0.0.1:2 +n1 ping6 -W 1 -c 1 fd00::2 +[[ $(n2 wg show wg0 endpoints) == "$pub1 127.212.121.99:9999" ]] + +# Test using IPv6 that roaming works +n1 wg set wg0 listen-port 9998 +n1 wg set wg0 peer "$pub2" endpoint [::1]:2 +n1 ping -W 1 -c 1 192.168.241.2 +[[ $(n2 wg show wg0 endpoints) == "$pub1 [::1]:9998" ]] + +# Test that crypto-RP filter works +n1 wg set wg0 peer "$pub2" allowed-ips 192.168.241.0/24 +exec 4< <(n1 ncat -l -u -p 1111) +nmap_pid=$! +waitncatudp $netns1 +n2 ncat -u 192.168.241.1 1111 <<<"X" +read -r -N 1 -t 1 out <&4 && [[ $out == "X" ]] +kill $nmap_pid +more_specific_key="$(pp wg genkey | pp wg pubkey)" +n1 wg set wg0 peer "$more_specific_key" allowed-ips 192.168.241.2/32 +n2 wg set wg0 listen-port 9997 +exec 4< <(n1 ncat -l -u -p 1111) +nmap_pid=$! +waitncatudp $netns1 +n2 ncat -u 192.168.241.1 1111 <<<"X" +! read -r -N 1 -t 1 out <&4 || false +kill $nmap_pid +n1 wg set wg0 peer "$more_specific_key" remove +[[ $(n1 wg show wg0 endpoints) == "$pub2 [::1]:9997" ]] + +ip1 link del wg0 +ip2 link del wg0 + +# Test using NAT. We now change the topology to this: +# ┌────────────────────────────────────────┐ ┌────────────────────────────────────────────────┐ ┌────────────────────────────────────────┐ +# │ $ns1 namespace │ │ $ns0 namespace │ │ $ns2 namespace │ +# │ │ │ │ │ │ +# │ ┌─────┐ ┌─────┐ │ │ ┌──────┐ ┌──────┐ │ │ ┌─────┐ ┌─────┐ │ +# │ │ wg0 │─────────────│vethc│───────────┼────┼────│vethrc│ │vethrs│──────────────┼─────┼──│veths│────────────│ wg0 │ │ +# │ ├─────┴──────────┐ ├─────┴──────────┐│ │ ├──────┴─────────┐ ├──────┴────────────┐ │ │ ├─────┴──────────┐ ├─────┴──────────┐ │ +# │ │192.168.241.1/24│ │192.168.1.100/24││ │ │192.168.1.100/24│ │10.0.0.1/24 │ │ │ │10.0.0.100/24 │ │192.168.241.2/24│ │ +# │ │fd00::1/24 │ │ ││ │ │ │ │SNAT:192.168.1.0/24│ │ │ │ │ │fd00::2/24 │ │ +# │ └────────────────┘ └────────────────┘│ │ └────────────────┘ └───────────────────┘ │ │ └────────────────┘ └────────────────┘ │ +# └────────────────────────────────────────┘ └────────────────────────────────────────────────┘ └────────────────────────────────────────┘ + +ip1 link add dev wg0 type wireguard +ip2 link add dev wg0 type wireguard +configure_peers + +ip0 link add vethrc type veth peer name vethc +ip0 link add vethrs type veth peer name veths +ip0 link set vethc netns $netns1 +ip0 link set veths netns $netns2 +ip0 link set vethrc up +ip0 link set vethrs up +ip0 addr add 192.168.1.1/24 dev vethrc +ip0 addr add 10.0.0.1/24 dev vethrs +ip1 addr add 192.168.1.100/24 dev vethc +ip1 link set vethc up +ip1 route add default via 192.168.1.1 +ip2 addr add 10.0.0.100/24 dev veths +ip2 link set veths up +waitiface $netns0 vethrc +waitiface $netns0 vethrs +waitiface $netns1 vethc +waitiface $netns2 veths + +n0 bash -c 'printf 1 > /proc/sys/net/ipv4/ip_forward' +n0 bash -c 'printf 2 > /proc/sys/net/netfilter/nf_conntrack_udp_timeout' +n0 bash -c 'printf 2 > /proc/sys/net/netfilter/nf_conntrack_udp_timeout_stream' +n0 iptables -t nat -A POSTROUTING -s 192.168.1.0/24 -d 10.0.0.0/24 -j SNAT --to 10.0.0.1 + +n1 wg set wg0 peer "$pub2" endpoint 10.0.0.100:2 persistent-keepalive 1 +n1 ping -W 1 -c 1 192.168.241.2 +n2 ping -W 1 -c 1 192.168.241.1 +[[ $(n2 wg show wg0 endpoints) == "$pub1 10.0.0.1:1" ]] +# Demonstrate n2 can still send packets to n1, since persistent-keepalive will prevent connection tracking entry from expiring (to see entries: `n0 conntrack -L`). +pp sleep 3 +n2 ping -W 1 -c 1 192.168.241.1 + +n0 iptables -t nat -F +ip0 link del vethrc +ip0 link del vethrs +ip1 link del wg0 +ip2 link del wg0 + +# Test that saddr routing is sticky but not too sticky, changing to this topology: +# ┌────────────────────────────────────────┐ ┌────────────────────────────────────────┐ +# │ $ns1 namespace │ │ $ns2 namespace │ +# │ │ │ │ +# │ ┌─────┐ ┌─────┐ │ │ ┌─────┐ ┌─────┐ │ +# │ │ wg0 │─────────────│veth1│───────────┼────┼──│veth2│────────────│ wg0 │ │ +# │ ├─────┴──────────┐ ├─────┴──────────┐│ │ ├─────┴──────────┐ ├─────┴──────────┐ │ +# │ │192.168.241.1/24│ │10.0.0.1/24 ││ │ │10.0.0.2/24 │ │192.168.241.2/24│ │ +# │ │fd00::1/24 │ │fd00:aa::1/96 ││ │ │fd00:aa::2/96 │ │fd00::2/24 │ │ +# │ └────────────────┘ └────────────────┘│ │ └────────────────┘ └────────────────┘ │ +# └────────────────────────────────────────┘ └────────────────────────────────────────┘ + +ip1 link add dev wg0 type wireguard +ip2 link add dev wg0 type wireguard +configure_peers +ip1 link add veth1 type veth peer name veth2 +ip1 link set veth2 netns $netns2 +n1 bash -c 'printf 0 > /proc/sys/net/ipv6/conf/all/accept_dad' +n2 bash -c 'printf 0 > /proc/sys/net/ipv6/conf/all/accept_dad' +n1 bash -c 'printf 0 > /proc/sys/net/ipv6/conf/veth1/accept_dad' +n2 bash -c 'printf 0 > /proc/sys/net/ipv6/conf/veth2/accept_dad' +n1 bash -c 'printf 1 > /proc/sys/net/ipv4/conf/veth1/promote_secondaries' + +# First we check that we aren't overly sticky and can fall over to new IPs when old ones are removed +ip1 addr add 10.0.0.1/24 dev veth1 +ip1 addr add fd00:aa::1/96 dev veth1 +ip2 addr add 10.0.0.2/24 dev veth2 +ip2 addr add fd00:aa::2/96 dev veth2 +ip1 link set veth1 up +ip2 link set veth2 up +waitiface $netns1 veth1 +waitiface $netns2 veth2 +n1 wg set wg0 peer "$pub2" endpoint 10.0.0.2:2 +n1 ping -W 1 -c 1 192.168.241.2 +ip1 addr add 10.0.0.10/24 dev veth1 +ip1 addr del 10.0.0.1/24 dev veth1 +n1 ping -W 1 -c 1 192.168.241.2 +n1 wg set wg0 peer "$pub2" endpoint [fd00:aa::2]:2 +n1 ping -W 1 -c 1 192.168.241.2 +ip1 addr add fd00:aa::10/96 dev veth1 +ip1 addr del fd00:aa::1/96 dev veth1 +n1 ping -W 1 -c 1 192.168.241.2 + +# Now we show that we can successfully do reply to sender routing +ip1 link set veth1 down +ip2 link set veth2 down +ip1 addr flush dev veth1 +ip2 addr flush dev veth2 +ip1 addr add 10.0.0.1/24 dev veth1 +ip1 addr add 10.0.0.2/24 dev veth1 +ip1 addr add fd00:aa::1/96 dev veth1 +ip1 addr add fd00:aa::2/96 dev veth1 +ip2 addr add 10.0.0.3/24 dev veth2 +ip2 addr add fd00:aa::3/96 dev veth2 +ip1 link set veth1 up +ip2 link set veth2 up +waitiface $netns1 veth1 +waitiface $netns2 veth2 +n2 wg set wg0 peer "$pub1" endpoint 10.0.0.1:1 +n2 ping -W 1 -c 1 192.168.241.1 +[[ $(n2 wg show wg0 endpoints) == "$pub1 10.0.0.1:1" ]] +n2 wg set wg0 peer "$pub1" endpoint [fd00:aa::1]:1 +n2 ping -W 1 -c 1 192.168.241.1 +[[ $(n2 wg show wg0 endpoints) == "$pub1 [fd00:aa::1]:1" ]] +n2 wg set wg0 peer "$pub1" endpoint 10.0.0.2:1 +n2 ping -W 1 -c 1 192.168.241.1 +[[ $(n2 wg show wg0 endpoints) == "$pub1 10.0.0.2:1" ]] +n2 wg set wg0 peer "$pub1" endpoint [fd00:aa::2]:1 +n2 ping -W 1 -c 1 192.168.241.1 +[[ $(n2 wg show wg0 endpoints) == "$pub1 [fd00:aa::2]:1" ]] + +# What happens if the inbound destination address belongs to a different interface as the default route? +ip1 link add dummy0 type dummy +ip1 addr add 10.50.0.1/24 dev dummy0 +ip1 link set dummy0 up +ip2 route add 10.50.0.0/24 dev veth2 +n2 wg set wg0 peer "$pub1" endpoint 10.50.0.1:1 +n2 ping -W 1 -c 1 192.168.241.1 +[[ $(n2 wg show wg0 endpoints) == "$pub1 10.50.0.1:1" ]] + +ip1 link del dummy0 +ip1 addr flush dev veth1 +ip2 addr flush dev veth2 +ip1 route flush dev veth1 +ip2 route flush dev veth2 + +# Now we see what happens if another interface route takes precedence over an ongoing one +ip1 link add veth3 type veth peer name veth4 +ip1 link set veth4 netns $netns2 +ip1 addr add 10.0.0.1/24 dev veth1 +ip2 addr add 10.0.0.2/24 dev veth2 +ip1 addr add 10.0.0.3/24 dev veth3 +ip1 link set veth1 up +ip2 link set veth2 up +ip1 link set veth3 up +ip2 link set veth4 up +waitiface $netns1 veth1 +waitiface $netns2 veth2 +waitiface $netns1 veth3 +waitiface $netns2 veth4 +ip1 route flush dev veth1 +ip1 route flush dev veth3 +ip1 route add 10.0.0.0/24 dev veth1 src 10.0.0.1 metric 2 +n1 wg set wg0 peer "$pub2" endpoint 10.0.0.2:2 +n1 ping -W 1 -c 1 192.168.241.2 +[[ $(n2 wg show wg0 endpoints) == "$pub1 10.0.0.1:1" ]] +ip1 route add 10.0.0.0/24 dev veth3 src 10.0.0.3 metric 1 +n1 bash -c 'printf 0 > /proc/sys/net/ipv4/conf/veth1/rp_filter' +n2 bash -c 'printf 0 > /proc/sys/net/ipv4/conf/veth4/rp_filter' +n1 bash -c 'printf 0 > /proc/sys/net/ipv4/conf/all/rp_filter' +n2 bash -c 'printf 0 > /proc/sys/net/ipv4/conf/all/rp_filter' +n1 ping -W 1 -c 1 192.168.241.2 +[[ $(n2 wg show wg0 endpoints) == "$pub1 10.0.0.3:1" ]] + +ip1 link del veth1 +ip1 link del veth3 +ip1 link del wg0 +ip2 link del wg0 + +# We test that Netlink/IPC is working properly by doing things that usually cause split responses +ip0 link add dev wg0 type wireguard +config=( "[Interface]" "PrivateKey=$(wg genkey)" "[Peer]" "PublicKey=$(wg genkey)" ) +for a in {1..255}; do + for b in {0..255}; do + config+=( "AllowedIPs=$a.$b.0.0/16,$a::$b/128" ) + done +done +n0 wg setconf wg0 <(printf '%s\n' "${config[@]}") +i=0 +for ip in $(n0 wg show wg0 allowed-ips); do + ((++i)) +done +((i == 255*256*2+1)) +ip0 link del wg0 +ip0 link add dev wg0 type wireguard +config=( "[Interface]" "PrivateKey=$(wg genkey)" ) +for a in {1..40}; do + config+=( "[Peer]" "PublicKey=$(wg genkey)" ) + for b in {1..52}; do + config+=( "AllowedIPs=$a.$b.0.0/16" ) + done +done +n0 wg setconf wg0 <(printf '%s\n' "${config[@]}") +i=0 +while read -r line; do + j=0 + for ip in $line; do + ((++j)) + done + ((j == 53)) + ((++i)) +done < <(n0 wg show wg0 allowed-ips) +((i == 40)) +ip0 link del wg0 +ip0 link add wg0 type wireguard +config=( ) +for i in {1..29}; do + config+=( "[Peer]" "PublicKey=$(wg genkey)" ) +done +config+=( "[Peer]" "PublicKey=$(wg genkey)" "AllowedIPs=255.2.3.4/32,abcd::255/128" ) +n0 wg setconf wg0 <(printf '%s\n' "${config[@]}") +n0 wg showconf wg0 > /dev/null +ip0 link del wg0 + +allowedips=( ) +for i in {1..197}; do + allowedips+=( abcd::$i ) +done +saved_ifs="$IFS" +IFS=, +allowedips="${allowedips[*]}" +IFS="$saved_ifs" +ip0 link add wg0 type wireguard +n0 wg set wg0 peer "$pub1" +n0 wg set wg0 peer "$pub2" allowed-ips "$allowedips" +{ + read -r pub allowedips + [[ $pub == "$pub1" && $allowedips == "(none)" ]] + read -r pub allowedips + [[ $pub == "$pub2" ]] + i=0 + for _ in $allowedips; do + ((++i)) + done + ((i == 197)) +} < <(n0 wg show wg0 allowed-ips) +ip0 link del wg0 + +! n0 wg show doesnotexist || false + +ip0 link add wg0 type wireguard +n0 wg set wg0 private-key <(echo "$key1") peer "$pub2" preshared-key <(echo "$psk") +[[ $(n0 wg show wg0 private-key) == "$key1" ]] +[[ $(n0 wg show wg0 preshared-keys) == "$pub2 $psk" ]] +n0 wg set wg0 private-key /dev/null peer "$pub2" preshared-key /dev/null +[[ $(n0 wg show wg0 private-key) == "(none)" ]] +[[ $(n0 wg show wg0 preshared-keys) == "$pub2 (none)" ]] +n0 wg set wg0 peer "$pub2" +n0 wg set wg0 private-key <(echo "$key2") +[[ $(n0 wg show wg0 public-key) == "$pub2" ]] +[[ -z $(n0 wg show wg0 peers) ]] +n0 wg set wg0 peer "$pub2" +[[ -z $(n0 wg show wg0 peers) ]] +n0 wg set wg0 private-key <(echo "$key1") +n0 wg set wg0 peer "$pub2" +[[ $(n0 wg show wg0 peers) == "$pub2" ]] +ip0 link del wg0 + +declare -A objects +while read -t 0.1 -r line 2>/dev/null || [[ $? -ne 142 ]]; do + [[ $line =~ .*(wg[0-9]+:\ [A-Z][a-z]+\ [0-9]+)\ .*(created|destroyed).* ]] || continue + objects["${BASH_REMATCH[1]}"]+="${BASH_REMATCH[2]}" +done < /dev/kmsg +alldeleted=1 +for object in "${!objects[@]}"; do + if [[ ${objects["$object"]} != *createddestroyed ]]; then + echo "Error: $object: merely ${objects["$object"]}" >&3 + alldeleted=0 + fi +done +[[ $alldeleted -eq 1 ]] +pretty "" "Objects that were created were also destroyed."