[net-next,v6,00/23] WireGuard: Secure Network Tunnel

Message ID	20180925145622.29959-1-Jason@zx2c4.com
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: best guess record for domain of linux-crypto-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; From: "Jason A. Donenfeld" <Jason@zx2c4.com> To: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, linux-crypto@vger.kernel.org, davem@davemloft.net, gregkh@linuxfoundation.org Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Subject: [PATCH net-next v6 00/23] WireGuard: Secure Network Tunnel Date: Tue, 25 Sep 2018 16:55:59 +0200 Message-Id: <20180925145622.29959-1-Jason@zx2c4.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-crypto-owner@vger.kernel.org Precedence: bulk
Series	WireGuard: Secure Network Tunnel \| expand [net-next,v6,00/23] WireGuard: Secure Network Tunnel [net-next,v6,03/23] zinc: ChaCha20 generic C implementation and selftest [net-next,v6,04/23] zinc: ChaCha20 x86_64 implementation [net-next,v6,05/23] zinc: import Andy Polyakov's ChaCha20 ARM and ARM64 implementations [net-next,v6,10/23] zinc: Poly1305 x86_64 implementation [net-next,v6,13/23] zinc: Poly1305 MIPS32r2 and MIPS64 implementations [net-next,v6,14/23] zinc: ChaCha20Poly1305 construction and selftest [net-next,v6,15/23] zinc: BLAKE2s generic C implementation and selftest [net-next,v6,20/23] crypto: port Poly1305 to Zinc

Jason A. Donenfeld Sept. 25, 2018, 2:55 p.m. UTC

Changes v5->v6, along with who suggested it.
--------------------------------------------
  - [Eric Biggers] Cleaner and more efficient blake2s_final function.
  - [Eric Biggers] Remove modulo trick that was working around a gcc 8.1 bug.
  - [Eric Biggers] Use crypto_xor_cpy to avoid a memmove in chacha20.
  - In the unlikely condition that we transition from SIMD back to scalar
    instructions for Poly1305, it's important that we also convert back
    from base 2^26 to 2^64. This is super rare and weird, and it's hard
    to imagine somebody actually doing this for whatever bad reason, but
    it is a possibility, so we shouldn't calculate things wrong in that
    instance.
  - [Andrew Lunn] Change BUG_ON to WARN_ON in choice places.
  - [Thomas Gleixner] Explicitly dual license files under X where X≠GPL2
    to be under X && GPL2.
  - [Andy Lutomirski] Use separate symbols for selftests that are of
    variable length, so that we neither waste space on disk nor waste space
    in memory (via __initconst).
  - [Thomas Gleixner] Make SPDX headers a standalone comment, and use the
    // commenting style in .h files, per the kernel style guide.
  - [Paul Burton] Allow assembler to fill branch delay slots on MIPS32r2,
    so that the implementation is more future-proof.
  - [Eric Biggers] Use Eric's scalar implementation for ChaCha20 ARM, which
    performs very well for Cortex-A7,5 and ARMv6, while using Andy
    Polyakov's NEON implementation for other ARMv7.
  - [Ard Biesheuvel] Reduce optimization from -O3 to -O2.
  - [Arn?d Bi?e(rgmann|shuvel)] Make sure stack frames stay inside 1024
    bytes on 32-bit and 1280 bytes on 64-bit.
  - [Ard Biesheuvel] CRYPTOGAMS-related commit messages now have the
    corresponding OpenSSL commit written in them.
  - [Eric Biggers] Keep HChaCha20 key in native endian, since in some
    cases it can trivially be passed into the ChaCha20 block later.
  - Make constants follow a consistent naming scheme.
  - Workaround gcc 8 stack mis-optimization on m68k.
  - [Arnd Bergmann] Workaround gcc 8 stack mis-optimization with KASAN.

Many of the above ARM changes are a result of discussions between Ard,
Eric, AndyL, AndyP, Samuel, and me. These discussions are ongoing, and so
it's possible we'll do another revision with further ARM tweaks. But perhaps
this one will be sufficient for merging now, and we can continue to refine
later in the cycle.

-----------------------------------------------------------

This patchset is available on git.kernel.org in this branch, where it may be
pulled directly for inclusion into net-next:

  * https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/log/?h=jd/wireguard

-----------------------------------------------------------

WireGuard is a secure network tunnel written especially for Linux, which
has faced around three years of serious development, deployment, and
scrutiny. It delivers excellent performance and is extremely easy to
use and configure. It has been designed with the primary goal of being
both easy to audit by virtue of being small and highly secure from a
cryptography and systems security perspective. WireGuard is used by some
massive companies pushing enormous amounts of traffic, and likely
already today you've consumed bytes that at some point transited through
a WireGuard tunnel. Even as an out-of-tree module, WireGuard has been
integrated into various userspace tools, Linux distributions, mobile
phones, and data centers. There are ports in several languages to
several operating systems, and even commercial hardware and services
sold integrating WireGuard. It is time, therefore, for WireGuard to be
properly integrated into Linux.

Ample information, including documentation, installation instructions,
and project details, is available at:

  * https://www.wireguard.com/
  * https://www.wireguard.com/papers/wireguard.pdf

As it is currently an out-of-tree module, it lives in its own git repo
and has its own mailing list, and every commit for the module is tested
against every stable kernel since 3.10 on a variety of architectures
using an extensive test suite:

  * https://git.zx2c4.com/WireGuard
    https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/WireGuard.git/
  * https://lists.zx2c4.com/mailman/listinfo/wireguard
  * https://www.wireguard.com/build-status/

The project has been broadly discussed at conferences, and was presented
to the Netdev developers in Seoul last November, where a paper was
released detailing some interesting aspects of the project. Dave asked
me after the talk if I would consider sending in a v1 "sooner rather
than later", hence this patchset. A decision is still waiting from the
Linux Plumbers Conference, but an update on these topics may be presented
in Vancouver in a few months. Prior presentations:

  * https://www.wireguard.com/presentations/
  * https://www.wireguard.com/papers/wireguard-netdev22.pdf

The cryptography in the protocol itself has been formally verified by
several independent academic teams with positive results, and I know of
two additional efforts on their way to further corroborate those
findings. The version 1 protocol is "complete", and so the purpose of
this review is to assess the implementation of the protocol. However, it
still may be of interest to know that the thing you're reviewing uses a
protocol with various nice security properties:

  * https://www.wireguard.com/formal-verification/

This patchset is divided into four segments. The first introduces a very
simple helper for working with the FPU state for the purposes of amortizing
SIMD operations. The second segment is a small collection of cryptographic
primitives, split up into several commits by primitive and by hardware. The
third shows usage of Zinc within the existing crypto API and as a replacement
to the existing crypto API. The last is WireGuard itself, presented as an
unintrusive and self-contained virtual network driver.

It is intended that this entire patch series enter the kernel through
DaveM's net-next tree. Subsequently, WireGuard patches will go through
DaveM's net-next tree, while Zinc patches will go through Greg KH's tree.

Enjoy,
Jason

Jason A. Donenfeld Sept. 25, 2018, 7:43 p.m. UTC | #1

On Tue, Sep 25, 2018 at 8:33 PM Joe Perches <joe@perches.com> wrote:
> I think the -Dpr_fmt is especially odd and not

> really acceptable as it not used anywhere else

> in the kernel.

There are about 2000 cases in the kernel of the same '#define
pr_fmt...' being pasted into the top of the file, which seems a bit
cumbersome. Rather than having to paste that into each and every file
that I pr_err from, why can't I just do this from the makefile, since
I want that same pr_fmt to copy the whole directory?

Andy Lutomirski Sept. 25, 2018, 8 p.m. UTC | #2

> On Sep 25, 2018, at 12:43 PM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:

> 

>> On Tue, Sep 25, 2018 at 8:33 PM Joe Perches <joe@perches.com> wrote:

>> I think the -Dpr_fmt is especially odd and not

>> really acceptable as it not used anywhere else

>> in the kernel.

> 

> There are about 2000 cases in the kernel of the same '#define

> pr_fmt...' being pasted into the top of the file, which seems a bit

> cumbersome. Rather than having to paste that into each and every file

> that I pr_err from, why can't I just do this from the makefile, since

> I want that same pr_fmt to copy the whole directory?


Because people like to be able to just read the C file to figure out what it does. Or we could adopt the Makefile approach kernel-wide, since this use of it seems reasonable.

Jason A. Donenfeld Sept. 25, 2018, 8:02 p.m. UTC | #3

On Tue, Sep 25, 2018 at 10:00 PM Andy Lutomirski <luto@amacapital.net> wrote:
> > On Sep 25, 2018, at 12:43 PM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:

> >

> >> On Tue, Sep 25, 2018 at 8:33 PM Joe Perches <joe@perches.com> wrote:

> >> I think the -Dpr_fmt is especially odd and not

> >> really acceptable as it not used anywhere else

> >> in the kernel.

> >

> > There are about 2000 cases in the kernel of the same '#define

> > pr_fmt...' being pasted into the top of the file, which seems a bit

> > cumbersome. Rather than having to paste that into each and every file

> > that I pr_err from, why can't I just do this from the makefile, since

> > I want that same pr_fmt to copy the whole directory?

>

> Because people like to be able to just read the C file to figure out what it does. Or we could adopt the Makefile approach kernel-wide, since this use of it seems reasonable.


It would indeed be nice to see this done tree-wide. I can recall
various random bugs over the year where some dmesg messages are
missing a prefix because the author forgot to copy and paste the thing
to /yet another file/ in the same directory.

Jason A. Donenfeld Sept. 25, 2018, 8:54 p.m. UTC | #4

Hi Joe,

On Tue, Sep 25, 2018 at 10:21 PM Joe Perches <joe@perches.com> wrote:
> I'd still prefer as all these are effectively

> debugging output that you convert the pr_info

> uses pr_debug and so avoid using pr_fmt altogether.

I started to write back, "sure no problem, this will be in v7," but
then I gave it a closer look, and I think actually I'll be changing
these into pr_err/pr_info and not naming this "DEBUG" but rather
"SELFTEST", since I think these are actually very good to have beyond
mere debugging.

Jason

Joe Perches Sept. 25, 2018, 9:02 p.m. UTC | #5

On Tue, 2018-09-25 at 22:54 +0200, Jason A. Donenfeld wrote:
> Hi Joe,

> 

> On Tue, Sep 25, 2018 at 10:21 PM Joe Perches <joe@perches.com> wrote:

> > I'd still prefer as all these are effectively

> > debugging output that you convert the pr_info

> > uses pr_debug and so avoid using pr_fmt altogether.

> 

> I started to write back, "sure no problem, this will be in v7," but

> then I gave it a closer look, and I think actually I'll be changing

> these into pr_err/pr_info and not naming this "DEBUG" but rather

> "SELFTEST", since I think these are actually very good to have beyond

> mere debugging.


Then you should change the CONFIG_ZINC_DEBUG option too.

Dunno what you should do with the

#ifdef DEBUG
	BUG_ON(...);
#endif

uses though.

Ard Biesheuvel Sept. 26, 2018, 8:59 a.m. UTC | #6

On Tue, 25 Sep 2018 at 17:00, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>

> These wire Andy Polyakov's implementations up to the kernel for ARMv7,8

> NEON, and introduce Eric Biggers' ultra-fast scalar implementation for

> CPUs without NEON or for CPUs with slow NEON (Cortex-A5,7).

>

> This commit does the following:

>   - Adds the glue code for the assembly implementations.

>   - Renames the ARMv8 code into place, since it can at this point be

>     used wholesale.

>   - Merges Andy Polyakov's ARMv7 NEON code with Eric Biggers' <=ARMv7

>     scalar code.

>

> Commit note: Eric Biggers' scalar code is brand new, and quite possibly

> prematurely added to this commit, and so it may require a bit of revision.

>

> This commit delivers approximately the same or much better performance than

> the existing crypto API's code and has been measured to do as such on:

>

>   - ARM1176JZF-S [ARMv6]

>   - Cortex-A7    [ARMv7]

>   - Cortex-A8    [ARMv7]

>   - Cortex-A9    [ARMv7]

>   - Cortex-A17   [ARMv7]

>   - Cortex-A53   [ARMv8]

>   - Cortex-A55   [ARMv8]

>   - Cortex-A73   [ARMv8]

>   - Cortex-A75   [ARMv8]

>

> Interestingly, Andy Polyakov's scalar code is slower than Eric Biggers',

> but is also significantly shorter. This has the advantage that it does

> not evict other code from L1 cache -- particularly on ARM11 chips -- and

> so in certain circumstances it can actually be faster. However, it wasn't

> found that this had an affect on any code existing in the kernel today.

>

> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

> Co-authored-by: Eric Biggers <ebiggers@google.com>

> Cc: Samuel Neves <sneves@dei.uc.pt>

> Cc: Andy Lutomirski <luto@kernel.org>

> Cc: Greg KH <gregkh@linuxfoundation.org>

> Cc: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com>

> Cc: Russell King <linux@armlinux.org.uk>

> Cc: linux-arm-kernel@lists.infradead.org

> ---

>  lib/zinc/Makefile                             |   2 +

>  lib/zinc/chacha20/chacha20-arm-glue.h         |  88 +++

>  ...acha20-arm-cryptogams.S => chacha20-arm.S} | 502 ++++++++++++++++--

>  ...20-arm64-cryptogams.S => chacha20-arm64.S} |   0

>  lib/zinc/chacha20/chacha20.c                  |   2 +

>  5 files changed, 556 insertions(+), 38 deletions(-)

>  create mode 100644 lib/zinc/chacha20/chacha20-arm-glue.h

>  rename lib/zinc/chacha20/{chacha20-arm-cryptogams.S => chacha20-arm.S} (71%)

>  rename lib/zinc/chacha20/{chacha20-arm64-cryptogams.S => chacha20-arm64.S} (100%)

>

> diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile

> index 223a0816c918..e47f64e12bbd 100644

> --- a/lib/zinc/Makefile

> +++ b/lib/zinc/Makefile

> @@ -4,4 +4,6 @@ ccflags-$(CONFIG_ZINC_DEBUG) += -DDEBUG

>

>  zinc_chacha20-y := chacha20/chacha20.o

>  zinc_chacha20-$(CONFIG_ZINC_ARCH_X86_64) += chacha20/chacha20-x86_64.o

> +zinc_chacha20-$(CONFIG_ZINC_ARCH_ARM) += chacha20/chacha20-arm.o

> +zinc_chacha20-$(CONFIG_ZINC_ARCH_ARM64) += chacha20/chacha20-arm64.o

>  obj-$(CONFIG_ZINC_CHACHA20) += zinc_chacha20.o

> diff --git a/lib/zinc/chacha20/chacha20-arm-glue.h b/lib/zinc/chacha20/chacha20-arm-glue.h

> new file mode 100644

> index 000000000000..86cce851ed02

> --- /dev/null

> +++ b/lib/zinc/chacha20/chacha20-arm-glue.h

> @@ -0,0 +1,88 @@

> +/* SPDX-License-Identifier: GPL-2.0 OR MIT */

> +/*

> + * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.

> + */

> +

> +#include <asm/hwcap.h>

> +#include <asm/neon.h>

> +#if defined(CONFIG_ARM)

> +#include <asm/system_info.h>

> +#include <asm/cputype.h>

> +#endif

> +

> +asmlinkage void chacha20_arm(u8 *out, const u8 *in, const size_t len,

> +                            const u32 key[8], const u32 counter[4]);

> +#if defined(CONFIG_ARM)

> +asmlinkage void hchacha20_arm(const u32 state[16], u32 out[8]);

> +#endif

> +#if defined(CONFIG_KERNEL_MODE_NEON)

> +asmlinkage void chacha20_neon(u8 *out, const u8 *in, const size_t len,

> +                             const u32 key[8], const u32 counter[4]);

> +#endif

> +

> +static bool chacha20_use_neon __ro_after_init;

> +

> +static void __init chacha20_fpu_init(void)

> +{

> +#if defined(CONFIG_ARM64)

> +       chacha20_use_neon = elf_hwcap & HWCAP_ASIMD;

> +#elif defined(CONFIG_ARM)

> +       switch (read_cpuid_part()) {

> +       case ARM_CPU_PART_CORTEX_A7:

> +       case ARM_CPU_PART_CORTEX_A5:

> +               /* The Cortex-A7 and Cortex-A5 do not perform well with the NEON

> +                * implementation but do incredibly with the scalar one and use

> +                * less power.

> +                */

> +               break;

> +       default:

> +               chacha20_use_neon = elf_hwcap & HWCAP_NEON;

> +       }

> +#endif

> +}

> +

> +static inline bool chacha20_arch(struct chacha20_ctx *state, u8 *dst,

> +                                const u8 *src, size_t len,

> +                                simd_context_t *simd_context)

> +{

> +#if defined(CONFIG_KERNEL_MODE_NEON)

> +       if (chacha20_use_neon && len >= CHACHA20_BLOCK_SIZE * 3 &&

> +           simd_use(simd_context))

> +               chacha20_neon(dst, src, len, state->key, state->counter);

> +       else

> +#endif


Better to use IS_ENABLED() here:

> +       if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON)) &&

> +           chacha20_use_neon && len >= CHACHA20_BLOCK_SIZE * 3 &&

> +           simd_use(simd_context))


Also, this still has unbounded worst case scheduling latency, given
that the outer library function passes its entire input straight into
the NEON routine.

> +               chacha20_arm(dst, src, len, state->key, state->counter);

> +

> +       state->counter[0] += (len + 63) / 64;

> +       return true;

> +}

> +

> +static inline bool hchacha20_arch(u32 derived_key[CHACHA20_KEY_WORDS],

> +                                 const u8 nonce[HCHACHA20_NONCE_SIZE],

> +                                 const u8 key[HCHACHA20_KEY_SIZE],

> +                                 simd_context_t *simd_context)

> +{

> +#if defined(CONFIG_ARM)

> +       u32 x[] = { CHACHA20_CONSTANT_EXPA,

> +                   CHACHA20_CONSTANT_ND_3,

> +                   CHACHA20_CONSTANT_2_BY,

> +                   CHACHA20_CONSTANT_TE_K,

> +                   get_unaligned_le32(key + 0),

> +                   get_unaligned_le32(key + 4),

> +                   get_unaligned_le32(key + 8),

> +                   get_unaligned_le32(key + 12),

> +                   get_unaligned_le32(key + 16),

> +                   get_unaligned_le32(key + 20),

> +                   get_unaligned_le32(key + 24),

> +                   get_unaligned_le32(key + 28),

> +                   get_unaligned_le32(nonce + 0),

> +                   get_unaligned_le32(nonce + 4),

> +                   get_unaligned_le32(nonce + 8),

> +                   get_unaligned_le32(nonce + 12)

> +       };

> +       hchacha20_arm(x, derived_key);

> +       return true;

> +#else

> +       return false;

> +#endif

> +}

> diff --git a/lib/zinc/chacha20/chacha20-arm-cryptogams.S b/lib/zinc/chacha20/chacha20-arm.S

> similarity index 71%

> rename from lib/zinc/chacha20/chacha20-arm-cryptogams.S

> rename to lib/zinc/chacha20/chacha20-arm.S

> index 770bab469171..5abedafcf129 100644

> --- a/lib/zinc/chacha20/chacha20-arm-cryptogams.S

> +++ b/lib/zinc/chacha20/chacha20-arm.S

> @@ -1,13 +1,475 @@

>  /* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */

>  /*

> + * Copyright (C) 2018 Google, Inc.

>   * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.

>   * Copyright (C) 2006-2017 CRYPTOGAMS by <appro@openssl.org>. All Rights Reserved.

> - *

> - * This is based in part on Andy Polyakov's implementation from CRYPTOGAMS.

>   */

>

>  #include <linux/linkage.h>

>

> +/*

> + * The following scalar routine was written by Eric Biggers.

> + *

> + * Design notes:

> + *

> + * 16 registers would be needed to hold the state matrix, but only 14 are

> + * available because 'sp' and 'pc' cannot be used.  So we spill the elements

> + * (x8, x9) to the stack and swap them out with (x10, x11).  This adds one

> + * 'ldrd' and one 'strd' instruction per round.

> + *

> + * All rotates are performed using the implicit rotate operand accepted by the

> + * 'add' and 'eor' instructions.  This is faster than using explicit rotate

> + * instructions.  To make this work, we allow the values in the second and last

> + * rows of the ChaCha state matrix (rows 'b' and 'd') to temporarily have the

> + * wrong rotation amount.  The rotation amount is then fixed up just in time

> + * when the values are used.  'brot' is the number of bits the values in row 'b'

> + * need to be rotated right to arrive at the correct values, and 'drot'

> + * similarly for row 'd'.  (brot, drot) start out as (0, 0) but we make it such

> + * that they end up as (25, 24) after every round.

> + */

> +

> +       // ChaCha state registers

> +       X0      .req    r0

> +       X1      .req    r1

> +       X2      .req    r2

> +       X3      .req    r3

> +       X4      .req    r4

> +       X5      .req    r5

> +       X6      .req    r6

> +       X7      .req    r7

> +       X8_X10  .req    r8      // shared by x8 and x10

> +       X9_X11  .req    r9      // shared by x9 and x11

> +       X12     .req    r10

> +       X13     .req    r11

> +       X14     .req    r12

> +       X15     .req    r14

> +

> +.Lexpand_32byte_k:

> +       // "expand 32-byte k"

> +       .word   0x61707865, 0x3320646e, 0x79622d32, 0x6b206574

> +

> +#ifdef __thumb2__

> +#  define adrl adr

> +#endif

> +

> +.macro __rev           out, in,  t0, t1, t2

> +.if __LINUX_ARM_ARCH__ >= 6

> +       rev             \out, \in

> +.else

> +       lsl             \t0, \in, #24

> +       and             \t1, \in, #0xff00

> +       and             \t2, \in, #0xff0000

> +       orr             \out, \t0, \in, lsr #24

> +       orr             \out, \out, \t1, lsl #8

> +       orr             \out, \out, \t2, lsr #8

> +.endif

> +.endm

> +

> +.macro _le32_bswap     x,  t0, t1, t2

> +#ifdef __ARMEB__

> +       __rev           \x, \x,  \t0, \t1, \t2

> +#endif

> +.endm

> +

> +.macro _le32_bswap_4x  a, b, c, d,  t0, t1, t2

> +       _le32_bswap     \a,  \t0, \t1, \t2

> +       _le32_bswap     \b,  \t0, \t1, \t2

> +       _le32_bswap     \c,  \t0, \t1, \t2

> +       _le32_bswap     \d,  \t0, \t1, \t2

> +.endm

> +

> +.macro __ldrd          a, b, src, offset

> +#if __LINUX_ARM_ARCH__ >= 6

> +       ldrd            \a, \b, [\src, #\offset]

> +#else

> +       ldr             \a, [\src, #\offset]

> +       ldr             \b, [\src, #\offset + 4]

> +#endif

> +.endm

> +

> +.macro __strd          a, b, dst, offset

> +#if __LINUX_ARM_ARCH__ >= 6

> +       strd            \a, \b, [\dst, #\offset]

> +#else

> +       str             \a, [\dst, #\offset]

> +       str             \b, [\dst, #\offset + 4]

> +#endif

> +.endm

> +

> +.macro _halfround      a1, b1, c1, d1,  a2, b2, c2, d2

> +

> +       // a += b; d ^= a; d = rol(d, 16);

> +       add             \a1, \a1, \b1, ror #brot

> +       add             \a2, \a2, \b2, ror #brot

> +       eor             \d1, \a1, \d1, ror #drot

> +       eor             \d2, \a2, \d2, ror #drot

> +       // drot == 32 - 16 == 16

> +

> +       // c += d; b ^= c; b = rol(b, 12);

> +       add             \c1, \c1, \d1, ror #16

> +       add             \c2, \c2, \d2, ror #16

> +       eor             \b1, \c1, \b1, ror #brot

> +       eor             \b2, \c2, \b2, ror #brot

> +       // brot == 32 - 12 == 20

> +

> +       // a += b; d ^= a; d = rol(d, 8);

> +       add             \a1, \a1, \b1, ror #20

> +       add             \a2, \a2, \b2, ror #20

> +       eor             \d1, \a1, \d1, ror #16

> +       eor             \d2, \a2, \d2, ror #16

> +       // drot == 32 - 8 == 24

> +

> +       // c += d; b ^= c; b = rol(b, 7);

> +       add             \c1, \c1, \d1, ror #24

> +       add             \c2, \c2, \d2, ror #24

> +       eor             \b1, \c1, \b1, ror #20

> +       eor             \b2, \c2, \b2, ror #20

> +       // brot == 32 - 7 == 25

> +.endm

> +

> +.macro _doubleround

> +

> +       // column round

> +

> +       // quarterrounds: (x0, x4, x8, x12) and (x1, x5, x9, x13)

> +       _halfround      X0, X4, X8_X10, X12,  X1, X5, X9_X11, X13

> +

> +       // save (x8, x9); restore (x10, x11)

> +       __strd          X8_X10, X9_X11, sp, 0

> +       __ldrd          X8_X10, X9_X11, sp, 8

> +

> +       // quarterrounds: (x2, x6, x10, x14) and (x3, x7, x11, x15)

> +       _halfround      X2, X6, X8_X10, X14,  X3, X7, X9_X11, X15

> +

> +       .set brot, 25

> +       .set drot, 24

> +

> +       // diagonal round

> +

> +       // quarterrounds: (x0, x5, x10, x15) and (x1, x6, x11, x12)

> +       _halfround      X0, X5, X8_X10, X15,  X1, X6, X9_X11, X12

> +

> +       // save (x10, x11); restore (x8, x9)

> +       __strd          X8_X10, X9_X11, sp, 8

> +       __ldrd          X8_X10, X9_X11, sp, 0

> +

> +       // quarterrounds: (x2, x7, x8, x13) and (x3, x4, x9, x14)

> +       _halfround      X2, X7, X8_X10, X13,  X3, X4, X9_X11, X14

> +.endm

> +

> +.macro _chacha_permute nrounds

> +       .set brot, 0

> +       .set drot, 0

> +       .rept \nrounds / 2

> +        _doubleround

> +       .endr

> +.endm

> +

> +.macro _chacha         nrounds

> +

> +.Lnext_block\@:

> +       // Stack: unused0-unused1 x10-x11 x0-x15 OUT IN LEN

> +       // Registers contain x0-x9,x12-x15.

> +

> +       // Do the core ChaCha permutation to update x0-x15.

> +       _chacha_permute \nrounds

> +

> +       add             sp, #8

> +       // Stack: x10-x11 orig_x0-orig_x15 OUT IN LEN

> +       // Registers contain x0-x9,x12-x15.

> +       // x4-x7 are rotated by 'brot'; x12-x15 are rotated by 'drot'.

> +

> +       // Free up some registers (r8-r12,r14) by pushing (x8-x9,x12-x15).

> +       push            {X8_X10, X9_X11, X12, X13, X14, X15}

> +

> +       // Load (OUT, IN, LEN).

> +       ldr             r14, [sp, #96]

> +       ldr             r12, [sp, #100]

> +       ldr             r11, [sp, #104]

> +

> +       orr             r10, r14, r12

> +

> +       // Use slow path if fewer than 64 bytes remain.

> +       cmp             r11, #64

> +       blt             .Lxor_slowpath\@

> +

> +       // Use slow path if IN and/or OUT isn't 4-byte aligned.  Needed even on

> +       // ARMv6+, since ldmia and stmia (used below) still require alignment.

> +       tst             r10, #3

> +       bne             .Lxor_slowpath\@

> +

> +       // Fast path: XOR 64 bytes of aligned data.

> +

> +       // Stack: x8-x9 x12-x15 x10-x11 orig_x0-orig_x15 OUT IN LEN

> +       // Registers: r0-r7 are x0-x7; r8-r11 are free; r12 is IN; r14 is OUT.

> +       // x4-x7 are rotated by 'brot'; x12-x15 are rotated by 'drot'.

> +

> +       // x0-x3

> +       __ldrd          r8, r9, sp, 32

> +       __ldrd          r10, r11, sp, 40

> +       add             X0, X0, r8

> +       add             X1, X1, r9

> +       add             X2, X2, r10

> +       add             X3, X3, r11

> +       _le32_bswap_4x  X0, X1, X2, X3,  r8, r9, r10

> +       ldmia           r12!, {r8-r11}

> +       eor             X0, X0, r8

> +       eor             X1, X1, r9

> +       eor             X2, X2, r10

> +       eor             X3, X3, r11

> +       stmia           r14!, {X0-X3}

> +

> +       // x4-x7

> +       __ldrd          r8, r9, sp, 48

> +       __ldrd          r10, r11, sp, 56

> +       add             X4, r8, X4, ror #brot

> +       add             X5, r9, X5, ror #brot

> +       ldmia           r12!, {X0-X3}

> +       add             X6, r10, X6, ror #brot

> +       add             X7, r11, X7, ror #brot

> +       _le32_bswap_4x  X4, X5, X6, X7,  r8, r9, r10

> +       eor             X4, X4, X0

> +       eor             X5, X5, X1

> +       eor             X6, X6, X2

> +       eor             X7, X7, X3

> +       stmia           r14!, {X4-X7}

> +

> +       // x8-x15

> +       pop             {r0-r7}                 // (x8-x9,x12-x15,x10-x11)

> +       __ldrd          r8, r9, sp, 32

> +       __ldrd          r10, r11, sp, 40

> +       add             r0, r0, r8              // x8

> +       add             r1, r1, r9              // x9

> +       add             r6, r6, r10             // x10

> +       add             r7, r7, r11             // x11

> +       _le32_bswap_4x  r0, r1, r6, r7,  r8, r9, r10

> +       ldmia           r12!, {r8-r11}

> +       eor             r0, r0, r8              // x8

> +       eor             r1, r1, r9              // x9

> +       eor             r6, r6, r10             // x10

> +       eor             r7, r7, r11             // x11

> +       stmia           r14!, {r0,r1,r6,r7}

> +       ldmia           r12!, {r0,r1,r6,r7}

> +       __ldrd          r8, r9, sp, 48

> +       __ldrd          r10, r11, sp, 56

> +       add             r2, r8, r2, ror #drot   // x12

> +       add             r3, r9, r3, ror #drot   // x13

> +       add             r4, r10, r4, ror #drot  // x14

> +       add             r5, r11, r5, ror #drot  // x15

> +       _le32_bswap_4x  r2, r3, r4, r5,  r9, r10, r11

> +         ldr           r9, [sp, #72]           // load LEN

> +       eor             r2, r2, r0              // x12

> +       eor             r3, r3, r1              // x13

> +       eor             r4, r4, r6              // x14

> +       eor             r5, r5, r7              // x15

> +         subs          r9, #64                 // decrement and check LEN

> +       stmia           r14!, {r2-r5}

> +

> +       beq             .Ldone\@

> +

> +.Lprepare_for_next_block\@:

> +

> +       // Stack: x0-x15 OUT IN LEN

> +

> +       // Increment block counter (x12)

> +       add             r8, #1

> +

> +       // Store updated (OUT, IN, LEN)

> +       str             r14, [sp, #64]

> +       str             r12, [sp, #68]

> +       str             r9, [sp, #72]

> +

> +         mov           r14, sp

> +

> +       // Store updated block counter (x12)

> +       str             r8, [sp, #48]

> +

> +         sub           sp, #16

> +

> +       // Reload state and do next block

> +       ldmia           r14!, {r0-r11}          // load x0-x11

> +       __strd          r10, r11, sp, 8         // store x10-x11 before state

> +       ldmia           r14, {r10-r12,r14}      // load x12-x15

> +       b               .Lnext_block\@

> +

> +.Lxor_slowpath\@:

> +       // Slow path: < 64 bytes remaining, or unaligned input or output buffer.

> +       // We handle it by storing the 64 bytes of keystream to the stack, then

> +       // XOR-ing the needed portion with the data.

> +

> +       // Allocate keystream buffer

> +       sub             sp, #64

> +       mov             r14, sp

> +

> +       // Stack: ks0-ks15 x8-x9 x12-x15 x10-x11 orig_x0-orig_x15 OUT IN LEN

> +       // Registers: r0-r7 are x0-x7; r8-r11 are free; r12 is IN; r14 is &ks0.

> +       // x4-x7 are rotated by 'brot'; x12-x15 are rotated by 'drot'.

> +

> +       // Save keystream for x0-x3

> +       __ldrd          r8, r9, sp, 96

> +       __ldrd          r10, r11, sp, 104

> +       add             X0, X0, r8

> +       add             X1, X1, r9

> +       add             X2, X2, r10

> +       add             X3, X3, r11

> +       _le32_bswap_4x  X0, X1, X2, X3,  r8, r9, r10

> +       stmia           r14!, {X0-X3}

> +

> +       // Save keystream for x4-x7

> +       __ldrd          r8, r9, sp, 112

> +       __ldrd          r10, r11, sp, 120

> +       add             X4, r8, X4, ror #brot

> +       add             X5, r9, X5, ror #brot

> +       add             X6, r10, X6, ror #brot

> +       add             X7, r11, X7, ror #brot

> +       _le32_bswap_4x  X4, X5, X6, X7,  r8, r9, r10

> +         add           r8, sp, #64

> +       stmia           r14!, {X4-X7}

> +

> +       // Save keystream for x8-x15

> +       ldm             r8, {r0-r7}             // (x8-x9,x12-x15,x10-x11)

> +       __ldrd          r8, r9, sp, 128

> +       __ldrd          r10, r11, sp, 136

> +       add             r0, r0, r8              // x8

> +       add             r1, r1, r9              // x9

> +       add             r6, r6, r10             // x10

> +       add             r7, r7, r11             // x11

> +       _le32_bswap_4x  r0, r1, r6, r7,  r8, r9, r10

> +       stmia           r14!, {r0,r1,r6,r7}

> +       __ldrd          r8, r9, sp, 144

> +       __ldrd          r10, r11, sp, 152

> +       add             r2, r8, r2, ror #drot   // x12

> +       add             r3, r9, r3, ror #drot   // x13

> +       add             r4, r10, r4, ror #drot  // x14

> +       add             r5, r11, r5, ror #drot  // x15

> +       _le32_bswap_4x  r2, r3, r4, r5,  r9, r10, r11

> +       stmia           r14, {r2-r5}

> +

> +       // Stack: ks0-ks15 unused0-unused7 x0-x15 OUT IN LEN

> +       // Registers: r8 is block counter, r12 is IN.

> +

> +       ldr             r9, [sp, #168]          // LEN

> +       ldr             r14, [sp, #160]         // OUT

> +       cmp             r9, #64

> +         mov           r0, sp

> +       movle           r1, r9

> +       movgt           r1, #64

> +       // r1 is number of bytes to XOR, in range [1, 64]

> +

> +.if __LINUX_ARM_ARCH__ < 6

> +       orr             r2, r12, r14

> +       tst             r2, #3                  // IN or OUT misaligned?

> +       bne             .Lxor_next_byte\@

> +.endif

> +

> +       // XOR a word at a time

> +.rept 16

> +       subs            r1, #4

> +       blt             .Lxor_words_done\@

> +       ldr             r2, [r12], #4

> +       ldr             r3, [r0], #4

> +       eor             r2, r2, r3

> +       str             r2, [r14], #4

> +.endr

> +       b               .Lxor_slowpath_done\@

> +.Lxor_words_done\@:

> +       ands            r1, r1, #3

> +       beq             .Lxor_slowpath_done\@

> +

> +       // XOR a byte at a time

> +.Lxor_next_byte\@:

> +       ldrb            r2, [r12], #1

> +       ldrb            r3, [r0], #1

> +       eor             r2, r2, r3

> +       strb            r2, [r14], #1

> +       subs            r1, #1

> +       bne             .Lxor_next_byte\@

> +

> +.Lxor_slowpath_done\@:

> +       subs            r9, #64

> +       add             sp, #96

> +       bgt             .Lprepare_for_next_block\@

> +

> +.Ldone\@:

> +.endm  // _chacha

> +

> +/*

> + * void chacha20_arm(u8 *out, const u8 *in, size_t len, const u32 key[8],

> + *                  const u32 iv[4]);

> + */

> +ENTRY(chacha20_arm)

> +       cmp             r2, #0                  // len == 0?

> +       bxeq            lr

> +

> +       push            {r0-r2,r4-r11,lr}

> +

> +       // Push state x0-x15 onto stack.

> +       // Also store an extra copy of x10-x11 just before the state.

> +

> +       ldr             r4, [sp, #48]           // iv

> +       mov             r0, sp

> +       sub             sp, #80

> +

> +       // iv: x12-x15

> +       ldm             r4, {X12,X13,X14,X15}

> +       stmdb           r0!, {X12,X13,X14,X15}

> +

> +       // key: x4-x11

> +       __ldrd          X8_X10, X9_X11, r3, 24

> +       __strd          X8_X10, X9_X11, sp, 8

> +       stmdb           r0!, {X8_X10, X9_X11}

> +       ldm             r3, {X4-X9_X11}

> +       stmdb           r0!, {X4-X9_X11}

> +

> +       // constants: x0-x3

> +       adrl            X3, .Lexpand_32byte_k

> +       ldm             X3, {X0-X3}

> +       __strd          X0, X1, sp, 16

> +       __strd          X2, X3, sp, 24

> +

> +       _chacha         20

> +

> +       add             sp, #76

> +       pop             {r4-r11, pc}

> +ENDPROC(chacha20_arm)

> +

> +/*

> + * void hchacha20_arm(const u32 state[16], u32 out[8]);

> + */

> +ENTRY(hchacha20_arm)

> +       push            {r1,r4-r11,lr}

> +

> +       mov             r14, r0

> +       ldmia           r14!, {r0-r11}          // load x0-x11

> +       push            {r10-r11}               // store x10-x11 to stack

> +       ldm             r14, {r10-r12,r14}      // load x12-x15

> +       sub             sp, #8

> +

> +       _chacha_permute 20

> +

> +       // Skip over (unused0-unused1, x10-x11)

> +       add             sp, #16

> +

> +       // Fix up rotations of x12-x15

> +       ror             X12, X12, #drot

> +       ror             X13, X13, #drot

> +         pop           {r4}                    // load 'out'

> +       ror             X14, X14, #drot

> +       ror             X15, X15, #drot

> +

> +       // Store (x0-x3,x12-x15) to 'out'

> +       stm             r4, {X0,X1,X2,X3,X12,X13,X14,X15}

> +

> +       pop             {r4-r11,pc}

> +ENDPROC(hchacha20_arm)

> +

> +#ifdef CONFIG_KERNEL_MODE_NEON

> +/*

> + * This following NEON routine was ported from Andy Polyakov's implementation

> + * from CRYPTOGAMS. It begins with parts of the CRYPTOGAMS scalar routine,

> + * since certain NEON code paths actually branch to it.

> + */

> +

>  .text

>  #if defined(__thumb2__) || defined(__clang__)

>  .syntax        unified

> @@ -22,39 +484,6 @@

>  #define ldrhsb ldrbhs

>  #endif

>

> -.align 5

> -.Lsigma:

> -.long  0x61707865,0x3320646e,0x79622d32,0x6b206574     @ endian-neutral

> -.Lone:

> -.long  1,0,0,0

> -.word  -1

> -

> -.align 5

> -ENTRY(chacha20_arm)

> -       ldr     r12,[sp,#0]             @ pull pointer to counter and nonce

> -       stmdb   sp!,{r0-r2,r4-r11,lr}

> -       cmp     r2,#0                   @ len==0?

> -#ifdef __thumb2__

> -       itt     eq

> -#endif

> -       addeq   sp,sp,#4*3

> -       beq     .Lno_data_arm

> -       ldmia   r12,{r4-r7}             @ load counter and nonce

> -       sub     sp,sp,#4*(16)           @ off-load area

> -#if __LINUX_ARM_ARCH__ < 7 && !defined(__thumb2__)

> -       sub     r14,pc,#100             @ .Lsigma

> -#else

> -       adr     r14,.Lsigma             @ .Lsigma

> -#endif

> -       stmdb   sp!,{r4-r7}             @ copy counter and nonce

> -       ldmia   r3,{r4-r11}             @ load key

> -       ldmia   r14,{r0-r3}             @ load sigma

> -       stmdb   sp!,{r4-r11}            @ copy key

> -       stmdb   sp!,{r0-r3}             @ copy sigma

> -       str     r10,[sp,#4*(16+10)]     @ off-load "rx"

> -       str     r11,[sp,#4*(16+11)]     @ off-load "rx"

> -       b       .Loop_outer_enter

> -

>  .align 4

>  .Loop_outer:

>         ldmia   sp,{r0-r9}              @ load key material

> @@ -748,11 +1177,8 @@ ENTRY(chacha20_arm)

>

>  .Ldone:

>         add     sp,sp,#4*(32+3)

> -.Lno_data_arm:

>         ldmia   sp!,{r4-r11,pc}

> -ENDPROC(chacha20_arm)

>

> -#ifdef CONFIG_KERNEL_MODE_NEON

>  .align 5

>  .Lsigma2:

>  .long  0x61707865,0x3320646e,0x79622d32,0x6b206574     @ endian-neutral

> diff --git a/lib/zinc/chacha20/chacha20-arm64-cryptogams.S b/lib/zinc/chacha20/chacha20-arm64.S

> similarity index 100%

> rename from lib/zinc/chacha20/chacha20-arm64-cryptogams.S

> rename to lib/zinc/chacha20/chacha20-arm64.S

> diff --git a/lib/zinc/chacha20/chacha20.c b/lib/zinc/chacha20/chacha20.c

> index 4354b874a6a5..fc4f74fca653 100644

> --- a/lib/zinc/chacha20/chacha20.c

> +++ b/lib/zinc/chacha20/chacha20.c

> @@ -16,6 +16,8 @@

>

>  #if defined(CONFIG_ZINC_ARCH_X86_64)

>  #include "chacha20-x86_64-glue.h"

> +#elif defined(CONFIG_ZINC_ARCH_ARM) || defined(CONFIG_ZINC_ARCH_ARM64)

> +#include "chacha20-arm-glue.h"

>  #else

>  void __init chacha20_fpu_init(void)

>  {

> --

> 2.19.0

>

Jason A. Donenfeld Sept. 26, 2018, 1:32 p.m. UTC | #7

Hi Ard,

On Wed, Sep 26, 2018 at 10:59 AM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> > +static inline bool chacha20_arch(struct chacha20_ctx *state, u8 *dst,

> > +                                const u8 *src, size_t len,

> > +                                simd_context_t *simd_context)

> > +{

> > +#if defined(CONFIG_KERNEL_MODE_NEON)

> > +       if (chacha20_use_neon && len >= CHACHA20_BLOCK_SIZE * 3 &&

> > +           simd_use(simd_context))

> > +               chacha20_neon(dst, src, len, state->key, state->counter);

> > +       else

> > +#endif

>

> Better to use IS_ENABLED() here:

>

> > +       if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON)) &&

> > +           chacha20_use_neon && len >= CHACHA20_BLOCK_SIZE * 3 &&

> > +           simd_use(simd_context))

Good idea. I'll fix that up.

>

> Also, this still has unbounded worst case scheduling latency, given

> that the outer library function passes its entire input straight into

> the NEON routine.

The vast majority of crypto routines in arch/*/crypto/ follow this
same exact pattern, actually. I realize a few don't -- probably the
ones you had a hand in :) -- but I think this is up to the caller to
handle. I made a change so that in chacha20poly1305.c, it calls
simd_relax after handling each scatter-gather element, so a
"construction" will handle this gracefully. But I believe it's up to
the caller to decide on what sizes of information it wants to pass to
primitives. Put differently, this also hasn't ever been an issue
before -- the existing state of the tree indicates this -- and so I
don't anticipate this will be a real issue now. And if it becomes one,
this is something we can address *later*, but certainly there's no use
of adding additional complexity to the initial patchset to do this
now.

Jason

Ard Biesheuvel Sept. 26, 2018, 2:02 p.m. UTC | #8

(+ Herbert, Thomas)

On Wed, 26 Sep 2018 at 15:33, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>

> Hi Ard,

>

> On Wed, Sep 26, 2018 at 10:59 AM Ard Biesheuvel

> <ard.biesheuvel@linaro.org> wrote:

> > > +static inline bool chacha20_arch(struct chacha20_ctx *state, u8 *dst,

> > > +                                const u8 *src, size_t len,

> > > +                                simd_context_t *simd_context)

> > > +{

> > > +#if defined(CONFIG_KERNEL_MODE_NEON)

> > > +       if (chacha20_use_neon && len >= CHACHA20_BLOCK_SIZE * 3 &&

> > > +           simd_use(simd_context))

> > > +               chacha20_neon(dst, src, len, state->key, state->counter);

> > > +       else

> > > +#endif

> >

> > Better to use IS_ENABLED() here:

> >

> > > +       if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON)) &&

> > > +           chacha20_use_neon && len >= CHACHA20_BLOCK_SIZE * 3 &&

> > > +           simd_use(simd_context))

>

> Good idea. I'll fix that up.

>

> >

> > Also, this still has unbounded worst case scheduling latency, given

> > that the outer library function passes its entire input straight into

> > the NEON routine.

>

> The vast majority of crypto routines in arch/*/crypto/ follow this

> same exact pattern, actually. I realize a few don't -- probably the

> ones you had a hand in :) -- but I think this is up to the caller to

> handle.


Anything that uses the scatterwalk API (AEADs and skciphers) will
handle at most a page at a time. Hashes are different, which is why
some of them have to handle it explicitly.

> I made a change so that in chacha20poly1305.c, it calls

> simd_relax after handling each scatter-gather element, so a

> "construction" will handle this gracefully. But I believe it's up to

> the caller to decide on what sizes of information it wants to pass to

> primitives. Put differently, this also hasn't ever been an issue

> before -- the existing state of the tree indicates this -- and so I

> don't anticipate this will be a real issue now.


The state of the tree does not capture all relevant context or
history. The scheduling latency issue was brought up very recently by
the -rt folks on the mailing lists.

> And if it becomes one,

> this is something we can address *later*, but certainly there's no use

> of adding additional complexity to the initial patchset to do this

> now.

>


You are introducing a very useful SIMD abstraction, but it lets code
run with preemption disabled for unbounded amounts of time, and so now
is the time to ensure we get it right.

Part of the [justified] criticism on the current state of the crypto
API is on its complexity, and so I don't think it makes sense to keep
it simple now and add the complexity later (and the same concern
applies to async support btw).

Andrew Lunn Sept. 26, 2018, 2:36 p.m. UTC | #9

> > Also, this still has unbounded worst case scheduling latency, given

> > that the outer library function passes its entire input straight into

> > the NEON routine.

> 

> The vast majority of crypto routines in arch/*/crypto/ follow this

> same exact pattern, actually. I realize a few don't -- probably the

> ones you had a hand in :) -- but I think this is up to the caller to

> handle. I made a change so that in chacha20poly1305.c, it calls

> simd_relax after handling each scatter-gather element, so a

> "construction" will handle this gracefully. But I believe it's up to

> the caller to decide on what sizes of information it wants to pass to

> primitives. Put differently, this also hasn't ever been an issue

> before -- the existing state of the tree indicates this -- and so I

> don't anticipate this will be a real issue now. And if it becomes one,

> this is something we can address *later*, but certainly there's no use

> of adding additional complexity to the initial patchset to do this

> now.


Hi Jason

This is not my area of expertise, so you should verify what i'm say
here...

My guess is, IPSEC will mostly ask the crypto code to work on 1500
byte full MTU packets and 64 byte TCP ACK packets. Disk encryption i
guess works on 4K blocks. So these requests are all quite small,
keeping the latency reasonably bounded.

The wireguard interface claims it is GSO capable. This means the
network stack will pass it big chunks of data and leave it to the
network interface to perform the segmentation into 1500 byte MTU
frames on the wire. I've not looked at how wireguard actually handles
these big chunks. But to get maximum performance, it should try to
keep them whole, just add a header and/or trailer. Will wireguard pass
these big chunks of data to the crypto code? Do we now have 64K blocks
being worked on? Does the latency jump from 4K to 64K? That might be
new, so the existing state of the tree does not help you here.

   Andrew

Jason A. Donenfeld Sept. 26, 2018, 3:25 p.m. UTC | #10

On Wed, Sep 26, 2018 at 4:36 PM Andrew Lunn <andrew@lunn.ch> wrote:
> The wireguard interface claims it is GSO capable. This means the

> network stack will pass it big chunks of data and leave it to the

> network interface to perform the segmentation into 1500 byte MTU

> frames on the wire. I've not looked at how wireguard actually handles

> these big chunks. But to get maximum performance, it should try to

> keep them whole, just add a header and/or trailer. Will wireguard pass

> these big chunks of data to the crypto code? Do we now have 64K blocks

> being worked on? Does the latency jump from 4K to 64K? That might be

> new, so the existing state of the tree does not help you here.


No, it only requests GSO superpackets so that it can group the pieces
and encrypt them on the same core. But they're each encrypted
separately (broken up immediately after ndo_start_xmit), and so they
wind up being ~1420 bytes each to encrypt. I spoke about this at
netdev2.2 if you're interested in the architecture; there's a paper:

https://www.wireguard.com/papers/wireguard-netdev22.pdf
https://www.youtube.com/watch?v=54orFwtQ1XY
https://www.wireguard.com/talks/netdev2017-slides.pdf

Jason A. Donenfeld Sept. 26, 2018, 3:41 p.m. UTC | #11

On Wed, Sep 26, 2018 at 4:02 PM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> I don't think it makes sense to keep

> it simple now and add the complexity later (and the same concern

> applies to async support btw).

Ugh, no. I don't want to add needless complexity, period. Zinc is
synchronous, not asynchronous. It provides software implementations.
That's what it does. While many of your reviews have been useful, many
of your comments indicate some desire to change and mold the purpose
and focus of Zinc away from Zinc's intents. Stop that. It's not going
to become a bloated mess of "things Ard wanted and quipped about on
LKML." Things like these only serve to filibuster the patchset
indefinitely. But maybe that's what you'd like all along? Hard to
tell, honestly. So, no, sorry, Zinc isn't gaining an async interface
right now.

Ard Biesheuvel Sept. 26, 2018, 3:41 p.m. UTC | #12

On Wed, 26 Sep 2018 at 16:02, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>

> (+ Herbert, Thomas)

>

> On Wed, 26 Sep 2018 at 15:33, Jason A. Donenfeld <Jason@zx2c4.com> wrote:

> >

> > Hi Ard,

> >

> > On Wed, Sep 26, 2018 at 10:59 AM Ard Biesheuvel

> > <ard.biesheuvel@linaro.org> wrote:

> > > > +static inline bool chacha20_arch(struct chacha20_ctx *state, u8 *dst,

> > > > +                                const u8 *src, size_t len,

> > > > +                                simd_context_t *simd_context)

> > > > +{

> > > > +#if defined(CONFIG_KERNEL_MODE_NEON)

> > > > +       if (chacha20_use_neon && len >= CHACHA20_BLOCK_SIZE * 3 &&

> > > > +           simd_use(simd_context))

> > > > +               chacha20_neon(dst, src, len, state->key, state->counter);

> > > > +       else

> > > > +#endif

> > >

> > > Better to use IS_ENABLED() here:

> > >

> > > > +       if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON)) &&

> > > > +           chacha20_use_neon && len >= CHACHA20_BLOCK_SIZE * 3 &&

> > > > +           simd_use(simd_context))

> >

> > Good idea. I'll fix that up.

> >

> > >

> > > Also, this still has unbounded worst case scheduling latency, given

> > > that the outer library function passes its entire input straight into

> > > the NEON routine.

> >

> > The vast majority of crypto routines in arch/*/crypto/ follow this

> > same exact pattern, actually. I realize a few don't -- probably the

> > ones you had a hand in :) -- but I think this is up to the caller to

> > handle.

>

> Anything that uses the scatterwalk API (AEADs and skciphers) will

> handle at most a page at a time. Hashes are different, which is why

> some of them have to handle it explicitly.

>

> > I made a change so that in chacha20poly1305.c, it calls

> > simd_relax after handling each scatter-gather element, so a

> > "construction" will handle this gracefully. But I believe it's up to

> > the caller to decide on what sizes of information it wants to pass to

> > primitives. Put differently, this also hasn't ever been an issue

> > before -- the existing state of the tree indicates this -- and so I

> > don't anticipate this will be a real issue now.

>

> The state of the tree does not capture all relevant context or

> history. The scheduling latency issue was brought up very recently by

> the -rt folks on the mailing lists.

>

> > And if it becomes one,

> > this is something we can address *later*, but certainly there's no use

> > of adding additional complexity to the initial patchset to do this

> > now.

> >

>

> You are introducing a very useful SIMD abstraction, but it lets code

> run with preemption disabled for unbounded amounts of time, and so now

> is the time to ensure we get it right.

>


Actually, looking at the code again, the abstraction does appear to be
fine, it is just the chacha20 code that does not make use of it.

Jason A. Donenfeld Sept. 26, 2018, 3:45 p.m. UTC | #13

On Wed, Sep 26, 2018 at 5:42 PM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
>

> On Wed, 26 Sep 2018 at 16:02, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

> Actually, looking at the code again, the abstraction does appear to be

> fine, it is just the chacha20 code that does not make use of it.


So what you have in mind is something like calling simd_relax() every
4096 bytes or so?

Jason A. Donenfeld Sept. 26, 2018, 3:49 p.m. UTC | #14

On Wed, Sep 26, 2018 at 5:45 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> So what you have in mind is something like calling simd_relax() every

> 4096 bytes or so?


That was actually pretty easy, putting together both of your suggestions:

static inline bool chacha20_arch(struct chacha20_ctx *state, u8 *dst,
                 u8 *src, size_t len,
                 simd_context_t *simd_context)
{
    while (len > PAGE_SIZE) {
        chacha20_arch(state, dst, src, PAGE_SIZE, simd_context);
        len -= PAGE_SIZE;
        src += PAGE_SIZE;
        dst += PAGE_SIZE;
        simd_relax(simd_context);
    }
    if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON) && chacha20_use_neon &&
        len >= CHACHA20_BLOCK_SIZE * 3 && simd_use(simd_context))
        chacha20_neon(dst, src, len, state->key, state->counter);
    else
        chacha20_arm(dst, src, len, state->key, state->counter);

    state->counter[0] += (len + 63) / 64;
    return true;
}

Jason A. Donenfeld Sept. 26, 2018, 3:58 p.m. UTC | #15

On Wed, Sep 26, 2018 at 5:52 PM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> Nice one :-)

>

> This works for me (but perhaps add a comment as well)


Sure. Just a prototype; it'll be clean for v7.

Jason A. Donenfeld Sept. 26, 2018, 4:04 p.m. UTC | #16

Hi Ivan,

On Wed, Sep 26, 2018 at 6:00 PM Ivan Labáth <labokml@labo.rs> wrote:
>

> On 25.09.2018 16:56, Jason A. Donenfeld wrote:

> > Extensive documentation and description of the protocol and

> > considerations, along with formal proofs of the cryptography, are> available at:

> >

> >   * https://www.wireguard.com/

> >   * https://www.wireguard.com/papers/wireguard.pdf

> []

> > +enum { HANDSHAKE_DSCP = 0x88 /* AF41, plus 00 ECN */ };

> []

> > +     if (skb->protocol == htons(ETH_P_IP)) {

> > +             len = ntohs(ip_hdr(skb)->tot_len);

> > +             if (unlikely(len < sizeof(struct iphdr)))

> > +                     goto dishonest_packet_size;

> > +             if (INET_ECN_is_ce(PACKET_CB(skb)->ds))

> > +                     IP_ECN_set_ce(ip_hdr(skb));

> > +     } else if (skb->protocol == htons(ETH_P_IPV6)) {

> > +             len = ntohs(ipv6_hdr(skb)->payload_len) +

> > +                   sizeof(struct ipv6hdr);

> > +             if (INET_ECN_is_ce(PACKET_CB(skb)->ds))

> > +                     IP6_ECN_set_ce(skb, ipv6_hdr(skb));

> > +     } else

> []

> > +     skb_queue_walk (&packets, skb) {

> > +             /* 0 for no outer TOS: no leak. TODO: should we use flowi->tos

> > +              * as outer? */

> > +             PACKET_CB(skb)->ds = ip_tunnel_ecn_encap(0, ip_hdr(skb), skb);

> > +             PACKET_CB(skb)->nonce =

> > +                             atomic64_inc_return(&key->counter.counter) - 1;

> > +             if (unlikely(PACKET_CB(skb)->nonce >= REJECT_AFTER_MESSAGES))

> > +                     goto out_invalid;

> > +     }

> Hi,

>

> is there documentation and/or rationale for ecn handling?

> Quick search for ecn and dscp didn't reveal any.


ECN support was developed with Dave Taht so that it does the right
thing with CAKE and such. He's CC'd, so that he can fill in details,
and sure, we can write these up. As well, I can add the rationale for
the handshake-packet-specific DSCP value to the paper in the next few
days; thanks for pointing out these documentation oversights.

Jason

Andy Lutomirski Sept. 26, 2018, 4:21 p.m. UTC | #17

> On Sep 26, 2018, at 7:02 AM, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

> 

> (+ Herbert, Thomas)

> 

>> On Wed, 26 Sep 2018 at 15:33, Jason A. Donenfeld <Jason@zx2c4.com> wrote:

>> 

>> Hi Ard,

>> .

> 

>> And if it becomes one,

>> this is something we can address *later*, but certainly there's no use

>> of adding additional complexity to the initial patchset to do this

>> now.

>> 

> 

> You are introducing a very useful SIMD abstraction, but it lets code

> run with preemption disabled for unbounded amounts of time, and so now

> is the time to ensure we get it right.

> 

> Part of the [justified] criticism on the current state of the crypto

> API is on its complexity, and so I don't think it makes sense to keep

> it simple now and add the complexity later (and the same concern

> applies to async support btw).

Are, is what you’re saying that the Zinc chacha20 functions should call simd_relax() every n bytes automatically for some reasonable value of n?  If so, seems sensible, except that some care might be needed to make sure they interact with preemption correctly.

What I mean is: the public Zinc entry points should either be callable in an atomic context or they should not be.  I think this should be checked at runtime in an appropriate place with an __might_sleep or similar.  Or simd_relax should learn to *not* schedule if the result of preempt_enable() leaves it atomic. (And the latter needs to be done in a way that works even on non-preempt kernels, and I don’t remember whether that’s possible.). And this should happen regardless of how many bytes are processed. IOW, calling into Zinc should be equally not atomic-safe for 100 bytes and for 10 MB.

As for async, ISTM a really good WireGuard accelerator would expose a different interface than crypto API supports, and it probably makes sense to wait for such hardware to show up before figuring out how to use it.  And no matter what form it takes, I don’t think it should complicate the basic Zinc crypto entry points.

Jason A. Donenfeld Sept. 26, 2018, 5:03 p.m. UTC | #18

On Wed, Sep 26, 2018 at 6:21 PM Andy Lutomirski <luto@amacapital.net> wrote:
> Are, is what you’re saying that the Zinc chacha20 functions should call simd_relax() every n bytes automatically for some reasonable value of n?  If so, seems sensible, except that some care might be needed to make sure they interact with preemption correctly.

>

> What I mean is: the public Zinc entry points should either be callable in an atomic context or they should not be.  I think this should be checked at runtime in an appropriate place with an __might_sleep or similar.  Or simd_relax should learn to *not* schedule if the result of preempt_enable() leaves it atomic. (And the latter needs to be done in a way that works even on non-preempt kernels, and I don’t remember whether that’s possible.). And this should happen regardless of how many bytes are processed. IOW, calling into Zinc should be equally not atomic-safe for 100 bytes and for 10 MB.


I'm not sure this is actually a problem. Namely:

preempt_disable();
kernel_fpu_begin();
kernel_fpu_end();
schedule(); <--- bug!

Calling kernel_fpu_end() disables preemption, but AFAIK, preemption
enabling/disabling is recursive, so kernel_fpu_end's use of
preempt_disable won't actually do anything until the outer preempt
enable is called:

preempt_disable();
kernel_fpu_begin();
kernel_fpu_end();
preempt_enable();
schedule(); <--- works!

Or am I missing some more subtle point?

Jason

Ard Biesheuvel Sept. 26, 2018, 5:08 p.m. UTC | #19

On Wed, 26 Sep 2018 at 19:03, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>

> On Wed, Sep 26, 2018 at 6:21 PM Andy Lutomirski <luto@amacapital.net> wrote:

> > Are, is what you’re saying that the Zinc chacha20 functions should call simd_relax() every n bytes automatically for some reasonable value of n?  If so, seems sensible, except that some care might be needed to make sure they interact with preemption correctly.

> >

> > What I mean is: the public Zinc entry points should either be callable in an atomic context or they should not be.  I think this should be checked at runtime in an appropriate place with an __might_sleep or similar.  Or simd_relax should learn to *not* schedule if the result of preempt_enable() leaves it atomic. (And the latter needs to be done in a way that works even on non-preempt kernels, and I don’t remember whether that’s possible.). And this should happen regardless of how many bytes are processed. IOW, calling into Zinc should be equally not atomic-safe for 100 bytes and for 10 MB.

>

> I'm not sure this is actually a problem. Namely:

>

> preempt_disable();

> kernel_fpu_begin();

> kernel_fpu_end();

> schedule(); <--- bug!

>

> Calling kernel_fpu_end() disables preemption, but AFAIK, preemption

> enabling/disabling is recursive, so kernel_fpu_end's use of

> preempt_disable won't actually do anything until the outer preempt

> enable is called:

>

> preempt_disable();

> kernel_fpu_begin();

> kernel_fpu_end();

> preempt_enable();

> schedule(); <--- works!

>

> Or am I missing some more subtle point?

>


No that seems accurate to me.

Andy Lutomirski Sept. 26, 2018, 5:23 p.m. UTC | #20

> On Sep 26, 2018, at 10:03 AM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:

> 

>> On Wed, Sep 26, 2018 at 6:21 PM Andy Lutomirski <luto@amacapital.net> wrote:

>> Are, is what you’re saying that the Zinc chacha20 functions should call simd_relax() every n bytes automatically for some reasonable value of n?  If so, seems sensible, except that some care might be needed to make sure they interact with preemption correctly.

>> 

>> What I mean is: the public Zinc entry points should either be callable in an atomic context or they should not be.  I think this should be checked at runtime in an appropriate place with an __might_sleep or similar.  Or simd_relax should learn to *not* schedule if the result of preempt_enable() leaves it atomic. (And the latter needs to be done in a way that works even on non-preempt kernels, and I don’t remember whether that’s possible.). And this should happen regardless of how many bytes are processed. IOW, calling into Zinc should be equally not atomic-safe for 100 bytes and for 10 MB.

> 

> I'm not sure this is actually a problem. Namely:

> 

> preempt_disable();

> kernel_fpu_begin();

> kernel_fpu_end();

> schedule(); <--- bug!

> 

> Calling kernel_fpu_end() disables preemption, but AFAIK, preemption

> enabling/disabling is recursive, so kernel_fpu_end's use of

> preempt_disable won't actually do anything until the outer preempt

> enable is called:

> 

> preempt_disable();

> kernel_fpu_begin();

> kernel_fpu_end();

> preempt_enable();

> schedule(); <--- works!

> 

> Or am I missing some more subtle point?

> 


No, I think you’re right. I was mid-remembering precisely how simd_relax() worked.

Jason A. Donenfeld Sept. 26, 2018, 5:46 p.m. UTC | #21

On Wed, Sep 26, 2018 at 7:37 PM Eric Biggers <ebiggers@kernel.org> wrote:
> Can you please stop accusing Ard of "filibustering" your patchset?  Spending too

> long in non-preemptible code is a real problem even on non-RT systems.

> syzkaller has been reporting bugs where the kernel spins too long without any

> preemption points, both in crypto-related code and elsewhere in the kernel.  So

> we've had to add explicit preemption points to address those, as otherwise users

> can lock up all CPUs for tens of seconds.  The issue being discussed here is

> basically the same except here preemption is being explicitly disabled via

> kernel_neon_begin(), so it becomes a problem even on non-CONFIG_PREEMPT kernels.


The async distraction (re:filibustering) and the preempt concern are
two totally different things. I've already posted some code elsewhere
in this thread that addresses the preempt issue that looked good to
Ard. This will be part of v7.

Andrew Lunn Sept. 27, 2018, 1:15 a.m. UTC | #22

Hi Jason

I know you have been concentrating on the crypto code, so i'm not
expecting too many changes at the moment in the network code.

It would be good to further reduce the number of checkpatch warnings.
The biggest problem i have at the moment is that all the trivial to
fix warnings are hiding a few more important warnings. There are a few
like these which are not obvious to see:

WARNING: Avoid crashing the kernel - try using WARN_ON & recovery code rather than BUG() or BUG_ON()
#2984: FILE: drivers/net/wireguard/noise.c:293:
+       BUG_ON(first_len > BLAKE2S_HASH_SIZE || second_len > BLAKE2S_HASH_SIZE ||

WARNING: Macros with flow control statements should be avoided
#5471: FILE: drivers/net/wireguard/selftest/allowedips.h:456:
+#define init_peer(name) do {                                               \
+               name = kzalloc(sizeof(*name), GFP_KERNEL);                 \
+               if (unlikely(!name)) {                                     \
+                       pr_info("allowedips self-test: out of memory\n");  \
+                       goto free;                                         \
+               }                                                          \
+               kref_init(&name->refcount);                                \
+       } while (0)


The namespace pollution also needs to be addresses. You have some
pretty generic named global symbols. I picked out a few examples from
objdump

00002a94 g     F .text	    00000060 peer_put
00003484 g     F .text	    0000004c timers_stop
00003520 g     F .text	    00000114 packet_queue_init
00002640 g     F .text	    00000034 device_uninit
000026bc g     F .text	    00000288 peer_create
000090d4 g     F .text	    000001bc ratelimiter_init

Please make use of a prefix for global symbols, e.g. wg_.

       Andrew

Jason A. Donenfeld Sept. 27, 2018, 1:26 p.m. UTC | #23

Hi Thomas,

I'm trying to optimize this for crypto performance while still taking
into account preemption concerns. I'm having a bit of trouble figuring
out a way to determine numerically what the upper bounds for this
stuff looks like. I'm sure I could pick a pretty sane number that's
arguably okay -- and way under the limit -- but I still am interested
in determining what that limit actually is. I was hoping there'd be a
debugging option called, "warn if preemption is disabled for too
long", or something, but I couldn't find anything like that. I'm also
not quite sure what the latency limits are, to just compute this with
a formula. Essentially what I'm trying to determine is:

preempt_disable();
asm volatile(".fill N, 1, 0x90;");
preempt_enable();

What is the maximum value of N for which the above is okay? What
technique would you generally use in measuring this?

Thanks,
Jason

Jason A. Donenfeld Sept. 27, 2018, 3:19 p.m. UTC | #24

Hey again Thomas,

On Thu, Sep 27, 2018 at 3:26 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>

> Hi Thomas,

>

> I'm trying to optimize this for crypto performance while still taking

> into account preemption concerns. I'm having a bit of trouble figuring

> out a way to determine numerically what the upper bounds for this

> stuff looks like. I'm sure I could pick a pretty sane number that's

> arguably okay -- and way under the limit -- but I still am interested

> in determining what that limit actually is. I was hoping there'd be a

> debugging option called, "warn if preemption is disabled for too

> long", or something, but I couldn't find anything like that. I'm also

> not quite sure what the latency limits are, to just compute this with

> a formula. Essentially what I'm trying to determine is:

>

> preempt_disable();

> asm volatile(".fill N, 1, 0x90;");

> preempt_enable();

>

> What is the maximum value of N for which the above is okay? What

> technique would you generally use in measuring this?

>

> Thanks,

> Jason


From talking to Peter (now CC'd) on IRC, it sounds like what you're
mostly interested in is clocktime latency on reasonable hardware, with
a goal of around ~20µs as a maximum upper bound? I don't expect to get
anywhere near this value at all, but if you can confirm that's a
decent ballpark, it would make for some interesting calculations.

Regards,
Jason

Eric Biggers Sept. 27, 2018, 6:29 p.m. UTC | #25

On Tue, Sep 25, 2018 at 04:55:59PM +0200, Jason A. Donenfeld wrote:
> 

> It is intended that this entire patch series enter the kernel through

> DaveM's net-next tree. Subsequently, WireGuard patches will go through

> DaveM's net-next tree, while Zinc patches will go through Greg KH's tree.

> 

Why is Herbert Xu's existing crypto tree being circumvented, especially for
future patches (the initial merge isn't quite as important as that's a one-time
event)?  I like being able to check out cryptodev to test upcoming crypto
patches.  And currently, changes to APIs, algorithms, tests, and implementations
all go through cryptodev, which is convenient for crypto developers.

Apparently, you're proposing that someone adding a new algorithm will now have
to submit the API portion to one maintainer (Herbert Xu) and the implementation
portion to another maintainer (you), and they'll go through separate git trees.
That's inconvenient for developers, and it seems that in practice you and
Herbert will be stepping on each other's toes a lot.

Can you please reach some kind of sane agreement with Herbert so that the
development process isn't fractured into two?  Perhaps you could review patches,
but Herbert could still apply them?

I'm also wondering about the criteria for making additions and changes to
"Zinc".  You mentioned before that one of the "advantages" of Zinc is that it
doesn't include "cipher modes from 90s cryptographers" -- what does that mean
exactly?  You've also indicated before that you don't want people modifying the
Poly1305 implementations as they are too error-prone.  Yet apparently it's fine
to change them yourself, e.g. when you added the part that converts the
accumulator from base 26 to base 32.  I worry there may be double standards
here, and useful contributions could be blocked or discouraged in the future.
Can you please elaborate on your criteria for contributions to Zinc?

Also, will you allow algorithms that aren't up to modern security standards but
are needed for compatibility reasons, e.g. MD5, SHA-1, and DES?  There are
existing standards, APIs, and data formats that use these "legacy" algorithms;
so implementations of them are often still needed, whether we like it or not.

And does it matter who designed the algorithms, e.g. do algorithms from Daniel
Bernstein get effectively a free pass, while algorithms from certain countries,
governments, or organizations are not allowed?  E.g. wireless driver developers
may need the SM4 block cipher (which is now supported by the crypto API) as it's
specified in a Chinese wireless standard.  Will you allow SM4 in Zinc?  Or will
people have to submit some algorithms to Herbert and some to you due to
disagreements about what algorithms should be included?

- Eric

Jason A. Donenfeld Sept. 27, 2018, 10:37 p.m. UTC | #26

Hi Andrew,

Thanks for following up with this.

On Thu, Sep 27, 2018 at 3:15 AM Andrew Lunn <andrew@lunn.ch> wrote:
> I know you have been concentrating on the crypto code, so i'm not

> expecting too many changes at the moment in the network code.

I should be addressing things in parallel, actually, so I'm happy to
work on this.

> WARNING: Avoid crashing the kernel - try using WARN_ON & recovery code rather than BUG() or BUG_ON()

> #2984: FILE: drivers/net/wireguard/noise.c:293:

> +       BUG_ON(first_len > BLAKE2S_HASH_SIZE || second_len > BLAKE2S_HASH_SIZE ||

I was actually going to ask you about this, because it applies
similarly in another context too that I'm trying to refine. The above
function you quote has the following properties:

- Only ever accepts fixed length parameters, so the compiler can
constant fold invocations of it fantastically. Those parameters are
fixed length in the sense that they're enum/macro constants. They
never come from the user or from a packet or something.
- Never produces an incorrect result. For said constants, all inputs
are valid, and so it succeeds in producing an output every time.
- Is a "pure" function, just knocking bytes around, without needing to
interact with fancy kernel-y things; could be implemented on some sort
of WWII-era rotor machine provided you had the patience.

Because of the above, there's never any error to return to the user of
the function. Also, because it only ever takes constant sized inputs,
in theory I should be able to change that BUG_ON() to BUILD_BUG_ON(),
but in practice the optimizer/inliner isn't actually that aggressive.

But what I would like is some way of signaling to the developer using
this function that they've passed it an illegal value, and their code
should not ship until that's fixed, under any circumstances at all  --
that their usage of the function is completely unacceptable and wrong.
Bla bla strong statements.

For this, I figured the notion would come across with the aberrant
behavior of "crash the developer's [in this case, my] QEMU instance"
when "#ifdef DEBUG is true". This is the same kind of place where I'd
have an "assert()" statement in userland. It sounds like what you're
saying is that a WARN_ON is equally as effective instead? Or given the
above, do you think the BUG_ON is actually sensible? Or neither and I
should do something else?

> WARNING: Macros with flow control statements should be avoided

> #5471: FILE: drivers/net/wireguard/selftest/allowedips.h:456:

> +#define init_peer(name) do {                                               \

> +               name = kzalloc(sizeof(*name), GFP_KERNEL);                 \

> +               if (unlikely(!name)) {                                     \

> +                       pr_info("allowedips self-test: out of memory\n");  \

> +                       goto free;                                         \

> +               }                                                          \

> +               kref_init(&name->refcount);                                \

> +       } while (0)

This is part of a debugging selftest, where I'm initing a bunch of
peers one after another, and this macro helps keep the test clear
while offloading the actual irrelevant coding part to this macro. The
test itself then just has code like:

    init_peer(a);
    init_peer(b);
    init_peer(c);
    init_peer(d);
    init_peer(e);
    init_peer(f);
    init_peer(g);
    init_peer(h);

    insert(4, a, 192, 168, 4, 0, 24);
    insert(4, b, 192, 168, 4, 4, 32);
    insert(4, c, 192, 168, 0, 0, 16);
    insert(4, d, 192, 95, 5, 64, 27);
    /* replaces previous entry, and maskself is required */
    insert(4, c, 192, 95, 5, 65, 27);
    insert(6, d, 0x26075300, 0x60006b00, 0, 0xc05f0543, 128);
    insert(6, c, 0x26075300, 0x60006b00, 0, 0, 64);
    ...

And so forth. I can probably figure out a different way to code this
if you really want, but I thought this would be clear.

> The namespace pollution also needs to be addresses. You have some

> pretty generic named global symbols. I picked out a few examples from

> objdump

>

> 00002a94 g     F .text      00000060 peer_put

> 00003484 g     F .text      0000004c timers_stop

> 00003520 g     F .text      00000114 packet_queue_init

> 00002640 g     F .text      00000034 device_uninit

> 000026bc g     F .text      00000288 peer_create

> 000090d4 g     F .text      000001bc ratelimiter_init

>

> Please make use of a prefix for global symbols, e.g. wg_.

Will do. v7 will include the wg_ prefix.

On a slightly related note, out of curiosity, any idea what's up with
the future of LTO in the kernel? It sounds like that'd be nice to have
on a module-by-module basis. IOW, I'd love to LTO all of my .c files
in wireguard together, and then only ever expose mod_init/exit and
whatever I explicitly EXPORT_SYMBOL, and then have the compiler and
linker treat the rest of everything as essentially in one .c file and
optimize the heck out of it, and then strip all the symbols. It would,
incidentally, remove the need for said symbol prefixes, since I
wouldn't be making anything global. (Probably perf/ftrace/etc would
become harder to use though.)

Jason

Jason A. Donenfeld Sept. 28, 2018, 1:09 a.m. UTC | #27

On Fri, Sep 28, 2018 at 12:37 AM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> Will do. v7 will include the wg_ prefix.


$ nm *.o | while read a b c; do [[ $b == T ]] && echo $c; done | grep -v ^wg_
cleanup_module
init_module

Success.

Jason A. Donenfeld Sept. 28, 2018, 2:35 a.m. UTC | #28

Hi Eric,

On Thu, Sep 27, 2018 at 06:17:27PM -0700, Eric Biggers wrote:
> So, Zinc will simultaneously replace the current crypto implementations, *and*

> be "orthogonal" and "separate" from all the crypto code currently maintained by

> Herbert?  You can't have your cake and eat it too...

The plan is for it to replace many uses of the crypto API where it makes
sense, but not replace uses where it doesn't make sense. Perhaps in the
long run, over time, its usage will grow to cover those cases too, or,
perhaps instead, Zinc will form a simple basis of software crypto
algorithms in whatever future API designs crop up. In other words, like
most changes in kernel development, things happen gradually, starting
with a few good cases, and gradually growing as the need and design
arise.

> I'm still concerned you're splitting the community in two.  It will be unclear

> where new algorithms and implementations should go.  Some people will choose

> Herbert and the current crypto API and conventions, and some people will choose

> you and Zinc...  I still don't see clear guidelines for what will go where.

I can try to work out some explicit guidelines and write these up for
Documentation/, if that'd make a difference for you. I don't think this
is a matter of "community splitting". On the contrary, I think Zinc will
bring communities together, inviting the larger cryptography community
to take an interest in improving the state of crypto in Linux. Either
way, the litmus test for where code should go remains pretty similar to
how it's been working so far. Are you tempted to stick it in lib/
because that fits your programming paradigm and because you think it's
generally useful? If so, submit it to lib/zinc/. Conversely, is it only
something used in the large array of options provided by dmcrypt, ipsec,
afalg, etc? Submit it to the crypto API.

If you think this criteria is lacking, I'm amenable to adjusting that
and changing it, especially as situations and designs change and morph
over time. But that seems like a fairly decent starting point.

> Please reach out to Herbert to find a sane solution

> crypto development without choosing "sides".

Please, don't politicize this. This has nothing to do with "sides". This
has to do with which paradigm makes sense for implementing a particular
algorithm. And everything that goes in Zinc gets to be used seamlessly
by the crypto API anyway, through use of the trivial stub glue code,
like what I've shown in the two commits in this series. Again, if it's
something that will work well as a direct function call, then it seems
like Zinc makes sense as a home for it.

With that said, I've reached out to Herbert, and we'll of course discuss
and reach a good conclusion together.

> Note that usage can change over time; a user that requires a

> single cipher could later need multiple, and vice versa.

I think this depends on the design of the driver and the style it's
implemented in. For example, I could imagine something like this:

   encrypt_stuff_with_morus(obj, key);

evolving over time to:

  if (obj->type == MORUS_TYPE)
    encrypt_stuff_with_morus(obj, key);
  else if (obj->type == AEGIS_TYPE)
    encrypt_stuff_with_aegis(obj, key);

On the other hand, if the developer has good reason to use the crypto
API's dynamic dispatch and async API and so forth, then perhaps it just
changes from:

  static const char *cipher_name = "morus";

to

  static const char *cipher_name_type_1 = "morus";
  static const char *cipher_name_type_2 = "aegis";

I can imagine both programming styles and evolutions being desirable for
different reasons.

> >   - It matches exactly what Andy Polyakov's code is doing for the exact

> >     same reason, so this isn't something that's actually "new". (There

> >     are paths inside his implementation that branch from the vector code

> >     to the scalar code.)

> 

> Matches Andy's code, where?  The reason you had to add the radix conversion is

> because his code does *not* handle it...

For example, check out the avx blocks function. The radix conversion
happens in a few different places throughout. The reason we need it
separately here is because, unlike userspace, it's possible the kernel
code will transition from 2^26 back to 2^64 as a result of the FPU
context changing.

As well, AndyP seems to like the idea of including this logic in the
assembly instead of in C, if I understood our discussions correctly, so
there's a decent chance this will migrate out of the glue code and into
the assembly properly, which is probably a better place for it.

> >   - It has been discussed at length with Andy, including what kinds of

> >     proofs we'll need if we want to chop it down further (to remove that

> >     final reduction), and why we both don't want to do that yet, and so

> >     we go with the full carrying for the avoidance of risk.

> 

> Sorry, other people don't know about your private discussions.  For the rest of

> us, why not add a comment to the code explaining what's going on?

That's a good idea. I can include some discussion about this as well in
the commit message that introduces the glue code, too, I guess? I've
been hesitant to fill these commit messages up even more, given there
are already so many walls of text and whatnot, but if you think that'd
be useful, I'll do that for v7, and also add comments.

> >   - We've proved its correctness with Z3, actually using an even looser

> >     constraint on digit size than what Andy mentioned to us, thus resulting

> >     in a stronger proof result. So we're certain this isn't rubbish.

> AFAICS actually it *is* rubbish, because your C code stores the accumulator as

> 64-bit integers whereas the asm code (at least, the 32-bit version) reads it as

> 32-bit integers.  That won't work correctly on big endian ARM.

> > There's no doubt about it, we've done our due-diligence here. 

> Apparently not, given that it's broken on big endian ARM.

> Of course, having bugs in code which you insist was proven correct

> + fuzzed doesn't exactly inspire trust.

What's with the snark? It's not rubbish. I'm not sure if you noticed it in
the development trees (both the WireGuard module tree and my kernel.org
integration tree for this patch), but the big endian ARM support was fixed
pretty shortly after I jumped the gun posting v6. Like, super soon after.
That, and other big endian fixes (on aarch64 as well) are already queued up
for v7. And now build.wireguard.com has more big endian running in CI.

> The details of the correctness proofs and fuzzing you claim to have done aren't

> explained, even in the cover letter; so for now we just have to trust you on

> that point.

"Claim to have done", "trust you on that point" -- I think there's no
reason to doubt the integrity of my "claims", and I don't appreciate the
phrasing that appears to call that into question.

Regardless, sure, we can expand the "wall-of-text" commit messages even
further, if you want, and include the verbatim Z3 scripts for reproduction.

> I understand that your standards are still as high or even higher than

> Herbert's, which is good; crypto code should be held to high standards!  But

> based on the evidence, I do worry there's a double standard going on where you

> get away with things yourself which you won't allow from others in Zinc.  It's

> just not honest, and it will make people not want to contribute to Zinc.

> Maintainers are supposed to be unbiased and hold all contributions to the same

> standard.

This is complete and utter garbage, and I find its insinuations insulting
and ridiculous. There is absolutely no lack of honesty and no double
standard being applied whatsoever. Your attempt to cast doubt about the
quality of standards applied and the integrity of the process is wholly
inappropriate. When I tell you that high standards were applied and that
due-diligence was done in developing a particular patch, I mean what I
say.

> We need "Zinc" to be Linux's crypto library, not "Jason's crypto library".

This very much is a project directed toward the benefit of the kernel in
a general sense. It's been this way from the start, and there's nothing
in its goals or plans to the contrary of that. Please leave this vague
and unproductive rhetoric aside.

Jason

Eric Biggers Sept. 28, 2018, 4:55 a.m. UTC | #29

Hi Jason,

On Fri, Sep 28, 2018 at 04:35:48AM +0200, Jason A. Donenfeld wrote:
> Hi Eric,

> 

> On Thu, Sep 27, 2018 at 06:17:27PM -0700, Eric Biggers wrote:

> > So, Zinc will simultaneously replace the current crypto implementations, *and*

> > be "orthogonal" and "separate" from all the crypto code currently maintained by

> > Herbert?  You can't have your cake and eat it too...

> 

> The plan is for it to replace many uses of the crypto API where it makes

> sense, but not replace uses where it doesn't make sense. Perhaps in the

> long run, over time, its usage will grow to cover those cases too, or,

> perhaps instead, Zinc will form a simple basis of software crypto

> algorithms in whatever future API designs crop up. In other words, like

> most changes in kernel development, things happen gradually, starting

> with a few good cases, and gradually growing as the need and design

> arise.


Please re-read what I wrote.  Here I'm talking about the crypto code itself, not
its users.

And you still haven't answered my question about adding a new algorithm that is
useful to have in both APIs.  You're proposing that in most cases the crypto API
part will need to go through Herbert while the implementation will need to go
through you/Greg, right?  Or will you/Greg be taking both?  Or will Herbert take
both?

> 

> > I'm still concerned you're splitting the community in two.  It will be unclear

> > where new algorithms and implementations should go.  Some people will choose

> > Herbert and the current crypto API and conventions, and some people will choose

> > you and Zinc...  I still don't see clear guidelines for what will go where.

> 

> I can try to work out some explicit guidelines and write these up for

> Documentation/, if that'd make a difference for you. I don't think this

> is a matter of "community splitting". On the contrary, I think Zinc will

> bring communities together, inviting the larger cryptography community

> to take an interest in improving the state of crypto in Linux. Either

> way, the litmus test for where code should go remains pretty similar to

> how it's been working so far. Are you tempted to stick it in lib/

> because that fits your programming paradigm and because you think it's

> generally useful? If so, submit it to lib/zinc/. Conversely, is it only

> something used in the large array of options provided by dmcrypt, ipsec,

> afalg, etc? Submit it to the crypto API.

> 

> If you think this criteria is lacking, I'm amenable to adjusting that

> and changing it, especially as situations and designs change and morph

> over time. But that seems like a fairly decent starting point.


A documentation file for Zinc is a good idea.  Some of the information in your
commit messages should be moved there too.

> 

> > Please reach out to Herbert to find a sane solution

> > crypto development without choosing "sides".

> 

> Please, don't politicize this. This has nothing to do with "sides". This

> has to do with which paradigm makes sense for implementing a particular

> algorithm. 


I'm not trying to "politicize" this, but rather determine your criteria for
including algorithms in Zinc, which you haven't made clear.  Since for the
second time you've avoided answering my question about whether you'd allow the
SM4 cipher in Zinc, and you designed WireGuard to be "cryptographically
opinionated", and you were one of the loudest voices calling for the Speck
cipher to be removed, and your justification for Zinc complains about cipher
modes from "90s cryptographers", I think it's reasonable for people to wonder
whether as the Zinc (i.e. Linux crypto library) maintainer you will restrict the
inclusion of crypto algorithms to the ones you prefer, rather than the ones that
users need.  Note that the kernel is used by people all over the world and needs
to support various standards, protocols, and APIs that use different crypto
algorithms, not always the ones we'd like; and different users have different
preferences.  People need to know whether the Linux kernel's crypto library will
meet (or be allowed to meet) their crypto needs.

> And everything that goes in Zinc gets to be used seamlessly

> by the crypto API anyway, through use of the trivial stub glue code,

> like what I've shown in the two commits in this series. Again, if it's

> something that will work well as a direct function call, then it seems

> like Zinc makes sense as a home for it.

> 

> With that said, I've reached out to Herbert, and we'll of course discuss

> and reach a good conclusion together.

> 

> > Note that usage can change over time; a user that requires a

> > single cipher could later need multiple, and vice versa.

> 

> I think this depends on the design of the driver and the style it's

> implemented in. For example, I could imagine something like this:

> 

>    encrypt_stuff_with_morus(obj, key);

> 

> evolving over time to:

> 

>   if (obj->type == MORUS_TYPE)

>     encrypt_stuff_with_morus(obj, key);

>   else if (obj->type == AEGIS_TYPE)

>     encrypt_stuff_with_aegis(obj, key);

> 

> On the other hand, if the developer has good reason to use the crypto

> API's dynamic dispatch and async API and so forth, then perhaps it just

> changes from:

> 

>   static const char *cipher_name = "morus";

> 

> to

> 

>   static const char *cipher_name_type_1 = "morus";

>   static const char *cipher_name_type_2 = "aegis";

> 

> I can imagine both programming styles and evolutions being desirable for

> different reasons.

> 

> > >   - It matches exactly what Andy Polyakov's code is doing for the exact

> > >     same reason, so this isn't something that's actually "new". (There

> > >     are paths inside his implementation that branch from the vector code

> > >     to the scalar code.)

> > 

> > Matches Andy's code, where?  The reason you had to add the radix conversion is

> > because his code does *not* handle it...

> 

> For example, check out the avx blocks function. The radix conversion

> happens in a few different places throughout. The reason we need it

> separately here is because, unlike userspace, it's possible the kernel

> code will transition from 2^26 back to 2^64 as a result of the FPU

> context changing.


IOW, you had to rewrite the x86 assembly algorithm in C and make it use a
different Poly1305 context format.  That's a major change, where bugs can be
introduced -- and at least one was introduced, in fact.  I'd appreciate it if
you were more accurate in describing your modifications and the potential risks
involved.

And yes I know why converting radixes is needed, as I had pointed it out...

> 

> As well, AndyP seems to like the idea of including this logic in the

> assembly instead of in C, if I understood our discussions correctly, so

> there's a decent chance this will migrate out of the glue code and into

> the assembly properly, which is probably a better place for it.

> 

> > >   - It has been discussed at length with Andy, including what kinds of

> > >     proofs we'll need if we want to chop it down further (to remove that

> > >     final reduction), and why we both don't want to do that yet, and so

> > >     we go with the full carrying for the avoidance of risk.

> > 

> > Sorry, other people don't know about your private discussions.  For the rest of

> > us, why not add a comment to the code explaining what's going on?

> 

> That's a good idea. I can include some discussion about this as well in

> the commit message that introduces the glue code, too, I guess? I've

> been hesitant to fill these commit messages up even more, given there

> are already so many walls of text and whatnot, but if you think that'd

> be useful, I'll do that for v7, and also add comments.


Please prefer to put information in the code or documentation rather than in
commit messages, when appropriate.

> 

> > >   - We've proved its correctness with Z3, actually using an even looser

> > >     constraint on digit size than what Andy mentioned to us, thus resulting

> > >     in a stronger proof result. So we're certain this isn't rubbish.

> > AFAICS actually it *is* rubbish, because your C code stores the accumulator as

> > 64-bit integers whereas the asm code (at least, the 32-bit version) reads it as

> > 32-bit integers.  That won't work correctly on big endian ARM.

> > > There's no doubt about it, we've done our due-diligence here. 

> > Apparently not, given that it's broken on big endian ARM.

> > Of course, having bugs in code which you insist was proven correct

> > + fuzzed doesn't exactly inspire trust.

> 

> What's with the snark? It's not rubbish. I'm not sure if you noticed it in

> the development trees (both the WireGuard module tree and my kernel.org

> integration tree for this patch), but the big endian ARM support was fixed

> pretty shortly after I jumped the gun posting v6. Like, super soon after.

> That, and other big endian fixes (on aarch64 as well) are already queued up

> for v7. And now build.wireguard.com has more big endian running in CI.

> 

> > The details of the correctness proofs and fuzzing you claim to have done aren't

> > explained, even in the cover letter; so for now we just have to trust you on

> > that point.

> 

> "Claim to have done", "trust you on that point" -- I think there's no

> reason to doubt the integrity of my "claims", and I don't appreciate the

> phrasing that appears to call that into question.

> 

> Regardless, sure, we can expand the "wall-of-text" commit messages even

> further, if you want, and include the verbatim Z3 scripts for reproduction.

> 

> > I understand that your standards are still as high or even higher than

> > Herbert's, which is good; crypto code should be held to high standards!  But

> > based on the evidence, I do worry there's a double standard going on where you

> > get away with things yourself which you won't allow from others in Zinc.  It's

> > just not honest, and it will make people not want to contribute to Zinc.

> > Maintainers are supposed to be unbiased and hold all contributions to the same

> > standard.

> 

> This is complete and utter garbage, and I find its insinuations insulting

> and ridiculous. There is absolutely no lack of honesty and no double

> standard being applied whatsoever. Your attempt to cast doubt about the

> quality of standards applied and the integrity of the process is wholly

> inappropriate. When I tell you that high standards were applied and that

> due-diligence was done in developing a particular patch, I mean what I

> say.


No, I was not aware of the fixed version.  I found the bug myself reading the
latest patchset you've mailed out, v6.  It's great that you found and fixed this
bug on your own, and of course that raises my level of confidence in your work.
Still, the v6 patchset still includes your claim that "All implementations have
been extensively tested and fuzzed"; so that claim was objectively wrong.  So I
disagree that my concerns are "complete and utter garbage".  And I think that
the fact that you prefer to respond by attacking me, rather than committing to
be more accurate in your claims and to treat all contributions fairly, is
problematic.

Also, if you have extra tests that you're running, it would be great if you
could include them as part of the submission so that others can run them too.

> 

> > We need "Zinc" to be Linux's crypto library, not "Jason's crypto library".

> 

> This very much is a project directed toward the benefit of the kernel in

> a general sense. It's been this way from the start, and there's nothing

> in its goals or plans to the contrary of that. Please leave this vague

> and unproductive rhetoric aside.

> 


Well, it doesn't help that you yourself claim that Zinc stands for
"Zx2c4's INsane Cryptolib"...

- Eric

Jason A. Donenfeld Sept. 28, 2018, 5:46 a.m. UTC | #30

Hi Eric,

On Fri, Sep 28, 2018 at 6:55 AM Eric Biggers <ebiggers@kernel.org> wrote:
> And you still haven't answered my question about adding a new algorithm that is

> useful to have in both APIs.  You're proposing that in most cases the crypto API

> part will need to go through Herbert while the implementation will need to go

> through you/Greg, right?  Or will you/Greg be taking both?  Or will Herbert take

> both?

If an implementation enters Zinc, it will go through my tree. If it
enters the crypto API, it will go through Herbert's tree. If there
wind up being messy tree dependency and merge timing issues, I can
work this out in the usual way with Herbert. I'll be sure to discuss
these edge cases with him when we discuss. I think there's a way to
handle that with minimum friction.

> A documentation file for Zinc is a good idea.  Some of the information in your

> commit messages should be moved there too.

Will do.

> I'm not trying to "politicize" this, but rather determine your criteria for

> including algorithms in Zinc, which you haven't made clear.  Since for the

> second time you've avoided answering my question about whether you'd allow the

> SM4 cipher in Zinc, and you designed WireGuard to be "cryptographically

> opinionated", and you were one of the loudest voices calling for the Speck

> cipher to be removed, and your justification for Zinc complains about cipher

> modes from "90s cryptographers", I think it's reasonable for people to wonder

> whether as the Zinc (i.e. Linux crypto library) maintainer you will restrict the

> inclusion of crypto algorithms to the ones you prefer, rather than the ones that

> users need.  Note that the kernel is used by people all over the world and needs

> to support various standards, protocols, and APIs that use different crypto

> algorithms, not always the ones we'd like; and different users have different

> preferences.  People need to know whether the Linux kernel's crypto library will

> meet (or be allowed to meet) their crypto needs.

WireGuard is indeed quite opinionated in its primitive choices, but I
don't think it'd be wise to apply the same design to Zinc. There are
APIs where the goal is to have a limited set of high-level functions,
and have those implemented in only a preferred set of primitives. NaCl
is a good example of this -- functions like "crypto_secretbox" that
are actually xsalsapoly under the hood. Zinc doesn't intend to become
an API like those, but rather to provide the actual primitives for use
cases where that specific primitive is used. Sometimes the kernel is
in the business of being able to pick its own crypto -- random.c, tcp
sequence numbers, big_key.c, etc -- but most of the time the kernel is
in the business of implementing other people's crypto, for specific
devices/protocols/diskformats. And for those use cases we need not
some high-level API like NaCl, but rather direct access to the
primitives that are required to implement those drivers. WireGuard is
one such example, but so would be Bluetooth, CIFS/SMB, WiFi, and so
on. We're in the business of writing drivers, after all. So, no, I
don't think I'd knock down the addition of a primitive because of a
simple preference for a different primitive, if it was clearly the
case that the driver requiring it really benefited from having
accessible via the plain Zinc function calls. Sorry if I hadn't made
this clear earlier -- I thought Ard had asked more or less the same
thing about DES and I answered accordingly, but maybe that wasn't made
clear enough there.

> > For example, check out the avx blocks function. The radix conversion

> > happens in a few different places throughout. The reason we need it

> > separately here is because, unlike userspace, it's possible the kernel

> > code will transition from 2^26 back to 2^64 as a result of the FPU

> > context changing.

>

> IOW, you had to rewrite the x86 assembly algorithm in C and make it use a

> different Poly1305 context format.  That's a major change, where bugs can be

> introduced -- and at least one was introduced, in fact.  I'd appreciate it if

> you were more accurate in describing your modifications and the potential risks

> involved.

A different Poly1305 context format? Not at all - it's using the exact
same context struct as the assembly. And it's making the same
conversion that the assembly is. There's not something "different"
happening; that's the whole point.

Also, this is not some process of frightfully transcribing assembly to
C and hoping it all works out. This is a careful process of reasoning
about the limbs, proving that the carries are correct, and that the
arithmetic done in C actually corresponds to what we want. And then
finally we check that what we've implemented is indeed the same as
what the assembly implemented. Finally, as I mentioned, hopefully Andy
will be folding this back into his implementations sometime soon
anyway.

> > That's a good idea. I can include some discussion about this as well in

> > the commit message that introduces the glue code, too, I guess? I've

> > been hesitant to fill these commit messages up even more, given there

> > are already so many walls of text and whatnot, but if you think that'd

> > be useful, I'll do that for v7, and also add comments.

>

> Please prefer to put information in the code or documentation rather than in

> commit messages, when appropriate.

Okay, no problem.

> > This is complete and utter garbage, and I find its insinuations insulting

> > and ridiculous. There is absolutely no lack of honesty and no double

> > standard being applied whatsoever. Your attempt to cast doubt about the

> > quality of standards applied and the integrity of the process is wholly

> > inappropriate. When I tell you that high standards were applied and that

> > due-diligence was done in developing a particular patch, I mean what I

> > say.

>

> I

> disagree that my concerns are "complete and utter garbage".  And I think that

> the fact that you prefer to respond by attacking me, rather than committing to

> be more accurate in your claims and to treat all contributions fairly, is

> problematic.

I believe you have the sequence of events wrong. I've stated and made
that there isn't a double standard, that the standards intend to be
evenly rigorous, and I solicited your feedback on how I could best
communicate changes in pre-merged patch series; I did the things
you've said one should do. But your rhetoric then moved to questioning
my integrity, and I believe that was uncalled for, inappropriate, and
not a useful contribution to this mostly productive discussion --
hence garbage. In other words, I provided an acceptable defense, not
an attack. But can we move past all this schoolyard nonsense? Cut the
rhetoric, please; it's really quite overwhelming.

> It's great that you found and fixed this

> bug on your own, and of course that raises my level of confidence in your work.

Good.

> Still, the v6 patchset still includes your claim that "All implementations have

> been extensively tested and fuzzed"; so that claim was objectively wrong.

I don't think that claim is wrong. The fuzzing and testing that's been
done has been extensive, and the fuzzing and testing that continues to
occur is extensive. As mentioned, the bug was fixed pretty much right
after git-send-email. If you'd like I can try to space out the patch
submissions by a little bit longer -- that would probably have various
benefits -- but given that the netdev code is yet to receive a
thorough review, I think we've got a bit of time before this is
merged. The faster-paced patch cycles might inadvertently introduce
things that get fixed immediately after sending then, unfortunately,
but I don't think this will be the case with the final series that's
merged. Though I'm incorporating tons and tons of feedback right now,
I do look forward to the structure of the series settling down a
little bit and the pace of suggestions decreasing, so that I can focus
on auditing and verifying again. But given the dynamism and
interesting insights of the reviews so far, I think the fast pace has
actually been useful for elucidating the various expectations and
suggestions of reviewers. It's most certainly improved this patchset
in terrific ways.

> Well, it doesn't help that you yourself claim that Zinc stands for

> "Zx2c4's INsane Cryptolib"...

This expansion of the acronym was intended as a totally ridiculous
joke. I guess it wasn't received well. I'll remove it from the next
series. Sorry.

Jason

Ard Biesheuvel Sept. 28, 2018, 8:49 a.m. UTC | #31

(+ Joe)

On 28 September 2018 at 10:28, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> On 25 September 2018 at 16:56, Jason A. Donenfeld <Jason@zx2c4.com> wrote:

>> Sometimes it's useful to amortize calls to XSAVE/XRSTOR and the related

>> FPU/SIMD functions over a number of calls, because FPU restoration is

>> quite expensive. This adds a simple header for carrying out this pattern:

>>

>>     simd_context_t simd_context;

>>

>>     simd_get(&simd_context);

>>     while ((item = get_item_from_queue()) != NULL) {

>>         encrypt_item(item, simd_context);

>>         simd_relax(&simd_context);

>>     }

>>     simd_put(&simd_context);

>>

>> The relaxation step ensures that we don't trample over preemption, and

>> the get/put API should be a familiar paradigm in the kernel.

>>

>> On the other end, code that actually wants to use SIMD instructions can

>> accept this as a parameter and check it via:

>>

>>    void encrypt_item(struct item *item, simd_context_t *simd_context)

>>    {

>>        if (item->len > LARGE_FOR_SIMD && simd_use(simd_context))

>>            wild_simd_code(item);

>>        else

>>            boring_scalar_code(item);

>>    }

>>

>> The actual XSAVE happens during simd_use (and only on the first time),

>> so that if the context is never actually used, no performance penalty is

>> hit.

>>

>> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

>> Cc: Samuel Neves <sneves@dei.uc.pt>

>> Cc: Andy Lutomirski <luto@kernel.org>

>> Cc: Thomas Gleixner <tglx@linutronix.de>

>> Cc: Greg KH <gregkh@linuxfoundation.org>

>> Cc: linux-arch@vger.kernel.org

>> ---

>>  arch/alpha/include/asm/Kbuild      |  5 ++-

>>  arch/arc/include/asm/Kbuild        |  1 +

>>  arch/arm/include/asm/simd.h        | 63 ++++++++++++++++++++++++++++++

>>  arch/arm64/include/asm/simd.h      | 51 +++++++++++++++++++++---

>>  arch/c6x/include/asm/Kbuild        |  3 +-

>>  arch/h8300/include/asm/Kbuild      |  3 +-

>>  arch/hexagon/include/asm/Kbuild    |  1 +

>>  arch/ia64/include/asm/Kbuild       |  1 +

>>  arch/m68k/include/asm/Kbuild       |  1 +

>>  arch/microblaze/include/asm/Kbuild |  1 +

>>  arch/mips/include/asm/Kbuild       |  1 +

>>  arch/nds32/include/asm/Kbuild      |  7 ++--

>>  arch/nios2/include/asm/Kbuild      |  1 +

>>  arch/openrisc/include/asm/Kbuild   |  7 ++--

>>  arch/parisc/include/asm/Kbuild     |  1 +

>>  arch/powerpc/include/asm/Kbuild    |  3 +-

>>  arch/riscv/include/asm/Kbuild      |  3 +-

>>  arch/s390/include/asm/Kbuild       |  3 +-

>>  arch/sh/include/asm/Kbuild         |  1 +

>>  arch/sparc/include/asm/Kbuild      |  1 +

>>  arch/um/include/asm/Kbuild         |  3 +-

>>  arch/unicore32/include/asm/Kbuild  |  1 +

>>  arch/x86/include/asm/simd.h        | 44 ++++++++++++++++++++-

>>  arch/xtensa/include/asm/Kbuild     |  1 +

>>  include/asm-generic/simd.h         | 20 ++++++++++

>>  include/linux/simd.h               | 28 +++++++++++++

>>  26 files changed, 234 insertions(+), 21 deletions(-)

>>  create mode 100644 arch/arm/include/asm/simd.h

>>  create mode 100644 include/linux/simd.h

>>

>> diff --git a/arch/alpha/include/asm/Kbuild b/arch/alpha/include/asm/Kbuild

>> index 0580cb8c84b2..07b2c1025d34 100644

>> --- a/arch/alpha/include/asm/Kbuild

>> +++ b/arch/alpha/include/asm/Kbuild

>> @@ -2,14 +2,15 @@

>>

>>

>>  generic-y += compat.h

>> +generic-y += current.h

>>  generic-y += exec.h

>>  generic-y += export.h

>>  generic-y += fb.h

>>  generic-y += irq_work.h

>> +generic-y += kprobes.h

>>  generic-y += mcs_spinlock.h

>>  generic-y += mm-arch-hooks.h

>>  generic-y += preempt.h

>>  generic-y += sections.h

>> +generic-y += simd.h

>>  generic-y += trace_clock.h

>> -generic-y += current.h

>> -generic-y += kprobes.h

>

> Given that this patch applies to all architectures at once, it is

> probably better to drop the unrelated reordering hunks to avoid

> conflicts.

>

>> diff --git a/arch/arc/include/asm/Kbuild b/arch/arc/include/asm/Kbuild

>> index feed50ce89fa..a7f4255f1649 100644

>> --- a/arch/arc/include/asm/Kbuild

>> +++ b/arch/arc/include/asm/Kbuild

>> @@ -22,6 +22,7 @@ generic-y += parport.h

>>  generic-y += pci.h

>>  generic-y += percpu.h

>>  generic-y += preempt.h

>> +generic-y += simd.h

>>  generic-y += topology.h

>>  generic-y += trace_clock.h

>>  generic-y += user.h

>> diff --git a/arch/arm/include/asm/simd.h b/arch/arm/include/asm/simd.h

>> new file mode 100644

>> index 000000000000..263950dd69cb

>> --- /dev/null

>> +++ b/arch/arm/include/asm/simd.h

>> @@ -0,0 +1,63 @@

>> +/* SPDX-License-Identifier: GPL-2.0

>> + *

>> + * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.

>> + */

>> +

>> +#include <linux/simd.h>

>> +#ifndef _ASM_SIMD_H

>> +#define _ASM_SIMD_H

>> +

>> +#ifdef CONFIG_KERNEL_MODE_NEON

>> +#include <asm/neon.h>

>> +

>> +static __must_check inline bool may_use_simd(void)

>> +{

>> +       return !in_interrupt();

>> +}

>> +

>

> Remember this guy?

>

> https://marc.info/?l=linux-arch&m=149631094625176&w=2

>

> That was never merged, so let's get it right this time.

>

...
>> diff --git a/include/linux/simd.h b/include/linux/simd.h

>> new file mode 100644

>> index 000000000000..33bba21012ff

>> --- /dev/null

>> +++ b/include/linux/simd.h

>> @@ -0,0 +1,28 @@

>> +/* SPDX-License-Identifier: GPL-2.0

>> + *

>> + * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.

>> + */

>> +

>> +#ifndef _SIMD_H

>> +#define _SIMD_H

>> +

>> +typedef enum {

>> +       HAVE_NO_SIMD = 1 << 0,

>> +       HAVE_FULL_SIMD = 1 << 1,

>> +       HAVE_SIMD_IN_USE = 1 << 31

>> +} simd_context_t;

>> +


Oh, and another thing (and I'm surprised checkpatch.pl didn't complain
about it): the use of typedef in new code is strongly discouraged.
This policy predates my involvement, so perhaps Joe can elaborate on
the rationale?


>> +#include <linux/sched.h>

>> +#include <asm/simd.h>

>> +

>> +static inline void simd_relax(simd_context_t *ctx)

>> +{

>> +#ifdef CONFIG_PREEMPT

>> +       if ((*ctx & HAVE_SIMD_IN_USE) && need_resched()) {

>> +               simd_put(ctx);

>> +               simd_get(ctx);

>> +       }

>> +#endif

>

> Could we return a bool here indicating whether we rescheduled or not?

> In some cases, we could pass that into the asm code as a 'reload'

> param, allowing repeated loads of key schedules, round constant tables

> or S-boxes to be elided.

>

>> +}

>> +

>> +#endif /* _SIMD_H */

>> --

>> 2.19.0

>>

Jason A. Donenfeld Sept. 28, 2018, 1:40 p.m. UTC | #32

On Fri, Sep 28, 2018 at 9:52 AM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> As I understood from someone who was at your Kernel Recipes talk, you

> mentioned that it actually stands for 'zinc is not crypto/' (note the

> slash)

I mentioned this was in v1 but it wasn't taken as lightly as planned
and was removed, so if it needs a recursive acronym it can be "zinc is
nice crypto", "zinc is neat crypto", or as the commit message
mentions, "zinc as in crypto." Alternatively, it can just not stand
for anything at all. Zinc -> Zinc. That's what the thing is called.

Jason A. Donenfeld Sept. 28, 2018, 1:45 p.m. UTC | #33

On Fri, Sep 28, 2018 at 10:28 AM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> Given that this patch applies to all architectures at once, it is

> probably better to drop the unrelated reordering hunks to avoid

> conflicts.


Ack. Will retain order for v7.

> > +static __must_check inline bool may_use_simd(void)

> > +{

> > +       return !in_interrupt();

> > +}

> > +

>

> Remember this guy?

>

> https://marc.info/?l=linux-arch&m=149631094625176&w=2

>

> That was never merged, so let's get it right this time.


Wow, nice memory. I had forgotten all about that. Queued for v7.

> > +static inline void simd_relax(simd_context_t *ctx)

> > +{

> > +#ifdef CONFIG_PREEMPT

> > +       if ((*ctx & HAVE_SIMD_IN_USE) && need_resched()) {

> > +               simd_put(ctx);

> > +               simd_get(ctx);

> > +       }

> > +#endif

>

> Could we return a bool here indicating whether we rescheduled or not?

> In some cases, we could pass that into the asm code as a 'reload'

> param, allowing repeated loads of key schedules, round constant tables

> or S-boxes to be elided.


Sure, sounds easy enough.

Jason A. Donenfeld Sept. 28, 2018, 1:59 p.m. UTC | #34

On Fri, Sep 28, 2018 at 3:58 PM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
>

> On 28 September 2018 at 15:47, Jason A. Donenfeld <Jason@zx2c4.com> wrote:

> > On Fri, Sep 28, 2018 at 10:49 AM Ard Biesheuvel

> > <ard.biesheuvel@linaro.org> wrote:

> >> >> +typedef enum {

> >> >> +       HAVE_NO_SIMD = 1 << 0,

> >> >> +       HAVE_FULL_SIMD = 1 << 1,

> >> >> +       HAVE_SIMD_IN_USE = 1 << 31

> >> >> +} simd_context_t;

> >> >> +

> >>

> >> Oh, and another thing (and I'm surprised checkpatch.pl didn't complain

> >> about it): the use of typedef in new code is strongly discouraged.

> >> This policy predates my involvement, so perhaps Joe can elaborate on

> >> the rationale?

> >

> > In case it matters, the motivation for making this a typedef is I

> > could imagine this at some point turning into a more complicated

> > struct on certain platforms and that would make refactoring easier. I

> > could just make it `struct simd_context` now with 1 member though...

>

> Yes that makes sense


The rationale for it being a typedef or moving to a struct now?

Ard Biesheuvel Sept. 28, 2018, 2 p.m. UTC | #35

On 28 September 2018 at 15:59, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> On Fri, Sep 28, 2018 at 3:58 PM Ard Biesheuvel

> <ard.biesheuvel@linaro.org> wrote:

>>

>> On 28 September 2018 at 15:47, Jason A. Donenfeld <Jason@zx2c4.com> wrote:

>> > On Fri, Sep 28, 2018 at 10:49 AM Ard Biesheuvel

>> > <ard.biesheuvel@linaro.org> wrote:

>> >> >> +typedef enum {

>> >> >> +       HAVE_NO_SIMD = 1 << 0,

>> >> >> +       HAVE_FULL_SIMD = 1 << 1,

>> >> >> +       HAVE_SIMD_IN_USE = 1 << 31

>> >> >> +} simd_context_t;

>> >> >> +

>> >>

>> >> Oh, and another thing (and I'm surprised checkpatch.pl didn't complain

>> >> about it): the use of typedef in new code is strongly discouraged.

>> >> This policy predates my involvement, so perhaps Joe can elaborate on

>> >> the rationale?

>> >

>> > In case it matters, the motivation for making this a typedef is I

>> > could imagine this at some point turning into a more complicated

>> > struct on certain platforms and that would make refactoring easier. I

>> > could just make it `struct simd_context` now with 1 member though...

>>

>> Yes that makes sense

>

> The rationale for it being a typedef or moving to a struct now?


Yes just switch to a struct.

Jason A. Donenfeld Sept. 28, 2018, 2:01 p.m. UTC | #36

On Fri, Sep 28, 2018 at 4:00 PM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
>

> On 28 September 2018 at 15:59, Jason A. Donenfeld <Jason@zx2c4.com> wrote:

> > On Fri, Sep 28, 2018 at 3:58 PM Ard Biesheuvel

> > <ard.biesheuvel@linaro.org> wrote:

> >>

> >> On 28 September 2018 at 15:47, Jason A. Donenfeld <Jason@zx2c4.com> wrote:

> >> > On Fri, Sep 28, 2018 at 10:49 AM Ard Biesheuvel

> >> > <ard.biesheuvel@linaro.org> wrote:

> >> >> >> +typedef enum {

> >> >> >> +       HAVE_NO_SIMD = 1 << 0,

> >> >> >> +       HAVE_FULL_SIMD = 1 << 1,

> >> >> >> +       HAVE_SIMD_IN_USE = 1 << 31

> >> >> >> +} simd_context_t;

> >> >> >> +

> >> >>

> >> >> Oh, and another thing (and I'm surprised checkpatch.pl didn't complain

> >> >> about it): the use of typedef in new code is strongly discouraged.

> >> >> This policy predates my involvement, so perhaps Joe can elaborate on

> >> >> the rationale?

> >> >

> >> > In case it matters, the motivation for making this a typedef is I

> >> > could imagine this at some point turning into a more complicated

> >> > struct on certain platforms and that would make refactoring easier. I

> >> > could just make it `struct simd_context` now with 1 member though...

> >>

> >> Yes that makes sense

> >

> > The rationale for it being a typedef or moving to a struct now?

>

> Yes just switch to a struct.


Okay. No problem with that, but will wait to hear from Joe first.

Jason A. Donenfeld Sept. 28, 2018, 3:04 p.m. UTC | #37

On Fri, Sep 28, 2018 at 5:01 PM Andrew Lunn <andrew@lunn.ch> wrote:
> The namespace is more than just about the linker. I see an Opps stack

> trace with wg_ symbols, i know i need to talk to Jason. Without any

> prefix, i have to go digging into the code to find out who i need to

> talk to. This is one reason why often every symbol has the prefix, not

> just the global scope ones.


Good point. I'll see what the wg_ prefixing looks like for all the
static functions too.

Jason A. Donenfeld Sept. 29, 2018, 2:20 a.m. UTC | #38

Hi Ard,

On Fri, Sep 28, 2018 at 6:02 PM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> Please put comments like this below the ---

git-notes is nice for this indeed.

> Are these CONFIG_ symbols defined anywhere at this point?

Yes, they're introduced in the first zinc commit. There's no git-blame
on git.kernel.org, presumably because it's expensive to compute, but
there is on my personal instance, so this might help:
https://git.zx2c4.com/linux-dev/blame/lib/zinc/Kconfig?h=jd/wireguard

> In any case, I don't think these is a reason for these, at least not

> on ARM/arm64. The 64-bitness is implied in both cases

You mean to say that since these nobs are def_bool y and are
essentially "depends on ARM", then I should just straight up use
CONFIG_ARM? I had thought about this, but figured this would make it
easier to later make these optional or have other options block them
need be, or even if the dependencies and requirements for having them
changes (for example, with UML on x86). I think doing it this way
gives us some flexibility later on. So if that's a compelling enough
reason, I'd like to keep those.

> and the

> dependency on !CPU_32v3 you introduce (looking at the version of

> Kconfig at the end of the series) seems spurious to me. Was that added

> because of some kbuild robot report? (we don't support ARMv3 in the

> kernel but ARCH_RPC is built in v3 mode because of historical reasons

> while the actual core is a v4)

I added the !CPU_32v3 in my development tree after posting v6, so good
to hear you're just looking straight at the updated tree. If you see
things jump out in there prior to me posting v7, don't hesitate to let
me know.

The reason it was added was indeed because of:
https://lists.01.org/pipermail/kbuild-all/2018-September/053114.html
-- exactly what you suspected, ARCH_RPC. Have a better suggestion than
!CPU_32v3? It seems to me like so long as the kernel has CPU_32v3 as a
thing in any form, I should mark Zinc as not supporting it, since
we'll certainly be at least v4 and up. (Do you guys have any old Acorn
ARM610 boxes sitting around for old time's sake at LinaroHQ? ;-)

> > +#endif

> > +

>

> No need to make asmlinkage declarations conditional

Yep, addressed in the IS_ENABLED cleanup.

>

> if (IS_ENABLED())

Sorted.

>

> """

> if (!IS_ENABLED(CONFIG_ARM))

>    return false;

>

> hchacha20_arm(x, derived_key);

> return true;

> """

>

> and drop the #ifdefs

Also sorted.

Regards,
Jason

Jason A. Donenfeld Sept. 29, 2018, 2:40 a.m. UTC | #39

[+Willy]

Hi Ard,

On Fri, Sep 28, 2018 at 7:47 PM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
>

> On 28 September 2018 at 07:46, Jason A. Donenfeld <Jason@zx2c4.com> wrote:

> > WireGuard is indeed quite opinionated in its primitive choices, but I

> > don't think it'd be wise to apply the same design to Zinc. There are

> > APIs where the goal is to have a limited set of high-level functions,

> > and have those implemented in only a preferred set of primitives. NaCl

> > is a good example of this -- functions like "crypto_secretbox" that

> > are actually xsalsapoly under the hood. Zinc doesn't intend to become

> > an API like those, but rather to provide the actual primitives for use

> > cases where that specific primitive is used. Sometimes the kernel is

> > in the business of being able to pick its own crypto -- random.c, tcp

> > sequence numbers, big_key.c, etc -- but most of the time the kernel is

> > in the business of implementing other people's crypto, for specific

> > devices/protocols/diskformats. And for those use cases we need not

> > some high-level API like NaCl, but rather direct access to the

> > primitives that are required to implement those drivers. WireGuard is

> > one such example, but so would be Bluetooth, CIFS/SMB, WiFi, and so

> > on. We're in the business of writing drivers, after all. So, no, I

> > don't think I'd knock down the addition of a primitive because of a

> > simple preference for a different primitive, if it was clearly the

> > case that the driver requiring it really benefited from having

> > accessible via the plain Zinc function calls. Sorry if I hadn't made

> > this clear earlier -- I thought Ard had asked more or less the same

> > thing about DES and I answered accordingly, but maybe that wasn't made

> > clear enough there.

> >

>

> CRC32 is another case study that I would like to bring up:

> - the current status is a bit of a mess, where we treat crc32() and

> crc32c() differently - the latter is exposed by libcrc32.ko as a

> wrapper around the crypto API, and there is some networking code (SCTP

> iirc) that puts another kludge on top of that to ensure that the code

> is not used before the module is loaded

> - we have C versions, scalar hw instruction based versions and

> carryless multiplication based versions for various architectures

> - on ST platforms, we have a synchronous hw accelerator that is 10x

> faster than C code on the same platform.

>

> AFAICT none of its current users rely on the async interface or

> runtime dispatch of the 'hash'.

I've added Willy to the CC, as we were actually discussing the crc32
case together two days ago. If my understanding was correct, he seemed
to think it'd be pretty useful too.

It seems like unifying the crc32 implementations at some point would
make sense, and since there's no usage of these as part of the crypto
api, providing crc32 via a vanilla function call seems reasonable.
It's not clear that on some strict formal level, crc32 belongs
anywhere near Zinc, since it's not _exactly_ in the same category...
But one could broaden the scope a bit and make the argument that this
general idea of accepting some bytes, doing some "pure" arithmetic
operations from a particular discipline on them in slightly different
ways depending on the architecture, and then returning some other
bytes, is what in essence is happening with these all of these
function calls, no matter their security or intended use; so if crc32
would benefit from being implemented using the exact same design
patterns, then it might as well be grouped in with the rest of Zinc.
On the other hand, this might be a bit of a slippery slope ("are
compression functions next?"), and a libcrc32.ko could just as well
exist as a standalone thing. I can see it both ways and some other
arguments too, but also this might form a good setting for some much
needed cleanup of the crc32. So that's definitely something we can
consider.

Regarding the synchronous ST driver: I'll give it a thorough read
through, but I do wonder off the bat if actually it'd be possible to
incorporate that into the Zinc paradigm in a fairly non-intrusive way.
I'll give that some thought and play around a bit with it.

Jason

Ard Biesheuvel Sept. 29, 2018, 6:16 a.m. UTC | #40

On 29 September 2018 at 04:20, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> Hi Ard,

>

> On Fri, Sep 28, 2018 at 6:02 PM Ard Biesheuvel

> <ard.biesheuvel@linaro.org> wrote:

>> Please put comments like this below the ---

>

> git-notes is nice for this indeed.

>

>> Are these CONFIG_ symbols defined anywhere at this point?

>

> Yes, they're introduced in the first zinc commit. There's no git-blame

> on git.kernel.org, presumably because it's expensive to compute, but

> there is on my personal instance, so this might help:

> https://git.zx2c4.com/linux-dev/blame/lib/zinc/Kconfig?h=jd/wireguard

>

>> In any case, I don't think these is a reason for these, at least not

>> on ARM/arm64. The 64-bitness is implied in both cases

>

> You mean to say that since these nobs are def_bool y and are

> essentially "depends on ARM", then I should just straight up use

> CONFIG_ARM? I had thought about this, but figured this would make it

> easier to later make these optional or have other options block them

> need be, or even if the dependencies and requirements for having them

> changes (for example, with UML on x86). I think doing it this way

> gives us some flexibility later on. So if that's a compelling enough

> reason, I'd like to keep those.

>


Sure. But probably better to be consistent then, and stop using
CONFIG_ARM directly in your code.

>> and the

>> dependency on !CPU_32v3 you introduce (looking at the version of

>> Kconfig at the end of the series) seems spurious to me. Was that added

>> because of some kbuild robot report? (we don't support ARMv3 in the

>> kernel but ARCH_RPC is built in v3 mode because of historical reasons

>> while the actual core is a v4)

>

> I added the !CPU_32v3 in my development tree after posting v6, so good

> to hear you're just looking straight at the updated tree. If you see

> things jump out in there prior to me posting v7, don't hesitate to let

> me know.

>

> The reason it was added was indeed because of:

> https://lists.01.org/pipermail/kbuild-all/2018-September/053114.html

> -- exactly what you suspected, ARCH_RPC. Have a better suggestion than

> !CPU_32v3?


Yes, you could just add

asflags-$(CONFIG_CPU_32v3) += -march=armv4

with a comment stating that we don't actually support ARMv3 but only
use it as a code generation target for reasons unrelated to the ISA

> It seems to me like so long as the kernel has CPU_32v3 as a

> thing in any form, I should mark Zinc as not supporting it, since

> we'll certainly be at least v4 and up. (Do you guys have any old Acorn

> ARM610 boxes sitting around for old time's sake at LinaroHQ? ;-)

>


AFAIK we only support the StrongARM flavor of RiscPC which is ARMv4.
But yes, some people do care deeply about these antiquated platforms,
including Russell, and there is no particular to leave this one
behind.

>> > +#endif

>> > +

>>

>> No need to make asmlinkage declarations conditional

>

> Yep, addressed in the IS_ENABLED cleanup.

>

>>

>> if (IS_ENABLED())

>

> Sorted.

>

>>

>> """

>> if (!IS_ENABLED(CONFIG_ARM))

>>    return false;

>>

>> hchacha20_arm(x, derived_key);

>> return true;

>> """

>>

>> and drop the #ifdefs

>

> Also sorted.

>

> Regards,

> Jason

Joe Perches Sept. 30, 2018, 4:20 a.m. UTC | #41

On Fri, 2018-09-28 at 16:01 +0200, Jason A. Donenfeld wrote:
> On Fri, Sep 28, 2018 at 4:00 PM Ard Biesheuvel

> <ard.biesheuvel@linaro.org> wrote:

> > 

> > On 28 September 2018 at 15:59, Jason A. Donenfeld <Jason@zx2c4.com> wrote:

> > > On Fri, Sep 28, 2018 at 3:58 PM Ard Biesheuvel

> > > <ard.biesheuvel@linaro.org> wrote:

> > > > 

> > > > On 28 September 2018 at 15:47, Jason A. Donenfeld <Jason@zx2c4.com> wrote:

> > > > > On Fri, Sep 28, 2018 at 10:49 AM Ard Biesheuvel

> > > > > <ard.biesheuvel@linaro.org> wrote:

> > > > > > > > +typedef enum {

> > > > > > > > +       HAVE_NO_SIMD = 1 << 0,

> > > > > > > > +       HAVE_FULL_SIMD = 1 << 1,

> > > > > > > > +       HAVE_SIMD_IN_USE = 1 << 31

> > > > > > > > +} simd_context_t;

> > > > > > > > +

> > > > > > 

> > > > > > Oh, and another thing (and I'm surprised checkpatch.pl didn't complain

> > > > > > about it): the use of typedef in new code is strongly discouraged.

> > > > > > This policy predates my involvement, so perhaps Joe can elaborate on

> > > > > > the rationale?

> > > > > 

> > > > > In case it matters, the motivation for making this a typedef is I

> > > > > could imagine this at some point turning into a more complicated

> > > > > struct on certain platforms and that would make refactoring easier. I

> > > > > could just make it `struct simd_context` now with 1 member though...

> > > > 

> > > > Yes that makes sense

> > > 

> > > The rationale for it being a typedef or moving to a struct now?

> > 

> > Yes just switch to a struct.

> 

> Okay. No problem with that, but will wait to hear from Joe first.


Why do you need to hear from me again?

As far as I know, the only info about typedef avoidance are in
Documentation/process/coding-style.rst section 5.

Jason A. Donenfeld Oct. 1, 2018, 1:43 a.m. UTC | #42

On Sun, Sep 30, 2018 at 7:35 AM Andy Lutomirski <luto@amacapital.net> wrote:
> >>>>>>> Oh, and another thing (and I'm surprised checkpatch.pl didn't complain

> >>>>>>> about it): the use of typedef in new code is strongly discouraged.

> >>>>>>> This policy predates my involvement, so perhaps Joe can elaborate on

> >>>>>>> the rationale?

> >>>>>>

> >>>>>> In case it matters, the motivation for making this a typedef is I

> >>>>>> could imagine this at some point turning into a more complicated

> >>>>>> struct on certain platforms and that would make refactoring easier. I

> >>>>>> could just make it `struct simd_context` now with 1 member though...

> >>>>>

> >>>>> Yes that makes sense

> >>>>

> >>>> The rationale for it being a typedef or moving to a struct now?

> >>>

> >>> Yes just switch to a struct.

> >>

> >> Okay. No problem with that, but will wait to hear from Joe first.

> >

> > Why do you need to hear from me again?

> >

> > As far as I know, the only info about typedef avoidance are in

> > Documentation/process/coding-style.rst section 5.

> >

> >

>

> I personally prefer it with the typedef. If this were my code, I’d say the coding style is silly for opaque tiny structs like this.


I'll stick with a typedef. Reading the style guide, this clearly falls
into 5.a, 5.b, and maybe 5.c. For 5.a, at some point this will
possibly contain architecture specific blobs. For 5.b, it is just an
enum (integer) right now.

Ard Biesheuvel Oct. 2, 2018, 4:59 p.m. UTC | #43

Hi Jason,

On 25 September 2018 at 16:56, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> This comes from Dan Bernstein and Peter Schwabe's public domain NEON

> code, and has been modified to be friendly for kernel space, as well as

> removing some qhasm strangeness to be more idiomatic.

>

> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

> Cc: Samuel Neves <sneves@dei.uc.pt>

> Cc: Andy Lutomirski <luto@kernel.org>

> Cc: Greg KH <gregkh@linuxfoundation.org>

> Cc: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com>

> Cc: Russell King <linux@armlinux.org.uk>

> Cc: linux-arm-kernel@lists.infradead.org

> ---

>  lib/zinc/Makefile                         |    1 +

>  lib/zinc/curve25519/curve25519-arm-glue.h |   42 +

>  lib/zinc/curve25519/curve25519-arm.S      | 2095 +++++++++++++++++++++

>  lib/zinc/curve25519/curve25519.c          |    2 +

>  4 files changed, 2140 insertions(+)

>  create mode 100644 lib/zinc/curve25519/curve25519-arm-glue.h

>  create mode 100644 lib/zinc/curve25519/curve25519-arm.S

>

> diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile

> index 65440438c6e5..be73c342f9ba 100644

> --- a/lib/zinc/Makefile

> +++ b/lib/zinc/Makefile

> @@ -27,4 +27,5 @@ zinc_blake2s-$(CONFIG_ZINC_ARCH_X86_64) += blake2s/blake2s-x86_64.o

>  obj-$(CONFIG_ZINC_BLAKE2S) += zinc_blake2s.o

>

>  zinc_curve25519-y := curve25519/curve25519.o

> +zinc_curve25519-$(CONFIG_ZINC_ARCH_ARM) += curve25519/curve25519-arm.o

>  obj-$(CONFIG_ZINC_CURVE25519) += zinc_curve25519.o

> diff --git a/lib/zinc/curve25519/curve25519-arm-glue.h b/lib/zinc/curve25519/curve25519-arm-glue.h

> new file mode 100644

> index 000000000000..9211bcab5615

> --- /dev/null

> +++ b/lib/zinc/curve25519/curve25519-arm-glue.h

> @@ -0,0 +1,42 @@

> +/* SPDX-License-Identifier: GPL-2.0 OR MIT */

> +/*

> + * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.

> + */

> +

> +#include <asm/hwcap.h>

> +#include <asm/neon.h>

> +#include <asm/simd.h>

> +

> +#if defined(CONFIG_KERNEL_MODE_NEON)

> +asmlinkage void curve25519_neon(u8 mypublic[CURVE25519_KEY_SIZE],

> +                               const u8 secret[CURVE25519_KEY_SIZE],

> +                               const u8 basepoint[CURVE25519_KEY_SIZE]);

> +#endif

> +

> +static bool curve25519_use_neon __ro_after_init;

> +

> +static void __init curve25519_fpu_init(void)

> +{

> +       curve25519_use_neon = elf_hwcap & HWCAP_NEON;

> +}

> +

> +static inline bool curve25519_arch(u8 mypublic[CURVE25519_KEY_SIZE],

> +                                  const u8 secret[CURVE25519_KEY_SIZE],

> +                                  const u8 basepoint[CURVE25519_KEY_SIZE])

> +{

> +#if defined(CONFIG_KERNEL_MODE_NEON)

> +       if (curve25519_use_neon && may_use_simd()) {

> +               kernel_neon_begin();

> +               curve25519_neon(mypublic, secret, basepoint);

> +               kernel_neon_end();

> +               return true;

> +       }

> +#endif

> +       return false;

> +}

> +

> +static inline bool curve25519_base_arch(u8 pub[CURVE25519_KEY_SIZE],

> +                                       const u8 secret[CURVE25519_KEY_SIZE])

> +{

> +       return false;

> +}


Shouldn't this use the new simd abstraction as well?


> diff --git a/lib/zinc/curve25519/curve25519-arm.S b/lib/zinc/curve25519/curve25519-arm.S

> new file mode 100644

> index 000000000000..db6570c20fd1

> --- /dev/null

> +++ b/lib/zinc/curve25519/curve25519-arm.S

> @@ -0,0 +1,2095 @@

> +/* SPDX-License-Identifier: GPL-2.0 OR MIT */

> +/*

> + * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.

> + *

> + * Based on public domain code from Daniel J. Bernstein and Peter Schwabe. This

> + * has been built from SUPERCOP's curve25519/neon2/scalarmult.pq using qhasm,

> + * but has subsequently been manually reworked for use in kernel space.

> + */

> +

> +#ifdef CONFIG_KERNEL_MODE_NEON

> +#include <linux/linkage.h>

> +

> +.text

> +.fpu neon

> +.arch armv7-a

> +.align 4

> +

> +ENTRY(curve25519_neon)

> +       push            {r4-r11, lr}

> +       mov             ip, sp

> +       sub             r3, sp, #704

> +       and             r3, r3, #0xfffffff0

> +       mov             sp, r3

> +       movw            r4, #0

> +       movw            r5, #254

> +       vmov.i32        q0, #1

> +       vshr.u64        q1, q0, #7

> +       vshr.u64        q0, q0, #8

> +       vmov.i32        d4, #19

> +       vmov.i32        d5, #38

> +       add             r6, sp, #480

> +       vst1.8          {d2-d3}, [r6, : 128]

> +       add             r6, sp, #496

> +       vst1.8          {d0-d1}, [r6, : 128]

> +       add             r6, sp, #512

> +       vst1.8          {d4-d5}, [r6, : 128]


I guess qhasm means generated code, right?

Because many of these adds are completely redundant ...

> +       add             r6, r3, #0

> +       vmov.i32        q2, #0

> +       vst1.8          {d4-d5}, [r6, : 128]!

> +       vst1.8          {d4-d5}, [r6, : 128]!

> +       vst1.8          d4, [r6, : 64]

> +       add             r6, r3, #0


> +       movw            r7, #960

> +       sub             r7, r7, #2

> +       neg             r7, r7

> +       sub             r7, r7, r7, LSL #7


This looks odd as well.

Could you elaborate on what qhasm is exactly? And, as with the other
patches, I would prefer it if we could have your changes as a separate
patch (although having the qhasm base would be preferred)



> +       str             r7, [r6]

> +       add             r6, sp, #672

> +       vld1.8          {d4-d5}, [r1]!

> +       vld1.8          {d6-d7}, [r1]

> +       vst1.8          {d4-d5}, [r6, : 128]!

> +       vst1.8          {d6-d7}, [r6, : 128]

> +       sub             r1, r6, #16

> +       ldrb            r6, [r1]

> +       and             r6, r6, #248

> +       strb            r6, [r1]

> +       ldrb            r6, [r1, #31]

> +       and             r6, r6, #127

> +       orr             r6, r6, #64

> +       strb            r6, [r1, #31]

> +       vmov.i64        q2, #0xffffffff

> +       vshr.u64        q3, q2, #7

> +       vshr.u64        q2, q2, #6

> +       vld1.8          {d8}, [r2]

> +       vld1.8          {d10}, [r2]

> +       add             r2, r2, #6

> +       vld1.8          {d12}, [r2]

> +       vld1.8          {d14}, [r2]

> +       add             r2, r2, #6

> +       vld1.8          {d16}, [r2]

> +       add             r2, r2, #4

> +       vld1.8          {d18}, [r2]

> +       vld1.8          {d20}, [r2]

> +       add             r2, r2, #6

> +       vld1.8          {d22}, [r2]

> +       add             r2, r2, #2

> +       vld1.8          {d24}, [r2]

> +       vld1.8          {d26}, [r2]

> +       vshr.u64        q5, q5, #26

> +       vshr.u64        q6, q6, #3

> +       vshr.u64        q7, q7, #29

> +       vshr.u64        q8, q8, #6

> +       vshr.u64        q10, q10, #25

> +       vshr.u64        q11, q11, #3

> +       vshr.u64        q12, q12, #12

> +       vshr.u64        q13, q13, #38

> +       vand            q4, q4, q2

> +       vand            q6, q6, q2

> +       vand            q8, q8, q2

> +       vand            q10, q10, q2

> +       vand            q2, q12, q2

> +       vand            q5, q5, q3

> +       vand            q7, q7, q3

> +       vand            q9, q9, q3

> +       vand            q11, q11, q3

> +       vand            q3, q13, q3

> +       add             r2, r3, #48

> +       vadd.i64        q12, q4, q1

> +       vadd.i64        q13, q10, q1

> +       vshr.s64        q12, q12, #26

> +       vshr.s64        q13, q13, #26

> +       vadd.i64        q5, q5, q12

> +       vshl.i64        q12, q12, #26

> +       vadd.i64        q14, q5, q0

> +       vadd.i64        q11, q11, q13

> +       vshl.i64        q13, q13, #26

> +       vadd.i64        q15, q11, q0

> +       vsub.i64        q4, q4, q12

> +       vshr.s64        q12, q14, #25

> +       vsub.i64        q10, q10, q13

> +       vshr.s64        q13, q15, #25

> +       vadd.i64        q6, q6, q12

> +       vshl.i64        q12, q12, #25

> +       vadd.i64        q14, q6, q1

> +       vadd.i64        q2, q2, q13

> +       vsub.i64        q5, q5, q12

> +       vshr.s64        q12, q14, #26

> +       vshl.i64        q13, q13, #25

> +       vadd.i64        q14, q2, q1

> +       vadd.i64        q7, q7, q12

> +       vshl.i64        q12, q12, #26

> +       vadd.i64        q15, q7, q0

> +       vsub.i64        q11, q11, q13

> +       vshr.s64        q13, q14, #26

> +       vsub.i64        q6, q6, q12

> +       vshr.s64        q12, q15, #25

> +       vadd.i64        q3, q3, q13

> +       vshl.i64        q13, q13, #26

> +       vadd.i64        q14, q3, q0

> +       vadd.i64        q8, q8, q12

> +       vshl.i64        q12, q12, #25

> +       vadd.i64        q15, q8, q1

> +       add             r2, r2, #8

> +       vsub.i64        q2, q2, q13

> +       vshr.s64        q13, q14, #25

> +       vsub.i64        q7, q7, q12

> +       vshr.s64        q12, q15, #26

> +       vadd.i64        q14, q13, q13

> +       vadd.i64        q9, q9, q12

> +       vtrn.32         d12, d14

> +       vshl.i64        q12, q12, #26

> +       vtrn.32         d13, d15

> +       vadd.i64        q0, q9, q0

> +       vadd.i64        q4, q4, q14

> +       vst1.8          d12, [r2, : 64]!

> +       vshl.i64        q6, q13, #4

> +       vsub.i64        q7, q8, q12

> +       vshr.s64        q0, q0, #25

> +       vadd.i64        q4, q4, q6

> +       vadd.i64        q6, q10, q0

> +       vshl.i64        q0, q0, #25

> +       vadd.i64        q8, q6, q1

> +       vadd.i64        q4, q4, q13

> +       vshl.i64        q10, q13, #25

> +       vadd.i64        q1, q4, q1

> +       vsub.i64        q0, q9, q0

> +       vshr.s64        q8, q8, #26

> +       vsub.i64        q3, q3, q10

> +       vtrn.32         d14, d0

> +       vshr.s64        q1, q1, #26

> +       vtrn.32         d15, d1

> +       vadd.i64        q0, q11, q8

> +       vst1.8          d14, [r2, : 64]

> +       vshl.i64        q7, q8, #26

> +       vadd.i64        q5, q5, q1

> +       vtrn.32         d4, d6

> +       vshl.i64        q1, q1, #26

> +       vtrn.32         d5, d7

> +       vsub.i64        q3, q6, q7

> +       add             r2, r2, #16

> +       vsub.i64        q1, q4, q1

> +       vst1.8          d4, [r2, : 64]

> +       vtrn.32         d6, d0

> +       vtrn.32         d7, d1

> +       sub             r2, r2, #8

> +       vtrn.32         d2, d10

> +       vtrn.32         d3, d11

> +       vst1.8          d6, [r2, : 64]

> +       sub             r2, r2, #24

> +       vst1.8          d2, [r2, : 64]

> +       add             r2, r3, #96

> +       vmov.i32        q0, #0

> +       vmov.i64        d2, #0xff

> +       vmov.i64        d3, #0

> +       vshr.u32        q1, q1, #7

> +       vst1.8          {d2-d3}, [r2, : 128]!

> +       vst1.8          {d0-d1}, [r2, : 128]!

> +       vst1.8          d0, [r2, : 64]

> +       add             r2, r3, #144

> +       vmov.i32        q0, #0

> +       vst1.8          {d0-d1}, [r2, : 128]!

> +       vst1.8          {d0-d1}, [r2, : 128]!

> +       vst1.8          d0, [r2, : 64]

> +       add             r2, r3, #240

> +       vmov.i32        q0, #0

> +       vmov.i64        d2, #0xff

> +       vmov.i64        d3, #0

> +       vshr.u32        q1, q1, #7

> +       vst1.8          {d2-d3}, [r2, : 128]!

> +       vst1.8          {d0-d1}, [r2, : 128]!

> +       vst1.8          d0, [r2, : 64]

> +       add             r2, r3, #48

> +       add             r6, r3, #192

> +       vld1.8          {d0-d1}, [r2, : 128]!

> +       vld1.8          {d2-d3}, [r2, : 128]!

> +       vld1.8          {d4}, [r2, : 64]

> +       vst1.8          {d0-d1}, [r6, : 128]!

> +       vst1.8          {d2-d3}, [r6, : 128]!

> +       vst1.8          d4, [r6, : 64]

> +.Lmainloop:

> +       mov             r2, r5, LSR #3

> +       and             r6, r5, #7

> +       ldrb            r2, [r1, r2]

> +       mov             r2, r2, LSR r6

> +       and             r2, r2, #1

> +       str             r5, [sp, #456]

> +       eor             r4, r4, r2

> +       str             r2, [sp, #460]

> +       neg             r2, r4

> +       add             r4, r3, #96

> +       add             r5, r3, #192

> +       add             r6, r3, #144

> +       vld1.8          {d8-d9}, [r4, : 128]!

> +       add             r7, r3, #240

> +       vld1.8          {d10-d11}, [r5, : 128]!

> +       veor            q6, q4, q5

> +       vld1.8          {d14-d15}, [r6, : 128]!

> +       vdup.i32        q8, r2

> +       vld1.8          {d18-d19}, [r7, : 128]!

> +       veor            q10, q7, q9

> +       vld1.8          {d22-d23}, [r4, : 128]!

> +       vand            q6, q6, q8

> +       vld1.8          {d24-d25}, [r5, : 128]!

> +       vand            q10, q10, q8

> +       vld1.8          {d26-d27}, [r6, : 128]!

> +       veor            q4, q4, q6

> +       vld1.8          {d28-d29}, [r7, : 128]!

> +       veor            q5, q5, q6

> +       vld1.8          {d0}, [r4, : 64]

> +       veor            q6, q7, q10

> +       vld1.8          {d2}, [r5, : 64]

> +       veor            q7, q9, q10

> +       vld1.8          {d4}, [r6, : 64]

> +       veor            q9, q11, q12

> +       vld1.8          {d6}, [r7, : 64]

> +       veor            q10, q0, q1

> +       sub             r2, r4, #32

> +       vand            q9, q9, q8

> +       sub             r4, r5, #32

> +       vand            q10, q10, q8

> +       sub             r5, r6, #32

> +       veor            q11, q11, q9

> +       sub             r6, r7, #32

> +       veor            q0, q0, q10

> +       veor            q9, q12, q9

> +       veor            q1, q1, q10

> +       veor            q10, q13, q14

> +       veor            q12, q2, q3

> +       vand            q10, q10, q8

> +       vand            q8, q12, q8

> +       veor            q12, q13, q10

> +       veor            q2, q2, q8

> +       veor            q10, q14, q10

> +       veor            q3, q3, q8

> +       vadd.i32        q8, q4, q6

> +       vsub.i32        q4, q4, q6

> +       vst1.8          {d16-d17}, [r2, : 128]!

> +       vadd.i32        q6, q11, q12

> +       vst1.8          {d8-d9}, [r5, : 128]!

> +       vsub.i32        q4, q11, q12

> +       vst1.8          {d12-d13}, [r2, : 128]!

> +       vadd.i32        q6, q0, q2

> +       vst1.8          {d8-d9}, [r5, : 128]!

> +       vsub.i32        q0, q0, q2

> +       vst1.8          d12, [r2, : 64]

> +       vadd.i32        q2, q5, q7

> +       vst1.8          d0, [r5, : 64]

> +       vsub.i32        q0, q5, q7

> +       vst1.8          {d4-d5}, [r4, : 128]!

> +       vadd.i32        q2, q9, q10

> +       vst1.8          {d0-d1}, [r6, : 128]!

> +       vsub.i32        q0, q9, q10

> +       vst1.8          {d4-d5}, [r4, : 128]!

> +       vadd.i32        q2, q1, q3

> +       vst1.8          {d0-d1}, [r6, : 128]!

> +       vsub.i32        q0, q1, q3

> +       vst1.8          d4, [r4, : 64]

> +       vst1.8          d0, [r6, : 64]

> +       add             r2, sp, #512

> +       add             r4, r3, #96

> +       add             r5, r3, #144

> +       vld1.8          {d0-d1}, [r2, : 128]

> +       vld1.8          {d2-d3}, [r4, : 128]!

> +       vld1.8          {d4-d5}, [r5, : 128]!

> +       vzip.i32        q1, q2

> +       vld1.8          {d6-d7}, [r4, : 128]!

> +       vld1.8          {d8-d9}, [r5, : 128]!

> +       vshl.i32        q5, q1, #1

> +       vzip.i32        q3, q4

> +       vshl.i32        q6, q2, #1

> +       vld1.8          {d14}, [r4, : 64]

> +       vshl.i32        q8, q3, #1

> +       vld1.8          {d15}, [r5, : 64]

> +       vshl.i32        q9, q4, #1

> +       vmul.i32        d21, d7, d1

> +       vtrn.32         d14, d15

> +       vmul.i32        q11, q4, q0

> +       vmul.i32        q0, q7, q0

> +       vmull.s32       q12, d2, d2

> +       vmlal.s32       q12, d11, d1

> +       vmlal.s32       q12, d12, d0

> +       vmlal.s32       q12, d13, d23

> +       vmlal.s32       q12, d16, d22

> +       vmlal.s32       q12, d7, d21

> +       vmull.s32       q10, d2, d11

> +       vmlal.s32       q10, d4, d1

> +       vmlal.s32       q10, d13, d0

> +       vmlal.s32       q10, d6, d23

> +       vmlal.s32       q10, d17, d22

> +       vmull.s32       q13, d10, d4

> +       vmlal.s32       q13, d11, d3

> +       vmlal.s32       q13, d13, d1

> +       vmlal.s32       q13, d16, d0

> +       vmlal.s32       q13, d17, d23

> +       vmlal.s32       q13, d8, d22

> +       vmull.s32       q1, d10, d5

> +       vmlal.s32       q1, d11, d4

> +       vmlal.s32       q1, d6, d1

> +       vmlal.s32       q1, d17, d0

> +       vmlal.s32       q1, d8, d23

> +       vmull.s32       q14, d10, d6

> +       vmlal.s32       q14, d11, d13

> +       vmlal.s32       q14, d4, d4

> +       vmlal.s32       q14, d17, d1

> +       vmlal.s32       q14, d18, d0

> +       vmlal.s32       q14, d9, d23

> +       vmull.s32       q11, d10, d7

> +       vmlal.s32       q11, d11, d6

> +       vmlal.s32       q11, d12, d5

> +       vmlal.s32       q11, d8, d1

> +       vmlal.s32       q11, d19, d0

> +       vmull.s32       q15, d10, d8

> +       vmlal.s32       q15, d11, d17

> +       vmlal.s32       q15, d12, d6

> +       vmlal.s32       q15, d13, d5

> +       vmlal.s32       q15, d19, d1

> +       vmlal.s32       q15, d14, d0

> +       vmull.s32       q2, d10, d9

> +       vmlal.s32       q2, d11, d8

> +       vmlal.s32       q2, d12, d7

> +       vmlal.s32       q2, d13, d6

> +       vmlal.s32       q2, d14, d1

> +       vmull.s32       q0, d15, d1

> +       vmlal.s32       q0, d10, d14

> +       vmlal.s32       q0, d11, d19

> +       vmlal.s32       q0, d12, d8

> +       vmlal.s32       q0, d13, d17

> +       vmlal.s32       q0, d6, d6

> +       add             r2, sp, #480

> +       vld1.8          {d18-d19}, [r2, : 128]


If you append a ! here ...

> +       vmull.s32       q3, d16, d7

> +       vmlal.s32       q3, d10, d15

> +       vmlal.s32       q3, d11, d14

> +       vmlal.s32       q3, d12, d9

> +       vmlal.s32       q3, d13, d8

> +       add             r2, sp, #496


... you can drop this add

> +       vld1.8          {d8-d9}, [r2, : 128]

> +       vadd.i64        q5, q12, q9

> +       vadd.i64        q6, q15, q9

> +       vshr.s64        q5, q5, #26

> +       vshr.s64        q6, q6, #26

> +       vadd.i64        q7, q10, q5

> +       vshl.i64        q5, q5, #26

> +       vadd.i64        q8, q7, q4

> +       vadd.i64        q2, q2, q6

> +       vshl.i64        q6, q6, #26

> +       vadd.i64        q10, q2, q4

> +       vsub.i64        q5, q12, q5

> +       vshr.s64        q8, q8, #25

> +       vsub.i64        q6, q15, q6

> +       vshr.s64        q10, q10, #25

> +       vadd.i64        q12, q13, q8

> +       vshl.i64        q8, q8, #25

> +       vadd.i64        q13, q12, q9

> +       vadd.i64        q0, q0, q10

> +       vsub.i64        q7, q7, q8

> +       vshr.s64        q8, q13, #26

> +       vshl.i64        q10, q10, #25

> +       vadd.i64        q13, q0, q9

> +       vadd.i64        q1, q1, q8

> +       vshl.i64        q8, q8, #26

> +       vadd.i64        q15, q1, q4

> +       vsub.i64        q2, q2, q10

> +       vshr.s64        q10, q13, #26

> +       vsub.i64        q8, q12, q8

> +       vshr.s64        q12, q15, #25

> +       vadd.i64        q3, q3, q10

> +       vshl.i64        q10, q10, #26

> +       vadd.i64        q13, q3, q4

> +       vadd.i64        q14, q14, q12

> +       add             r2, r3, #288

> +       vshl.i64        q12, q12, #25

> +       add             r4, r3, #336

> +       vadd.i64        q15, q14, q9

> +       add             r2, r2, #8

> +       vsub.i64        q0, q0, q10

> +       add             r4, r4, #8

> +       vshr.s64        q10, q13, #25

> +       vsub.i64        q1, q1, q12

> +       vshr.s64        q12, q15, #26

> +       vadd.i64        q13, q10, q10

> +       vadd.i64        q11, q11, q12

> +       vtrn.32         d16, d2

> +       vshl.i64        q12, q12, #26

> +       vtrn.32         d17, d3

> +       vadd.i64        q1, q11, q4

> +       vadd.i64        q4, q5, q13

> +       vst1.8          d16, [r2, : 64]!

> +       vshl.i64        q5, q10, #4

> +       vst1.8          d17, [r4, : 64]!

> +       vsub.i64        q8, q14, q12

> +       vshr.s64        q1, q1, #25

> +       vadd.i64        q4, q4, q5

> +       vadd.i64        q5, q6, q1

> +       vshl.i64        q1, q1, #25

> +       vadd.i64        q6, q5, q9

> +       vadd.i64        q4, q4, q10

> +       vshl.i64        q10, q10, #25

> +       vadd.i64        q9, q4, q9

> +       vsub.i64        q1, q11, q1

> +       vshr.s64        q6, q6, #26

> +       vsub.i64        q3, q3, q10

> +       vtrn.32         d16, d2

> +       vshr.s64        q9, q9, #26

> +       vtrn.32         d17, d3

> +       vadd.i64        q1, q2, q6

> +       vst1.8          d16, [r2, : 64]

> +       vshl.i64        q2, q6, #26

> +       vst1.8          d17, [r4, : 64]

> +       vadd.i64        q6, q7, q9

> +       vtrn.32         d0, d6

> +       vshl.i64        q7, q9, #26

> +       vtrn.32         d1, d7

> +       vsub.i64        q2, q5, q2

> +       add             r2, r2, #16

> +       vsub.i64        q3, q4, q7

> +       vst1.8          d0, [r2, : 64]

> +       add             r4, r4, #16

> +       vst1.8          d1, [r4, : 64]

> +       vtrn.32         d4, d2

> +       vtrn.32         d5, d3

> +       sub             r2, r2, #8

> +       sub             r4, r4, #8

> +       vtrn.32         d6, d12

> +       vtrn.32         d7, d13

> +       vst1.8          d4, [r2, : 64]

> +       vst1.8          d5, [r4, : 64]

> +       sub             r2, r2, #24

> +       sub             r4, r4, #24

> +       vst1.8          d6, [r2, : 64]

> +       vst1.8          d7, [r4, : 64]

> +       add             r2, r3, #240

> +       add             r4, r3, #96

> +       vld1.8          {d0-d1}, [r4, : 128]!

> +       vld1.8          {d2-d3}, [r4, : 128]!

> +       vld1.8          {d4}, [r4, : 64]

> +       add             r4, r3, #144

> +       vld1.8          {d6-d7}, [r4, : 128]!

> +       vtrn.32         q0, q3

> +       vld1.8          {d8-d9}, [r4, : 128]!

> +       vshl.i32        q5, q0, #4

> +       vtrn.32         q1, q4

> +       vshl.i32        q6, q3, #4

> +       vadd.i32        q5, q5, q0

> +       vadd.i32        q6, q6, q3

> +       vshl.i32        q7, q1, #4

> +       vld1.8          {d5}, [r4, : 64]

> +       vshl.i32        q8, q4, #4

> +       vtrn.32         d4, d5

> +       vadd.i32        q7, q7, q1

> +       vadd.i32        q8, q8, q4

> +       vld1.8          {d18-d19}, [r2, : 128]!

> +       vshl.i32        q10, q2, #4

> +       vld1.8          {d22-d23}, [r2, : 128]!

> +       vadd.i32        q10, q10, q2

> +       vld1.8          {d24}, [r2, : 64]

> +       vadd.i32        q5, q5, q0

> +       add             r2, r3, #192

> +       vld1.8          {d26-d27}, [r2, : 128]!

> +       vadd.i32        q6, q6, q3

> +       vld1.8          {d28-d29}, [r2, : 128]!

> +       vadd.i32        q8, q8, q4

> +       vld1.8          {d25}, [r2, : 64]

> +       vadd.i32        q10, q10, q2

> +       vtrn.32         q9, q13

> +       vadd.i32        q7, q7, q1

> +       vadd.i32        q5, q5, q0

> +       vtrn.32         q11, q14

> +       vadd.i32        q6, q6, q3

> +       add             r2, sp, #528

> +       vadd.i32        q10, q10, q2

> +       vtrn.32         d24, d25

> +       vst1.8          {d12-d13}, [r2, : 128]


same here

> +       vshl.i32        q6, q13, #1

> +       add             r2, sp, #544

> +       vst1.8          {d20-d21}, [r2, : 128]


and here

> +       vshl.i32        q10, q14, #1

> +       add             r2, sp, #560

> +       vst1.8          {d12-d13}, [r2, : 128]


and here

> +       vshl.i32        q15, q12, #1

> +       vadd.i32        q8, q8, q4

> +       vext.32         d10, d31, d30, #0

> +       vadd.i32        q7, q7, q1

> +       add             r2, sp, #576

> +       vst1.8          {d16-d17}, [r2, : 128]


and here

> +       vmull.s32       q8, d18, d5

> +       vmlal.s32       q8, d26, d4

> +       vmlal.s32       q8, d19, d9

> +       vmlal.s32       q8, d27, d3

> +       vmlal.s32       q8, d22, d8

> +       vmlal.s32       q8, d28, d2

> +       vmlal.s32       q8, d23, d7

> +       vmlal.s32       q8, d29, d1

> +       vmlal.s32       q8, d24, d6

> +       vmlal.s32       q8, d25, d0

> +       add             r2, sp, #592

> +       vst1.8          {d14-d15}, [r2, : 128]


and here

> +       vmull.s32       q2, d18, d4

> +       vmlal.s32       q2, d12, d9

> +       vmlal.s32       q2, d13, d8

> +       vmlal.s32       q2, d19, d3

> +       vmlal.s32       q2, d22, d2

> +       vmlal.s32       q2, d23, d1

> +       vmlal.s32       q2, d24, d0

> +       add             r2, sp, #608

> +       vst1.8          {d20-d21}, [r2, : 128]


and here


> +       vmull.s32       q7, d18, d9

> +       vmlal.s32       q7, d26, d3

> +       vmlal.s32       q7, d19, d8

> +       vmlal.s32       q7, d27, d2

> +       vmlal.s32       q7, d22, d7

> +       vmlal.s32       q7, d28, d1

> +       vmlal.s32       q7, d23, d6

> +       vmlal.s32       q7, d29, d0

> +       add             r2, sp, #624

> +       vst1.8          {d10-d11}, [r2, : 128]


and here

> +       vmull.s32       q5, d18, d3

> +       vmlal.s32       q5, d19, d2

> +       vmlal.s32       q5, d22, d1

> +       vmlal.s32       q5, d23, d0

> +       vmlal.s32       q5, d12, d8

> +       add             r2, sp, #640

> +       vst1.8          {d16-d17}, [r2, : 128]

> +       vmull.s32       q4, d18, d8

> +       vmlal.s32       q4, d26, d2

> +       vmlal.s32       q4, d19, d7

> +       vmlal.s32       q4, d27, d1

> +       vmlal.s32       q4, d22, d6

> +       vmlal.s32       q4, d28, d0

> +       vmull.s32       q8, d18, d7

> +       vmlal.s32       q8, d26, d1

> +       vmlal.s32       q8, d19, d6

> +       vmlal.s32       q8, d27, d0

> +       add             r2, sp, #544

> +       vld1.8          {d20-d21}, [r2, : 128]

> +       vmlal.s32       q7, d24, d21

> +       vmlal.s32       q7, d25, d20

> +       vmlal.s32       q4, d23, d21

> +       vmlal.s32       q4, d29, d20

> +       vmlal.s32       q8, d22, d21

> +       vmlal.s32       q8, d28, d20

> +       vmlal.s32       q5, d24, d20

> +       add             r2, sp, #544


redundant add


I'll stop here - let me just note that this code does not strike me as
particularly well optimized for in-order cores (such as A7).

For instance, the sequence

vmlal.s32 q2, d18, d7
vmlal.s32 q2, d19, d6
vmlal.s32 q5, d18, d6
vmlal.s32 q5, d19, d21
vmlal.s32 q1, d18, d21
vmlal.s32 q1, d19, d29
vmlal.s32 q0, d18, d28
vmlal.s32 q0, d19, d9
vmlal.s32 q6, d18, d29
vmlal.s32 q6, d19, d28

can be reordered as

vmlal.s32 q2, d18, d7
vmlal.s32 q5, d18, d6
vmlal.s32 q1, d18, d21
vmlal.s32 q0, d18, d28
vmlal.s32 q6, d18, d29

vmlal.s32 q2, d19, d6
vmlal.s32 q5, d19, d21
vmlal.s32 q1, d19, d29
vmlal.s32 q0, d19, d9
vmlal.s32 q6, d19, d28

and not have every other instruction depend on the output of the previous one.

Obviously, the ultimate truth is in the benchmark numbers, but I'd
thought I'd mention it anyway.



> +       vst1.8          {d14-d15}, [r2, : 128]

> +       vmull.s32       q7, d18, d6

> +       vmlal.s32       q7, d26, d0

> +       add             r2, sp, #624

> +       vld1.8          {d30-d31}, [r2, : 128]

> +       vmlal.s32       q2, d30, d21

> +       vmlal.s32       q7, d19, d21

> +       vmlal.s32       q7, d27, d20

> +       add             r2, sp, #592

> +       vld1.8          {d26-d27}, [r2, : 128]

> +       vmlal.s32       q4, d25, d27

> +       vmlal.s32       q8, d29, d27

> +       vmlal.s32       q8, d25, d26

> +       vmlal.s32       q7, d28, d27

> +       vmlal.s32       q7, d29, d26

> +       add             r2, sp, #576

> +       vld1.8          {d28-d29}, [r2, : 128]

> +       vmlal.s32       q4, d24, d29

> +       vmlal.s32       q8, d23, d29

> +       vmlal.s32       q8, d24, d28

> +       vmlal.s32       q7, d22, d29

> +       vmlal.s32       q7, d23, d28

> +       add             r2, sp, #576

> +       vst1.8          {d8-d9}, [r2, : 128]

> +       add             r2, sp, #528

> +       vld1.8          {d8-d9}, [r2, : 128]

> +       vmlal.s32       q7, d24, d9

> +       vmlal.s32       q7, d25, d31

> +       vmull.s32       q1, d18, d2

> +       vmlal.s32       q1, d19, d1

> +       vmlal.s32       q1, d22, d0

> +       vmlal.s32       q1, d24, d27

> +       vmlal.s32       q1, d23, d20

> +       vmlal.s32       q1, d12, d7

> +       vmlal.s32       q1, d13, d6

> +       vmull.s32       q6, d18, d1

> +       vmlal.s32       q6, d19, d0

> +       vmlal.s32       q6, d23, d27

> +       vmlal.s32       q6, d22, d20

> +       vmlal.s32       q6, d24, d26

> +       vmull.s32       q0, d18, d0

> +       vmlal.s32       q0, d22, d27

> +       vmlal.s32       q0, d23, d26

> +       vmlal.s32       q0, d24, d31

> +       vmlal.s32       q0, d19, d20

> +       add             r2, sp, #608

> +       vld1.8          {d18-d19}, [r2, : 128]

> +       vmlal.s32       q2, d18, d7

> +       vmlal.s32       q2, d19, d6

> +       vmlal.s32       q5, d18, d6

> +       vmlal.s32       q5, d19, d21

> +       vmlal.s32       q1, d18, d21

> +       vmlal.s32       q1, d19, d29

> +       vmlal.s32       q0, d18, d28

> +       vmlal.s32       q0, d19, d9

> +       vmlal.s32       q6, d18, d29

> +       vmlal.s32       q6, d19, d28

> +       add             r2, sp, #560

> +       vld1.8          {d18-d19}, [r2, : 128]

> +       add             r2, sp, #480

> +       vld1.8          {d22-d23}, [r2, : 128]

> +       vmlal.s32       q5, d19, d7

> +       vmlal.s32       q0, d18, d21

> +       vmlal.s32       q0, d19, d29

> +       vmlal.s32       q6, d18, d6

> +       add             r2, sp, #496

> +       vld1.8          {d6-d7}, [r2, : 128]

> +       vmlal.s32       q6, d19, d21

> +       add             r2, sp, #544

> +       vld1.8          {d18-d19}, [r2, : 128]

> +       vmlal.s32       q0, d30, d8

> +       add             r2, sp, #640

> +       vld1.8          {d20-d21}, [r2, : 128]

> +       vmlal.s32       q5, d30, d29

> +       add             r2, sp, #576

> +       vld1.8          {d24-d25}, [r2, : 128]

> +       vmlal.s32       q1, d30, d28

> +       vadd.i64        q13, q0, q11

> +       vadd.i64        q14, q5, q11

> +       vmlal.s32       q6, d30, d9

> +       vshr.s64        q4, q13, #26

> +       vshr.s64        q13, q14, #26

> +       vadd.i64        q7, q7, q4

> +       vshl.i64        q4, q4, #26

> +       vadd.i64        q14, q7, q3

> +       vadd.i64        q9, q9, q13

> +       vshl.i64        q13, q13, #26

> +       vadd.i64        q15, q9, q3

> +       vsub.i64        q0, q0, q4

> +       vshr.s64        q4, q14, #25

> +       vsub.i64        q5, q5, q13

> +       vshr.s64        q13, q15, #25

> +       vadd.i64        q6, q6, q4

> +       vshl.i64        q4, q4, #25

> +       vadd.i64        q14, q6, q11

> +       vadd.i64        q2, q2, q13

> +       vsub.i64        q4, q7, q4

> +       vshr.s64        q7, q14, #26

> +       vshl.i64        q13, q13, #25

> +       vadd.i64        q14, q2, q11

> +       vadd.i64        q8, q8, q7

> +       vshl.i64        q7, q7, #26

> +       vadd.i64        q15, q8, q3

> +       vsub.i64        q9, q9, q13

> +       vshr.s64        q13, q14, #26

> +       vsub.i64        q6, q6, q7

> +       vshr.s64        q7, q15, #25

> +       vadd.i64        q10, q10, q13

> +       vshl.i64        q13, q13, #26

> +       vadd.i64        q14, q10, q3

> +       vadd.i64        q1, q1, q7

> +       add             r2, r3, #144

> +       vshl.i64        q7, q7, #25

> +       add             r4, r3, #96

> +       vadd.i64        q15, q1, q11

> +       add             r2, r2, #8

> +       vsub.i64        q2, q2, q13

> +       add             r4, r4, #8

> +       vshr.s64        q13, q14, #25

> +       vsub.i64        q7, q8, q7

> +       vshr.s64        q8, q15, #26

> +       vadd.i64        q14, q13, q13

> +       vadd.i64        q12, q12, q8

> +       vtrn.32         d12, d14

> +       vshl.i64        q8, q8, #26

> +       vtrn.32         d13, d15

> +       vadd.i64        q3, q12, q3

> +       vadd.i64        q0, q0, q14

> +       vst1.8          d12, [r2, : 64]!

> +       vshl.i64        q7, q13, #4

> +       vst1.8          d13, [r4, : 64]!

> +       vsub.i64        q1, q1, q8

> +       vshr.s64        q3, q3, #25

> +       vadd.i64        q0, q0, q7

> +       vadd.i64        q5, q5, q3

> +       vshl.i64        q3, q3, #25

> +       vadd.i64        q6, q5, q11

> +       vadd.i64        q0, q0, q13

> +       vshl.i64        q7, q13, #25

> +       vadd.i64        q8, q0, q11

> +       vsub.i64        q3, q12, q3

> +       vshr.s64        q6, q6, #26

> +       vsub.i64        q7, q10, q7

> +       vtrn.32         d2, d6

> +       vshr.s64        q8, q8, #26

> +       vtrn.32         d3, d7

> +       vadd.i64        q3, q9, q6

> +       vst1.8          d2, [r2, : 64]

> +       vshl.i64        q6, q6, #26

> +       vst1.8          d3, [r4, : 64]

> +       vadd.i64        q1, q4, q8

> +       vtrn.32         d4, d14

> +       vshl.i64        q4, q8, #26

> +       vtrn.32         d5, d15

> +       vsub.i64        q5, q5, q6

> +       add             r2, r2, #16

> +       vsub.i64        q0, q0, q4

> +       vst1.8          d4, [r2, : 64]

> +       add             r4, r4, #16

> +       vst1.8          d5, [r4, : 64]

> +       vtrn.32         d10, d6

> +       vtrn.32         d11, d7

> +       sub             r2, r2, #8

> +       sub             r4, r4, #8

> +       vtrn.32         d0, d2

> +       vtrn.32         d1, d3

> +       vst1.8          d10, [r2, : 64]

> +       vst1.8          d11, [r4, : 64]

> +       sub             r2, r2, #24

> +       sub             r4, r4, #24

> +       vst1.8          d0, [r2, : 64]

> +       vst1.8          d1, [r4, : 64]

> +       add             r2, r3, #288

> +       add             r4, r3, #336

> +       vld1.8          {d0-d1}, [r2, : 128]!

> +       vld1.8          {d2-d3}, [r4, : 128]!

> +       vsub.i32        q0, q0, q1

> +       vld1.8          {d2-d3}, [r2, : 128]!

> +       vld1.8          {d4-d5}, [r4, : 128]!

> +       vsub.i32        q1, q1, q2

> +       add             r5, r3, #240

> +       vld1.8          {d4}, [r2, : 64]

> +       vld1.8          {d6}, [r4, : 64]

> +       vsub.i32        q2, q2, q3

> +       vst1.8          {d0-d1}, [r5, : 128]!

> +       vst1.8          {d2-d3}, [r5, : 128]!

> +       vst1.8          d4, [r5, : 64]

> +       add             r2, r3, #144

> +       add             r4, r3, #96

> +       add             r5, r3, #144

> +       add             r6, r3, #192

> +       vld1.8          {d0-d1}, [r2, : 128]!

> +       vld1.8          {d2-d3}, [r4, : 128]!

> +       vsub.i32        q2, q0, q1

> +       vadd.i32        q0, q0, q1

> +       vld1.8          {d2-d3}, [r2, : 128]!

> +       vld1.8          {d6-d7}, [r4, : 128]!

> +       vsub.i32        q4, q1, q3

> +       vadd.i32        q1, q1, q3

> +       vld1.8          {d6}, [r2, : 64]

> +       vld1.8          {d10}, [r4, : 64]

> +       vsub.i32        q6, q3, q5

> +       vadd.i32        q3, q3, q5

> +       vst1.8          {d4-d5}, [r5, : 128]!

> +       vst1.8          {d0-d1}, [r6, : 128]!

> +       vst1.8          {d8-d9}, [r5, : 128]!

> +       vst1.8          {d2-d3}, [r6, : 128]!

> +       vst1.8          d12, [r5, : 64]

> +       vst1.8          d6, [r6, : 64]

> +       add             r2, r3, #0

> +       add             r4, r3, #240

> +       vld1.8          {d0-d1}, [r4, : 128]!

> +       vld1.8          {d2-d3}, [r4, : 128]!

> +       vld1.8          {d4}, [r4, : 64]

> +       add             r4, r3, #336

> +       vld1.8          {d6-d7}, [r4, : 128]!

> +       vtrn.32         q0, q3

> +       vld1.8          {d8-d9}, [r4, : 128]!

> +       vshl.i32        q5, q0, #4

> +       vtrn.32         q1, q4

> +       vshl.i32        q6, q3, #4

> +       vadd.i32        q5, q5, q0

> +       vadd.i32        q6, q6, q3

> +       vshl.i32        q7, q1, #4

> +       vld1.8          {d5}, [r4, : 64]

> +       vshl.i32        q8, q4, #4

> +       vtrn.32         d4, d5

> +       vadd.i32        q7, q7, q1

> +       vadd.i32        q8, q8, q4

> +       vld1.8          {d18-d19}, [r2, : 128]!

> +       vshl.i32        q10, q2, #4

> +       vld1.8          {d22-d23}, [r2, : 128]!

> +       vadd.i32        q10, q10, q2

> +       vld1.8          {d24}, [r2, : 64]

> +       vadd.i32        q5, q5, q0

> +       add             r2, r3, #288

> +       vld1.8          {d26-d27}, [r2, : 128]!

> +       vadd.i32        q6, q6, q3

> +       vld1.8          {d28-d29}, [r2, : 128]!

> +       vadd.i32        q8, q8, q4

> +       vld1.8          {d25}, [r2, : 64]

> +       vadd.i32        q10, q10, q2

> +       vtrn.32         q9, q13

> +       vadd.i32        q7, q7, q1

> +       vadd.i32        q5, q5, q0

> +       vtrn.32         q11, q14

> +       vadd.i32        q6, q6, q3

> +       add             r2, sp, #528

> +       vadd.i32        q10, q10, q2

> +       vtrn.32         d24, d25

> +       vst1.8          {d12-d13}, [r2, : 128]

> +       vshl.i32        q6, q13, #1

> +       add             r2, sp, #544

> +       vst1.8          {d20-d21}, [r2, : 128]

> +       vshl.i32        q10, q14, #1

> +       add             r2, sp, #560

> +       vst1.8          {d12-d13}, [r2, : 128]

> +       vshl.i32        q15, q12, #1

> +       vadd.i32        q8, q8, q4

> +       vext.32         d10, d31, d30, #0

> +       vadd.i32        q7, q7, q1

> +       add             r2, sp, #576

> +       vst1.8          {d16-d17}, [r2, : 128]

> +       vmull.s32       q8, d18, d5

> +       vmlal.s32       q8, d26, d4

> +       vmlal.s32       q8, d19, d9

> +       vmlal.s32       q8, d27, d3

> +       vmlal.s32       q8, d22, d8

> +       vmlal.s32       q8, d28, d2

> +       vmlal.s32       q8, d23, d7

> +       vmlal.s32       q8, d29, d1

> +       vmlal.s32       q8, d24, d6

> +       vmlal.s32       q8, d25, d0

> +       add             r2, sp, #592

> +       vst1.8          {d14-d15}, [r2, : 128]

> +       vmull.s32       q2, d18, d4

> +       vmlal.s32       q2, d12, d9

> +       vmlal.s32       q2, d13, d8

> +       vmlal.s32       q2, d19, d3

> +       vmlal.s32       q2, d22, d2

> +       vmlal.s32       q2, d23, d1

> +       vmlal.s32       q2, d24, d0

> +       add             r2, sp, #608

> +       vst1.8          {d20-d21}, [r2, : 128]

> +       vmull.s32       q7, d18, d9

> +       vmlal.s32       q7, d26, d3

> +       vmlal.s32       q7, d19, d8

> +       vmlal.s32       q7, d27, d2

> +       vmlal.s32       q7, d22, d7

> +       vmlal.s32       q7, d28, d1

> +       vmlal.s32       q7, d23, d6

> +       vmlal.s32       q7, d29, d0

> +       add             r2, sp, #624

> +       vst1.8          {d10-d11}, [r2, : 128]

> +       vmull.s32       q5, d18, d3

> +       vmlal.s32       q5, d19, d2

> +       vmlal.s32       q5, d22, d1

> +       vmlal.s32       q5, d23, d0

> +       vmlal.s32       q5, d12, d8

> +       add             r2, sp, #640

> +       vst1.8          {d16-d17}, [r2, : 128]

> +       vmull.s32       q4, d18, d8

> +       vmlal.s32       q4, d26, d2

> +       vmlal.s32       q4, d19, d7

> +       vmlal.s32       q4, d27, d1

> +       vmlal.s32       q4, d22, d6

> +       vmlal.s32       q4, d28, d0

> +       vmull.s32       q8, d18, d7

> +       vmlal.s32       q8, d26, d1

> +       vmlal.s32       q8, d19, d6

> +       vmlal.s32       q8, d27, d0

> +       add             r2, sp, #544

> +       vld1.8          {d20-d21}, [r2, : 128]

> +       vmlal.s32       q7, d24, d21

> +       vmlal.s32       q7, d25, d20

> +       vmlal.s32       q4, d23, d21

> +       vmlal.s32       q4, d29, d20

> +       vmlal.s32       q8, d22, d21

> +       vmlal.s32       q8, d28, d20

> +       vmlal.s32       q5, d24, d20

> +       add             r2, sp, #544

> +       vst1.8          {d14-d15}, [r2, : 128]

> +       vmull.s32       q7, d18, d6

> +       vmlal.s32       q7, d26, d0

> +       add             r2, sp, #624

> +       vld1.8          {d30-d31}, [r2, : 128]

> +       vmlal.s32       q2, d30, d21

> +       vmlal.s32       q7, d19, d21

> +       vmlal.s32       q7, d27, d20

> +       add             r2, sp, #592

> +       vld1.8          {d26-d27}, [r2, : 128]

> +       vmlal.s32       q4, d25, d27

> +       vmlal.s32       q8, d29, d27

> +       vmlal.s32       q8, d25, d26

> +       vmlal.s32       q7, d28, d27

> +       vmlal.s32       q7, d29, d26

> +       add             r2, sp, #576

> +       vld1.8          {d28-d29}, [r2, : 128]

> +       vmlal.s32       q4, d24, d29

> +       vmlal.s32       q8, d23, d29

> +       vmlal.s32       q8, d24, d28

> +       vmlal.s32       q7, d22, d29

> +       vmlal.s32       q7, d23, d28

> +       add             r2, sp, #576

> +       vst1.8          {d8-d9}, [r2, : 128]

> +       add             r2, sp, #528

> +       vld1.8          {d8-d9}, [r2, : 128]

> +       vmlal.s32       q7, d24, d9

> +       vmlal.s32       q7, d25, d31

> +       vmull.s32       q1, d18, d2

> +       vmlal.s32       q1, d19, d1

> +       vmlal.s32       q1, d22, d0

> +       vmlal.s32       q1, d24, d27

> +       vmlal.s32       q1, d23, d20

> +       vmlal.s32       q1, d12, d7

> +       vmlal.s32       q1, d13, d6

> +       vmull.s32       q6, d18, d1

> +       vmlal.s32       q6, d19, d0

> +       vmlal.s32       q6, d23, d27

> +       vmlal.s32       q6, d22, d20

> +       vmlal.s32       q6, d24, d26

> +       vmull.s32       q0, d18, d0

> +       vmlal.s32       q0, d22, d27

> +       vmlal.s32       q0, d23, d26

> +       vmlal.s32       q0, d24, d31

> +       vmlal.s32       q0, d19, d20

> +       add             r2, sp, #608

> +       vld1.8          {d18-d19}, [r2, : 128]

> +       vmlal.s32       q2, d18, d7

> +       vmlal.s32       q2, d19, d6

> +       vmlal.s32       q5, d18, d6

> +       vmlal.s32       q5, d19, d21

> +       vmlal.s32       q1, d18, d21

> +       vmlal.s32       q1, d19, d29

> +       vmlal.s32       q0, d18, d28

> +       vmlal.s32       q0, d19, d9

> +       vmlal.s32       q6, d18, d29

> +       vmlal.s32       q6, d19, d28

> +       add             r2, sp, #560

> +       vld1.8          {d18-d19}, [r2, : 128]

> +       add             r2, sp, #480

> +       vld1.8          {d22-d23}, [r2, : 128]

> +       vmlal.s32       q5, d19, d7

> +       vmlal.s32       q0, d18, d21

> +       vmlal.s32       q0, d19, d29

> +       vmlal.s32       q6, d18, d6

> +       add             r2, sp, #496

> +       vld1.8          {d6-d7}, [r2, : 128]

> +       vmlal.s32       q6, d19, d21

> +       add             r2, sp, #544

> +       vld1.8          {d18-d19}, [r2, : 128]

> +       vmlal.s32       q0, d30, d8

> +       add             r2, sp, #640

> +       vld1.8          {d20-d21}, [r2, : 128]

> +       vmlal.s32       q5, d30, d29

> +       add             r2, sp, #576

> +       vld1.8          {d24-d25}, [r2, : 128]

> +       vmlal.s32       q1, d30, d28

> +       vadd.i64        q13, q0, q11

> +       vadd.i64        q14, q5, q11

> +       vmlal.s32       q6, d30, d9

> +       vshr.s64        q4, q13, #26

> +       vshr.s64        q13, q14, #26

> +       vadd.i64        q7, q7, q4

> +       vshl.i64        q4, q4, #26

> +       vadd.i64        q14, q7, q3

> +       vadd.i64        q9, q9, q13

> +       vshl.i64        q13, q13, #26

> +       vadd.i64        q15, q9, q3

> +       vsub.i64        q0, q0, q4

> +       vshr.s64        q4, q14, #25

> +       vsub.i64        q5, q5, q13

> +       vshr.s64        q13, q15, #25

> +       vadd.i64        q6, q6, q4

> +       vshl.i64        q4, q4, #25

> +       vadd.i64        q14, q6, q11

> +       vadd.i64        q2, q2, q13

> +       vsub.i64        q4, q7, q4

> +       vshr.s64        q7, q14, #26

> +       vshl.i64        q13, q13, #25

> +       vadd.i64        q14, q2, q11

> +       vadd.i64        q8, q8, q7

> +       vshl.i64        q7, q7, #26

> +       vadd.i64        q15, q8, q3

> +       vsub.i64        q9, q9, q13

> +       vshr.s64        q13, q14, #26

> +       vsub.i64        q6, q6, q7

> +       vshr.s64        q7, q15, #25

> +       vadd.i64        q10, q10, q13

> +       vshl.i64        q13, q13, #26

> +       vadd.i64        q14, q10, q3

> +       vadd.i64        q1, q1, q7

> +       add             r2, r3, #288

> +       vshl.i64        q7, q7, #25

> +       add             r4, r3, #96

> +       vadd.i64        q15, q1, q11

> +       add             r2, r2, #8

> +       vsub.i64        q2, q2, q13

> +       add             r4, r4, #8

> +       vshr.s64        q13, q14, #25

> +       vsub.i64        q7, q8, q7

> +       vshr.s64        q8, q15, #26

> +       vadd.i64        q14, q13, q13

> +       vadd.i64        q12, q12, q8

> +       vtrn.32         d12, d14

> +       vshl.i64        q8, q8, #26

> +       vtrn.32         d13, d15

> +       vadd.i64        q3, q12, q3

> +       vadd.i64        q0, q0, q14

> +       vst1.8          d12, [r2, : 64]!

> +       vshl.i64        q7, q13, #4

> +       vst1.8          d13, [r4, : 64]!

> +       vsub.i64        q1, q1, q8

> +       vshr.s64        q3, q3, #25

> +       vadd.i64        q0, q0, q7

> +       vadd.i64        q5, q5, q3

> +       vshl.i64        q3, q3, #25

> +       vadd.i64        q6, q5, q11

> +       vadd.i64        q0, q0, q13

> +       vshl.i64        q7, q13, #25

> +       vadd.i64        q8, q0, q11

> +       vsub.i64        q3, q12, q3

> +       vshr.s64        q6, q6, #26

> +       vsub.i64        q7, q10, q7

> +       vtrn.32         d2, d6

> +       vshr.s64        q8, q8, #26

> +       vtrn.32         d3, d7

> +       vadd.i64        q3, q9, q6

> +       vst1.8          d2, [r2, : 64]

> +       vshl.i64        q6, q6, #26

> +       vst1.8          d3, [r4, : 64]

> +       vadd.i64        q1, q4, q8

> +       vtrn.32         d4, d14

> +       vshl.i64        q4, q8, #26

> +       vtrn.32         d5, d15

> +       vsub.i64        q5, q5, q6

> +       add             r2, r2, #16

> +       vsub.i64        q0, q0, q4

> +       vst1.8          d4, [r2, : 64]

> +       add             r4, r4, #16

> +       vst1.8          d5, [r4, : 64]

> +       vtrn.32         d10, d6

> +       vtrn.32         d11, d7

> +       sub             r2, r2, #8

> +       sub             r4, r4, #8

> +       vtrn.32         d0, d2

> +       vtrn.32         d1, d3

> +       vst1.8          d10, [r2, : 64]

> +       vst1.8          d11, [r4, : 64]

> +       sub             r2, r2, #24

> +       sub             r4, r4, #24

> +       vst1.8          d0, [r2, : 64]

> +       vst1.8          d1, [r4, : 64]

> +       add             r2, sp, #512

> +       add             r4, r3, #144

> +       add             r5, r3, #192

> +       vld1.8          {d0-d1}, [r2, : 128]

> +       vld1.8          {d2-d3}, [r4, : 128]!

> +       vld1.8          {d4-d5}, [r5, : 128]!

> +       vzip.i32        q1, q2

> +       vld1.8          {d6-d7}, [r4, : 128]!

> +       vld1.8          {d8-d9}, [r5, : 128]!

> +       vshl.i32        q5, q1, #1

> +       vzip.i32        q3, q4

> +       vshl.i32        q6, q2, #1

> +       vld1.8          {d14}, [r4, : 64]

> +       vshl.i32        q8, q3, #1

> +       vld1.8          {d15}, [r5, : 64]

> +       vshl.i32        q9, q4, #1

> +       vmul.i32        d21, d7, d1

> +       vtrn.32         d14, d15

> +       vmul.i32        q11, q4, q0

> +       vmul.i32        q0, q7, q0

> +       vmull.s32       q12, d2, d2

> +       vmlal.s32       q12, d11, d1

> +       vmlal.s32       q12, d12, d0

> +       vmlal.s32       q12, d13, d23

> +       vmlal.s32       q12, d16, d22

> +       vmlal.s32       q12, d7, d21

> +       vmull.s32       q10, d2, d11

> +       vmlal.s32       q10, d4, d1

> +       vmlal.s32       q10, d13, d0

> +       vmlal.s32       q10, d6, d23

> +       vmlal.s32       q10, d17, d22

> +       vmull.s32       q13, d10, d4

> +       vmlal.s32       q13, d11, d3

> +       vmlal.s32       q13, d13, d1

> +       vmlal.s32       q13, d16, d0

> +       vmlal.s32       q13, d17, d23

> +       vmlal.s32       q13, d8, d22

> +       vmull.s32       q1, d10, d5

> +       vmlal.s32       q1, d11, d4

> +       vmlal.s32       q1, d6, d1

> +       vmlal.s32       q1, d17, d0

> +       vmlal.s32       q1, d8, d23

> +       vmull.s32       q14, d10, d6

> +       vmlal.s32       q14, d11, d13

> +       vmlal.s32       q14, d4, d4

> +       vmlal.s32       q14, d17, d1

> +       vmlal.s32       q14, d18, d0

> +       vmlal.s32       q14, d9, d23

> +       vmull.s32       q11, d10, d7

> +       vmlal.s32       q11, d11, d6

> +       vmlal.s32       q11, d12, d5

> +       vmlal.s32       q11, d8, d1

> +       vmlal.s32       q11, d19, d0

> +       vmull.s32       q15, d10, d8

> +       vmlal.s32       q15, d11, d17

> +       vmlal.s32       q15, d12, d6

> +       vmlal.s32       q15, d13, d5

> +       vmlal.s32       q15, d19, d1

> +       vmlal.s32       q15, d14, d0

> +       vmull.s32       q2, d10, d9

> +       vmlal.s32       q2, d11, d8

> +       vmlal.s32       q2, d12, d7

> +       vmlal.s32       q2, d13, d6

> +       vmlal.s32       q2, d14, d1

> +       vmull.s32       q0, d15, d1

> +       vmlal.s32       q0, d10, d14

> +       vmlal.s32       q0, d11, d19

> +       vmlal.s32       q0, d12, d8

> +       vmlal.s32       q0, d13, d17

> +       vmlal.s32       q0, d6, d6

> +       add             r2, sp, #480

> +       vld1.8          {d18-d19}, [r2, : 128]

> +       vmull.s32       q3, d16, d7

> +       vmlal.s32       q3, d10, d15

> +       vmlal.s32       q3, d11, d14

> +       vmlal.s32       q3, d12, d9

> +       vmlal.s32       q3, d13, d8

> +       add             r2, sp, #496

> +       vld1.8          {d8-d9}, [r2, : 128]

> +       vadd.i64        q5, q12, q9

> +       vadd.i64        q6, q15, q9

> +       vshr.s64        q5, q5, #26

> +       vshr.s64        q6, q6, #26

> +       vadd.i64        q7, q10, q5

> +       vshl.i64        q5, q5, #26

> +       vadd.i64        q8, q7, q4

> +       vadd.i64        q2, q2, q6

> +       vshl.i64        q6, q6, #26

> +       vadd.i64        q10, q2, q4

> +       vsub.i64        q5, q12, q5

> +       vshr.s64        q8, q8, #25

> +       vsub.i64        q6, q15, q6

> +       vshr.s64        q10, q10, #25

> +       vadd.i64        q12, q13, q8

> +       vshl.i64        q8, q8, #25

> +       vadd.i64        q13, q12, q9

> +       vadd.i64        q0, q0, q10

> +       vsub.i64        q7, q7, q8

> +       vshr.s64        q8, q13, #26

> +       vshl.i64        q10, q10, #25

> +       vadd.i64        q13, q0, q9

> +       vadd.i64        q1, q1, q8

> +       vshl.i64        q8, q8, #26

> +       vadd.i64        q15, q1, q4

> +       vsub.i64        q2, q2, q10

> +       vshr.s64        q10, q13, #26

> +       vsub.i64        q8, q12, q8

> +       vshr.s64        q12, q15, #25

> +       vadd.i64        q3, q3, q10

> +       vshl.i64        q10, q10, #26

> +       vadd.i64        q13, q3, q4

> +       vadd.i64        q14, q14, q12

> +       add             r2, r3, #144

> +       vshl.i64        q12, q12, #25

> +       add             r4, r3, #192

> +       vadd.i64        q15, q14, q9

> +       add             r2, r2, #8

> +       vsub.i64        q0, q0, q10

> +       add             r4, r4, #8

> +       vshr.s64        q10, q13, #25

> +       vsub.i64        q1, q1, q12

> +       vshr.s64        q12, q15, #26

> +       vadd.i64        q13, q10, q10

> +       vadd.i64        q11, q11, q12

> +       vtrn.32         d16, d2

> +       vshl.i64        q12, q12, #26

> +       vtrn.32         d17, d3

> +       vadd.i64        q1, q11, q4

> +       vadd.i64        q4, q5, q13

> +       vst1.8          d16, [r2, : 64]!

> +       vshl.i64        q5, q10, #4

> +       vst1.8          d17, [r4, : 64]!

> +       vsub.i64        q8, q14, q12

> +       vshr.s64        q1, q1, #25

> +       vadd.i64        q4, q4, q5

> +       vadd.i64        q5, q6, q1

> +       vshl.i64        q1, q1, #25

> +       vadd.i64        q6, q5, q9

> +       vadd.i64        q4, q4, q10

> +       vshl.i64        q10, q10, #25

> +       vadd.i64        q9, q4, q9

> +       vsub.i64        q1, q11, q1

> +       vshr.s64        q6, q6, #26

> +       vsub.i64        q3, q3, q10

> +       vtrn.32         d16, d2

> +       vshr.s64        q9, q9, #26

> +       vtrn.32         d17, d3

> +       vadd.i64        q1, q2, q6

> +       vst1.8          d16, [r2, : 64]

> +       vshl.i64        q2, q6, #26

> +       vst1.8          d17, [r4, : 64]

> +       vadd.i64        q6, q7, q9

> +       vtrn.32         d0, d6

> +       vshl.i64        q7, q9, #26

> +       vtrn.32         d1, d7

> +       vsub.i64        q2, q5, q2

> +       add             r2, r2, #16

> +       vsub.i64        q3, q4, q7

> +       vst1.8          d0, [r2, : 64]

> +       add             r4, r4, #16

> +       vst1.8          d1, [r4, : 64]

> +       vtrn.32         d4, d2

> +       vtrn.32         d5, d3

> +       sub             r2, r2, #8

> +       sub             r4, r4, #8

> +       vtrn.32         d6, d12

> +       vtrn.32         d7, d13

> +       vst1.8          d4, [r2, : 64]

> +       vst1.8          d5, [r4, : 64]

> +       sub             r2, r2, #24

> +       sub             r4, r4, #24

> +       vst1.8          d6, [r2, : 64]

> +       vst1.8          d7, [r4, : 64]

> +       add             r2, r3, #336

> +       add             r4, r3, #288

> +       vld1.8          {d0-d1}, [r2, : 128]!

> +       vld1.8          {d2-d3}, [r4, : 128]!

> +       vadd.i32        q0, q0, q1

> +       vld1.8          {d2-d3}, [r2, : 128]!

> +       vld1.8          {d4-d5}, [r4, : 128]!

> +       vadd.i32        q1, q1, q2

> +       add             r5, r3, #288

> +       vld1.8          {d4}, [r2, : 64]

> +       vld1.8          {d6}, [r4, : 64]

> +       vadd.i32        q2, q2, q3

> +       vst1.8          {d0-d1}, [r5, : 128]!

> +       vst1.8          {d2-d3}, [r5, : 128]!

> +       vst1.8          d4, [r5, : 64]

> +       add             r2, r3, #48

> +       add             r4, r3, #144

> +       vld1.8          {d0-d1}, [r4, : 128]!

> +       vld1.8          {d2-d3}, [r4, : 128]!

> +       vld1.8          {d4}, [r4, : 64]

> +       add             r4, r3, #288

> +       vld1.8          {d6-d7}, [r4, : 128]!

> +       vtrn.32         q0, q3

> +       vld1.8          {d8-d9}, [r4, : 128]!

> +       vshl.i32        q5, q0, #4

> +       vtrn.32         q1, q4

> +       vshl.i32        q6, q3, #4

> +       vadd.i32        q5, q5, q0

> +       vadd.i32        q6, q6, q3

> +       vshl.i32        q7, q1, #4

> +       vld1.8          {d5}, [r4, : 64]

> +       vshl.i32        q8, q4, #4

> +       vtrn.32         d4, d5

> +       vadd.i32        q7, q7, q1

> +       vadd.i32        q8, q8, q4

> +       vld1.8          {d18-d19}, [r2, : 128]!

> +       vshl.i32        q10, q2, #4

> +       vld1.8          {d22-d23}, [r2, : 128]!

> +       vadd.i32        q10, q10, q2

> +       vld1.8          {d24}, [r2, : 64]

> +       vadd.i32        q5, q5, q0

> +       add             r2, r3, #240

> +       vld1.8          {d26-d27}, [r2, : 128]!

> +       vadd.i32        q6, q6, q3

> +       vld1.8          {d28-d29}, [r2, : 128]!

> +       vadd.i32        q8, q8, q4

> +       vld1.8          {d25}, [r2, : 64]

> +       vadd.i32        q10, q10, q2

> +       vtrn.32         q9, q13

> +       vadd.i32        q7, q7, q1

> +       vadd.i32        q5, q5, q0

> +       vtrn.32         q11, q14

> +       vadd.i32        q6, q6, q3

> +       add             r2, sp, #528

> +       vadd.i32        q10, q10, q2

> +       vtrn.32         d24, d25

> +       vst1.8          {d12-d13}, [r2, : 128]

> +       vshl.i32        q6, q13, #1

> +       add             r2, sp, #544

> +       vst1.8          {d20-d21}, [r2, : 128]

> +       vshl.i32        q10, q14, #1

> +       add             r2, sp, #560

> +       vst1.8          {d12-d13}, [r2, : 128]

> +       vshl.i32        q15, q12, #1

> +       vadd.i32        q8, q8, q4

> +       vext.32         d10, d31, d30, #0

> +       vadd.i32        q7, q7, q1

> +       add             r2, sp, #576

> +       vst1.8          {d16-d17}, [r2, : 128]

> +       vmull.s32       q8, d18, d5

> +       vmlal.s32       q8, d26, d4

> +       vmlal.s32       q8, d19, d9

> +       vmlal.s32       q8, d27, d3

> +       vmlal.s32       q8, d22, d8

> +       vmlal.s32       q8, d28, d2

> +       vmlal.s32       q8, d23, d7

> +       vmlal.s32       q8, d29, d1

> +       vmlal.s32       q8, d24, d6

> +       vmlal.s32       q8, d25, d0

> +       add             r2, sp, #592

> +       vst1.8          {d14-d15}, [r2, : 128]

> +       vmull.s32       q2, d18, d4

> +       vmlal.s32       q2, d12, d9

> +       vmlal.s32       q2, d13, d8

> +       vmlal.s32       q2, d19, d3

> +       vmlal.s32       q2, d22, d2

> +       vmlal.s32       q2, d23, d1

> +       vmlal.s32       q2, d24, d0

> +       add             r2, sp, #608

> +       vst1.8          {d20-d21}, [r2, : 128]

> +       vmull.s32       q7, d18, d9

> +       vmlal.s32       q7, d26, d3

> +       vmlal.s32       q7, d19, d8

> +       vmlal.s32       q7, d27, d2

> +       vmlal.s32       q7, d22, d7

> +       vmlal.s32       q7, d28, d1

> +       vmlal.s32       q7, d23, d6

> +       vmlal.s32       q7, d29, d0

> +       add             r2, sp, #624

> +       vst1.8          {d10-d11}, [r2, : 128]

> +       vmull.s32       q5, d18, d3

> +       vmlal.s32       q5, d19, d2

> +       vmlal.s32       q5, d22, d1

> +       vmlal.s32       q5, d23, d0

> +       vmlal.s32       q5, d12, d8

> +       add             r2, sp, #640

> +       vst1.8          {d16-d17}, [r2, : 128]

> +       vmull.s32       q4, d18, d8

> +       vmlal.s32       q4, d26, d2

> +       vmlal.s32       q4, d19, d7

> +       vmlal.s32       q4, d27, d1

> +       vmlal.s32       q4, d22, d6

> +       vmlal.s32       q4, d28, d0

> +       vmull.s32       q8, d18, d7

> +       vmlal.s32       q8, d26, d1

> +       vmlal.s32       q8, d19, d6

> +       vmlal.s32       q8, d27, d0

> +       add             r2, sp, #544

> +       vld1.8          {d20-d21}, [r2, : 128]

> +       vmlal.s32       q7, d24, d21

> +       vmlal.s32       q7, d25, d20

> +       vmlal.s32       q4, d23, d21

> +       vmlal.s32       q4, d29, d20

> +       vmlal.s32       q8, d22, d21

> +       vmlal.s32       q8, d28, d20

> +       vmlal.s32       q5, d24, d20

> +       add             r2, sp, #544

> +       vst1.8          {d14-d15}, [r2, : 128]

> +       vmull.s32       q7, d18, d6

> +       vmlal.s32       q7, d26, d0

> +       add             r2, sp, #624

> +       vld1.8          {d30-d31}, [r2, : 128]

> +       vmlal.s32       q2, d30, d21

> +       vmlal.s32       q7, d19, d21

> +       vmlal.s32       q7, d27, d20

> +       add             r2, sp, #592

> +       vld1.8          {d26-d27}, [r2, : 128]

> +       vmlal.s32       q4, d25, d27

> +       vmlal.s32       q8, d29, d27

> +       vmlal.s32       q8, d25, d26

> +       vmlal.s32       q7, d28, d27

> +       vmlal.s32       q7, d29, d26

> +       add             r2, sp, #576

> +       vld1.8          {d28-d29}, [r2, : 128]

> +       vmlal.s32       q4, d24, d29

> +       vmlal.s32       q8, d23, d29

> +       vmlal.s32       q8, d24, d28

> +       vmlal.s32       q7, d22, d29

> +       vmlal.s32       q7, d23, d28

> +       add             r2, sp, #576

> +       vst1.8          {d8-d9}, [r2, : 128]

> +       add             r2, sp, #528

> +       vld1.8          {d8-d9}, [r2, : 128]

> +       vmlal.s32       q7, d24, d9

> +       vmlal.s32       q7, d25, d31

> +       vmull.s32       q1, d18, d2

> +       vmlal.s32       q1, d19, d1

> +       vmlal.s32       q1, d22, d0

> +       vmlal.s32       q1, d24, d27

> +       vmlal.s32       q1, d23, d20

> +       vmlal.s32       q1, d12, d7

> +       vmlal.s32       q1, d13, d6

> +       vmull.s32       q6, d18, d1

> +       vmlal.s32       q6, d19, d0

> +       vmlal.s32       q6, d23, d27

> +       vmlal.s32       q6, d22, d20

> +       vmlal.s32       q6, d24, d26

> +       vmull.s32       q0, d18, d0

> +       vmlal.s32       q0, d22, d27

> +       vmlal.s32       q0, d23, d26

> +       vmlal.s32       q0, d24, d31

> +       vmlal.s32       q0, d19, d20

> +       add             r2, sp, #608

> +       vld1.8          {d18-d19}, [r2, : 128]

> +       vmlal.s32       q2, d18, d7

> +       vmlal.s32       q2, d19, d6

> +       vmlal.s32       q5, d18, d6

> +       vmlal.s32       q5, d19, d21

> +       vmlal.s32       q1, d18, d21

> +       vmlal.s32       q1, d19, d29

> +       vmlal.s32       q0, d18, d28

> +       vmlal.s32       q0, d19, d9

> +       vmlal.s32       q6, d18, d29

> +       vmlal.s32       q6, d19, d28

> +       add             r2, sp, #560

> +       vld1.8          {d18-d19}, [r2, : 128]

> +       add             r2, sp, #480

> +       vld1.8          {d22-d23}, [r2, : 128]

> +       vmlal.s32       q5, d19, d7

> +       vmlal.s32       q0, d18, d21

> +       vmlal.s32       q0, d19, d29

> +       vmlal.s32       q6, d18, d6

> +       add             r2, sp, #496

> +       vld1.8          {d6-d7}, [r2, : 128]

> +       vmlal.s32       q6, d19, d21

> +       add             r2, sp, #544

> +       vld1.8          {d18-d19}, [r2, : 128]

> +       vmlal.s32       q0, d30, d8

> +       add             r2, sp, #640

> +       vld1.8          {d20-d21}, [r2, : 128]

> +       vmlal.s32       q5, d30, d29

> +       add             r2, sp, #576

> +       vld1.8          {d24-d25}, [r2, : 128]

> +       vmlal.s32       q1, d30, d28

> +       vadd.i64        q13, q0, q11

> +       vadd.i64        q14, q5, q11

> +       vmlal.s32       q6, d30, d9

> +       vshr.s64        q4, q13, #26

> +       vshr.s64        q13, q14, #26

> +       vadd.i64        q7, q7, q4

> +       vshl.i64        q4, q4, #26

> +       vadd.i64        q14, q7, q3

> +       vadd.i64        q9, q9, q13

> +       vshl.i64        q13, q13, #26

> +       vadd.i64        q15, q9, q3

> +       vsub.i64        q0, q0, q4

> +       vshr.s64        q4, q14, #25

> +       vsub.i64        q5, q5, q13

> +       vshr.s64        q13, q15, #25

> +       vadd.i64        q6, q6, q4

> +       vshl.i64        q4, q4, #25

> +       vadd.i64        q14, q6, q11

> +       vadd.i64        q2, q2, q13

> +       vsub.i64        q4, q7, q4

> +       vshr.s64        q7, q14, #26

> +       vshl.i64        q13, q13, #25

> +       vadd.i64        q14, q2, q11

> +       vadd.i64        q8, q8, q7

> +       vshl.i64        q7, q7, #26

> +       vadd.i64        q15, q8, q3

> +       vsub.i64        q9, q9, q13

> +       vshr.s64        q13, q14, #26

> +       vsub.i64        q6, q6, q7

> +       vshr.s64        q7, q15, #25

> +       vadd.i64        q10, q10, q13

> +       vshl.i64        q13, q13, #26

> +       vadd.i64        q14, q10, q3

> +       vadd.i64        q1, q1, q7

> +       add             r2, r3, #240

> +       vshl.i64        q7, q7, #25

> +       add             r4, r3, #144

> +       vadd.i64        q15, q1, q11

> +       add             r2, r2, #8

> +       vsub.i64        q2, q2, q13

> +       add             r4, r4, #8

> +       vshr.s64        q13, q14, #25

> +       vsub.i64        q7, q8, q7

> +       vshr.s64        q8, q15, #26

> +       vadd.i64        q14, q13, q13

> +       vadd.i64        q12, q12, q8

> +       vtrn.32         d12, d14

> +       vshl.i64        q8, q8, #26

> +       vtrn.32         d13, d15

> +       vadd.i64        q3, q12, q3

> +       vadd.i64        q0, q0, q14

> +       vst1.8          d12, [r2, : 64]!

> +       vshl.i64        q7, q13, #4

> +       vst1.8          d13, [r4, : 64]!

> +       vsub.i64        q1, q1, q8

> +       vshr.s64        q3, q3, #25

> +       vadd.i64        q0, q0, q7

> +       vadd.i64        q5, q5, q3

> +       vshl.i64        q3, q3, #25

> +       vadd.i64        q6, q5, q11

> +       vadd.i64        q0, q0, q13

> +       vshl.i64        q7, q13, #25

> +       vadd.i64        q8, q0, q11

> +       vsub.i64        q3, q12, q3

> +       vshr.s64        q6, q6, #26

> +       vsub.i64        q7, q10, q7

> +       vtrn.32         d2, d6

> +       vshr.s64        q8, q8, #26

> +       vtrn.32         d3, d7

> +       vadd.i64        q3, q9, q6

> +       vst1.8          d2, [r2, : 64]

> +       vshl.i64        q6, q6, #26

> +       vst1.8          d3, [r4, : 64]

> +       vadd.i64        q1, q4, q8

> +       vtrn.32         d4, d14

> +       vshl.i64        q4, q8, #26

> +       vtrn.32         d5, d15

> +       vsub.i64        q5, q5, q6

> +       add             r2, r2, #16

> +       vsub.i64        q0, q0, q4

> +       vst1.8          d4, [r2, : 64]

> +       add             r4, r4, #16

> +       vst1.8          d5, [r4, : 64]

> +       vtrn.32         d10, d6

> +       vtrn.32         d11, d7

> +       sub             r2, r2, #8

> +       sub             r4, r4, #8

> +       vtrn.32         d0, d2

> +       vtrn.32         d1, d3

> +       vst1.8          d10, [r2, : 64]

> +       vst1.8          d11, [r4, : 64]

> +       sub             r2, r2, #24

> +       sub             r4, r4, #24

> +       vst1.8          d0, [r2, : 64]

> +       vst1.8          d1, [r4, : 64]

> +       ldr             r2, [sp, #456]

> +       ldr             r4, [sp, #460]

> +       subs            r5, r2, #1

> +       bge             .Lmainloop

> +       add             r1, r3, #144

> +       add             r2, r3, #336

> +       vld1.8          {d0-d1}, [r1, : 128]!

> +       vld1.8          {d2-d3}, [r1, : 128]!

> +       vld1.8          {d4}, [r1, : 64]

> +       vst1.8          {d0-d1}, [r2, : 128]!

> +       vst1.8          {d2-d3}, [r2, : 128]!

> +       vst1.8          d4, [r2, : 64]

> +       movw            r1, #0

> +.Linvertloop:

> +       add             r2, r3, #144

> +       movw            r4, #0

> +       movw            r5, #2

> +       cmp             r1, #1

> +       moveq           r5, #1

> +       addeq           r2, r3, #336

> +       addeq           r4, r3, #48

> +       cmp             r1, #2

> +       moveq           r5, #1

> +       addeq           r2, r3, #48

> +       cmp             r1, #3

> +       moveq           r5, #5

> +       addeq           r4, r3, #336

> +       cmp             r1, #4

> +       moveq           r5, #10

> +       cmp             r1, #5

> +       moveq           r5, #20

> +       cmp             r1, #6

> +       moveq           r5, #10

> +       addeq           r2, r3, #336

> +       addeq           r4, r3, #336

> +       cmp             r1, #7

> +       moveq           r5, #50

> +       cmp             r1, #8

> +       moveq           r5, #100

> +       cmp             r1, #9

> +       moveq           r5, #50

> +       addeq           r2, r3, #336

> +       cmp             r1, #10

> +       moveq           r5, #5

> +       addeq           r2, r3, #48

> +       cmp             r1, #11

> +       moveq           r5, #0

> +       addeq           r2, r3, #96

> +       add             r6, r3, #144

> +       add             r7, r3, #288

> +       vld1.8          {d0-d1}, [r6, : 128]!

> +       vld1.8          {d2-d3}, [r6, : 128]!

> +       vld1.8          {d4}, [r6, : 64]

> +       vst1.8          {d0-d1}, [r7, : 128]!

> +       vst1.8          {d2-d3}, [r7, : 128]!

> +       vst1.8          d4, [r7, : 64]

> +       cmp             r5, #0

> +       beq             .Lskipsquaringloop

> +.Lsquaringloop:

> +       add             r6, r3, #288

> +       add             r7, r3, #288

> +       add             r8, r3, #288

> +       vmov.i32        q0, #19

> +       vmov.i32        q1, #0

> +       vmov.i32        q2, #1

> +       vzip.i32        q1, q2

> +       vld1.8          {d4-d5}, [r7, : 128]!

> +       vld1.8          {d6-d7}, [r7, : 128]!

> +       vld1.8          {d9}, [r7, : 64]

> +       vld1.8          {d10-d11}, [r6, : 128]!

> +       add             r7, sp, #384

> +       vld1.8          {d12-d13}, [r6, : 128]!

> +       vmul.i32        q7, q2, q0

> +       vld1.8          {d8}, [r6, : 64]

> +       vext.32         d17, d11, d10, #1

> +       vmul.i32        q9, q3, q0

> +       vext.32         d16, d10, d8, #1

> +       vshl.u32        q10, q5, q1

> +       vext.32         d22, d14, d4, #1

> +       vext.32         d24, d18, d6, #1

> +       vshl.u32        q13, q6, q1

> +       vshl.u32        d28, d8, d2

> +       vrev64.i32      d22, d22

> +       vmul.i32        d1, d9, d1

> +       vrev64.i32      d24, d24

> +       vext.32         d29, d8, d13, #1

> +       vext.32         d0, d1, d9, #1

> +       vrev64.i32      d0, d0

> +       vext.32         d2, d9, d1, #1

> +       vext.32         d23, d15, d5, #1

> +       vmull.s32       q4, d20, d4

> +       vrev64.i32      d23, d23

> +       vmlal.s32       q4, d21, d1

> +       vrev64.i32      d2, d2

> +       vmlal.s32       q4, d26, d19

> +       vext.32         d3, d5, d15, #1

> +       vmlal.s32       q4, d27, d18

> +       vrev64.i32      d3, d3

> +       vmlal.s32       q4, d28, d15

> +       vext.32         d14, d12, d11, #1

> +       vmull.s32       q5, d16, d23

> +       vext.32         d15, d13, d12, #1

> +       vmlal.s32       q5, d17, d4

> +       vst1.8          d8, [r7, : 64]!

> +       vmlal.s32       q5, d14, d1

> +       vext.32         d12, d9, d8, #0

> +       vmlal.s32       q5, d15, d19

> +       vmov.i64        d13, #0

> +       vmlal.s32       q5, d29, d18

> +       vext.32         d25, d19, d7, #1

> +       vmlal.s32       q6, d20, d5

> +       vrev64.i32      d25, d25

> +       vmlal.s32       q6, d21, d4

> +       vst1.8          d11, [r7, : 64]!

> +       vmlal.s32       q6, d26, d1

> +       vext.32         d9, d10, d10, #0

> +       vmlal.s32       q6, d27, d19

> +       vmov.i64        d8, #0

> +       vmlal.s32       q6, d28, d18

> +       vmlal.s32       q4, d16, d24

> +       vmlal.s32       q4, d17, d5

> +       vmlal.s32       q4, d14, d4

> +       vst1.8          d12, [r7, : 64]!

> +       vmlal.s32       q4, d15, d1

> +       vext.32         d10, d13, d12, #0

> +       vmlal.s32       q4, d29, d19

> +       vmov.i64        d11, #0

> +       vmlal.s32       q5, d20, d6

> +       vmlal.s32       q5, d21, d5

> +       vmlal.s32       q5, d26, d4

> +       vext.32         d13, d8, d8, #0

> +       vmlal.s32       q5, d27, d1

> +       vmov.i64        d12, #0

> +       vmlal.s32       q5, d28, d19

> +       vst1.8          d9, [r7, : 64]!

> +       vmlal.s32       q6, d16, d25

> +       vmlal.s32       q6, d17, d6

> +       vst1.8          d10, [r7, : 64]

> +       vmlal.s32       q6, d14, d5

> +       vext.32         d8, d11, d10, #0

> +       vmlal.s32       q6, d15, d4

> +       vmov.i64        d9, #0

> +       vmlal.s32       q6, d29, d1

> +       vmlal.s32       q4, d20, d7

> +       vmlal.s32       q4, d21, d6

> +       vmlal.s32       q4, d26, d5

> +       vext.32         d11, d12, d12, #0

> +       vmlal.s32       q4, d27, d4

> +       vmov.i64        d10, #0

> +       vmlal.s32       q4, d28, d1

> +       vmlal.s32       q5, d16, d0

> +       sub             r6, r7, #32

> +       vmlal.s32       q5, d17, d7

> +       vmlal.s32       q5, d14, d6

> +       vext.32         d30, d9, d8, #0

> +       vmlal.s32       q5, d15, d5

> +       vld1.8          {d31}, [r6, : 64]!

> +       vmlal.s32       q5, d29, d4

> +       vmlal.s32       q15, d20, d0

> +       vext.32         d0, d6, d18, #1

> +       vmlal.s32       q15, d21, d25

> +       vrev64.i32      d0, d0

> +       vmlal.s32       q15, d26, d24

> +       vext.32         d1, d7, d19, #1

> +       vext.32         d7, d10, d10, #0

> +       vmlal.s32       q15, d27, d23

> +       vrev64.i32      d1, d1

> +       vld1.8          {d6}, [r6, : 64]

> +       vmlal.s32       q15, d28, d22

> +       vmlal.s32       q3, d16, d4

> +       add             r6, r6, #24

> +       vmlal.s32       q3, d17, d2

> +       vext.32         d4, d31, d30, #0

> +       vmov            d17, d11

> +       vmlal.s32       q3, d14, d1

> +       vext.32         d11, d13, d13, #0

> +       vext.32         d13, d30, d30, #0

> +       vmlal.s32       q3, d15, d0

> +       vext.32         d1, d8, d8, #0

> +       vmlal.s32       q3, d29, d3

> +       vld1.8          {d5}, [r6, : 64]

> +       sub             r6, r6, #16

> +       vext.32         d10, d6, d6, #0

> +       vmov.i32        q1, #0xffffffff

> +       vshl.i64        q4, q1, #25

> +       add             r7, sp, #480

> +       vld1.8          {d14-d15}, [r7, : 128]

> +       vadd.i64        q9, q2, q7

> +       vshl.i64        q1, q1, #26

> +       vshr.s64        q10, q9, #26

> +       vld1.8          {d0}, [r6, : 64]!

> +       vadd.i64        q5, q5, q10

> +       vand            q9, q9, q1

> +       vld1.8          {d16}, [r6, : 64]!

> +       add             r6, sp, #496

> +       vld1.8          {d20-d21}, [r6, : 128]

> +       vadd.i64        q11, q5, q10

> +       vsub.i64        q2, q2, q9

> +       vshr.s64        q9, q11, #25

> +       vext.32         d12, d5, d4, #0

> +       vand            q11, q11, q4

> +       vadd.i64        q0, q0, q9

> +       vmov            d19, d7

> +       vadd.i64        q3, q0, q7

> +       vsub.i64        q5, q5, q11

> +       vshr.s64        q11, q3, #26

> +       vext.32         d18, d11, d10, #0

> +       vand            q3, q3, q1

> +       vadd.i64        q8, q8, q11

> +       vadd.i64        q11, q8, q10

> +       vsub.i64        q0, q0, q3

> +       vshr.s64        q3, q11, #25

> +       vand            q11, q11, q4

> +       vadd.i64        q3, q6, q3

> +       vadd.i64        q6, q3, q7

> +       vsub.i64        q8, q8, q11

> +       vshr.s64        q11, q6, #26

> +       vand            q6, q6, q1

> +       vadd.i64        q9, q9, q11

> +       vadd.i64        d25, d19, d21

> +       vsub.i64        q3, q3, q6

> +       vshr.s64        d23, d25, #25

> +       vand            q4, q12, q4

> +       vadd.i64        d21, d23, d23

> +       vshl.i64        d25, d23, #4

> +       vadd.i64        d21, d21, d23

> +       vadd.i64        d25, d25, d21

> +       vadd.i64        d4, d4, d25

> +       vzip.i32        q0, q8

> +       vadd.i64        d12, d4, d14

> +       add             r6, r8, #8

> +       vst1.8          d0, [r6, : 64]

> +       vsub.i64        d19, d19, d9

> +       add             r6, r6, #16

> +       vst1.8          d16, [r6, : 64]

> +       vshr.s64        d22, d12, #26

> +       vand            q0, q6, q1

> +       vadd.i64        d10, d10, d22

> +       vzip.i32        q3, q9

> +       vsub.i64        d4, d4, d0

> +       sub             r6, r6, #8

> +       vst1.8          d6, [r6, : 64]

> +       add             r6, r6, #16

> +       vst1.8          d18, [r6, : 64]

> +       vzip.i32        q2, q5

> +       sub             r6, r6, #32

> +       vst1.8          d4, [r6, : 64]

> +       subs            r5, r5, #1

> +       bhi             .Lsquaringloop

> +.Lskipsquaringloop:

> +       mov             r2, r2

> +       add             r5, r3, #288

> +       add             r6, r3, #144

> +       vmov.i32        q0, #19

> +       vmov.i32        q1, #0

> +       vmov.i32        q2, #1

> +       vzip.i32        q1, q2

> +       vld1.8          {d4-d5}, [r5, : 128]!

> +       vld1.8          {d6-d7}, [r5, : 128]!

> +       vld1.8          {d9}, [r5, : 64]

> +       vld1.8          {d10-d11}, [r2, : 128]!

> +       add             r5, sp, #384

> +       vld1.8          {d12-d13}, [r2, : 128]!

> +       vmul.i32        q7, q2, q0

> +       vld1.8          {d8}, [r2, : 64]

> +       vext.32         d17, d11, d10, #1

> +       vmul.i32        q9, q3, q0

> +       vext.32         d16, d10, d8, #1

> +       vshl.u32        q10, q5, q1

> +       vext.32         d22, d14, d4, #1

> +       vext.32         d24, d18, d6, #1

> +       vshl.u32        q13, q6, q1

> +       vshl.u32        d28, d8, d2

> +       vrev64.i32      d22, d22

> +       vmul.i32        d1, d9, d1

> +       vrev64.i32      d24, d24

> +       vext.32         d29, d8, d13, #1

> +       vext.32         d0, d1, d9, #1

> +       vrev64.i32      d0, d0

> +       vext.32         d2, d9, d1, #1

> +       vext.32         d23, d15, d5, #1

> +       vmull.s32       q4, d20, d4

> +       vrev64.i32      d23, d23

> +       vmlal.s32       q4, d21, d1

> +       vrev64.i32      d2, d2

> +       vmlal.s32       q4, d26, d19

> +       vext.32         d3, d5, d15, #1

> +       vmlal.s32       q4, d27, d18

> +       vrev64.i32      d3, d3

> +       vmlal.s32       q4, d28, d15

> +       vext.32         d14, d12, d11, #1

> +       vmull.s32       q5, d16, d23

> +       vext.32         d15, d13, d12, #1

> +       vmlal.s32       q5, d17, d4

> +       vst1.8          d8, [r5, : 64]!

> +       vmlal.s32       q5, d14, d1

> +       vext.32         d12, d9, d8, #0

> +       vmlal.s32       q5, d15, d19

> +       vmov.i64        d13, #0

> +       vmlal.s32       q5, d29, d18

> +       vext.32         d25, d19, d7, #1

> +       vmlal.s32       q6, d20, d5

> +       vrev64.i32      d25, d25

> +       vmlal.s32       q6, d21, d4

> +       vst1.8          d11, [r5, : 64]!

> +       vmlal.s32       q6, d26, d1

> +       vext.32         d9, d10, d10, #0

> +       vmlal.s32       q6, d27, d19

> +       vmov.i64        d8, #0

> +       vmlal.s32       q6, d28, d18

> +       vmlal.s32       q4, d16, d24

> +       vmlal.s32       q4, d17, d5

> +       vmlal.s32       q4, d14, d4

> +       vst1.8          d12, [r5, : 64]!

> +       vmlal.s32       q4, d15, d1

> +       vext.32         d10, d13, d12, #0

> +       vmlal.s32       q4, d29, d19

> +       vmov.i64        d11, #0

> +       vmlal.s32       q5, d20, d6

> +       vmlal.s32       q5, d21, d5

> +       vmlal.s32       q5, d26, d4

> +       vext.32         d13, d8, d8, #0

> +       vmlal.s32       q5, d27, d1

> +       vmov.i64        d12, #0

> +       vmlal.s32       q5, d28, d19

> +       vst1.8          d9, [r5, : 64]!

> +       vmlal.s32       q6, d16, d25

> +       vmlal.s32       q6, d17, d6

> +       vst1.8          d10, [r5, : 64]

> +       vmlal.s32       q6, d14, d5

> +       vext.32         d8, d11, d10, #0

> +       vmlal.s32       q6, d15, d4

> +       vmov.i64        d9, #0

> +       vmlal.s32       q6, d29, d1

> +       vmlal.s32       q4, d20, d7

> +       vmlal.s32       q4, d21, d6

> +       vmlal.s32       q4, d26, d5

> +       vext.32         d11, d12, d12, #0

> +       vmlal.s32       q4, d27, d4

> +       vmov.i64        d10, #0

> +       vmlal.s32       q4, d28, d1

> +       vmlal.s32       q5, d16, d0

> +       sub             r2, r5, #32

> +       vmlal.s32       q5, d17, d7

> +       vmlal.s32       q5, d14, d6

> +       vext.32         d30, d9, d8, #0

> +       vmlal.s32       q5, d15, d5

> +       vld1.8          {d31}, [r2, : 64]!

> +       vmlal.s32       q5, d29, d4

> +       vmlal.s32       q15, d20, d0

> +       vext.32         d0, d6, d18, #1

> +       vmlal.s32       q15, d21, d25

> +       vrev64.i32      d0, d0

> +       vmlal.s32       q15, d26, d24

> +       vext.32         d1, d7, d19, #1

> +       vext.32         d7, d10, d10, #0

> +       vmlal.s32       q15, d27, d23

> +       vrev64.i32      d1, d1

> +       vld1.8          {d6}, [r2, : 64]

> +       vmlal.s32       q15, d28, d22

> +       vmlal.s32       q3, d16, d4

> +       add             r2, r2, #24

> +       vmlal.s32       q3, d17, d2

> +       vext.32         d4, d31, d30, #0

> +       vmov            d17, d11

> +       vmlal.s32       q3, d14, d1

> +       vext.32         d11, d13, d13, #0

> +       vext.32         d13, d30, d30, #0

> +       vmlal.s32       q3, d15, d0

> +       vext.32         d1, d8, d8, #0

> +       vmlal.s32       q3, d29, d3

> +       vld1.8          {d5}, [r2, : 64]

> +       sub             r2, r2, #16

> +       vext.32         d10, d6, d6, #0

> +       vmov.i32        q1, #0xffffffff

> +       vshl.i64        q4, q1, #25

> +       add             r5, sp, #480

> +       vld1.8          {d14-d15}, [r5, : 128]

> +       vadd.i64        q9, q2, q7

> +       vshl.i64        q1, q1, #26

> +       vshr.s64        q10, q9, #26

> +       vld1.8          {d0}, [r2, : 64]!

> +       vadd.i64        q5, q5, q10

> +       vand            q9, q9, q1

> +       vld1.8          {d16}, [r2, : 64]!

> +       add             r2, sp, #496

> +       vld1.8          {d20-d21}, [r2, : 128]

> +       vadd.i64        q11, q5, q10

> +       vsub.i64        q2, q2, q9

> +       vshr.s64        q9, q11, #25

> +       vext.32         d12, d5, d4, #0

> +       vand            q11, q11, q4

> +       vadd.i64        q0, q0, q9

> +       vmov            d19, d7

> +       vadd.i64        q3, q0, q7

> +       vsub.i64        q5, q5, q11

> +       vshr.s64        q11, q3, #26

> +       vext.32         d18, d11, d10, #0

> +       vand            q3, q3, q1

> +       vadd.i64        q8, q8, q11

> +       vadd.i64        q11, q8, q10

> +       vsub.i64        q0, q0, q3

> +       vshr.s64        q3, q11, #25

> +       vand            q11, q11, q4

> +       vadd.i64        q3, q6, q3

> +       vadd.i64        q6, q3, q7

> +       vsub.i64        q8, q8, q11

> +       vshr.s64        q11, q6, #26

> +       vand            q6, q6, q1

> +       vadd.i64        q9, q9, q11

> +       vadd.i64        d25, d19, d21

> +       vsub.i64        q3, q3, q6

> +       vshr.s64        d23, d25, #25

> +       vand            q4, q12, q4

> +       vadd.i64        d21, d23, d23

> +       vshl.i64        d25, d23, #4

> +       vadd.i64        d21, d21, d23

> +       vadd.i64        d25, d25, d21

> +       vadd.i64        d4, d4, d25

> +       vzip.i32        q0, q8

> +       vadd.i64        d12, d4, d14

> +       add             r2, r6, #8

> +       vst1.8          d0, [r2, : 64]

> +       vsub.i64        d19, d19, d9

> +       add             r2, r2, #16

> +       vst1.8          d16, [r2, : 64]

> +       vshr.s64        d22, d12, #26

> +       vand            q0, q6, q1

> +       vadd.i64        d10, d10, d22

> +       vzip.i32        q3, q9

> +       vsub.i64        d4, d4, d0

> +       sub             r2, r2, #8

> +       vst1.8          d6, [r2, : 64]

> +       add             r2, r2, #16

> +       vst1.8          d18, [r2, : 64]

> +       vzip.i32        q2, q5

> +       sub             r2, r2, #32

> +       vst1.8          d4, [r2, : 64]

> +       cmp             r4, #0

> +       beq             .Lskippostcopy

> +       add             r2, r3, #144

> +       mov             r4, r4

> +       vld1.8          {d0-d1}, [r2, : 128]!

> +       vld1.8          {d2-d3}, [r2, : 128]!

> +       vld1.8          {d4}, [r2, : 64]

> +       vst1.8          {d0-d1}, [r4, : 128]!

> +       vst1.8          {d2-d3}, [r4, : 128]!

> +       vst1.8          d4, [r4, : 64]

> +.Lskippostcopy:

> +       cmp             r1, #1

> +       bne             .Lskipfinalcopy

> +       add             r2, r3, #288

> +       add             r4, r3, #144

> +       vld1.8          {d0-d1}, [r2, : 128]!

> +       vld1.8          {d2-d3}, [r2, : 128]!

> +       vld1.8          {d4}, [r2, : 64]

> +       vst1.8          {d0-d1}, [r4, : 128]!

> +       vst1.8          {d2-d3}, [r4, : 128]!

> +       vst1.8          d4, [r4, : 64]

> +.Lskipfinalcopy:

> +       add             r1, r1, #1

> +       cmp             r1, #12

> +       blo             .Linvertloop

> +       add             r1, r3, #144

> +       ldr             r2, [r1], #4

> +       ldr             r3, [r1], #4

> +       ldr             r4, [r1], #4

> +       ldr             r5, [r1], #4

> +       ldr             r6, [r1], #4

> +       ldr             r7, [r1], #4

> +       ldr             r8, [r1], #4

> +       ldr             r9, [r1], #4

> +       ldr             r10, [r1], #4

> +       ldr             r1, [r1]

> +       add             r11, r1, r1, LSL #4

> +       add             r11, r11, r1, LSL #1

> +       add             r11, r11, #16777216

> +       mov             r11, r11, ASR #25

> +       add             r11, r11, r2

> +       mov             r11, r11, ASR #26

> +       add             r11, r11, r3

> +       mov             r11, r11, ASR #25

> +       add             r11, r11, r4

> +       mov             r11, r11, ASR #26

> +       add             r11, r11, r5

> +       mov             r11, r11, ASR #25

> +       add             r11, r11, r6

> +       mov             r11, r11, ASR #26

> +       add             r11, r11, r7

> +       mov             r11, r11, ASR #25

> +       add             r11, r11, r8

> +       mov             r11, r11, ASR #26

> +       add             r11, r11, r9

> +       mov             r11, r11, ASR #25

> +       add             r11, r11, r10

> +       mov             r11, r11, ASR #26

> +       add             r11, r11, r1

> +       mov             r11, r11, ASR #25

> +       add             r2, r2, r11

> +       add             r2, r2, r11, LSL #1

> +       add             r2, r2, r11, LSL #4

> +       mov             r11, r2, ASR #26

> +       add             r3, r3, r11

> +       sub             r2, r2, r11, LSL #26

> +       mov             r11, r3, ASR #25

> +       add             r4, r4, r11

> +       sub             r3, r3, r11, LSL #25

> +       mov             r11, r4, ASR #26

> +       add             r5, r5, r11

> +       sub             r4, r4, r11, LSL #26

> +       mov             r11, r5, ASR #25

> +       add             r6, r6, r11

> +       sub             r5, r5, r11, LSL #25

> +       mov             r11, r6, ASR #26

> +       add             r7, r7, r11

> +       sub             r6, r6, r11, LSL #26

> +       mov             r11, r7, ASR #25

> +       add             r8, r8, r11

> +       sub             r7, r7, r11, LSL #25

> +       mov             r11, r8, ASR #26

> +       add             r9, r9, r11

> +       sub             r8, r8, r11, LSL #26

> +       mov             r11, r9, ASR #25

> +       add             r10, r10, r11

> +       sub             r9, r9, r11, LSL #25

> +       mov             r11, r10, ASR #26

> +       add             r1, r1, r11

> +       sub             r10, r10, r11, LSL #26

> +       mov             r11, r1, ASR #25

> +       sub             r1, r1, r11, LSL #25

> +       add             r2, r2, r3, LSL #26

> +       mov             r3, r3, LSR #6

> +       add             r3, r3, r4, LSL #19

> +       mov             r4, r4, LSR #13

> +       add             r4, r4, r5, LSL #13

> +       mov             r5, r5, LSR #19

> +       add             r5, r5, r6, LSL #6

> +       add             r6, r7, r8, LSL #25

> +       mov             r7, r8, LSR #7

> +       add             r7, r7, r9, LSL #19

> +       mov             r8, r9, LSR #13

> +       add             r8, r8, r10, LSL #12

> +       mov             r9, r10, LSR #20

> +       add             r1, r9, r1, LSL #6

> +       str             r2, [r0]

> +       str             r3, [r0, #4]

> +       str             r4, [r0, #8]

> +       str             r5, [r0, #12]

> +       str             r6, [r0, #16]

> +       str             r7, [r0, #20]

> +       str             r8, [r0, #24]

> +       str             r1, [r0, #28]

> +       movw            r0, #0

> +       mov             sp, ip

> +       pop             {r4-r11, pc}

> +ENDPROC(curve25519_neon)

> +#endif

> diff --git a/lib/zinc/curve25519/curve25519.c b/lib/zinc/curve25519/curve25519.c

> index 32536340d39d..0d5ea97762d4 100644

> --- a/lib/zinc/curve25519/curve25519.c

> +++ b/lib/zinc/curve25519/curve25519.c

> @@ -21,6 +21,8 @@

>

>  #if defined(CONFIG_ZINC_ARCH_X86_64)

>  #include "curve25519-x86_64-glue.h"

> +#elif defined(CONFIG_ZINC_ARCH_ARM)

> +#include "curve25519-arm-glue.h"

>  #else

>  void __init curve25519_fpu_init(void)

>  {

> --

> 2.19.0

>

Jason A. Donenfeld Oct. 3, 2018, 1:03 a.m. UTC | #44

(+Dan,Peter in CC. Replying to:
<https://lore.kernel.org/lkml/CAKv+Gu9FLDRLxHReKcveZYHNYerR5Y2pZd9gn-hWrU0jb2KgfA@mail.gmail.com/>
for context.)

Hi Ard,

On Tue, Oct 2, 2018 at 6:59 PM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> Shouldn't this use the new simd abstraction as well?

Yes, it probably should, thanks.

> I guess qhasm means generated code, right?

> Because many of these adds are completely redundant ...

> This looks odd as well.

> Could you elaborate on what qhasm is exactly? And, as with the other

> patches, I would prefer it if we could have your changes as a separate

> patch (although having the qhasm base would be preferred)

Indeed qhasm converts this --
<https://github.com/floodyberry/supercop/blob/master/crypto_scalarmult/curve25519/neon2/scalarmult.pq>
-- into this. It's a thing from Dan (CC'd now) --
<http://cr.yp.to/qhasm.html>. As you've requested, I can layer the
patches to show our changes on top.

> ... you can drop this add

> same here

> and here

> and here

> and here

> and here

> and here

> and here

> redundant add

> I'll stop here - let me just note that this code does not strike me as

> particularly well optimized for in-order cores (such as A7).

> For instance, the sequence

> can be reordered as

> and not have every other instruction depend on the output of the previous one.

> Obviously, the ultimate truth is in the benchmark numbers, but I'd

> thought I'd mention it anyway.

Yes indeed the output is suboptimal in a lot of places. We can
gradually clean this up -- slowly and carefully over time -- if you
want. I can also look into producing a new implementation within HACL*
so that it's verified. Assurance-wise, though, I feel pretty good
about this implementation considering its origins, its breadth of use
(in BoringSSL), the fuzzing hours it's incurred, and the actual
implementation itself.

 Either way, performance-wise, it's really worth having.

For example, on a Cortex-A7, we get these results (according to get_cycles()):

neon: 23142 cycles per call
fiat32: 49136 cycles per call
donna32: 71988 cycles per call

And on a Cortex-A9, we get these results (according to get_cycles()):

neon: 5020 cycles per call
fiat32: 17326 cycles per call
donna32: 28076 cycles per call

Jason

Jason A. Donenfeld Oct. 3, 2018, 3:10 a.m. UTC | #45

On Tue, Oct 2, 2018 at 6:59 PM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> Could you elaborate on what qhasm is exactly? And, as with the other

> patches, I would prefer it if we could have your changes as a separate

> patch (although having the qhasm base would be preferred)

By the way, as of a few minutes ago, if you look in the development
tree at the commit called "zinc: Curve25519 ARM implementation", that
now shows the diffs to the original, as you requested. I'll probably
obsess over that a little bit more before v7, but if you see anything
gnarly there beforehand, I'd be happy to hear.

Jason

Eric Biggers Oct. 3, 2018, 5:56 a.m. UTC | #46

On Tue, Sep 25, 2018 at 04:56:20PM +0200, Jason A. Donenfeld wrote:
> diff --git a/crypto/chacha20_zinc.c b/crypto/chacha20_zinc.c

> new file mode 100644

> index 000000000000..f7d70b3efc31

> --- /dev/null

> +++ b/crypto/chacha20_zinc.c

> @@ -0,0 +1,90 @@

> +/* SPDX-License-Identifier: GPL-2.0

> + *

> + * Copyright (C) 2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.

> + */

> +

> +#include <asm/unaligned.h>

> +#include <crypto/algapi.h>

> +#include <crypto/internal/skcipher.h>

> +#include <zinc/chacha20.h>

> +#include <linux/module.h>

> +

> +static int crypto_chacha20_setkey(struct crypto_skcipher *tfm, const u8 *key,

> +				  unsigned int keysize)

> +{

> +	struct chacha20_ctx *ctx = crypto_skcipher_ctx(tfm);

> +

> +	if (keysize != CHACHA20_KEY_SIZE)

> +		return -EINVAL;

> +	chacha20_init(ctx, key, 0);

> +	return 0;

> +}

> +

> +static int crypto_chacha20_crypt(struct skcipher_request *req)

> +{

> +	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);

> +	struct chacha20_ctx *ctx = crypto_skcipher_ctx(tfm);

> +	struct skcipher_walk walk;

> +	simd_context_t simd_context;

> +	int err, i;

> +

> +	err = skcipher_walk_virt(&walk, req, true);

> +	if (unlikely(err))

> +		return err;

> +

> +	for (i = 0; i < ARRAY_SIZE(ctx->counter); ++i)

> +		ctx->counter[i] = get_unaligned_le32(walk.iv + i * sizeof(u32));

> +


Multiple threads may use the same tfm concurrently, so the tfm context must not
be used to store per-request information such as the IV.

- Eric

Eric Biggers Oct. 3, 2018, 6:12 a.m. UTC | #47

On Tue, Sep 25, 2018 at 04:56:10PM +0200, Jason A. Donenfeld wrote:
> These NEON and non-NEON implementations come from Andy Polyakov's

> implementation, and are included here in raw form without modification,

> so that subsequent commits that fix these up for the kernel can see how

> it has changed. This awkward commit splitting has been requested for the

> ARM[64] implementations in particular.

> 

> While this is CRYPTOGAMS code, the originating code for this happens to

> be the same as OpenSSL's commit 5bb1cd2292b388263a0cc05392bb99141212aa53

Sorry to ruin the fun, but actually there are no Poly1305 implementations in
CRYPTOGAMS (https://github.com/dot-asm/cryptogams).  Nor are there any ChaCha20
implementations.

Andy P., can you please add your Poly1305 and ChaCha20 implementations to the
CRYPTOGAMS repository, so that they have a clear kernel-compatible license?

It would be great if you'd include a kernel-compatible license directly in the
versions in the OpenSSL tree too...

Thanks!

- Eric

Ard Biesheuvel Oct. 3, 2018, 7:58 a.m. UTC | #48

On 3 October 2018 at 08:12, Eric Biggers <ebiggers@kernel.org> wrote:
> On Tue, Sep 25, 2018 at 04:56:10PM +0200, Jason A. Donenfeld wrote:

>> These NEON and non-NEON implementations come from Andy Polyakov's

>> implementation, and are included here in raw form without modification,

>> so that subsequent commits that fix these up for the kernel can see how

>> it has changed. This awkward commit splitting has been requested for the

>> ARM[64] implementations in particular.

>>


"This awkward commit splitting"

Seriously?!?

So you really think it is fine to import huge chunks of code like this
from other projects without keeping track of the local changes?

>> While this is CRYPTOGAMS code, the originating code for this happens to

>> be the same as OpenSSL's commit 5bb1cd2292b388263a0cc05392bb99141212aa53

>

> Sorry to ruin the fun, but actually there are no Poly1305 implementations in

> CRYPTOGAMS (https://github.com/dot-asm/cryptogams).  Nor are there any ChaCha20

> implementations.

>


So was this code taken directly from the OpenSSL project then? If so,
why do the commit messages claim otherwise?

> Andy P., can you please add your Poly1305 and ChaCha20 implementations to the

> CRYPTOGAMS repository, so that they have a clear kernel-compatible license?

>

> It would be great if you'd include a kernel-compatible license directly in the

> versions in the OpenSSL tree too...

>


Yes please.

Jason A. Donenfeld Oct. 3, 2018, 2:01 p.m. UTC | #49

Hi Eric,

On Wed, Oct 3, 2018 at 7:56 AM Eric Biggers <ebiggers@kernel.org> wrote:
> Multiple threads may use the same tfm concurrently, so the tfm context must not

> be used to store per-request information such as the IV.


Thanks, fixed for v7.

Jason

Jason A. Donenfeld Oct. 3, 2018, 2:08 p.m. UTC | #50

Hi Ard,

On Wed, Oct 3, 2018 at 9:58 AM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> >> it has changed. This awkward commit splitting has been requested for the

> >> ARM[64] implementations in particular.

> >>

>

> "This awkward commit splitting"

Awkward in the sense that only those two commits are doing it, whereas
the rest of the series is not. Not awkward in any other sense that you
seemed to have divined based on your oversized punctuation below.
Fortunately, at your suggestion for v7, I've now done the splitting
for all of the other places, and so I the comment in the dev tree a
few days ago, since it's now done consistently across the patchset.

>

> Seriously?!?

>

> So you really think it is fine to import huge chunks of code like this

> from other projects without keeping track of the local changes?

As explained above, that's not what I meant at all.

Jason

Ard Biesheuvel Oct. 3, 2018, 2:25 p.m. UTC | #51

On 3 October 2018 at 16:13, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> On 3 October 2018 at 16:12, Jason A. Donenfeld <Jason@zx2c4.com> wrote:

>> Hi Ard,

>>

>> On Wed, Oct 3, 2018 at 1:15 PM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

>>> > +config WIREGUARD

>>> > +       tristate "WireGuard secure network tunnel"

>>> > +       depends on NET && INET

>>>

>>> I think you need to add IPV6 here

>>

>> Nope. Like much of the net tree, WireGuard can function on ipv6-less

>> kernels. If you do find something in WireGuard isn't working in a

>> v6-less configuration, I consider that to be a bug that needs fixing.

>>

>

> OK. I hit a build error yesterday, and setting CONFIG_IPV6 fixed it.

> Let me see if I can reproduce.


I get

drivers/net/wireguard/socket.o: In function `send6':
socket.c:(.text+0x56c): undefined reference to `ipv6_chk_addr'
drivers/net/wireguard/socket.o: In function `wg_socket_send_skb_to_peer':
socket.c:(.text+0x904): undefined reference to `ipv6_chk_addr'
drivers/net/wireguard/socket.o: In function `wg_socket_init':
socket.c:(.text+0x161c): undefined reference to `ipv6_mod_enabled'

if I build my kernel with WireGuard built in but IPv6 support enabled
as a module.

Jason A. Donenfeld Oct. 3, 2018, 2:28 p.m. UTC | #52

On Wed, Oct 3, 2018 at 4:25 PM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> >> On Wed, Oct 3, 2018 at 1:15 PM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

> >>> > +config WIREGUARD

> >>> > +       tristate "WireGuard secure network tunnel"

> >>> > +       depends on NET && INET

> >>>

> >>> I think you need to add IPV6 here

> >>

> >> Nope. Like much of the net tree, WireGuard can function on ipv6-less

> >> kernels. If you do find something in WireGuard isn't working in a

> >> v6-less configuration, I consider that to be a bug that needs fixing.

> >>

> >

> > OK. I hit a build error yesterday, and setting CONFIG_IPV6 fixed it.

> > Let me see if I can reproduce.

>

> I get

>

> drivers/net/wireguard/socket.o: In function `send6':

> socket.c:(.text+0x56c): undefined reference to `ipv6_chk_addr'

> drivers/net/wireguard/socket.o: In function `wg_socket_send_skb_to_peer':

> socket.c:(.text+0x904): undefined reference to `ipv6_chk_addr'

> drivers/net/wireguard/socket.o: In function `wg_socket_init':

> socket.c:(.text+0x161c): undefined reference to `ipv6_mod_enabled'

>

> if I build my kernel with WireGuard built in but IPv6 support enabled

> as a module.


Nice catch. Looks like the norm is adding "depends on IPV6 || !IPV6"
or the like. I'll play and make sure this works for v7. Thanks for
pointing it out.

Jason

Jason A. Donenfeld Oct. 3, 2018, 2:45 p.m. UTC | #53

On Wed, Oct 3, 2018 at 4:08 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> for all of the other places, and so I the comment in the dev tree a


The missing word between "I" and "the" is "extirpated".

D. J. Bernstein Oct. 5, 2018, 3:05 p.m. UTC | #54

For the in-order ARM Cortex-A8 (the target for this code), adjacent
multiply-add instructions forward summands quickly. A simple in-order
dot-product computation has no latency problems, while interleaving
computations, as suggested in this thread, creates problems. Also, on
this microarchitecture, occasional ARM instructions run in parallel with
NEON, so trying to manually eliminate ARM instructions through global
pointer tracking wouldn't gain speed; it would simply create unnecessary
code-maintenance problems.

See https://cr.yp.to/papers.html#neoncrypto for analysis of the
performance of---and remaining bottlenecks in---this code. Further
speedups should be possible on this microarchitecture, but, for anyone
interested in this, I recommend focusing on building a cycle-accurate
simulator (e.g., fixing inaccuracies in the Sobole simulator) first.

Of course, there are other ARM microarchitectures, and there are many
cases where different microarchitectures prefer different optimizations.
The kernel already has boot-time benchmarks for different optimizations
for raid6, and should do the same for crypto code, so that implementors
can focus on each microarchitecture separately rather than living in the
barbaric world of having to choose which CPUs to favor.

---Dan

Ard Biesheuvel Oct. 5, 2018, 3:16 p.m. UTC | #55

On 5 October 2018 at 17:05, D. J. Bernstein <djb@cr.yp.to> wrote:
> For the in-order ARM Cortex-A8 (the target for this code), adjacent

> multiply-add instructions forward summands quickly. A simple in-order

> dot-product computation has no latency problems, while interleaving

> computations, as suggested in this thread, creates problems. Also, on

> this microarchitecture, occasional ARM instructions run in parallel with

> NEON, so trying to manually eliminate ARM instructions through global

> pointer tracking wouldn't gain speed; it would simply create unnecessary

> code-maintenance problems.

>

> See https://cr.yp.to/papers.html#neoncrypto for analysis of the

> performance of---and remaining bottlenecks in---this code. Further

> speedups should be possible on this microarchitecture, but, for anyone

> interested in this, I recommend focusing on building a cycle-accurate

> simulator (e.g., fixing inaccuracies in the Sobole simulator) first.

>

> Of course, there are other ARM microarchitectures, and there are many

> cases where different microarchitectures prefer different optimizations.

> The kernel already has boot-time benchmarks for different optimizations

> for raid6, and should do the same for crypto code, so that implementors

> can focus on each microarchitecture separately rather than living in the

> barbaric world of having to choose which CPUs to favor.

>

Thanks Dan for the insight.

We have already established in a separate discussion that Cortex-A7,
which is main optimization target for future development, does not
have the microarchitectural peculiarity that you are referring to that
ARM instructions are essentially free when interleaved with NEON code.

But I take your point re benchmarking (as I already indicated in my
reply to Jason): if we optimize towards speed, we should ideally reuse
the existing benchmarking infrastructure we have to select the fastest
implementation at runtime. For instance, it turns out that scalar
ChaCha20 is almost as fast as NEON (or even faster?) on A7, and using
NEON in the kernel has some issues of its own.

Jason A. Donenfeld Oct. 5, 2018, 6:40 p.m. UTC | #56

Hey Dan,

On Fri, Oct 05, 2018 at 03:05:38PM -0000, D. J. Bernstein wrote:
> Of course, there are other ARM microarchitectures, and there are many

> cases where different microarchitectures prefer different optimizations.

> The kernel already has boot-time benchmarks for different optimizations

> for raid6, and should do the same for crypto code, so that implementors

> can focus on each microarchitecture separately rather than living in the

> barbaric world of having to choose which CPUs to favor.

I've been playing a bit with some code to do this sort of thing,
choosing a set of implementations to enable or disable by trying all the
combinations, and then calculating a quick median. I don't know if I'll
submit that for the initial merge of this patchset -- and in fact all
the current implementations I'm proposing are pretty much okay on
microarchitectures -- but down the line this could be useful as a
mechanism.

Jason

Jason A. Donenfeld Nov. 12, 2018, 11:53 p.m. UTC | #57

Hey Ivan,

Sorry for not getting back to you sooner.

On Mon, Nov 5, 2018 at 8:06 AM Ivan Labáth <labokml@labo.rs> wrote:
> Any news on this?

>

> To be clear, question is not about an insignificant documentation

> oversight. It is about copying bits from inner packets to outer packets

The short answer is RFC6040 with DSCP fixed to 0 so as not to leak
anything. I've added a description of this to
<wireguard.com/protocol/>.

Regards,
Jason

Dave Taht Nov. 13, 2018, 12:10 a.m. UTC | #58

On Mon, Nov 12, 2018 at 3:54 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>

> Hey Ivan,

>

> Sorry for not getting back to you sooner.

>

> On Mon, Nov 5, 2018 at 8:06 AM Ivan Labáth <labokml@labo.rs> wrote:

> > Any news on this?

> >

> > To be clear, question is not about an insignificant documentation

> > oversight. It is about copying bits from inner packets to outer packets

>

> The short answer is RFC6040 with DSCP fixed to 0 so as not to leak

> anything. I've added a description of this to

> <wireguard.com/protocol/>.

you have a speling error (ECM). :)

side note:

I have to say that wireguard works really well with ecn and non-ecn marked flows
against codel and fq_codel on the bottleneck router.

I'd still rather like it if wireguard focused a bit more on
interleaving multiple flows better
rather than on single stream benchmarks, one day.

In this case, codel is managing things not fq and we could possibly
shave a few ms of induced latency off of it in this particular test series:

http://tun.taht.net/~d/wireguard/rrul_-_comcast_v6.png

vs wireguard (doing it ivp6 over that ipv6)

http://tun.taht.net/~d/wireguard/rrul_-_wireguard.png

That said, I've been deploying wireguard widely in replacement of my
old tinc network particularly on machines that were formerly cpu
bottlenecked
and am insanely pleased with it. what's a few extra ms of latency
between friends?

>

> Regards,

> Jason

-- 

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740

Jason A. Donenfeld Nov. 13, 2018, 12:13 a.m. UTC | #59

On Mon, Nov 12, 2018 at 7:10 PM Dave Taht <dave.taht@gmail.com> wrote:
> you have a speling error (ECM). :)

Thanks.

>

> side note:

>

> I have to say that wireguard works really well with ecn and non-ecn marked flows

> against codel and fq_codel on the bottleneck router.

Yup!

> I'd still rather like it if wireguard focused a bit more on

> interleaving multiple flows better

> rather than on single stream benchmarks, one day.

We're working on it, actually. Toke has been running some tests on
some beefy 100gbps hardware he recently acquired, and after v1 lands
we'll probably have a few interesting ideas for this.

Jason

[net-next,v6,00/23] WireGuard: Secure Network Tunnel

Message

Comments