mbox series

[net-next,v5,00/20] WireGuard: Secure Network Tunnel

Message ID 20180918161646.19105-1-Jason@zx2c4.com
Headers show
Series WireGuard: Secure Network Tunnel | expand

Message

Jason A. Donenfeld Sept. 18, 2018, 4:16 p.m. UTC
v5 has the most comprehensive set of changes yet, and I think should
finally address all of the relevant issues brought up on the mailing
list. In particular, this feedback has come from:

  - Andy Lutomirski
  - Eric Biggers
  - Ard Biesheuvel
  - Kevin Easton
  - Andrew Lunn
  - Martin Willi

Changes v4->v5:
---------------
  - Use fewer inlines, except when measured as necessary.
  - Reduce size of scattergather array to fit within stack on
    small systems.
  - Account for larger stack frames with KASAN.
  - The x86_64 implementations are selected according to input length.
  - Avoid using simd for small blocks on x86_64.
  - The simd_get/put API is now pass by reference, so that the user
    can lazily use the context based on whether or not it's needed.
    See the description again in the first commit for this.
  - Add cycle counts for different sizes for x86_64 commit messages.
  - Relax simd during chapoly sg loop.
  - Replace -include with #if defined(...)
  - Saner and simpler Kconfig.
  - Split into separate modules instead of one monolithic zinc.
  - The combination of these three last items means that there no
    longer are any conditionals in our Makefile.
  - Martin showed a performance regression using tcrypt in v4. This
    has been triaged and fixed, and now the Zinc code runs faster
    than the previous code.
  - While I initially wasn't going to do this for the initial
    patchset, it was just so simple to do: now there's a nosimd
    module parameter that can be used to disable simd instructions
    for debugging and testing, or on weird systems.

-----------------------------------------------------------

This patchset is available on git.kernel.org in this branch, where it may be
pulled directly for inclusion into net-next:

  * https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/log/?h=jd/wireguard

-----------------------------------------------------------

WireGuard is a secure network tunnel written especially for Linux, which
has faced around three years of serious development, deployment, and
scrutiny. It delivers excellent performance and is extremely easy to
use and configure. It has been designed with the primary goal of being
both easy to audit by virtue of being small and highly secure from a
cryptography and systems security perspective. WireGuard is used by some
massive companies pushing enormous amounts of traffic, and likely
already today you've consumed bytes that at some point transited through
a WireGuard tunnel. Even as an out-of-tree module, WireGuard has been
integrated into various userspace tools, Linux distributions, mobile
phones, and data centers. There are ports in several languages to
several operating systems, and even commercial hardware and services
sold integrating WireGuard. It is time, therefore, for WireGuard to be
properly integrated into Linux.

Ample information, including documentation, installation instructions,
and project details, is available at:

  * https://www.wireguard.com/
  * https://www.wireguard.com/papers/wireguard.pdf

As it is currently an out-of-tree module, it lives in its own git repo
and has its own mailing list, and every commit for the module is tested
against every stable kernel since 3.10 on a variety of architectures
using an extensive test suite:

  * https://git.zx2c4.com/WireGuard
    https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/WireGuard.git/
  * https://lists.zx2c4.com/mailman/listinfo/wireguard
  * https://www.wireguard.com/build-status/

The project has been broadly discussed at conferences, and was presented
to the Netdev developers in Seoul last November, where a paper was
released detailing some interesting aspects of the project. Dave asked
me after the talk if I would consider sending in a v1 "sooner rather
than later", hence this patchset. A decision is still waiting from the
Linux Plumbers Conference, but an update on these topics may be presented
in Vancouver in a few months. Prior presentations:

  * https://www.wireguard.com/presentations/
  * https://www.wireguard.com/papers/wireguard-netdev22.pdf

The cryptography in the protocol itself has been formally verified by
several independent academic teams with positive results, and I know of
two additional efforts on their way to further corroborate those
findings. The version 1 protocol is "complete", and so the purpose of
this review is to assess the implementation of the protocol. However, it
still may be of interest to know that the thing you're reviewing uses a
protocol with various nice security properties:

  * https://www.wireguard.com/formal-verification/

This patchset is divided into four segments. The first introduces a very
simple helper for working with the FPU state for the purposes of amortizing
SIMD operations. The second segment is a small collection of cryptographic
primitives, split up into several commits by primitive and by hardware. The
third shows usage of Zinc within the existing crypto API and as a replacement
to the existing crypto API. The last is WireGuard itself, presented as an
unintrusive and self-contained virtual network driver.

It is intended that this entire patch series enter the kernel through
DaveM's net-next tree. Subsequently, WireGuard patches will go through
DaveM's net-next tree, while Zinc patches will go through Greg KH's tree.

Enjoy,
Jason

Comments

Jason A. Donenfeld Sept. 18, 2018, 5:12 p.m. UTC | #1
Hi David,

On Tue, Sep 18, 2018 at 06:01:47PM +0100, David Howells wrote:
> Jason A. Donenfeld <Jason@zx2c4.com> wrote:

> 

> > A while back, I noticed that the crypto and crypto API usage in big_keys

> > were entirely broken in multiple ways, so I rewrote it. Now, I'm

> > rewriting it again, but this time using Zinc's ChaCha20Poly1305

> > function.

> 

> warthog>git grep chacha20poly1305_decrypt net-next/master 

> warthog1>

> 

> Where do I find this?

> 

> David


Previously in the patchset:
https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/log/?h=jd/wireguard
https://lore.kernel.org/lkml/20180918161646.19105-1-Jason@zx2c4.com/

Regards,
Jason
Jason A. Donenfeld Sept. 18, 2018, 9:01 p.m. UTC | #2
Hi Ard,

On Tue, Sep 18, 2018 at 11:28:50AM -0700, Ard Biesheuvel wrote:
> On 18 September 2018 at 09:16, Jason A. Donenfeld <Jason@zx2c4.com> wrote:

> >   - While I initially wasn't going to do this for the initial

> >     patchset, it was just so simple to do: now there's a nosimd

> >     module parameter that can be used to disable simd instructions

> >     for debugging and testing, or on weird systems.

> >

> 

> I was going to respond in the other thread but it is probably better

> to move the discussion here.

> 

> My concern about the monolithic nature of each algo module is not only

> about SIMD, and it has nothing to do with weird systems. It has to do

> with micro-architectural differences which are more common on ARM than

> on other architectures *, I suppose. But generalizing from that, it

> has to do with policy which is currently owned by userland and not by

> the kernel. This will also be important for choosing between the time

> variant but less safe table based scalar AES and the much slower time

> invariant version (which is substantially slower, especially on

> decryption) once we move AES into this library.

> 

> So a command line option for the kernel is not the solution here. If

> we can't have separate modules, could we at least have per-module

> options that put the policy decisions back into userland?

> 

> * as an example, the SHA256 NEON code I collaborated on with Andy

> Polyakov 2 years ago is significantly faster on some cores and not on

> others


Interesting concern. There are micro-architectural quirks on x86 too
that the current code actually already considers. Notably, we use an
AVX-512VL path for Skylake-X but an AVX-512F path for Knights Landing
and Coffee Lake and others, due to thermal throttling when touching the
zmm registers on Skylake-X. So, in the code, we have it automatically
select the right thing based on the micro-architecture.

Is the same thing not possible with ARM? Do you not have access to this
information already, such that the module can just always do the right
thing and not require any user intervention?

If so, that would be ideal. If not (and I'm curious to learn why not
exactly), then indeed we could add some runtime nobs in /sys/module/
{algo}/parameters/{nob}, or the like. This would be super easy to do,
should we ever encounter a situation where we're unable to auto-detect
the correct thing.

Regards,
Jason
Andrew Lunn Sept. 18, 2018, 11:34 p.m. UTC | #3
> +#define push_rcu(stack, p, len) ({                                             \

> +		if (rcu_access_pointer(p)) {                                   \

> +			BUG_ON(len >= 128);                                    \

> +			stack[len++] = rcu_dereference_raw(p);                 \

> +		}                                                              \

> +		true;                                                          \

> +	})

> +static void root_free_rcu(struct rcu_head *rcu)

> +{

> +	struct allowedips_node *node, *stack[128] = {

> +		container_of(rcu, struct allowedips_node, rcu) };

> +	unsigned int len = 1;

> +

> +	while (len > 0 && (node = stack[--len]) &&

> +	       push_rcu(stack, node->bit[0], len) &&

> +	       push_rcu(stack, node->bit[1], len))

> +		kfree(node);

> +}


Hi Jason

I see this BUG_ON() is still here. It really needs to be removed. It
does not look like you need to crash the kernel here. Can you add in a
test of len >= 128, do a WARN and then return. I think you then leak
some memory, but i would much prefer that to a crashed machine.

     Andrew
Jason A. Donenfeld Sept. 19, 2018, 12:17 a.m. UTC | #4
Hi Eric,

On Wed, Sep 19, 2018 at 12:55 AM Eric Biggers <ebiggers@kernel.org> wrote:
> This will compute the wrong digest if called with simd_context=HAVE_FULL_SIMD

> and then later with simd_context=HAVE_NO_SIMD, since poly1305_blocks_neon()

> converts the accumulator from base 32 to base 26, whereas poly1305_blocks_arm()

> assumes it is still in base 32.  Is that intentional?  I'm sure this is a rare

> case, but my understanding is that the existing crypto API doesn't preclude

> calling successive steps in different contexts.  And I'm concerned that it could

> be relevant in some cases, e.g. especially if people are importing a hash state

> that was exported earlier.  Handling it by silently computing the wrong digest

> is not a great idea...


Indeed you're right; Samuel and I were just discussing that recently.
I'd rather handle this correctly even if the contexts change, so I'll
see if I can fix this up properly for that unlikely case in the next
revision.

Jason
Eric Biggers Sept. 19, 2018, 12:41 a.m. UTC | #5
On Tue, Sep 18, 2018 at 06:16:38PM +0200, Jason A. Donenfeld wrote:
> The C implementation was originally based on Samuel Neves' public

> domain reference implementation but has since been heavily modified

> for the kernel. We're able to do compile-time optimizations by moving

> some scaffolding around the final function into the header file.

> 

> Information: https://blake2.net/

> 

> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

> Signed-off-by: Samuel Neves <sneves@dei.uc.pt>

> Cc: Andy Lutomirski <luto@kernel.org>

> Cc: Greg KH <gregkh@linuxfoundation.org>

> Cc: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com>

> ---

>  include/zinc/blake2s.h      |   95 ++

>  lib/zinc/Kconfig            |    3 +

>  lib/zinc/Makefile           |    3 +

>  lib/zinc/blake2s/blake2s.c  |  301 +++++

>  lib/zinc/selftest/blake2s.h | 2095 +++++++++++++++++++++++++++++++++++

>  5 files changed, 2497 insertions(+)

>  create mode 100644 include/zinc/blake2s.h

>  create mode 100644 lib/zinc/blake2s/blake2s.c

>  create mode 100644 lib/zinc/selftest/blake2s.h

> 

> diff --git a/include/zinc/blake2s.h b/include/zinc/blake2s.h

> new file mode 100644

> index 000000000000..951281596274

> --- /dev/null

> +++ b/include/zinc/blake2s.h

> @@ -0,0 +1,95 @@

> +/* SPDX-License-Identifier: MIT

> + *

> + * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.

> + */

> +

> +#ifndef _ZINC_BLAKE2S_H

> +#define _ZINC_BLAKE2S_H

> +

> +#include <linux/types.h>

> +#include <linux/kernel.h>

> +#include <crypto/algapi.h>

> +

> +enum blake2s_lengths {

> +	BLAKE2S_BLOCKBYTES = 64,

> +	BLAKE2S_OUTBYTES = 32,

> +	BLAKE2S_KEYBYTES = 32

> +};

> +

> +struct blake2s_state {

> +	u32 h[8];

> +	u32 t[2];

> +	u32 f[2];

> +	u8 buf[BLAKE2S_BLOCKBYTES];

> +	size_t buflen;

> +	u8 last_node;

> +};

> +

> +void blake2s_init(struct blake2s_state *state, const size_t outlen);

> +void blake2s_init_key(struct blake2s_state *state, const size_t outlen,

> +		      const void *key, const size_t keylen);

> +void blake2s_update(struct blake2s_state *state, const u8 *in, size_t inlen);

> +void __blake2s_final(struct blake2s_state *state);

> +static inline void blake2s_final(struct blake2s_state *state, u8 *out,

> +				 const size_t outlen)

> +{

> +	int i;

> +

> +#ifdef DEBUG

> +	BUG_ON(!out || !outlen || outlen > BLAKE2S_OUTBYTES);

> +#endif

> +	__blake2s_final(state);

> +

> +	if (__builtin_constant_p(outlen) && !(outlen % sizeof(u32))) {

> +		if (IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) ||

> +		    IS_ALIGNED((unsigned long)out, __alignof__(u32))) {

> +			__le32 *outwords = (__le32 *)out;

> +

> +			for (i = 0; i < outlen / sizeof(u32); ++i)

> +				outwords[i] = cpu_to_le32(state->h[i]);

> +		} else {

> +			__le32 buffer[BLAKE2S_OUTBYTES];


This buffer is 4 times too long.

> +

> +			for (i = 0; i < outlen / sizeof(u32); ++i)

> +				buffer[i] = cpu_to_le32(state->h[i]);

> +			memcpy(out, buffer, outlen);

> +			memzero_explicit(buffer, sizeof(buffer));

> +		}

> +	} else {

> +		u8 buffer[BLAKE2S_OUTBYTES] __aligned(__alignof__(u32));

> +		__le32 *outwords = (__le32 *)buffer;

> +

> +		for (i = 0; i < 8; ++i)

> +			outwords[i] = cpu_to_le32(state->h[i]);

> +		memcpy(out, buffer, outlen);

> +		memzero_explicit(buffer, sizeof(buffer));

> +	}

> +

> +	memzero_explicit(state, sizeof(*state));

> +}


Or how about something much simpler:

static inline void blake2s_final(struct blake2s_state *state, u8 *out,
				 const size_t outlen)
{
#ifdef DEBUG
	BUG_ON(!out || !outlen || outlen > BLAKE2S_OUTBYTES);
#endif
	__blake2s_final(state);

	cpu_to_le32_array(state->h, ARRAY_SIZE(state->h));
	memcpy(out, state->h, outlen);

	memzero_explicit(state, sizeof(*state));
}
Jason A. Donenfeld Sept. 19, 2018, 12:45 a.m. UTC | #6
Hey Eric,

On Wed, Sep 19, 2018 at 2:41 AM Eric Biggers <ebiggers@kernel.org> wrote:
> This buffer is 4 times too long.


Nice catch.

> Or how about something much simpler:

>

> static inline void blake2s_final(struct blake2s_state *state, u8 *out,

>                                  const size_t outlen)

> {

> #ifdef DEBUG

>         BUG_ON(!out || !outlen || outlen > BLAKE2S_OUTBYTES);

> #endif

>         __blake2s_final(state);

>

>         cpu_to_le32_array(state->h, ARRAY_SIZE(state->h));

>         memcpy(out, state->h, outlen);

>

>         memzero_explicit(state, sizeof(*state));

> }


Oh, that's excellent, thanks. Much better than prior. I'll do exactly that.

Jason
Eric Biggers Sept. 19, 2018, 12:50 a.m. UTC | #7
On Tue, Sep 18, 2018 at 06:16:33PM +0200, Jason A. Donenfeld wrote:
> diff --git a/lib/zinc/poly1305/poly1305.c b/lib/zinc/poly1305/poly1305.c

> new file mode 100644

> index 000000000000..dbab82f33aa7

> --- /dev/null

> +++ b/lib/zinc/poly1305/poly1305.c

> @@ -0,0 +1,155 @@

> +/* SPDX-License-Identifier: MIT

> + *

> + * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.

> + *

> + * Implementation of the Poly1305 message authenticator.

> + *

> + * Information: https://cr.yp.to/mac.html

> + */

> +

> +#include <zinc/poly1305.h>

> +

> +#include <asm/unaligned.h>

> +#include <linux/kernel.h>

> +#include <linux/string.h>

> +#include <linux/module.h>

> +#include <linux/init.h>

> +

> +#ifndef HAVE_POLY1305_ARCH_IMPLEMENTATION

> +static inline bool poly1305_init_arch(void *ctx,

> +				      const u8 key[POLY1305_KEY_SIZE])

> +{

> +	return false;

> +}

> +static inline bool poly1305_blocks_arch(void *ctx, const u8 *input,

> +					const size_t len, const u32 padbit,

> +					simd_context_t *simd_context)

> +{

> +	return false;

> +}

> +static inline bool poly1305_emit_arch(void *ctx, u8 mac[POLY1305_MAC_SIZE],

> +				      const u32 nonce[4],

> +				      simd_context_t *simd_context)

> +{

> +	return false;

> +}

> +void __init poly1305_fpu_init(void)

> +{

> +}

> +#endif

> +

> +#if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__)

> +#include "poly1305-donna64.h"

> +#else

> +#include "poly1305-donna32.h"

> +#endif

> +

> +void poly1305_init(struct poly1305_ctx *ctx, const u8 key[POLY1305_KEY_SIZE])

> +{

> +	ctx->nonce[0] = get_unaligned_le32(&key[16]);

> +	ctx->nonce[1] = get_unaligned_le32(&key[20]);

> +	ctx->nonce[2] = get_unaligned_le32(&key[24]);

> +	ctx->nonce[3] = get_unaligned_le32(&key[28]);

> +

> +	if (!poly1305_init_arch(ctx->opaque, key))

> +		poly1305_init_generic(ctx->opaque, key);

> +

> +	ctx->num = 0;

> +}

> +EXPORT_SYMBOL(poly1305_init);

> +

> +static inline void poly1305_blocks(void *ctx, const u8 *input, const size_t len,

> +				   const u32 padbit,

> +				   simd_context_t *simd_context)

> +{

> +	if (!poly1305_blocks_arch(ctx, input, len, padbit, simd_context))

> +		poly1305_blocks_generic(ctx, input, len, padbit);

> +}

> +

> +static inline void poly1305_emit(void *ctx, u8 mac[POLY1305_KEY_SIZE],

> +				 const u32 nonce[4],

> +				 simd_context_t *simd_context)

> +{

> +	if (!poly1305_emit_arch(ctx, mac, nonce, simd_context))

> +		poly1305_emit_generic(ctx, mac, nonce);

> +}

> +

> +void poly1305_update(struct poly1305_ctx *ctx, const u8 *input, size_t len,

> +		     simd_context_t *simd_context)

> +{

> +	const size_t num = ctx->num % POLY1305_BLOCK_SIZE;

> +	size_t rem;


0 <= ctx->num < POLY1305_BLOCK_SIZE, so no need to mod by POLY1305_BLOCK_SIZE.

> +

> +	if (num) {

> +		rem = POLY1305_BLOCK_SIZE - num;

> +		if (len < rem) {

> +			memcpy(ctx->data + num, input, len);

> +			ctx->num = num + len;

> +			return;

> +		}

> +		memcpy(ctx->data + num, input, rem);

> +		poly1305_blocks(ctx->opaque, ctx->data, POLY1305_BLOCK_SIZE, 1,

> +				simd_context);

> +		input += rem;

> +		len -= rem;

> +	}

> +

> +	rem = len % POLY1305_BLOCK_SIZE;

> +	len -= rem;

> +

> +	if (len >= POLY1305_BLOCK_SIZE) {

> +		poly1305_blocks(ctx->opaque, input, len, 1, simd_context);

> +		input += len;

> +	}

> +

> +	if (rem)

> +		memcpy(ctx->data, input, rem);

> +

> +	ctx->num = rem;

> +}

> +EXPORT_SYMBOL(poly1305_update);

> +

> +void poly1305_final(struct poly1305_ctx *ctx, u8 mac[POLY1305_MAC_SIZE],

> +		    simd_context_t *simd_context)

> +{

> +	size_t num = ctx->num % POLY1305_BLOCK_SIZE;


Same here.

> +++ b/lib/zinc/selftest/poly1305.h

> @@ -0,0 +1,875 @@

> +/* SPDX-License-Identifier: MIT

> + *

> + * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.

> + */

> +

> +#ifdef DEBUG

> +struct poly1305_testvec {

> +	u8 input[600];

> +	u8 output[POLY1305_MAC_SIZE];

> +	u8 key[POLY1305_KEY_SIZE];

> +	size_t ilen;

> +};

> +

> +static const struct poly1305_testvec poly1305_testvecs[] __initconst = {

> +{ /* RFC7539 */

> +	.input	= { 0x43, 0x72, 0x79, 0x70, 0x74, 0x6f, 0x67, 0x72,

> +		    0x61, 0x70, 0x68, 0x69, 0x63, 0x20, 0x46, 0x6f,

> +		    0x72, 0x75, 0x6d, 0x20, 0x52, 0x65, 0x73, 0x65,

> +		    0x61, 0x72, 0x63, 0x68, 0x20, 0x47, 0x72, 0x6f,

> +		    0x75, 0x70 },

> +	.ilen	= 34,

> +	.output	= { 0xa8, 0x06, 0x1d, 0xc1, 0x30, 0x51, 0x36, 0xc6,

> +		    0xc2, 0x2b, 0x8b, 0xaf, 0x0c, 0x01, 0x27, 0xa9 },

> +	.key	= { 0x85, 0xd6, 0xbe, 0x78, 0x57, 0x55, 0x6d, 0x33,

> +		    0x7f, 0x44, 0x52, 0xfe, 0x42, 0xd5, 0x06, 0xa8,

> +		    0x01, 0x03, 0x80, 0x8a, 0xfb, 0x0d, 0xb2, 0xfd,

> +		    0x4a, 0xbf, 0xf6, 0xaf, 0x41, 0x49, 0xf5, 0x1b },

> +}, { /* "The Poly1305-AES message-authentication code" */


Hardcoding the 'input' array to 600 bytes forces the full amount of space to be
reserved in the kernel image for every test vector.  Also, if anyone adds a
longer test vector they will need to remember to increase the value.

It should be a const pointer instead, like the test vectors in crypto/testmgr.h.

- Eric
Jason A. Donenfeld Sept. 19, 2018, 1:35 a.m. UTC | #8
On Wed, Sep 19, 2018 at 2:50 AM Eric Biggers <ebiggers@kernel.org> wrote:
> Hardcoding the 'input' array to 600 bytes forces the full amount of space to be

> reserved in the kernel image for every test vector.  Also, if anyone adds a

> longer test vector they will need to remember to increase the value.

>

> It should be a const pointer instead, like the test vectors in crypto/testmgr.h.


I know. The agony. This has been really annoying me. I originally did
it the right way, but removed it last week, when I noticed that gcc
failed to put it in the initconst section:

https://git.zx2c4.com/WireGuard/commit/?id=f4698d20f13946afc6ce99e98685ba3f9adc4474

Even changing the (u8[]){ ... } into a (const u8[]){ ... } or even
into a const string literal does not do the trick. It makes it into
the constant data section with const, but it does not make it into the
initconst section. What a bummer.

I went asking about this on the gcc mailing list, to see if there was
just some aspect of C that I had overlooked:
https://gcc.gnu.org/ml/gcc/2018-09/msg00043.html So far, it looks like
we're SOL. I could probably make some macros to do this in a .S file
but... that's pretty unacceptably gross. Or I could define lots and
lots of variables in __initconst, and then connect them all together
at the end, but that's pretty gross too. Or I could have all this data
in one variable and record offsets into it, but that's even more
atrocious.

So I think it comes down to these two non-ugly options:
- We use fixed sized buffers, waste a lot of space, and be happy that
it's cleared from memory immediately after init/insertion anyway, so
it's not actually wasting ram.
- We use const string literals / constant compound literals, save
space on disk, but not benefit from having it cleared from memory
after init/insertion.

Of these, which would you prefer? I can see the argument both ways,
but in the end opted for the first. Or perhaps you have a better third
option?

Jason
Jason A. Donenfeld Sept. 19, 2018, 1:39 a.m. UTC | #9
> > +     const size_t num = ctx->num % POLY1305_BLOCK_SIZE;

> 0 <= ctx->num < POLY1305_BLOCK_SIZE, so no need to mod by POLY1305_BLOCK_SIZE.

> > +     size_t num = ctx->num % POLY1305_BLOCK_SIZE;

> Same here.


I know, but I was having a hard time convincing gcc-8 of that
invariant, and it was warning me. Perhaps this is something they
fixed, though, between 8.1 and 8.2 though. I'll check back and adjust
accordingly.
Jason A. Donenfeld Sept. 19, 2018, 1:41 a.m. UTC | #10
On Wed, Sep 19, 2018 at 3:39 AM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>

> > > +     const size_t num = ctx->num % POLY1305_BLOCK_SIZE;

> > 0 <= ctx->num < POLY1305_BLOCK_SIZE, so no need to mod by POLY1305_BLOCK_SIZE.

> > > +     size_t num = ctx->num % POLY1305_BLOCK_SIZE;

> > Same here.

>

> I know, but I was having a hard time convincing gcc-8 of that

> invariant, and it was warning me. Perhaps this is something they

> fixed, though, between 8.1 and 8.2 though. I'll check back and adjust

> accordingly.


This was changed here:
https://git.zx2c4.com/WireGuard/commit/?id=37f114a73ba37219b00a66f0a51219a696599745

I can't reproduce with 8.2 anymore, so perhaps I should remove it now.
Unless you'd like to avoid a warning on old compilers. Since there's
no difference in speed, probably we should avoid the 8.1 warning and
leave it be?
Jason A. Donenfeld Sept. 19, 2018, 2:02 a.m. UTC | #11
On Wed, Sep 19, 2018 at 3:08 AM Eric Biggers <ebiggers@kernel.org> wrote:
> Does this consistently perform as well as an implementation that organizes the

> operations such that the quarterrounds for all columns/diagonals are

> interleaved?  As-is, there are tight dependencies in QUARTER_ROUND() (as well as

> in the existing chacha20_block() in lib/chacha20.c, for that matter), so we're

> heavily depending on the compiler to do the needed interleaving so as to not get

> potentially disastrous performance.  Making it explicit could be a good idea.


It does perform as well, and the compiler outputs good code, even on
older compilers. Notably that's all a single statement (via the comma
operator).

> > +}

> > +

> > +static void chacha20_generic(u8 *out, const u8 *in, u32 len, const u32 key[8],

> > +                          const u32 counter[4])

> > +{

> > +     __le32 buf[CHACHA20_BLOCK_WORDS];

> > +     u32 x[] = {

> > +             EXPAND_32_BYTE_K,

> > +             key[0], key[1], key[2], key[3],

> > +             key[4], key[5], key[6], key[7],

> > +             counter[0], counter[1], counter[2], counter[3]

> > +     };

> > +

> > +     if (out != in)

> > +             memmove(out, in, len);

> > +

> > +     while (len >= CHACHA20_BLOCK_SIZE) {

> > +             chacha20_block_generic(buf, x);

> > +             crypto_xor(out, (u8 *)buf, CHACHA20_BLOCK_SIZE);

> > +             len -= CHACHA20_BLOCK_SIZE;

> > +             out += CHACHA20_BLOCK_SIZE;

> > +     }

> > +     if (len) {

> > +             chacha20_block_generic(buf, x);

> > +             crypto_xor(out, (u8 *)buf, len);

> > +     }

> > +}

>

> If crypto_xor_cpy() is used instead of crypto_xor(), and 'in' is incremented

> along with 'out', then the memmove() is not needed.


Nice idea, thanks. Implemented.

Jason
Jason A. Donenfeld Sept. 19, 2018, 2:04 a.m. UTC | #12
Hi Andrew,

On Wed, Sep 19, 2018 at 1:34 AM Andrew Lunn <andrew@lunn.ch> wrote:
> I see this BUG_ON() is still here. It really needs to be removed. It

> does not look like you need to crash the kernel here. Can you add in a

> test of len >= 128, do a WARN and then return. I think you then leak

> some memory, but i would much prefer that to a crashed machine.


Sure, I'll change it to that.

Regards,
Jason
Jason A. Donenfeld Sept. 19, 2018, 2:14 a.m. UTC | #13
Hi Thomas,

On Wed, Sep 19, 2018 at 12:30 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> I'm a bit confused by this SOB chain. So above you write that it's from

> Andy Polakovs implementation and Samuel did the changes. But here it seems

> you are the main author. If Samuel just did some modifications then you

> want to use the Co-developed-by tag along with his SOB.


Thanks, I'll use that tag.

>

> Also I'd recommend to add a Originally-by or Based-on-code-from: Andy

> Polyakov tag. Both are not formal tags but widely in use for attributions.


Great idea.

>

> > +++ b/lib/zinc/chacha20/chacha20-x86_64-glue.h

> > @@ -0,0 +1,100 @@

> > +/* SPDX-License-Identifier: MIT

>

> Please put that into a separate one liner comment. Also this should be

> 'GPL-2.0[+] or MIT' I think.


I had that originally, but changed it to just MIT, since MIT is a
subset of GPL-2.0. And looking at tree repo, it appears this is what
others do too.

Jason
Thomas Gleixner Sept. 19, 2018, 6:13 a.m. UTC | #14
Jason,

On Wed, 19 Sep 2018, Jason A. Donenfeld wrote:
> On Wed, Sep 19, 2018 at 12:30 AM Thomas Gleixner <tglx@linutronix.de> wrote:

> >

> > > +++ b/lib/zinc/chacha20/chacha20-x86_64-glue.h

> > > @@ -0,0 +1,100 @@

> > > +/* SPDX-License-Identifier: MIT

> >

> > Please put that into a separate one liner comment. Also this should be

> > 'GPL-2.0[+] or MIT' I think.

> 

> I had that originally, but changed it to just MIT, since MIT is a

> subset of GPL-2.0. And looking at tree repo, it appears this is what

> others do too.


Subset? Not really. Both MIT and BSD3-Clause are GPL2.0 compatible
licenses. And if your intention is to have those files MIT/BSD only, yes
then the single license identifier is the right thing. If you want it dual
licensed then it should be expressed there clearly.

Thanks,

	tglx
Jason A. Donenfeld Sept. 19, 2018, 11:33 a.m. UTC | #15
Hi Thomas,

On Wed, Sep 19, 2018 at 8:13 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> Subset? Not really. Both MIT and BSD3-Clause are GPL2.0 compatible

> licenses. And if your intention is to have those files MIT/BSD only, yes

> then the single license identifier is the right thing. If you want it dual

> licensed then it should be expressed there clearly.


I always thought "GPL2 compatible" was the same as "subset" because of
the restrictions clause of GPL2. But IANAL, so for the avoidance of
doubt I'll take your advice and put both.

Jason
Jason A. Donenfeld Sept. 19, 2018, 12:26 p.m. UTC | #16
On Wed, Sep 19, 2018 at 1:50 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>

> On Wed, Sep 19, 2018 at 6:19 AM Andy Lutomirski <luto@amacapital.net> wrote:

> > Can you not uglify the code a bit by using normal (non-compound) liberals as described in the response to that email?

>

> I can define a bunch of variables, and then string them all together

> in an array at the end. This is a bit ugly and less maintainable, but

> would be the best of both worlds.


Alright, I wrote a little program to do this for me in a way that
actually doesn't look half bad. Thanks for suggesting.

Jason
Andrew Lunn Sept. 19, 2018, 12:38 p.m. UTC | #17
On Wed, Sep 19, 2018 at 04:04:01AM +0200, Jason A. Donenfeld wrote:
> Hi Andrew,

> 

> On Wed, Sep 19, 2018 at 1:34 AM Andrew Lunn <andrew@lunn.ch> wrote:

> > I see this BUG_ON() is still here. It really needs to be removed. It

> > does not look like you need to crash the kernel here. Can you add in a

> > test of len >= 128, do a WARN and then return. I think you then leak

> > some memory, but i would much prefer that to a crashed machine.

> 

> Sure, I'll change it to that.


Great, thanks. I noticed there is at least one more BUG()
statements. It would be good to remove them all. BUG() should only be
used when something bad has already happened and we want to minimise
the damage by killing the machine immediately.

    Andrew
Ard Biesheuvel Sept. 20, 2018, 3:41 p.m. UTC | #18
(+ Arnd, Eric)

On 18 September 2018 at 09:16, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
...

> diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile

> new file mode 100644

> index 000000000000..83dfd63988c0

> --- /dev/null

> +++ b/lib/zinc/Makefile

> @@ -0,0 +1,4 @@


Apologies for not spotting these before:

> +ccflags-y := -O3


-O3 optimization has been problematic in the past, at least on x86 but
I think on other architectures as well. Please stick with -O2.

> +ccflags-y += -Wframe-larger-than=$(if (CONFIG_KASAN),16384,8192)


There is no way we can support code in the kernel with that kind of
stack space requirements. I will let Arnd comment on what we typically
allow, since he deals with such issues on a regular basis.

> +ccflags-y += -D'pr_fmt(fmt)="zinc: " fmt'

> +ccflags-$(CONFIG_ZINC_DEBUG) += -DDEBUG

> --

> 2.19.0

>
Jason A. Donenfeld Sept. 21, 2018, 12:11 a.m. UTC | #19
Hey Arnd,

On Thu, Sep 20, 2018 at 6:02 PM Arnd Bergmann <arnd@arndb.de> wrote:
> Right, if you hit a stack requirement like this, it's usually the compiler

> doing something bad, not just using too much stack but also generating

> rather slow object code in the process. It's better to fix the bug by

> optimizing the code to not spill registers to the stack.

>

> In the long run, I'd like to reduce the stack frame size further, so

> best assume that anything over 1024 bytes (on 32-bit) or 1280 bytes

> (on 64-bit) is a bug in the code, and stay below that.

>

> For prototyping, you can just mark the broken functions individually

> by setting the warning limit for a specific function that is known to

> be misoptimized by the compiler (with a comment about which compiler

> and architectures are affected), but not override the limit for the

> entire file.


Thanks for the explanation. Fortunately in my case, the issues were
trivially fixable to get it under 1024/1280. (By the way, why does
64-bit have a slightly larger stack frame? To account for 32 pointers
taking double the space or something?) That will be rectified in v6.

There is one exception though: sometimes KASAN bloats the frame on
64-bit compiles. How would you feel about me adding
'ccflags-$(CONFIG_KASAN) += -Wframe-larger-than=16384' in my makefile?
I'm not remotely close to reaching that limit (it's just a tiny bit
over 1280), but it does seem like telling gcc to quiet down about
stack frames when KASAN is being used might make sense. Alternatively,
I see the defaults for FRAME_WARN are:

config FRAME_WARN
       int "Warn for stack frames larger than (needs gcc 4.4)"
       range 0 8192
       default 3072 if KASAN_EXTRA
       default 2048 if GCC_PLUGIN_LATENT_ENTROPY
       default 1280 if (!64BIT && PARISC)
       default 1024 if (!64BIT && !PARISC)
       default 2048 if 64BIT

What about changing that KASAN_EXTRA into just KASAN? This seems
cleanest; I'll send a patch for it.

On the other hand, this KASAN behavior is only observable on 64-bit
systems when I force it to be 1280, whereas the default is still 2048,
so probably this isn't a problem *now* for this patchset. But it is
something to think about for down the road when you lower the default
frame.

Regards,
Jason
Jason A. Donenfeld Sept. 21, 2018, 12:17 a.m. UTC | #20
Hi Ard,

On Thu, Sep 20, 2018 at 5:42 PM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> Apologies for not spotting these before:

>

> > +ccflags-y := -O3

>

> -O3 optimization has been problematic in the past, at least on x86 but

> I think on other architectures as well. Please stick with -O2.


Fixed up for v6.

>

> > +ccflags-y += -Wframe-larger-than=$(if (CONFIG_KASAN),16384,8192)

>

> There is no way we can support code in the kernel with that kind of

> stack space requirements. I will let Arnd comment on what we typically

> allow, since he deals with such issues on a regular basis.


Also fixed up for v6 by just removing that line entirely.

Jason
Andrew Lunn Sept. 21, 2018, 3:12 a.m. UTC | #21
On Fri, Sep 21, 2018 at 02:11:43AM +0200, Jason A. Donenfeld wrote:
> Hey Arnd,

> 

> On Thu, Sep 20, 2018 at 6:02 PM Arnd Bergmann <arnd@arndb.de> wrote:

> > Right, if you hit a stack requirement like this, it's usually the compiler

> > doing something bad, not just using too much stack but also generating

> > rather slow object code in the process. It's better to fix the bug by

> > optimizing the code to not spill registers to the stack.

> >

> > In the long run, I'd like to reduce the stack frame size further, so

> > best assume that anything over 1024 bytes (on 32-bit) or 1280 bytes

> > (on 64-bit) is a bug in the code, and stay below that.

> >

> > For prototyping, you can just mark the broken functions individually

> > by setting the warning limit for a specific function that is known to

> > be misoptimized by the compiler (with a comment about which compiler

> > and architectures are affected), but not override the limit for the

> > entire file.

> 

> Thanks for the explanation. Fortunately in my case, the issues were

> trivially fixable to get it under 1024/1280. (By the way, why does

> 64-bit have a slightly larger stack frame? To account for 32 pointers

> taking double the space or something?) That will be rectified in v6.


Hi Jason

Do you any stack usage information?

A VPN can be at the bottom of some deep stack calls. Swap on NFS over
the VPN? If you have one frame of 1K, you might be O.K. But if you
have a few of these, i can see there might be issues of overflowing
the stack.

    Andrew
Andy Lutomirski Sept. 21, 2018, 3:23 a.m. UTC | #22
> On Sep 20, 2018, at 8:12 PM, Andrew Lunn <andrew@lunn.ch> wrote:

> 

>> On Fri, Sep 21, 2018 at 02:11:43AM +0200, Jason A. Donenfeld wrote:

>> Hey Arnd,

>> 

>>> On Thu, Sep 20, 2018 at 6:02 PM Arnd Bergmann <arnd@arndb.de> wrote:

>>> Right, if you hit a stack requirement like this, it's usually the compiler

>>> doing something bad, not just using too much stack but also generating

>>> rather slow object code in the process. It's better to fix the bug by

>>> optimizing the code to not spill registers to the stack.

>>> 

>>> In the long run, I'd like to reduce the stack frame size further, so

>>> best assume that anything over 1024 bytes (on 32-bit) or 1280 bytes

>>> (on 64-bit) is a bug in the code, and stay below that.

>>> 

>>> For prototyping, you can just mark the broken functions individually

>>> by setting the warning limit for a specific function that is known to

>>> be misoptimized by the compiler (with a comment about which compiler

>>> and architectures are affected), but not override the limit for the

>>> entire file.

>> 

>> Thanks for the explanation. Fortunately in my case, the issues were

>> trivially fixable to get it under 1024/1280. (By the way, why does

>> 64-bit have a slightly larger stack frame? To account for 32 pointers

>> taking double the space or something?) That will be rectified in v6.

> 

> Hi Jason

> 

> Do you any stack usage information?

> 

> A VPN can be at the bottom of some deep stack calls. Swap on NFS over

> the VPN? If you have one frame of 1K, you might be O.K. But if you

> have a few of these, i can see there might be issues of overflowing

> the stack.

> 

>  


At the risk on suggesting something awful: on x86_64, since we turn preemption off for simd, it wouldn’t be *completely* insane to do the crypto on the irq stack. It would look like:

kernel_fpu_call(func, arg);

And this helper would disable preemption, enable FPU, switch to the irq stack, call func(arg), disable FPU, enable preemption, and return. And we can have large IRQ stacks.

I refuse to touch this with a ten-foot pole until the lazy FPU restore patches land.

All that being said, why are these frames so large?  It sounds like something may be spilling that ought not to.
Jason A. Donenfeld Sept. 21, 2018, 4:15 a.m. UTC | #23
Hi Andy,

On Fri, Sep 21, 2018 at 5:23 AM Andy Lutomirski <luto@amacapital.net> wrote:
> At the risk on suggesting something awful: on x86_64, since we turn preemption off for simd, it wouldn’t be *completely* insane to do the crypto on the irq stack. It would look like:

>

> kernel_fpu_call(func, arg);

>

> And this helper would disable preemption, enable FPU, switch to the irq stack, call func(arg), disable FPU, enable preemption, and return. And we can have large IRQ stacks.

>

> I refuse to touch this with a ten-foot pole until the lazy FPU restore patches land.


Haha. That's fun, and maybe we'll do that at some point, but I have
some other reasons too for being on a workqueue now.

>

> All that being said, why are these frames so large?  It sounds like something may be spilling that ought not to.


They're not. Well, they're not anymore. I had a silly thing before
like "u8 buffer[1 << 12]" in some debugging code, which is what
prompted the ccflag-y addition. I cleaned up the mistakes like that
and frames are now reasonable everywhere. Non-issue.

Jason
Jason A. Donenfeld Sept. 21, 2018, 4:32 a.m. UTC | #24
Hi Ard,

On Fri, Sep 21, 2018 at 6:30 AM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> Did any of the fuzzing/testing you mention in the cover letter occur

> on the kernel versions of these algorithms?


Yes, of course.

Jason
Andy Lutomirski Sept. 21, 2018, 4:52 a.m. UTC | #25
> On Sep 20, 2018, at 9:30 PM, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

> 

>> On 20 September 2018 at 21:15, Jason A. Donenfeld <Jason@zx2c4.com> wrote:

>> Hi Andy,

>> 

>>> On Fri, Sep 21, 2018 at 5:23 AM Andy Lutomirski <luto@amacapital.net> wrote:

>>> At the risk on suggesting something awful: on x86_64, since we turn preemption off for simd, it wouldn’t be *completely* insane to do the crypto on the irq stack. It would look like:

>>> 

>>> kernel_fpu_call(func, arg);

>>> 

>>> And this helper would disable preemption, enable FPU, switch to the irq stack, call func(arg), disable FPU, enable preemption, and return. And we can have large IRQ stacks.

>>> 

>>> I refuse to touch this with a ten-foot pole until the lazy FPU restore patches land.

>> 

>> Haha. That's fun, and maybe we'll do that at some point, but I have

>> some other reasons too for being on a workqueue now.

>> 

> 

> Kernel mode crypto is callable from any context, and SIMD can be used

> in softirq context on arm64 (and on x86, even from hardirq context

> IIRC if the interrupt is taken from userland), in which case we'd

> already be on the irq stack.


The x86_64 irq stack handles nesting already.
Arnd Bergmann Sept. 25, 2018, 7:18 a.m. UTC | #26
On Sat, Sep 22, 2018 at 6:11 PM Arnd Bergmann <arnd@arndb.de> wrote:
>

> On Thu, Sep 20, 2018 at 5:12 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:

> >

> > Hey Arnd,

> >

> > On Thu, Sep 20, 2018 at 6:02 PM Arnd Bergmann <arnd@arndb.de> wrote:

> > > Right, if you hit a stack requirement like this, it's usually the compiler

> > > doing something bad, not just using too much stack but also generating

> > > rather slow object code in the process. It's better to fix the bug by

> > > optimizing the code to not spill registers to the stack.

> > >

> > > In the long run, I'd like to reduce the stack frame size further, so

> > > best assume that anything over 1024 bytes (on 32-bit) or 1280 bytes

> > > (on 64-bit) is a bug in the code, and stay below that.

> > >

> > > For prototyping, you can just mark the broken functions individually

> > > by setting the warning limit for a specific function that is known to

> > > be misoptimized by the compiler (with a comment about which compiler

> > > and architectures are affected), but not override the limit for the

> > > entire file.

> >

> > Thanks for the explanation. Fortunately in my case, the issues were

> > trivially fixable to get it under 1024/1280

>

> A lot of these bugs are not trivial, but we still need a full analysis of what

> failed and what the possible mititgations are. Can you describe more

> specifically what causes it?


I think I misread your earlier sentence and thought you had said the
exact opposite.

For confirmation, I've downloaded your git tree and built it with my
collection of compilers (gcc-4.6 through 8.1) and tried building it
in various configurations. Nothing alarming stood out, the only
thing that I think would might warrant some investigation is this one:

lib/zinc/curve25519/curve25519-hacl64.h: In function 'curve25519_generic':
lib/zinc/curve25519/curve25519-hacl64.h:785:1: warning: the frame size
of 1536 bytes is larger than 500 bytes [-Wframe-larger-than=]

Without KASAN, this takes 832 bytes, which is still more than it should
use from a look at the source code.

I first suspected some misoptimization around the get/put_unaligned_le64()
calls, but playing around with it some more led me to this patch:

diff --git a/lib/zinc/curve25519/curve25519-hacl64.h
b/lib/zinc/curve25519/curve25519-hacl64.h
index c7b2924a68c2..1f6eb5708a0e 100644
--- a/lib/zinc/curve25519/curve25519-hacl64.h
+++ b/lib/zinc/curve25519/curve25519-hacl64.h
@@ -182,8 +182,7 @@ static __always_inline void
fmul_mul_shift_reduce_(u128 *output, u64 *input,

 static __always_inline void fmul_fmul(u64 *output, u64 *input, u64 *input21)
 {
-       u64 tmp[5];
-       memcpy(tmp, input, 5 * sizeof(*input));
+       u64 tmp[5] = { input[0], input[1], input[2], input[3], input[4] };
        {
                u128 b4;
                u128 b0;

That change gets it down to

lib/zinc/curve25519/curve25519-hacl64.h: In function 'curve25519_generic':
lib/zinc/curve25519/curve25519-hacl64.h:788:1: warning: the frame size
of 640 bytes is larger than 500 bytes [-Wframe-larger-than=]

with KASAN, or 496 bytes without it. This indicates that there might
be something wrong with either gcc-8 or with our fortified memset()
function that requires more investigation. Maybe you can see
something there that I missed.

        Arnd