diff mbox series

[RFC,18/18] net: wireguard - switch to crypto API for packet encryption

Message ID 20190925161255.1871-19-ard.biesheuvel@linaro.org
State New
Headers show
Series crypto: wireguard using the existing crypto API | expand

Commit Message

Ard Biesheuvel Sept. 25, 2019, 4:12 p.m. UTC
Replace the chacha20poly1305() library calls with invocations of the
RFC7539 AEAD, as implemented by the generic chacha20poly1305 template.

For now, only synchronous AEADs are supported, but looking at the code,
it does not look terribly complicated to add support for async versions
of rfc7539(chacha20,poly1305) as well, some of which already exist in
the drivers/crypto tree.

The nonce related changes are there to address the mismatch between the
96-bit nonce (aka IV) that the rfc7539() template expects, and the 64-bit
nonce that WireGuard uses.

Note that these changes take advantage of the fact that synchronous
instantiations of the generic rfc7539() template will use a zero reqsize
if possible, removing the need for heap allocations for the request
structures.

This code was tested using the included netns.sh script, and by connecting
to the WireGuard demo server
(https://www.wireguard.com/quickstart/#demo-server)

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

---
 drivers/net/wireguard/noise.c    | 34 ++++++++++++-
 drivers/net/wireguard/noise.h    |  3 +-
 drivers/net/wireguard/queueing.h |  5 +-
 drivers/net/wireguard/receive.c  | 51 ++++++++++++--------
 drivers/net/wireguard/send.c     | 45 ++++++++++-------
 5 files changed, 97 insertions(+), 41 deletions(-)

-- 
2.20.1

Comments

Linus Torvalds Sept. 25, 2019, 10:15 p.m. UTC | #1
On Wed, Sep 25, 2019 at 9:14 AM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
>

> Replace the chacha20poly1305() library calls with invocations of the

> RFC7539 AEAD, as implemented by the generic chacha20poly1305 template.


Honestly, the other patches look fine to me from what I've seen (with
the small note I had in a separate email for 11/18), but this one I
consider just nasty, and a prime example of why people hate those
crypto lookup routines.

Some of it is just the fundamental and pointless silly indirection,
that just makes things harder to read, less efficient, and less
straightforward.

That's exemplified by this part of the patch:

>  struct noise_symmetric_key {

> -       u8 key[NOISE_SYMMETRIC_KEY_LEN];

> +       struct crypto_aead *tfm;


which is just one of those "we know what we want and we just want to
use it directly" things, and then the crypto indirection comes along
and makes that simple inline allocation of a small constant size
(afaik it is CHACHA20POLY1305_KEY_SIZE, which is 32) be another
allocation entirely.

And it's some random odd non-typed thing too, so then you have that
silly and stupid dynamic allocation using a name lookup:

   crypto_alloc_aead("rfc7539(chacha20,poly1305)", 0, CRYPTO_ALG_ASYNC);

to create what used to be (and should be) a simple allocation that was
has a static type and was just part of the code.

It also ends up doing other bad things, ie that packet-time

+       if (unlikely(crypto_aead_reqsize(key->tfm) > 0)) {
+               req = aead_request_alloc(key->tfm, GFP_ATOMIC);
+               if (!req)
+                       return false;

thing that hopefully _is_ unlikely, but that's just more potential
breakage from that whole async crypto interface.

This is what people do *not* want to do, and why people often don't
like the crypto interfaces.

It's exactly why we then have those "bare" routines as a library where
people just want to access the low-level hashing or whatever directly.

So please don't do things like this for an initial import.

Leave the thing alone, and just use the direct and synchronous static
crypto library interface tjhat you imported in patch 14/18 (but see
below about the incomplete import).

Now, later on, if you can *show* that some async implementation really
really helps, and you have numbers for it, and you can convince people
that the above kind of indirection is _worth_ it, then that's a second
phase. But don't make code uglier without those actual numbers.

Because it's not just uglier and has that silly extra indirection and
potential allocation problems, this part just looks very fragile
indeed:

> The nonce related changes are there to address the mismatch between the

> 96-bit nonce (aka IV) that the rfc7539() template expects, and the 64-bit

> nonce that WireGuard uses.

...
>  struct packet_cb {

> -       u64 nonce;

> -       struct noise_keypair *keypair;

>         atomic_t state;

> +       __le32 ivpad;                   /* pad 64-bit nonce to 96 bits */

> +       __le64 nonce;

> +       struct noise_keypair *keypair;

>         u32 mtu;

>         u8 ds;

>  };


The above is subtle and silently depends on the struct layout.

It really really shouldn't.

Can it be acceptable doing something like that? Yeah, but you really
should be making it very explicit, perhaps using

  struct {
        __le32 ivpad;
        __le64 nonce;
   } __packed;

or something like that.

Because right now you're depending on particular layout of those fields:

> +       aead_request_set_crypt(req, sg, sg, skb->len,

> +                              (u8 *)&PACKET_CB(skb)->ivpad);


but honestly, that's not ok at all.

Somebody makes a slight change to that struct, and it might continue
to work fine on x86-32 (where 64-bit values are only 32-bit aligned)
but subtly break on other architectures.

Also, you changed how the nonce works from being in CPU byte order to
be explicitly LE. That may be ok, and looks like it might be a
cleanup, but honestly I think it should have been done as a separate
patch.

So could you please update that patch 14/18 to also have that
synchronous chacha20poly1305_decrypt_sg() interface, and then just
drop this 18/18 for now?

That would mean that

 (a) you wouldn't need this patch, and you can then do that as a
separate second phase once you have numbers and it can stand on its
own.

 (b) you'd actually have something that *builds* when  you import the
main wireguard patch in 15/18

because right now it looks like you're not only forcing this async
interface with the unnecessary indirection, you're also basically
having a tree that doesn't even build or work for a couple of commits.

And I'm still not convinced (a) ever makes sense - the overhead of any
accelerator is just high enought that I doubt you'll have numbers -
performance _or_ power.

But even if you're right that it might be a power advantage on some
platform, that wouldn't make it an advantage on other platforms. Maybe
it could be done as a config option where you can opt in to the async
interface when that makes sense - but not force the indirection and
extra allocations when it doesn't. As a separate patch, something like
that doesn't sound horrendous (and I think that's also an argument for
doing that CPU->LE change as an independent change).

Yes, yes, there's also that 17/18 that switches over to a different
header file location and Kconfig names but that could easily be folded
into 15/18 and then it would all be bisectable.

Alternatively, maybe 15/18 could be done with wireguard disabled in
the Kconfig (just to make the patch identical), and then 17/18 enables
it when it compiles with a big note about how you wanted to keep 15/18
pristine to make the changes obvious.

Hmm?

I don't really have a dog in this fight, but on the whole I really
liked the series. But this 18/18 raised my heckles, and I think I
understand why it might raise the heckles of the wireguard people.

Please?

     Linus
Linus Torvalds Sept. 25, 2019, 10:22 p.m. UTC | #2
On Wed, Sep 25, 2019 at 3:15 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>

> I don't really have a dog in this fight, but on the whole I really

> liked the series. But this 18/18 raised my heckles, and I think I

> understand why it might raise the heckles of the wireguard people.


To be honest, I guess I _do_ have a dog in the fight, namely the thing
that I'd love to see wireguard merged.

And this series otherwise looked non-offensive to me. Maybe not
everybody is hugely happy with all the details, but it looks like a
good sane "let's use as much of the existing models as possible" that
nobody should absolutely hate.

                   Linus
Pascal Van Leeuwen Sept. 26, 2019, 9:40 a.m. UTC | #3
> >

> > Replace the chacha20poly1305() library calls with invocations of the

> > RFC7539 AEAD, as implemented by the generic chacha20poly1305 template.

> 

> Honestly, the other patches look fine to me from what I've seen (with

> the small note I had in a separate email for 11/18), but this one I

> consider just nasty, and a prime example of why people hate those

> crypto lookup routines.

> 

> Some of it is just the fundamental and pointless silly indirection,

> that just makes things harder to read, less efficient, and less

> straightforward.

> 

> That's exemplified by this part of the patch:

> 

> >  struct noise_symmetric_key {

> > -       u8 key[NOISE_SYMMETRIC_KEY_LEN];

> > +       struct crypto_aead *tfm;

> 

> which is just one of those "we know what we want and we just want to

> use it directly" things, and then the crypto indirection comes along

> and makes that simple inline allocation of a small constant size

> (afaik it is CHACHA20POLY1305_KEY_SIZE, which is 32) be another

> allocation entirely.

> 

> And it's some random odd non-typed thing too, so then you have that

> silly and stupid dynamic allocation using a name lookup:

> 

>    crypto_alloc_aead("rfc7539(chacha20,poly1305)", 0, CRYPTO_ALG_ASYNC);

> 

> to create what used to be (and should be) a simple allocation that was

> has a static type and was just part of the code.

> 

While I agree with the principle of first merging Wireguard without 
hooking it up to the Crypto API and doing the latter in a later,
separate patch, I DONT'T agree with your bashing of the Crypto API
or HW crypto acceleration in general.

Yes, I do agree  that if you need to do the occasional single crypto 
op for a fixed algorithm on a small amount of data then you should
just use a simple direct  library call. I'm all for a Zinc type 
library for that.
(and I believe Ard is actually actively making such changes already)

However, if you're doing bulk crypto like network packet processing
(as Wireguard does!) or disk/filesystem encryption, then that cipher
allocation only happens once every blue moon and the overhead for
that is totally *irrelevant* as it is amortized over many hours or 
days of runtime.

While I generally dislike this whole hype of storing stuff in
textual formats like XML and JSON and then wasting lots of CPU
cycles on parsing that, I've learned to appreciate the power of
these textual Crypto API templates, as they allow a hardware 
accelerator to advertise complex combined operations as single
atomic calls, amortizing the communication overhead between SW
and HW. It's actually very flexible and powerful!

> It also ends up doing other bad things, ie that packet-time

> 

> +       if (unlikely(crypto_aead_reqsize(key->tfm) > 0)) {

> +               req = aead_request_alloc(key->tfm, GFP_ATOMIC);

> +               if (!req)

> +                       return false;

> 

> thing that hopefully _is_ unlikely, but that's just more potential

> breakage from that whole async crypto interface.

> 

> This is what people do *not* want to do, and why people often don't

> like the crypto interfaces.

> 

Life is all about needing to do things you don't like to do ...
If you want the performance, you need to do the effort. That simple.
HW acceleration surely won't work from a naive synchronous interface.
(Same applies to running crypto in a separate thread on the CPU BTW!)

In any case, Wireguard bulk crypto *should* eventually run on top
of Crypto API such that it can leverage *existing* HW acceleration.
It would be incredibly silly not to do so, given the HW exists!

> And I'm still not convinced (a) ever makes sense - the overhead of any

> accelerator is just high enought that I doubt you'll have numbers -

> performance _or_ power.

> 

You shouldn't make such assertions if you obviously don't know what
you're talking about. Yes, there is significant overhead on the CPU
for doing lookaside crypto, but it's (usually) nothing compared to
doing the actual crypto itself on the CPU barring a few exceptions. 
(Notably AES-GCM or AES-CTR on ARM64 or x64 CPU's and *maybe* 
Chacha-Poly on recent Intel CPU's - but there's a *lot* more crypto 
being used out there than just AES-GCM and Chacha-Poly, not to 
mention a lot more less capable (embedded) CPU's running Linux)

For anything but those exceptions, we blow even the fastest Intel
server CPU's out of the water with our crypto accelerators.
(I can bore you with some figures actually measured with the
Crypto API on our HW, once I'm done optimizing the driver and I 
have some time to collect the results)

And in any case, for somewhat larger blocks/packets, the overhead
on the CPU would at least be such that it's less than what the CPU
would need to do the crypto itself - even if it's faster - such that
there is room there to do *other*, presumably more useful, work.

Then there's indeed the power consumption issue, which is complex
because crypto power != total system power so it depends too much on
the actual use case to make generic statements on it. So I'll leave
that with the remark that Intel server CPU's have to seriously
throttle down their clock if you start using AVX512 for crypto, just to
stay within their power budget, while we can do the same performance
(~200 Gbps) in just a few (~2) Watts on a similar technology node.
(excluding the CPU management overhead, but that surely won't consume
excessive power like AVX512)

> But even if you're right that it might be a power advantage on some

> platform, that wouldn't make it an advantage on other platforms. Maybe

> it could be done as a config option where you can opt in to the async

> interface when that makes sense - but not force the indirection and

> extra allocations when it doesn't. As a separate patch, something like

> that doesn't sound horrendous (and I think that's also an argument for

> doing that CPU->LE change as an independent change).

> 

Making it a switch sounds good to me though.

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
www.insidesecure.com
Ard Biesheuvel Sept. 26, 2019, 11:06 a.m. UTC | #4
On Thu, 26 Sep 2019 at 00:15, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>

> On Wed, Sep 25, 2019 at 9:14 AM Ard Biesheuvel

> <ard.biesheuvel@linaro.org> wrote:

> >

> > Replace the chacha20poly1305() library calls with invocations of the

> > RFC7539 AEAD, as implemented by the generic chacha20poly1305 template.

>

> Honestly, the other patches look fine to me from what I've seen (with

> the small note I had in a separate email for 11/18), but this one I

> consider just nasty, and a prime example of why people hate those

> crypto lookup routines.

>

> Some of it is just the fundamental and pointless silly indirection,

> that just makes things harder to read, less efficient, and less

> straightforward.

>

> That's exemplified by this part of the patch:

>

> >  struct noise_symmetric_key {

> > -       u8 key[NOISE_SYMMETRIC_KEY_LEN];

> > +       struct crypto_aead *tfm;

>

> which is just one of those "we know what we want and we just want to

> use it directly" things, and then the crypto indirection comes along

> and makes that simple inline allocation of a small constant size

> (afaik it is CHACHA20POLY1305_KEY_SIZE, which is 32) be another

> allocation entirely.

>

> And it's some random odd non-typed thing too, so then you have that

> silly and stupid dynamic allocation using a name lookup:

>

>    crypto_alloc_aead("rfc7539(chacha20,poly1305)", 0, CRYPTO_ALG_ASYNC);

>

> to create what used to be (and should be) a simple allocation that was

> has a static type and was just part of the code.

>


That crypto_alloc_aead() call does a lot of things under the hood:
- use an existing instantiation of rfc7539(chacha20,poly1305) if available,
- look for modules that implement the whole transformation directly,
- if none are found, instantiate the rfc7539 template, which will
essentially do the above for chacha20 and poly1305, potentially using
per-arch accelerated implementations if available (for either), or
otherwise, fall back to the generic versions.

What *I* see as the issue here is not that we need to do this at all,
but that we have to do it for each value of the key. IMO, it would be
much better to instantiate this thing only once, and have a way of
passing a per-request key into it, permitting us to hide the whole
thing behind the existing library interface.


> It also ends up doing other bad things, ie that packet-time

>

> +       if (unlikely(crypto_aead_reqsize(key->tfm) > 0)) {

> +               req = aead_request_alloc(key->tfm, GFP_ATOMIC);

> +               if (!req)

> +                       return false;

>

> thing that hopefully _is_ unlikely, but that's just more potential

> breakage from that whole async crypto interface.

>

> This is what people do *not* want to do, and why people often don't

> like the crypto interfaces.

>

> It's exactly why we then have those "bare" routines as a library where

> people just want to access the low-level hashing or whatever directly.

>

> So please don't do things like this for an initial import.

>


This is tied to the zero reqsize patch earlier in the series. If your
crypto_alloc_aead() call gets fulfilled by the template we modified
earlier (and which is the only synchronous implementation we currently
have), this is guaranteed to be unlikely/false. For async
implementations, however, we need this allocation to be on the heap
since the stack will go away before the call completes.

> Leave the thing alone, and just use the direct and synchronous static

> crypto library interface tjhat you imported in patch 14/18 (but see

> below about the incomplete import).

>


Patch #14 only imports the C library version, not any of the
accelerated versions that Jason proposed. So we will need a way to
hook all the existing accelerated drivers into that library interface,
add handling for SIMD etc.

> Now, later on, if you can *show* that some async implementation really

> really helps, and you have numbers for it, and you can convince people

> that the above kind of indirection is _worth_ it, then that's a second

> phase. But don't make code uglier without those actual numbers.

>

> Because it's not just uglier and has that silly extra indirection and

> potential allocation problems, this part just looks very fragile

> indeed:

>

> > The nonce related changes are there to address the mismatch between the

> > 96-bit nonce (aka IV) that the rfc7539() template expects, and the 64-bit

> > nonce that WireGuard uses.

> ...

> >  struct packet_cb {

> > -       u64 nonce;

> > -       struct noise_keypair *keypair;

> >         atomic_t state;

> > +       __le32 ivpad;                   /* pad 64-bit nonce to 96 bits */

> > +       __le64 nonce;

> > +       struct noise_keypair *keypair;

> >         u32 mtu;

> >         u8 ds;

> >  };

>

> The above is subtle and silently depends on the struct layout.

>

> It really really shouldn't.

>

> Can it be acceptable doing something like that? Yeah, but you really

> should be making it very explicit, perhaps using

>

>   struct {

>         __le32 ivpad;

>         __le64 nonce;

>    } __packed;

>

> or something like that.

>


This is what I started out with, put the packed struct causes GCC on
architectures that care about alignment to switch to the unaligned
accessors, which I tried to avoid. I'll add a build_bug to ensure that
offset(ivpad) + 4 == offset(nonce).

> Because right now you're depending on particular layout of those fields:

>

> > +       aead_request_set_crypt(req, sg, sg, skb->len,

> > +                              (u8 *)&PACKET_CB(skb)->ivpad);

>

> but honestly, that's not ok at all.

>

> Somebody makes a slight change to that struct, and it might continue

> to work fine on x86-32 (where 64-bit values are only 32-bit aligned)

> but subtly break on other architectures.

>

> Also, you changed how the nonce works from being in CPU byte order to

> be explicitly LE. That may be ok, and looks like it might be a

> cleanup, but honestly I think it should have been done as a separate

> patch.

>


Fair enough.

> So could you please update that patch 14/18 to also have that

> synchronous chacha20poly1305_decrypt_sg() interface, and then just

> drop this 18/18 for now?

>


Hmm, not really, because then performance is going to suck. The way we
organise the code in the crypto API today is to have generic C
libraries in lib/crypto, and use the full API for per-arch accelerated
code (or async accelerators). Patch #14 uses the former.

We'll need a way to refactor the existing accelerated code so it is
exposed via the library interface, and there are a couple of options:
- modify our AEAD code so it can take per-request keys - that way, we
could instantiate a single TFM in the implementation of the library,
and keep the chacha20poly1305_[en|de]crypt_sg() interfaces intact,
- create arch/*/lib versions of all the accelerated ChaCha20 and
Poly1305 routines we have, so that the ordinary library precedence
rules give you the fastest implementation available - we may need some
tweaks to the module loader for weak symbols etc to make this
seamless, but it should be doable
- add an entirely new accelerated crypto library stack on the side (aka Zinc)


> That would mean that

>

>  (a) you wouldn't need this patch, and you can then do that as a

> separate second phase once you have numbers and it can stand on its

> own.

>

>  (b) you'd actually have something that *builds* when  you import the

> main wireguard patch in 15/18

>

> because right now it looks like you're not only forcing this async

> interface with the unnecessary indirection, you're also basically

> having a tree that doesn't even build or work for a couple of commits.

>


True, but this was intentional, and not intended for merge as-is.

> And I'm still not convinced (a) ever makes sense - the overhead of any

> accelerator is just high enought that I doubt you'll have numbers -

> performance _or_ power.

>

> But even if you're right that it might be a power advantage on some

> platform, that wouldn't make it an advantage on other platforms. Maybe

> it could be done as a config option where you can opt in to the async

> interface when that makes sense - but not force the indirection and

> extra allocations when it doesn't.


I know the code isn't pretty, but it looks worse than it is. I'll look
into using the new static calls framework to instantiate the library
interface based on whatever the platform provides as synchronous
implementations of ChaCha20 and Poly1305.

> As a separate patch, something like

> that doesn't sound horrendous (and I think that's also an argument for

> doing that CPU->LE change as an independent change).

>

> Yes, yes, there's also that 17/18 that switches over to a different

> header file location and Kconfig names but that could easily be folded

> into 15/18 and then it would all be bisectable.

>

> Alternatively, maybe 15/18 could be done with wireguard disabled in

> the Kconfig (just to make the patch identical), and then 17/18 enables

> it when it compiles with a big note about how you wanted to keep 15/18

> pristine to make the changes obvious.

>

> Hmm?

>

> I don't really have a dog in this fight, but on the whole I really

> liked the series. But this 18/18 raised my heckles, and I think I

> understand why it might raise the heckles of the wireguard people.

>


We're likely to spend some time discussing all this before I get
around to respinning this (if ever). But I am also a fan of WireGuard,
and I am eager to finish this discussion once and for all.
Ard Biesheuvel Sept. 26, 2019, 12:34 p.m. UTC | #5
On Thu, 26 Sep 2019 at 13:06, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>

> On Thu, 26 Sep 2019 at 00:15, Linus Torvalds

> <torvalds@linux-foundation.org> wrote:

> >

> > On Wed, Sep 25, 2019 at 9:14 AM Ard Biesheuvel

> > <ard.biesheuvel@linaro.org> wrote:

> > >

> > > Replace the chacha20poly1305() library calls with invocations of the

> > > RFC7539 AEAD, as implemented by the generic chacha20poly1305 template.

> >

> > Honestly, the other patches look fine to me from what I've seen (with

> > the small note I had in a separate email for 11/18), but this one I

> > consider just nasty, and a prime example of why people hate those

> > crypto lookup routines.

> >

> > Some of it is just the fundamental and pointless silly indirection,

> > that just makes things harder to read, less efficient, and less

> > straightforward.

> >

> > That's exemplified by this part of the patch:

> >

> > >  struct noise_symmetric_key {

> > > -       u8 key[NOISE_SYMMETRIC_KEY_LEN];

> > > +       struct crypto_aead *tfm;

> >

> > which is just one of those "we know what we want and we just want to

> > use it directly" things, and then the crypto indirection comes along

> > and makes that simple inline allocation of a small constant size

> > (afaik it is CHACHA20POLY1305_KEY_SIZE, which is 32) be another

> > allocation entirely.

> >

> > And it's some random odd non-typed thing too, so then you have that

> > silly and stupid dynamic allocation using a name lookup:

> >

> >    crypto_alloc_aead("rfc7539(chacha20,poly1305)", 0, CRYPTO_ALG_ASYNC);

> >

> > to create what used to be (and should be) a simple allocation that was

> > has a static type and was just part of the code.

> >

>

> That crypto_alloc_aead() call does a lot of things under the hood:

> - use an existing instantiation of rfc7539(chacha20,poly1305) if available,

> - look for modules that implement the whole transformation directly,

> - if none are found, instantiate the rfc7539 template, which will

> essentially do the above for chacha20 and poly1305, potentially using

> per-arch accelerated implementations if available (for either), or

> otherwise, fall back to the generic versions.

>

> What *I* see as the issue here is not that we need to do this at all,

> but that we have to do it for each value of the key. IMO, it would be

> much better to instantiate this thing only once, and have a way of

> passing a per-request key into it, permitting us to hide the whole

> thing behind the existing library interface.

>


Note that we don't have to do the whole dance for each new value of
the key: subsequent invocations will all succeed at step #1, and grab
the existing instantiation, but allocate a new TFM structure that
refers to it. It is this step that we should be able to omit as well
if the API is changed to allow per-request keys to be passed in via
the request structure.
Linus Torvalds Sept. 26, 2019, 4:35 p.m. UTC | #6
On Thu, Sep 26, 2019 at 2:40 AM Pascal Van Leeuwen
<pvanleeuwen@verimatrix.com> wrote:
>

> While I agree with the principle of first merging Wireguard without

> hooking it up to the Crypto API and doing the latter in a later,

> separate patch, I DONT'T agree with your bashing of the Crypto API

> or HW crypto acceleration in general.


I'm not bashing hardware crypto acceleration.

But I *am* bashing bad interfaces to it.

Honestly, you need to face a few facts, Pascal.

Really.

First off: availability.

 (a) hardware doesn't exist right now outside your lab

This is a fact.

 (b) hardware simply WILL NOT exist in any huge number for years,
possibly decades. If ever,

This is just reality, Pascal. Even if you release your hardware
tomorrow, where do you think it will exists? Laptops? PC's? Phones?
No. No. And no.

Phones are likely the strongest argument for power use, but phones
won't really start using it until it's on the SoC, because while they
care deeply about power, they care even more deeply about a lot of
other things a whole lot more. Form factor, price, and customers that
care.

So phones aren't happening. Not for years, and not until it's a big
deal and standard IP that everybody wants.

Laptops and PC's? No. Look at all the crypto acceleration they have today.

That was sarcasm, btw, just to make it clear. It's simply not a market.

End result: even with hardware, the hardware will be very rare. Maybe
routers that want to sell particular features in the near future.

Again, this isn't theory. This is that pesky little thing called
"reality". It sucks, I know.

But even if you *IGNORE* the fact that hardware doesn't exist right
now, and won't be widely available for years (or longer), there's
another little fact that you are ignoring:

The async crypto interfaces are badly designed. Full stop.

Seriously. This isn't rocket science. This is very very basic Computer
Science 101.

Tell me, what's _the_ most important part about optimizing something?

Yeah, it's "you optimize for the common case". But let's ignore that
part, since I already said that hardware isn't the common case but I
promised that I'd ignore that part.

The _other_ most important part of optimization is that you optimize
for the case that _matters_.

And the async crypto case got that completely wrong, and the wireguard
patch shows that issue very clearly.

The fact is, even you admit that a few CPU cycles don't matter for the
async case where you have hardware acceleration, because the real cost
is going to be in the IO - whether it's DMA, cache coherent
interconnects, or some day some SoC special bus. The CPU cycles just
don't matter, they are entirely in the noise.

What does that mean?  Think about it.

[ Time passes ]

Ok, I'll tell you: it means that the interface shouldn't be designed
for async hw-assisted crypto. The interface should be designed for the
case that _does_ care about CPU cycles, and then the async hw-assisted
crypto should be hidden by a conditional, and its (extra) data
structures should be the ones that are behind some indirect pointers
etc.  Because, as we agreed, the async external hw case really doesn't
care. It it has to traverse a pointer or two, and if it has to have a
*SEPARATE* keystore that has longer lifetimes, then the async code can
set that up on its own, but that should not affect the case that
cares.

Really, this is fundamental, and really, the crypto interfaces got this wrong.

This is in fact _so_ fundamental that the only possible reason you can
disagree is because you don't care about reality or fundamentals, and
you only care about the small particular hardware niche you work in
and nothing else.

You really should think about this a bit.

> However, if you're doing bulk crypto like network packet processing

> (as Wireguard does!) or disk/filesystem encryption, then that cipher

> allocation only happens once every blue moon and the overhead for

> that is totally *irrelevant* as it is amortized over many hours or

> days of runtime.


This is not true. It's not true because the overhead happens ALL THE TIME.

And in 99.9% of all cases there are no upsides from the hw crypto,
because the hardware DOES NOT EXIST.

You think the "common case" is that hardware encryption case, but see
above. It's really not.

And when you _do_ have HW encryption, you could do the indirection.

But that's not an argument for always having the indirection.

What indirection am I talking about?

There's multiple levels of indirection in the existing bad crypto interfaces:

 (a) the data structures themselves. This is things like keys and
state storage, both of which are (often) small enough that they would
be better off on a stack, or embedded in the data structures of the
callers.

 (b) the calling conventions. This is things like indirection -
usually several levels - just to find the function pointer to call to,
and then an indirect call to that function pointer.

and both of those are actively bad things when you don't have those
hardware accelerators.

When you *do* have them, you don't care. Everybody agrees about that.
But you're ignoring the big white elephant in the room, which is that
the hw really is very rare in the end, even if it were to exist at
all.

> While I generally dislike this whole hype of storing stuff in

> textual formats like XML and JSON and then wasting lots of CPU

> cycles on parsing that, I've learned to appreciate the power of

> these textual Crypto API templates, as they allow a hardware

> accelerator to advertise complex combined operations as single

> atomic calls, amortizing the communication overhead between SW

> and HW. It's actually very flexible and powerful!


BUT IT IS FUNDAMENTALLY BADLY DESIGNED!

Really.

You can get the advantages of hw-accelerated crypto with good designs.
The current crypto layer is not that.

The current crypto layer simply has indirection at the wrong level.

Here's how it should have been done:

 - make the interfaces be direct calls to the crypto you know you want.

 - make the basic key and state buffer be explicit and let people
allocate them on their own stacks or whatever

"But what about hw acceleration?"

 - add a single indirect private pointer that the direct calls can use
for their own state IF THEY HAVE REASON TO

 - make the direct crypto calls just have a couple of conditionals
inside of them

Why? Direct calls with a couple of conditionals are really cheap for
the non-accelerated case. MUCH cheaper than the indirection overhead
(both on a state level and on a "crypto description" level) that the
current code has.

Seriously. The hw accelerated crypto won't care about the "call a
static routine" part. The hw accelerated crypto won't care about the
"I need to allocate a copy of the key because I can't have it on
stack, and need to have it in a DMA'able region". The hw accelerated
crypto won't care about the two extra instructions that do "do I have
any extra state at all, or should I just do the synchronous CPU
version" before it gets called through some indirect pointer.

So there is absolutely NO DOWNSIDE for hw accelerated crypto to just
do it right, and use an interface like this:

       if (!chacha20poly1305_decrypt_sg(sg, sg, skb->len, NULL, 0,
                                        PACKET_CB(skb)->nonce, key->key,
                                        simd_context))
               return false;

because for the hw accelerated case the costs are all elsewhere.

But for synchronous encryption code on the CPU? Avoiding the
indirection can be a huge win. Avoiding allocations, extra cachelines,
all that overhead. Avoiding indirect calls entirely, because doing a
few conditional branches (that will predict perfectly) on the state
will be a lot more efficient, both in direct CPU cycles and in things
like I$ etc.

In contrast, forcing people to use this model:

       if (unlikely(crypto_aead_reqsize(key->tfm) > 0)) {
               req = aead_request_alloc(key->tfm, GFP_ATOMIC);
               if (!req)
                       return false;
       } else {
               req = &stackreq;
               aead_request_set_tfm(req, key->tfm);
       }

       aead_request_set_ad(req, 0);
       aead_request_set_callback(req, 0, NULL, NULL);
       aead_request_set_crypt(req, sg, sg, skb->len,
                              (u8 *)&PACKET_CB(skb)->ivpad);
       err = crypto_aead_decrypt(req);
       if (unlikely(req != &stackreq))
               aead_request_free(req);
       if (err)
               return false;

isn't going to help anybody. It sure as hell doesn't help the
CPU-synchronous case, and see above: it doesn't even help the hw
accelerated case. It could have had _all_ that "tfm" work behind a
private pointer that the CPU case never touches except to see "ok,
it's NULL, I don't have any".

See?

The interface *should* be that chacha20poly1305_decrypt_sg() library
interface, just give it a private pointer it can use and update. Then,
*internally* if can do something like

     bool chacha20poly1305_decrypt_sg(...)
     {
             struct cc20p1305_ptr *state = *state_p;
             if (state) {
                     .. do basically the above complex thing ..
                     return ret; .. or fall back to sw if the hw
queues are full..
             }
             .. do the CPU only thing..
     }

and now you'd have no extra obverhead for the no-hw-accel case, you'd
have a more pleasant interface to use, and you could still use hw
accel if you wanted to.

THIS is why I say that the crypto interface is bad. It was designed
for the wrong objectives. It was designed purely for a SSL-like model
where you do a complex key and algorithm exchange dance, and you
simply don't know ahead of time what crypto you are even using.

And yes, that "I'm doing the SSL thing" used to be a major use of
encryption. I understand why it happened. It was what people did in
the 90's. People thought it was a good idea back then, and it was also
most of the hw acceleration world.

And yes, in that model of "I don't have a clue of what crypto I'm even
using" the model works fine. But honestly, if you can't admit to
yourself that it's wrong for the case where you _do_ know the
algorithm, you have some serious blinders on.

Just from a user standpoint, do you seriously think users _like_
having to do the above 15+ lines of code, vs the single function call?

The crypto interface really isn't pleasant, and you're wrong to
believe that it really helps. The hw acceleration capability could
have been abstracted away, instead of making that indirection be front
and center.

And again - I do realize the historical reasons for it. But
understanding that doesn't magically make it wonderful.

                 Linus
Pascal Van Leeuwen Sept. 27, 2019, 12:15 a.m. UTC | #7
> -----Original Message-----

> From: Linus Torvalds <torvalds@linux-foundation.org>

> Sent: Thursday, September 26, 2019 6:35 PM

> To: Pascal Van Leeuwen <pvanleeuwen@verimatrix.com>

> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>; Linux Crypto Mailing List <linux-

> crypto@vger.kernel.org>; Linux ARM <linux-arm-kernel@lists.infradead.org>; Herbert Xu

> <herbert@gondor.apana.org.au>; David Miller <davem@davemloft.net>; Greg KH

> <gregkh@linuxfoundation.org>; Jason A . Donenfeld <Jason@zx2c4.com>; Samuel Neves

> <sneves@dei.uc.pt>; Dan Carpenter <dan.carpenter@oracle.com>; Arnd Bergmann <arnd@arndb.de>;

> Eric Biggers <ebiggers@google.com>; Andy Lutomirski <luto@kernel.org>; Will Deacon

> <will@kernel.org>; Marc Zyngier <maz@kernel.org>; Catalin Marinas <catalin.marinas@arm.com>

> Subject: Re: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption

> 

> On Thu, Sep 26, 2019 at 2:40 AM Pascal Van Leeuwen

> <pvanleeuwen@verimatrix.com> wrote:

> >

> > While I agree with the principle of first merging Wireguard without

> > hooking it up to the Crypto API and doing the latter in a later,

> > separate patch, I DONT'T agree with your bashing of the Crypto API

> > or HW crypto acceleration in general.

> 

> I'm not bashing hardware crypto acceleration.

> 

> But I *am* bashing bad interfaces to it.

> 

And I'm arguing the interface is not that bad, it is the way it is
for good reasons. I think we all agree it's not suitable for the
occasional crypto operation using a fixed algorithm. For that, by
all means use direct library calls. No one is arguing against that.

However, I dare you to come up with something better that would
provide the same flexibility for doing configurable crypto and
offloading these (combined) operations to crypto acceleration 
hardware, depending on its actual capabilities.

i.e. something that would allow offloading rfc7539esp to
accelerators capable of doing that whole transform, while also being
able to offload separate chacha and/or poly operations to less
capable hardware. While actually being able to keep those deep HW
pipelines filled.

> Honestly, you need to face a few facts, Pascal.

> 

> Really.

> 

> First off: availability.

> 

>  (a) hardware doesn't exist right now outside your lab

> 

> This is a fact.

> 

Actually, that's _not_ a fact at all. For three reaons:

a) I don't even have real hardware (for this). We're an IP provider, 
   we don't make actual hardware. True, I have an FPGA dev board
   for prototyping but an ASIC guy like me considers that "just SW".
b) The actual hardware is in our customers labs, so definitely 
   outside of "my" lab. I don't even _have_ a lab. Just a full desk :-)
c) NXP Layerscape chips supporting Poly-Chacha acceleration can be bought
   right now (i.e NXP LX2160A, look it up). CAAM support for Poly-Chacha
   has been in mainline since kernel 5.0. So there's your hardware.

And does it matter that it doesn't exist today if it is a known fact
it *will* be there in just a few months? The general idea is that you
make sure the SW support is ready *before* you start selling the HW.

>  (b) hardware simply WILL NOT exist in any huge number for years,

> possibly decades. If ever,

> 

That remark is just very stupid. The hardware ALREADY exists, and
more hardware is in the pipeline. Once this stuff is designed in, it
usually stays in for many years to come. And these are chips sold in
_serious_ quantities, to be used in things like wireless routers and
DSL, cable and FTTH modems, 5G base stations, etc. etc.

> This is just reality, Pascal. 

>

I guess you must live in a different reality from mine? Because I see
some clear mismatch with _known facts_ in *my* reality. But then again,
I'm in this business so I know exactly what's out there. Unlike you.

> Even if you release your hardware tomorrow, 

>

We actually released this hardware a looooong time ago, it just takes
very long for silicon to reach the market. And that time is soon.

> where do you think it will exists? Laptops? PC's? Phones?

>

I already answered this above. Many embedded networking use cases.

> No. No. And no.

> 

Shouting "no" many times won't make it go away ;-)

> Phones are likely the strongest argument for power use, but phones

> won't really start using it until it's on the SoC, because while they

> care deeply about power, they care even more deeply about a lot of

> other things a whole lot more. Form factor, price, and customers that

> care.

> So phones aren't happening. Not for years, and not until it's a big

> deal and standard IP that everybody wants.

>

It will likely not be OUR HW, as the Qualcomms, Samsungs and Mediateks
of this world are very tough cookies to crack for an IP provider, but 
I would expect them to do their own at some point. I would also NOT
expect them to upstream any driver code for that. It may already exist!

> Laptops and PC's? No. Look at all the crypto acceleration they have today.

> 

No argument there. If an Intel or AMD adds crypto acceleration, they will
add it to the CPU core itself, for obvious reasons. If you don't actually
design the CPU, you don't have that choice though. (and from a power
consumption perspective, it's actually not that great of a solution)

> That was sarcasm, btw, just to make it clear. It's simply not a market.

> 

Eh ... we've been selling into that market for more than 20 years and
we still exist today? How is that possible if it doesn't exist?

> End result: even with hardware, the hardware will be very rare. Maybe

> routers that want to sell particular features in the near future.

> 

No, these are just the routers going into *everyone's* home. And 5G
basestations arriving at every other street corner. I wouldn't call 
that rare, exactly.

> Again, this isn't theory. This is that pesky little thing called

> "reality". It sucks, I know.

> 

You very obviously have absolutely NO idea what you're talking about.
Either that or you're living in some alternate reality.

> But even if you *IGNORE* the fact that hardware doesn't exist right

> now

>

Which I've proven to be FALSE

>, and won't be widely available for years (or longer),

>

Which again doesn't match the FACTS.

> there's another little fact that you are ignoring:

> 

> The async crypto interfaces are badly designed. Full stop.

> 

They may not be perfect. I think you are free to come up with solutions
to improve on that? But if such a solution would make it impossible to
offload to crypto hardware then *that* would be truly bad interface 
design. Do you have similar arguments about the interfacing to e.g.
graphics processing on the GPU? I'm sure those could be simplified to
be  easier to use and make a full SW implementation run that much more
efficiently ... (/sarcasm)

> Seriously. This isn't rocket science. This is very very basic Computer

> Science 101.

> 

I know. Rocket science is _easy_ ;-)

> Tell me, what's _the_ most important part about optimizing something?

> 	

> Yeah, it's "you optimize for the common case". But let's ignore that

> part, since I already said that hardware isn't the common case but I

> promised that I'd ignore that part.

> 

> The _other_ most important part of optimization is that you optimize

> for the case that _matters_.

> 

Matters to _whom_. What matters to someone running a fat desktop or
server CPU is _not_ what matters to someone running on a relatively
weak embedded CPU that _does_ have powerful crypto acceleration on the
side.
And when it comes to the _common_ case: there's actually a heck of a 
lot more embedded SoCs out there than x86 server/desktop CPU's. Fact!
You probably just don't know much about most of them.

But if you're arguing that the API should be lightweight and not add
significant overhead to a pure SW implementation, then I can agree.
However, I have not seen any PROOF (i.e. actual measurements, not 
theory) that it actually DOES add a lot of overhead. Nor any suggestions 
(beyond the hopelessly naive) for improving it.

> And the async crypto case got that completely wrong, and the wireguard

> patch shows that issue very clearly.

> 

> The fact is, even you admit that a few CPU cycles don't matter for the

> async case where you have hardware acceleration, because the real cost

> is going to be in the IO - whether it's DMA, cache coherent

> interconnects, or some day some SoC special bus.

>

I was talking _latency_ not _throughput_. I am _very_ interested in
reducing (API/driver) CPU overhead, if only because it doesn't allow
our HW to reach it's full potential. I'm working hard on optimizing our
driver for that right now.
And if something can be improved in the Crypto API itself there, without 
compromising it's functionality and flexibility, then I'm all for that.

> The CPU cycles just don't matter, they are entirely in the noise.

> 

> What does that mean?  Think about it.

> 

> [ Time passes ]

> 

> Ok, I'll tell you: it means that the interface shouldn't be designed

> for async hw-assisted crypto. 

>

If you don't design them with that in mind, you simply won't be able
to effectively use the HW-assisted crypto at all. Just like you don't
design an API to a GPU for running a software implementation on the 
CPU, but for minimizing state changes and batch-queuing large strips
of triangles as that's what the _HW_ needs to be efficiently used.
Just sayin'.

> The interface should be designed for the

> case that _does_ care about CPU cycles, and then the async hw-assisted

> crypto should be hidden by a conditional, and its (extra) data

> structures should be the ones that are behind some indirect pointers

> etc.  Because, as we agreed, the async external hw case really doesn't

> care. It it has to traverse a pointer or two, and if it has to have a

> *SEPARATE* keystore that has longer lifetimes, then the async code can

> set that up on its own, but that should not affect the case that

> cares.

> 

What the hardware cares about is that you can batch queue your requests
and not busy-wait for any results. What the hardware cares about is 
that you don't send separate requests for encryption, authentication,
IV generation, etc, but combine this in a single request, hence the
templates in the Crypto API. What the hardware may care about, is that
you keep your key changes limited to just those actually required.
That is _fundamental_ to getting performance. 
Note that many SW implementations that require multiple independent
operations in flight to achieve maximum efficiency due to deep
(dual) pipelines, and/or spend significant cycles on running the key 
scheduling will ALSO benefit from these properties.
An async interface also makes it possible to run the actual crypto ops 
in multiple independent threads, on multiple CPUs, although I'm not
sure if the current Crypto API actually leverages that right now.

> Really, this is fundamental, and really, the crypto interfaces got this wrong.

> 

> This is in fact _so_ fundamental that the only possible reason you can

> disagree is because you don't care about reality or fundamentals, and

> you only care about the small particular hardware niche you work in

> and nothing else.

> 

Well, to be completely honest, I for sure don't care about making the SW
implementations run faster at the expense of HW offload capabilities.
Which is obvious, as I make a _living_ creating HW offload solutions.
Why would I actively work towards obsoleting my work?

FACT is that dedicated HW, in many cases, is still MUCH faster than the
CPU. At much lower consumption to boot. So why would you NOT want to 
leverage that, if it's available? That would just be dumb.

Again, I don't see you making the same argument about moving graphics
functionality from the GPU to the CPU. So why does crypto *have* to be
on the CPU? I just don't understand _why_ you care about that so much.

> You really should think about this a bit.

> 

I've been thinking about this daily for about 19 years now. It's my job.

> > However, if you're doing bulk crypto like network packet processing

> > (as Wireguard does!) or disk/filesystem encryption, then that cipher

> > allocation only happens once every blue moon and the overhead for

> > that is totally *irrelevant* as it is amortized over many hours or

> > days of runtime.

> 

> This is not true. It's not true because the overhead happens ALL THE TIME.

> 

The overhead for the _cipher allocation_ (because that's what _I_ was
talking about specifically) happens all the time? You understand you
really only need to do that twice per connection? (once per direction)

But there will be some overhead on the cipher requests themselves,
sure. A small compromise to make for the _possibility_ to use HW 
offload when it IS available. I would agree that if that overhead
turns out to be very significant, then something needs to be done
about that. But then PROVE that that's the case and provide solutions
that do not immediately make HW offload downright impossible.
As our HW is _for sure_ much faster than the CPU cluster (yes, all
CPU's combined at full utilization) it is _usually_ paired with.

> And in 99.9% of all cases there are no upsides from the hw crypto,

> because the hardware DOES NOT EXIST.

> 

Which is a _false_ assumption, we covered that several times before.

> You think the "common case" is that hardware encryption case, but see

> above. It's really not.

> 

See my argument above about there being many more embedded SoC's out
there than x86 CPU's. Which usually have some form of crypto accel
on the side. Which will, eventually, have Chacha-Poly support 
because that's what the industry is currently gravitating towards.
So define _common_ case.

> And when you _do_ have HW encryption, you could do the indirection.

> 

Again, not if the API was not architected to do so from the get-go.

> But that's not an argument for always having the indirection.

> 

> What indirection am I talking about?

> 

> There's multiple levels of indirection in the existing bad crypto interfaces:

> 

>  (a) the data structures themselves. This is things like keys and

> state storage, both of which are (often) small enough that they would

> be better off on a stack, or embedded in the data structures of the

> callers.

> 

>  (b) the calling conventions. This is things like indirection -

> usually several levels - just to find the function pointer to call to,

> and then an indirect call to that function pointer.

> 

> and both of those are actively bad things when you don't have those

> hardware accelerators.

> 

I would say those things are not required just for hardware acceleration,
so perhaps something can be improved there in the existing code.
Ever tried suggesting these to the Crypto API maintainers before?

> When you *do* have them, you don't care. Everybody agrees about that.

> But you're ignoring the big white elephant in the room, which is that

> the hw really is very rare in the end, even if it were to exist at

> all.

> 

Crypto acceleration in general is _not_ rare, almost any embedded SoC
has it. The big white elephant in the room is _actually_ that there 
never were decent, standard, ubiquitous API's available to use them
so most of them could only be used from dedicated in-house applications.
Which seriously hampered general acceptance and actual _use_.

> > While I generally dislike this whole hype of storing stuff in

> > textual formats like XML and JSON and then wasting lots of CPU

> > cycles on parsing that, I've learned to appreciate the power of

> > these textual Crypto API templates, as they allow a hardware

> > accelerator to advertise complex combined operations as single

> > atomic calls, amortizing the communication overhead between SW

> > and HW. It's actually very flexible and powerful!

> 

> BUT IT IS FUNDAMENTALLY BADLY DESIGNED!

> 

> Really.

> 

> You can get the advantages of hw-accelerated crypto with good designs.

> The current crypto layer is not that.

> 

> The current crypto layer simply has indirection at the wrong level.

> 

> Here's how it should have been done:

> 

>  - make the interfaces be direct calls to the crypto you know you want.

> 

Which wouldn't work for stuff like IPsec and dmcrypt, where you want
to be able to configure the crypto to be used, i.e. it's _not_ fixed.
And you don't want to have to modify those applications _everytime_ a
new algorithm is added. As the application shouldn't care about that,
it should just be able to leverage it for what it is.

Also, for combined operations, there needs to be some central place
where they are decomposed into optimal sub-operations, if necessary, 
without bothering the actual applications with that.

Having a simple direct crypto call is just a very naive solution
that would require changing _every_ application (for which this is
relevant) anytime you add a ciphersuite. It does not scale.

Yes, that will - by necessity - involve some indirection but as long as
you don't process anything crazy short, a single indirect call (or 2)
should really not be that significant on the total operation.
(and modern CPU's can predict indirect branches pretty well, too)

Note that all these arguments have actually _nothing_ to do with
supporting HW acceleration, they apply just as well to SW only.

>  - make the basic key and state buffer be explicit and let people

> allocate them on their own stacks or whatever

> 

Hey, no argument there. I don't see any good reason why the key can't
be on the stack. I doubt any hardware would be able to DMA that as-is
directly, and in any case, key changes should be infrequent, so copying
it to some DMA buffer should not be a performance problem.
So maybe that's an area for improvement: allow that to be on the stack.

> "But what about hw acceleration?"

> 

>  - add a single indirect private pointer that the direct calls can use

> for their own state IF THEY HAVE REASON TO

> 

>  - make the direct crypto calls just have a couple of conditionals

> inside of them

> 

> Why? Direct calls with a couple of conditionals are really cheap for

> the non-accelerated case. MUCH cheaper than the indirection overhead

> (both on a state level and on a "crypto description" level) that the

> current code has.

> 

I already explained the reasons for _not_ doing direct calls above.

> Seriously. The hw accelerated crypto won't care about the "call a

> static routine" part.

>

Correct! It's totally unrelated.

> The hw accelerated crypto won't care about the

> "I need to allocate a copy of the key because I can't have it on

> stack, and need to have it in a DMA'able region". 

>

Our HW surely won't, some HW might care but copying it should be OK.

> The hw accelerated

> crypto won't care about the two extra instructions that do "do I have

> any extra state at all, or should I just do the synchronous CPU

> version" before it gets called through some indirect pointer.

> 

Actually, here I _do_ care. I want minimal CPU overhead just a much
as you do, probably even more desperately. But OK, I would be able to
live with that, if that were the _only_ downside.

> So there is absolutely NO DOWNSIDE for hw accelerated crypto to just

> do it right, and use an interface like this:

> 

>        if (!chacha20poly1305_decrypt_sg(sg, sg, skb->len, NULL, 0,

>                                         PACKET_CB(skb)->nonce, key->key,

>                                         simd_context))

>                return false;

> 

Well, for one thing, a HW API should not expect the result to be
available when the function call returns. (if that's what you
mean here). That would just be WRONG.
HW offload doesn't work like that. Results come much later, and 
you need to keep dispatching more requests until the HW starts
asserting backpressure. You need to keep that HW pipeline filled.

> because for the hw accelerated case the costs are all elsewhere.

> 

> But for synchronous encryption code on the CPU? Avoiding the

> indirection can be a huge win. Avoiding allocations, extra cachelines,

> all that overhead. Avoiding indirect calls entirely, because doing a

> few conditional branches (that will predict perfectly) on the state

> will be a lot more efficient, both in direct CPU cycles and in things

> like I$ etc.

> 

Again, HW acceleration does not depend on the indirection _at all_,
that's there for entirely different purposes I explained above.
HW acceleration _does_ depend greatly on a truly async ifc though.
So queue requests on one side, handle results from the other side
in some callback func off of an interrupt handler. (with proper
interrupt coalescing, of course, and perhaps some NAPI-like 
functionality to further reduce interrupt rates when busy)

> In contrast, forcing people to use this model:

> 

>        if (unlikely(crypto_aead_reqsize(key->tfm) > 0)) {

>                req = aead_request_alloc(key->tfm, GFP_ATOMIC);

>                if (!req)

>                        return false;

>        } else {

>                req = &stackreq;

>                aead_request_set_tfm(req, key->tfm);

>        }

> 

Agree that is fishy, but it is something that could be fixed.

>        aead_request_set_ad(req, 0);

>

I'd rather see this being part of the set_crypt call as well.
I never said I liked _all_ decisions made in the API.
Likely this is because AEAD was added as an afterthought.

>        aead_request_set_callback(req, 0, NULL, NULL);

>

This is just inevitable for HW acceration ...

>        aead_request_set_crypt(req, sg, sg, skb->len,

>                               (u8 *)&PACKET_CB(skb)->ivpad);

>        err = crypto_aead_decrypt(req);

>

It would probably be more efficient if set_crypt and _decrypt 
could be combined in a single call (together with _set_ad). 
No argument there and these decisions have _nothing_ to do
with being able to do HW acceleration or not.

Trust me, I have whole list of things I don't like about the
API myself, it's not exacty ideal for HW acceleration  either.
(Note that SW overhead actually matters _more_ when you do HW 
acceleration, as the HW is often so fast that the SW is the 
actual bottleneck!).

But I have faith that, over time, I may be able to get some
improvements in (which should improve both HW _and_ SW use
cases by the way). By working _with_ the Crypto API people
and being _patient_. Not by telling them they suck.

>        if (unlikely(req != &stackreq))

>                aead_request_free(req);

>        if (err)

>                return false;

> 

> isn't going to help anybody. It sure as hell doesn't help the

> CPU-synchronous case, and see above: it doesn't even help the hw

> accelerated case. It could have had _all_ that "tfm" work behind a

> private pointer that the CPU case never touches except to see "ok,

> it's NULL, I don't have any".

> 

> See?

> 

Yes, I agree with the point you have here. So let's fix that.

> The interface *should* be that chacha20poly1305_decrypt_sg() library

> interface, just give it a private pointer it can use and update. Then,

> *internally* if can do something like

> 

>      bool chacha20poly1305_decrypt_sg(...)

>      {

>              struct cc20p1305_ptr *state = *state_p;

>              if (state) {

>                      .. do basically the above complex thing ..

>                      return ret; .. or fall back to sw if the hw

> queues are full..

>              }

>              .. do the CPU only thing..

>      }

> 

But even the CPU only thing may have several implementations, of which
you want to select the fastest one supported by the _detected_ CPU
features (i.e. SSE, AES-NI, AVX, AVX512, NEON, etc. etc.)
Do you think this would still be efficient if that would be some
large if-else tree? Also, such a fixed implementation wouldn't scale.

> and now you'd have no extra obverhead for the no-hw-accel case, you'd

> have a more pleasant interface to use, and you could still use hw

> accel if you wanted to.

> 

I still get the impression you're mostly interested in the "pleasant
interface" while I don't see why that should be more important than
being able to use HW crypto efficiently. That reminds me of way back
when I was a junior designer and some - more senior - software
engineer forced me to implement a full hardware divider(!) for some
parameter that needed to be set only once at initialization time, just
because he was too lazy to do a few simple precomputes in the driver.
He considered that to be "unpleasant" as well. 

> THIS is why I say that the crypto interface is bad. It was designed

> for the wrong objectives. It was designed purely for a SSL-like model

> where you do a complex key and algorithm exchange dance, and you

> simply don't know ahead of time what crypto you are even using.

> 

I guess it was designed for that, sure. And that's how the IPsec stack
and dmcrypt (to name a few examples) need it. It's also how Wireguard
_will_ need it when we start adding more ciphersuites to Wireguard.
Which is a MUST anyway, if Wireguard wants to be taken seriously:
there MUST be a fallback ciphersuite. At least one. Just in case 
Chacha-Poly gets broken overnight somehow, in which case you need to
switch over instantly and can't wait for some new implementation.

If you really _don't_ need that, but just need bit of fixed algorithm
crypto, then by all means, don't go through the Crypto API. I've a
already argued that on many occasions. I think people like Ard are
_already_ working on doing such crypto calls directly.

> And yes, that "I'm doing the SSL thing" used to be a major use of

> encryption. I understand why it happened. It was what people did in

> the 90's. People thought it was a good idea back then, and it was also

> most of the hw acceleration world.

> 

> And yes, in that model of "I don't have a clue of what crypto I'm even

> using" the model works fine. But honestly, if you can't admit to

> yourself that it's wrong for the case where you _do_ know the

> algorithm, you have some serious blinders on.

> 

But the point is - there are those case where you _don't_ know and
_that_ is what the Crypto API is for. And just generally, crypto
really _should_ be switchable. So you don't need to wait for a
fix to ripple through a kernel release cycle when an algorithm gets
broken. I don't know many use cases for just one fixed algorithm.

> Just from a user standpoint, do you seriously think users _like_

> having to do the above 15+ lines of code, vs the single function call?

> 

I know I wouldn't. I also know I would do it anyway as I would 
understand _why_ I would be doing it. 

> The crypto interface really isn't pleasant, and you're wrong to

> believe that it really helps. The hw acceleration capability could

> have been abstracted away, instead of making that indirection be front

> and center.

> 

Again, the Crypto API aims to do more than just allow for HW
acceleration and your main gripes actually seem to be with the
"other" stuff.

> And again - I do realize the historical reasons for it. But

> understanding that doesn't magically make it wonderful.

> 

No one said it was wonderful. Or pleasant. Or perfect.
You definitely raised some points that I think _could_ be 
improved without compromising any functionality.
But some stuff you don't like just has good reasons to exist.
Reasons you may not agree with, but that doesn't make them invalid.

>                  Linus


Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
www.insidesecure.com
Linus Torvalds Sept. 27, 2019, 1:30 a.m. UTC | #8
On Thu, Sep 26, 2019 at 5:15 PM Pascal Van Leeuwen
<pvanleeuwen@verimatrix.com> wrote:
>

> That remark is just very stupid. The hardware ALREADY exists, and

> more hardware is in the pipeline. Once this stuff is designed in, it

> usually stays in for many years to come. And these are chips sold in

> _serious_ quantities, to be used in things like wireless routers and

> DSL, cable and FTTH modems, 5G base stations, etc. etc.


Yes, I very much mentioned routers. I believe those can happen much
more quickly.

But I would very much hope that that is not the only situation where
you'd see wireguard used.

I'd want to see wireguard in an end-to-end situation from the very
client hardware. So laptops, phones, desktops. Not the untrusted (to
me) hw in between.

> No, these are just the routers going into *everyone's* home. And 5G

> basestations arriving at every other street corner. I wouldn't call

> that rare, exactly.


That's fine for a corporate tunnel between devices. Which is certainly
one use case for wireguard.

But if you want VPN for your own needs for security, you want it at
the _client_. Not at the router box. So that case really does matter.

And I really don't see the hardware happening in that space. So the
bad crypto interfaces only make the client _worse_.

See?

But on to the arguments that we actually agree on:

> Hey, no argument there. I don't see any good reason why the key can't

> be on the stack. I doubt any hardware would be able to DMA that as-is

> directly, and in any case, key changes should be infrequent, so copying

> it to some DMA buffer should not be a performance problem.

> So maybe that's an area for improvement: allow that to be on the stack.


It's not even just the stack. It's really that the crypto interfaces
are *designed* so that you have to allocate things separately, and
can't embed these things in your own data structures.

And they are that way, because the crypto interfaces aren't actually
about (just) hiding the hardware interface: they are about hiding
_all_ the encryption details.

There's no way to say "hey, I know the crypto I use, I know the key
size I have, I know the state size it needs, I can preallocate those
AS PART of my own data structures".

Because the interface is designed to be so "generic" that you simply
can't do those things, they are all external allocations, which is
inevitably slower when you don't have hardware.

And you've shown that you don't care about that "don't have hardware"
situation, and seem to think it's the only case that matters. That's
your job, after all.

But however much you try to claim otherwise, there's all these
situations where the hardware just isn't there, and the crypto
interface just forces nasty overhead for absolutely no good reason.

> I already explained the reasons for _not_ doing direct calls above.


And I've tried to explain how direct calls that do the synchronous
thing efficiently would be possible, but then _if_ there is hardware,
they can then fall back to an async interface.

> > So there is absolutely NO DOWNSIDE for hw accelerated crypto to just

> > do it right, and use an interface like this:

> >

> >        if (!chacha20poly1305_decrypt_sg(sg, sg, skb->len, NULL, 0,

> >                                         PACKET_CB(skb)->nonce, key->key,

> >                                         simd_context))

> >                return false;

> >

> Well, for one thing, a HW API should not expect the result to be

> available when the function call returns. (if that's what you

> mean here). That would just be WRONG.


Right. But that also shouldn't mean that when you have synchronous
hardware (ie CPU) you have to set everything up even though it will
never be used.

Put another way: even with hardware acceleration, the queuing
interface should be a simple "do this" interface.

The current crypto interface is basically something that requires all
the setup up-front, whether it's needed or not. And it forces those
very inconvenient and slow external allocations.

And I'm saying that causes problems, because it fundamentally means
that you can't do a good job for the common CPU  case, because you're
paying all those costs even when you need absolutely none of them.
Both at setup time, but also at run-time due to the extra indirection
and cache misses etc.

> Again, HW acceleration does not depend on the indirection _at all_,

> that's there for entirely different purposes I explained above.

> HW acceleration _does_ depend greatly on a truly async ifc though.


Can you realize that the world isn't just all hw acceleration?

Can you admit that the current crypto interface is just horrid for the
non-accelerated case?

Can you perhaps then also think that "maybe there are better models".

> So queue requests on one side, handle results from the other side

> in some callback func off of an interrupt handler.


Actually, what you can do - and what people *have* done - is to admit
that the synchronous case is real and important, and then design
interfaces that work for that one too.

You don't need to allocate resources ahead of time, and you don't have
to disallow just having the state buffer allocated by the caller.

So here's the *wrong* way to do it (and the way that crypto does it):

 - dynamically allocate buffers at "init time"

 - fill in "callback fields" etc before starting the crypto, whether
they are needed or not

 - call a "decrypt" function that then uses the indirect functions you
set up at init time, and possibly waits for it (or calls the callbacks
you set up)

note how it's all this "state machine" model where you add data to the
state machine, and at some point you say "execute" and then either you
wait for things or you get callbacks.

That makes sense for a hw crypto engine. It's how a lot of them work, after all.

But it makes _zero_ sense for the synchronous case. You did a lot of
extra work for that case, and because it was all a styate machine, you
did it particularly inefficiently: not only do you have those separate
allocations with pointer following, the "decrypt()" call ends up doing
an indirect call to the CPU implementation, which is just quite slow
to begin with, particularly in this day and age with retpoline etc.

So what's the alternative?

I claim that a good interface would accept that "Oh, a lot of cases
will be synchronous, and a lot of cases use one fixed
encryption/decryption model".

And it's quite doable. Instead of having those callback fields and
indirection etc, you could have something more akin to this:

 - let the caller know what the state size is and allocate the
synchronous state in its own data structures

 - let the caller just call a static "decrypt_xyz()" function for xyz
decryptrion.

 - if you end up doing it synchronously, that function just returns
"done". No overhead. No extra allocations. No unnecessary stuff. Just
do it, using the buffers provided. End of story. Efficient and simple.

 - BUT.

 - any hardware could have registered itself for "I can do xyz", and
the decrypt_xyz() function would know about those, and *if* it has a
list of accelerators (hopefully sorted by preference etc), it would
try to use them. And if they take the job (they might not - maybe
their queues are full, maybe they don't have room for new keys at the
moment, which might be a separate setup from the queues), the
"decrypt_xyz()" function returns a _cookie_ for that job. It's
probably a pre-allocated one (the hw accelerator might preallocate a
fixed number of in-progress data structures).

And once you have that cookie, and you see "ok, I didn't get the
answer immediately" only THEN do you start filling in things like
callback stuff, or maybe you set up a wait-queue and start waiting for
it, or whatever".

See the difference in models? One forces that asynchronous model, and
actively penalizes the synchronous one.

The other _allows_ an asynchronous model, but is fine with a synchronous one.

> >        aead_request_set_callback(req, 0, NULL, NULL);

> >

> This is just inevitable for HW acceration ...


See above. It really isn't. You could do it *after* the fact, when
you've gotten that ticket from the hardware. Then you say "ok, if the
ticket is done, use these callbacks". Or "I'll now wait for this
ticket to be done" (which is what the above does by setting the
callbacks to zero).

Wouldn't that be lovely for a user?

I suspect it would be a nice model for a hw accelerator too. If you
have full queues or have problems allocating new memory or whatever,
you just let the code fall back to the synchronous interface.

> Trust me, I have whole list of things I don't like about the

> API myself, it's not exacty ideal for HW acceleration  either.


That';s the thing. It's actively detrimental for "I have no HW acceleration".

And apparently it's not optimal for you guys either.

> But the point is - there are those case where you _don't_ know and

> _that_ is what the Crypto API is for. And just generally, crypto

> really _should_ be switchable.


It's very much not what wireguard does.

And honestly, most of the switchable ones have caused way more
security problems than they have "fixed" by being switchable.

                 Linus
Linus Torvalds Sept. 27, 2019, 2:06 a.m. UTC | #9
On Thu, Sep 26, 2019 at 5:15 PM Pascal Van Leeuwen
<pvanleeuwen@verimatrix.com> wrote:
>

> But even the CPU only thing may have several implementations, of which

> you want to select the fastest one supported by the _detected_ CPU

> features (i.e. SSE, AES-NI, AVX, AVX512, NEON, etc. etc.)

> Do you think this would still be efficient if that would be some

> large if-else tree? Also, such a fixed implementation wouldn't scale.


Just a note on this part.

Yes, with retpoline a large if-else tree is actually *way* better for
performance these days than even just one single indirect call. I
think the cross-over point is somewhere around 20 if-statements.

But those kinds of things also are things that we already handle well
with instruction rewriting, so they can actually have even less of an
overhead than a conditional branch. Using code like

  if (static_cpu_has(X86_FEATURE_AVX2))

actually ends up patching the code at run-time, so you end up having
just an unconditional branch. Exactly because CPU feature choices
often end up being in critical code-paths where you have
one-or-the-other kind of setup.

And yes, one of the big users of this is very much the crypto library code.

The code to do the above is disgusting, and when you look at the
generated code you see odd unreachable jumps and what looks like a
slow "bts" instruction that does the testing dynamically.

And then the kernel instruction stream gets rewritten fairly early
during the boot depending on the actual CPU capabilities, and the
dynamic tests get overwritten by a direct jump.

Admittedly I don't think the arm64 people go to quite those lengths,
but it certainly wouldn't be impossible there either.  It just takes a
bit of architecture knowledge and a strong stomach ;)

                 Linus
Linus Torvalds Sept. 27, 2019, 2:54 a.m. UTC | #10
On Thu, Sep 26, 2019 at 6:30 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>

> And once you have that cookie, and you see "ok, I didn't get the

> answer immediately" only THEN do you start filling in things like

> callback stuff, or maybe you set up a wait-queue and start waiting for

> it, or whatever".


Side note: almost nobody does this.

Almost every single async interface I've ever seen ends up being "only
designed for async".

And I think the reason is that everybody first does the simply
synchronous interfaces, and people start using those, and a lot of
people are perfectly happy with them. They are simple, and they work
fine for the huge majority of users.

And then somebody comes along and says "no, _we_ need to do this
asynchronously", and by definition that person does *not* care for the
synchronous case, since that interface already existed and was simpler
and already was mostly sufficient for the people who used it, and so
the async interface ends up being _only_ designed for the new async
workflow. Because that whole new world was written with just that case
is mind, and the synchronous case clearly didn't matter.

So then you end up with that kind of dichotomous situation, where you
have a strict black-and-white either-synchronous-or-async model.

And then some people - quite reasonably - just want the simplicity of
the synchronous code and it performs better for them because the
interfaces are simpler and better suited to their lack of extra work.

And other people feel they need the async code, because they can take
advantage of it.

And never the twain shall meet, because the async interface is
actively _bad_ for the people who have sync workloads and the sync
interface doesn't work for the async people.

Non-crypto example: [p]read() vs aio_read(). They do the same thing
(on a high level) apart from that sync/async issue. And there's no way
to get the best of both worlds.

Doing aio_read() on something that is already cached is actively much
worse than just doing a synchronous read() of cached data.

But aio_read() _can_ be much better if you know your workload doesn't
cache well and read() blocks too much for you.

There's no "read_potentially_async()" interface that just does the
synchronous read for any cached portion of the data, and then delays
just the IO parts and returns a "here, I gave you X bytes right now,
use this cookie to wait for the rest".

Maybe nobody would use it. But it really should be possibly to have
interfaces where a good synchronous implementation is _possible_
without the extra overhead, while also allowing async implementations.

                Linus
Herbert Xu Sept. 27, 2019, 3:53 a.m. UTC | #11
On Thu, Sep 26, 2019 at 07:54:03PM -0700, Linus Torvalds wrote:
>

> Side note: almost nobody does this.

> 

> Almost every single async interface I've ever seen ends up being "only

> designed for async".

> 

> And I think the reason is that everybody first does the simply

> synchronous interfaces, and people start using those, and a lot of

> people are perfectly happy with them. They are simple, and they work

> fine for the huge majority of users.


The crypto API is not the way it is because of async.  In fact, the
crypto API started out as sync only and async was essentially
bolted on top with minimial changes.

The main reason why the crypto API contains indirections is because
of the algorithmic flexibility which WireGuard does not need.

Now whether algorithmic flexibility is a good thing or not is a
different discussion.  But the fact of the matter is that the
majority of heavy crypto users in our kernel do require this
flexibility (e.g., IPsec, dmcrypt, fscrypt).

I don't have a beef with the fact that WireGuard is tied to a
single algorithm.  However, that simply does not work for the
other users that we will have to continue to support.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Herbert Xu Sept. 27, 2019, 4:01 a.m. UTC | #12
On Thu, Sep 26, 2019 at 07:54:03PM -0700, Linus Torvalds wrote:
>

> There's no "read_potentially_async()" interface that just does the

> synchronous read for any cached portion of the data, and then delays

> just the IO parts and returns a "here, I gave you X bytes right now,

> use this cookie to wait for the rest".


Incidentally this is exactly how the crypto async interface works.
For example, the same call works whether you're sync or async:

	aead_request_set_callback(req, ...);
	aead_request_set_crypt(req, ...);
	err = crypto_aead_encrypt(req);
	if (err == -EINPROGRESS)
		/*
		 * Request is processed asynchronously.
		 * This cannot occur if the algorithm is sync,
		 * e.g., when you specifically allocated sync
		 * algorithms.
		 */
	else
		/* Request was processed synchronously */

We even allow the request to be on the stack in the sync case, e.g.,
with SYNC_SKCIPHER_REQUEST_ON_STACK.

> Maybe nobody would use it. But it really should be possibly to have

> interfaces where a good synchronous implementation is _possible_

> without the extra overhead, while also allowing async implementations.


So there is really no async overhead in the crypto API AFAICS if
you're always doing sync.  What you see as overheads are probably
the result of having to support multiple underlying algorithms
(not just accelerations which can indeed be handled without
indirection at least for CPU-based ones).

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Linus Torvalds Sept. 27, 2019, 4:13 a.m. UTC | #13
On Thu, Sep 26, 2019 at 9:01 PM Herbert Xu <herbert@gondor.apana.org.au> wrote:
>

> So there is really no async overhead in the crypto API AFAICS if

> you're always doing sync.  What you see as overheads are probably

> the result of having to support multiple underlying algorithms

> (not just accelerations which can indeed be handled without

> indirection at least for CPU-based ones).


Fair enough, and sounds good. The biggest overhead is that indirection
for the state data, and the fact that the code indirect calls the
actual function.

If that could be avoided by just statically saying

     crypto_xyz_encrypt()

(with the xyz being the crypto algorithm you want) and having the
state be explicit, then yes, that would remove most of the overhead.

It would still leave setting the callback fields etc that are
unnecessary for the synchronous case and that I think could be done
differently, but that's probably just a couple of stores, so not
particularly noticeable.

              Linus
Andy Lutomirski Sept. 27, 2019, 4:36 a.m. UTC | #14
> On Sep 26, 2019, at 6:38 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:

>

> - let the caller know what the state size is and allocate the

> synchronous state in its own data structures

>

> - let the caller just call a static "decrypt_xyz()" function for xyz

> decryptrion.

>

> - if you end up doing it synchronously, that function just returns

> "done". No overhead. No extra allocations. No unnecessary stuff. Just

> do it, using the buffers provided. End of story. Efficient and simple.

>

> - BUT.

>

> - any hardware could have registered itself for "I can do xyz", and

> the decrypt_xyz() function would know about those, and *if* it has a

> list of accelerators (hopefully sorted by preference etc), it would

> try to use them. And if they take the job (they might not - maybe

> their queues are full, maybe they don't have room for new keys at the

> moment, which might be a separate setup from the queues), the

> "decrypt_xyz()" function returns a _cookie_ for that job. It's

> probably a pre-allocated one (the hw accelerator might preallocate a

> fixed number of in-progress data structures).


To really do this right, I think this doesn't go far enough.  Suppose
I'm trying to implement send() over a VPN very efficiently.  I want to
do, roughly, this:

void __user *buf, etc;

if (crypto api thinks async is good) {
  copy buf to some kernel memory;
  set up a scatterlist;
  do it async with this callback;
} else {
  do the crypto synchronously, from *user* memory, straight to kernel memory;
  (or, if that's too complicated, maybe copy in little chunks to a
little stack buffer.
   setting up a scatterlist is a waste of time.)
}

I don't know if the network code is structured in a way to make this
work easily, and the API would be more complex, but it could be nice
and fast.
Andy Lutomirski Sept. 27, 2019, 4:37 a.m. UTC | #15
On Thu, Sep 26, 2019 at 8:54 PM Herbert Xu <herbert@gondor.apana.org.au> wrote:
>

> On Thu, Sep 26, 2019 at 07:54:03PM -0700, Linus Torvalds wrote:

> >

> > Side note: almost nobody does this.

> >

> > Almost every single async interface I've ever seen ends up being "only

> > designed for async".

> >

> > And I think the reason is that everybody first does the simply

> > synchronous interfaces, and people start using those, and a lot of

> > people are perfectly happy with them. They are simple, and they work

> > fine for the huge majority of users.

>

> The crypto API is not the way it is because of async.  In fact, the

> crypto API started out as sync only and async was essentially

> bolted on top with minimial changes.


Then what's up with the insistence on using physical addresses for so
many of the buffers?

--Andy
Herbert Xu Sept. 27, 2019, 4:59 a.m. UTC | #16
On Thu, Sep 26, 2019 at 09:37:16PM -0700, Andy Lutomirski wrote:
>

> Then what's up with the insistence on using physical addresses for so

> many of the buffers?


This happens to be what async hardware wants, but the main reason
why the crypto API has them is because that's what the network
stack feeds us.  The crypto API was first created purely for IPsec
so the SG lists are intimiately tied with how skbs were constructed.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Pascal Van Leeuwen Sept. 27, 2019, 9:58 a.m. UTC | #17
> > That remark is just very stupid. The hardware ALREADY exists, and

> > more hardware is in the pipeline. Once this stuff is designed in, it

> > usually stays in for many years to come. And these are chips sold in

> > _serious_ quantities, to be used in things like wireless routers and

> > DSL, cable and FTTH modems, 5G base stations, etc. etc.

> 

> Yes, I very much mentioned routers. I believe those can happen much

> more quickly.

> 

> But I would very much hope that that is not the only situation where

> you'd see wireguard used.

> 

Same here

> I'd want to see wireguard in an end-to-end situation from the very

> client hardware. So laptops, phones, desktops. Not the untrusted (to

> me) hw in between.

> 

I don't see why the crypto HW would deserve any less trust than, say,
the CPU itself. I would say CPU's don't deserve that trust at the moment.

> > No, these are just the routers going into *everyone's* home. And 5G

> > basestations arriving at every other street corner. I wouldn't call

> > that rare, exactly.

> 

> That's fine for a corporate tunnel between devices. Which is certainly

> one use case for wireguard.

> 

> But if you want VPN for your own needs for security, you want it at

> the _client_. Not at the router box. So that case really does matter.

> 

Personally, I would really like it in my router box so my CPU is free
to do useful work instead of boring crypto. And I know there's nothing
untrusted in between my client and the router box, so I don't need to
worry about security there. But hey, that's just me.

> And I really don't see the hardware happening in that space. So the

> bad crypto interfaces only make the client _worse_.

> 

Fully agree. We don't focus on the client side with our HW anyway.
(but than there may be that router box in between that can help out)

> See?

> 

> But on to the arguments that we actually agree on:

> 

> > Hey, no argument there. I don't see any good reason why the key can't

> > be on the stack. I doubt any hardware would be able to DMA that as-is

> > directly, and in any case, key changes should be infrequent, so copying

> > it to some DMA buffer should not be a performance problem.

> > So maybe that's an area for improvement: allow that to be on the stack.

> 

> It's not even just the stack. It's really that the crypto interfaces

> are *designed* so that you have to allocate things separately, and

> can't embed these things in your own data structures.

> 

> And they are that way, because the crypto interfaces aren't actually

> about (just) hiding the hardware interface: they are about hiding

> _all_ the encryption details.

> 

Well, that's the general idea of abstraction. It also allows for 
swapping in any other cipher with minimal effort just _because_ the 
details were hidden from the application. So it may cost you some 
effort initially, but it may save you effort later.

> There's no way to say "hey, I know the crypto I use, I know the key

> size I have, I know the state size it needs, I can preallocate those

> AS PART of my own data structures".

> 

> Because the interface is designed to be so "generic" that you simply

> can't do those things, they are all external allocations, which is

> inevitably slower when you don't have hardware.

> 

Hmm, Ok, I see your point here. But most of those data structures 
(like the key) should be allocated infrequently anyway, so you can
amortize that cost over _many_ crypto operations.

You _do_ realize that performing the key schedule for e.g. AES with
AES-NI also takes quite a lot of time? So you should keep your keys
alive and not reload them all the time anyway.

But I already agreed with you that there may be cases where you just
want to call the library function directly. Wireguard just isn't one
of those cases, IMHO.

> And you've shown that you don't care about that "don't have hardware"

> situation, and seem to think it's the only case that matters. That's

> your job, after all.

> 

I don't recall putting it that strongly ... and I certainly never said
the HW acceleration thing is the _only_ case that matters. But it does
matter _significantly_ to me, for blatantly obvious reasons.

> But however much you try to claim otherwise, there's all these

> situations where the hardware just isn't there, and the crypto

> interface just forces nasty overhead for absolutely no good reason.

> 

> > I already explained the reasons for _not_ doing direct calls above.

> 

> And I've tried to explain how direct calls that do the synchronous

> thing efficiently would be possible, but then _if_ there is hardware,

> they can then fall back to an async interface.

> 

OK, I did not fully get that latter part. I would be fine with such an
approach for use cases (i.e. fixed, known crypto) where that makes sense.
It would actually be better than calling the SW-only library directly
(which was my suggestion) as it would still allow HW acceleration as
an option ... 

> > > So there is absolutely NO DOWNSIDE for hw accelerated crypto to just

> > > do it right, and use an interface like this:

> > >

> > >        if (!chacha20poly1305_decrypt_sg(sg, sg, skb->len, NULL, 0,

> > >                                         PACKET_CB(skb)->nonce, key->key,

> > >                                         simd_context))

> > >                return false;

> > >

> > Well, for one thing, a HW API should not expect the result to be

> > available when the function call returns. (if that's what you

> > mean here). That would just be WRONG.

> 

> Right. But that also shouldn't mean that when you have synchronous

> hardware (ie CPU) you have to set everything up even though it will

> never be used.

> 

> Put another way: even with hardware acceleration, the queuing

> interface should be a simple "do this" interface.

> 

OK, I don't think we disagree there. I _like_ simple. As long as it
doesn't sacrifice functionality I care about.

> The current crypto interface is basically something that requires all

> the setup up-front, whether it's needed or not. And it forces those

> very inconvenient and slow external allocations.

> 

But you should do the setup work (if by "setup" you mean things like
cipher allocation, key setup and request allocation) only _once_ in a 
_long_ while. You can just keep using it for the lifetime of the 
application (or key, for the key setup part).

If I look at my cipher fallback  paths in the driver (the only places
where I actually get to _use_ the API from the "other" side), per 
actual indivual request they _only_ do - the rest is all preallocated
earlier:

_set_callback()
_set_crypt()
_set_ad()
_encrypt() or _decrypt()

And now that I look at that, I think the _set_callback()  could
move to the setup phase as it's always the same callback function.
Probably, in case of Wireguard, you could even move the _set_ad()
there as it's always zero and  the crypto driver is not allowed 
to overwrite it in the request struct anyway.

Also, I already agreed with you that _set_crypt(), _set_ad()
and _encrypt()/_decrypt() _could_ be conveniently wrapped into
one API call instead of 3 separate ones if we think that's worth it.

BUT ... actually ... I just looked at the actual _implementation_
and it turns out these are _inlineable_ functions defined in the
header file that _just_ write to some struct fields. So they 
should not end up being function calls at all(!!).
_Only_ the _encrypt()/_decrypt() invocation will end up with a
true (indirect) function call.

So where are all those allocations you mention? Have you ever
actually _used_ the Crypto API for anything?

Yes, if you actually want to _queue_ requests you need to use one
request struct for every queued operation, but you could just
preallocate an array of them that you cycle through. No need to do
those allocations in the hot path.

So is your problem really with the API _itself_ or with incorrect/
inefficient _use_ of the API in some places?

> And I'm saying that causes problems, because it fundamentally means

> that you can't do a good job for the common CPU  case, because you're

> paying all those costs even when you need absolutely none of them.

> Both at setup time, but also at run-time due to the extra indirection

> and cache misses etc.

> 

There is some cost sure, but is it _significant_ for any use case that
_matters_? You started bringing up optimization rules, so how about
Amdahls law?

> > Again, HW acceleration does not depend on the indirection _at all_,

> > that's there for entirely different purposes I explained above.

> > HW acceleration _does_ depend greatly on a truly async ifc though.

> 

> Can you realize that the world isn't just all hw acceleration?

> 

Sure. But there's also a lot of HW acceleration _already out there_
that _could_ have been used if only the proper SW API's had existed.
Welcome to _my_ world.

> Can you admit that the current crypto interface is just horrid for the

> non-accelerated case?

> 

Is agreeing that it is not perfect sufficient for you? :-)

> Can you perhaps then also think that "maybe there are better models".

> 

Sure. There's always better. There's also good enough though ...

> > So queue requests on one side, handle results from the other side

> > in some callback func off of an interrupt handler.

> 

> Actually, what you can do - and what people *have* done - is to admit

> that the synchronous case is real and important, and then design

> interfaces that work for that one too.

> 

But they _do_ work for that case as well. I still haven't seen any
solid evidence that they are as horribly inefficient as you are 
implying for _real life_ use cases. And even if they are, then there's
the question whether that is the fault of the API or incorrect use 
thereof.

> You don't need to allocate resources ahead of time, and you don't have

> to disallow just having the state buffer allocated by the caller.

> 

> So here's the *wrong* way to do it (and the way that crypto does it):

> 

>  - dynamically allocate buffers at "init time"

> 

Why is that so "wrong"? It sure beats doing allocations on the hot path.
But yes, some stuff should be allowed to live on the stack. Some other
stuf can't be on the stack though, as that's gone when the calling 
function exits while the background crypto processing still needs it.

And you don't want to have it on the stack initially and then have
to _copy_ it to some DMA-able location that you allocate on the fly
on the hot path if you _do_ want HW acceleration.

>  - fill in "callback fields" etc before starting the crypto, whether

> they are needed or not

> 

I think this can be done _once_ at request allocation time.
But it's just one function pointer write anyway. Is that significant? 
Or: _if_ that is significant, you  shouldn't be using the Crypto API for 
that use case in the first place.

>  - call a "decrypt" function that then uses the indirect functions you

> set up at init time, and possibly waits for it (or calls the callbacks

> you set up)

> 

> note how it's all this "state machine" model where you add data to the

> state machine, and at some point you say "execute" and then either you

> wait for things or you get callbacks.

> 

Not sure how splitting data setup over a few seperate "function" calls
suddenly makes it a "state machine model" ...

But yes, I can understand why the completion handling through this
callback function seems like unnecessary complication for the SW case.

> That makes sense for a hw crypto engine. It's how a lot of them work, after all.

> 

Oh really?

Can't speak for other people stuff, but for our hardware you post a
request to it and then go off do other stuff while the HW does its thing
after which it will inform you it's done by means of an interrupt.
I don't see how this relates to the "statemachine model" above, there
is no persistent state involved, it's all included in the request.
The _only_ thing that matters is that you realize it's a pipeline that
needs to be kept filled and has latency >> throughput, just like your 
CPU pipeline.

> But it makes _zero_ sense for the synchronous case. You did a lot of

> extra work for that case, and because it was all a styate machine, you

> did it particularly inefficiently: not only do you have those separate

> allocations with pointer following, the "decrypt()" call ends up doing

> an indirect call to the CPU implementation, which is just quite slow

> to begin with, particularly in this day and age with retpoline etc.

> 

> So what's the alternative?

> 

> I claim that a good interface would accept that "Oh, a lot of cases

> will be synchronous, and a lot of cases use one fixed

> encryption/decryption model".

> 

> And it's quite doable. Instead of having those callback fields and

> indirection etc, you could have something more akin to this:

> 

>  - let the caller know what the state size is and allocate the

> synchronous state in its own data structures

> 

>  - let the caller just call a static "decrypt_xyz()" function for xyz

> decryptrion.

> 

Fine for those few cases where the algorithm is known and fixed.
(You do realize that the primary use cases are IPsec, dmcrypt and
fscrypt where that is most definitely _not_ the case?)

Also, you're still ignoring the fact that there is not one, single,
optimal, CPU implementation either. You have to select that as well,
based on CPU features. So it's either an indirect function call that
would be well predictable - as it's always the same at that point in
the program - or it's a deep if-else tree (which might actually be
implemented by the compiler as an indirect (table) jump ...) 
selecting the fastest implementation, either SW _or_ HW.

>  - if you end up doing it synchronously, that function just returns

> "done". No overhead. No extra allocations. No unnecessary stuff. Just

> do it, using the buffers provided. End of story. Efficient and simple.

> 

I don't see which "extra allocations" you would be saving here.
Those shouldn't happen in the hot path either way.

>  - BUT.

> 

>  - any hardware could have registered itself for "I can do xyz", and

> the decrypt_xyz() function would know about those, and *if* it has a

> list of accelerators (hopefully sorted by preference etc), it would

> try to use them. And if they take the job (they might not - maybe

> their queues are full, maybe they don't have room for new keys at the

> moment, which might be a separate setup from the queues), the

> "decrypt_xyz()" function returns a _cookie_ for that job. It's

> probably a pre-allocated one (the hw accelerator might preallocate a

> fixed number of in-progress data structures).

> 

> And once you have that cookie, and you see "ok, I didn't get the

> answer immediately" only THEN do you start filling in things like

> callback stuff, or maybe you set up a wait-queue and start waiting for

> it, or whatever".

> 

I don't see the point of saving that single callback pointer write.
I mean, it's just _one_ CPU word memory write. Likely to the L1 cache.

But I can see the appeal of getting a "done" response on the _encrypt()/
_decrypt() call and then being able to immediately continue processing
the result data and having the async response handling separated off. 

I think it should actually be possible to change the API to work like
that without breaking backward compatibility, i.e. define some flag
specifying you actually _want_ this behavior and then define some
return code that says "I'm done processing, carry on please".

> See the difference in models? One forces that asynchronous model, and

> actively penalizes the synchronous one.

> 

> The other _allows_ an asynchronous model, but is fine with a synchronous one.

> 

> > >        aead_request_set_callback(req, 0, NULL, NULL);

> > >

> > This is just inevitable for HW acceration ...

> 

> See above. It really isn't. You could do it *after* the fact,

>

Before ... after ... the point was you need it. And it's a totally
insignificant saving anyway.

> when

> you've gotten that ticket from the hardware. Then you say "ok, if the

> ticket is done, use these callbacks". Or "I'll now wait for this

> ticket to be done" (which is what the above does by setting the

> callbacks to zero).

> 

> Wouldn't that be lovely for a user?

> 

Yes and no.
Because the user would _still_ need to handle the case of callbacks.
In case the request _does_ go to the HW accelerator.

So you keep the main processing path clean I suppose, saving some 
cycles there, but you still have this case of callbacks and having
multiple requests queued you need to handle as well. Which now 
becomes a separate _exception_ case.  You now have  two distinct 
processing paths you have to manage from your application.
How is that an _improvement_ for the user? (not withstanding that
it may be an improvement to SW only performance)

> I suspect it would be a nice model for a hw accelerator too. If you

> have full queues or have problems allocating new memory or whatever,

> you just let the code fall back to the synchronous interface.

> 

HW drivers typically _do_ use SW fallback for cases they cannot
handle. Actually, that works very nicely with the current API,
with the fallback cipher just being attached to the original
requests' callback function ... i.e. just do a tail call to 
the fallback cipher request.

> > Trust me, I have whole list of things I don't like about the

> > API myself, it's not exacty ideal for HW acceleration  either.

> 

> That';s the thing. It's actively detrimental for "I have no HW acceleration".

> 

You keep asserting that with no evidence whatsoeever.

> And apparently it's not optimal for you guys either.

> 

True, but I accept the fact that it needs to be that way because some
_other_ HW may drive that requirement. I accept the fact that I'm not
alone in the world.

> > But the point is - there are those case where you _don't_ know and

> > _that_ is what the Crypto API is for. And just generally, crypto

> > really _should_ be switchable.

> 

> It's very much not what wireguard does.

> 

And that's very much a part of Wireguard that is _broken_. I like
Wireguard for a lot of things, but it's single cipher focus is not
one of them. Especially since all crypto it uses comes from a single
source (DJB), which is frowned upon in the industry.

Crypto agility is a very important _security_ feature and the whole
argument Jason makes that it is actually a weakness is _bullshit_.
(Just because SSL _implemented_ this horribly wrong doesn't mean 
it's a bad thing to do - it's not, it's actually _necessary_. As 
the alternative would be to either continue using broken crypto
or wait _months_ for a new implementation to reach your devices
when the crypto gets broken somehow. Not good.)

> And honestly, most of the switchable ones have caused way more

> security problems than they have "fixed" by being switchable.

> 

"most of the switchable ones"
You mean _just_ SSL/TLS. SSL/TLS before 1.3 just sucked security
wise, on so many levels. That has _nothing_ to do with the very
desirable feature of crypto agility. It _can_ be done properly and 
securely. (for one thing, it does not _need_ to be negotiable)

>                  Linus


Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
www.insidesecure.com
Herbert Xu Sept. 27, 2019, 10:11 a.m. UTC | #18
On Fri, Sep 27, 2019 at 09:58:14AM +0000, Pascal Van Leeuwen wrote:
>

> But I can see the appeal of getting a "done" response on the _encrypt()/

> _decrypt() call and then being able to immediately continue processing

> the result data and having the async response handling separated off. 


This is how it works today.  If your request can be fulfilled
right away, you will get a return value other than EINPROGRESS
and you just carry on, the completion callback never happens in
this case.

aesni-intel makes heavy use of this.  In most cases it is sync.
It only goes async when the FPU is not available.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Pascal Van Leeuwen Sept. 27, 2019, 10:11 a.m. UTC | #19
> -----Original Message-----

> From: Linus Torvalds <torvalds@linux-foundation.org>

> Sent: Friday, September 27, 2019 4:06 AM

> To: Pascal Van Leeuwen <pvanleeuwen@verimatrix.com>

> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>; Linux Crypto Mailing List <linux-

> crypto@vger.kernel.org>; Linux ARM <linux-arm-kernel@lists.infradead.org>; Herbert Xu

> <herbert@gondor.apana.org.au>; David Miller <davem@davemloft.net>; Greg KH

> <gregkh@linuxfoundation.org>; Jason A . Donenfeld <Jason@zx2c4.com>; Samuel Neves

> <sneves@dei.uc.pt>; Dan Carpenter <dan.carpenter@oracle.com>; Arnd Bergmann

> <arnd@arndb.de>; Eric Biggers <ebiggers@google.com>; Andy Lutomirski <luto@kernel.org>;

> Will Deacon <will@kernel.org>; Marc Zyngier <maz@kernel.org>; Catalin Marinas

> <catalin.marinas@arm.com>

> Subject: Re: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet

> encryption

> 

> On Thu, Sep 26, 2019 at 5:15 PM Pascal Van Leeuwen

> <pvanleeuwen@verimatrix.com> wrote:

> >

> > But even the CPU only thing may have several implementations, of which

> > you want to select the fastest one supported by the _detected_ CPU

> > features (i.e. SSE, AES-NI, AVX, AVX512, NEON, etc. etc.)

> > Do you think this would still be efficient if that would be some

> > large if-else tree? Also, such a fixed implementation wouldn't scale.

> 

> Just a note on this part.

> 

> Yes, with retpoline a large if-else tree is actually *way* better for

> performance these days than even just one single indirect call. I

> think the cross-over point is somewhere around 20 if-statements.

> 

Yikes, that is just _horrible_ :-(

_However_ there's many CPU architectures out there that _don't_ need
the retpoline mitigation and would be unfairly penalized by the deep
if-else tree (as opposed to the indirect branch) for a problem they
did not cause in the first place.

Wouldn't it be more fair to impose the penalty on the CPU's actually
_causing_ this problem? Also because those are generally the more 
powerful CPU's anyway, that would suffer the least from the overhead?

> But those kinds of things also are things that we already handle well

> with instruction rewriting, so they can actually have even less of an

> overhead than a conditional branch. Using code like

> 

>   if (static_cpu_has(X86_FEATURE_AVX2))

> 

> actually ends up patching the code at run-time, so you end up having

> just an unconditional branch. Exactly because CPU feature choices

> often end up being in critical code-paths where you have

> one-or-the-other kind of setup.

> 

> And yes, one of the big users of this is very much the crypto library code.

> 

Ok, I didn't know about that. So I suppose we could have something
like if (static_soc_has(HW_CRYPTO_ACCELERATOR_XYZ)) ... Hmmm ...

> The code to do the above is disgusting, and when you look at the

> generated code you see odd unreachable jumps and what looks like a

> slow "bts" instruction that does the testing dynamically.

> 

> And then the kernel instruction stream gets rewritten fairly early

> during the boot depending on the actual CPU capabilities, and the

> dynamic tests get overwritten by a direct jump.

> 

> Admittedly I don't think the arm64 people go to quite those lengths,

> but it certainly wouldn't be impossible there either.  It just takes a

> bit of architecture knowledge and a strong stomach ;)

> 

>                  Linus


Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
www.insidesecure.com
Pascal Van Leeuwen Sept. 27, 2019, 10:44 a.m. UTC | #20
> -----Original Message-----

> From: Linus Torvalds <torvalds@linux-foundation.org>

> Sent: Friday, September 27, 2019 4:54 AM

> To: Pascal Van Leeuwen <pvanleeuwen@verimatrix.com>

> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>; Linux Crypto Mailing List <linux-

> crypto@vger.kernel.org>; Linux ARM <linux-arm-kernel@lists.infradead.org>; Herbert Xu

> <herbert@gondor.apana.org.au>; David Miller <davem@davemloft.net>; Greg KH

> <gregkh@linuxfoundation.org>; Jason A . Donenfeld <Jason@zx2c4.com>; Samuel Neves

> <sneves@dei.uc.pt>; Dan Carpenter <dan.carpenter@oracle.com>; Arnd Bergmann

> <arnd@arndb.de>; Eric Biggers <ebiggers@google.com>; Andy Lutomirski <luto@kernel.org>;

> Will Deacon <will@kernel.org>; Marc Zyngier <maz@kernel.org>; Catalin Marinas

> <catalin.marinas@arm.com>

> Subject: Re: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet

> encryption

> 

> On Thu, Sep 26, 2019 at 6:30 PM Linus Torvalds

> <torvalds@linux-foundation.org> wrote:

> >

> > And once you have that cookie, and you see "ok, I didn't get the

> > answer immediately" only THEN do you start filling in things like

> > callback stuff, or maybe you set up a wait-queue and start waiting for

> > it, or whatever".

> 

> Side note: almost nobody does this.

> 

> Almost every single async interface I've ever seen ends up being "only

> designed for async".

> 

> And I think the reason is that everybody first does the simply

> synchronous interfaces, and people start using those, and a lot of

> people are perfectly happy with them. They are simple, and they work

> fine for the huge majority of users.

> 

> And then somebody comes along and says "no, _we_ need to do this

> asynchronously", and by definition that person does *not* care for the

> synchronous case, since that interface already existed and was simpler

> and already was mostly sufficient for the people who used it, and so

> the async interface ends up being _only_ designed for the new async

> workflow. Because that whole new world was written with just that case

> is mind, and the synchronous case clearly didn't matter.

> 

> So then you end up with that kind of dichotomous situation, where you

> have a strict black-and-white either-synchronous-or-async model.

> 

> And then some people - quite reasonably - just want the simplicity of

> the synchronous code and it performs better for them because the

> interfaces are simpler and better suited to their lack of extra work.

> 

> And other people feel they need the async code, because they can take

> advantage of it.

> 

> And never the twain shall meet, because the async interface is

> actively _bad_ for the people who have sync workloads and the sync

> interface doesn't work for the async people.

> 

> Non-crypto example: [p]read() vs aio_read(). They do the same thing

> (on a high level) apart from that sync/async issue. And there's no way

> to get the best of both worlds.

> 

> Doing aio_read() on something that is already cached is actively much

> worse than just doing a synchronous read() of cached data.

> 

> But aio_read() _can_ be much better if you know your workload doesn't

> cache well and read() blocks too much for you.

> 

> There's no "read_potentially_async()" interface that just does the

> synchronous read for any cached portion of the data, and then delays

> just the IO parts and returns a "here, I gave you X bytes right now,

> use this cookie to wait for the rest".

> 

> Maybe nobody would use it. But it really should be possibly to have

> interfaces where a good synchronous implementation is _possible_

> without the extra overhead, while also allowing async implementations.

> 

That's the question. I've never seen such an API yet ...

You could also just accept that those are two wildly different use 
cases with wildly different requirements and allow them to coexist,
while  sharing as much of the low-level SW implementation code as
possible underneath. With the async API only used for those cases
where HW acceleration can make the difference.

I believe for hashes, the Crypto API still maintains an shash and
an ahash API. It works the other way around from how you would
like  to see though, with ahash wrapping the shash in case of SW 
implementations. Still, if you're sure you can't benefit from HW 
acceleration you have the option of using the shash directly.

I don't know why the synchronous blkcipher API was deprecated,
that happened before I joined. IMHO it would make sense to have,
so users not interested in HW crypto are not burdened by it.


>                 Linus


Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
www.insidesecure.com
Pascal Van Leeuwen Sept. 27, 2019, 11:08 a.m. UTC | #21
> -----Original Message-----

> From: linux-crypto-owner@vger.kernel.org <linux-crypto-owner@vger.kernel.org> On Behalf

> Of Pascal Van Leeuwen

> Sent: Friday, September 27, 2019 12:44 PM

> To: Linus Torvalds <torvalds@linux-foundation.org>

> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>; Linux Crypto Mailing List <linux-

> crypto@vger.kernel.org>; Linux ARM <linux-arm-kernel@lists.infradead.org>; Herbert Xu

> <herbert@gondor.apana.org.au>; David Miller <davem@davemloft.net>; Greg KH

> <gregkh@linuxfoundation.org>; Jason A . Donenfeld <Jason@zx2c4.com>; Samuel Neves

> <sneves@dei.uc.pt>; Dan Carpenter <dan.carpenter@oracle.com>; Arnd Bergmann

> <arnd@arndb.de>; Eric Biggers <ebiggers@google.com>; Andy Lutomirski <luto@kernel.org>;

> Will Deacon <will@kernel.org>; Marc Zyngier <maz@kernel.org>; Catalin Marinas

> <catalin.marinas@arm.com>

> Subject: RE: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet

> encryption

> 

> > -----Original Message-----

> > From: Linus Torvalds <torvalds@linux-foundation.org>

> > Sent: Friday, September 27, 2019 4:54 AM

> > To: Pascal Van Leeuwen <pvanleeuwen@verimatrix.com>

> > Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>; Linux Crypto Mailing List <linux-

> > crypto@vger.kernel.org>; Linux ARM <linux-arm-kernel@lists.infradead.org>; Herbert Xu

> > <herbert@gondor.apana.org.au>; David Miller <davem@davemloft.net>; Greg KH

> > <gregkh@linuxfoundation.org>; Jason A . Donenfeld <Jason@zx2c4.com>; Samuel Neves

> > <sneves@dei.uc.pt>; Dan Carpenter <dan.carpenter@oracle.com>; Arnd Bergmann

> > <arnd@arndb.de>; Eric Biggers <ebiggers@google.com>; Andy Lutomirski

> <luto@kernel.org>;

> > Will Deacon <will@kernel.org>; Marc Zyngier <maz@kernel.org>; Catalin Marinas

> > <catalin.marinas@arm.com>

> > Subject: Re: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet

> > encryption

> >

> > On Thu, Sep 26, 2019 at 6:30 PM Linus Torvalds

> > <torvalds@linux-foundation.org> wrote:

> > >

> > > And once you have that cookie, and you see "ok, I didn't get the

> > > answer immediately" only THEN do you start filling in things like

> > > callback stuff, or maybe you set up a wait-queue and start waiting for

> > > it, or whatever".

> >

> > Side note: almost nobody does this.

> >

> > Almost every single async interface I've ever seen ends up being "only

> > designed for async".

> >

> > And I think the reason is that everybody first does the simply

> > synchronous interfaces, and people start using those, and a lot of

> > people are perfectly happy with them. They are simple, and they work

> > fine for the huge majority of users.

> >

> > And then somebody comes along and says "no, _we_ need to do this

> > asynchronously", and by definition that person does *not* care for the

> > synchronous case, since that interface already existed and was simpler

> > and already was mostly sufficient for the people who used it, and so

> > the async interface ends up being _only_ designed for the new async

> > workflow. Because that whole new world was written with just that case

> > is mind, and the synchronous case clearly didn't matter.

> >

> > So then you end up with that kind of dichotomous situation, where you

> > have a strict black-and-white either-synchronous-or-async model.

> >

> > And then some people - quite reasonably - just want the simplicity of

> > the synchronous code and it performs better for them because the

> > interfaces are simpler and better suited to their lack of extra work.

> >

> > And other people feel they need the async code, because they can take

> > advantage of it.

> >

> > And never the twain shall meet, because the async interface is

> > actively _bad_ for the people who have sync workloads and the sync

> > interface doesn't work for the async people.

> >

> > Non-crypto example: [p]read() vs aio_read(). They do the same thing

> > (on a high level) apart from that sync/async issue. And there's no way

> > to get the best of both worlds.

> >

> > Doing aio_read() on something that is already cached is actively much

> > worse than just doing a synchronous read() of cached data.

> >

> > But aio_read() _can_ be much better if you know your workload doesn't

> > cache well and read() blocks too much for you.

> >

> > There's no "read_potentially_async()" interface that just does the

> > synchronous read for any cached portion of the data, and then delays

> > just the IO parts and returns a "here, I gave you X bytes right now,

> > use this cookie to wait for the rest".

> >

> > Maybe nobody would use it. But it really should be possibly to have

> > interfaces where a good synchronous implementation is _possible_

> > without the extra overhead, while also allowing async implementations.

> >

> That's the question. I've never seen such an API yet ...

> 

> You could also just accept that those are two wildly different use

> cases with wildly different requirements and allow them to coexist,

> while  sharing as much of the low-level SW implementation code as

> possible underneath. With the async API only used for those cases

> where HW acceleration can make the difference.

> 

> I believe for hashes, the Crypto API still maintains an shash and

> an ahash API. It works the other way around from how you would

> like  to see though, with ahash wrapping the shash in case of SW

> implementations. Still, if you're sure you can't benefit from HW

> acceleration you have the option of using the shash directly.

> 

> I don't know why the synchronous blkcipher API was deprecated,

> that happened before I joined. IMHO it would make sense to have,

> so users not interested in HW crypto are not burdened by it.

> 

> 

Never mind. From what I just learned, you can achieve the same 
thing with the skcipher API by just requesting a sync implementation.
Which would allow you to put your structs on the stack and would
not return from the encrypt()/decrypt() call until actually done.

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
www.insidesecure.com
Linus Torvalds Sept. 27, 2019, 4:23 p.m. UTC | #22
On Fri, Sep 27, 2019 at 2:58 AM Pascal Van Leeuwen
<pvanleeuwen@verimatrix.com> wrote:
>

> > I'd want to see wireguard in an end-to-end situation from the very

> > client hardware. So laptops, phones, desktops. Not the untrusted (to

> > me) hw in between.

> >

> I don't see why the crypto HW would deserve any less trust than, say,

> the CPU itself. I would say CPU's don't deserve that trust at the moment.


It's not the crypto engine that is part of the untrusted hardware.
It's the box itself, and the manufacturer, and you having to trust
that the manufacturer didn't set up some magic knocking sequence to
disable the encryption.

Maybe the company that makes them is trying to do a good job. But
maybe they are based in a country that has laws that require
backdoors.

Say, France. There's a long long history of that kind of thing.

It's all to "fight terrorism", but hey, a little industrial espionage
is good too, isn't it? So let's just disable GSM encryption based on
geographic locale and local regulation, shall we.

Yeah, yeah, GSM encryption wasn't all that strong to begin with, but
it was apparently strong enough that France didn't want it.

So tell me again why I should trust that box that I have no control over?

> Well, that's the general idea of abstraction. It also allows for

> swapping in any other cipher with minimal effort just _because_ the

> details were hidden from the application. So it may cost you some

> effort initially, but it may save you effort later.


We clearly disagree on the utility of crypto agility. You point to
things like ipsec as an argument for it.

And I point to ipsec as an argument *against* that horror. It's a
bloated, inefficient, horribly complex mess. And all the "agility" is
very much part of it.

I also point to GSM as a reason against "agility". It has caused way
more security problems than it has ever solved. The ":agility" is
often a way to turn off (or tune down) the encryption, not as a way to
say "ok, we can improve it later".

That "we can improve it later" is a bedtime story. It's not how it
gets used. Particularly as the weaknesses are often not primarily in
the crypto algorithm itself, but in how it gets used or other session
details.

When you actually want to *improve* security, you throw the old code
away, and start a new protocol entirely. Eg SSL -> TLS.

So cryptographic agility is way oversold, and often people are
actively lying about why they want it. And the people who aren't lying
are ignoring the costs.

One of the reasons _I_ like wireguard is that it just went for simple
and secure. No BS.

And you say

> Especially since all crypto it uses comes from a single

> source (DJB), which is frowned upon in the industry.


I'm perhaps not a fan of DJB in all respects, but there's no question
that he's at least competent.

The "industry practice" of having committees influenced by who knows
what isn't all that much better. Do you want to talk about NSA
elliptic curve constant choices?

Anyway, on the costs:

> >  - dynamically allocate buffers at "init time"

>

> Why is that so "wrong"? It sure beats doing allocations on the hot path.


It's wrong not becasue the allocation is costly (you do that only
once), but because the dynamic allocation means that you can't embed
stuff in your own native data structures as a user.

So now accessing those things is no longer dense in the cache.

And it's the cache that matters for a synchronous CPU algorithm. You
don't want the keys and state to be in some other location when you
already have your data structures for the stream that could just have
them right there with the other data.

> And you don't want to have it on the stack initially and then have

> to _copy_ it to some DMA-able location that you allocate on the fly

> on the hot path if you _do_ want HW acceleration.


Actually, that's *exactly* what you want. You want keys etc to be in
regular memory in a location that is convenient to the user, and then
only if the hardware has issues do you say "ok, copy the key to the
hardware". Because quite often the hardware will have very special key
caches that aren't even available to the CPU, because they are on some
hw-private buffers.

Yes, you want to have a "key identity" model so that the hardware
doesn't have to reload it all the time, but that's an invalidation
protocol, not a "put the keys or nonces in special places".

               Linus
Marc Gonzalez Sept. 30, 2019, 11:14 a.m. UTC | #23
[ Trimming recipients list ]

On 27/09/2019 18:23, Linus Torvalds wrote:

> It's not the crypto engine that is part of the untrusted hardware.

> It's the box itself, and the manufacturer, and you having to trust

> that the manufacturer didn't set up some magic knocking sequence to

> disable the encryption.

> 

> Maybe the company that makes them is trying to do a good job. But

> maybe they are based in a country that has laws that require

> backdoors.

> 

> Say, France. There's a long long history of that kind of thing.

> 

> It's all to "fight terrorism", but hey, a little industrial espionage

> is good too, isn't it? So let's just disable GSM encryption based on

> geographic locale and local regulation, shall we.

> 

> Yeah, yeah, GSM encryption wasn't all that strong to begin with, but

> it was apparently strong enough that France didn't want it.


Two statements above have raised at least one of my eyebrows.

1) France has laws that require backdoors.

2) France did not want GSM encryption.


The following article claims that it was the British who demanded that
A5/1 be weakened (not the algorithm, just the key size; which is what
the USgov did in the 90s).

https://www.aftenposten.no/verden/i/Olkl/Sources-We-were-pressured-to-weaken-the-mobile-security-in-the-80s


Additional references for myself

https://lwn.net/Articles/368861/
https://en.wikipedia.org/wiki/Export_of_cryptography_from_the_United_States
https://gsmmap.org/assets/pdfs/gsmmap.org-country_report-France-2017-06.pdf
https://gsmmap.org/assets/pdfs/gsmmap.org-country_report-France-2018-06.pdf
https://gsmmap.org/assets/pdfs/gsmmap.org-country_report-France-2019-08.pdf


As for your first claim, can you provide more information, so that I could
locate the law(s) in question? (Year the law was discussed, for example.)

I've seen a few propositions ("projet de loi") but none(?) have made it into
actual law, as far as I'm aware.

https://www.nextinpact.com/news/98039-loi-numerique-nkm-veut-backdoor-dans-chaque-materiel.htm
https://www.nextinpact.com/news/107546-lamendement-anti-huawei-porte-pour-backdoors-renseignement-francais.htm

Regards.
Pascal Van Leeuwen Sept. 30, 2019, 8:44 p.m. UTC | #24
> -----Original Message----

> From: Linus Torvalds <torvalds@linux-foundation.org>

> Sent: Friday, September 27, 2019 6:24 PM

> To: Pascal Van Leeuwen <pvanleeuwen@verimatrix.com>

> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>; Linux Crypto Mailing List <linux-

> crypto@vger.kernel.org>; Linux ARM <linux-arm-kernel@lists.infradead.org>; Herbert Xu

> <herbert@gondor.apana.org.au>; David Miller <davem@davemloft.net>; Greg KH

> <gregkh@linuxfoundation.org>; Jason A . Donenfeld <Jason@zx2c4.com>; Samuel Neves

> <sneves@dei.uc.pt>; Dan Carpenter <dan.carpenter@oracle.com>; Arnd Bergmann <arnd@arndb.de>;

> Eric Biggers <ebiggers@google.com>; Andy Lutomirski <luto@kernel.org>; Will Deacon

> <will@kernel.org>; Marc Zyngier <maz@kernel.org>; Catalin Marinas <catalin.marinas@arm.com>

> Subject: Re: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption

> 

> On Fri, Sep 27, 2019 at 2:58 AM Pascal Van Leeuwen

> <pvanleeuwen@verimatrix.com> wrote:

> >

> > > I'd want to see wireguard in an end-to-end situation from the very

> > > client hardware. So laptops, phones, desktops. Not the untrusted (to

> > > me) hw in between.

> > >

> > I don't see why the crypto HW would deserve any less trust than, say,

> > the CPU itself. I would say CPU's don't deserve that trust at the moment.

> 

> It's not the crypto engine that is part of the untrusted hardware.

> It's the box itself, and the manufacturer, and you having to trust

> that the manufacturer didn't set up some magic knocking sequence to

> disable the encryption.

> 

> Maybe the company that makes them is trying to do a good job. But

> maybe they are based in a country that has laws that require

> backdoors.

> 

> Say, France. There's a long long history of that kind of thing.

> 

> It's all to "fight terrorism", but hey, a little industrial espionage

> is good too, isn't it? So let's just disable GSM encryption based on

> geographic locale and local regulation, shall we.

> 

> Yeah, yeah, GSM encryption wasn't all that strong to begin with, but

> it was apparently strong enough that France didn't want it.

> 

> So tell me again why I should trust that box that I have no control over?

> 

Same reason you trust your PC hardware you have no control over?
(That CPU is assembled in Malaysia, your motherboard likely in China.
And not being a US citizen, *I* wouldn't trust anything out of the US
anyway, _knowing_ they've been actively spying on us for decades ...)

In case you worry about the software part: of course you'd be running
something open-source and Linux based like DD-WRT on that router ...

Personally I'm not that paranoid and I really like to offload all the
silly  crypto heavy-lifting to my router box, where it belongs.

> > Well, that's the general idea of abstraction. It also allows for

> > swapping in any other cipher with minimal effort just _because_ the

> > details were hidden from the application. So it may cost you some

> > effort initially, but it may save you effort later.

> 

> We clearly disagree on the utility of crypto agility. You point to

> things like ipsec as an argument for it.

> 

I don't recall doing specifically that, but anyway.

> And I point to ipsec as an argument *against* that horror. It's a

> bloated, inefficient, horribly complex mess. And all the "agility" is

> very much part of it.

> 

Oh really? I've been working on implementations thereof for nearly 2
decades, but I don't recognise this at all, at least not for the datapath.
IPsec actually made a significant effort to keep the packet format the
same across all extensions done over its 20+ year history. The cipher
agility is mostly abstracted away from the base protocol, allowing us to
add new ciphersuites - to hardware, no less! - with very minimal effort.

In any, case, while I believe in the KISS principle, I also believe that
things should be as simple as possible, but _no simpler than that_(A.E.)
Oversimplification is the evil twin of overcomplication.

> I also point to GSM as a reason against "agility". It has caused way

> more security problems than it has ever solved. The ":agility" is

> often a way to turn off (or tune down) the encryption, not as a way to

> say "ok, we can improve it later".

> 

> That "we can improve it later" is a bedtime story. It's not how it

> gets used. Particularly as the weaknesses are often not primarily in

> the crypto algorithm itself, but in how it gets used or other session

> details.

> 

I don't see what this has to do with cipher agility. Cipher agility has
nothing to do with "improving things later" and everything with the 
realisation that, someday, some clever person _will_ find some weakness.

> When you actually want to *improve* security, you throw the old code

> away, and start a new protocol entirely. Eg SSL -> TLS.

> 

Uhm. Now you're starting to show some ignorance ...

TLS was NOT a new protocol. I was a simple rename of a very minor evolution 
of SSL 3.0. Has been for all versions up to and including TLS 1.2. And YES,
THAT was a mistake, because SSL was just a very poor  starting point. 
For TLS 1.3 they finally did a (reasonably) proper redesign.
(Fun fact: SSL was _not_ designed by a committee, but TLS 1.3 _was_ ...)

> So cryptographic agility is way oversold, and often people are

> actively lying about why they want it. And the people who aren't lying

> are ignoring the costs.

> 

I wouldn't know what they could be lying about, crypto agility is 
just common sense risk spreading.

> One of the reasons _I_ like wireguard is that it just went for simple

> and secure. No BS.

> 

You and me both, BTW. I just don't want it to be _too_ simple.

> And you say

> 

> > Especially since all crypto it uses comes from a single

> > source (DJB), which is frowned upon in the industry.

> 

> I'm perhaps not a fan of DJB in all respects, but there's no question

> that he's at least competent.

> 

I have nothing against DJB, I've enjoyed many of his presentations.
I might even be a fan. I certainly don't doubt his competence.

But being as paranoid as you are: can you really TRUST the guy? ;-)
And as good as he is: there may be some weakness in the algorithm(s)
discovered _tomorrow_ and in that case _I_ would want to be able to
switch to an alternative instantly.
(and I believe for some big international organisation critically 
depending on such a VPN to connect all their branch offices around
the world while protecting their trade secrets, this is likely to
be even more important - they probably wouldn't want to wait until
Jason pulls Wireguard 2.0 out of his hat and certainly not for that
to pass certification and finaly hit their devices months later ...)

I'm not talking about some convoluted and fragile negotiation scheme,
a static parameter in some config file is just fine for that. The 
textual crypto templates of the Crypto API just fit that use case
perfectly.

And I have other reasons not to want to use Chacha-Poly, while I would
like to use the Wireguard _protocol_ itself:

1) Contrary to popular belief, Chacha-Poly is NOT the best choice of
   algorithms in terms of performance on many modern systems. On the
   quad core Cortex A72 system I'm working on here, AES-GCM is over 2
   times faster, even including Ard's Poly1305-Neon patches of last
   week (current mainline code for PC is even slower than that).
   Also, on modern Intel systems with AES-NI or VAES, AES-GCM 
   outperforms Chacha-Poly by a considerable margin. And, to make
   matters worse, running Chacha-Poly at high throughput is known to
   result in excessive thermal throttling on some recent Intel CPU's.
   Even if you don't need that throughput, it's nice to have more CPU
   power left to do useful work.
2) Chacha-Poly is inefficient in terms of power. For our hardware,
   it uses about 2x the power of AES-GCM and I have indications (e.g.
   the thermal throttling mentioned above) that this is no better for
   software implementations.

> The "industry practice" of having committees influenced by who knows

> what isn't all that much better. Do you want to talk about NSA

> elliptic curve constant choices?

> 

Which is actually an argument _in favor_ of crypto agility - you don't
want to be stuck with just one choice you may not trust ...
Options are _good_. (but do add some implementation complexity, sure)

> Anyway, on the costs:

> 

> > >  - dynamically allocate buffers at "init time"

> >

> > Why is that so "wrong"? It sure beats doing allocations on the hot path.

> 

> It's wrong not becasue the allocation is costly (you do that only

> once), but because the dynamic allocation means that you can't embed

> stuff in your own native data structures as a user.

> 

> So now accessing those things is no longer dense in the cache.

> 

I don't see how data allocated at _init time_ would be local in the 
cache at the time it is _finally_ used in some remote location, far
away in both space and time.

If you init and then immediately use, you may have a point, but
that should be the exception and not the rule.

> And it's the cache that matters for a synchronous CPU algorithm. You

> don't want the keys and state to be in some other location when you

> already have your data structures for the stream that could just have

> them right there with the other data.

> 

Yeah yeah, we all know that. But that only works for stuff that stays
in scope in the cache, not for stuff that has long since been pushed
out by other local variables.

And "other" memory that's used frequently (i.e. when it matters!) CAN
be cached too, you known :-) Modern prefetchers tend to be quite good,
too, so it shouldn't even matter if it gets flushed out temporarily.

> > And you don't want to have it on the stack initially and then have

> > to _copy_ it to some DMA-able location that you allocate on the fly

> > on the hot path if you _do_ want HW acceleration.

> 

> Actually, that's *exactly* what you want. You want keys etc to be in

> regular memory in a location that is convenient to the user, and then

> only if the hardware has issues do you say "ok, copy the key to the

> hardware". Because quite often the hardware will have very special key

> caches that aren't even available to the CPU, because they are on some

> hw-private buffers.

> 

Unfortunately, the only way to get that _into_ the HW is usually DMA
and that relies on DMA-capable memory. And copying significant data 
around on the  CPU tends to totally kill performance if you're in the 
business of HW acceleration, so it's nice if it's already in a DMA
capable buffer. Assuming the cost of having it there is not excessive.

I don't care so much about the keys BTW, that should not be performance
critical as you set it only once in a long while.
But things like IV's etc. _may_ be another matter for _some_ hardware.
(Actually, for _my_ hardware I _only_ care about not having to copy the
actual _data_, so for all _I_ care everything else can be on the stack.
But alas, I'm not alone in the world ...)

> Yes, you want to have a "key identity" model so that the hardware

> doesn't have to reload it all the time, but that's an invalidation

> protocol, not a "put the keys or nonces in special places".

> 

Actually, that _is_ exactly how (most of) _our_ hardware works :-)

But I _think_ keys and nonces and whatnot are actually not the main
reason those structs can't be on the stack. Drivers tend to add their
own local data to those structs, and this may contain buffers that
are used for DMA. I know for a fact the Inside Secure driver does
this (_not_ my design, BTW). I would personally have opted for 
embedding pointers to dynamically allocated blobs elsewhere, such
that the main struct _can_ be on the stack. Food for discussion :-)


>                Linus



Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
www.insidesecure.com
Linus Torvalds Sept. 30, 2019, 9:37 p.m. UTC | #25
On Mon, Sep 30, 2019 at 4:14 AM Marc Gonzalez <marc.w.gonzalez@free.fr> wrote:
>

> Two statements above have raised at least one of my eyebrows.

>

> 1) France has laws that require backdoors.


No. But France has a long history on being bad on encryption policies.
They've gotten better, thankfully.

France was one of the countries that had laws against strong
encryption back in the 90s. It got better in the early 2000s, but
there's a long history - and still a push - for some very questionable
practices.

It was just a couple of years ago that the had discussions about
mandatory backdoors for encryption in France. Look it up.

Are there other countries that have worse track records? Yes. And in
the west, the US (and Australia) have had similar issues.

But when it comes to Western Europe, France has been a particular
problem spot. And I wanted to point out that it's not always the
obvious problem countries (ie Middle East, China) that everybody
points to.

> 2) France did not want GSM encryption.


I'm pretty sure that France had the encryption bit off at least during the 90's.

GSM A5/1 isn't great, but as part of the spec there is also A5/0. No,
it's not used in the West any more.

France was also at least at one time considered a hotbed of industrial
espionage by other European countries. And the US.

You can try to google for it, but you won't find all that much from
the bad old days. You can find _some_ stuff still..

  https://apnews.com/4206823c63d58fd956f26fd5efc9a777

but basically French intelligence agencies have been accused of
extensive industrial espionage for French companies over the years.

Anyway, I'm not trying to point to France as some kind of "worst of
the worst". I literally picked it as an example because people
generally _don't_ think of Western European countries as having
encryption issues, and don't generally associate them with industrial
espionage. But there really is a history even there.

            Linus
diff mbox series

Patch

diff --git a/drivers/net/wireguard/noise.c b/drivers/net/wireguard/noise.c
index bf0b8c5ab298..0e7aab9f645d 100644
--- a/drivers/net/wireguard/noise.c
+++ b/drivers/net/wireguard/noise.c
@@ -109,16 +109,37 @@  static struct noise_keypair *keypair_create(struct wg_peer *peer)
 
 	if (unlikely(!keypair))
 		return NULL;
+
+	keypair->sending.tfm = crypto_alloc_aead("rfc7539(chacha20,poly1305)",
+						 0, CRYPTO_ALG_ASYNC);
+	if (unlikely(IS_ERR(keypair->sending.tfm)))
+		goto free_keypair;
+	keypair->receiving.tfm = crypto_alloc_aead("rfc7539(chacha20,poly1305)",
+						   0, CRYPTO_ALG_ASYNC);
+	if (unlikely(IS_ERR(keypair->receiving.tfm)))
+		goto free_sending_tfm;
+
 	keypair->internal_id = atomic64_inc_return(&keypair_counter);
 	keypair->entry.type = INDEX_HASHTABLE_KEYPAIR;
 	keypair->entry.peer = peer;
 	kref_init(&keypair->refcount);
 	return keypair;
+
+free_sending_tfm:
+	crypto_free_aead(keypair->sending.tfm);
+free_keypair:
+	kzfree(keypair);
+	return NULL;
 }
 
 static void keypair_free_rcu(struct rcu_head *rcu)
 {
-	kzfree(container_of(rcu, struct noise_keypair, rcu));
+	struct noise_keypair *keypair =
+		container_of(rcu, struct noise_keypair, rcu);
+
+	crypto_free_aead(keypair->sending.tfm);
+	crypto_free_aead(keypair->receiving.tfm);
+	kzfree(keypair);
 }
 
 static void keypair_free_kref(struct kref *kref)
@@ -360,11 +381,20 @@  static void derive_keys(struct noise_symmetric_key *first_dst,
 			struct noise_symmetric_key *second_dst,
 			const u8 chaining_key[NOISE_HASH_LEN])
 {
-	kdf(first_dst->key, second_dst->key, NULL, NULL,
+	u8 key[2][NOISE_SYMMETRIC_KEY_LEN];
+	int err;
+
+	kdf(key[0], key[1], NULL, NULL,
 	    NOISE_SYMMETRIC_KEY_LEN, NOISE_SYMMETRIC_KEY_LEN, 0, 0,
 	    chaining_key);
 	symmetric_key_init(first_dst);
 	symmetric_key_init(second_dst);
+
+	err = crypto_aead_setkey(first_dst->tfm, key[0], sizeof(key[0])) ?:
+	      crypto_aead_setkey(second_dst->tfm, key[1], sizeof(key[1]));
+	memzero_explicit(key, sizeof(key));
+	if (unlikely(err))
+		pr_warn_once("crypto_aead_setkey() failed (%d)\n", err);
 }
 
 static bool __must_check mix_dh(u8 chaining_key[NOISE_HASH_LEN],
diff --git a/drivers/net/wireguard/noise.h b/drivers/net/wireguard/noise.h
index 9c2cc62dc11e..6f033d2ea52c 100644
--- a/drivers/net/wireguard/noise.h
+++ b/drivers/net/wireguard/noise.h
@@ -8,6 +8,7 @@ 
 #include "messages.h"
 #include "peerlookup.h"
 
+#include <crypto/aead.h>
 #include <linux/types.h>
 #include <linux/spinlock.h>
 #include <linux/atomic.h>
@@ -26,7 +27,7 @@  union noise_counter {
 };
 
 struct noise_symmetric_key {
-	u8 key[NOISE_SYMMETRIC_KEY_LEN];
+	struct crypto_aead *tfm;
 	union noise_counter counter;
 	u64 birthdate;
 	bool is_valid;
diff --git a/drivers/net/wireguard/queueing.h b/drivers/net/wireguard/queueing.h
index f8de703dff97..593971edf8a3 100644
--- a/drivers/net/wireguard/queueing.h
+++ b/drivers/net/wireguard/queueing.h
@@ -55,9 +55,10 @@  enum packet_state {
 };
 
 struct packet_cb {
-	u64 nonce;
-	struct noise_keypair *keypair;
 	atomic_t state;
+	__le32 ivpad;			/* pad 64-bit nonce to 96 bits */
+	__le64 nonce;
+	struct noise_keypair *keypair;
 	u32 mtu;
 	u8 ds;
 };
diff --git a/drivers/net/wireguard/receive.c b/drivers/net/wireguard/receive.c
index 900c76edb9d6..395089e7e3a6 100644
--- a/drivers/net/wireguard/receive.c
+++ b/drivers/net/wireguard/receive.c
@@ -11,7 +11,7 @@ 
 #include "cookie.h"
 #include "socket.h"
 
-#include <linux/simd.h>
+#include <crypto/aead.h>
 #include <linux/ip.h>
 #include <linux/ipv6.h>
 #include <linux/udp.h>
@@ -244,13 +244,14 @@  static void keep_key_fresh(struct wg_peer *peer)
 	}
 }
 
-static bool decrypt_packet(struct sk_buff *skb, struct noise_symmetric_key *key,
-			   simd_context_t *simd_context)
+static bool decrypt_packet(struct sk_buff *skb, struct noise_symmetric_key *key)
 {
 	struct scatterlist sg[MAX_SKB_FRAGS + 8];
+	struct aead_request *req, stackreq;
 	struct sk_buff *trailer;
 	unsigned int offset;
 	int num_frags;
+	int err;
 
 	if (unlikely(!key))
 		return false;
@@ -262,8 +263,8 @@  static bool decrypt_packet(struct sk_buff *skb, struct noise_symmetric_key *key,
 		return false;
 	}
 
-	PACKET_CB(skb)->nonce =
-		le64_to_cpu(((struct message_data *)skb->data)->counter);
+	PACKET_CB(skb)->ivpad = 0;
+	PACKET_CB(skb)->nonce = ((struct message_data *)skb->data)->counter;
 
 	/* We ensure that the network header is part of the packet before we
 	 * call skb_cow_data, so that there's no chance that data is removed
@@ -281,9 +282,23 @@  static bool decrypt_packet(struct sk_buff *skb, struct noise_symmetric_key *key,
 	if (skb_to_sgvec(skb, sg, 0, skb->len) <= 0)
 		return false;
 
-	if (!chacha20poly1305_decrypt_sg(sg, sg, skb->len, NULL, 0,
-					 PACKET_CB(skb)->nonce, key->key,
-					 simd_context))
+	if (unlikely(crypto_aead_reqsize(key->tfm) > 0)) {
+		req = aead_request_alloc(key->tfm, GFP_ATOMIC);
+		if (!req)
+			return false;
+	} else {
+		req = &stackreq;
+		aead_request_set_tfm(req, key->tfm);
+	}
+
+	aead_request_set_ad(req, 0);
+	aead_request_set_callback(req, 0, NULL, NULL);
+	aead_request_set_crypt(req, sg, sg, skb->len,
+			       (u8 *)&PACKET_CB(skb)->ivpad);
+	err = crypto_aead_decrypt(req);
+	if (unlikely(req != &stackreq))
+		aead_request_free(req);
+	if (err)
 		return false;
 
 	/* Another ugly situation of pushing and pulling the header so as to
@@ -475,10 +490,10 @@  int wg_packet_rx_poll(struct napi_struct *napi, int budget)
 			goto next;
 
 		if (unlikely(!counter_validate(&keypair->receiving.counter,
-					       PACKET_CB(skb)->nonce))) {
+					       le64_to_cpu(PACKET_CB(skb)->nonce)))) {
 			net_dbg_ratelimited("%s: Packet has invalid nonce %llu (max %llu)\n",
 					    peer->device->dev->name,
-					    PACKET_CB(skb)->nonce,
+					    le64_to_cpu(PACKET_CB(skb)->nonce),
 					    keypair->receiving.counter.receive.counter);
 			goto next;
 		}
@@ -510,21 +525,19 @@  void wg_packet_decrypt_worker(struct work_struct *work)
 {
 	struct crypt_queue *queue = container_of(work, struct multicore_worker,
 						 work)->ptr;
-	simd_context_t simd_context;
 	struct sk_buff *skb;
 
-	simd_get(&simd_context);
 	while ((skb = ptr_ring_consume_bh(&queue->ring)) != NULL) {
-		enum packet_state state = likely(decrypt_packet(skb,
-					   &PACKET_CB(skb)->keypair->receiving,
-					   &simd_context)) ?
-				PACKET_STATE_CRYPTED : PACKET_STATE_DEAD;
+		enum packet_state state;
+
+		if (likely(decrypt_packet(skb,
+					  &PACKET_CB(skb)->keypair->receiving)))
+			state = PACKET_STATE_CRYPTED;
+		else
+			state = PACKET_STATE_DEAD;
 		wg_queue_enqueue_per_peer_napi(&PACKET_PEER(skb)->rx_queue, skb,
 					       state);
-		simd_relax(&simd_context);
 	}
-
-	simd_put(&simd_context);
 }
 
 static void wg_packet_consume_data(struct wg_device *wg, struct sk_buff *skb)
diff --git a/drivers/net/wireguard/send.c b/drivers/net/wireguard/send.c
index b0df5c717502..48d1fb02f575 100644
--- a/drivers/net/wireguard/send.c
+++ b/drivers/net/wireguard/send.c
@@ -11,7 +11,7 @@ 
 #include "messages.h"
 #include "cookie.h"
 
-#include <linux/simd.h>
+#include <crypto/aead.h>
 #include <linux/uio.h>
 #include <linux/inetdevice.h>
 #include <linux/socket.h>
@@ -157,11 +157,11 @@  static unsigned int calculate_skb_padding(struct sk_buff *skb)
 	return padded_size - last_unit;
 }
 
-static bool encrypt_packet(struct sk_buff *skb, struct noise_keypair *keypair,
-			   simd_context_t *simd_context)
+static bool encrypt_packet(struct sk_buff *skb, struct noise_keypair *keypair)
 {
 	unsigned int padding_len, plaintext_len, trailer_len;
 	struct scatterlist sg[MAX_SKB_FRAGS + 8];
+	struct aead_request *req, stackreq;
 	struct message_data *header;
 	struct sk_buff *trailer;
 	int num_frags;
@@ -199,7 +199,7 @@  static bool encrypt_packet(struct sk_buff *skb, struct noise_keypair *keypair,
 	header = (struct message_data *)skb_push(skb, sizeof(*header));
 	header->header.type = cpu_to_le32(MESSAGE_DATA);
 	header->key_idx = keypair->remote_index;
-	header->counter = cpu_to_le64(PACKET_CB(skb)->nonce);
+	header->counter = PACKET_CB(skb)->nonce;
 	pskb_put(skb, trailer, trailer_len);
 
 	/* Now we can encrypt the scattergather segments */
@@ -207,9 +207,24 @@  static bool encrypt_packet(struct sk_buff *skb, struct noise_keypair *keypair,
 	if (skb_to_sgvec(skb, sg, sizeof(struct message_data),
 			 noise_encrypted_len(plaintext_len)) <= 0)
 		return false;
-	return chacha20poly1305_encrypt_sg(sg, sg, plaintext_len, NULL, 0,
-					   PACKET_CB(skb)->nonce,
-					   keypair->sending.key, simd_context);
+
+	if (unlikely(crypto_aead_reqsize(keypair->sending.tfm) > 0)) {
+		req = aead_request_alloc(keypair->sending.tfm, GFP_ATOMIC);
+		if (!req)
+			return false;
+	} else {
+		req = &stackreq;
+		aead_request_set_tfm(req, keypair->sending.tfm);
+	}
+
+	aead_request_set_ad(req, 0);
+	aead_request_set_callback(req, 0, NULL, NULL);
+	aead_request_set_crypt(req, sg, sg, plaintext_len,
+			       (u8 *)&PACKET_CB(skb)->ivpad);
+	crypto_aead_encrypt(req);
+	if (unlikely(req != &stackreq))
+		aead_request_free(req);
+	return true;
 }
 
 void wg_packet_send_keepalive(struct wg_peer *peer)
@@ -296,16 +311,13 @@  void wg_packet_encrypt_worker(struct work_struct *work)
 	struct crypt_queue *queue = container_of(work, struct multicore_worker,
 						 work)->ptr;
 	struct sk_buff *first, *skb, *next;
-	simd_context_t simd_context;
 
-	simd_get(&simd_context);
 	while ((first = ptr_ring_consume_bh(&queue->ring)) != NULL) {
 		enum packet_state state = PACKET_STATE_CRYPTED;
 
 		skb_walk_null_queue_safe(first, skb, next) {
 			if (likely(encrypt_packet(skb,
-						  PACKET_CB(first)->keypair,
-						  &simd_context))) {
+						  PACKET_CB(first)->keypair))) {
 				wg_reset_packet(skb);
 			} else {
 				state = PACKET_STATE_DEAD;
@@ -314,10 +326,7 @@  void wg_packet_encrypt_worker(struct work_struct *work)
 		}
 		wg_queue_enqueue_per_peer(&PACKET_PEER(first)->tx_queue, first,
 					  state);
-
-		simd_relax(&simd_context);
 	}
-	simd_put(&simd_context);
 }
 
 static void wg_packet_create_data(struct sk_buff *first)
@@ -389,13 +398,15 @@  void wg_packet_send_staged_packets(struct wg_peer *peer)
 	 * handshake.
 	 */
 	skb_queue_walk(&packets, skb) {
+		u64 counter = atomic64_inc_return(&key->counter.counter) - 1;
+
 		/* 0 for no outer TOS: no leak. TODO: at some later point, we
 		 * might consider using flowi->tos as outer instead.
 		 */
 		PACKET_CB(skb)->ds = ip_tunnel_ecn_encap(0, ip_hdr(skb), skb);
-		PACKET_CB(skb)->nonce =
-				atomic64_inc_return(&key->counter.counter) - 1;
-		if (unlikely(PACKET_CB(skb)->nonce >= REJECT_AFTER_MESSAGES))
+		PACKET_CB(skb)->ivpad = 0;
+		PACKET_CB(skb)->nonce = cpu_to_le64(counter);
+		if (unlikely(counter >= REJECT_AFTER_MESSAGES))
 			goto out_invalid;
 	}