diff mbox series

[v2] crypto: aesni - add ccm(aes) algorithm implementation

Message ID 20201201194556.5220-1-ardb@kernel.org
State New
Headers show
Series [v2] crypto: aesni - add ccm(aes) algorithm implementation | expand

Commit Message

Ard Biesheuvel Dec. 1, 2020, 7:45 p.m. UTC
Add ccm(aes) implementation from linux-wireless mailing list (see
http://permalink.gmane.org/gmane.linux.kernel.wireless.general/126679).

This eliminates FPU context store/restore overhead existing in more
general ccm_base(ctr(aes-aesni),aes-aesni) case in MAC calculation.

Suggested-by: Ben Greear <greearb@candelatech.com>
Co-developed-by: Steve deRosier <derosier@cal-sierra.com>
Signed-off-by: Steve deRosier <derosier@cal-sierra.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
v2: avoid the SIMD helper, as it produces an CRYPTO_ALG_ASYNC aead, which
    is not usable by the 802.11 ccmp driver

 arch/x86/crypto/aesni-intel_glue.c | 406 +++++++++++++++++++-
 1 file changed, 404 insertions(+), 2 deletions(-)

Comments

Ben Greear Dec. 1, 2020, 10 p.m. UTC | #1
On 12/1/20 1:57 PM, Herbert Xu wrote:
> On Tue, Dec 01, 2020 at 08:45:56PM +0100, Ard Biesheuvel wrote:
>> Add ccm(aes) implementation from linux-wireless mailing list (see
>> http://permalink.gmane.org/gmane.linux.kernel.wireless.general/126679).
>>
>> This eliminates FPU context store/restore overhead existing in more
>> general ccm_base(ctr(aes-aesni),aes-aesni) case in MAC calculation.
>>
>> Suggested-by: Ben Greear <greearb@candelatech.com>
>> Co-developed-by: Steve deRosier <derosier@cal-sierra.com>
>> Signed-off-by: Steve deRosier <derosier@cal-sierra.com>
>> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
>> ---
>> v2: avoid the SIMD helper, as it produces an CRYPTO_ALG_ASYNC aead, which
>>      is not usable by the 802.11 ccmp driver
> 
> Sorry, but this is not the way to go.  Please fix wireless to
> use the async interface instead.

No one wanted to do this for the last 6+ years, so I don't think it is likely
to happen any time soon.  If the patch is better than
existing behaviour, please let it into the kernel.  And it is certainly
better in my test case.

Thanks,
Ben
Ard Biesheuvel Dec. 1, 2020, 10:01 p.m. UTC | #2
On Tue, 1 Dec 2020 at 22:57, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Tue, Dec 01, 2020 at 08:45:56PM +0100, Ard Biesheuvel wrote:
> > Add ccm(aes) implementation from linux-wireless mailing list (see
> > http://permalink.gmane.org/gmane.linux.kernel.wireless.general/126679).
> >
> > This eliminates FPU context store/restore overhead existing in more
> > general ccm_base(ctr(aes-aesni),aes-aesni) case in MAC calculation.
> >
> > Suggested-by: Ben Greear <greearb@candelatech.com>
> > Co-developed-by: Steve deRosier <derosier@cal-sierra.com>
> > Signed-off-by: Steve deRosier <derosier@cal-sierra.com>
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > ---
> > v2: avoid the SIMD helper, as it produces an CRYPTO_ALG_ASYNC aead, which
> >     is not usable by the 802.11 ccmp driver
>
> Sorry, but this is not the way to go.  Please fix wireless to
> use the async interface instead.
>

This is not the first time this has come up. The point is that CCMP in
the wireless stack is not used in 99% of the cases, given that any
wifi hardware built in the last ~10 years can do it in hardware. Only
in exceptional cases, such as Ben's, is there a need for exercising
this interface.

Also, care to explain why we have synchronous AEADs in the first place
if they are not supposed to be used?
Ben Greear Dec. 1, 2020, 10:12 p.m. UTC | #3
On 12/1/20 2:04 PM, Herbert Xu wrote:
> On Tue, Dec 01, 2020 at 11:01:57PM +0100, Ard Biesheuvel wrote:
>>
>> This is not the first time this has come up. The point is that CCMP in
>> the wireless stack is not used in 99% of the cases, given that any
>> wifi hardware built in the last ~10 years can do it in hardware. Only
>> in exceptional cases, such as Ben's, is there a need for exercising
>> this interface.
> 
> Either it matters or it doesn't.  If it doesn't matter why are we
> having this dicussion at all? If it does then fixing just one
> direction makes no sense.
> 
>> Also, care to explain why we have synchronous AEADs in the first place
>> if they are not supposed to be used?
> 
> Sync AEADs would make sense if you were dealing with a very small
> amount of data, e.g., one block.

Sure, I bet some part of the kernel does this.  So let the patch in to
handle that case.  It will just be happy luck that it improves some other
problems as well.

Thanks,
Ben
Herbert Xu Dec. 1, 2020, 10:16 p.m. UTC | #4
On Tue, Dec 01, 2020 at 11:12:32PM +0100, Ard Biesheuvel wrote:
>
> What do you mean by just one direction? Ben just confirmed a

The TX direction generally executes in process context, which
would benefit from an accelerated sync implementation.  The RX
side on the other hand is always in softirq context.

> substantial speedup for his use case, which requires the use of
> software encryption even on hardware that could support doing it in
> hardware, and that software encryption currently only supports the
> synchronous interface.

The problem is that the degradation would come at the worst time,
when the system is loaded.  IOW when you get an interrupt during
your TX path and get RX traffic that's when you'll take the fallback
path.

Cheers,
Herbert Xu Dec. 1, 2020, 11:11 p.m. UTC | #5
On Tue, Dec 01, 2020 at 11:27:52PM +0100, Ard Biesheuvel wrote:
>
> > The problem is that the degradation would come at the worst time,
> > when the system is loaded.  IOW when you get an interrupt during
> > your TX path and get RX traffic that's when you'll take the fallback
> > path.
> 
> I can see how in the general case, this is something you would prefer
> to avoid. However, on SMP x86_64 systems that implement AES-NI (which
> runs at ~1 cycle per byte), I don't see this as a real problem for
> this driver.

AES-NI is 1 cycle per byte but the fallback is not.

> What we could do is expose both versions, where the async version has
> a slightly higher priority, so that all users that do support the
> async interface will get it, and the wifi stack can use the sync
> interface instead.

No we've already tried this with IPsec and it doesn't work.  That's
why the async path exists in aesni.

Wireless is no different to IPsec in this respect.

Cheers,
Herbert Xu Dec. 1, 2020, 11:30 p.m. UTC | #6
On Wed, Dec 02, 2020 at 12:24:47AM +0100, Ard Biesheuvel wrote:
>
> True. But the fallback only gets executed if the scheduler is stupid
> enough to schedule the TX task onto the core that is overloaded doing
> RX softirqs. So in the general case, both TX and RX will be using
> AES-NI instructions (unless the CCMP is done in hardware which is the
> most common case by far)

I don't think this makes sense.  TX is typically done in response
to RX so the natural alignment is for it to be on the same CPU.
 
> Wireless is very different. Wifi uses a medium that is fundamentally
> shared, and so the load it can induce is bounded. There is no way a
> wifi interface is going to saturate a 64-bit AES-NI core doing CCMP in
> software.

This sounds pretty tenuous.  In any case, even if wireless itself
doesn't get you, there could be loads running on top of it, for
example, IPsec.

> Given the above, can't we be pragmatic here? This code addresses a
> niche use case, which is not affected by the general concerns
> regarding async crypto.

We already have a framework for acceleration that works properly
in aesni, I don't want to see us introduce another broken model
within the same driver.

So either just leave the whole thing along or do it properly by
making wireless async.

Cheers,
Ben Greear Dec. 2, 2020, 12:01 a.m. UTC | #7
On 12/1/20 3:48 PM, Herbert Xu wrote:
> On Wed, Dec 02, 2020 at 12:41:36AM +0100, Ard Biesheuvel wrote:
>>
>> You just explained that TX typically runs in process context, whereas
>> RX is handled in softirq context. So how exactly are these going to
>> end up on the same core?
> 
> When you receive a TCP packet via RX, this should wake up your user-
> space thread on the same CPU as otherwise you'll pay extra cost
> on pulling the data into memory again.
> 
>> Yes, but IPsec will not use the synchronous interface.
> 
> That doesn't matter when the underlying wireless code is using
> the sync interface.  An async user is completely capable of making
> the aesni code-path unavailable to the sync user.
>   
>> Fair enough. But it is unfortunate that we cannot support Ben's use
>> case without a lot of additional work that serves no purpose
>> otherwise.
> 
> To the contrary, I think to fully support Ben's use case you must
> use the async code path.  Otherwise sure you'll see good numbers
> when you do simple benchmarks like netperf, but when you really
> load up the system it just falls apart.

I can generate some very complicated traffic to test this, including bi-directional
traffic, mix of tcp/udp/https, etc.  If numbers will sway your opinion, let me know what
traffic tests will make you happy and I'll do the testing.

I know for sure that in download traffic (which is normal dominant direction for wifi stations),
Ard's patch gives me about 3x increase of throughput.  Without the patch, softirqd is pegged
100% futzing around with enabling and disabling the fpu.

The wifi stack people do not want any extra complexity in their code,
and async processing of this would be a lot of complexity.  So, even if I wanted
to implement it, likely it would never make it upstream anyway.

I also suspect that general users may benefit from this aesni patch since
many older wifi chipsets don't support wpa3 in hardware and wpa3 is the new
hotness in wifi.

Thanks,
Ben
Ard Biesheuvel Dec. 10, 2020, 12:18 a.m. UTC | #8
On Wed, 2 Dec 2020 at 00:12, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>

> On Tue, Dec 01, 2020 at 11:27:52PM +0100, Ard Biesheuvel wrote:

> >

> > > The problem is that the degradation would come at the worst time,

> > > when the system is loaded.  IOW when you get an interrupt during

> > > your TX path and get RX traffic that's when you'll take the fallback

> > > path.

> >

> > I can see how in the general case, this is something you would prefer

> > to avoid. However, on SMP x86_64 systems that implement AES-NI (which

> > runs at ~1 cycle per byte), I don't see this as a real problem for

> > this driver.

>

> AES-NI is 1 cycle per byte but the fallback is not.

>


One thing I realized just now is that in the current situation, all
the synchronous skciphers already degrade like this.

I.e., in Ben's case, without the special ccm implementation, ccm(aes)
will resolve to ccm(ctr(aesni),cbcmac(aesni)), which is instantiated
as a sync skcipher using the ctr and ccm/cbcmac templates built on top
of the AES-NI cipher (not skcipher).  This cipher will also fall back
to suboptimal scalar code if the SIMD is in use in process context.



> > What we could do is expose both versions, where the async version has

> > a slightly higher priority, so that all users that do support the

> > async interface will get it, and the wifi stack can use the sync

> > interface instead.

>

> No we've already tried this with IPsec and it doesn't work.  That's

> why the async path exists in aesni.

>

> Wireless is no different to IPsec in this respect.

>

> Cheers,

> --

> Email: Herbert Xu <herbert@gondor.apana.org.au>

> Home Page: http://gondor.apana.org.au/~herbert/

> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Herbert Xu Dec. 10, 2020, 2:43 a.m. UTC | #9
On Thu, Dec 10, 2020 at 01:18:12AM +0100, Ard Biesheuvel wrote:
> 

> One thing I realized just now is that in the current situation, all

> the synchronous skciphers already degrade like this.

> 

> I.e., in Ben's case, without the special ccm implementation, ccm(aes)

> will resolve to ccm(ctr(aesni),cbcmac(aesni)), which is instantiated

> as a sync skcipher using the ctr and ccm/cbcmac templates built on top

> of the AES-NI cipher (not skcipher).  This cipher will also fall back

> to suboptimal scalar code if the SIMD is in use in process context.


Sure, your patch is not making it any worse.  But I don't think
the extra code is worth it considering that you're still going to
be running into that slow fallback path all the time.

Much better to fix the wireless code to actually go async.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Ben Greear Dec. 10, 2020, 3:01 a.m. UTC | #10
On 12/9/20 6:43 PM, Herbert Xu wrote:
> On Thu, Dec 10, 2020 at 01:18:12AM +0100, Ard Biesheuvel wrote:

>>

>> One thing I realized just now is that in the current situation, all

>> the synchronous skciphers already degrade like this.

>>

>> I.e., in Ben's case, without the special ccm implementation, ccm(aes)

>> will resolve to ccm(ctr(aesni),cbcmac(aesni)), which is instantiated

>> as a sync skcipher using the ctr and ccm/cbcmac templates built on top

>> of the AES-NI cipher (not skcipher).  This cipher will also fall back

>> to suboptimal scalar code if the SIMD is in use in process context.

> 

> Sure, your patch is not making it any worse.  But I don't think

> the extra code is worth it considering that you're still going to

> be running into that slow fallback path all the time.


How can we test this assumption?  I see 3x performance gain, so it is not hitting
the fallback path much in my case.  What traffic pattern and protocol do you think
will cause the slow fallback path to happen often enough to make this patch not
helpful?

> Much better to fix the wireless code to actually go async.


This will not happen any time soon, so better to make incremental
improvement in the crypt code.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com
Ard Biesheuvel Dec. 10, 2020, 7:30 a.m. UTC | #11
On Thu, 10 Dec 2020 at 04:01, Ben Greear <greearb@candelatech.com> wrote:
>

> On 12/9/20 6:43 PM, Herbert Xu wrote:

> > On Thu, Dec 10, 2020 at 01:18:12AM +0100, Ard Biesheuvel wrote:

> >>

> >> One thing I realized just now is that in the current situation, all

> >> the synchronous skciphers already degrade like this.

> >>

> >> I.e., in Ben's case, without the special ccm implementation, ccm(aes)

> >> will resolve to ccm(ctr(aesni),cbcmac(aesni)), which is instantiated

> >> as a sync skcipher using the ctr and ccm/cbcmac templates built on top

> >> of the AES-NI cipher (not skcipher).  This cipher will also fall back

> >> to suboptimal scalar code if the SIMD is in use in process context.

> >

> > Sure, your patch is not making it any worse.  But I don't think

> > the extra code is worth it considering that you're still going to

> > be running into that slow fallback path all the time.

>

> How can we test this assumption?  I see 3x performance gain, so it is not hitting

> the fallback path much in my case.  What traffic pattern and protocol do you think

> will cause the slow fallback path to happen often enough to make this patch not

> helpful?

>


Is there a way to verify Herbert's assertion that TX and RX tend to be
handled by the same core? I am not a networking guy, but that seems
dubious to me.

You could add a pr_warn_ratelimited() inside the fallback path and see
if it ever gets called at all under various loads.

> > Much better to fix the wireless code to actually go async.

>

> This will not happen any time soon, so better to make incremental

> improvement in the crypt code.

>


I would argue that these are orthogonal. My patch improves both the
accelerated and the fallback path, given that the latter does not have
to walk the input data twice anymore, and go through 3 layers of
templates and the associated indirect calls for each 16 bytes of
input.

Of course, it would be better to avoid using the fallback path
altogether, but I don't think one should hold up the other.
Herbert Xu Dec. 10, 2020, 11:14 a.m. UTC | #12
On Thu, Dec 10, 2020 at 08:30:47AM +0100, Ard Biesheuvel wrote:
>

> I would argue that these are orthogonal. My patch improves both the

> accelerated and the fallback path, given that the latter does not have

> to walk the input data twice anymore, and go through 3 layers of

> templates and the associated indirect calls for each 16 bytes of

> input.


As I told your before, your patch introduces a new model into aesni
that is different to every other algorithm there for the sole purpose
of catering for legacy hardware in a subsystem that refuses to do
the right thing.

That is not acceptable.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Ard Biesheuvel Dec. 10, 2020, 12:03 p.m. UTC | #13
On Thu, 10 Dec 2020 at 12:14, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>

> On Thu, Dec 10, 2020 at 08:30:47AM +0100, Ard Biesheuvel wrote:

> >

> > I would argue that these are orthogonal. My patch improves both the

> > accelerated and the fallback path, given that the latter does not have

> > to walk the input data twice anymore, and go through 3 layers of

> > templates and the associated indirect calls for each 16 bytes of

> > input.

>

> As I told your before, your patch introduces a new model into aesni

> that is different to every other algorithm there for the sole purpose

> of catering for legacy hardware in a subsystem that refuses to do

> the right thing.

>

> That is not acceptable.

>


OK, I will stop whining about CCM, apologies if this is getting tedious.

But we should probably start policing this a bit more. For instance, we now have

drivers/net/macsec.c:

/* Pick a sync gcm(aes) cipher to ensure order is preserved. */
tfm = crypto_alloc_aead("gcm(aes)", 0, CRYPTO_ALG_ASYNC);

(btw the comment is bogus, right?)

TLS_SW does the same thing in net/tls/tls_device_fallback.c.

So it is not only CCM in the 802.11 layer, there are now other places
where we end up using a sub-optimal algorithm (and less secure if
table based AES or GHASH end up being used) just to avoid a potential
fallback which is not even as bad as the fallback we will actually end
up with when the crypto API synthesizes it from the GCM, CTR and GHASH
templates/drivers.

Async is obviously needed for h/w accelerators, but could we perhaps
do better for s/w SIMD algorithms? Those are by far the most widely
used ones.
Herbert Xu Dec. 10, 2020, 12:16 p.m. UTC | #14
On Thu, Dec 10, 2020 at 01:03:56PM +0100, Ard Biesheuvel wrote:
>

> But we should probably start policing this a bit more. For instance, we now have

> 

> drivers/net/macsec.c:

> 

> /* Pick a sync gcm(aes) cipher to ensure order is preserved. */

> tfm = crypto_alloc_aead("gcm(aes)", 0, CRYPTO_ALG_ASYNC);

> 

> (btw the comment is bogus, right?)

> 

> TLS_SW does the same thing in net/tls/tls_device_fallback.c.


Short of us volunteering to write code for every user out there
I don't see a way out.
 
> Async is obviously needed for h/w accelerators, but could we perhaps

> do better for s/w SIMD algorithms? Those are by far the most widely

> used ones.


If you can come up with a way that avoids the cryptd model without
using a fallback obviously that would be the ultimate solution.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Ard Biesheuvel Dec. 10, 2020, 12:19 p.m. UTC | #15
On Thu, 10 Dec 2020 at 13:16, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>

> On Thu, Dec 10, 2020 at 01:03:56PM +0100, Ard Biesheuvel wrote:

> >

> > But we should probably start policing this a bit more. For instance, we now have

> >

> > drivers/net/macsec.c:

> >

> > /* Pick a sync gcm(aes) cipher to ensure order is preserved. */

> > tfm = crypto_alloc_aead("gcm(aes)", 0, CRYPTO_ALG_ASYNC);

> >

> > (btw the comment is bogus, right?)

> >

> > TLS_SW does the same thing in net/tls/tls_device_fallback.c.

>

> Short of us volunteering to write code for every user out there

> I don't see a way out.

>

> > Async is obviously needed for h/w accelerators, but could we perhaps

> > do better for s/w SIMD algorithms? Those are by far the most widely

> > used ones.

>

> If you can come up with a way that avoids the cryptd model without

> using a fallback obviously that would be the ultimate solution.

>


Could we disable softirq handling in these regions?
Ben Greear Dec. 10, 2020, 2:40 p.m. UTC | #16
On 12/9/20 11:30 PM, Ard Biesheuvel wrote:
> On Thu, 10 Dec 2020 at 04:01, Ben Greear <greearb@candelatech.com> wrote:

>>

>> On 12/9/20 6:43 PM, Herbert Xu wrote:

>>> On Thu, Dec 10, 2020 at 01:18:12AM +0100, Ard Biesheuvel wrote:

>>>>

>>>> One thing I realized just now is that in the current situation, all

>>>> the synchronous skciphers already degrade like this.

>>>>

>>>> I.e., in Ben's case, without the special ccm implementation, ccm(aes)

>>>> will resolve to ccm(ctr(aesni),cbcmac(aesni)), which is instantiated

>>>> as a sync skcipher using the ctr and ccm/cbcmac templates built on top

>>>> of the AES-NI cipher (not skcipher).  This cipher will also fall back

>>>> to suboptimal scalar code if the SIMD is in use in process context.

>>>

>>> Sure, your patch is not making it any worse.  But I don't think

>>> the extra code is worth it considering that you're still going to

>>> be running into that slow fallback path all the time.

>>

>> How can we test this assumption?  I see 3x performance gain, so it is not hitting

>> the fallback path much in my case.  What traffic pattern and protocol do you think

>> will cause the slow fallback path to happen often enough to make this patch not

>> helpful?

>>

> 

> Is there a way to verify Herbert's assertion that TX and RX tend to be

> handled by the same core? I am not a networking guy, but that seems

> dubious to me.

> 

> You could add a pr_warn_ratelimited() inside the fallback path and see

> if it ever gets called at all under various loads.


Even if it does sometimes use the same core, if performance is better and
CPU usage is lower, why would it even matter?

Anyway, looks like Herbert is dead set against this code in hopes that he
can force other subsystems to re-write their code.  If you come up with
some other variant that Herbert will accept, let me know and I'll test it.

Otherwise, I will just add your patch to my kernel and carry on.

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com
Ard Biesheuvel Dec. 15, 2020, 8:55 a.m. UTC | #17
(+ Eric)

TL;DR can we find a way to use synchronous SIMD skciphers/aeads
without cryptd or scalar fallbacks

On Thu, 10 Dec 2020 at 13:19, Ard Biesheuvel <ardb@kernel.org> wrote:
>

> On Thu, 10 Dec 2020 at 13:16, Herbert Xu <herbert@gondor.apana.org.au> wrote:

> >

> > On Thu, Dec 10, 2020 at 01:03:56PM +0100, Ard Biesheuvel wrote:

> > >

> > > But we should probably start policing this a bit more. For instance, we now have

> > >

> > > drivers/net/macsec.c:

> > >

> > > /* Pick a sync gcm(aes) cipher to ensure order is preserved. */

> > > tfm = crypto_alloc_aead("gcm(aes)", 0, CRYPTO_ALG_ASYNC);

> > >

> > > (btw the comment is bogus, right?)

> > >

> > > TLS_SW does the same thing in net/tls/tls_device_fallback.c.

> >

> > Short of us volunteering to write code for every user out there

> > I don't see a way out.

> >

> > > Async is obviously needed for h/w accelerators, but could we perhaps

> > > do better for s/w SIMD algorithms? Those are by far the most widely

> > > used ones.

> >

> > If you can come up with a way that avoids the cryptd model without

> > using a fallback obviously that would be the ultimate solution.

> >

>

> Could we disable softirq handling in these regions?


I have been looking into this a bit, and I wonder if we might consider
doing the following:
- forbid synchronous skcipher/aead encrypt/decrypt calls from any
other context than task or softirq (insofar this is not already the
case)
- limit kernel mode SIMD in general to task or softirq context
- reduce the scope for simd begin/end blocks, which is better for
PREEMPT in any case, and no longer results in a performance hit on x86
as it did before, now that we have lazy restore for the userland FPU
state
- disable softirq processing when enabling kernel mode SIMD

This way, we don't need a scalar fallback at all, given that any SIMD
use in softirq context is guaranteed to occur when the SIMD registers
are dead from the task's pov.

So the question is then how granular these kernel mode SIMD regions
need to be to avoid excessive latencies in softirq handling.

I think this could also be an opportunity for a bit more alignment
between architectures on this topic.

-- 
Ard.
Herbert Xu Dec. 15, 2020, 9:19 a.m. UTC | #18
On Tue, Dec 15, 2020 at 09:55:37AM +0100, Ard Biesheuvel wrote:
>

> So the question is then how granular these kernel mode SIMD regions

> need to be to avoid excessive latencies in softirq handling.


Can you get some real world numbers on what the latency is like?

Then we could take it to the scheduler folks and see if they're
OK with it.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Ben Greear Nov. 8, 2022, 6:50 p.m. UTC | #19
On 12/15/20 1:19 AM, Herbert Xu wrote:
> On Tue, Dec 15, 2020 at 09:55:37AM +0100, Ard Biesheuvel wrote:
>>
>> So the question is then how granular these kernel mode SIMD regions
>> need to be to avoid excessive latencies in softirq handling.
> 
> Can you get some real world numbers on what the latency is like?
> 
> Then we could take it to the scheduler folks and see if they're
> OK with it.
> 
> Thanks,
> 

Hello,

While rebasing my patches onto 6.1-rc4, I noticed my aesni for ccm(aes) patch didn't apply cleanly,
and I found this patch described below is applied now.  Does this upstream patch mean that aesni is already
supported upstream now?  Or is it specific to whatever xctr is?  If so,
any chance the patch is wanted upstream now?

commit fd94fcf09957a75e25941f7dbfc84d30a63817ac
Author: Nathan Huckleberry <nhuck@google.com>
Date:   Fri May 20 18:14:56 2022 +0000

     crypto: x86/aesni-xctr - Add accelerated implementation of XCTR

     Add hardware accelerated version of XCTR for x86-64 CPUs with AESNI
     support.

     More information on XCTR can be found in the HCTR2 paper:
     "Length-preserving encryption with HCTR2":
     https://eprint.iacr.org/2021/1441.pdf

     Signed-off-by: Nathan Huckleberry <nhuck@google.com>
     Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
     Reviewed-by: Eric Biggers <ebiggers@google.com>
     Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Thanks,
Ben
Herbert Xu Nov. 9, 2022, 3:52 a.m. UTC | #20
On Tue, Nov 08, 2022 at 10:50:48AM -0800, Ben Greear wrote:
>
> While rebasing my patches onto 6.1-rc4, I noticed my aesni for ccm(aes) patch didn't apply cleanly,
> and I found this patch described below is applied now.  Does this upstream patch mean that aesni is already
> supported upstream now?  Or is it specific to whatever xctr is?  If so,
> any chance the patch is wanted upstream now?

AFAICS the xctr patch has nothing to do with what you were trying
to achieve with wireless.  My objection still stands with regards
to wireless, we should patch wireless to use the async crypto
interface and not hack around it in the Crypto API.

Cheers,
Ard Biesheuvel Nov. 9, 2022, 10:05 a.m. UTC | #21
On Wed, 9 Nov 2022 at 04:52, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Tue, Nov 08, 2022 at 10:50:48AM -0800, Ben Greear wrote:
> >
> > While rebasing my patches onto 6.1-rc4, I noticed my aesni for ccm(aes) patch didn't apply cleanly,
> > and I found this patch described below is applied now.  Does this upstream patch mean that aesni is already
> > supported upstream now?  Or is it specific to whatever xctr is?  If so,
> > any chance the patch is wanted upstream now?
>
> AFAICS the xctr patch has nothing to do with what you were trying
> to achieve with wireless.  My objection still stands with regards
> to wireless, we should patch wireless to use the async crypto
> interface and not hack around it in the Crypto API.
>

Indeed. Those are just add/add conflicts because both patches
introduce new code into the same set of files. The resolution is
generally to keep both sides.

As for Herbert's objection: I will note here that in the meantime,
arm64 now has gotten rid of the scalar fallbacks entirely in AEAD and
skipcher implementations, because those are only callable in task or
softirq context, and the arm64 SIMD wrappers now disable softirq
processing. This means that the condition that results in the fallback
being needed can no longer occur, making the SIMD helper dead code on
arm64.

I suppose we might do the same thing on x86, but since the kernel mode
SIMD handling is highly arch specific, you'd really need to raise this
with the x86 maintainers.
Ben Greear Nov. 9, 2022, 2:12 p.m. UTC | #22
On 11/9/22 2:05 AM, Ard Biesheuvel wrote:
> On Wed, 9 Nov 2022 at 04:52, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>>
>> On Tue, Nov 08, 2022 at 10:50:48AM -0800, Ben Greear wrote:
>>>
>>> While rebasing my patches onto 6.1-rc4, I noticed my aesni for ccm(aes) patch didn't apply cleanly,
>>> and I found this patch described below is applied now.  Does this upstream patch mean that aesni is already
>>> supported upstream now?  Or is it specific to whatever xctr is?  If so,
>>> any chance the patch is wanted upstream now?
>>
>> AFAICS the xctr patch has nothing to do with what you were trying
>> to achieve with wireless.  My objection still stands with regards
>> to wireless, we should patch wireless to use the async crypto
>> interface and not hack around it in the Crypto API.
>>
> 
> Indeed. Those are just add/add conflicts because both patches
> introduce new code into the same set of files. The resolution is
> generally to keep both sides.
> 
> As for Herbert's objection: I will note here that in the meantime,
> arm64 now has gotten rid of the scalar fallbacks entirely in AEAD and
> skipcher implementations, because those are only callable in task or
> softirq context, and the arm64 SIMD wrappers now disable softirq
> processing. This means that the condition that results in the fallback
> being needed can no longer occur, making the SIMD helper dead code on
> arm64.
> 
> I suppose we might do the same thing on x86, but since the kernel mode
> SIMD handling is highly arch specific, you'd really need to raise this
> with the x86 maintainers.

Wifi stack is unlikely to ever change in this regard, so I hope that it does not
become too much more trouble to keep this patch alive.

I made an attempt at the merge, but the error path clean up looks tangled to me,
I'll post a patch for review once I get my tree cleaned up a bit from the rebase.

Thanks,
Ben
Ben Greear Nov. 11, 2022, 10:29 p.m. UTC | #23
On 11/9/22 2:05 AM, Ard Biesheuvel wrote:
> On Wed, 9 Nov 2022 at 04:52, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>>
>> On Tue, Nov 08, 2022 at 10:50:48AM -0800, Ben Greear wrote:
>>>
>>> While rebasing my patches onto 6.1-rc4, I noticed my aesni for ccm(aes) patch didn't apply cleanly,
>>> and I found this patch described below is applied now.  Does this upstream patch mean that aesni is already
>>> supported upstream now?  Or is it specific to whatever xctr is?  If so,
>>> any chance the patch is wanted upstream now?
>>
>> AFAICS the xctr patch has nothing to do with what you were trying
>> to achieve with wireless.  My objection still stands with regards
>> to wireless, we should patch wireless to use the async crypto
>> interface and not hack around it in the Crypto API.
>>
> 
> Indeed. Those are just add/add conflicts because both patches
> introduce new code into the same set of files. The resolution is
> generally to keep both sides.
> 
> As for Herbert's objection: I will note here that in the meantime,
> arm64 now has gotten rid of the scalar fallbacks entirely in AEAD and
> skipcher implementations, because those are only callable in task or
> softirq context, and the arm64 SIMD wrappers now disable softirq
> processing. This means that the condition that results in the fallback
> being needed can no longer occur, making the SIMD helper dead code on
> arm64.
> 
> I suppose we might do the same thing on x86, but since the kernel mode
> SIMD handling is highly arch specific, you'd really need to raise this
> with the x86 maintainers.
> 

Hello Ard,

Could you please review the attached patch to make sure I merged it properly?  My concern
is the cleanup section and/or some problems I might have introduced related to the similarly
named code that was added upstream.

Thanks,
Ben
Ard Biesheuvel Nov. 12, 2022, 2:59 p.m. UTC | #24
On Fri, 11 Nov 2022 at 23:29, Ben Greear <greearb@candelatech.com> wrote:
>
> On 11/9/22 2:05 AM, Ard Biesheuvel wrote:
> > On Wed, 9 Nov 2022 at 04:52, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> >>
> >> On Tue, Nov 08, 2022 at 10:50:48AM -0800, Ben Greear wrote:
> >>>
> >>> While rebasing my patches onto 6.1-rc4, I noticed my aesni for ccm(aes) patch didn't apply cleanly,
> >>> and I found this patch described below is applied now.  Does this upstream patch mean that aesni is already
> >>> supported upstream now?  Or is it specific to whatever xctr is?  If so,
> >>> any chance the patch is wanted upstream now?
> >>
> >> AFAICS the xctr patch has nothing to do with what you were trying
> >> to achieve with wireless.  My objection still stands with regards
> >> to wireless, we should patch wireless to use the async crypto
> >> interface and not hack around it in the Crypto API.
> >>
> >
> > Indeed. Those are just add/add conflicts because both patches
> > introduce new code into the same set of files. The resolution is
> > generally to keep both sides.
> >
> > As for Herbert's objection: I will note here that in the meantime,
> > arm64 now has gotten rid of the scalar fallbacks entirely in AEAD and
> > skipcher implementations, because those are only callable in task or
> > softirq context, and the arm64 SIMD wrappers now disable softirq
> > processing. This means that the condition that results in the fallback
> > being needed can no longer occur, making the SIMD helper dead code on
> > arm64.
> >
> > I suppose we might do the same thing on x86, but since the kernel mode
> > SIMD handling is highly arch specific, you'd really need to raise this
> > with the x86 maintainers.
> >
>
> Hello Ard,
>
> Could you please review the attached patch to make sure I merged it properly?  My concern
> is the cleanup section and/or some problems I might have introduced related to the similarly
> named code that was added upstream.
>

I don't think the logic is quite right, although it rarely matter.

I've pushed my version here - it invokes the static call for CTR so it
will use the faster AVX version if the CPU supports it.

https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=aesni-ccm-v6.1
Ben Greear Oct. 16, 2023, 8:50 p.m. UTC | #25
On 11/12/22 06:59, Ard Biesheuvel wrote:
> On Fri, 11 Nov 2022 at 23:29, Ben Greear <greearb@candelatech.com> wrote:
>>
>> On 11/9/22 2:05 AM, Ard Biesheuvel wrote:
>>> On Wed, 9 Nov 2022 at 04:52, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>>>>
>>>> On Tue, Nov 08, 2022 at 10:50:48AM -0800, Ben Greear wrote:
>>>>>
>>>>> While rebasing my patches onto 6.1-rc4, I noticed my aesni for ccm(aes) patch didn't apply cleanly,
>>>>> and I found this patch described below is applied now.  Does this upstream patch mean that aesni is already
>>>>> supported upstream now?  Or is it specific to whatever xctr is?  If so,
>>>>> any chance the patch is wanted upstream now?
>>>>
>>>> AFAICS the xctr patch has nothing to do with what you were trying
>>>> to achieve with wireless.  My objection still stands with regards
>>>> to wireless, we should patch wireless to use the async crypto
>>>> interface and not hack around it in the Crypto API.
>>>>
>>>
>>> Indeed. Those are just add/add conflicts because both patches
>>> introduce new code into the same set of files. The resolution is
>>> generally to keep both sides.
>>>
>>> As for Herbert's objection: I will note here that in the meantime,
>>> arm64 now has gotten rid of the scalar fallbacks entirely in AEAD and
>>> skipcher implementations, because those are only callable in task or
>>> softirq context, and the arm64 SIMD wrappers now disable softirq
>>> processing. This means that the condition that results in the fallback
>>> being needed can no longer occur, making the SIMD helper dead code on
>>> arm64.
>>>
>>> I suppose we might do the same thing on x86, but since the kernel mode
>>> SIMD handling is highly arch specific, you'd really need to raise this
>>> with the x86 maintainers.
>>>
>>
>> Hello Ard,
>>
>> Could you please review the attached patch to make sure I merged it properly?  My concern
>> is the cleanup section and/or some problems I might have introduced related to the similarly
>> named code that was added upstream.
>>
> 
> I don't think the logic is quite right, although it rarely matter.
> 
> I've pushed my version here - it invokes the static call for CTR so it
> will use the faster AVX version if the CPU supports it.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=aesni-ccm-v6.1

Hello Ard,

It looks like something changed again in the intel-aesni logic for 6.6 kernel.  I was able to do a small
change to the patch to get it to compile, but the kernel crashes when I bring up a wlan port in
6.6.  When I remove the aesni patch, the station comes up without crashing.  The aesni patch worked
fine in 6.5 as far as I can tell.

I'm attaching my slightly modified version of the patch you sent previous.  If you have time to
investigate this it would be much appreciated.

Thanks,
Ben
Eric Biggers Oct. 17, 2023, 3:16 a.m. UTC | #26
On Mon, Oct 16, 2023 at 01:50:05PM -0700, Ben Greear wrote:
> On 11/12/22 06:59, Ard Biesheuvel wrote:
> > On Fri, 11 Nov 2022 at 23:29, Ben Greear <greearb@candelatech.com> wrote:
> > > 
> > > On 11/9/22 2:05 AM, Ard Biesheuvel wrote:
> > > > On Wed, 9 Nov 2022 at 04:52, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> > > > > 
> > > > > On Tue, Nov 08, 2022 at 10:50:48AM -0800, Ben Greear wrote:
> > > > > > 
> > > > > > While rebasing my patches onto 6.1-rc4, I noticed my aesni for ccm(aes) patch didn't apply cleanly,
> > > > > > and I found this patch described below is applied now.  Does this upstream patch mean that aesni is already
> > > > > > supported upstream now?  Or is it specific to whatever xctr is?  If so,
> > > > > > any chance the patch is wanted upstream now?
> > > > > 
> > > > > AFAICS the xctr patch has nothing to do with what you were trying
> > > > > to achieve with wireless.  My objection still stands with regards
> > > > > to wireless, we should patch wireless to use the async crypto
> > > > > interface and not hack around it in the Crypto API.
> > > > > 
> > > > 
> > > > Indeed. Those are just add/add conflicts because both patches
> > > > introduce new code into the same set of files. The resolution is
> > > > generally to keep both sides.
> > > > 
> > > > As for Herbert's objection: I will note here that in the meantime,
> > > > arm64 now has gotten rid of the scalar fallbacks entirely in AEAD and
> > > > skipcher implementations, because those are only callable in task or
> > > > softirq context, and the arm64 SIMD wrappers now disable softirq
> > > > processing. This means that the condition that results in the fallback
> > > > being needed can no longer occur, making the SIMD helper dead code on
> > > > arm64.
> > > > 
> > > > I suppose we might do the same thing on x86, but since the kernel mode
> > > > SIMD handling is highly arch specific, you'd really need to raise this
> > > > with the x86 maintainers.
> > > > 
> > > 
> > > Hello Ard,
> > > 
> > > Could you please review the attached patch to make sure I merged it properly?  My concern
> > > is the cleanup section and/or some problems I might have introduced related to the similarly
> > > named code that was added upstream.
> > > 
> > 
> > I don't think the logic is quite right, although it rarely matter.
> > 
> > I've pushed my version here - it invokes the static call for CTR so it
> > will use the faster AVX version if the CPU supports it.
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=aesni-ccm-v6.1
> 
> Hello Ard,
> 
> It looks like something changed again in the intel-aesni logic for 6.6 kernel.  I was able to do a small
> change to the patch to get it to compile, but the kernel crashes when I bring up a wlan port in
> 6.6.  When I remove the aesni patch, the station comes up without crashing.  The aesni patch worked
> fine in 6.5 as far as I can tell.
> 
> I'm attaching my slightly modified version of the patch you sent previous.  If you have time to
> investigate this it would be much appreciated.
> 
> Thanks,
> Ben

If this patch is useful, shouldn't it be upstreamed?

- Eric
Ard Biesheuvel Oct. 17, 2023, 6:43 a.m. UTC | #27
On Tue, 17 Oct 2023 at 05:16, Eric Biggers <ebiggers@kernel.org> wrote:
>
> On Mon, Oct 16, 2023 at 01:50:05PM -0700, Ben Greear wrote:
> > On 11/12/22 06:59, Ard Biesheuvel wrote:
> > > On Fri, 11 Nov 2022 at 23:29, Ben Greear <greearb@candelatech.com> wrote:
> > > >
> > > > On 11/9/22 2:05 AM, Ard Biesheuvel wrote:
> > > > > On Wed, 9 Nov 2022 at 04:52, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> > > > > >
> > > > > > On Tue, Nov 08, 2022 at 10:50:48AM -0800, Ben Greear wrote:
> > > > > > >
> > > > > > > While rebasing my patches onto 6.1-rc4, I noticed my aesni for ccm(aes) patch didn't apply cleanly,
> > > > > > > and I found this patch described below is applied now.  Does this upstream patch mean that aesni is already
> > > > > > > supported upstream now?  Or is it specific to whatever xctr is?  If so,
> > > > > > > any chance the patch is wanted upstream now?
> > > > > >
> > > > > > AFAICS the xctr patch has nothing to do with what you were trying
> > > > > > to achieve with wireless.  My objection still stands with regards
> > > > > > to wireless, we should patch wireless to use the async crypto
> > > > > > interface and not hack around it in the Crypto API.
> > > > > >
> > > > >
> > > > > Indeed. Those are just add/add conflicts because both patches
> > > > > introduce new code into the same set of files. The resolution is
> > > > > generally to keep both sides.
> > > > >
> > > > > As for Herbert's objection: I will note here that in the meantime,
> > > > > arm64 now has gotten rid of the scalar fallbacks entirely in AEAD and
> > > > > skipcher implementations, because those are only callable in task or
> > > > > softirq context, and the arm64 SIMD wrappers now disable softirq
> > > > > processing. This means that the condition that results in the fallback
> > > > > being needed can no longer occur, making the SIMD helper dead code on
> > > > > arm64.
> > > > >
> > > > > I suppose we might do the same thing on x86, but since the kernel mode
> > > > > SIMD handling is highly arch specific, you'd really need to raise this
> > > > > with the x86 maintainers.
> > > > >
> > > >
> > > > Hello Ard,
> > > >
> > > > Could you please review the attached patch to make sure I merged it properly?  My concern
> > > > is the cleanup section and/or some problems I might have introduced related to the similarly
> > > > named code that was added upstream.
> > > >
> > >
> > > I don't think the logic is quite right, although it rarely matter.
> > >
> > > I've pushed my version here - it invokes the static call for CTR so it
> > > will use the faster AVX version if the CPU supports it.
> > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=aesni-ccm-v6.1
> >
> > Hello Ard,
> >
> > It looks like something changed again in the intel-aesni logic for 6.6 kernel.  I was able to do a small
> > change to the patch to get it to compile, but the kernel crashes when I bring up a wlan port in
> > 6.6.  When I remove the aesni patch, the station comes up without crashing.  The aesni patch worked
> > fine in 6.5 as far as I can tell.
> >
> > I'm attaching my slightly modified version of the patch you sent previous.  If you have time to
> > investigate this it would be much appreciated.
> >
> > Thanks,
> > Ben
>
> If this patch is useful, shouldn't it be upstreamed?
>

It was rejected by Herbert on the basis that the wireless stack should
be converted to use the async API instead.
Herbert Xu Oct. 18, 2023, 1:24 a.m. UTC | #28
On Tue, Oct 17, 2023 at 08:43:29AM +0200, Ard Biesheuvel wrote:
>
> It was rejected by Herbert on the basis that the wireless stack should
> be converted to use the async API instead.

Yes, there is no reason to special-case this one algorithm when
everything else uses the simd model.  Of course, if we can support
simd in softirqs that would be the best outcome.

Cheers,
diff mbox series

Patch

diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c
index ad8a7188a2bf..d90b03d9b420 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -97,12 +97,13 @@  asmlinkage void aesni_cbc_dec(struct crypto_aes_ctx *ctx, u8 *out,
 #define AVX_GEN2_OPTSIZE 640
 #define AVX_GEN4_OPTSIZE 4096
 
+asmlinkage void aesni_ctr_enc(struct crypto_aes_ctx *ctx, u8 *out,
+			      const u8 *in, unsigned int len, u8 *iv);
+
 #ifdef CONFIG_X86_64
 
 static void (*aesni_ctr_enc_tfm)(struct crypto_aes_ctx *ctx, u8 *out,
 			      const u8 *in, unsigned int len, u8 *iv);
-asmlinkage void aesni_ctr_enc(struct crypto_aes_ctx *ctx, u8 *out,
-			      const u8 *in, unsigned int len, u8 *iv);
 
 asmlinkage void aesni_xts_crypt8(const struct crypto_aes_ctx *ctx, u8 *out,
 				 const u8 *in, bool enc, le128 *iv);
@@ -454,6 +455,377 @@  static int cbc_decrypt(struct skcipher_request *req)
 	return err;
 }
 
+static int aesni_ccm_setkey(struct crypto_aead *tfm, const u8 *in_key,
+			    unsigned int key_len)
+{
+	struct crypto_aes_ctx *ctx = crypto_aead_ctx(tfm);
+
+	return aes_set_key_common(crypto_aead_tfm(tfm), ctx, in_key, key_len);
+}
+
+static int aesni_ccm_setauthsize(struct crypto_aead *tfm, unsigned int authsize)
+{
+	if ((authsize & 1) || authsize < 4)
+		return -EINVAL;
+	return 0;
+}
+
+static int ccm_set_msg_len(u8 *block, unsigned int msglen, int csize)
+{
+	__be32 data;
+
+	memset(block, 0, csize);
+	block += csize;
+
+	if (csize >= 4)
+		csize = 4;
+	else if (msglen > (1 << (8 * csize)))
+		return -EOVERFLOW;
+
+	data = cpu_to_be32(msglen);
+	memcpy(block - csize, (u8 *)&data + 4 - csize, csize);
+
+	return 0;
+}
+
+static int ccm_init_mac(struct aead_request *req, u8 maciv[], u32 msglen)
+{
+	struct crypto_aead *aead = crypto_aead_reqtfm(req);
+	__be32 *n = (__be32 *)&maciv[AES_BLOCK_SIZE - 8];
+	u32 l = req->iv[0] + 1;
+
+	/* verify that CCM dimension 'L' is set correctly in the IV */
+	if (l < 2 || l > 8)
+		return -EINVAL;
+
+	/* verify that msglen can in fact be represented in L bytes */
+	if (l < 4 && msglen >> (8 * l))
+		return -EOVERFLOW;
+
+	/*
+	 * Even if the CCM spec allows L values of up to 8, the Linux cryptoapi
+	 * uses a u32 type to represent msglen so the top 4 bytes are always 0.
+	 */
+	n[0] = 0;
+	n[1] = cpu_to_be32(msglen);
+
+	memcpy(maciv, req->iv, AES_BLOCK_SIZE - l);
+
+	/*
+	 * Meaning of byte 0 according to CCM spec (RFC 3610/NIST 800-38C)
+	 * - bits 0..2	: max # of bytes required to represent msglen, minus 1
+	 *                (already set by caller)
+	 * - bits 3..5	: size of auth tag (1 => 4 bytes, 2 => 6 bytes, etc)
+	 * - bit 6	: indicates presence of authenticate-only data
+	 */
+	maciv[0] |= (crypto_aead_authsize(aead) - 2) << 2;
+	if (req->assoclen)
+		maciv[0] |= 0x40;
+
+	memset(&req->iv[AES_BLOCK_SIZE - l], 0, l);
+	return ccm_set_msg_len(maciv + AES_BLOCK_SIZE - l, msglen, l);
+}
+
+static int compute_mac(struct crypto_aes_ctx *ctx, u8 mac[], u8 *data, int n,
+		       unsigned int ilen, u8 *idata, bool do_simd)
+{
+	unsigned int bs = AES_BLOCK_SIZE;
+	u8 *odata = mac;
+	int datalen, getlen;
+
+	datalen = n;
+
+	/* first time in here, block may be partially filled. */
+	getlen = bs - ilen;
+	if (datalen >= getlen) {
+		memcpy(idata + ilen, data, getlen);
+
+		if (likely(do_simd)) {
+			aesni_cbc_enc(ctx, odata, idata, AES_BLOCK_SIZE, odata);
+		} else {
+			crypto_xor(odata, idata, AES_BLOCK_SIZE);
+			aes_encrypt(ctx, odata, odata);
+		}
+
+		datalen -= getlen;
+		data += getlen;
+		ilen = 0;
+	}
+
+	/* now encrypt rest of data */
+	while (datalen >= bs) {
+		if (likely(do_simd)) {
+			aesni_cbc_enc(ctx, odata, data, AES_BLOCK_SIZE, odata);
+		} else {
+			crypto_xor(odata, data, AES_BLOCK_SIZE);
+			aes_encrypt(ctx, odata, odata);
+		}
+
+		datalen -= bs;
+		data += bs;
+	}
+
+	/* check and see if there's leftover data that wasn't
+	 * enough to fill a block.
+	 */
+	if (datalen) {
+		memcpy(idata + ilen, data, datalen);
+		ilen += datalen;
+	}
+	return ilen;
+}
+
+static void ccm_calculate_auth_mac(struct aead_request *req,
+				   struct crypto_aes_ctx *ctx, u8 mac[],
+				   struct scatterlist *src,
+				   bool do_simd)
+{
+	unsigned int len = req->assoclen;
+	struct scatter_walk walk;
+	u8 idata[AES_BLOCK_SIZE];
+	unsigned int ilen;
+	struct {
+		__be16 l;
+		__be32 h;
+	} __packed *ltag = (void *)idata;
+
+	/* prepend the AAD with a length tag */
+	if (len < 0xff00) {
+		ltag->l = cpu_to_be16(len);
+		ilen = 2;
+	} else  {
+		ltag->l = cpu_to_be16(0xfffe);
+		ltag->h = cpu_to_be32(len);
+		ilen = 6;
+	}
+
+	scatterwalk_start(&walk, src);
+
+	while (len) {
+		u8 *src;
+		int n;
+
+		n = scatterwalk_clamp(&walk, len);
+		if (!n) {
+			scatterwalk_start(&walk, sg_next(walk.sg));
+			n = scatterwalk_clamp(&walk, len);
+		}
+		src = scatterwalk_map(&walk);
+
+		ilen = compute_mac(ctx, mac, src, n, ilen, idata, do_simd);
+		len -= n;
+
+		scatterwalk_unmap(src);
+		scatterwalk_advance(&walk, n);
+		scatterwalk_done(&walk, 0, len);
+	}
+
+	/* any leftover needs padding and then encrypted */
+	if (ilen) {
+		crypto_xor(mac, idata, ilen);
+		if (likely(do_simd))
+			aesni_enc(ctx, mac, mac);
+		else
+			aes_encrypt(ctx, mac, mac);
+	}
+}
+
+static int aesni_ccm_encrypt(struct aead_request *req)
+{
+	struct crypto_aead *aead = crypto_aead_reqtfm(req);
+	struct crypto_aes_ctx *ctx = aes_ctx(crypto_aead_ctx(aead));
+	bool const do_simd = crypto_simd_usable();
+	u8 __aligned(8) mac[AES_BLOCK_SIZE];
+	u8 __aligned(8) buf[AES_BLOCK_SIZE];
+	struct skcipher_walk walk;
+	u32 l = req->iv[0] + 1;
+	int err;
+
+	err = ccm_init_mac(req, mac, req->cryptlen);
+	if (err)
+		return err;
+
+	if (likely(do_simd)) {
+		kernel_fpu_begin();
+		aesni_enc(ctx, mac, mac);
+	} else {
+		aes_encrypt(ctx, mac, mac);
+	}
+
+	if (req->assoclen)
+		ccm_calculate_auth_mac(req, ctx, mac, req->src, do_simd);
+
+	req->iv[AES_BLOCK_SIZE - 1] = 0x1;
+	err = skcipher_walk_aead_encrypt(&walk, req, true);
+
+	while (walk.nbytes >= AES_BLOCK_SIZE) {
+		int len = walk.nbytes & AES_BLOCK_MASK;
+		int n;
+
+		for (n = 0; n < len; n += AES_BLOCK_SIZE) {
+			if (likely(do_simd)) {
+				aesni_cbc_enc(ctx, mac, walk.src.virt.addr + n,
+					      AES_BLOCK_SIZE, mac);
+			} else {
+				crypto_xor(mac, walk.src.virt.addr + n,
+					   AES_BLOCK_SIZE);
+				aes_encrypt(ctx, mac, mac);
+
+				aes_encrypt(ctx, buf, walk.iv);
+				crypto_inc(walk.iv, AES_BLOCK_SIZE);
+				crypto_xor_cpy(walk.dst.virt.addr + n,
+					       walk.src.virt.addr + n,
+					       buf, AES_BLOCK_SIZE);
+			}
+		}
+		if (likely(do_simd))
+			aesni_ctr_enc(ctx, walk.dst.virt.addr,
+				      walk.src.virt.addr, len, walk.iv);
+
+		err = skcipher_walk_done(&walk, walk.nbytes & ~AES_BLOCK_MASK);
+	}
+	if (walk.nbytes) {
+		if (likely(do_simd))
+			aesni_enc(ctx, buf, walk.iv);
+		else
+			aes_encrypt(ctx, buf, walk.iv);
+
+		crypto_xor(mac, walk.src.virt.addr, walk.nbytes);
+		crypto_xor_cpy(walk.dst.virt.addr, walk.src.virt.addr,
+			       buf, walk.nbytes);
+
+		if (likely(do_simd))
+			aesni_enc(ctx, mac, mac);
+		else
+			aes_encrypt(ctx, mac, mac);
+
+		err = skcipher_walk_done(&walk, 0);
+	}
+
+	if (err)
+		goto fail;
+
+	memset(walk.iv + AES_BLOCK_SIZE - l, 0, l);
+
+	if (likely(do_simd)) {
+		aesni_ctr_enc(ctx, mac, mac, AES_BLOCK_SIZE, walk.iv);
+	} else {
+		aes_encrypt(ctx, buf, walk.iv);
+		crypto_xor(mac, buf, AES_BLOCK_SIZE);
+	}
+
+	/* copy authtag to end of dst */
+	scatterwalk_map_and_copy(mac, req->dst, req->assoclen + req->cryptlen,
+				 crypto_aead_authsize(aead), 1);
+
+fail:
+	if (likely(do_simd))
+		kernel_fpu_end();
+	return err;
+}
+
+static int aesni_ccm_decrypt(struct aead_request *req)
+{
+	struct crypto_aead *aead = crypto_aead_reqtfm(req);
+	struct crypto_aes_ctx *ctx = aes_ctx(crypto_aead_ctx(aead));
+	unsigned int authsize = crypto_aead_authsize(aead);
+	bool const do_simd = crypto_simd_usable();
+	u8 __aligned(8) mac[AES_BLOCK_SIZE];
+	u8 __aligned(8) tag[AES_BLOCK_SIZE];
+	u8 __aligned(8) buf[AES_BLOCK_SIZE];
+	struct skcipher_walk walk;
+	u32 l = req->iv[0] + 1;
+	int err;
+
+	err = ccm_init_mac(req, mac, req->cryptlen - authsize);
+	if (err)
+		return err;
+
+	/* copy authtag from end of src */
+	scatterwalk_map_and_copy(tag, req->src,
+				 req->assoclen + req->cryptlen - authsize,
+				 authsize, 0);
+
+	if (likely(do_simd)) {
+		kernel_fpu_begin();
+		aesni_enc(ctx, mac, mac);
+	} else {
+		aes_encrypt(ctx, mac, mac);
+	}
+
+	if (req->assoclen)
+		ccm_calculate_auth_mac(req, ctx, mac, req->src, do_simd);
+
+	req->iv[AES_BLOCK_SIZE - 1] = 0x1;
+	err = skcipher_walk_aead_decrypt(&walk, req, true);
+
+	while (walk.nbytes >= AES_BLOCK_SIZE) {
+		int len = walk.nbytes & AES_BLOCK_MASK;
+		int n;
+
+		if (likely(do_simd))
+			aesni_ctr_enc(ctx, walk.dst.virt.addr,
+				      walk.src.virt.addr, len, walk.iv);
+
+		for (n = 0; n < len; n += AES_BLOCK_SIZE) {
+			if (likely(do_simd)) {
+				aesni_cbc_enc(ctx, mac, walk.dst.virt.addr + n,
+					      AES_BLOCK_SIZE, mac);
+			} else {
+				aes_encrypt(ctx, buf, walk.iv);
+				crypto_inc(walk.iv, AES_BLOCK_SIZE);
+				crypto_xor_cpy(walk.dst.virt.addr + n,
+					       walk.src.virt.addr + n,
+					       buf, AES_BLOCK_SIZE);
+
+				crypto_xor(mac, walk.dst.virt.addr + n,
+					   AES_BLOCK_SIZE);
+				aes_encrypt(ctx, mac, mac);
+			}
+		}
+
+		err = skcipher_walk_done(&walk, walk.nbytes & ~AES_BLOCK_MASK);
+	}
+	if (walk.nbytes) {
+		if (likely(do_simd))
+			aesni_enc(ctx, buf, walk.iv);
+		else
+			aes_encrypt(ctx, buf, walk.iv);
+
+		crypto_xor_cpy(walk.dst.virt.addr, walk.src.virt.addr,
+			       buf, walk.nbytes);
+		crypto_xor(mac, walk.dst.virt.addr, walk.nbytes);
+
+		if (likely(do_simd))
+			aesni_enc(ctx, mac, mac);
+		else
+			aes_encrypt(ctx, mac, mac);
+
+		err = skcipher_walk_done(&walk, 0);
+	}
+
+	if (err)
+		goto fail;
+
+	memset(walk.iv + AES_BLOCK_SIZE - l, 0, l);
+
+	if (likely(do_simd)) {
+		aesni_ctr_enc(ctx, mac, mac, AES_BLOCK_SIZE, walk.iv);
+	} else {
+		aes_encrypt(ctx, buf, walk.iv);
+		crypto_xor(mac, buf, AES_BLOCK_SIZE);
+	}
+
+	/* compare calculated auth tag with the stored one */
+	if (crypto_memneq(mac, tag, authsize))
+		err = -EBADMSG;
+
+fail:
+	if (likely(do_simd))
+		kernel_fpu_end();
+	return err;
+}
+
 #ifdef CONFIG_X86_64
 static void ctr_crypt_final(struct crypto_aes_ctx *ctx,
 			    struct skcipher_walk *walk)
@@ -1045,10 +1417,28 @@  static struct aead_alg aesni_aeads[] = { {
 		.cra_module		= THIS_MODULE,
 	},
 } };
+
 #else
 static struct aead_alg aesni_aeads[0];
 #endif
 
+static struct aead_alg aesni_ccm_aead = {
+	.base.cra_name		= "ccm(aes)",
+	.base.cra_driver_name	= "ccm-aesni",
+	.base.cra_priority	= 400,
+	.base.cra_blocksize	= 1,
+	.base.cra_ctxsize	= sizeof(struct crypto_aes_ctx),
+	.base.cra_module	= THIS_MODULE,
+
+	.setkey			= aesni_ccm_setkey,
+	.setauthsize		= aesni_ccm_setauthsize,
+	.encrypt		= aesni_ccm_encrypt,
+	.decrypt		= aesni_ccm_decrypt,
+	.ivsize			= AES_BLOCK_SIZE,
+	.chunksize		= AES_BLOCK_SIZE,
+	.maxauthsize		= AES_BLOCK_SIZE,
+};
+
 static struct simd_aead_alg *aesni_simd_aeads[ARRAY_SIZE(aesni_aeads)];
 
 static const struct x86_cpu_id aesni_cpu_id[] = {
@@ -1098,8 +1488,17 @@  static int __init aesni_init(void)
 	if (err)
 		goto unregister_skciphers;
 
+	if (IS_ENABLED(CONFIG_X86_64)) {
+		err = crypto_register_aead(&aesni_ccm_aead);
+		if (err)
+			goto unregister_aeads;
+	}
+
 	return 0;
 
+unregister_aeads:
+	simd_unregister_aeads(aesni_aeads, ARRAY_SIZE(aesni_aeads),
+			      aesni_simd_aeads);
 unregister_skciphers:
 	simd_unregister_skciphers(aesni_skciphers, ARRAY_SIZE(aesni_skciphers),
 				  aesni_simd_skciphers);
@@ -1110,6 +1509,9 @@  static int __init aesni_init(void)
 
 static void __exit aesni_exit(void)
 {
+	if (IS_ENABLED(CONFIG_X86_64))
+		crypto_unregister_aead(&aesni_ccm_aead);
+
 	simd_unregister_aeads(aesni_aeads, ARRAY_SIZE(aesni_aeads),
 			      aesni_simd_aeads);
 	simd_unregister_skciphers(aesni_skciphers, ARRAY_SIZE(aesni_skciphers),