diff mbox series

[v2,05/20] crypto: mips/chacha - import accelerated 32r2 code from Zinc

Message ID 20191002141713.31189-6-ard.biesheuvel@linaro.org
State New
Headers show
Series crypto: crypto API library interfaces for WireGuard | expand

Commit Message

Ard Biesheuvel Oct. 2, 2019, 2:16 p.m. UTC
This integrates the accelerated MIPS 32r2 implementation of ChaCha
into both the API and library interfaces of the kernel crypto stack.

The significance of this is that, in addition to becoming available
as an accelerated library implementation, it can also be used by
existing crypto API code such as Adiantum (for block encryption on
ultra low performance cores) or IPsec using chacha20poly1305. These
are use cases that have already opted into using the abstract crypto
API. In order to support Adiantum, the core assembler routine has
been adapted to take the round count as a function argument rather
than hardcoding it to 20.

Co-developed-by: René van Dorst <opensource@vdorst.com>
Signed-off-by: René van Dorst <opensource@vdorst.com>

Co-developed-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

---
 arch/mips/Makefile             |   2 +-
 arch/mips/crypto/Makefile      |   3 +
 arch/mips/crypto/chacha-core.S | 424 ++++++++++++++++++++
 arch/mips/crypto/chacha-glue.c | 161 ++++++++
 crypto/Kconfig                 |   6 +
 5 files changed, 595 insertions(+), 1 deletion(-)

-- 
2.20.1

Comments

Jason A. Donenfeld Oct. 4, 2019, 1:46 p.m. UTC | #1
On Wed, Oct 02, 2019 at 04:16:58PM +0200, Ard Biesheuvel wrote:
> This integrates the accelerated MIPS 32r2 implementation of ChaCha

> into both the API and library interfaces of the kernel crypto stack.

> 

> The significance of this is that, in addition to becoming available

> as an accelerated library implementation, it can also be used by

> existing crypto API code such as Adiantum (for block encryption on

> ultra low performance cores) or IPsec using chacha20poly1305. These

> are use cases that have already opted into using the abstract crypto

> API. In order to support Adiantum, the core assembler routine has

> been adapted to take the round count as a function argument rather

> than hardcoding it to 20.


Could you resubmit this with first my original commit and then with your
changes on top? I'd like to see and be able to review exactly what's
changed. If I recall correctly, René and I were really starved for
registers and tried pretty hard to avoid spilling to the stack, so I'm
interested to learn how you crammed a bit more sauce in there.

I also wonder if maybe it'd be better to just leave this as is with 20
rounds, which it was previously optimized for, and just not do
accelerated Adiantum for MIPS. Android has long since given up on the
ISA entirely.
Ard Biesheuvel Oct. 4, 2019, 2:38 p.m. UTC | #2
On Fri, 4 Oct 2019 at 15:46, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>

> On Wed, Oct 02, 2019 at 04:16:58PM +0200, Ard Biesheuvel wrote:

> > This integrates the accelerated MIPS 32r2 implementation of ChaCha

> > into both the API and library interfaces of the kernel crypto stack.

> >

> > The significance of this is that, in addition to becoming available

> > as an accelerated library implementation, it can also be used by

> > existing crypto API code such as Adiantum (for block encryption on

> > ultra low performance cores) or IPsec using chacha20poly1305. These

> > are use cases that have already opted into using the abstract crypto

> > API. In order to support Adiantum, the core assembler routine has

> > been adapted to take the round count as a function argument rather

> > than hardcoding it to 20.

>

> Could you resubmit this with first my original commit and then with your

> changes on top? I'd like to see and be able to review exactly what's

> changed. If I recall correctly, René and I were really starved for

> registers and tried pretty hard to avoid spilling to the stack, so I'm

> interested to learn how you crammed a bit more sauce in there.

>


The round count is passed via the fifth function parameter, so it is
already on the stack. Reloading it for every block doesn't sound like
a huge deal to me.

> I also wonder if maybe it'd be better to just leave this as is with 20

> rounds, which it was previously optimized for, and just not do

> accelerated Adiantum for MIPS. Android has long since given up on the

> ISA entirely.


Adiantum does not depend on Android - anyone running linux on his MIPS
router can use it if they want encrypted storage.
Ard Biesheuvel Oct. 4, 2019, 2:38 p.m. UTC | #3
On Fri, 4 Oct 2019 at 16:38, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>

> On Fri, 4 Oct 2019 at 15:46, Jason A. Donenfeld <Jason@zx2c4.com> wrote:

> >

> > On Wed, Oct 02, 2019 at 04:16:58PM +0200, Ard Biesheuvel wrote:

> > > This integrates the accelerated MIPS 32r2 implementation of ChaCha

> > > into both the API and library interfaces of the kernel crypto stack.

> > >

> > > The significance of this is that, in addition to becoming available

> > > as an accelerated library implementation, it can also be used by

> > > existing crypto API code such as Adiantum (for block encryption on

> > > ultra low performance cores) or IPsec using chacha20poly1305. These

> > > are use cases that have already opted into using the abstract crypto

> > > API. In order to support Adiantum, the core assembler routine has

> > > been adapted to take the round count as a function argument rather

> > > than hardcoding it to 20.

> >

> > Could you resubmit this with first my original commit and then with your

> > changes on top? I'd like to see and be able to review exactly what's

> > changed. If I recall correctly, René and I were really starved for

> > registers and tried pretty hard to avoid spilling to the stack, so I'm

> > interested to learn how you crammed a bit more sauce in there.

> >

>

> The round count is passed via the fifth function parameter, so it is

> already on the stack. Reloading it for every block doesn't sound like

> a huge deal to me.

>

> > I also wonder if maybe it'd be better to just leave this as is with 20

> > rounds, which it was previously optimized for, and just not do

> > accelerated Adiantum for MIPS. Android has long since given up on the

> > ISA entirely.

>

> Adiantum does not depend on Android - anyone running linux on his MIPS

> router can use it if they want encrypted storage.


But to answer your first question: sure, i will split off the changes.
Jason A. Donenfeld Oct. 4, 2019, 2:59 p.m. UTC | #4
On Fri, Oct 4, 2019 at 4:44 PM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> The round count is passed via the fifth function parameter, so it is

> already on the stack. Reloading it for every block doesn't sound like

> a huge deal to me.


Please benchmark it to indicate that, if it really isn't a big deal. I
recall finding that memory accesses on common mips32r2 commodity
router hardware was extremely inefficient. The whole thing is designed
to minimize memory accesses, which are the primary bottleneck on that
platform.

Seems like this thing might be best deferred for after this all lands.
IOW, let's get this in with the 20 round original now, and later you
can submit a change for the 12 round and René and I can spend time
dusting off our test rigs and seeing which strategy works best. I very
nearly tossed out a bunch of old router hardware last night when
cleaning up. Glad I saved it!
Ard Biesheuvel Oct. 4, 2019, 3:05 p.m. UTC | #5
On Fri, 4 Oct 2019 at 16:59, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>

> On Fri, Oct 4, 2019 at 4:44 PM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

> > The round count is passed via the fifth function parameter, so it is

> > already on the stack. Reloading it for every block doesn't sound like

> > a huge deal to me.

>

> Please benchmark it to indicate that, if it really isn't a big deal. I

> recall finding that memory accesses on common mips32r2 commodity

> router hardware was extremely inefficient. The whole thing is designed

> to minimize memory accesses, which are the primary bottleneck on that

> platform.

>


Reloading a single word from the stack each time we load, xor and
store 64 bytes of data from/to memory is highly unlikely to be
noticeable.

> Seems like this thing might be best deferred for after this all lands.

> IOW, let's get this in with the 20 round original now, and later you

> can submit a change for the 12 round and René and I can spend time

> dusting off our test rigs and seeing which strategy works best. I very

> nearly tossed out a bunch of old router hardware last night when

> cleaning up. Glad I saved it!


I don't agree but I don't care deeply enough to argue about it :-)
René van Dorst Oct. 4, 2019, 3:15 p.m. UTC | #6
Hi Jason,

Quoting "Jason A. Donenfeld" <Jason@zx2c4.com>:

> On Fri, Oct 4, 2019 at 4:44 PM Ard Biesheuvel  

> <ard.biesheuvel@linaro.org> wrote:

>> The round count is passed via the fifth function parameter, so it is

>> already on the stack. Reloading it for every block doesn't sound like

>> a huge deal to me.

>

> Please benchmark it to indicate that, if it really isn't a big deal. I

> recall finding that memory accesses on common mips32r2 commodity

> router hardware was extremely inefficient. The whole thing is designed

> to minimize memory accesses, which are the primary bottleneck on that

> platform.


I also think it isn't a big deal, but I shall benchmark it this weekend.
If I am correct a memory write will first put in cache. So if you read
it again and it is in cache it is very fast. 1 or 2 clockcycles.
Also the value isn't used directly after it is read.
So cpu don't have to stall on this read.

Greats,

René

>

> Seems like this thing might be best deferred for after this all lands.

> IOW, let's get this in with the 20 round original now, and later you

> can submit a change for the 12 round and René and I can spend time

> dusting off our test rigs and seeing which strategy works best. I very

> nearly tossed out a bunch of old router hardware last night when

> cleaning up. Glad I saved it!
Ard Biesheuvel Oct. 4, 2019, 3:23 p.m. UTC | #7
On Fri, 4 Oct 2019 at 17:15, René van Dorst <opensource@vdorst.com> wrote:
>

> Hi Jason,

>

> Quoting "Jason A. Donenfeld" <Jason@zx2c4.com>:

>

> > On Fri, Oct 4, 2019 at 4:44 PM Ard Biesheuvel

> > <ard.biesheuvel@linaro.org> wrote:

> >> The round count is passed via the fifth function parameter, so it is

> >> already on the stack. Reloading it for every block doesn't sound like

> >> a huge deal to me.

> >

> > Please benchmark it to indicate that, if it really isn't a big deal. I

> > recall finding that memory accesses on common mips32r2 commodity

> > router hardware was extremely inefficient. The whole thing is designed

> > to minimize memory accesses, which are the primary bottleneck on that

> > platform.

>

> I also think it isn't a big deal, but I shall benchmark it this weekend.

> If I am correct a memory write will first put in cache. So if you read

> it again and it is in cache it is very fast. 1 or 2 clockcycles.

> Also the value isn't used directly after it is read.

> So cpu don't have to stall on this read.

>


Thanks René.

Note that the round count is not being spilled. I [re]load it from the
stack as a function parameter.

So instead of

li $at, 20

I do

lw $at, 16($sp)


Thanks a lot for taking the time to double check this. I think it
would be nice to be able to expose xchacha12 like we do on other
architectures.

Note that for xchacha, I also added a hchacha_block() routine based on
your code (with the round count as the third argument) [0]. Please let
me know if you see anything wrong with that.


+.globl hchacha_block
+.ent hchacha_block
+hchacha_block:
+ .frame $sp, STACK_SIZE, $ra
+
+ addiu $sp, -STACK_SIZE
+
+ /* Save s0-s7 */
+ sw $s0, 0($sp)
+ sw $s1, 4($sp)
+ sw $s2, 8($sp)
+ sw $s3, 12($sp)
+ sw $s4, 16($sp)
+ sw $s5, 20($sp)
+ sw $s6, 24($sp)
+ sw $s7, 28($sp)
+
+ lw X0, 0(STATE)
+ lw X1, 4(STATE)
+ lw X2, 8(STATE)
+ lw X3, 12(STATE)
+ lw X4, 16(STATE)
+ lw X5, 20(STATE)
+ lw X6, 24(STATE)
+ lw X7, 28(STATE)
+ lw X8, 32(STATE)
+ lw X9, 36(STATE)
+ lw X10, 40(STATE)
+ lw X11, 44(STATE)
+ lw X12, 48(STATE)
+ lw X13, 52(STATE)
+ lw X14, 56(STATE)
+ lw X15, 60(STATE)
+
+.Loop_hchacha_xor_rounds:
+ addiu $a2, -2
+ AXR( 0, 1, 2, 3, 4, 5, 6, 7, 12,13,14,15, 16);
+ AXR( 8, 9,10,11, 12,13,14,15, 4, 5, 6, 7, 12);
+ AXR( 0, 1, 2, 3, 4, 5, 6, 7, 12,13,14,15, 8);
+ AXR( 8, 9,10,11, 12,13,14,15, 4, 5, 6, 7, 7);
+ AXR( 0, 1, 2, 3, 5, 6, 7, 4, 15,12,13,14, 16);
+ AXR(10,11, 8, 9, 15,12,13,14, 5, 6, 7, 4, 12);
+ AXR( 0, 1, 2, 3, 5, 6, 7, 4, 15,12,13,14, 8);
+ AXR(10,11, 8, 9, 15,12,13,14, 5, 6, 7, 4, 7);
+ bnez $a2, .Loop_hchacha_xor_rounds
+
+ sw X0, 0(OUT)
+ sw X1, 4(OUT)
+ sw X2, 8(OUT)
+ sw X3, 12(OUT)
+ sw X12, 16(OUT)
+ sw X13, 20(OUT)
+ sw X14, 24(OUT)
+ sw X15, 28(OUT)
+
+ /* Restore used registers */
+ lw $s0, 0($sp)
+ lw $s1, 4($sp)
+ lw $s2, 8($sp)
+ lw $s3, 12($sp)
+ lw $s4, 16($sp)
+ lw $s5, 20($sp)
+ lw $s6, 24($sp)
+ lw $s7, 28($sp)
+
+ addiu $sp, STACK_SIZE
+ jr $ra
+.end hchacha_block
+.set at


[0] https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/commit/?h=wireguard-crypto-library-api-v3&id=cc74a037f8152d52bd17feaf8d9142b61761484f
René van Dorst Oct. 5, 2019, 9:05 a.m. UTC | #8
Hi Ard and Jason,

Quoting Ard Biesheuvel <ard.biesheuvel@linaro.org>:

> On Fri, 4 Oct 2019 at 17:15, René van Dorst <opensource@vdorst.com> wrote:

>>

>> Hi Jason,

>>

>> Quoting "Jason A. Donenfeld" <Jason@zx2c4.com>:

>>

>> > On Fri, Oct 4, 2019 at 4:44 PM Ard Biesheuvel

>> > <ard.biesheuvel@linaro.org> wrote:

>> >> The round count is passed via the fifth function parameter, so it is

>> >> already on the stack. Reloading it for every block doesn't sound like

>> >> a huge deal to me.

>> >

>> > Please benchmark it to indicate that, if it really isn't a big deal. I

>> > recall finding that memory accesses on common mips32r2 commodity

>> > router hardware was extremely inefficient. The whole thing is designed

>> > to minimize memory accesses, which are the primary bottleneck on that

>> > platform.

>>

>> I also think it isn't a big deal, but I shall benchmark it this weekend.

>> If I am correct a memory write will first put in cache. So if you read

>> it again and it is in cache it is very fast. 1 or 2 clockcycles.

>> Also the value isn't used directly after it is read.

>> So cpu don't have to stall on this read.

>>

>

> Thanks René.

>

> Note that the round count is not being spilled. I [re]load it from the

> stack as a function parameter.

>

> So instead of

>

> li $at, 20

>

> I do

>

> lw $at, 16($sp)

>

>

> Thanks a lot for taking the time to double check this. I think it

> would be nice to be able to expose xchacha12 like we do on other

> architectures.


I dust off my old benchmark code and put it on top of latest WireGuard
source [0]. It benchmarks the chacha20poly1305_{de,en}crypt functions with
different data block sizes (x bytes).
It runs two tests, first one is see how many runs we get in 1 second  
results in
MB/Sec and other one measures the used cpu cycles per loop.

The test is preformed on a Mediatek MT7621A SoC running at 880MHz.

Baseline [1]:

root@OpenWrt:~# insmod wg-speed-baseline.ko
[ 2029.866393] wireguard: chacha20 self-tests: pass
[ 2029.894301] wireguard: poly1305 self-tests: pass
[ 2029.906428] wireguard: chacha20poly1305 self-tests: pass
[ 2030.121001] wireguard: chacha20poly1305_encrypt:    1 bytes,        
0.253 MB/sec,     1598 cycles
[ 2030.340786] wireguard: chacha20poly1305_encrypt:   16 bytes,        
4.178 MB/sec,     1554 cycles
[ 2030.561434] wireguard: chacha20poly1305_encrypt:   64 bytes,       
15.392 MB/sec,     1692 cycles
[ 2030.784635] wireguard: chacha20poly1305_encrypt:  128 bytes,       
22.106 MB/sec,     2381 cycles
[ 2031.081534] wireguard: chacha20poly1305_encrypt: 1420 bytes,       
35.480 MB/sec,    16751 cycles
[ 2031.371369] wireguard: chacha20poly1305_encrypt: 1440 bytes,       
36.117 MB/sec,    16712 cycles
[ 2031.589621] wireguard: chacha20poly1305_decrypt:    1 bytes,        
0.246 MB/sec,     1648 cycles
[ 2031.809392] wireguard: chacha20poly1305_decrypt:   16 bytes,        
4.064 MB/sec,     1598 cycles
[ 2032.030034] wireguard: chacha20poly1305_decrypt:   64 bytes,       
14.990 MB/sec,     1738 cycles
[ 2032.253245] wireguard: chacha20poly1305_decrypt:  128 bytes,       
21.679 MB/sec,     2428 cycles
[ 2032.540150] wireguard: chacha20poly1305_decrypt: 1420 bytes,       
35.480 MB/sec,    16793 cycles
[ 2032.829954] wireguard: chacha20poly1305_decrypt: 1440 bytes,       
35.979 MB/sec,    16756 cycles
[ 2032.850563] wireguard: blake2s self-tests: pass
[ 2033.073767] wireguard: curve25519 self-tests: pass
[ 2033.083600] wireguard: allowedips self-tests: pass
[ 2033.097982] wireguard: nonce counter self-tests: pass
[ 2033.535726] wireguard: ratelimiter self-tests: pass
[ 2033.545615] wireguard: WireGuard 0.0.20190913-4-g5cca99692496  
loaded. See www.wireguard.com for information.
[ 2033.565197] wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld  
<Jason@zx2c4.com>. All Rights Reserved.

Modified chacha20-mips.S [2]:

root@OpenWrt:~# rmmod wireguard.ko
root@OpenWrt:~# insmod wg-speed-nround-stack.ko
[ 2045.129910] wireguard: chacha20 self-tests: pass
[ 2045.157824] wireguard: poly1305 self-tests: pass
[ 2045.169962] wireguard: chacha20poly1305 self-tests: pass
[ 2045.381034] wireguard: chacha20poly1305_encrypt:    1 bytes,        
0.251 MB/sec,     1607 cycles
[ 2045.600801] wireguard: chacha20poly1305_encrypt:   16 bytes,        
4.174 MB/sec,     1555 cycles
[ 2045.821437] wireguard: chacha20poly1305_encrypt:   64 bytes,       
15.392 MB/sec,     1691 cycles
[ 2046.044650] wireguard: chacha20poly1305_encrypt:  128 bytes,       
22.082 MB/sec,     2379 cycles
[ 2046.341509] wireguard: chacha20poly1305_encrypt: 1420 bytes,       
35.615 MB/sec,    16739 cycles
[ 2046.631333] wireguard: chacha20poly1305_encrypt: 1440 bytes,       
36.117 MB/sec,    16705 cycles
[ 2046.849614] wireguard: chacha20poly1305_decrypt:    1 bytes,        
0.246 MB/sec,     1647 cycles
[ 2047.069403] wireguard: chacha20poly1305_decrypt:   16 bytes,        
4.056 MB/sec,     1600 cycles
[ 2047.290036] wireguard: chacha20poly1305_decrypt:   64 bytes,       
15.001 MB/sec,     1736 cycles
[ 2047.513253] wireguard: chacha20poly1305_decrypt:  128 bytes,       
21.666 MB/sec,     2429 cycles
[ 2047.800102] wireguard: chacha20poly1305_decrypt: 1420 bytes,       
35.480 MB/sec,    16785 cycles
[ 2048.089967] wireguard: chacha20poly1305_decrypt: 1440 bytes,       
35.979 MB/sec,    16759 cycles
[ 2048.110580] wireguard: blake2s self-tests: pass
[ 2048.333719] wireguard: curve25519 self-tests: pass
[ 2048.343547] wireguard: allowedips self-tests: pass
[ 2048.357926] wireguard: nonce counter self-tests: pass
[ 2048.785837] wireguard: ratelimiter self-tests: pass
[ 2048.795781] wireguard: WireGuard 0.0.20190913-5-gee7c7eec8deb  
loaded. See www.wireguard.com for information.
[ 2048.815389] wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld  
<Jason@zx2c4.com>. All Rights Reserved.


I don't see the extra store/load on the stack back in the results.
So I think that this test proves enough that the extra nround on the stack is
not a problem.

Ard, I shall take a look on your hchacha code later this weekend.

Greats,

René

[0]: https://github.com/vDorst/wireguard/commits/mips-bench
[1]:  
https://github.com/vDorst/wireguard/commit/5cca9969249632820cb96548813a65d1f297aa8c
[2]:  
https://github.com/vDorst/wireguard/commit/ee7c7eec8deb3d5d5dae2eec0be0aafca3fddbc2

>

> Note that for xchacha, I also added a hchacha_block() routine based on

> your code (with the round count as the third argument) [0]. Please let

> me know if you see anything wrong with that.

>

>

> +.globl hchacha_block

> +.ent hchacha_block

> +hchacha_block:

> + .frame $sp, STACK_SIZE, $ra

> +

> + addiu $sp, -STACK_SIZE

> +

> + /* Save s0-s7 */

> + sw $s0, 0($sp)

> + sw $s1, 4($sp)

> + sw $s2, 8($sp)

> + sw $s3, 12($sp)

> + sw $s4, 16($sp)

> + sw $s5, 20($sp)

> + sw $s6, 24($sp)

> + sw $s7, 28($sp)

> +

> + lw X0, 0(STATE)

> + lw X1, 4(STATE)

> + lw X2, 8(STATE)

> + lw X3, 12(STATE)

> + lw X4, 16(STATE)

> + lw X5, 20(STATE)

> + lw X6, 24(STATE)

> + lw X7, 28(STATE)

> + lw X8, 32(STATE)

> + lw X9, 36(STATE)

> + lw X10, 40(STATE)

> + lw X11, 44(STATE)

> + lw X12, 48(STATE)

> + lw X13, 52(STATE)

> + lw X14, 56(STATE)

> + lw X15, 60(STATE)

> +

> +.Loop_hchacha_xor_rounds:

> + addiu $a2, -2

> + AXR( 0, 1, 2, 3, 4, 5, 6, 7, 12,13,14,15, 16);

> + AXR( 8, 9,10,11, 12,13,14,15, 4, 5, 6, 7, 12);

> + AXR( 0, 1, 2, 3, 4, 5, 6, 7, 12,13,14,15, 8);

> + AXR( 8, 9,10,11, 12,13,14,15, 4, 5, 6, 7, 7);

> + AXR( 0, 1, 2, 3, 5, 6, 7, 4, 15,12,13,14, 16);

> + AXR(10,11, 8, 9, 15,12,13,14, 5, 6, 7, 4, 12);

> + AXR( 0, 1, 2, 3, 5, 6, 7, 4, 15,12,13,14, 8);

> + AXR(10,11, 8, 9, 15,12,13,14, 5, 6, 7, 4, 7);

> + bnez $a2, .Loop_hchacha_xor_rounds

> +

> + sw X0, 0(OUT)

> + sw X1, 4(OUT)

> + sw X2, 8(OUT)

> + sw X3, 12(OUT)

> + sw X12, 16(OUT)

> + sw X13, 20(OUT)

> + sw X14, 24(OUT)

> + sw X15, 28(OUT)

> +

> + /* Restore used registers */

> + lw $s0, 0($sp)

> + lw $s1, 4($sp)

> + lw $s2, 8($sp)

> + lw $s3, 12($sp)

> + lw $s4, 16($sp)

> + lw $s5, 20($sp)

> + lw $s6, 24($sp)

> + lw $s7, 28($sp)

> +

> + addiu $sp, STACK_SIZE

> + jr $ra

> +.end hchacha_block

> +.set at

>

>

> [0]  

> https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/commit/?h=wireguard-crypto-library-api-v3&id=cc74a037f8152d52bd17feaf8d9142b61761484f
René van Dorst Oct. 6, 2019, 7:12 p.m. UTC | #9
Quoting Ard Biesheuvel <ard.biesheuvel@linaro.org>:

<snip>

Hi Ard,

> Thanks a lot for taking the time to double check this. I think it

> would be nice to be able to expose xchacha12 like we do on other

> architectures.

>

> Note that for xchacha, I also added a hchacha_block() routine based on

> your code (with the round count as the third argument) [0]. Please let

> me know if you see anything wrong with that.

>

>

> +.globl hchacha_block

> +.ent hchacha_block

> +hchacha_block:

> + .frame $sp, STACK_SIZE, $ra

> +

> + addiu $sp, -STACK_SIZE

> +

> + /* Save s0-s7 */

> + sw $s0, 0($sp)

> + sw $s1, 4($sp)

> + sw $s2, 8($sp)

> + sw $s3, 12($sp)

> + sw $s4, 16($sp)

> + sw $s5, 20($sp)

> + sw $s6, 24($sp)

> + sw $s7, 28($sp)


We only have to preserve the used s registers.
Currently X11 to X15 are using the registers s6 down to s2.

But by shuffling/redefine the needed registers, so that we use all the
non-preserve registers, I can reduce the used s registers to one.

Registers we don't use and don't have to preserve are a3, at and v0.
Also STATE(a0) can be reused because we only need that pointer while  
loading the
values from memory.

So:

#undef X12
#undef X13
#undef X14
#undef X15

#define X12    $a3
#define X13    $at
#define X14    $v0
#define X15    STATE

And save X11(s6) on the stack.

See the full code here [0].

For the rest the code looks good!

Greats,

René

[0]:  
https://github.com/vDorst/wireguard/commit/562a516ae3b282b32f57d3239369360bc926df60


> +

> + lw X0, 0(STATE)

> + lw X1, 4(STATE)

> + lw X2, 8(STATE)

> + lw X3, 12(STATE)

> + lw X4, 16(STATE)

> + lw X5, 20(STATE)

> + lw X6, 24(STATE)

> + lw X7, 28(STATE)

> + lw X8, 32(STATE)

> + lw X9, 36(STATE)

> + lw X10, 40(STATE)

> + lw X11, 44(STATE)

> + lw X12, 48(STATE)

> + lw X13, 52(STATE)

> + lw X14, 56(STATE)

> + lw X15, 60(STATE)

> +

> +.Loop_hchacha_xor_rounds:

> + addiu $a2, -2

> + AXR( 0, 1, 2, 3, 4, 5, 6, 7, 12,13,14,15, 16);

> + AXR( 8, 9,10,11, 12,13,14,15, 4, 5, 6, 7, 12);

> + AXR( 0, 1, 2, 3, 4, 5, 6, 7, 12,13,14,15, 8);

> + AXR( 8, 9,10,11, 12,13,14,15, 4, 5, 6, 7, 7);

> + AXR( 0, 1, 2, 3, 5, 6, 7, 4, 15,12,13,14, 16);

> + AXR(10,11, 8, 9, 15,12,13,14, 5, 6, 7, 4, 12);

> + AXR( 0, 1, 2, 3, 5, 6, 7, 4, 15,12,13,14, 8);

> + AXR(10,11, 8, 9, 15,12,13,14, 5, 6, 7, 4, 7);

> + bnez $a2, .Loop_hchacha_xor_rounds

> +

> + sw X0, 0(OUT)

> + sw X1, 4(OUT)

> + sw X2, 8(OUT)

> + sw X3, 12(OUT)

> + sw X12, 16(OUT)

> + sw X13, 20(OUT)

> + sw X14, 24(OUT)

> + sw X15, 28(OUT)

> +

> + /* Restore used registers */

> + lw $s0, 0($sp)

> + lw $s1, 4($sp)

> + lw $s2, 8($sp)

> + lw $s3, 12($sp)

> + lw $s4, 16($sp)

> + lw $s5, 20($sp)

> + lw $s6, 24($sp)

> + lw $s7, 28($sp)

> +

> + addiu $sp, STACK_SIZE

> + jr $ra

> +.end hchacha_block

> +.set at

>

>

> [0]  

> https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/commit/?h=wireguard-crypto-library-api-v3&id=cc74a037f8152d52bd17feaf8d9142b61761484f
diff mbox series

Patch

diff --git a/arch/mips/Makefile b/arch/mips/Makefile
index cdc09b71febe..8584c047ea59 100644
--- a/arch/mips/Makefile
+++ b/arch/mips/Makefile
@@ -323,7 +323,7 @@  libs-$(CONFIG_MIPS_FP_SUPPORT) += arch/mips/math-emu/
 # See arch/mips/Kbuild for content of core part of the kernel
 core-y += arch/mips/
 
-drivers-$(CONFIG_MIPS_CRC_SUPPORT) += arch/mips/crypto/
+drivers-y			+= arch/mips/crypto/
 drivers-$(CONFIG_OPROFILE)	+= arch/mips/oprofile/
 
 # suspend and hibernation support
diff --git a/arch/mips/crypto/Makefile b/arch/mips/crypto/Makefile
index e07aca572c2e..7f7ea0020cc2 100644
--- a/arch/mips/crypto/Makefile
+++ b/arch/mips/crypto/Makefile
@@ -4,3 +4,6 @@ 
 #
 
 obj-$(CONFIG_CRYPTO_CRC32_MIPS) += crc32-mips.o
+
+obj-$(CONFIG_CRYPTO_CHACHA_MIPS) += chacha-mips.o
+chacha-mips-y := chacha-core.o chacha-glue.o
diff --git a/arch/mips/crypto/chacha-core.S b/arch/mips/crypto/chacha-core.S
new file mode 100644
index 000000000000..42150d15fc88
--- /dev/null
+++ b/arch/mips/crypto/chacha-core.S
@@ -0,0 +1,424 @@ 
+/* SPDX-License-Identifier: GPL-2.0 OR MIT */
+/*
+ * Copyright (C) 2016-2018 René van Dorst <opensource@vdorst.com>. All Rights Reserved.
+ * Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#define MASK_U32		0x3c
+#define CHACHA20_BLOCK_SIZE	64
+#define STACK_SIZE		32
+
+#define X0	$t0
+#define X1	$t1
+#define X2	$t2
+#define X3	$t3
+#define X4	$t4
+#define X5	$t5
+#define X6	$t6
+#define X7	$t7
+#define X8	$t8
+#define X9	$t9
+#define X10	$v1
+#define X11	$s6
+#define X12	$s5
+#define X13	$s4
+#define X14	$s3
+#define X15	$s2
+/* Use regs which are overwritten on exit for Tx so we don't leak clear data. */
+#define T0	$s1
+#define T1	$s0
+#define T(n)	T ## n
+#define X(n)	X ## n
+
+/* Input arguments */
+#define STATE		$a0
+#define OUT		$a1
+#define IN		$a2
+#define BYTES		$a3
+
+/* Output argument */
+/* NONCE[0] is kept in a register and not in memory.
+ * We don't want to touch original value in memory.
+ * Must be incremented every loop iteration.
+ */
+#define NONCE_0		$v0
+
+/* SAVED_X and SAVED_CA are set in the jump table.
+ * Use regs which are overwritten on exit else we don't leak clear data.
+ * They are used to handling the last bytes which are not multiple of 4.
+ */
+#define SAVED_X		X15
+#define SAVED_CA	$s7
+
+#define IS_UNALIGNED	$s7
+
+#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
+#define MSB 0
+#define LSB 3
+#define ROTx rotl
+#define ROTR(n) rotr n, 24
+#define	CPU_TO_LE32(n) \
+	wsbh	n; \
+	rotr	n, 16;
+#else
+#define MSB 3
+#define LSB 0
+#define ROTx rotr
+#define CPU_TO_LE32(n)
+#define ROTR(n)
+#endif
+
+#define FOR_EACH_WORD(x) \
+	x( 0); \
+	x( 1); \
+	x( 2); \
+	x( 3); \
+	x( 4); \
+	x( 5); \
+	x( 6); \
+	x( 7); \
+	x( 8); \
+	x( 9); \
+	x(10); \
+	x(11); \
+	x(12); \
+	x(13); \
+	x(14); \
+	x(15);
+
+#define FOR_EACH_WORD_REV(x) \
+	x(15); \
+	x(14); \
+	x(13); \
+	x(12); \
+	x(11); \
+	x(10); \
+	x( 9); \
+	x( 8); \
+	x( 7); \
+	x( 6); \
+	x( 5); \
+	x( 4); \
+	x( 3); \
+	x( 2); \
+	x( 1); \
+	x( 0);
+
+#define PLUS_ONE_0	 1
+#define PLUS_ONE_1	 2
+#define PLUS_ONE_2	 3
+#define PLUS_ONE_3	 4
+#define PLUS_ONE_4	 5
+#define PLUS_ONE_5	 6
+#define PLUS_ONE_6	 7
+#define PLUS_ONE_7	 8
+#define PLUS_ONE_8	 9
+#define PLUS_ONE_9	10
+#define PLUS_ONE_10	11
+#define PLUS_ONE_11	12
+#define PLUS_ONE_12	13
+#define PLUS_ONE_13	14
+#define PLUS_ONE_14	15
+#define PLUS_ONE_15	16
+#define PLUS_ONE(x)	PLUS_ONE_ ## x
+#define _CONCAT3(a,b,c)	a ## b ## c
+#define CONCAT3(a,b,c)	_CONCAT3(a,b,c)
+
+#define STORE_UNALIGNED(x) \
+CONCAT3(.Lchacha_mips_xor_unaligned_, PLUS_ONE(x), _b: ;) \
+	.if (x != 12); \
+		lw	T0, (x*4)(STATE); \
+	.endif; \
+	lwl	T1, (x*4)+MSB ## (IN); \
+	lwr	T1, (x*4)+LSB ## (IN); \
+	.if (x == 12); \
+		addu	X ## x, NONCE_0; \
+	.else; \
+		addu	X ## x, T0; \
+	.endif; \
+	CPU_TO_LE32(X ## x); \
+	xor	X ## x, T1; \
+	swl	X ## x, (x*4)+MSB ## (OUT); \
+	swr	X ## x, (x*4)+LSB ## (OUT);
+
+#define STORE_ALIGNED(x) \
+CONCAT3(.Lchacha_mips_xor_aligned_, PLUS_ONE(x), _b: ;) \
+	.if (x != 12); \
+		lw	T0, (x*4)(STATE); \
+	.endif; \
+	lw	T1, (x*4) ## (IN); \
+	.if (x == 12); \
+		addu	X ## x, NONCE_0; \
+	.else; \
+		addu	X ## x, T0; \
+	.endif; \
+	CPU_TO_LE32(X ## x); \
+	xor	X ## x, T1; \
+	sw	X ## x, (x*4) ## (OUT);
+
+/* Jump table macro.
+ * Used for setup and handling the last bytes, which are not multiple of 4.
+ * X15 is free to store Xn
+ * Every jumptable entry must be equal in size.
+ */
+#define JMPTBL_ALIGNED(x) \
+.Lchacha_mips_jmptbl_aligned_ ## x: ; \
+	.set	noreorder; \
+	b	.Lchacha_mips_xor_aligned_ ## x ## _b; \
+	.if (x == 12); \
+		addu	SAVED_X, X ## x, NONCE_0; \
+	.else; \
+		addu	SAVED_X, X ## x, SAVED_CA; \
+	.endif; \
+	.set	reorder
+
+#define JMPTBL_UNALIGNED(x) \
+.Lchacha_mips_jmptbl_unaligned_ ## x: ; \
+	.set	noreorder; \
+	b	.Lchacha_mips_xor_unaligned_ ## x ## _b; \
+	.if (x == 12); \
+		addu	SAVED_X, X ## x, NONCE_0; \
+	.else; \
+		addu	SAVED_X, X ## x, SAVED_CA; \
+	.endif; \
+	.set	reorder
+
+#define AXR(A, B, C, D,  K, L, M, N,  V, W, Y, Z,  S) \
+	addu	X(A), X(K); \
+	addu	X(B), X(L); \
+	addu	X(C), X(M); \
+	addu	X(D), X(N); \
+	xor	X(V), X(A); \
+	xor	X(W), X(B); \
+	xor	X(Y), X(C); \
+	xor	X(Z), X(D); \
+	rotl	X(V), S;    \
+	rotl	X(W), S;    \
+	rotl	X(Y), S;    \
+	rotl	X(Z), S;
+
+.text
+.set	reorder
+.set	noat
+.globl	chacha_mips
+.ent	chacha_mips
+chacha_mips:
+	.frame	$sp, STACK_SIZE, $ra
+
+	/* Load number of rounds */
+	lw	$at, 16($sp)
+
+	addiu	$sp, -STACK_SIZE
+
+	/* Return bytes = 0. */
+	beqz	BYTES, .Lchacha_mips_end
+
+	lw	NONCE_0, 48(STATE)
+
+	/* Save s0-s7 */
+	sw	$s0,  0($sp)
+	sw	$s1,  4($sp)
+	sw	$s2,  8($sp)
+	sw	$s3, 12($sp)
+	sw	$s4, 16($sp)
+	sw	$s5, 20($sp)
+	sw	$s6, 24($sp)
+	sw	$s7, 28($sp)
+
+	/* Test IN or OUT is unaligned.
+	 * IS_UNALIGNED = ( IN | OUT ) & 0x00000003
+	 */
+	or	IS_UNALIGNED, IN, OUT
+	andi	IS_UNALIGNED, 0x3
+
+	b	.Lchacha_rounds_start
+
+.align 4
+.Loop_chacha_rounds:
+	addiu	IN,  CHACHA20_BLOCK_SIZE
+	addiu	OUT, CHACHA20_BLOCK_SIZE
+	addiu	NONCE_0, 1
+
+.Lchacha_rounds_start:
+	lw	X0,  0(STATE)
+	lw	X1,  4(STATE)
+	lw	X2,  8(STATE)
+	lw	X3,  12(STATE)
+
+	lw	X4,  16(STATE)
+	lw	X5,  20(STATE)
+	lw	X6,  24(STATE)
+	lw	X7,  28(STATE)
+	lw	X8,  32(STATE)
+	lw	X9,  36(STATE)
+	lw	X10, 40(STATE)
+	lw	X11, 44(STATE)
+
+	move	X12, NONCE_0
+	lw	X13, 52(STATE)
+	lw	X14, 56(STATE)
+	lw	X15, 60(STATE)
+
+.Loop_chacha_xor_rounds:
+	addiu	$at, -2
+	AXR( 0, 1, 2, 3,  4, 5, 6, 7, 12,13,14,15, 16);
+	AXR( 8, 9,10,11, 12,13,14,15,  4, 5, 6, 7, 12);
+	AXR( 0, 1, 2, 3,  4, 5, 6, 7, 12,13,14,15,  8);
+	AXR( 8, 9,10,11, 12,13,14,15,  4, 5, 6, 7,  7);
+	AXR( 0, 1, 2, 3,  5, 6, 7, 4, 15,12,13,14, 16);
+	AXR(10,11, 8, 9, 15,12,13,14,  5, 6, 7, 4, 12);
+	AXR( 0, 1, 2, 3,  5, 6, 7, 4, 15,12,13,14,  8);
+	AXR(10,11, 8, 9, 15,12,13,14,  5, 6, 7, 4,  7);
+	bnez	$at, .Loop_chacha_xor_rounds
+
+	addiu	BYTES, -(CHACHA20_BLOCK_SIZE)
+
+	/* Is data src/dst unaligned? Jump */
+	bnez	IS_UNALIGNED, .Loop_chacha_unaligned
+
+	/* Set number rounds here to fill delayslot. */
+	lw	$at, (STACK_SIZE+16)($sp)
+
+	/* BYTES < 0, it has no full block. */
+	bltz	BYTES, .Lchacha_mips_no_full_block_aligned
+
+	FOR_EACH_WORD_REV(STORE_ALIGNED)
+
+	/* BYTES > 0? Loop again. */
+	bgtz	BYTES, .Loop_chacha_rounds
+
+	/* Place this here to fill delay slot */
+	addiu	NONCE_0, 1
+
+	/* BYTES < 0? Handle last bytes */
+	bltz	BYTES, .Lchacha_mips_xor_bytes
+
+.Lchacha_mips_xor_done:
+	/* Restore used registers */
+	lw	$s0,  0($sp)
+	lw	$s1,  4($sp)
+	lw	$s2,  8($sp)
+	lw	$s3, 12($sp)
+	lw	$s4, 16($sp)
+	lw	$s5, 20($sp)
+	lw	$s6, 24($sp)
+	lw	$s7, 28($sp)
+
+	/* Write NONCE_0 back to right location in state */
+	sw	NONCE_0, 48(STATE)
+
+.Lchacha_mips_end:
+	addiu	$sp, STACK_SIZE
+	jr	$ra
+
+.Lchacha_mips_no_full_block_aligned:
+	/* Restore the offset on BYTES */
+	addiu	BYTES, CHACHA20_BLOCK_SIZE
+
+	/* Get number of full WORDS */
+	andi	$at, BYTES, MASK_U32
+
+	/* Load upper half of jump table addr */
+	lui	T0, %hi(.Lchacha_mips_jmptbl_aligned_0)
+
+	/* Calculate lower half jump table offset */
+	ins	T0, $at, 1, 6
+
+	/* Add offset to STATE */
+	addu	T1, STATE, $at
+
+	/* Add lower half jump table addr */
+	addiu	T0, %lo(.Lchacha_mips_jmptbl_aligned_0)
+
+	/* Read value from STATE */
+	lw	SAVED_CA, 0(T1)
+
+	/* Store remaining bytecounter as negative value */
+	subu	BYTES, $at, BYTES
+
+	jr	T0
+
+	/* Jump table */
+	FOR_EACH_WORD(JMPTBL_ALIGNED)
+
+
+.Loop_chacha_unaligned:
+	/* Set number rounds here to fill delayslot. */
+	lw	$at, (STACK_SIZE+16)($sp)
+
+	/* BYTES > 0, it has no full block. */
+	bltz	BYTES, .Lchacha_mips_no_full_block_unaligned
+
+	FOR_EACH_WORD_REV(STORE_UNALIGNED)
+
+	/* BYTES > 0? Loop again. */
+	bgtz	BYTES, .Loop_chacha_rounds
+
+	/* Write NONCE_0 back to right location in state */
+	sw	NONCE_0, 48(STATE)
+
+	.set noreorder
+	/* Fall through to byte handling */
+	bgez	BYTES, .Lchacha_mips_xor_done
+.Lchacha_mips_xor_unaligned_0_b:
+.Lchacha_mips_xor_aligned_0_b:
+	/* Place this here to fill delay slot */
+	addiu	NONCE_0, 1
+	.set reorder
+
+.Lchacha_mips_xor_bytes:
+	addu	IN, $at
+	addu	OUT, $at
+	/* First byte */
+	lbu	T1, 0(IN)
+	addiu	$at, BYTES, 1
+	CPU_TO_LE32(SAVED_X)
+	ROTR(SAVED_X)
+	xor	T1, SAVED_X
+	sb	T1, 0(OUT)
+	beqz	$at, .Lchacha_mips_xor_done
+	/* Second byte */
+	lbu	T1, 1(IN)
+	addiu	$at, BYTES, 2
+	ROTx	SAVED_X, 8
+	xor	T1, SAVED_X
+	sb	T1, 1(OUT)
+	beqz	$at, .Lchacha_mips_xor_done
+	/* Third byte */
+	lbu	T1, 2(IN)
+	ROTx	SAVED_X, 8
+	xor	T1, SAVED_X
+	sb	T1, 2(OUT)
+	b	.Lchacha_mips_xor_done
+
+.Lchacha_mips_no_full_block_unaligned:
+	/* Restore the offset on BYTES */
+	addiu	BYTES, CHACHA20_BLOCK_SIZE
+
+	/* Get number of full WORDS */
+	andi	$at, BYTES, MASK_U32
+
+	/* Load upper half of jump table addr */
+	lui	T0, %hi(.Lchacha_mips_jmptbl_unaligned_0)
+
+	/* Calculate lower half jump table offset */
+	ins	T0, $at, 1, 6
+
+	/* Add offset to STATE */
+	addu	T1, STATE, $at
+
+	/* Add lower half jump table addr */
+	addiu	T0, %lo(.Lchacha_mips_jmptbl_unaligned_0)
+
+	/* Read value from STATE */
+	lw	SAVED_CA, 0(T1)
+
+	/* Store remaining bytecounter as negative value */
+	subu	BYTES, $at, BYTES
+
+	jr	T0
+
+	/* Jump table */
+	FOR_EACH_WORD(JMPTBL_UNALIGNED)
+.end chacha_mips
+.set at
diff --git a/arch/mips/crypto/chacha-glue.c b/arch/mips/crypto/chacha-glue.c
new file mode 100644
index 000000000000..de01dc57751e
--- /dev/null
+++ b/arch/mips/crypto/chacha-glue.c
@@ -0,0 +1,161 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * MIPS accelerated ChaCha and XChaCha stream ciphers,
+ * including ChaCha20 (RFC7539)
+ *
+ * Copyright (C) 2019 Linaro, Ltd. <ard.biesheuvel@linaro.org>
+ */
+
+#include <crypto/algapi.h>
+#include <crypto/internal/chacha.h>
+#include <crypto/internal/skcipher.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+
+asmlinkage void chacha_mips(const u32 *state, u8 *dst, const u8 *src,
+			    unsigned int bytes, int nrounds);
+
+void hchacha_block(const u32 *state, u32 *stream, int nrounds)
+{
+	hchacha_block_generic(state, stream, nrounds);
+}
+EXPORT_SYMBOL(hchacha_block);
+
+void chacha_init(u32 *state, const u32 *key, const u8 *iv)
+{
+	chacha_init_generic(state, key, iv);
+}
+EXPORT_SYMBOL(chacha_init);
+
+void chacha_crypt(u32 *state, u8 *dst, const u8 *src, unsigned int bytes,
+		  int nrounds)
+{
+	chacha_mips(state, dst, src, bytes, nrounds);
+}
+EXPORT_SYMBOL(chacha_crypt);
+
+static int chacha_mips_stream_xor(struct skcipher_request *req,
+				  const struct chacha_ctx *ctx, const u8 *iv)
+{
+	struct skcipher_walk walk;
+	u32 state[16];
+	int err;
+
+	err = skcipher_walk_virt(&walk, req, false);
+
+	crypto_chacha_init(state, ctx, iv);
+
+	while (walk.nbytes > 0) {
+		unsigned int nbytes = walk.nbytes;
+
+		if (nbytes < walk.total)
+			nbytes = round_down(nbytes, walk.stride);
+
+		chacha_mips(state, walk.dst.virt.addr, walk.src.virt.addr,
+			    nbytes, ctx->nrounds);
+		err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
+	}
+
+	return err;
+}
+
+static int __chacha_mips(struct skcipher_request *req)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	return chacha_mips_stream_xor(req, ctx, req->iv);
+}
+
+static int xchacha_mips(struct skcipher_request *req)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct chacha_ctx subctx;
+	u32 state[16];
+	u8 real_iv[16];
+
+	crypto_chacha_init(state, ctx, req->iv);
+
+	hchacha_block_generic(state, subctx.key, ctx->nrounds);
+	subctx.nrounds = ctx->nrounds;
+
+	memcpy(&real_iv[0], req->iv + 24, 8);
+	memcpy(&real_iv[8], req->iv + 16, 8);
+	return chacha_mips_stream_xor(req, &subctx, real_iv);
+}
+
+static struct skcipher_alg algs[] = {
+	{
+		.base.cra_name		= "chacha20",
+		.base.cra_driver_name	= "chacha20-mips",
+		.base.cra_priority	= 200,
+		.base.cra_blocksize	= 1,
+		.base.cra_ctxsize	= sizeof(struct chacha_ctx),
+		.base.cra_module	= THIS_MODULE,
+
+		.min_keysize		= CHACHA_KEY_SIZE,
+		.max_keysize		= CHACHA_KEY_SIZE,
+		.ivsize			= CHACHA_IV_SIZE,
+		.chunksize		= CHACHA_BLOCK_SIZE,
+		.walksize		= 4 * CHACHA_BLOCK_SIZE,
+		.setkey			= crypto_chacha20_setkey,
+		.encrypt		= __chacha_mips,
+		.decrypt		= __chacha_mips,
+	}, {
+		.base.cra_name		= "xchacha20",
+		.base.cra_driver_name	= "xchacha20-mips",
+		.base.cra_priority	= 200,
+		.base.cra_blocksize	= 1,
+		.base.cra_ctxsize	= sizeof(struct chacha_ctx),
+		.base.cra_module	= THIS_MODULE,
+
+		.min_keysize		= CHACHA_KEY_SIZE,
+		.max_keysize		= CHACHA_KEY_SIZE,
+		.ivsize			= XCHACHA_IV_SIZE,
+		.chunksize		= CHACHA_BLOCK_SIZE,
+		.walksize		= 4 * CHACHA_BLOCK_SIZE,
+		.setkey			= crypto_chacha20_setkey,
+		.encrypt		= xchacha_mips,
+		.decrypt		= xchacha_mips,
+	}, {
+		.base.cra_name		= "xchacha12",
+		.base.cra_driver_name	= "xchacha12-mips",
+		.base.cra_priority	= 200,
+		.base.cra_blocksize	= 1,
+		.base.cra_ctxsize	= sizeof(struct chacha_ctx),
+		.base.cra_module	= THIS_MODULE,
+
+		.min_keysize		= CHACHA_KEY_SIZE,
+		.max_keysize		= CHACHA_KEY_SIZE,
+		.ivsize			= XCHACHA_IV_SIZE,
+		.chunksize		= CHACHA_BLOCK_SIZE,
+		.walksize		= 4 * CHACHA_BLOCK_SIZE,
+		.setkey			= crypto_chacha12_setkey,
+		.encrypt		= xchacha_mips,
+		.decrypt		= xchacha_mips,
+	}
+};
+
+static int __init chacha_simd_mod_init(void)
+{
+	return crypto_register_skciphers(algs, ARRAY_SIZE(algs));
+}
+
+static void __exit chacha_simd_mod_fini(void)
+{
+	crypto_unregister_skciphers(algs, ARRAY_SIZE(algs));
+}
+
+module_init(chacha_simd_mod_init);
+module_exit(chacha_simd_mod_fini);
+
+MODULE_DESCRIPTION("ChaCha and XChaCha stream ciphers (MIPS accelerated)");
+MODULE_AUTHOR("Ard Biesheuvel <ard.biesheuvel@linaro.org>");
+MODULE_LICENSE("GPL v2");
+MODULE_ALIAS_CRYPTO("chacha20");
+MODULE_ALIAS_CRYPTO("chacha20-mips");
+MODULE_ALIAS_CRYPTO("xchacha20");
+MODULE_ALIAS_CRYPTO("xchacha20-mips");
+MODULE_ALIAS_CRYPTO("xchacha12");
+MODULE_ALIAS_CRYPTO("xchacha12-mips");
diff --git a/crypto/Kconfig b/crypto/Kconfig
index f90b53a526ba..43e94ac5d117 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -1441,6 +1441,12 @@  config CRYPTO_CHACHA20_X86_64
 	  SSSE3, AVX2, and AVX-512VL optimized implementations of the ChaCha20,
 	  XChaCha20, and XChaCha12 stream ciphers.
 
+config CRYPTO_CHACHA_MIPS
+	tristate "ChaCha stream cipher algorithms (MIPS 32r2 optimized)"
+	depends on CPU_MIPS32_R2
+	select CRYPTO_CHACHA20
+	select CRYPTO_ARCH_HAVE_LIB_CHACHA
+
 config CRYPTO_SEED
 	tristate "SEED cipher algorithm"
 	select CRYPTO_ALGAPI