diff mbox series

[12/12] RISC-V: crypto: add Zvkb accelerated ChaCha20 implementation

Message ID 20231025183644.8735-13-jerry.shih@sifive.com
State New
Headers show
Series RISC-V: provide some accelerated cryptography implementations using vector extensions | expand

Commit Message

Jerry Shih Oct. 25, 2023, 6:36 p.m. UTC
Add a ChaCha20 vector implementation from OpenSSL(openssl/openssl#21923).

Signed-off-by: Jerry Shih <jerry.shih@sifive.com>
---
 arch/riscv/crypto/Kconfig                |  12 +
 arch/riscv/crypto/Makefile               |   7 +
 arch/riscv/crypto/chacha-riscv64-glue.c  | 120 +++++++++
 arch/riscv/crypto/chacha-riscv64-zvkb.pl | 322 +++++++++++++++++++++++
 4 files changed, 461 insertions(+)
 create mode 100644 arch/riscv/crypto/chacha-riscv64-glue.c
 create mode 100644 arch/riscv/crypto/chacha-riscv64-zvkb.pl

Comments

Eric Biggers Nov. 20, 2023, 7:18 p.m. UTC | #1
Hi Jerry!

On Mon, Nov 20, 2023 at 10:55:15AM +0800, Jerry Shih wrote:
> >> +# - RV64I
> >> +# - RISC-V Vector ('V') with VLEN >= 128
> >> +# - RISC-V Vector Cryptography Bit-manipulation extension ('Zvkb')
> >> +# - RISC-V Zicclsm(Main memory supports misaligned loads/stores)
> > 
> > How is the presence of the Zicclsm extension guaranteed?
> > 
> > - Eric
> 
> I have the addition extension parser for `Zicclsm` in the v2 patch set.

First, I can see your updated patchset at branch
"dev/jerrys/vector-crypto-upstream-v2" of https://github.com/JerryShih/linux,
but I haven't seen it on the mailing list yet.  Are you planning to send it out?

Second, with your updated patchset, I'm not seeing any of the RISC-V optimized
algorithms be registered when I boot the kernel in QEMU.  This is caused by the
new check 'riscv_isa_extension_available(NULL, ZICCLSM)' not passing.  Is
checking for "Zicclsm" the correct way to determine whether unaligned memory
accesses are supported?

I'm using 'qemu-system-riscv64 -cpu max -machine virt', with the very latest
QEMU commit (af9264da80073435), so it should have all the CPU features.

- Eric
Jerry Shih Nov. 21, 2023, 10:55 a.m. UTC | #2
On Nov 21, 2023, at 03:18, Eric Biggers <ebiggers@kernel.org> wrote:
> First, I can see your updated patchset at branch
> "dev/jerrys/vector-crypto-upstream-v2" of https://github.com/JerryShih/linux,
> but I haven't seen it on the mailing list yet.  Are you planning to send it out?

I will send it out soon.

> Second, with your updated patchset, I'm not seeing any of the RISC-V optimized
> algorithms be registered when I boot the kernel in QEMU.  This is caused by the
> new check 'riscv_isa_extension_available(NULL, ZICCLSM)' not passing.  Is
> checking for "Zicclsm" the correct way to determine whether unaligned memory
> accesses are supported?
> 
> I'm using 'qemu-system-riscv64 -cpu max -machine virt', with the very latest
> QEMU commit (af9264da80073435), so it should have all the CPU features.
> 
> - Eric

Sorry, I just use my `internal` qemu with vector-crypto and rva22 patches.

The public qemu haven't supported rva22 profiles. Here is the qemu patch[1] for
that. But here is the discussion why the qemu doesn't export these
`named extensions`(e.g. Zicclsm).
I try to add Zicclsm in DT in the v2 patch set. Maybe we will have more discussion
about the rva22 profiles in kernel DT.

[1]
LINK: https://lore.kernel.org/all/d1d6f2dc-55b2-4dce-a48a-4afbbf6df526@ventanamicro.com/#t

I don't know whether it's a good practice to check unaligned access using
`Zicclsm`. 

Here is another related cpu feature for unaligned access:
RISCV_HWPROBE_MISALIGNED_*
But it looks like it always be initialized with `RISCV_HWPROBE_MISALIGNED_SLOW`[2].
It implies that linux kernel always supports unaligned access. But we have the
actual HW which doesn't support unaligned access for vector unit.

[2]
LINK: https://github.com/torvalds/linux/blob/98b1cc82c4affc16f5598d4fa14b1858671b2263/arch/riscv/kernel/cpufeature.c#L575

I will still use `Zicclsm` checking in this stage for reviewing. And I will create qemu
branch with Zicclsm enabled feature for testing.

-Jerry
Conor Dooley Nov. 21, 2023, 1:14 p.m. UTC | #3
On Tue, Nov 21, 2023 at 06:55:07PM +0800, Jerry Shih wrote:
> On Nov 21, 2023, at 03:18, Eric Biggers <ebiggers@kernel.org> wrote:
> > First, I can see your updated patchset at branch
> > "dev/jerrys/vector-crypto-upstream-v2" of https://github.com/JerryShih/linux,
> > but I haven't seen it on the mailing list yet.  Are you planning to send it out?
> 
> I will send it out soon.
> 
> > Second, with your updated patchset, I'm not seeing any of the RISC-V optimized
> > algorithms be registered when I boot the kernel in QEMU.  This is caused by the
> > new check 'riscv_isa_extension_available(NULL, ZICCLSM)' not passing.  Is
> > checking for "Zicclsm" the correct way to determine whether unaligned memory
> > accesses are supported?
> > 
> > I'm using 'qemu-system-riscv64 -cpu max -machine virt', with the very latest
> > QEMU commit (af9264da80073435), so it should have all the CPU features.
> > 
> > - Eric
> 
> Sorry, I just use my `internal` qemu with vector-crypto and rva22 patches.
> 
> The public qemu haven't supported rva22 profiles. Here is the qemu patch[1] for
> that. But here is the discussion why the qemu doesn't export these
> `named extensions`(e.g. Zicclsm).
> I try to add Zicclsm in DT in the v2 patch set. Maybe we will have more discussion
> about the rva22 profiles in kernel DT.

Please do, that'll be fun! Please take some time to read what the
profiles spec actually defines Zicclsm fore before you send those patches
though. I think you might come to find you have misunderstood what it
means - certainly I did the first time I saw it!

> [1]
> LINK: https://lore.kernel.org/all/d1d6f2dc-55b2-4dce-a48a-4afbbf6df526@ventanamicro.com/#t
> 
> I don't know whether it's a good practice to check unaligned access using
> `Zicclsm`. 
> 
> Here is another related cpu feature for unaligned access:
> RISCV_HWPROBE_MISALIGNED_*
> But it looks like it always be initialized with `RISCV_HWPROBE_MISALIGNED_SLOW`[2].
> It implies that linux kernel always supports unaligned access. But we have the
> actual HW which doesn't support unaligned access for vector unit.

https://docs.kernel.org/arch/riscv/uabi.html#misaligned-accesses

Misaligned accesses are part of the user ABI & the hwprobe stuff for
that allows userspace to figure out whether they're fast (likely
implemented in hardware), slow (likely emulated in firmware) or emulated
in the kernel.

Cheers,
Conor.

> 
> [2]
> LINK: https://github.com/torvalds/linux/blob/98b1cc82c4affc16f5598d4fa14b1858671b2263/arch/riscv/kernel/cpufeature.c#L575
> 
> I will still use `Zicclsm` checking in this stage for reviewing. And I will create qemu
> branch with Zicclsm enabled feature for testing.
> 
> -Jerry
Eric Biggers Nov. 21, 2023, 11:37 p.m. UTC | #4
On Tue, Nov 21, 2023 at 01:14:47PM +0000, Conor Dooley wrote:
> On Tue, Nov 21, 2023 at 06:55:07PM +0800, Jerry Shih wrote:
> > On Nov 21, 2023, at 03:18, Eric Biggers <ebiggers@kernel.org> wrote:
> > > First, I can see your updated patchset at branch
> > > "dev/jerrys/vector-crypto-upstream-v2" of https://github.com/JerryShih/linux,
> > > but I haven't seen it on the mailing list yet.  Are you planning to send it out?
> > 
> > I will send it out soon.
> > 
> > > Second, with your updated patchset, I'm not seeing any of the RISC-V optimized
> > > algorithms be registered when I boot the kernel in QEMU.  This is caused by the
> > > new check 'riscv_isa_extension_available(NULL, ZICCLSM)' not passing.  Is
> > > checking for "Zicclsm" the correct way to determine whether unaligned memory
> > > accesses are supported?
> > > 
> > > I'm using 'qemu-system-riscv64 -cpu max -machine virt', with the very latest
> > > QEMU commit (af9264da80073435), so it should have all the CPU features.
> > > 
> > > - Eric
> > 
> > Sorry, I just use my `internal` qemu with vector-crypto and rva22 patches.
> > 
> > The public qemu haven't supported rva22 profiles. Here is the qemu patch[1] for
> > that. But here is the discussion why the qemu doesn't export these
> > `named extensions`(e.g. Zicclsm).
> > I try to add Zicclsm in DT in the v2 patch set. Maybe we will have more discussion
> > about the rva22 profiles in kernel DT.
> 
> Please do, that'll be fun! Please take some time to read what the
> profiles spec actually defines Zicclsm fore before you send those patches
> though. I think you might come to find you have misunderstood what it
> means - certainly I did the first time I saw it!
> 
> > [1]
> > LINK: https://lore.kernel.org/all/d1d6f2dc-55b2-4dce-a48a-4afbbf6df526@ventanamicro.com/#t
> > 
> > I don't know whether it's a good practice to check unaligned access using
> > `Zicclsm`. 
> > 
> > Here is another related cpu feature for unaligned access:
> > RISCV_HWPROBE_MISALIGNED_*
> > But it looks like it always be initialized with `RISCV_HWPROBE_MISALIGNED_SLOW`[2].
> > It implies that linux kernel always supports unaligned access. But we have the
> > actual HW which doesn't support unaligned access for vector unit.
> 
> https://docs.kernel.org/arch/riscv/uabi.html#misaligned-accesses
> 
> Misaligned accesses are part of the user ABI & the hwprobe stuff for
> that allows userspace to figure out whether they're fast (likely
> implemented in hardware), slow (likely emulated in firmware) or emulated
> in the kernel.
> 
> Cheers,
> Conor.
> 
> > 
> > [2]
> > LINK: https://github.com/torvalds/linux/blob/98b1cc82c4affc16f5598d4fa14b1858671b2263/arch/riscv/kernel/cpufeature.c#L575
> > 
> > I will still use `Zicclsm` checking in this stage for reviewing. And I will create qemu
> > branch with Zicclsm enabled feature for testing.
> > 

According to https://github.com/riscv/riscv-profiles/blob/main/profiles.adoc,
Zicclsm means that "main memory supports misaligned loads/stores", but they
"might execute extremely slowly."

In general, the vector crypto routines that Jerry is adding assume that
misaligned vector loads/stores are supported *and* are fast.  I think the kernel
mustn't register those algorithms if that isn't the case.  Zicclsm sounds like
the wrong thing to check.  Maybe RISCV_HWPROBE_MISALIGNED_FAST is the right
thing to check?

BTW, something else I was wondering about is endianness.  Most of the vector
crypto routines also assume little endian byte order, but I don't see that being
explicitly checked for anywhere.  Should it be?

- Eric
Conor Dooley Nov. 22, 2023, 12:39 a.m. UTC | #5
On Tue, Nov 21, 2023 at 03:37:43PM -0800, Eric Biggers wrote:
> On Tue, Nov 21, 2023 at 01:14:47PM +0000, Conor Dooley wrote:
> > On Tue, Nov 21, 2023 at 06:55:07PM +0800, Jerry Shih wrote:
> > > On Nov 21, 2023, at 03:18, Eric Biggers <ebiggers@kernel.org> wrote:
> > > > First, I can see your updated patchset at branch
> > > > "dev/jerrys/vector-crypto-upstream-v2" of https://github.com/JerryShih/linux,
> > > > but I haven't seen it on the mailing list yet.  Are you planning to send it out?
> > > 
> > > I will send it out soon.
> > > 
> > > > Second, with your updated patchset, I'm not seeing any of the RISC-V optimized
> > > > algorithms be registered when I boot the kernel in QEMU.  This is caused by the
> > > > new check 'riscv_isa_extension_available(NULL, ZICCLSM)' not passing.  Is
> > > > checking for "Zicclsm" the correct way to determine whether unaligned memory
> > > > accesses are supported?
> > > > 
> > > > I'm using 'qemu-system-riscv64 -cpu max -machine virt', with the very latest
> > > > QEMU commit (af9264da80073435), so it should have all the CPU features.
> > > > 
> > > > - Eric
> > > 
> > > Sorry, I just use my `internal` qemu with vector-crypto and rva22 patches.
> > > 
> > > The public qemu haven't supported rva22 profiles. Here is the qemu patch[1] for
> > > that. But here is the discussion why the qemu doesn't export these
> > > `named extensions`(e.g. Zicclsm).
> > > I try to add Zicclsm in DT in the v2 patch set. Maybe we will have more discussion
> > > about the rva22 profiles in kernel DT.
> > 
> > Please do, that'll be fun! Please take some time to read what the
> > profiles spec actually defines Zicclsm fore before you send those patches
> > though. I think you might come to find you have misunderstood what it
> > means - certainly I did the first time I saw it!
> > 
> > > [1]
> > > LINK: https://lore.kernel.org/all/d1d6f2dc-55b2-4dce-a48a-4afbbf6df526@ventanamicro.com/#t
> > > 
> > > I don't know whether it's a good practice to check unaligned access using
> > > `Zicclsm`. 
> > > 
> > > Here is another related cpu feature for unaligned access:
> > > RISCV_HWPROBE_MISALIGNED_*
> > > But it looks like it always be initialized with `RISCV_HWPROBE_MISALIGNED_SLOW`[2].
> > > It implies that linux kernel always supports unaligned access. But we have the
> > > actual HW which doesn't support unaligned access for vector unit.
> > 
> > https://docs.kernel.org/arch/riscv/uabi.html#misaligned-accesses
> > 
> > Misaligned accesses are part of the user ABI & the hwprobe stuff for
> > that allows userspace to figure out whether they're fast (likely
> > implemented in hardware), slow (likely emulated in firmware) or emulated
> > in the kernel.
> >
> > > [2]
> > > LINK: https://github.com/torvalds/linux/blob/98b1cc82c4affc16f5598d4fa14b1858671b2263/arch/riscv/kernel/cpufeature.c#L575
> > > 
> > > I will still use `Zicclsm` checking in this stage for reviewing. And I will create qemu
> > > branch with Zicclsm enabled feature for testing.
> > > 
> 
> According to https://github.com/riscv/riscv-profiles/blob/main/profiles.adoc,
> Zicclsm means that "main memory supports misaligned loads/stores", but they
> "might execute extremely slowly."

Check the section it is defined in - it is only defined for the RVA22U64
profile which describes "features available to user-mode execution
environments". It otherwise has no meaning, so it is not suitable for
detecting anything from within the kernel. For other operating systems
it might actually mean something, but for Linux the uABI on RISC-V
unconditionally provides what Zicclsm is intended to convey:
https://www.kernel.org/doc/html/next/riscv/uabi.html#misaligned-accesses
We could (_perhaps_) set it in /proc/cpuinfo in riscv,isa there - but a
conversation would have to be had about what these non-extension
"features" actually are & whether it makes sense to put them there.

> In general, the vector crypto routines that Jerry is adding assume that
> misaligned vector loads/stores are supported *and* are fast.  I think the kernel
> mustn't register those algorithms if that isn't the case.  Zicclsm sounds like
> the wrong thing to check.  Maybe RISCV_HWPROBE_MISALIGNED_FAST is the right
> thing to check?

It actually means something, so it is certainly better ;)
I think checking it makes sense as a good surrogate for actually knowing
whether or not the hardware supports misaligned access.

> BTW, something else I was wondering about is endianness.  Most of the vector
> crypto routines also assume little endian byte order, but I don't see that being
> explicitly checked for anywhere.  Should it be?

The RISC-V kernel only supports LE at the moment. I hope that doesn't
change tbh.

Cheers,
Conor.
Eric Biggers Nov. 22, 2023, 1:29 a.m. UTC | #6
On Thu, Oct 26, 2023 at 02:36:44AM +0800, Jerry Shih wrote:
> diff --git a/arch/riscv/crypto/chacha-riscv64-glue.c b/arch/riscv/crypto/chacha-riscv64-glue.c
> new file mode 100644
> index 000000000000..72011949f705
> --- /dev/null
> +++ b/arch/riscv/crypto/chacha-riscv64-glue.c
> @@ -0,0 +1,120 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Port of the OpenSSL ChaCha20 implementation for RISC-V 64
> + *
> + * Copyright (C) 2023 SiFive, Inc.
> + * Author: Jerry Shih <jerry.shih@sifive.com>
> + */
> +
> +#include <asm/simd.h>
> +#include <asm/vector.h>
> +#include <crypto/internal/chacha.h>
> +#include <crypto/internal/simd.h>
> +#include <crypto/internal/skcipher.h>
> +#include <linux/crypto.h>
> +#include <linux/module.h>
> +#include <linux/types.h>
> +
> +#define CHACHA_BLOCK_VALID_SIZE_MASK (~(CHACHA_BLOCK_SIZE - 1))
> +#define CHACHA_BLOCK_REMAINING_SIZE_MASK (CHACHA_BLOCK_SIZE - 1)
> +#define CHACHA_KEY_OFFSET 4
> +#define CHACHA_IV_OFFSET 12
> +
> +/* chacha20 using zvkb vector crypto extension */
> +void ChaCha20_ctr32_zvkb(u8 *out, const u8 *input, size_t len, const u32 *key,
> +			 const u32 *counter);
> +
> +static int chacha20_encrypt(struct skcipher_request *req)
> +{
> +	u32 state[CHACHA_STATE_WORDS];

This function doesn't need to create the whole state matrix on the stack, since
the underlying assembly function takes as input the key and counter, not the
state matrix.  I recommend something like the following:

diff --git a/arch/riscv/crypto/chacha-riscv64-glue.c b/arch/riscv/crypto/chacha-riscv64-glue.c
index df185d0663fcc..216b4cd9d1e01 100644
--- a/arch/riscv/crypto/chacha-riscv64-glue.c
+++ b/arch/riscv/crypto/chacha-riscv64-glue.c
@@ -16,45 +16,42 @@
 #include <linux/module.h>
 #include <linux/types.h>
 
-#define CHACHA_KEY_OFFSET 4
-#define CHACHA_IV_OFFSET 12
-
 /* chacha20 using zvkb vector crypto extension */
 asmlinkage void ChaCha20_ctr32_zvkb(u8 *out, const u8 *input, size_t len,
 				    const u32 *key, const u32 *counter);
 
 static int chacha20_encrypt(struct skcipher_request *req)
 {
-	u32 state[CHACHA_STATE_WORDS];
 	u8 block_buffer[CHACHA_BLOCK_SIZE];
 	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
 	const struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
 	struct skcipher_walk walk;
 	unsigned int nbytes;
 	unsigned int tail_bytes;
+	u32 iv[4];
 	int err;
 
-	chacha_init_generic(state, ctx->key, req->iv);
+	iv[0] = get_unaligned_le32(req->iv);
+	iv[1] = get_unaligned_le32(req->iv + 4);
+	iv[2] = get_unaligned_le32(req->iv + 8);
+	iv[3] = get_unaligned_le32(req->iv + 12);
 
 	err = skcipher_walk_virt(&walk, req, false);
 	while (walk.nbytes) {
-		nbytes = walk.nbytes & (~(CHACHA_BLOCK_SIZE - 1));
+		nbytes = walk.nbytes & ~(CHACHA_BLOCK_SIZE - 1);
 		tail_bytes = walk.nbytes & (CHACHA_BLOCK_SIZE - 1);
 		kernel_vector_begin();
 		if (nbytes) {
 			ChaCha20_ctr32_zvkb(walk.dst.virt.addr,
 					    walk.src.virt.addr, nbytes,
-					    state + CHACHA_KEY_OFFSET,
-					    state + CHACHA_IV_OFFSET);
-			state[CHACHA_IV_OFFSET] += nbytes / CHACHA_BLOCK_SIZE;
+					    ctx->key, iv);
+			iv[0] += nbytes / CHACHA_BLOCK_SIZE;
 		}
 		if (walk.nbytes == walk.total && tail_bytes > 0) {
 			memcpy(block_buffer, walk.src.virt.addr + nbytes,
 			       tail_bytes);
 			ChaCha20_ctr32_zvkb(block_buffer, block_buffer,
-					    CHACHA_BLOCK_SIZE,
-					    state + CHACHA_KEY_OFFSET,
-					    state + CHACHA_IV_OFFSET);
+					    CHACHA_BLOCK_SIZE, ctx->key, iv);
 			memcpy(walk.dst.virt.addr + nbytes, block_buffer,
 			       tail_bytes);
 			tail_bytes = 0;
Jerry Shih Nov. 22, 2023, 5:37 p.m. UTC | #7
On Nov 21, 2023, at 21:14, Conor Dooley <conor.dooley@microchip.com> wrote:
> On Tue, Nov 21, 2023 at 06:55:07PM +0800, Jerry Shih wrote:
>> Sorry, I just use my `internal` qemu with vector-crypto and rva22 patches.
>> 
>> The public qemu haven't supported rva22 profiles. Here is the qemu patch[1] for
>> that. But here is the discussion why the qemu doesn't export these
>> `named extensions`(e.g. Zicclsm).
>> I try to add Zicclsm in DT in the v2 patch set. Maybe we will have more discussion
>> about the rva22 profiles in kernel DT.
> 
> Please do, that'll be fun! Please take some time to read what the
> profiles spec actually defines Zicclsm fore before you send those patches
> though. I think you might come to find you have misunderstood what it
> means - certainly I did the first time I saw it!

From the rva22 profile:
  This requires misaligned support for all regular load and store instructions (including
  scalar and ``vector``)

The spec includes the explicit `vector` keyword.
So, I still think we could use Zicclsm checking for these vector-crypto implementations.

My proposed patch is just a simple patch which only update the DT document and
update the isa string parser for Zicclsm. If it's still not recommend to use Zicclsm
checking, I will turn to use `RISCV_HWPROBE_MISALIGNED_*` instead.

>> [1]
>> LINK: https://lore.kernel.org/all/d1d6f2dc-55b2-4dce-a48a-4afbbf6df526@ventanamicro.com/#t
>> 
>> I don't know whether it's a good practice to check unaligned access using
>> `Zicclsm`. 
>> 
>> Here is another related cpu feature for unaligned access:
>> RISCV_HWPROBE_MISALIGNED_*
>> But it looks like it always be initialized with `RISCV_HWPROBE_MISALIGNED_SLOW`[2].
>> It implies that linux kernel always supports unaligned access. But we have the
>> actual HW which doesn't support unaligned access for vector unit.
> 
> https://docs.kernel.org/arch/riscv/uabi.html#misaligned-accesses
> 
> Misaligned accesses are part of the user ABI & the hwprobe stuff for
> that allows userspace to figure out whether they're fast (likely
> implemented in hardware), slow (likely emulated in firmware) or emulated
> in the kernel.

The HWPROBE_MISALIGNED_* checking function is at:
https://github.com/torvalds/linux/blob/c2d5304e6c648ebcf653bace7e51e0e6742e46c8/arch/riscv/kernel/cpufeature.c#L564-L647
The tests are all scalar. No `vector` test inside. So, I'm not sure the
HWPROBE_MISALIGNED_* is related to vector unit or not.

The goal is to check whether `vector` support unaligned access or not
in this crypto patch.

I haven't seen the emulated path for unaligned-vector-access in OpenSBI
and kernel. Is the unaligned-vector-access included in user ABI?

Thanks,
Jerry
Palmer Dabbelt Nov. 22, 2023, 6:05 p.m. UTC | #8
On Wed, 22 Nov 2023 09:37:33 PST (-0800), jerry.shih@sifive.com wrote:
> On Nov 21, 2023, at 21:14, Conor Dooley <conor.dooley@microchip.com> wrote:
>> On Tue, Nov 21, 2023 at 06:55:07PM +0800, Jerry Shih wrote:
>>> Sorry, I just use my `internal` qemu with vector-crypto and rva22 patches.
>>> 
>>> The public qemu haven't supported rva22 profiles. Here is the qemu patch[1] for
>>> that. But here is the discussion why the qemu doesn't export these
>>> `named extensions`(e.g. Zicclsm).
>>> I try to add Zicclsm in DT in the v2 patch set. Maybe we will have more discussion
>>> about the rva22 profiles in kernel DT.
>> 
>> Please do, that'll be fun! Please take some time to read what the
>> profiles spec actually defines Zicclsm fore before you send those patches
>> though. I think you might come to find you have misunderstood what it
>> means - certainly I did the first time I saw it!
>
> From the rva22 profile:
>   This requires misaligned support for all regular load and store instructions (including
>   scalar and ``vector``)
>
> The spec includes the explicit `vector` keyword.
> So, I still think we could use Zicclsm checking for these vector-crypto implementations.
>
> My proposed patch is just a simple patch which only update the DT document and
> update the isa string parser for Zicclsm. If it's still not recommend to use Zicclsm
> checking, I will turn to use `RISCV_HWPROBE_MISALIGNED_*` instead.

IMO that's the way to go: even if these are required to be supported by 
Zicclsm, we still need to deal with the performance implications.

>>> [1]
>>> LINK: https://lore.kernel.org/all/d1d6f2dc-55b2-4dce-a48a-4afbbf6df526@ventanamicro.com/#t
>>> 
>>> I don't know whether it's a good practice to check unaligned access using
>>> `Zicclsm`. 
>>> 
>>> Here is another related cpu feature for unaligned access:
>>> RISCV_HWPROBE_MISALIGNED_*
>>> But it looks like it always be initialized with `RISCV_HWPROBE_MISALIGNED_SLOW`[2].
>>> It implies that linux kernel always supports unaligned access. But we have the
>>> actual HW which doesn't support unaligned access for vector unit.
>> 
>> https://docs.kernel.org/arch/riscv/uabi.html#misaligned-accesses
>> 
>> Misaligned accesses are part of the user ABI & the hwprobe stuff for
>> that allows userspace to figure out whether they're fast (likely
>> implemented in hardware), slow (likely emulated in firmware) or emulated
>> in the kernel.
>
> The HWPROBE_MISALIGNED_* checking function is at:
> https://github.com/torvalds/linux/blob/c2d5304e6c648ebcf653bace7e51e0e6742e46c8/arch/riscv/kernel/cpufeature.c#L564-L647
> The tests are all scalar. No `vector` test inside. So, I'm not sure the
> HWPROBE_MISALIGNED_* is related to vector unit or not.
>
> The goal is to check whether `vector` support unaligned access or not
> in this crypto patch.
>
> I haven't seen the emulated path for unaligned-vector-access in OpenSBI
> and kernel. Is the unaligned-vector-access included in user ABI?

I guess it's kind of a grey area, but I'd agrue that it is: we merged 
support for V when the only implementation (ie, QEMU) supported 
misaligned accesses, so we're stuck with that being the defacto 
behavior.  As part of adding support for the K230 we'll need to then add 
the kernel-mode vector misaligned access handlers, but that doesn't seem 
so hard.

So I'd say we should update the hwprobe docs to say that key only 
reflects scalar accesses (or maybe even just integer accesses?  that's 
all we're testing for) -- essentially just make the documentation match 
the implementation, as that'll keep ABI compatibility.  Then we can add 
a new key for vector misaligned access performance.

>
> Thanks,
> Jerry
Conor Dooley Nov. 22, 2023, 6:20 p.m. UTC | #9
On Thu, Nov 23, 2023 at 01:37:33AM +0800, Jerry Shih wrote:
> On Nov 21, 2023, at 21:14, Conor Dooley <conor.dooley@microchip.com> wrote:
> > On Tue, Nov 21, 2023 at 06:55:07PM +0800, Jerry Shih wrote:
> >> Sorry, I just use my `internal` qemu with vector-crypto and rva22 patches.
> >> 
> >> The public qemu haven't supported rva22 profiles. Here is the qemu patch[1] for
> >> that. But here is the discussion why the qemu doesn't export these
> >> `named extensions`(e.g. Zicclsm).
> >> I try to add Zicclsm in DT in the v2 patch set. Maybe we will have more discussion
> >> about the rva22 profiles in kernel DT.
> > 
> > Please do, that'll be fun! Please take some time to read what the
> > profiles spec actually defines Zicclsm fore before you send those patches
> > though. I think you might come to find you have misunderstood what it
> > means - certainly I did the first time I saw it!
> 
> From the rva22 profile:

"rva22" is not a profile. As I pointed out to Eric, this is defined in
the RVA22U64 profile (and the RVA20U64 one, but that is effectively a
moot point). The profile descriptions for these only specify "the ISA
features available to user-mode execution environments", so it is not
suitable for use in any other context.

>   This requires misaligned support for all regular load and store instructions (including
>   scalar and ``vector``)
> 
> The spec includes the explicit `vector` keyword.
> So, I still think we could use Zicclsm checking for these vector-crypto implementations.

In userspace, if Zicclsm was exported somewhere, that would be a valid
argument. Even for userspace, the hwprobe flags probably provide more
information though, since the firmware emulation is insanely slow.

> My proposed patch is just a simple patch which only update the DT document and
> update the isa string parser for Zicclsm.

Zicclsm has no meaning outside of user mode, so it's not suitable for
use in that context. Other "features" defined in the profiles spec might
be suitable for inclusion, but it'll be a case-by-case basis.

> If it's still not recommend to use Zicclsm
> checking, I will turn to use `RISCV_HWPROBE_MISALIGNED_*` instead.

Palmer has commented on the rest, so no need for me :)
Jerry Shih Nov. 22, 2023, 7:05 p.m. UTC | #10
On Nov 23, 2023, at 02:20, Conor Dooley <conor@kernel.org> wrote:
> On Thu, Nov 23, 2023 at 01:37:33AM +0800, Jerry Shih wrote:
>> On Nov 21, 2023, at 21:14, Conor Dooley <conor.dooley@microchip.com> wrote:
>>> On Tue, Nov 21, 2023 at 06:55:07PM +0800, Jerry Shih wrote:
>>>> Sorry, I just use my `internal` qemu with vector-crypto and rva22 patches.
>>>> 
>>>> The public qemu haven't supported rva22 profiles. Here is the qemu patch[1] for
>>>> that. But here is the discussion why the qemu doesn't export these
>>>> `named extensions`(e.g. Zicclsm).
>>>> I try to add Zicclsm in DT in the v2 patch set. Maybe we will have more discussion
>>>> about the rva22 profiles in kernel DT.
>>> 
>>> Please do, that'll be fun! Please take some time to read what the
>>> profiles spec actually defines Zicclsm fore before you send those patches
>>> though. I think you might come to find you have misunderstood what it
>>> means - certainly I did the first time I saw it!
>> 
>> From the rva22 profile:
> 
> "rva22" is not a profile. As I pointed out to Eric, this is defined in
> the RVA22U64 profile (and the RVA20U64 one, but that is effectively a
> moot point). The profile descriptions for these only specify "the ISA
> features available to user-mode execution environments", so it is not
> suitable for use in any other context.

I missed that important part: it's for user space.
Thx.

>>  This requires misaligned support for all regular load and store instructions (including
>>  scalar and ``vector``)
>> 
>> The spec includes the explicit `vector` keyword.
>> So, I still think we could use Zicclsm checking for these vector-crypto implementations.
> 
> In userspace, if Zicclsm was exported somewhere, that would be a valid
> argument. Even for userspace, the hwprobe flags probably provide more
> information though, since the firmware emulation is insanely slow.

I agree. It will be more useful to have the flag like `VECTOR_MISALIGNED_FAST`
instead.

>> My proposed patch is just a simple patch which only update the DT document and
>> update the isa string parser for Zicclsm.
> 
> Zicclsm has no meaning outside of user mode, so it's not suitable for
> use in that context. Other "features" defined in the profiles spec might
> be suitable for inclusion, but it'll be a case-by-case basis.

I will skip the Zicclsm part in my v2 patch.

>> If it's still not recommend to use Zicclsm
>> checking, I will turn to use `RISCV_HWPROBE_MISALIGNED_*` instead.
> 
> Palmer has commented on the rest, so no need for me :)

All crypto algorithms will assume that the vector supports misaligned access in next
v2 patch.
And the algorithms will also not check for `RISCV_HWPROBE_MISALIGNED_*` since
it's related to scalar accesses.
Once we have the vector performance related flag, we could go back here to use it.

-Jerry
diff mbox series

Patch

diff --git a/arch/riscv/crypto/Kconfig b/arch/riscv/crypto/Kconfig
index 2797b37394bb..41ce453afafa 100644
--- a/arch/riscv/crypto/Kconfig
+++ b/arch/riscv/crypto/Kconfig
@@ -35,6 +35,18 @@  config CRYPTO_AES_BLOCK_RISCV64
 	  - Zvkg vector crypto extension (XTS)
 	  - Zvkned vector crypto extension
 
+config CRYPTO_CHACHA20_RISCV64
+	default y if RISCV_ISA_V
+	tristate "Ciphers: ChaCha20"
+	depends on 64BIT && RISCV_ISA_V
+	select CRYPTO_SKCIPHER
+	select CRYPTO_LIB_CHACHA_GENERIC
+	help
+	  Length-preserving ciphers: ChaCha20 stream cipher algorithm
+
+	  Architecture: riscv64 using:
+	  - Zvkb vector crypto extension
+
 config CRYPTO_GHASH_RISCV64
 	default y if RISCV_ISA_V
 	tristate "Hash functions: GHASH"
diff --git a/arch/riscv/crypto/Makefile b/arch/riscv/crypto/Makefile
index b772417703fd..80b0ebc956a3 100644
--- a/arch/riscv/crypto/Makefile
+++ b/arch/riscv/crypto/Makefile
@@ -9,6 +9,9 @@  aes-riscv64-y := aes-riscv64-glue.o aes-riscv64-zvkned.o
 obj-$(CONFIG_CRYPTO_AES_BLOCK_RISCV64) += aes-block-riscv64.o
 aes-block-riscv64-y := aes-riscv64-block-mode-glue.o aes-riscv64-zvbb-zvkg-zvkned.o aes-riscv64-zvkb-zvkned.o
 
+obj-$(CONFIG_CRYPTO_CHACHA20_RISCV64) += chacha-riscv64.o
+chacha-riscv64-y := chacha-riscv64-glue.o chacha-riscv64-zvkb.o
+
 obj-$(CONFIG_CRYPTO_GHASH_RISCV64) += ghash-riscv64.o
 ghash-riscv64-y := ghash-riscv64-glue.o ghash-riscv64-zvkg.o
 
@@ -36,6 +39,9 @@  $(obj)/aes-riscv64-zvbb-zvkg-zvkned.S: $(src)/aes-riscv64-zvbb-zvkg-zvkned.pl
 $(obj)/aes-riscv64-zvkb-zvkned.S: $(src)/aes-riscv64-zvkb-zvkned.pl
 	$(call cmd,perlasm)
 
+$(obj)/chacha-riscv64-zvkb.S: $(src)/chacha-riscv64-zvkb.pl
+	$(call cmd,perlasm)
+
 $(obj)/ghash-riscv64-zvkg.S: $(src)/ghash-riscv64-zvkg.pl
 	$(call cmd,perlasm)
 
@@ -54,6 +60,7 @@  $(obj)/sm4-riscv64-zvksed.S: $(src)/sm4-riscv64-zvksed.pl
 clean-files += aes-riscv64-zvkned.S
 clean-files += aes-riscv64-zvbb-zvkg-zvkned.S
 clean-files += aes-riscv64-zvkb-zvkned.S
+clean-files += chacha-riscv64-zvkb.S
 clean-files += ghash-riscv64-zvkg.S
 clean-files += sha256-riscv64-zvkb-zvknha_or_zvknhb.S
 clean-files += sha512-riscv64-zvkb-zvknhb.S
diff --git a/arch/riscv/crypto/chacha-riscv64-glue.c b/arch/riscv/crypto/chacha-riscv64-glue.c
new file mode 100644
index 000000000000..72011949f705
--- /dev/null
+++ b/arch/riscv/crypto/chacha-riscv64-glue.c
@@ -0,0 +1,120 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Port of the OpenSSL ChaCha20 implementation for RISC-V 64
+ *
+ * Copyright (C) 2023 SiFive, Inc.
+ * Author: Jerry Shih <jerry.shih@sifive.com>
+ */
+
+#include <asm/simd.h>
+#include <asm/vector.h>
+#include <crypto/internal/chacha.h>
+#include <crypto/internal/simd.h>
+#include <crypto/internal/skcipher.h>
+#include <linux/crypto.h>
+#include <linux/module.h>
+#include <linux/types.h>
+
+#define CHACHA_BLOCK_VALID_SIZE_MASK (~(CHACHA_BLOCK_SIZE - 1))
+#define CHACHA_BLOCK_REMAINING_SIZE_MASK (CHACHA_BLOCK_SIZE - 1)
+#define CHACHA_KEY_OFFSET 4
+#define CHACHA_IV_OFFSET 12
+
+/* chacha20 using zvkb vector crypto extension */
+void ChaCha20_ctr32_zvkb(u8 *out, const u8 *input, size_t len, const u32 *key,
+			 const u32 *counter);
+
+static int chacha20_encrypt(struct skcipher_request *req)
+{
+	u32 state[CHACHA_STATE_WORDS];
+	u8 block_buffer[CHACHA_BLOCK_SIZE];
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	const struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct skcipher_walk walk;
+	unsigned int nbytes;
+	unsigned int tail_bytes;
+	int err;
+
+	chacha_init_generic(state, ctx->key, req->iv);
+
+	err = skcipher_walk_virt(&walk, req, false);
+	while (walk.nbytes) {
+		nbytes = walk.nbytes & CHACHA_BLOCK_VALID_SIZE_MASK;
+		tail_bytes = walk.nbytes & CHACHA_BLOCK_REMAINING_SIZE_MASK;
+		kernel_vector_begin();
+		if (nbytes) {
+			ChaCha20_ctr32_zvkb(walk.dst.virt.addr,
+					    walk.src.virt.addr, nbytes,
+					    state + CHACHA_KEY_OFFSET,
+					    state + CHACHA_IV_OFFSET);
+			state[CHACHA_IV_OFFSET] += nbytes / CHACHA_BLOCK_SIZE;
+		}
+		if (walk.nbytes == walk.total && tail_bytes > 0) {
+			memcpy(block_buffer, walk.src.virt.addr + nbytes,
+			       tail_bytes);
+			ChaCha20_ctr32_zvkb(block_buffer, block_buffer,
+					    CHACHA_BLOCK_SIZE,
+					    state + CHACHA_KEY_OFFSET,
+					    state + CHACHA_IV_OFFSET);
+			memcpy(walk.dst.virt.addr + nbytes, block_buffer,
+			       tail_bytes);
+			tail_bytes = 0;
+		}
+		kernel_vector_end();
+
+		err = skcipher_walk_done(&walk, tail_bytes);
+	}
+
+	return err;
+}
+
+static struct skcipher_alg riscv64_chacha_alg_zvkb[] = { {
+	.base = {
+		.cra_name = "chacha20",
+		.cra_driver_name = "chacha20-riscv64-zvkb",
+		.cra_priority = 300,
+		.cra_blocksize = 1,
+		.cra_ctxsize = sizeof(struct chacha_ctx),
+		.cra_module = THIS_MODULE,
+	},
+	.min_keysize = CHACHA_KEY_SIZE,
+	.max_keysize = CHACHA_KEY_SIZE,
+	.ivsize = CHACHA_IV_SIZE,
+	.chunksize = CHACHA_BLOCK_SIZE,
+	.walksize = CHACHA_BLOCK_SIZE * 4,
+	.setkey = chacha20_setkey,
+	.encrypt = chacha20_encrypt,
+	.decrypt = chacha20_encrypt,
+} };
+
+static inline bool check_chacha20_ext(void)
+{
+	return riscv_isa_extension_available(NULL, ZVKB) &&
+	       riscv_vector_vlen() >= 128;
+}
+
+static int __init riscv64_chacha_mod_init(void)
+{
+	if (check_chacha20_ext())
+		return crypto_register_skciphers(
+			riscv64_chacha_alg_zvkb,
+			ARRAY_SIZE(riscv64_chacha_alg_zvkb));
+
+	return -ENODEV;
+}
+
+static void __exit riscv64_chacha_mod_fini(void)
+{
+	if (check_chacha20_ext())
+		crypto_unregister_skciphers(
+			riscv64_chacha_alg_zvkb,
+			ARRAY_SIZE(riscv64_chacha_alg_zvkb));
+}
+
+module_init(riscv64_chacha_mod_init);
+module_exit(riscv64_chacha_mod_fini);
+
+MODULE_DESCRIPTION("ChaCha20 (RISC-V accelerated)");
+MODULE_AUTHOR("Jerry Shih <jerry.shih@sifive.com>");
+MODULE_LICENSE("GPL");
+MODULE_ALIAS_CRYPTO("chacha20");
diff --git a/arch/riscv/crypto/chacha-riscv64-zvkb.pl b/arch/riscv/crypto/chacha-riscv64-zvkb.pl
new file mode 100644
index 000000000000..9caf7b247804
--- /dev/null
+++ b/arch/riscv/crypto/chacha-riscv64-zvkb.pl
@@ -0,0 +1,322 @@ 
+#! /usr/bin/env perl
+# SPDX-License-Identifier: Apache-2.0 OR BSD-2-Clause
+#
+# This file is dual-licensed, meaning that you can use it under your
+# choice of either of the following two licenses:
+#
+# Copyright 2023-2023 The OpenSSL Project Authors. All Rights Reserved.
+#
+# Licensed under the Apache License 2.0 (the "License").  You may not use
+# this file except in compliance with the License.  You can obtain a copy
+# in the file LICENSE in the source distribution or at
+# https://www.openssl.org/source/license.html
+#
+# or
+#
+# Copyright (c) 2023, Jerry Shih <jerry.shih@sifive.com>
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+# 1. Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+# 2. Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+# A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+# OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+# DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+# THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+# - RV64I
+# - RISC-V Vector ('V') with VLEN >= 128
+# - RISC-V Vector Cryptography Bit-manipulation extension ('Zvkb')
+# - RISC-V Zicclsm(Main memory supports misaligned loads/stores)
+
+use strict;
+use warnings;
+
+use FindBin qw($Bin);
+use lib "$Bin";
+use lib "$Bin/../../perlasm";
+use riscv;
+
+# $output is the last argument if it looks like a file (it has an extension)
+# $flavour is the first argument if it doesn't look like a file
+my $output  = $#ARGV >= 0 && $ARGV[$#ARGV] =~ m|\.\w+$| ? pop   : undef;
+my $flavour = $#ARGV >= 0 && $ARGV[0] !~ m|\.|          ? shift : undef;
+
+$output and open STDOUT, ">$output";
+
+my $code = <<___;
+.text
+___
+
+# void ChaCha20_ctr32_zvkb(unsigned char *out, const unsigned char *inp,
+#                          size_t len, const unsigned int key[8],
+#                          const unsigned int counter[4]);
+################################################################################
+my ( $OUTPUT, $INPUT, $LEN, $KEY, $COUNTER ) = ( "a0", "a1", "a2", "a3", "a4" );
+my ( $T0 ) = ( "t0" );
+my ( $CONST_DATA0, $CONST_DATA1, $CONST_DATA2, $CONST_DATA3 ) =
+  ( "a5", "a6", "a7", "t1" );
+my ( $KEY0, $KEY1, $KEY2,$KEY3, $KEY4, $KEY5, $KEY6, $KEY7,
+     $COUNTER0, $COUNTER1, $NONCE0, $NONCE1
+) = ( "s0", "s1", "s2", "s3", "s4", "s5", "s6",
+    "s7", "s8", "s9", "s10", "s11" );
+my ( $VL, $STRIDE, $CHACHA_LOOP_COUNT ) = ( "t2", "t3", "t4" );
+my (
+    $V0,  $V1,  $V2,  $V3,  $V4,  $V5,  $V6,  $V7,  $V8,  $V9,  $V10,
+    $V11, $V12, $V13, $V14, $V15, $V16, $V17, $V18, $V19, $V20, $V21,
+    $V22, $V23, $V24, $V25, $V26, $V27, $V28, $V29, $V30, $V31,
+) = map( "v$_", ( 0 .. 31 ) );
+
+sub chacha_quad_round_group {
+    my (
+        $A0, $B0, $C0, $D0, $A1, $B1, $C1, $D1,
+        $A2, $B2, $C2, $D2, $A3, $B3, $C3, $D3
+    ) = @_;
+
+    my $code = <<___;
+    # a += b; d ^= a; d <<<= 16;
+    @{[vadd_vv $A0, $A0, $B0]}
+    @{[vadd_vv $A1, $A1, $B1]}
+    @{[vadd_vv $A2, $A2, $B2]}
+    @{[vadd_vv $A3, $A3, $B3]}
+    @{[vxor_vv $D0, $D0, $A0]}
+    @{[vxor_vv $D1, $D1, $A1]}
+    @{[vxor_vv $D2, $D2, $A2]}
+    @{[vxor_vv $D3, $D3, $A3]}
+    @{[vror_vi $D0, $D0, 32 - 16]}
+    @{[vror_vi $D1, $D1, 32 - 16]}
+    @{[vror_vi $D2, $D2, 32 - 16]}
+    @{[vror_vi $D3, $D3, 32 - 16]}
+    # c += d; b ^= c; b <<<= 12;
+    @{[vadd_vv $C0, $C0, $D0]}
+    @{[vadd_vv $C1, $C1, $D1]}
+    @{[vadd_vv $C2, $C2, $D2]}
+    @{[vadd_vv $C3, $C3, $D3]}
+    @{[vxor_vv $B0, $B0, $C0]}
+    @{[vxor_vv $B1, $B1, $C1]}
+    @{[vxor_vv $B2, $B2, $C2]}
+    @{[vxor_vv $B3, $B3, $C3]}
+    @{[vror_vi $B0, $B0, 32 - 12]}
+    @{[vror_vi $B1, $B1, 32 - 12]}
+    @{[vror_vi $B2, $B2, 32 - 12]}
+    @{[vror_vi $B3, $B3, 32 - 12]}
+    # a += b; d ^= a; d <<<= 8;
+    @{[vadd_vv $A0, $A0, $B0]}
+    @{[vadd_vv $A1, $A1, $B1]}
+    @{[vadd_vv $A2, $A2, $B2]}
+    @{[vadd_vv $A3, $A3, $B3]}
+    @{[vxor_vv $D0, $D0, $A0]}
+    @{[vxor_vv $D1, $D1, $A1]}
+    @{[vxor_vv $D2, $D2, $A2]}
+    @{[vxor_vv $D3, $D3, $A3]}
+    @{[vror_vi $D0, $D0, 32 - 8]}
+    @{[vror_vi $D1, $D1, 32 - 8]}
+    @{[vror_vi $D2, $D2, 32 - 8]}
+    @{[vror_vi $D3, $D3, 32 - 8]}
+    # c += d; b ^= c; b <<<= 7;
+    @{[vadd_vv $C0, $C0, $D0]}
+    @{[vadd_vv $C1, $C1, $D1]}
+    @{[vadd_vv $C2, $C2, $D2]}
+    @{[vadd_vv $C3, $C3, $D3]}
+    @{[vxor_vv $B0, $B0, $C0]}
+    @{[vxor_vv $B1, $B1, $C1]}
+    @{[vxor_vv $B2, $B2, $C2]}
+    @{[vxor_vv $B3, $B3, $C3]}
+    @{[vror_vi $B0, $B0, 32 - 7]}
+    @{[vror_vi $B1, $B1, 32 - 7]}
+    @{[vror_vi $B2, $B2, 32 - 7]}
+    @{[vror_vi $B3, $B3, 32 - 7]}
+___
+
+    return $code;
+}
+
+$code .= <<___;
+.p2align 3
+.globl ChaCha20_ctr32_zvkb
+.type ChaCha20_ctr32_zvkb,\@function
+ChaCha20_ctr32_zvkb:
+    srli $LEN, $LEN, 6
+    beqz $LEN, .Lend
+
+    addi sp, sp, -96
+    sd s0, 0(sp)
+    sd s1, 8(sp)
+    sd s2, 16(sp)
+    sd s3, 24(sp)
+    sd s4, 32(sp)
+    sd s5, 40(sp)
+    sd s6, 48(sp)
+    sd s7, 56(sp)
+    sd s8, 64(sp)
+    sd s9, 72(sp)
+    sd s10, 80(sp)
+    sd s11, 88(sp)
+
+    li $STRIDE, 64
+
+    #### chacha block data
+    # "expa" little endian
+    li $CONST_DATA0, 0x61707865
+    # "nd 3" little endian
+    li $CONST_DATA1, 0x3320646e
+    # "2-by" little endian
+    li $CONST_DATA2, 0x79622d32
+    # "te k" little endian
+    li $CONST_DATA3, 0x6b206574
+
+    lw $KEY0, 0($KEY)
+    lw $KEY1, 4($KEY)
+    lw $KEY2, 8($KEY)
+    lw $KEY3, 12($KEY)
+    lw $KEY4, 16($KEY)
+    lw $KEY5, 20($KEY)
+    lw $KEY6, 24($KEY)
+    lw $KEY7, 28($KEY)
+
+    lw $COUNTER0, 0($COUNTER)
+    lw $COUNTER1, 4($COUNTER)
+    lw $NONCE0, 8($COUNTER)
+    lw $NONCE1, 12($COUNTER)
+
+.Lblock_loop:
+    @{[vsetvli $VL, $LEN, "e32", "m1", "ta", "ma"]}
+
+    # init chacha const states
+    @{[vmv_v_x $V0, $CONST_DATA0]}
+    @{[vmv_v_x $V1, $CONST_DATA1]}
+    @{[vmv_v_x $V2, $CONST_DATA2]}
+    @{[vmv_v_x $V3, $CONST_DATA3]}
+
+    # init chacha key states
+    @{[vmv_v_x $V4, $KEY0]}
+    @{[vmv_v_x $V5, $KEY1]}
+    @{[vmv_v_x $V6, $KEY2]}
+    @{[vmv_v_x $V7, $KEY3]}
+    @{[vmv_v_x $V8, $KEY4]}
+    @{[vmv_v_x $V9, $KEY5]}
+    @{[vmv_v_x $V10, $KEY6]}
+    @{[vmv_v_x $V11, $KEY7]}
+
+    # init chacha key states
+    @{[vid_v $V12]}
+    @{[vadd_vx $V12, $V12, $COUNTER0]}
+    @{[vmv_v_x $V13, $COUNTER1]}
+
+    # init chacha nonce states
+    @{[vmv_v_x $V14, $NONCE0]}
+    @{[vmv_v_x $V15, $NONCE1]}
+
+    # load the top-half of input data
+    @{[vlsseg_nf_e32_v 8, $V16, $INPUT, $STRIDE]}
+
+    li $CHACHA_LOOP_COUNT, 10
+.Lround_loop:
+    addi $CHACHA_LOOP_COUNT, $CHACHA_LOOP_COUNT, -1
+    @{[chacha_quad_round_group
+      $V0, $V4, $V8, $V12,
+      $V1, $V5, $V9, $V13,
+      $V2, $V6, $V10, $V14,
+      $V3, $V7, $V11, $V15]}
+    @{[chacha_quad_round_group
+      $V0, $V5, $V10, $V15,
+      $V1, $V6, $V11, $V12,
+      $V2, $V7, $V8, $V13,
+      $V3, $V4, $V9, $V14]}
+    bnez $CHACHA_LOOP_COUNT, .Lround_loop
+
+    # load the bottom-half of input data
+    addi $T0, $INPUT, 32
+    @{[vlsseg_nf_e32_v 8, $V24, $T0, $STRIDE]}
+
+    # add chacha top-half initial block states
+    @{[vadd_vx $V0, $V0, $CONST_DATA0]}
+    @{[vadd_vx $V1, $V1, $CONST_DATA1]}
+    @{[vadd_vx $V2, $V2, $CONST_DATA2]}
+    @{[vadd_vx $V3, $V3, $CONST_DATA3]}
+    @{[vadd_vx $V4, $V4, $KEY0]}
+    @{[vadd_vx $V5, $V5, $KEY1]}
+    @{[vadd_vx $V6, $V6, $KEY2]}
+    @{[vadd_vx $V7, $V7, $KEY3]}
+    # xor with the top-half input
+    @{[vxor_vv $V16, $V16, $V0]}
+    @{[vxor_vv $V17, $V17, $V1]}
+    @{[vxor_vv $V18, $V18, $V2]}
+    @{[vxor_vv $V19, $V19, $V3]}
+    @{[vxor_vv $V20, $V20, $V4]}
+    @{[vxor_vv $V21, $V21, $V5]}
+    @{[vxor_vv $V22, $V22, $V6]}
+    @{[vxor_vv $V23, $V23, $V7]}
+
+    # save the top-half of output
+    @{[vssseg_nf_e32_v 8, $V16, $OUTPUT, $STRIDE]}
+
+    # add chacha bottom-half initial block states
+    @{[vadd_vx $V8, $V8, $KEY4]}
+    @{[vadd_vx $V9, $V9, $KEY5]}
+    @{[vadd_vx $V10, $V10, $KEY6]}
+    @{[vadd_vx $V11, $V11, $KEY7]}
+    @{[vid_v $V0]}
+    @{[vadd_vx $V12, $V12, $COUNTER0]}
+    @{[vadd_vx $V13, $V13, $COUNTER1]}
+    @{[vadd_vx $V14, $V14, $NONCE0]}
+    @{[vadd_vx $V15, $V15, $NONCE1]}
+    @{[vadd_vv $V12, $V12, $V0]}
+    # xor with the bottom-half input
+    @{[vxor_vv $V24, $V24, $V8]}
+    @{[vxor_vv $V25, $V25, $V9]}
+    @{[vxor_vv $V26, $V26, $V10]}
+    @{[vxor_vv $V27, $V27, $V11]}
+    @{[vxor_vv $V29, $V29, $V13]}
+    @{[vxor_vv $V28, $V28, $V12]}
+    @{[vxor_vv $V30, $V30, $V14]}
+    @{[vxor_vv $V31, $V31, $V15]}
+
+    # save the bottom-half of output
+    addi $T0, $OUTPUT, 32
+    @{[vssseg_nf_e32_v 8, $V24, $T0, $STRIDE]}
+
+    # update counter
+    add $COUNTER0, $COUNTER0, $VL
+    sub $LEN, $LEN, $VL
+    # increase offset for `4 * 16 * VL = 64 * VL`
+    slli $T0, $VL, 6
+    add $INPUT, $INPUT, $T0
+    add $OUTPUT, $OUTPUT, $T0
+    bnez $LEN, .Lblock_loop
+
+    ld s0, 0(sp)
+    ld s1, 8(sp)
+    ld s2, 16(sp)
+    ld s3, 24(sp)
+    ld s4, 32(sp)
+    ld s5, 40(sp)
+    ld s6, 48(sp)
+    ld s7, 56(sp)
+    ld s8, 64(sp)
+    ld s9, 72(sp)
+    ld s10, 80(sp)
+    ld s11, 88(sp)
+    addi sp, sp, 96
+
+.Lend:
+    ret
+.size ChaCha20_ctr32_zvkb,.-ChaCha20_ctr32_zvkb
+___
+
+print $code;
+
+close STDOUT or die "error closing STDOUT: $!";