Message ID | 20250613103309.22440-1-giovanni.cabiddu@intel.com |
---|---|
State | New |
Headers | show |
Series | crypto: qat - lower priority for skcipher and aead algorithms | expand |
On Fri, Jun 13, 2025 at 12:01:50PM -0700, Eric Biggers wrote: > On Fri, Jun 13, 2025 at 11:32:27AM +0100, Giovanni Cabiddu wrote: > > Most kernel applications utilizing the crypto API operate synchronously > > and on small buffer sizes, therefore do not benefit from QAT acceleration. > > > > Reduce the priority of QAT implementations for both skcipher and aead > > algorithms, allowing more suitable alternatives to be selected by default. > > > > Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com> > > Link: https://lore.kernel.org/all/20250613012357.GA3603104@google.com/ > > Cc: stable@vger.kernel.org > > --- > > drivers/crypto/intel/qat/qat_common/qat_algs.c | 12 ++++++------ > > 1 file changed, 6 insertions(+), 6 deletions(-) > > > > diff --git a/drivers/crypto/intel/qat/qat_common/qat_algs.c b/drivers/crypto/intel/qat/qat_common/qat_algs.c > > index 3c4bba4a8779..d69cc1e5e023 100644 > > --- a/drivers/crypto/intel/qat/qat_common/qat_algs.c > > +++ b/drivers/crypto/intel/qat/qat_common/qat_algs.c > > @@ -1277,7 +1277,7 @@ static struct aead_alg qat_aeads[] = { { > > .base = { > > .cra_name = "authenc(hmac(sha1),cbc(aes))", > > .cra_driver_name = "qat_aes_cbc_hmac_sha1", > > - .cra_priority = 4001, > > + .cra_priority = 100, > > .cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY, > > .cra_blocksize = AES_BLOCK_SIZE, > > .cra_ctxsize = sizeof(struct qat_alg_aead_ctx), > > @@ -1294,7 +1294,7 @@ static struct aead_alg qat_aeads[] = { { > > .base = { > > .cra_name = "authenc(hmac(sha256),cbc(aes))", > > .cra_driver_name = "qat_aes_cbc_hmac_sha256", > > - .cra_priority = 4001, > > + .cra_priority = 100, > > .cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY, > > .cra_blocksize = AES_BLOCK_SIZE, > > .cra_ctxsize = sizeof(struct qat_alg_aead_ctx), > > @@ -1311,7 +1311,7 @@ static struct aead_alg qat_aeads[] = { { > > .base = { > > .cra_name = "authenc(hmac(sha512),cbc(aes))", > > .cra_driver_name = "qat_aes_cbc_hmac_sha512", > > - .cra_priority = 4001, > > + .cra_priority = 100, > > .cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY, > > .cra_blocksize = AES_BLOCK_SIZE, > > .cra_ctxsize = sizeof(struct qat_alg_aead_ctx), > > @@ -1329,7 +1329,7 @@ static struct aead_alg qat_aeads[] = { { > > static struct skcipher_alg qat_skciphers[] = { { > > .base.cra_name = "cbc(aes)", > > .base.cra_driver_name = "qat_aes_cbc", > > - .base.cra_priority = 4001, > > + .base.cra_priority = 100, > > .base.cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY, > > .base.cra_blocksize = AES_BLOCK_SIZE, > > .base.cra_ctxsize = sizeof(struct qat_alg_skcipher_ctx), > > @@ -1347,7 +1347,7 @@ static struct skcipher_alg qat_skciphers[] = { { > > }, { > > .base.cra_name = "ctr(aes)", > > .base.cra_driver_name = "qat_aes_ctr", > > - .base.cra_priority = 4001, > > + .base.cra_priority = 100, > > .base.cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY, > > .base.cra_blocksize = 1, > > .base.cra_ctxsize = sizeof(struct qat_alg_skcipher_ctx), > > @@ -1365,7 +1365,7 @@ static struct skcipher_alg qat_skciphers[] = { { > > }, { > > .base.cra_name = "xts(aes)", > > .base.cra_driver_name = "qat_aes_xts", > > - .base.cra_priority = 4001, > > + .base.cra_priority = 100, > > .base.cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_ALG_NEED_FALLBACK | > > CRYPTO_ALG_ALLOCATES_MEMORY, > > .base.cra_blocksize = AES_BLOCK_SIZE, > > -- > > 2.49.0 > > > > Acked-by: Eric Biggers <ebiggers@kernel.org> > > But, I think your commit message may be misleading: > > > Most kernel applications utilizing the crypto API operate synchronously > > and on small buffer sizes, therefore do not benefit from QAT acceleration. > > That implies that QAT acceleration *would* be beneficial for kernel applications > using asynchronous processing, large buffer sizes, or both. > > But as far as I know, that hasn't been shown to be true either. Those things > would likely make the performance characteristics of the QAT driver less bad, > but that doesn't necessarily mean it would become better than VAES. VAES is > already incredibly fast, and far easier to use. Also, it seems that you're considering 4096 bytes to be a "small" buffer size, which is kind of misleading too. In dm-crypt for example 4096-byte sectors (the recommended setting these days) are considered "large sectors", differentiating them from the traditional 512-byte sectors. - Eric
On Fri, Jun 13, 2025 at 11:32:27AM +0100, Giovanni Cabiddu wrote: > Most kernel applications utilizing the crypto API operate synchronously > and on small buffer sizes, therefore do not benefit from QAT acceleration. So what performance numbers should we be getting with QAT if the buffer sizes were large enough? Cheers,
On Mon, Jun 16, 2025 at 12:18:02PM +0800, Herbert Xu wrote: > On Fri, Jun 13, 2025 at 11:32:27AM +0100, Giovanni Cabiddu wrote: > > Most kernel applications utilizing the crypto API operate synchronously > > and on small buffer sizes, therefore do not benefit from QAT acceleration. > > So what performance numbers should we be getting with QAT if the > buffer sizes were large enough? Specifically for AES128-XTS, under optimal conditions, the current generation of QAT (GEN4) can achieve approximately 12 GB/s throughput at 4KB block sizes using a single device. Systems typically include between 1 and 4 QAT devices per socket and each device contains two internal engines capable of performing that algorithm. This level of performance is observed in userspace, where it is possible to (1) batch requests to amortize MMIO overhead (e.g., multiple requests per write), (2) submit requests asynchronously, (3) use flat buffers instead of scatter-gather lists, and (4) rely on polling rather than interrupts. However, in the kernel, we are currently unable to keep the accelerator sufficiently busy. For example, using a synthetic synchronous and single threaded benchmark on a Sapphire Rapids system, with interrupts properly affinitized, I observed throughput of around 500 Mbps with 4KB buffers. Debugfs statistics (telemetry) indicated that the accelerator was utilized at only ~4%. Given this, VAES is currently the more suitable choice for kernel use cases. The patch to lower the priority of QAT's symmetric crypto algorithms reflects this practical reality. The original high priority (4001) was set when the driver was first upstreamed in 2014 and had not been revisited until now. That said, symmetric encryption support in QAT remains relevant for chained operations, where compression and encryption can be offloaded in a single request. This capability is planned for full support in GEN6 and may be backported to GEN4 if there is sufficient demand. However, there is currently no infrastructure or in-kernel user for this functionality. Regards,
On Mon, Jun 16, 2025 at 04:02:30PM +0100, Giovanni Cabiddu wrote: > On Mon, Jun 16, 2025 at 12:18:02PM +0800, Herbert Xu wrote: > > On Fri, Jun 13, 2025 at 11:32:27AM +0100, Giovanni Cabiddu wrote: > > > Most kernel applications utilizing the crypto API operate synchronously > > > and on small buffer sizes, therefore do not benefit from QAT acceleration. > > > > So what performance numbers should we be getting with QAT if the > > buffer sizes were large enough? > > Specifically for AES128-XTS, under optimal conditions, the current > generation of QAT (GEN4) can achieve approximately 12 GB/s throughput at > 4KB block sizes using a single device. Systems typically include between > 1 and 4 QAT devices per socket and each device contains two internal > engines capable of performing that algorithm. > > This level of performance is observed in userspace, where it is possible > to (1) batch requests to amortize MMIO overhead (e.g., multiple requests > per write), (2) submit requests asynchronously, (3) use flat buffers > instead of scatter-gather lists, and (4) rely on polling rather than > interrupts. > > However, in the kernel, we are currently unable to keep the accelerator > sufficiently busy. For example, using a synthetic synchronous and single > threaded benchmark on a Sapphire Rapids system, with interrupts properly > affinitized, I observed throughput of around 500 Mbps with 4KB buffers. > Debugfs statistics (telemetry) indicated that the accelerator was > utilized at only ~4%. > > Given this, VAES is currently the more suitable choice for kernel use > cases. The patch to lower the priority of QAT's symmetric crypto > algorithms reflects this practical reality. The original high priority > (4001) was set when the driver was first upstreamed in 2014 and had not > been revisited until now. For some perspective, encrypting or decrypting 4 KiB messages with AES-128-XTS serially, I get 18.4 GB/s per thread with the VAES-accelerated code on an Intel Emerald Rapids processor. (The code is arch/x86/crypto/aes-xts-avx-x86_64.S, which I wrote and contributed in Linux 6.10.) The processor appeared to be running at about 3.28 GHz. That's about 5.6 bytes per cycle. Emerald Rapids processors have 6 to 60 cores per socket. Even assuming that a second thread in each core provides no benefit due to competing for the same core's resources, that would be an AES-128-XTS throughput of 110 to 1100 GB/s. That's way more than QAT could provide, even under the optimal conditions which do not exist in reality as QAT is much harder to use than VAES. FWIW, on an AMD EPYC 9B45 (Zen 5 / Turin) server processor, I get 35.2 GB/s. This processor appeared to run at about 4.15 GHz, so that's about 8.5 bytes per cycle. That's 51% more bytes per cycle than Intel. This shows that there is still room for improvement in VAES, even when it's already much better than QAT. It's unclear why Intel's efforts seem to be focused on QAT instead of VAES. - Eric
On Mon, Jun 16, 2025 at 09:47:52AM -0700, Eric Biggers wrote: > Emerald Rapids processors have 6 to 60 cores per socket. Even assuming that a > second thread in each core provides no benefit due to competing for the same > core's resources, that would be an AES-128-XTS throughput of 110 to 1100 GB/s. Small correction: Emerald Rapids is 8-64 cores per socket, not 6-60. (I was accidentally looking at Sapphire Rapids info.) That just makes the VAES throughput even higher. - Eric
On Mon, Jun 16, 2025 at 09:47:52AM -0700, Eric Biggers wrote: > FWIW, on an AMD EPYC 9B45 (Zen 5 / Turin) server processor, I get 35.2 GB/s. > This processor appeared to run at about 4.15 GHz, so that's about 8.5 bytes per > cycle. That's 51% more bytes per cycle than Intel. This shows that there is > still room for improvement in VAES, even when it's already much better than QAT. Also, to be clear, the 35.2 GB/s (and the corresponding bytes/cycle number of 8.5) is single-thread throughput. This should have been clear since I compared it to the single-thread throughput on Emerald Rapids. But I just wanted to make sure to state it explicitly, as an earlier part of my email discussed whole-processor throughput which it could be confused with. - Eric
On Mon, Jun 16, 2025 at 04:02:30PM +0100, Giovanni Cabiddu wrote: > > This level of performance is observed in userspace, where it is possible > to (1) batch requests to amortize MMIO overhead (e.g., multiple requests > per write), (2) submit requests asynchronously, (3) use flat buffers > instead of scatter-gather lists, and (4) rely on polling rather than > interrupts. So is batching a large number of 4K requests requests sufficient to achieve the maximum throughput? Or does it require physically contiguous memory much greater than 4K in size? Cheers,
On Tue, Jun 17, 2025 at 12:57:05PM +0800, Herbert Xu wrote: > On Mon, Jun 16, 2025 at 04:02:30PM +0100, Giovanni Cabiddu wrote: > > > > This level of performance is observed in userspace, where it is possible > > to (1) batch requests to amortize MMIO overhead (e.g., multiple requests > > per write), (2) submit requests asynchronously, (3) use flat buffers > > instead of scatter-gather lists, and (4) rely on polling rather than > > interrupts. > > So is batching a large number of 4K requests requests sufficient > to achieve the maximum throughput? Or does it require physically > contiguous memory much greater than 4K in size? Yes, batching a large number of 4KiB requests is sufficient to achieve near-maximum throughput. In an experiment using the skcipher APIs in asynchronous mode, I was able to reach approximately 11 GB/s throughput with 4KiB buffers. To achieve this, I had to increase the request queue depth and adjust the interrupt coalescing timer, which is set quite high by default. I'm continuing to experiment. For example, I modified the code to send a direct pointer to the device when the source and destination scatterlist entries each contain only a single segment. This should reduce I/O overhead by avoiding the need to read the scatter-gather list descriptors. Regarding the synchronous use case, preliminary analysis shows that the main bottlenecks are: (1) interrupt handling — particularly the overhead of completion handling, with significant time spent in the tasklet executing crypto_req_done() and (2) latency waiting on the device. I'm exploring ways to improve these. While this work might seem moot given that AES is faster in the core, the same optimizations are applicable to the compression service, where QAT can still provide benefits. Regards,
diff --git a/drivers/crypto/intel/qat/qat_common/qat_algs.c b/drivers/crypto/intel/qat/qat_common/qat_algs.c index 3c4bba4a8779..d69cc1e5e023 100644 --- a/drivers/crypto/intel/qat/qat_common/qat_algs.c +++ b/drivers/crypto/intel/qat/qat_common/qat_algs.c @@ -1277,7 +1277,7 @@ static struct aead_alg qat_aeads[] = { { .base = { .cra_name = "authenc(hmac(sha1),cbc(aes))", .cra_driver_name = "qat_aes_cbc_hmac_sha1", - .cra_priority = 4001, + .cra_priority = 100, .cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY, .cra_blocksize = AES_BLOCK_SIZE, .cra_ctxsize = sizeof(struct qat_alg_aead_ctx), @@ -1294,7 +1294,7 @@ static struct aead_alg qat_aeads[] = { { .base = { .cra_name = "authenc(hmac(sha256),cbc(aes))", .cra_driver_name = "qat_aes_cbc_hmac_sha256", - .cra_priority = 4001, + .cra_priority = 100, .cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY, .cra_blocksize = AES_BLOCK_SIZE, .cra_ctxsize = sizeof(struct qat_alg_aead_ctx), @@ -1311,7 +1311,7 @@ static struct aead_alg qat_aeads[] = { { .base = { .cra_name = "authenc(hmac(sha512),cbc(aes))", .cra_driver_name = "qat_aes_cbc_hmac_sha512", - .cra_priority = 4001, + .cra_priority = 100, .cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY, .cra_blocksize = AES_BLOCK_SIZE, .cra_ctxsize = sizeof(struct qat_alg_aead_ctx), @@ -1329,7 +1329,7 @@ static struct aead_alg qat_aeads[] = { { static struct skcipher_alg qat_skciphers[] = { { .base.cra_name = "cbc(aes)", .base.cra_driver_name = "qat_aes_cbc", - .base.cra_priority = 4001, + .base.cra_priority = 100, .base.cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY, .base.cra_blocksize = AES_BLOCK_SIZE, .base.cra_ctxsize = sizeof(struct qat_alg_skcipher_ctx), @@ -1347,7 +1347,7 @@ static struct skcipher_alg qat_skciphers[] = { { }, { .base.cra_name = "ctr(aes)", .base.cra_driver_name = "qat_aes_ctr", - .base.cra_priority = 4001, + .base.cra_priority = 100, .base.cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY, .base.cra_blocksize = 1, .base.cra_ctxsize = sizeof(struct qat_alg_skcipher_ctx), @@ -1365,7 +1365,7 @@ static struct skcipher_alg qat_skciphers[] = { { }, { .base.cra_name = "xts(aes)", .base.cra_driver_name = "qat_aes_xts", - .base.cra_priority = 4001, + .base.cra_priority = 100, .base.cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_ALG_NEED_FALLBACK | CRYPTO_ALG_ALLOCATES_MEMORY, .base.cra_blocksize = AES_BLOCK_SIZE,
Most kernel applications utilizing the crypto API operate synchronously and on small buffer sizes, therefore do not benefit from QAT acceleration. Reduce the priority of QAT implementations for both skcipher and aead algorithms, allowing more suitable alternatives to be selected by default. Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com> Link: https://lore.kernel.org/all/20250613012357.GA3603104@google.com/ Cc: stable@vger.kernel.org --- drivers/crypto/intel/qat/qat_common/qat_algs.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-)