crypto: scompress - eliminate percpu scratch buffers

Message ID 20170720114000.13840-1-ard.biesheuvel@linaro.org
State New
Headers show

Commit Message

Ard Biesheuvel July 20, 2017, 11:40 a.m.
The scompress code unconditionally allocates 2 per-CPU scratch buffers
of 128 KB each, in order to avoid allocation overhead in the async
wrapper that encapsulates the synchronous compression algorithm, since
it may execute in atomic context.

There are a couple of issues with this:
- The code may be invoked in process context with CRYPTO_TFM_REQ_MAY_SLEEP
  set in the request flags. If the CRYPTO_ACOMP_ALLOC_OUTPUT flag is also
  set, we end up calling kmalloc_array() and alloc_page() with the GFP
  flags set to GFP_KERNEL, and so we may attempt to sleep even though we
  are running with preemption disabled (due to the use of get_cpu())

- On a system such as Cavium ThunderX, which has 96 cores, the per-CPU
  buffers waste 24 MB of memory.

- On 32-bit systems, claiming vmalloc space of this order (e.g., 1 MB on
  a 4-core system) is undesirable as well, given how small the vmalloc
  space typically is on such systems.

- Lengthy CPU bound tasks such as compression should not be run with
  preemption disabled if we can avoid it.

So replace the per-PCU buffers with on-demand page allocations in the
handler. This reduces the memory and vmalloc space footprint of this
driver to ~0 unless you actually use it.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

---

The req->dst case could be further optimized, and reuse the allocated
scratch_dst buffer as the buffer that is returned to the caller. Given
that there are no occurrences of crypto_alloc_acomp() anywhere in the
kernel, I will leave that to the next person.

 crypto/scompress.c | 126 ++++----------------
 1 file changed, 22 insertions(+), 104 deletions(-)

-- 
2.9.3

Comments

Herbert Xu July 21, 2017, 12:42 p.m. | #1
On Thu, Jul 20, 2017 at 12:40:00PM +0100, Ard Biesheuvel wrote:
> The scompress code unconditionally allocates 2 per-CPU scratch buffers

> of 128 KB each, in order to avoid allocation overhead in the async

> wrapper that encapsulates the synchronous compression algorithm, since

> it may execute in atomic context.


The whole point of pre-allocation is that we cannot allocate 128K
(or 64K as it was before scomp) at run-time, and in particular,
for IPsec which runs in softirq path.  Am I missing something?

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Ard Biesheuvel July 21, 2017, 1:09 p.m. | #2
On 21 July 2017 at 13:42, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Thu, Jul 20, 2017 at 12:40:00PM +0100, Ard Biesheuvel wrote:

>> The scompress code unconditionally allocates 2 per-CPU scratch buffers

>> of 128 KB each, in order to avoid allocation overhead in the async

>> wrapper that encapsulates the synchronous compression algorithm, since

>> it may execute in atomic context.

>

> The whole point of pre-allocation is that we cannot allocate 128K

> (or 64K as it was before scomp) at run-time, and in particular,

> for IPsec which runs in softirq path.  Am I missing something?

>


Right. And is req->dst guaranteed to be assigned in that case? Because
crypto_scomp_sg_alloc() happily allocates pages and kmalloc()s the
scatterlist if req->dst == NULL.

Is there any way we could make these scratch buffers part of the
request structure instead? Or at least defer allocating them until the
first call to crypto_scomp_init_tfm()? And on top of that, we should
probably only use the per-CPU scratch buffers if
CRYPTO_TFM_REQ_MAY_SLEEP is cleared, because in that case, we are not
pre-emptible to begin with, and the concern does not apply.
Herbert Xu July 21, 2017, 1:11 p.m. | #3
On Fri, Jul 21, 2017 at 02:09:39PM +0100, Ard Biesheuvel wrote:
>

> Right. And is req->dst guaranteed to be assigned in that case? Because

> crypto_scomp_sg_alloc() happily allocates pages and kmalloc()s the

> scatterlist if req->dst == NULL.

> 

> Is there any way we could make these scratch buffers part of the

> request structure instead? Or at least defer allocating them until the

> first call to crypto_scomp_init_tfm()? And on top of that, we should

> probably only use the per-CPU scratch buffers if

> CRYPTO_TFM_REQ_MAY_SLEEP is cleared, because in that case, we are not

> pre-emptible to begin with, and the concern does not apply.


Well for now scomp is new code so nothing actually uses it.  But
the idea is to convert IPcomp over to acomp and scomp will fill in
as the compatibility glue.

The medium/long-term plan is to convert IPcomp over to a partial
decompression model so that we can allocate pages instead of a
contiguous chunk as we do now.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Ard Biesheuvel July 21, 2017, 1:24 p.m. | #4
On 21 July 2017 at 14:11, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Fri, Jul 21, 2017 at 02:09:39PM +0100, Ard Biesheuvel wrote:

>>

>> Right. And is req->dst guaranteed to be assigned in that case? Because

>> crypto_scomp_sg_alloc() happily allocates pages and kmalloc()s the

>> scatterlist if req->dst == NULL.

>>

>> Is there any way we could make these scratch buffers part of the

>> request structure instead? Or at least defer allocating them until the

>> first call to crypto_scomp_init_tfm()? And on top of that, we should

>> probably only use the per-CPU scratch buffers if

>> CRYPTO_TFM_REQ_MAY_SLEEP is cleared, because in that case, we are not

>> pre-emptible to begin with, and the concern does not apply.

>

> Well for now scomp is new code so nothing actually uses it.  But

> the idea is to convert IPcomp over to acomp and scomp will fill in

> as the compatibility glue.

>


Yes, I guessed as much.

> The medium/long-term plan is to convert IPcomp over to a partial

> decompression model so that we can allocate pages instead of a

> contiguous chunk as we do now.

>


OK, but that doesn't really answer any of my questions:
- Should we enforce that CRYPTO_ACOMP_ALLOC_OUTPUT is mutually
exclusive with CRYPTO_TFM_REQ_MAY_SLEEP, or should
crypto_scomp_sg_alloc() always use GFP_ATOMIC? We need one or the
other, because the current situation is buggy.
- Could we move the scratch buffers to the request structure instead?
Or may that be allocated from softirq context as well?
- Would you mind a patch that defers the allocation of the per-CPU
buffers to the first invocation of crypto_scomp_init_tfm()?
- Would you mind a patch that makes the code only use the per-CPU
buffers if we are running atomically to begin with?

Thanks,
Ard.
Ard Biesheuvel July 21, 2017, 1:28 p.m. | #5
On 21 July 2017 at 14:24, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> On 21 July 2017 at 14:11, Herbert Xu <herbert@gondor.apana.org.au> wrote:

>> On Fri, Jul 21, 2017 at 02:09:39PM +0100, Ard Biesheuvel wrote:

>>>

>>> Right. And is req->dst guaranteed to be assigned in that case? Because

>>> crypto_scomp_sg_alloc() happily allocates pages and kmalloc()s the

>>> scatterlist if req->dst == NULL.

>>>

>>> Is there any way we could make these scratch buffers part of the

>>> request structure instead? Or at least defer allocating them until the

>>> first call to crypto_scomp_init_tfm()? And on top of that, we should

>>> probably only use the per-CPU scratch buffers if

>>> CRYPTO_TFM_REQ_MAY_SLEEP is cleared, because in that case, we are not

>>> pre-emptible to begin with, and the concern does not apply.

>>

>> Well for now scomp is new code so nothing actually uses it.  But

>> the idea is to convert IPcomp over to acomp and scomp will fill in

>> as the compatibility glue.

>>

>

> Yes, I guessed as much.

>

>> The medium/long-term plan is to convert IPcomp over to a partial

>> decompression model so that we can allocate pages instead of a

>> contiguous chunk as we do now.

>>

>

> OK, but that doesn't really answer any of my questions:

> - Should we enforce that CRYPTO_ACOMP_ALLOC_OUTPUT is mutually

> exclusive with CRYPTO_TFM_REQ_MAY_SLEEP, or should

> crypto_scomp_sg_alloc() always use GFP_ATOMIC? We need one or the

> other, because the current situation is buggy.


I mean CRYPTO_ACOMP_ALLOC_OUTPUT should not be set if
CRYPTO_TFM_REQ_MAY_SLEEP is cleared.

> - Could we move the scratch buffers to the request structure instead?

> Or may that be allocated from softirq context as well?

> - Would you mind a patch that defers the allocation of the per-CPU

> buffers to the first invocation of crypto_scomp_init_tfm()?

> - Would you mind a patch that makes the code only use the per-CPU

> buffers if we are running atomically to begin with?

>

> Thanks,

> Ard.
Herbert Xu July 21, 2017, 1:31 p.m. | #6
On Fri, Jul 21, 2017 at 02:24:02PM +0100, Ard Biesheuvel wrote:
>

> OK, but that doesn't really answer any of my questions:

> - Should we enforce that CRYPTO_ACOMP_ALLOC_OUTPUT is mutually

> exclusive with CRYPTO_TFM_REQ_MAY_SLEEP, or should

> crypto_scomp_sg_alloc() always use GFP_ATOMIC? We need one or the

> other, because the current situation is buggy.


Yeah I'm not surprised that it is broken, because there are currently
no users of the API.  We should fix the acomp API.

> - Could we move the scratch buffers to the request structure instead?

> Or may that be allocated from softirq context as well?


Yes the IPcomp requests would be allocated from softirq context.

> - Would you mind a patch that defers the allocation of the per-CPU

> buffers to the first invocation of crypto_scomp_init_tfm()?


Sure.  That's what we currently do for IPcomp too.

> - Would you mind a patch that makes the code only use the per-CPU

> buffers if we are running atomically to begin with?


That would mean dropping the first packet so no.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Ard Biesheuvel July 21, 2017, 1:42 p.m. | #7
On 21 July 2017 at 14:31, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Fri, Jul 21, 2017 at 02:24:02PM +0100, Ard Biesheuvel wrote:

>>

>> OK, but that doesn't really answer any of my questions:

>> - Should we enforce that CRYPTO_ACOMP_ALLOC_OUTPUT is mutually

>> exclusive with CRYPTO_TFM_REQ_MAY_SLEEP, or should

>> crypto_scomp_sg_alloc() always use GFP_ATOMIC? We need one or the

>> other, because the current situation is buggy.

>

> Yeah I'm not surprised that it is broken, because there are currently

> no users of the API.  We should fix the acomp API.

>

>> - Could we move the scratch buffers to the request structure instead?

>> Or may that be allocated from softirq context as well?

>

> Yes the IPcomp requests would be allocated from softirq context.

>


__acomp_request_alloc() currently uses GFP_KERNEL explicitly, so that
needs to be fixed as well then.

>> - Would you mind a patch that defers the allocation of the per-CPU

>> buffers to the first invocation of crypto_scomp_init_tfm()?

>

> Sure.  That's what we currently do for IPcomp too.

>


OK.

>> - Would you mind a patch that makes the code only use the per-CPU

>> buffers if we are running atomically to begin with?

>

> That would mean dropping the first packet so no.

>


I think you misunderstood me: the idea is not to allocate the per-CPU
buffers at that time, but avoiding them and use the heap directly (as
my patch does now). This way, we don't unnecessarily disable
preemption for calls executing in process context.

I.e..

if (req->base.flags & CRYPTO_TFM_REQ_MAY_SLEEP) {
  cpu = get_cpu();
  scratch_src = *per_cpu_ptr(scomp_src_scratches, cpu);
  ...
} else {
  scratch_src = (u8 *)__get_free_pages(gfp, alloc_order);
}

and make the put_cpu() at the end conditional in the same way.

In any case, my main gripe with this code is the unconditional
allocation of the scratch buffers (taking up memory and vmalloc space)
so if we can defer that to crypto_scomp_init_tfm() instead, I already
have most of what I need.

-- 
Ard.
Herbert Xu July 21, 2017, 1:44 p.m. | #8
On Fri, Jul 21, 2017 at 02:42:20PM +0100, Ard Biesheuvel wrote:
> 

> >> - Would you mind a patch that makes the code only use the per-CPU

> >> buffers if we are running atomically to begin with?

> >

> > That would mean dropping the first packet so no.

> >

> 

> I think you misunderstood me: the idea is not to allocate the per-CPU

> buffers at that time, but avoiding them and use the heap directly (as

> my patch does now). This way, we don't unnecessarily disable

> preemption for calls executing in process context.


When we actually have a user other than IPcomp then we could
try this optimisation.  Otherwise I think it'd be a waste of
time.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Ard Biesheuvel July 21, 2017, 2:34 p.m. | #9
On 21 July 2017 at 14:44, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Fri, Jul 21, 2017 at 02:42:20PM +0100, Ard Biesheuvel wrote:

>>

>> >> - Would you mind a patch that makes the code only use the per-CPU

>> >> buffers if we are running atomically to begin with?

>> >

>> > That would mean dropping the first packet so no.

>> >

>>

>> I think you misunderstood me: the idea is not to allocate the per-CPU

>> buffers at that time, but avoiding them and use the heap directly (as

>> my patch does now). This way, we don't unnecessarily disable

>> preemption for calls executing in process context.

>

> When we actually have a user other than IPcomp then we could

> try this optimisation.  Otherwise I think it'd be a waste of

> time.

>


Fair enough.

Patch

diff --git a/crypto/scompress.c b/crypto/scompress.c
index ae1d3cf209e4..3d4c35a4c5a8 100644
--- a/crypto/scompress.c
+++ b/crypto/scompress.c
@@ -30,10 +30,6 @@ 
 #include "internal.h"
 
 static const struct crypto_type crypto_scomp_type;
-static void * __percpu *scomp_src_scratches;
-static void * __percpu *scomp_dst_scratches;
-static int scomp_scratch_users;
-static DEFINE_MUTEX(scomp_lock);
 
 #ifdef CONFIG_NET
 static int crypto_scomp_report(struct sk_buff *skb, struct crypto_alg *alg)
@@ -70,67 +66,6 @@  static int crypto_scomp_init_tfm(struct crypto_tfm *tfm)
 	return 0;
 }
 
-static void crypto_scomp_free_scratches(void * __percpu *scratches)
-{
-	int i;
-
-	if (!scratches)
-		return;
-
-	for_each_possible_cpu(i)
-		vfree(*per_cpu_ptr(scratches, i));
-
-	free_percpu(scratches);
-}
-
-static void * __percpu *crypto_scomp_alloc_scratches(void)
-{
-	void * __percpu *scratches;
-	int i;
-
-	scratches = alloc_percpu(void *);
-	if (!scratches)
-		return NULL;
-
-	for_each_possible_cpu(i) {
-		void *scratch;
-
-		scratch = vmalloc_node(SCOMP_SCRATCH_SIZE, cpu_to_node(i));
-		if (!scratch)
-			goto error;
-		*per_cpu_ptr(scratches, i) = scratch;
-	}
-
-	return scratches;
-
-error:
-	crypto_scomp_free_scratches(scratches);
-	return NULL;
-}
-
-static void crypto_scomp_free_all_scratches(void)
-{
-	if (!--scomp_scratch_users) {
-		crypto_scomp_free_scratches(scomp_src_scratches);
-		crypto_scomp_free_scratches(scomp_dst_scratches);
-		scomp_src_scratches = NULL;
-		scomp_dst_scratches = NULL;
-	}
-}
-
-static int crypto_scomp_alloc_all_scratches(void)
-{
-	if (!scomp_scratch_users++) {
-		scomp_src_scratches = crypto_scomp_alloc_scratches();
-		if (!scomp_src_scratches)
-			return -ENOMEM;
-		scomp_dst_scratches = crypto_scomp_alloc_scratches();
-		if (!scomp_dst_scratches)
-			return -ENOMEM;
-	}
-	return 0;
-}
-
 static void crypto_scomp_sg_free(struct scatterlist *sgl)
 {
 	int i, n;
@@ -184,24 +119,31 @@  static int scomp_acomp_comp_decomp(struct acomp_req *req, int dir)
 	void **tfm_ctx = acomp_tfm_ctx(tfm);
 	struct crypto_scomp *scomp = *tfm_ctx;
 	void **ctx = acomp_request_ctx(req);
-	const int cpu = get_cpu();
-	u8 *scratch_src = *per_cpu_ptr(scomp_src_scratches, cpu);
-	u8 *scratch_dst = *per_cpu_ptr(scomp_dst_scratches, cpu);
+	u8 *scratch_src;
+	u8 *scratch_dst;
+	int alloc_order;
+	gfp_t gfp;
 	int ret;
 
-	if (!req->src || !req->slen || req->slen > SCOMP_SCRATCH_SIZE) {
-		ret = -EINVAL;
-		goto out;
-	}
+	if (!req->src || !req->slen || req->slen > SCOMP_SCRATCH_SIZE)
+		return -EINVAL;
 
-	if (req->dst && !req->dlen) {
-		ret = -EINVAL;
-		goto out;
-	}
+	if (req->dst && !req->dlen)
+		return -EINVAL;
 
 	if (!req->dlen || req->dlen > SCOMP_SCRATCH_SIZE)
 		req->dlen = SCOMP_SCRATCH_SIZE;
 
+	gfp = req->base.flags & CRYPTO_TFM_REQ_MAY_SLEEP ? GFP_KERNEL
+							 : GFP_ATOMIC;
+
+	alloc_order = get_order(PAGE_ALIGN(req->slen) + req->dlen);
+	scratch_src = (u8 *)__get_free_pages(gfp, alloc_order);
+	if (!scratch_src)
+		return -ENOMEM;
+
+	scratch_dst = scratch_src + PAGE_ALIGN(req->slen);
+
 	scatterwalk_map_and_copy(scratch_src, req->src, 0, req->slen, 0);
 	if (dir)
 		ret = crypto_scomp_compress(scomp, scratch_src, req->slen,
@@ -211,9 +153,7 @@  static int scomp_acomp_comp_decomp(struct acomp_req *req, int dir)
 					      scratch_dst, &req->dlen, *ctx);
 	if (!ret) {
 		if (!req->dst) {
-			req->dst = crypto_scomp_sg_alloc(req->dlen,
-				   req->base.flags & CRYPTO_TFM_REQ_MAY_SLEEP ?
-				   GFP_KERNEL : GFP_ATOMIC);
+			req->dst = crypto_scomp_sg_alloc(req->dlen, gfp);
 			if (!req->dst)
 				goto out;
 		}
@@ -221,7 +161,7 @@  static int scomp_acomp_comp_decomp(struct acomp_req *req, int dir)
 					 1);
 	}
 out:
-	put_cpu();
+	free_pages((unsigned long)scratch_src, alloc_order);
 	return ret;
 }
 
@@ -316,40 +256,18 @@  static const struct crypto_type crypto_scomp_type = {
 int crypto_register_scomp(struct scomp_alg *alg)
 {
 	struct crypto_alg *base = &alg->base;
-	int ret = -ENOMEM;
-
-	mutex_lock(&scomp_lock);
-	if (crypto_scomp_alloc_all_scratches())
-		goto error;
 
 	base->cra_type = &crypto_scomp_type;
 	base->cra_flags &= ~CRYPTO_ALG_TYPE_MASK;
 	base->cra_flags |= CRYPTO_ALG_TYPE_SCOMPRESS;
 
-	ret = crypto_register_alg(base);
-	if (ret)
-		goto error;
-
-	mutex_unlock(&scomp_lock);
-	return ret;
-
-error:
-	crypto_scomp_free_all_scratches();
-	mutex_unlock(&scomp_lock);
-	return ret;
+	return crypto_register_alg(base);
 }
 EXPORT_SYMBOL_GPL(crypto_register_scomp);
 
 int crypto_unregister_scomp(struct scomp_alg *alg)
 {
-	int ret;
-
-	mutex_lock(&scomp_lock);
-	ret = crypto_unregister_alg(&alg->base);
-	crypto_scomp_free_all_scratches();
-	mutex_unlock(&scomp_lock);
-
-	return ret;
+	return crypto_unregister_alg(&alg->base);
 }
 EXPORT_SYMBOL_GPL(crypto_unregister_scomp);