diff mbox

[v2,09/11] arm64/crypto: add voluntary preemption to Crypto Extensions SHA1

Message ID 1400091451-9117-10-git-send-email-ard.biesheuvel@linaro.org
State New
Headers show

Commit Message

Ard Biesheuvel May 14, 2014, 6:17 p.m. UTC
The Crypto Extensions based SHA1 implementation uses the NEON register file,
and hence runs with preemption disabled. This patch adds a TIF_NEED_RESCHED
check to its inner loop so we at least give up the CPU voluntarily when we
are running in process context and have been tagged for preemption by the
scheduler.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
---
 arch/arm64/crypto/sha1-ce-core.S | 19 ++++++++-------
 arch/arm64/crypto/sha1-ce-glue.c | 52 ++++++++++++++++++++++++++--------------
 2 files changed, 44 insertions(+), 27 deletions(-)

Comments

Catalin Marinas May 15, 2014, 5:24 p.m. UTC | #1
On Wed, May 14, 2014 at 07:17:29PM +0100, Ard Biesheuvel wrote:
> The Crypto Extensions based SHA1 implementation uses the NEON register file,
> and hence runs with preemption disabled. This patch adds a TIF_NEED_RESCHED
> check to its inner loop so we at least give up the CPU voluntarily when we
> are running in process context and have been tagged for preemption by the
> scheduler.

Sorry, I haven't got to the bottom of your series earlier and I now
realised that the last patches are not just new crypto algorithms.

> +static u8 const *sha1_do_update(struct shash_desc *desc, const u8 *data,
> +				int blocks, u8 *head, unsigned int len)
> +{
> +	struct sha1_state *sctx = shash_desc_ctx(desc);
> +	struct thread_info *ti = NULL;
> +
> +	/*
> +	 * Pass current's thread info pointer to sha1_ce_transform()
> +	 * below if we want it to play nice under preemption.
> +	 */
> +	if ((IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY) || IS_ENABLED(CONFIG_PREEMPT))
> +	    && (desc->flags & CRYPTO_TFM_REQ_MAY_SLEEP))
> +		ti = current_thread_info();
> +
> +	do {
> +		int rem;
> +
> +		kernel_neon_begin_partial(16);
> +		rem = sha1_ce_transform(blocks, data, sctx->state, head, 0, ti);
> +		kernel_neon_end();
> +
> +		data += (blocks - rem) * SHA1_BLOCK_SIZE;
> +		blocks = rem;
> +		head = NULL;
> +	} while (unlikely(ti && blocks > 0));
> +	return data;
> +}

What latencies are we talking about? Would it make sense to always
call cond_resched() even if preemption is disabled?

With PREEMPT_VOLUNTARY I don't think the above code does any voluntary
preemption. The preempt_enable() in kernel_neon_end() only reschedules
if PREEMPT.

But I think we should have this loop always rescheduling if
CRYPTO_TFM_REQ_MAY_SLEEP. I can see there is a crypto_yield() function
that conditionally reschedules. What's the overhead of calling
sha1_ce_transform() in a loop vs a single call for the entire data?
Ard Biesheuvel May 15, 2014, 9:35 p.m. UTC | #2
On 15 May 2014 10:24, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Wed, May 14, 2014 at 07:17:29PM +0100, Ard Biesheuvel wrote:
>> The Crypto Extensions based SHA1 implementation uses the NEON register file,
>> and hence runs with preemption disabled. This patch adds a TIF_NEED_RESCHED
>> check to its inner loop so we at least give up the CPU voluntarily when we
>> are running in process context and have been tagged for preemption by the
>> scheduler.
>
> Sorry, I haven't got to the bottom of your series earlier and I now
> realised that the last patches are not just new crypto algorithms.
>
>> +static u8 const *sha1_do_update(struct shash_desc *desc, const u8 *data,
>> +                             int blocks, u8 *head, unsigned int len)
>> +{
>> +     struct sha1_state *sctx = shash_desc_ctx(desc);
>> +     struct thread_info *ti = NULL;
>> +
>> +     /*
>> +      * Pass current's thread info pointer to sha1_ce_transform()
>> +      * below if we want it to play nice under preemption.
>> +      */
>> +     if ((IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY) || IS_ENABLED(CONFIG_PREEMPT))
>> +         && (desc->flags & CRYPTO_TFM_REQ_MAY_SLEEP))
>> +             ti = current_thread_info();
>> +
>> +     do {
>> +             int rem;
>> +
>> +             kernel_neon_begin_partial(16);
>> +             rem = sha1_ce_transform(blocks, data, sctx->state, head, 0, ti);
>> +             kernel_neon_end();
>> +
>> +             data += (blocks - rem) * SHA1_BLOCK_SIZE;
>> +             blocks = rem;
>> +             head = NULL;
>> +     } while (unlikely(ti && blocks > 0));
>> +     return data;
>> +}
>
> What latencies are we talking about? Would it make sense to always
> call cond_resched() even if preemption is disabled?
>

Calculating the SHA256 (which is the most striking example) hash of a
4k buffer takes ~10 microseconds, so perhaps this is all a bit moot as
I don't think there will generally be many in-kernel users hashing
substantially larger quantities of data at a time.

> With PREEMPT_VOLUNTARY I don't think the above code does any voluntary
> preemption. The preempt_enable() in kernel_neon_end() only reschedules
> if PREEMPT.
>

Right, my mistake.

> But I think we should have this loop always rescheduling if
> CRYPTO_TFM_REQ_MAY_SLEEP. I can see there is a crypto_yield() function
> that conditionally reschedules. What's the overhead of calling
> sha1_ce_transform() in a loop vs a single call for the entire data?
>

For SHA256 we lose at least 10% if we call into the core function for
each 64 byte block rather than letting it chew away at all the data
itself. (This is due to the fact that it needs to load 64 32-bit round
keys) But perhaps (considering the above) it is better to just shelve
the last 4 patches in this series until it becomes a problem for
anyone.
Catalin Marinas May 15, 2014, 9:47 p.m. UTC | #3
On 15 May 2014, at 22:35, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> On 15 May 2014 10:24, Catalin Marinas <catalin.marinas@arm.com> wrote:
>> On Wed, May 14, 2014 at 07:17:29PM +0100, Ard Biesheuvel wrote:
>>> +static u8 const *sha1_do_update(struct shash_desc *desc, const u8 *data,
>>> +                             int blocks, u8 *head, unsigned int len)
>>> +{
>>> +     struct sha1_state *sctx = shash_desc_ctx(desc);
>>> +     struct thread_info *ti = NULL;
>>> +
>>> +     /*
>>> +      * Pass current's thread info pointer to sha1_ce_transform()
>>> +      * below if we want it to play nice under preemption.
>>> +      */
>>> +     if ((IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY) || IS_ENABLED(CONFIG_PREEMPT))
>>> +         && (desc->flags & CRYPTO_TFM_REQ_MAY_SLEEP))
>>> +             ti = current_thread_info();
>>> +
>>> +     do {
>>> +             int rem;
>>> +
>>> +             kernel_neon_begin_partial(16);
>>> +             rem = sha1_ce_transform(blocks, data, sctx->state, head, 0, ti);
>>> +             kernel_neon_end();
>>> +
>>> +             data += (blocks - rem) * SHA1_BLOCK_SIZE;
>>> +             blocks = rem;
>>> +             head = NULL;
>>> +     } while (unlikely(ti && blocks > 0));
>>> +     return data;
>>> +}
>> 
>> What latencies are we talking about? Would it make sense to always
>> call cond_resched() even if preemption is disabled?
> 
> Calculating the SHA256 (which is the most striking example) hash of a
> 4k buffer takes ~10 microseconds, so perhaps this is all a bit moot as
> I don't think there will generally be many in-kernel users hashing
> substantially larger quantities of data at a time.

And what’s the maximum size? Could we do it in blocks of 4K?

>> But I think we should have this loop always rescheduling if
>> CRYPTO_TFM_REQ_MAY_SLEEP. I can see there is a crypto_yield() function
>> that conditionally reschedules. What's the overhead of calling
>> sha1_ce_transform() in a loop vs a single call for the entire data?
> 
> For SHA256 we lose at least 10% if we call into the core function for
> each 64 byte block rather than letting it chew away at all the data
> itself.

I was thinking more like calling it for bigger buffers (e.g. page) at
once and a crypto_yield() in between.

> (This is due to the fact that it needs to load 64 32-bit round
> keys) But perhaps (considering the above) it is better to just shelve
> the last 4 patches in this series until it becomes a problem for
> anyone.

For the time being I would drop the last 4 patches. We can revisit them
for the next kernel version.

Catalin--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ard Biesheuvel May 15, 2014, 10:10 p.m. UTC | #4
On 15 May 2014 14:47, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On 15 May 2014, at 22:35, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>> On 15 May 2014 10:24, Catalin Marinas <catalin.marinas@arm.com> wrote:
>>> On Wed, May 14, 2014 at 07:17:29PM +0100, Ard Biesheuvel wrote:
>>>> +static u8 const *sha1_do_update(struct shash_desc *desc, const u8 *data,
>>>> +                             int blocks, u8 *head, unsigned int len)
>>>> +{
>>>> +     struct sha1_state *sctx = shash_desc_ctx(desc);
>>>> +     struct thread_info *ti = NULL;
>>>> +
>>>> +     /*
>>>> +      * Pass current's thread info pointer to sha1_ce_transform()
>>>> +      * below if we want it to play nice under preemption.
>>>> +      */
>>>> +     if ((IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY) || IS_ENABLED(CONFIG_PREEMPT))
>>>> +         && (desc->flags & CRYPTO_TFM_REQ_MAY_SLEEP))
>>>> +             ti = current_thread_info();
>>>> +
>>>> +     do {
>>>> +             int rem;
>>>> +
>>>> +             kernel_neon_begin_partial(16);
>>>> +             rem = sha1_ce_transform(blocks, data, sctx->state, head, 0, ti);
>>>> +             kernel_neon_end();
>>>> +
>>>> +             data += (blocks - rem) * SHA1_BLOCK_SIZE;
>>>> +             blocks = rem;
>>>> +             head = NULL;
>>>> +     } while (unlikely(ti && blocks > 0));
>>>> +     return data;
>>>> +}
>>>
>>> What latencies are we talking about? Would it make sense to always
>>> call cond_resched() even if preemption is disabled?
>>
>> Calculating the SHA256 (which is the most striking example) hash of a
>> 4k buffer takes ~10 microseconds, so perhaps this is all a bit moot as
>> I don't think there will generally be many in-kernel users hashing
>> substantially larger quantities of data at a time.
>
> And what’s the maximum size? Could we do it in blocks of 4K?
>

Yes, that should be doable as well. Once we feel we actually need this
feature, that is probably a better approach.

>>> But I think we should have this loop always rescheduling if
>>> CRYPTO_TFM_REQ_MAY_SLEEP. I can see there is a crypto_yield() function
>>> that conditionally reschedules. What's the overhead of calling
>>> sha1_ce_transform() in a loop vs a single call for the entire data?
>>
>> For SHA256 we lose at least 10% if we call into the core function for
>> each 64 byte block rather than letting it chew away at all the data
>> itself.
>
> I was thinking more like calling it for bigger buffers (e.g. page) at
> once and a crypto_yield() in between.
>
>> (This is due to the fact that it needs to load 64 32-bit round
>> keys) But perhaps (considering the above) it is better to just shelve
>> the last 4 patches in this series until it becomes a problem for
>> anyone.
>
> For the time being I would drop the last 4 patches. We can revisit them
> for the next kernel version.
>

Agreed.

Should I send an updated pull request?
Catalin Marinas May 16, 2014, 8:57 a.m. UTC | #5
On Thu, May 15, 2014 at 11:10:20PM +0100, Ard Biesheuvel wrote:
> On 15 May 2014 14:47, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > For the time being I would drop the last 4 patches. We can revisit them
> > for the next kernel version.
> 
> Agreed.
> 
> Should I send an updated pull request?

As it's not a signed tag anyway, I can merge up to the relevant commit.
diff mbox

Patch

diff --git a/arch/arm64/crypto/sha1-ce-core.S b/arch/arm64/crypto/sha1-ce-core.S
index 09d57d98609c..0cb9b8f4906b 100644
--- a/arch/arm64/crypto/sha1-ce-core.S
+++ b/arch/arm64/crypto/sha1-ce-core.S
@@ -66,8 +66,8 @@ 
 	.word		0x5a827999, 0x6ed9eba1, 0x8f1bbcdc, 0xca62c1d6
 
 	/*
-	 * void sha1_ce_transform(int blocks, u8 const *src, u32 *state,
-	 * 			  u8 *head, long bytes)
+	 * int sha1_ce_transform(int blocks, u8 const *src, u32 *state,
+	 * 			 u8 *head, long bytes, struct thread_info *ti)
 	 */
 ENTRY(sha1_ce_transform)
 	/* load round constants */
@@ -127,7 +127,13 @@  CPU_LE(	rev32		v11.16b, v11.16b	)
 	add		dgbv.2s, dgbv.2s, dg1v.2s
 	add		dgav.4s, dgav.4s, dg0v.4s
 
-	cbnz		w0, 0b
+	cbz		w0, 4f
+	b_if_no_resched	x5, x8, 0b
+
+	/* store new state */
+3:	str		dga, [x2]
+	str		dgb, [x2, #16]
+	ret
 
 	/*
 	 * Final block: add padding and total bit count.
@@ -135,7 +141,7 @@  CPU_LE(	rev32		v11.16b, v11.16b	)
 	 * size was not a round multiple of the block size, and the padding is
 	 * handled by the C code.
 	 */
-	cbz		x4, 3f
+4:	cbz		x4, 3b
 	movi		v9.2d, #0
 	mov		x8, #0x80000000
 	movi		v10.2d, #0
@@ -145,9 +151,4 @@  CPU_LE(	rev32		v11.16b, v11.16b	)
 	mov		v11.d[0], xzr
 	mov		v11.d[1], x7
 	b		2b
-
-	/* store new state */
-3:	str		dga, [x2]
-	str		dgb, [x2, #16]
-	ret
 ENDPROC(sha1_ce_transform)
diff --git a/arch/arm64/crypto/sha1-ce-glue.c b/arch/arm64/crypto/sha1-ce-glue.c
index 6fe83f37a750..b195f7104706 100644
--- a/arch/arm64/crypto/sha1-ce-glue.c
+++ b/arch/arm64/crypto/sha1-ce-glue.c
@@ -20,8 +20,8 @@  MODULE_DESCRIPTION("SHA1 secure hash using ARMv8 Crypto Extensions");
 MODULE_AUTHOR("Ard Biesheuvel <ard.biesheuvel@linaro.org>");
 MODULE_LICENSE("GPL v2");
 
-asmlinkage void sha1_ce_transform(int blocks, u8 const *src, u32 *state,
-				  u8 *head, long bytes);
+asmlinkage int sha1_ce_transform(int blocks, u8 const *src, u32 *state,
+				 u8 *head, long bytes, struct thread_info *ti);
 
 static int sha1_init(struct shash_desc *desc)
 {
@@ -33,6 +33,34 @@  static int sha1_init(struct shash_desc *desc)
 	return 0;
 }
 
+static u8 const *sha1_do_update(struct shash_desc *desc, const u8 *data,
+				int blocks, u8 *head, unsigned int len)
+{
+	struct sha1_state *sctx = shash_desc_ctx(desc);
+	struct thread_info *ti = NULL;
+
+	/*
+	 * Pass current's thread info pointer to sha1_ce_transform()
+	 * below if we want it to play nice under preemption.
+	 */
+	if ((IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY) || IS_ENABLED(CONFIG_PREEMPT))
+	    && (desc->flags & CRYPTO_TFM_REQ_MAY_SLEEP))
+		ti = current_thread_info();
+
+	do {
+		int rem;
+
+		kernel_neon_begin_partial(16);
+		rem = sha1_ce_transform(blocks, data, sctx->state, head, 0, ti);
+		kernel_neon_end();
+
+		data += (blocks - rem) * SHA1_BLOCK_SIZE;
+		blocks = rem;
+		head = NULL;
+	} while (unlikely(ti && blocks > 0));
+	return data;
+}
+
 static int sha1_update(struct shash_desc *desc, const u8 *data,
 		       unsigned int len)
 {
@@ -42,8 +70,6 @@  static int sha1_update(struct shash_desc *desc, const u8 *data,
 	sctx->count += len;
 
 	if ((partial + len) >= SHA1_BLOCK_SIZE) {
-		int blocks;
-
 		if (partial) {
 			int p = SHA1_BLOCK_SIZE - partial;
 
@@ -52,15 +78,10 @@  static int sha1_update(struct shash_desc *desc, const u8 *data,
 			len -= p;
 		}
 
-		blocks = len / SHA1_BLOCK_SIZE;
-		len %= SHA1_BLOCK_SIZE;
+		data = sha1_do_update(desc, data, len / SHA1_BLOCK_SIZE,
+				      partial ? sctx->buffer : NULL, 0);
 
-		kernel_neon_begin_partial(16);
-		sha1_ce_transform(blocks, data, sctx->state,
-				  partial ? sctx->buffer : NULL, 0);
-		kernel_neon_end();
-
-		data += blocks * SHA1_BLOCK_SIZE;
+		len %= SHA1_BLOCK_SIZE;
 		partial = 0;
 	}
 	if (len)
@@ -95,7 +116,6 @@  static int sha1_finup(struct shash_desc *desc, const u8 *data,
 {
 	struct sha1_state *sctx = shash_desc_ctx(desc);
 	__be32 *dst = (__be32 *)out;
-	int blocks;
 	int i;
 
 	if (sctx->count || !len || (len % SHA1_BLOCK_SIZE)) {
@@ -109,11 +129,7 @@  static int sha1_finup(struct shash_desc *desc, const u8 *data,
 	 * perform the entire digest calculation in a single invocation
 	 * of sha1_ce_transform()
 	 */
-	blocks = len / SHA1_BLOCK_SIZE;
-
-	kernel_neon_begin_partial(16);
-	sha1_ce_transform(blocks, data, sctx->state, NULL, len);
-	kernel_neon_end();
+	sha1_do_update(desc, data, len / SHA1_BLOCK_SIZE, NULL, len);
 
 	for (i = 0; i < SHA1_DIGEST_SIZE / sizeof(__be32); i++)
 		put_unaligned_be32(sctx->state[i], dst++);