[1/4] crypto/arm64: ghash - reduce performance impact of NEON yield checks

Message ID 20180724171224.17363-2-ard.biesheuvel@linaro.org
State New
Headers show
Series
  • crypto/arm64: reduce impact of NEON yield checks
Related show

Commit Message

Ard Biesheuvel July 24, 2018, 5:12 p.m.
As reported by Vakul, checking the TIF_NEED_RESCHED flag after every
iteration of the GHASH and AES-GCM core routines is having a considerable
performance impact on cores such as the Cortex-A53 with Crypto Extensions
implemented.

GHASH performance is down by 22% for large block sizes, and AES-GCM is
down by 16% for large block sizes and 128 bit keys. This appears to be
a result of the high performance of the crypto instructions on the one
hand (2.0 cycles per byte for GHASH, 3.0 cpb for AES-GCM), combined with
the relatively poor load/store performance of this simple core.

So let's reduce this performance impact by only doing the yield check
once every 32 blocks for GHASH (or 4 when using the version based on
8-bit polynomial multiplication), and once every 16 blocks for AES-GCM.
This way, we recover most of the performance while still limiting the
duration of scheduling blackouts due to disabling preemption to ~1000
cycles.

Cc: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

---
 arch/arm64/crypto/ghash-ce-core.S | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

-- 
2.11.0

Comments

Vakul Garg July 25, 2018, 6:57 a.m. | #1
> -----Original Message-----

> From: Ard Biesheuvel [mailto:ard.biesheuvel@linaro.org]

> Sent: Tuesday, July 24, 2018 10:42 PM

> To: linux-crypto@vger.kernel.org

> Cc: herbert@gondor.apana.org.au; will.deacon@arm.com;

> dave.martin@arm.com; Vakul Garg <vakul.garg@nxp.com>;

> bigeasy@linutronix.de; Ard Biesheuvel <ard.biesheuvel@linaro.org>

> Subject: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of

> NEON yield checks

> 

> As reported by Vakul, checking the TIF_NEED_RESCHED flag after every

> iteration of the GHASH and AES-GCM core routines is having a considerable

> performance impact on cores such as the Cortex-A53 with Crypto Extensions

> implemented.

> 

> GHASH performance is down by 22% for large block sizes, and AES-GCM is

> down by 16% for large block sizes and 128 bit keys. This appears to be a

> result of the high performance of the crypto instructions on the one hand

> (2.0 cycles per byte for GHASH, 3.0 cpb for AES-GCM), combined with the

> relatively poor load/store performance of this simple core.

> 

> So let's reduce this performance impact by only doing the yield check once

> every 32 blocks for GHASH (or 4 when using the version based on 8-bit

> polynomial multiplication), and once every 16 blocks for AES-GCM.

> This way, we recover most of the performance while still limiting the

> duration of scheduling blackouts due to disabling preemption to ~1000

> cycles.


I tested this patch. It helped but didn't regain the performance to previous level.
Are there more files remaining to be fixed? (In your original patch series for adding
preemptability check, there were lot more files changed than this series with 4 files).

Instead of using hardcoded  32 block/16 block limit, should it be controlled using Kconfig?
I believe that on different cores, these values could be required to be different.

> 

> Cc: Vakul Garg <vakul.garg@nxp.com>

> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

> ---

>  arch/arm64/crypto/ghash-ce-core.S | 12 +++++++++---

>  1 file changed, 9 insertions(+), 3 deletions(-)

> 

> diff --git a/arch/arm64/crypto/ghash-ce-core.S b/arch/arm64/crypto/ghash-

> ce-core.S

> index dcffb9e77589..9c14beaabeee 100644

> --- a/arch/arm64/crypto/ghash-ce-core.S

> +++ b/arch/arm64/crypto/ghash-ce-core.S

> @@ -212,7 +212,7 @@

>  	ushr		XL.2d, XL.2d, #1

>  	.endm

> 

> -	.macro		__pmull_ghash, pn

> +	.macro		__pmull_ghash, pn, yield_count

>  	frame_push	5

> 

>  	mov		x19, x0

> @@ -259,6 +259,9 @@ CPU_LE(	rev64		T1.16b, T1.16b	)

>  	eor		T2.16b, T2.16b, XH.16b

>  	eor		XL.16b, XL.16b, T2.16b

> 

> +	tst		w19, #(\yield_count - 1)

> +	b.ne		1b

> +

>  	cbz		w19, 3f

> 

>  	if_will_cond_yield_neon

> @@ -279,11 +282,11 @@ CPU_LE(	rev64		T1.16b, T1.16b	)

>  	 *			   struct ghash_key const *k, const char

> *head)

>  	 */

>  ENTRY(pmull_ghash_update_p64)

> -	__pmull_ghash	p64

> +	__pmull_ghash	p64, 32

>  ENDPROC(pmull_ghash_update_p64)

> 

>  ENTRY(pmull_ghash_update_p8)

> -	__pmull_ghash	p8

> +	__pmull_ghash	p8, 4

>  ENDPROC(pmull_ghash_update_p8)

> 

>  	KS		.req	v8

> @@ -428,6 +431,9 @@ CPU_LE(	rev		x28, x28	)

>  	st1		{INP.16b}, [x21], #16

>  	.endif

> 

> +	tst		w19, #0xf			// do yield check only

> +	b.ne		1b				// once every 16

> blocks

> +

>  	cbz		w19, 3f

> 

>  	if_will_cond_yield_neon

> --

> 2.11.0
Sebastian Andrzej Siewior July 25, 2018, 7:03 a.m. | #2
On 2018-07-25 06:57:42 [+0000], Vakul Garg wrote:
> I tested this patch. It helped but didn't regain the performance to previous level.

> Are there more files remaining to be fixed? (In your original patch series for adding

> preemptability check, there were lot more files changed than this series with 4 files).

> 

> Instead of using hardcoded  32 block/16 block limit, should it be controlled using Kconfig?

> I believe that on different cores, these values could be required to be different.


What about PREEMPT_NONE (server)?

Sebastian
Vakul Garg July 25, 2018, 7:04 a.m. | #3
> -----Original Message-----

> From: bigeasy@linutronix.de [mailto:bigeasy@linutronix.de]

> Sent: Wednesday, July 25, 2018 12:33 PM

> To: Vakul Garg <vakul.garg@nxp.com>

> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>; linux-

> crypto@vger.kernel.org; herbert@gondor.apana.org.au;

> will.deacon@arm.com; dave.martin@arm.com

> Subject: Re: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact

> of NEON yield checks

> 

> On 2018-07-25 06:57:42 [+0000], Vakul Garg wrote:

> > I tested this patch. It helped but didn't regain the performance to previous

> level.

> > Are there more files remaining to be fixed? (In your original patch

> > series for adding preemptability check, there were lot more files changed

> than this series with 4 files).

> >

> > Instead of using hardcoded  32 block/16 block limit, should it be controlled

> using Kconfig?

> > I believe that on different cores, these values could be required to be

> different.

> 

> What about PREEMPT_NONE (server)?


Why not have best of both the worlds :)

> 

> Sebastian
Sebastian Andrzej Siewior July 25, 2018, 7:11 a.m. | #4
On 2018-07-25 07:04:55 [+0000], Vakul Garg wrote:
> > 

> > What about PREEMPT_NONE (server)?

> 

> Why not have best of both the worlds :)


the NEON code gets interrupted because another tasks wants to schedule
and the scheduler allows. With "low latency desktop" this gets right
done away. The lower levels won't schedule so fast. So if you seek for
performance, the lower level should give you more. If you seek for low
latency…

Sebastian
Ard Biesheuvel July 25, 2018, 7:27 a.m. | #5
(+ Mark)

On 25 July 2018 at 08:57, Vakul Garg <vakul.garg@nxp.com> wrote:
>

>

>> -----Original Message-----

>> From: Ard Biesheuvel [mailto:ard.biesheuvel@linaro.org]

>> Sent: Tuesday, July 24, 2018 10:42 PM

>> To: linux-crypto@vger.kernel.org

>> Cc: herbert@gondor.apana.org.au; will.deacon@arm.com;

>> dave.martin@arm.com; Vakul Garg <vakul.garg@nxp.com>;

>> bigeasy@linutronix.de; Ard Biesheuvel <ard.biesheuvel@linaro.org>

>> Subject: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of

>> NEON yield checks

>>

>> As reported by Vakul, checking the TIF_NEED_RESCHED flag after every

>> iteration of the GHASH and AES-GCM core routines is having a considerable

>> performance impact on cores such as the Cortex-A53 with Crypto Extensions

>> implemented.

>>

>> GHASH performance is down by 22% for large block sizes, and AES-GCM is

>> down by 16% for large block sizes and 128 bit keys. This appears to be a

>> result of the high performance of the crypto instructions on the one hand

>> (2.0 cycles per byte for GHASH, 3.0 cpb for AES-GCM), combined with the

>> relatively poor load/store performance of this simple core.

>>

>> So let's reduce this performance impact by only doing the yield check once

>> every 32 blocks for GHASH (or 4 when using the version based on 8-bit

>> polynomial multiplication), and once every 16 blocks for AES-GCM.

>> This way, we recover most of the performance while still limiting the

>> duration of scheduling blackouts due to disabling preemption to ~1000

>> cycles.

>

> I tested this patch. It helped but didn't regain the performance to previous level.

> Are there more files remaining to be fixed? (In your original patch series for adding

> preemptability check, there were lot more files changed than this series with 4 files).

>

> Instead of using hardcoded  32 block/16 block limit, should it be controlled using Kconfig?

> I believe that on different cores, these values could be required to be different.

>


Simply enabling CONFIG_PREEMPT already causes a 8% performance hit on
my 24xA53 system, probably because each per-CPU variable access
involves disabling and re-enabling preemption, turning every per-CPU
load into 2 loads and a store, which hurts on this particular core.
Mark and I have played around a bit with using a GPR to record the
per-CPU offset, which would make this unnecessary, but this has its
own set of problems so that is not expected to land any time soon.

So if you care that much about squeezing the last drop of throughput
out of your system without regard for worst case scheduling latency,
disabling CONFIG_PREEMPT is a much better idea than playing around
with tunables to tweak the maximum quantum of work that is executed
with preemption disabled, especially since distro kernels will pick
the default anyway.
Ard Biesheuvel July 25, 2018, 8:09 a.m. | #6
On 25 July 2018 at 09:27, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> (+ Mark)

>

> On 25 July 2018 at 08:57, Vakul Garg <vakul.garg@nxp.com> wrote:

>>

>>

>>> -----Original Message-----

>>> From: Ard Biesheuvel [mailto:ard.biesheuvel@linaro.org]

>>> Sent: Tuesday, July 24, 2018 10:42 PM

>>> To: linux-crypto@vger.kernel.org

>>> Cc: herbert@gondor.apana.org.au; will.deacon@arm.com;

>>> dave.martin@arm.com; Vakul Garg <vakul.garg@nxp.com>;

>>> bigeasy@linutronix.de; Ard Biesheuvel <ard.biesheuvel@linaro.org>

>>> Subject: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of

>>> NEON yield checks

>>>

>>> As reported by Vakul, checking the TIF_NEED_RESCHED flag after every

>>> iteration of the GHASH and AES-GCM core routines is having a considerable

>>> performance impact on cores such as the Cortex-A53 with Crypto Extensions

>>> implemented.

>>>

>>> GHASH performance is down by 22% for large block sizes, and AES-GCM is

>>> down by 16% for large block sizes and 128 bit keys. This appears to be a

>>> result of the high performance of the crypto instructions on the one hand

>>> (2.0 cycles per byte for GHASH, 3.0 cpb for AES-GCM), combined with the

>>> relatively poor load/store performance of this simple core.

>>>

>>> So let's reduce this performance impact by only doing the yield check once

>>> every 32 blocks for GHASH (or 4 when using the version based on 8-bit

>>> polynomial multiplication), and once every 16 blocks for AES-GCM.

>>> This way, we recover most of the performance while still limiting the

>>> duration of scheduling blackouts due to disabling preemption to ~1000

>>> cycles.

>>

>> I tested this patch. It helped but didn't regain the performance to previous level.

>> Are there more files remaining to be fixed? (In your original patch series for adding

>> preemptability check, there were lot more files changed than this series with 4 files).

>>

>> Instead of using hardcoded  32 block/16 block limit, should it be controlled using Kconfig?

>> I believe that on different cores, these values could be required to be different.

>>

>

> Simply enabling CONFIG_PREEMPT already causes a 8% performance hit on

> my 24xA53 system, probably because each per-CPU variable access

> involves disabling and re-enabling preemption, turning every per-CPU

> load into 2 loads and a store,


Actually, more like

load/store
load
load/store

so 3 loads and 2 stores.



> which hurts on this particular core.

> Mark and I have played around a bit with using a GPR to record the

> per-CPU offset, which would make this unnecessary, but this has its

> own set of problems so that is not expected to land any time soon.

>

> So if you care that much about squeezing the last drop of throughput

> out of your system without regard for worst case scheduling latency,

> disabling CONFIG_PREEMPT is a much better idea than playing around

> with tunables to tweak the maximum quantum of work that is executed

> with preemption disabled, especially since distro kernels will pick

> the default anyway.
Dave Martin July 25, 2018, 9:05 a.m. | #7
On Tue, Jul 24, 2018 at 06:12:21PM +0100, Ard Biesheuvel wrote:
> As reported by Vakul, checking the TIF_NEED_RESCHED flag after every

> iteration of the GHASH and AES-GCM core routines is having a considerable

> performance impact on cores such as the Cortex-A53 with Crypto Extensions

> implemented.

> 

> GHASH performance is down by 22% for large block sizes, and AES-GCM is

> down by 16% for large block sizes and 128 bit keys. This appears to be

> a result of the high performance of the crypto instructions on the one

> hand (2.0 cycles per byte for GHASH, 3.0 cpb for AES-GCM), combined with

> the relatively poor load/store performance of this simple core.

> 

> So let's reduce this performance impact by only doing the yield check

> once every 32 blocks for GHASH (or 4 when using the version based on

> 8-bit polynomial multiplication), and once every 16 blocks for AES-GCM.

> This way, we recover most of the performance while still limiting the

> duration of scheduling blackouts due to disabling preemption to ~1000

> cycles.

> 

> Cc: Vakul Garg <vakul.garg@nxp.com>

> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

> ---

>  arch/arm64/crypto/ghash-ce-core.S | 12 +++++++++---

>  1 file changed, 9 insertions(+), 3 deletions(-)

> 

> diff --git a/arch/arm64/crypto/ghash-ce-core.S b/arch/arm64/crypto/ghash-ce-core.S

> index dcffb9e77589..9c14beaabeee 100644

> --- a/arch/arm64/crypto/ghash-ce-core.S

> +++ b/arch/arm64/crypto/ghash-ce-core.S

> @@ -212,7 +212,7 @@

>  	ushr		XL.2d, XL.2d, #1

>  	.endm

>  

> -	.macro		__pmull_ghash, pn

> +	.macro		__pmull_ghash, pn, yield_count

>  	frame_push	5

>  

>  	mov		x19, x0

> @@ -259,6 +259,9 @@ CPU_LE(	rev64		T1.16b, T1.16b	)

>  	eor		T2.16b, T2.16b, XH.16b

>  	eor		XL.16b, XL.16b, T2.16b

>  

> +	tst		w19, #(\yield_count - 1)


This should perhaps be (\yield_count) - 1.

It would be a bit silly to pass a non-trivial expression for yield_count
though.

> +	b.ne		1b

> +


Is it worth having a build-time check that \yield_count is a power of two?
(i.e., (\yield_count) & ((\yield_count) - 1) != 0).  We could have a
generic macro for that.

Otherwise this code may poll more often than expected.  Not the end of
the world, though.

[...]

Cheers
---Dave
Ard Biesheuvel July 25, 2018, 9:11 a.m. | #8
On 25 July 2018 at 11:05, Dave Martin <Dave.Martin@arm.com> wrote:
> On Tue, Jul 24, 2018 at 06:12:21PM +0100, Ard Biesheuvel wrote:

>> As reported by Vakul, checking the TIF_NEED_RESCHED flag after every

>> iteration of the GHASH and AES-GCM core routines is having a considerable

>> performance impact on cores such as the Cortex-A53 with Crypto Extensions

>> implemented.

>>

>> GHASH performance is down by 22% for large block sizes, and AES-GCM is

>> down by 16% for large block sizes and 128 bit keys. This appears to be

>> a result of the high performance of the crypto instructions on the one

>> hand (2.0 cycles per byte for GHASH, 3.0 cpb for AES-GCM), combined with

>> the relatively poor load/store performance of this simple core.

>>

>> So let's reduce this performance impact by only doing the yield check

>> once every 32 blocks for GHASH (or 4 when using the version based on

>> 8-bit polynomial multiplication), and once every 16 blocks for AES-GCM.

>> This way, we recover most of the performance while still limiting the

>> duration of scheduling blackouts due to disabling preemption to ~1000

>> cycles.

>>

>> Cc: Vakul Garg <vakul.garg@nxp.com>

>> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

>> ---

>>  arch/arm64/crypto/ghash-ce-core.S | 12 +++++++++---

>>  1 file changed, 9 insertions(+), 3 deletions(-)

>>

>> diff --git a/arch/arm64/crypto/ghash-ce-core.S b/arch/arm64/crypto/ghash-ce-core.S

>> index dcffb9e77589..9c14beaabeee 100644

>> --- a/arch/arm64/crypto/ghash-ce-core.S

>> +++ b/arch/arm64/crypto/ghash-ce-core.S

>> @@ -212,7 +212,7 @@

>>       ushr            XL.2d, XL.2d, #1

>>       .endm

>>

>> -     .macro          __pmull_ghash, pn

>> +     .macro          __pmull_ghash, pn, yield_count

>>       frame_push      5

>>

>>       mov             x19, x0

>> @@ -259,6 +259,9 @@ CPU_LE(   rev64           T1.16b, T1.16b  )

>>       eor             T2.16b, T2.16b, XH.16b

>>       eor             XL.16b, XL.16b, T2.16b

>>

>> +     tst             w19, #(\yield_count - 1)

>

> This should perhaps be (\yield_count) - 1.

>

> It would be a bit silly to pass a non-trivial expression for yield_count

> though.

>

>> +     b.ne            1b

>> +

>

> Is it worth having a build-time check that \yield_count is a power of two?

> (i.e., (\yield_count) & ((\yield_count) - 1) != 0).  We could have a

> generic macro for that.

>

> Otherwise this code may poll more often than expected.  Not the end of

> the world, though.

>


Thanks for taking a look.

Given that the macro in question is used in exactly two places in the
same file, and is unlikely to be reused unless the architecture gains
support for another optional instruction that can be used as a drop-in
replacement, I don't think any of this truly matters tbh.
Dave Martin July 25, 2018, 9:48 a.m. | #9
On Wed, Jul 25, 2018 at 10:11:42AM +0100, Ard Biesheuvel wrote:
> On 25 July 2018 at 11:05, Dave Martin <Dave.Martin@arm.com> wrote:

> > On Tue, Jul 24, 2018 at 06:12:21PM +0100, Ard Biesheuvel wrote:

> >> As reported by Vakul, checking the TIF_NEED_RESCHED flag after every

> >> iteration of the GHASH and AES-GCM core routines is having a considerable

> >> performance impact on cores such as the Cortex-A53 with Crypto Extensions

> >> implemented.

> >>

> >> GHASH performance is down by 22% for large block sizes, and AES-GCM is

> >> down by 16% for large block sizes and 128 bit keys. This appears to be

> >> a result of the high performance of the crypto instructions on the one

> >> hand (2.0 cycles per byte for GHASH, 3.0 cpb for AES-GCM), combined with

> >> the relatively poor load/store performance of this simple core.

> >>

> >> So let's reduce this performance impact by only doing the yield check

> >> once every 32 blocks for GHASH (or 4 when using the version based on

> >> 8-bit polynomial multiplication), and once every 16 blocks for AES-GCM.

> >> This way, we recover most of the performance while still limiting the

> >> duration of scheduling blackouts due to disabling preemption to ~1000

> >> cycles.

> >>

> >> Cc: Vakul Garg <vakul.garg@nxp.com>

> >> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

> >> ---

> >>  arch/arm64/crypto/ghash-ce-core.S | 12 +++++++++---

> >>  1 file changed, 9 insertions(+), 3 deletions(-)

> >>

> >> diff --git a/arch/arm64/crypto/ghash-ce-core.S b/arch/arm64/crypto/ghash-ce-core.S

> >> index dcffb9e77589..9c14beaabeee 100644

> >> --- a/arch/arm64/crypto/ghash-ce-core.S

> >> +++ b/arch/arm64/crypto/ghash-ce-core.S

> >> @@ -212,7 +212,7 @@

> >>       ushr            XL.2d, XL.2d, #1

> >>       .endm

> >>

> >> -     .macro          __pmull_ghash, pn

> >> +     .macro          __pmull_ghash, pn, yield_count

> >>       frame_push      5

> >>

> >>       mov             x19, x0

> >> @@ -259,6 +259,9 @@ CPU_LE(   rev64           T1.16b, T1.16b  )

> >>       eor             T2.16b, T2.16b, XH.16b

> >>       eor             XL.16b, XL.16b, T2.16b

> >>

> >> +     tst             w19, #(\yield_count - 1)

> >

> > This should perhaps be (\yield_count) - 1.

> >

> > It would be a bit silly to pass a non-trivial expression for yield_count

> > though.

> >

> >> +     b.ne            1b

> >> +

> >

> > Is it worth having a build-time check that \yield_count is a power of two?

> > (i.e., (\yield_count) & ((\yield_count) - 1) != 0).  We could have a

> > generic macro for that.

> >

> > Otherwise this code may poll more often than expected.  Not the end of

> > the world, though.

> >

> 

> Thanks for taking a look.

> 

> Given that the macro in question is used in exactly two places in the

> same file, and is unlikely to be reused unless the architecture gains

> support for another optional instruction that can be used as a drop-in

> replacement, I don't think any of this truly matters tbh.


Fair enough.  A comment on the macro definition might help, but beyond
that it is probably overkill.

Cheers
---Dave
Ard Biesheuvel July 25, 2018, 9:49 a.m. | #10
On 25 July 2018 at 11:48, Dave Martin <Dave.Martin@arm.com> wrote:
> On Wed, Jul 25, 2018 at 10:11:42AM +0100, Ard Biesheuvel wrote:

>> On 25 July 2018 at 11:05, Dave Martin <Dave.Martin@arm.com> wrote:

>> > On Tue, Jul 24, 2018 at 06:12:21PM +0100, Ard Biesheuvel wrote:

>> >> As reported by Vakul, checking the TIF_NEED_RESCHED flag after every

>> >> iteration of the GHASH and AES-GCM core routines is having a considerable

>> >> performance impact on cores such as the Cortex-A53 with Crypto Extensions

>> >> implemented.

>> >>

>> >> GHASH performance is down by 22% for large block sizes, and AES-GCM is

>> >> down by 16% for large block sizes and 128 bit keys. This appears to be

>> >> a result of the high performance of the crypto instructions on the one

>> >> hand (2.0 cycles per byte for GHASH, 3.0 cpb for AES-GCM), combined with

>> >> the relatively poor load/store performance of this simple core.

>> >>

>> >> So let's reduce this performance impact by only doing the yield check

>> >> once every 32 blocks for GHASH (or 4 when using the version based on

>> >> 8-bit polynomial multiplication), and once every 16 blocks for AES-GCM.

>> >> This way, we recover most of the performance while still limiting the

>> >> duration of scheduling blackouts due to disabling preemption to ~1000

>> >> cycles.

>> >>

>> >> Cc: Vakul Garg <vakul.garg@nxp.com>

>> >> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

>> >> ---

>> >>  arch/arm64/crypto/ghash-ce-core.S | 12 +++++++++---

>> >>  1 file changed, 9 insertions(+), 3 deletions(-)

>> >>

>> >> diff --git a/arch/arm64/crypto/ghash-ce-core.S b/arch/arm64/crypto/ghash-ce-core.S

>> >> index dcffb9e77589..9c14beaabeee 100644

>> >> --- a/arch/arm64/crypto/ghash-ce-core.S

>> >> +++ b/arch/arm64/crypto/ghash-ce-core.S

>> >> @@ -212,7 +212,7 @@

>> >>       ushr            XL.2d, XL.2d, #1

>> >>       .endm

>> >>

>> >> -     .macro          __pmull_ghash, pn

>> >> +     .macro          __pmull_ghash, pn, yield_count

>> >>       frame_push      5

>> >>

>> >>       mov             x19, x0

>> >> @@ -259,6 +259,9 @@ CPU_LE(   rev64           T1.16b, T1.16b  )

>> >>       eor             T2.16b, T2.16b, XH.16b

>> >>       eor             XL.16b, XL.16b, T2.16b

>> >>

>> >> +     tst             w19, #(\yield_count - 1)

>> >

>> > This should perhaps be (\yield_count) - 1.

>> >

>> > It would be a bit silly to pass a non-trivial expression for yield_count

>> > though.

>> >

>> >> +     b.ne            1b

>> >> +

>> >

>> > Is it worth having a build-time check that \yield_count is a power of two?

>> > (i.e., (\yield_count) & ((\yield_count) - 1) != 0).  We could have a

>> > generic macro for that.

>> >

>> > Otherwise this code may poll more often than expected.  Not the end of

>> > the world, though.

>> >

>>

>> Thanks for taking a look.

>>

>> Given that the macro in question is used in exactly two places in the

>> same file, and is unlikely to be reused unless the architecture gains

>> support for another optional instruction that can be used as a drop-in

>> replacement, I don't think any of this truly matters tbh.

>

> Fair enough.  A comment on the macro definition might help, but beyond

> that it is probably overkill.

>


Sure. I'll add that in v2.

Patch

diff --git a/arch/arm64/crypto/ghash-ce-core.S b/arch/arm64/crypto/ghash-ce-core.S
index dcffb9e77589..9c14beaabeee 100644
--- a/arch/arm64/crypto/ghash-ce-core.S
+++ b/arch/arm64/crypto/ghash-ce-core.S
@@ -212,7 +212,7 @@ 
 	ushr		XL.2d, XL.2d, #1
 	.endm
 
-	.macro		__pmull_ghash, pn
+	.macro		__pmull_ghash, pn, yield_count
 	frame_push	5
 
 	mov		x19, x0
@@ -259,6 +259,9 @@  CPU_LE(	rev64		T1.16b, T1.16b	)
 	eor		T2.16b, T2.16b, XH.16b
 	eor		XL.16b, XL.16b, T2.16b
 
+	tst		w19, #(\yield_count - 1)
+	b.ne		1b
+
 	cbz		w19, 3f
 
 	if_will_cond_yield_neon
@@ -279,11 +282,11 @@  CPU_LE(	rev64		T1.16b, T1.16b	)
 	 *			   struct ghash_key const *k, const char *head)
 	 */
 ENTRY(pmull_ghash_update_p64)
-	__pmull_ghash	p64
+	__pmull_ghash	p64, 32
 ENDPROC(pmull_ghash_update_p64)
 
 ENTRY(pmull_ghash_update_p8)
-	__pmull_ghash	p8
+	__pmull_ghash	p8, 4
 ENDPROC(pmull_ghash_update_p8)
 
 	KS		.req	v8
@@ -428,6 +431,9 @@  CPU_LE(	rev		x28, x28	)
 	st1		{INP.16b}, [x21], #16
 	.endif
 
+	tst		w19, #0xf			// do yield check only
+	b.ne		1b				// once every 16 blocks
+
 	cbz		w19, 3f
 
 	if_will_cond_yield_neon