mbox series

[0/4] crypto/arm64: reduce impact of NEON yield checks

Message ID 20180724171224.17363-1-ard.biesheuvel@linaro.org
Headers show
Series crypto/arm64: reduce impact of NEON yield checks | expand

Message

Ard Biesheuvel July 24, 2018, 5:12 p.m. UTC
Vakul reports a considerable performance hit when running the accelerated
arm64 crypto routines with CONFIG_PREEMPT=y configured, now that thay have
been updated to take the TIF_NEED_RESCHED flag into account.

The issue appears to be caused by the fact that Cortex-A53, the core in
question, has a high end implementation of the Crypto Extensions, and
has a shallow pipeline, which means even sequential algorithms that may be
held back by pipeline stalls on high end out of order cores run at maximum
speed. This means SHA-1, SHA-2, GHASH and AES in GCM and CCM modes run at a
speed in the order of 2 to 4 cycles per byte, and are currently implemented
to check the TIF_NEED_RESCHED after each iteration, which may process as
little as 16 bytes (for GHASH).

Obviously, every cycle of overhead hurts in this context, and given that
the A53's load/store unit is not quite high end, any delays caused by
memory accesses that occur in the inner loop of the algorithms are going
to be quite significant, hence the performance regression.

So reduce the frequency at which the NEON yield checks are performed, so
that they occur roughly once every 1000 cycles, which is hopefully a
reasonable tradeoff between throughput and worst case scheduling latency.

Ard Biesheuvel (4):
  crypto/arm64: ghash - reduce performance impact of NEON yield checks
  crypto/arm64: aes-ccm - reduce performance impact of NEON yield checks
  crypto/arm64: sha1 - reduce performance impact of NEON yield checks
  crypto/arm64: sha2 - reduce performance impact of NEON yield checks

 arch/arm64/crypto/aes-ce-ccm-core.S |  3 +++
 arch/arm64/crypto/ghash-ce-core.S   | 12 +++++++++---
 arch/arm64/crypto/sha1-ce-core.S    |  3 +++
 arch/arm64/crypto/sha2-ce-core.S    |  3 +++
 4 files changed, 18 insertions(+), 3 deletions(-)

-- 
2.11.0

Comments

Sebastian Andrzej Siewior July 24, 2018, 5:21 p.m. UTC | #1
On 2018-07-24 19:12:20 [+0200], Ard Biesheuvel wrote:
> Vakul reports a considerable performance hit when running the accelerated

> arm64 crypto routines with CONFIG_PREEMPT=y configured, now that thay have

> been updated to take the TIF_NEED_RESCHED flag into account.


just in time. I will try to come up with some numbers on RT with the
original patch and with that one. I have it almost working…

Sebastian
Dave Martin July 25, 2018, 9:09 a.m. UTC | #2
On Tue, Jul 24, 2018 at 06:12:20PM +0100, Ard Biesheuvel wrote:
> Vakul reports a considerable performance hit when running the accelerated

> arm64 crypto routines with CONFIG_PREEMPT=y configured, now that thay have

> been updated to take the TIF_NEED_RESCHED flag into account.

> 

> The issue appears to be caused by the fact that Cortex-A53, the core in

> question, has a high end implementation of the Crypto Extensions, and

> has a shallow pipeline, which means even sequential algorithms that may be

> held back by pipeline stalls on high end out of order cores run at maximum

> speed. This means SHA-1, SHA-2, GHASH and AES in GCM and CCM modes run at a

> speed in the order of 2 to 4 cycles per byte, and are currently implemented

> to check the TIF_NEED_RESCHED after each iteration, which may process as

> little as 16 bytes (for GHASH).

> 

> Obviously, every cycle of overhead hurts in this context, and given that

> the A53's load/store unit is not quite high end, any delays caused by

> memory accesses that occur in the inner loop of the algorithms are going

> to be quite significant, hence the performance regression.

> 

> So reduce the frequency at which the NEON yield checks are performed, so

> that they occur roughly once every 1000 cycles, which is hopefully a

> reasonable tradeoff between throughput and worst case scheduling latency.


Is there any way to tune this automatically, or it that likely to be more
trouble than it's worth?

Also, how did you come up with 1000 cycles?  At what point does
preemption latency become more/less important than throughput?

Maybe someone already invented a similar framework somewhere else in the
kernel.  I seem to remember some automatic selection of memcpy
implementation based on a boot-time benchmark, but I may have
misremembered.

Cheers
---Dave
Ard Biesheuvel July 25, 2018, 9:23 a.m. UTC | #3
On 25 July 2018 at 11:09, Dave Martin <Dave.Martin@arm.com> wrote:
> On Tue, Jul 24, 2018 at 06:12:20PM +0100, Ard Biesheuvel wrote:

>> Vakul reports a considerable performance hit when running the accelerated

>> arm64 crypto routines with CONFIG_PREEMPT=y configured, now that thay have

>> been updated to take the TIF_NEED_RESCHED flag into account.

>>

>> The issue appears to be caused by the fact that Cortex-A53, the core in

>> question, has a high end implementation of the Crypto Extensions, and

>> has a shallow pipeline, which means even sequential algorithms that may be

>> held back by pipeline stalls on high end out of order cores run at maximum

>> speed. This means SHA-1, SHA-2, GHASH and AES in GCM and CCM modes run at a

>> speed in the order of 2 to 4 cycles per byte, and are currently implemented

>> to check the TIF_NEED_RESCHED after each iteration, which may process as

>> little as 16 bytes (for GHASH).

>>

>> Obviously, every cycle of overhead hurts in this context, and given that

>> the A53's load/store unit is not quite high end, any delays caused by

>> memory accesses that occur in the inner loop of the algorithms are going

>> to be quite significant, hence the performance regression.

>>

>> So reduce the frequency at which the NEON yield checks are performed, so

>> that they occur roughly once every 1000 cycles, which is hopefully a

>> reasonable tradeoff between throughput and worst case scheduling latency.

>

> Is there any way to tune this automatically, or it that likely to be more

> trouble than it's worth?

>


Good question. I think A53 is a reasonable worst case, and these
changes reduce the impact to the same ballpark as the impact of
enabling CONFIG_PREEMPT in the first place.

> Also, how did you come up with 1000 cycles?  At what point does

> preemption latency become more/less important than throughput?

>


Another good question. I was hoping Sebastian or the other -rt folks
would be triggered by this. Given the above, I ended up with a ~1000
cycles quantum, and hopefully this is considered to be small enough.

> Maybe someone already invented a similar framework somewhere else in the

> kernel.  I seem to remember some automatic selection of memcpy

> implementation based on a boot-time benchmark, but I may have

> misremembered.

>


We have crypto benchmarking code in the kernel, and at some point, I
even did some work on selecting the best algo based on performance.

But to be honest, I think this is a bit overkill. If you need those
final 5% of throughput at any cost, you're better off running with
CONFIG_PREEMPT=n anyway.
Dave Martin July 25, 2018, 9:45 a.m. UTC | #4
On Wed, Jul 25, 2018 at 10:23:00AM +0100, Ard Biesheuvel wrote:
> On 25 July 2018 at 11:09, Dave Martin <Dave.Martin@arm.com> wrote:

> > On Tue, Jul 24, 2018 at 06:12:20PM +0100, Ard Biesheuvel wrote:

> >> Vakul reports a considerable performance hit when running the accelerated

> >> arm64 crypto routines with CONFIG_PREEMPT=y configured, now that thay have

> >> been updated to take the TIF_NEED_RESCHED flag into account.

> >>

> >> The issue appears to be caused by the fact that Cortex-A53, the core in

> >> question, has a high end implementation of the Crypto Extensions, and

> >> has a shallow pipeline, which means even sequential algorithms that may be

> >> held back by pipeline stalls on high end out of order cores run at maximum

> >> speed. This means SHA-1, SHA-2, GHASH and AES in GCM and CCM modes run at a

> >> speed in the order of 2 to 4 cycles per byte, and are currently implemented

> >> to check the TIF_NEED_RESCHED after each iteration, which may process as

> >> little as 16 bytes (for GHASH).

> >>

> >> Obviously, every cycle of overhead hurts in this context, and given that

> >> the A53's load/store unit is not quite high end, any delays caused by

> >> memory accesses that occur in the inner loop of the algorithms are going

> >> to be quite significant, hence the performance regression.

> >>

> >> So reduce the frequency at which the NEON yield checks are performed, so

> >> that they occur roughly once every 1000 cycles, which is hopefully a

> >> reasonable tradeoff between throughput and worst case scheduling latency.

> >

> > Is there any way to tune this automatically, or it that likely to be more

> > trouble than it's worth?

> >

> 

> Good question. I think A53 is a reasonable worst case, and these

> changes reduce the impact to the same ballpark as the impact of

> enabling CONFIG_PREEMPT in the first place.

> 

> > Also, how did you come up with 1000 cycles?  At what point does

> > preemption latency become more/less important than throughput?

> >

> 

> Another good question. I was hoping Sebastian or the other -rt folks

> would be triggered by this. Given the above, I ended up with a ~1000

> cycles quantum, and hopefully this is considered to be small enough.

> 

> > Maybe someone already invented a similar framework somewhere else in the

> > kernel.  I seem to remember some automatic selection of memcpy

> > implementation based on a boot-time benchmark, but I may have

> > misremembered.

> >

> 

> We have crypto benchmarking code in the kernel, and at some point, I

> even did some work on selecting the best algo based on performance.

> 

> But to be honest, I think this is a bit overkill. If you need those

> final 5% of throughput at any cost, you're better off running with

> CONFIG_PREEMPT=n anyway.


Can't really argue with any of that -- I was just wondering whether
there was precedent.

Hopefully the ~1000 cycles ballpark will satisfy most people.  For
the rest, it's too bad: if somebody is relying on the last 1-2% of
performance, they probably have a broken use case.

Cheers
---Dave
Ard Biesheuvel July 25, 2018, 9:54 a.m. UTC | #5
On 25 July 2018 at 11:45, Dave Martin <Dave.Martin@arm.com> wrote:
> On Wed, Jul 25, 2018 at 10:23:00AM +0100, Ard Biesheuvel wrote:

>> On 25 July 2018 at 11:09, Dave Martin <Dave.Martin@arm.com> wrote:

>> > On Tue, Jul 24, 2018 at 06:12:20PM +0100, Ard Biesheuvel wrote:

>> >> Vakul reports a considerable performance hit when running the accelerated

>> >> arm64 crypto routines with CONFIG_PREEMPT=y configured, now that thay have

>> >> been updated to take the TIF_NEED_RESCHED flag into account.

>> >>

>> >> The issue appears to be caused by the fact that Cortex-A53, the core in

>> >> question, has a high end implementation of the Crypto Extensions, and

>> >> has a shallow pipeline, which means even sequential algorithms that may be

>> >> held back by pipeline stalls on high end out of order cores run at maximum

>> >> speed. This means SHA-1, SHA-2, GHASH and AES in GCM and CCM modes run at a

>> >> speed in the order of 2 to 4 cycles per byte, and are currently implemented

>> >> to check the TIF_NEED_RESCHED after each iteration, which may process as

>> >> little as 16 bytes (for GHASH).

>> >>

>> >> Obviously, every cycle of overhead hurts in this context, and given that

>> >> the A53's load/store unit is not quite high end, any delays caused by

>> >> memory accesses that occur in the inner loop of the algorithms are going

>> >> to be quite significant, hence the performance regression.

>> >>

>> >> So reduce the frequency at which the NEON yield checks are performed, so

>> >> that they occur roughly once every 1000 cycles, which is hopefully a

>> >> reasonable tradeoff between throughput and worst case scheduling latency.

>> >

>> > Is there any way to tune this automatically, or it that likely to be more

>> > trouble than it's worth?

>> >

>>

>> Good question. I think A53 is a reasonable worst case, and these

>> changes reduce the impact to the same ballpark as the impact of

>> enabling CONFIG_PREEMPT in the first place.

>>

>> > Also, how did you come up with 1000 cycles?  At what point does

>> > preemption latency become more/less important than throughput?

>> >

>>

>> Another good question. I was hoping Sebastian or the other -rt folks

>> would be triggered by this. Given the above, I ended up with a ~1000

>> cycles quantum, and hopefully this is considered to be small enough.

>>

>> > Maybe someone already invented a similar framework somewhere else in the

>> > kernel.  I seem to remember some automatic selection of memcpy

>> > implementation based on a boot-time benchmark, but I may have

>> > misremembered.

>> >

>>

>> We have crypto benchmarking code in the kernel, and at some point, I

>> even did some work on selecting the best algo based on performance.

>>

>> But to be honest, I think this is a bit overkill. If you need those

>> final 5% of throughput at any cost, you're better off running with

>> CONFIG_PREEMPT=n anyway.

>

> Can't really argue with any of that -- I was just wondering whether

> there was precedent.

>

> Hopefully the ~1000 cycles ballpark will satisfy most people.  For

> the rest, it's too bad: if somebody is relying on the last 1-2% of

> performance, they probably have a broken use case.

>


Indeed. OTOH, if the -rt people (Sebastian?) turn up and say that a
1000 cycle limit to the quantum of work performed with preemption
disabled is unreasonably low, we can increase the yield block counts
and approach the optimal numbers a bit closer. But with diminishing
returns.

Also, if the cost of enabling CONFIG_PREEMPT by itself is
significantly reduced, e.g., by moving the per-CPU offset into a GPR,
we can always revisit this of course.
Sebastian Andrzej Siewior July 25, 2018, 4:50 p.m. UTC | #6
On 2018-07-25 11:54:53 [+0200], Ard Biesheuvel wrote:
> Indeed. OTOH, if the -rt people (Sebastian?) turn up and say that a

> 1000 cycle limit to the quantum of work performed with preemption

> disabled is unreasonably low, we can increase the yield block counts

> and approach the optimal numbers a bit closer. But with diminishing

> returns.


So I tested on SoftIron Overdrive 1000 which has A57 cores. I added this
series and didn't notice any spikes. This means cyclictest reported a
max value of like ~20us (which means the crypto code was not
noticeable).
I played a little with it and tcrypt tests for aes/sha1 and also no huge
spikes. So at this point this looks fantastic. I also setup cryptsetup /
dm-crypt with the usual xts(aes) mode and saw no spikes.
At this point, on this hardware if you want to raise the block count, I
wouldn't mind.

I remember on x86 the SIMD accelerated ciphers led to ~1ms+ spikes once
dm-crypt started its jobs.

Sebastian
Ard Biesheuvel July 26, 2018, 7:25 a.m. UTC | #7
On 25 July 2018 at 18:50, bigeasy@linutronix.de <bigeasy@linutronix.de> wrote:
> On 2018-07-25 11:54:53 [+0200], Ard Biesheuvel wrote:

>> Indeed. OTOH, if the -rt people (Sebastian?) turn up and say that a

>> 1000 cycle limit to the quantum of work performed with preemption

>> disabled is unreasonably low, we can increase the yield block counts

>> and approach the optimal numbers a bit closer. But with diminishing

>> returns.

>

> So I tested on SoftIron Overdrive 1000 which has A57 cores. I added this

> series and didn't notice any spikes. This means cyclictest reported a

> max value of like ~20us (which means the crypto code was not

> noticeable).

> I played a little with it and tcrypt tests for aes/sha1 and also no huge

> spikes. So at this point this looks fantastic. I also setup cryptsetup /

> dm-crypt with the usual xts(aes) mode and saw no spikes.

> At this point, on this hardware if you want to raise the block count, I

> wouldn't mind.

>

> I remember on x86 the SIMD accelerated ciphers led to ~1ms+ spikes once

> dm-crypt started its jobs.

>


Thanks a lot.

So 20 us ~= 20,000 cycles on my 1 GHz Cortex-A53, and if I am
understanding you correctly, you wouldn't mind the quantum of work to
be in the order 16,000 cycles or even substantially more?

That is good news, but it is also rather interesting, given that these
algorithms run at ~4 cycles per byte, meaning that you'd manage an
entire 4 KB page without ever yielding. (GCM is used on network
packets, XTS on disk sectors which are all smaller than that)

Do you remember how you found out NEON use is a problem for -rt on
arm64 in the first place? Which algorithm did you test at the time to
arrive at this conclusion?

Note that AES-GCM using ordinary SIMD instructions runs at 29 cpb, and
plain AES at ~20 (on A53), so perhaps it would make sense to
distinguish between algos using crypto instructions and ones using
plain SIMD.
Sebastian Andrzej Siewior July 26, 2018, 8:23 a.m. UTC | #8
On 2018-07-26 09:25:40 [+0200], Ard Biesheuvel wrote:
> Thanks a lot.

> 

> So 20 us ~= 20,000 cycles on my 1 GHz Cortex-A53, and if I am

> understanding you correctly, you wouldn't mind the quantum of work to

> be in the order 16,000 cycles or even substantially more?


I have currently that one box and it does not seem to be a problem. So
it reports now on idle around 20us max. So if add "only" 20us to NEON /
your preempt-disable section then we may end up at 20+20 = 40us.
At this point I am not sure how "bad" it is. It works, it does not seem
that much and you can disable it if you don't want the extra 20us here.

> That is good news, but it is also rather interesting, given that these

> algorithms run at ~4 cycles per byte, meaning that you'd manage an

> entire 4 KB page without ever yielding. (GCM is used on network

> packets, XTS on disk sectors which are all smaller than that)

> 

> Do you remember how you found out NEON use is a problem for -rt on

> arm64 in the first place? Which algorithm did you test at the time to

> arrive at this conclusion?


I *think* that yield got in there by chance. The main problem was back
at the time that within the neon begin/end section there was the scatter
list walk. That walk may invoke kmap() / kmalloc() / kfree() and is not
allowed on RT within a preempt-disable section. This was my main
concern.

> Note that AES-GCM using ordinary SIMD instructions runs at 29 cpb, and

> plain AES at ~20 (on A53), so perhaps it would make sense to

> distinguish between algos using crypto instructions and ones using

> plain SIMD.


I was looking at AES-CE and AES-NEON (aes-neon-blk / aes_ce_blk) with
	modprobe tcrypt mode=200 sec=1

and mode=403 +404 for the sha1/256 test.

Sebastian