diff mbox series

[v2,01/17] asm: simd context helper API

Message ID 20180824213849.23647-2-Jason@zx2c4.com
State Superseded
Headers show
Series WireGuard: Secure Network Tunnel | expand

Commit Message

Jason A. Donenfeld Aug. 24, 2018, 9:38 p.m. UTC
Sometimes it's useful to amortize calls to XSAVE/XRSTOR and the related
FPU/SIMD functions over a number of calls, because FPU restoration is
quite expensive. This adds a simple header for carrying out this pattern:

    simd_context_t simd_context = simd_get();
    while ((item = get_item_from_queue()) != NULL) {
        encrypt_item(item, simd_context);
        simd_context = simd_relax(simd_context);
    }
    simd_put(simd_context);

The relaxation step ensures that we don't trample over preemption, and
the get/put API should be a familiar paradigm in the kernel.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Samuel Neves <sneves@dei.uc.pt>
Cc: linux-arch@vger.kernel.org
---
 arch/alpha/include/asm/Kbuild      |  5 ++--
 arch/arc/include/asm/Kbuild        |  1 +
 arch/arm/include/asm/simd.h        | 42 ++++++++++++++++++++++++++++++
 arch/arm64/include/asm/simd.h      | 37 +++++++++++++++++++++-----
 arch/c6x/include/asm/Kbuild        |  3 ++-
 arch/h8300/include/asm/Kbuild      |  3 ++-
 arch/hexagon/include/asm/Kbuild    |  1 +
 arch/ia64/include/asm/Kbuild       |  1 +
 arch/m68k/include/asm/Kbuild       |  1 +
 arch/microblaze/include/asm/Kbuild |  1 +
 arch/mips/include/asm/Kbuild       |  1 +
 arch/nds32/include/asm/Kbuild      |  7 ++---
 arch/nios2/include/asm/Kbuild      |  1 +
 arch/openrisc/include/asm/Kbuild   |  7 ++---
 arch/parisc/include/asm/Kbuild     |  1 +
 arch/powerpc/include/asm/Kbuild    |  3 ++-
 arch/riscv/include/asm/Kbuild      |  3 ++-
 arch/s390/include/asm/Kbuild       |  3 ++-
 arch/sh/include/asm/Kbuild         |  1 +
 arch/sparc/include/asm/Kbuild      |  1 +
 arch/um/include/asm/Kbuild         |  3 ++-
 arch/unicore32/include/asm/Kbuild  |  1 +
 arch/x86/include/asm/simd.h        | 30 ++++++++++++++++++++-
 arch/xtensa/include/asm/Kbuild     |  1 +
 include/asm-generic/simd.h         | 15 +++++++++++
 include/linux/simd.h               | 28 ++++++++++++++++++++
 26 files changed, 180 insertions(+), 21 deletions(-)
 create mode 100644 arch/arm/include/asm/simd.h
 create mode 100644 include/linux/simd.h

-- 
2.18.0

Comments

Thomas Gleixner Aug. 26, 2018, 12:10 p.m. UTC | #1
On Fri, 24 Aug 2018, Jason A. Donenfeld wrote:

> Sometimes it's useful to amortize calls to XSAVE/XRSTOR and the related

> FPU/SIMD functions over a number of calls, because FPU restoration is

> quite expensive. This adds a simple header for carrying out this pattern:

> 

>     simd_context_t simd_context = simd_get();

>     while ((item = get_item_from_queue()) != NULL) {

>         encrypt_item(item, simd_context);

>         simd_context = simd_relax(simd_context);

>     }

>     simd_put(simd_context);


I'm not too fond of this simply because it requires that relax() step in
all code pathes. I'd rather make that completely transparent by just
marking the task as FPU using and let the context switch code deal with it
in case that it gets preempted. I'll let one of my engineers look into
that next week.

Thanks,

	tglx
Jason A. Donenfeld Aug. 26, 2018, 1:45 p.m. UTC | #2
Hey Thomas,

On Sun, Aug 26, 2018 at 6:10 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> I'm not too fond of this simply because it requires that relax() step in

> all code pathes. I'd rather make that completely transparent by just

> marking the task as FPU using and let the context switch code deal with it

> in case that it gets preempted. I'll let one of my engineers look into

> that next week.


Do you mean to say you intend to make kernel_fpu_end() and
kernel_neon_end() only actually do something upon context switch, but
not when it's actually called? So that multiple calls to
kernel_fpu_begin() and kernel_neon_begin() can be made without
penalty? If so, that'd be great, and I'd certainly prefer this to the
simd_context_t passing. I consider the simd_get/put/relax API a
stopgap measure until something like that is implemented.

Jason
Thomas Gleixner Aug. 26, 2018, 2:06 p.m. UTC | #3
Jason,

On Sun, 26 Aug 2018, Jason A. Donenfeld wrote:
> On Sun, Aug 26, 2018 at 6:10 AM Thomas Gleixner <tglx@linutronix.de> wrote:

> > I'm not too fond of this simply because it requires that relax() step in

> > all code pathes. I'd rather make that completely transparent by just

> > marking the task as FPU using and let the context switch code deal with it

> > in case that it gets preempted. I'll let one of my engineers look into

> > that next week.

> 

> Do you mean to say you intend to make kernel_fpu_end() and

> kernel_neon_end() only actually do something upon context switch, but

> not when it's actually called? So that multiple calls to

> kernel_fpu_begin() and kernel_neon_begin() can be made without

> penalty?


On context switch and exit to user. That allows to keep those code pathes
fully preemptible. Still twisting my brain around the details.

> If so, that'd be great, and I'd certainly prefer this to the

> simd_context_t passing. I consider the simd_get/put/relax API a

> stopgap measure until something like that is implemented.


I really want to avoid this stopgap^Wducttape thing.

Thanks,

	tglx
Jason A. Donenfeld Aug. 26, 2018, 2:18 p.m. UTC | #4
On Sun, Aug 26, 2018 at 8:06 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> > Do you mean to say you intend to make kernel_fpu_end() and

> > kernel_neon_end() only actually do something upon context switch, but

> > not when it's actually called? So that multiple calls to

> > kernel_fpu_begin() and kernel_neon_begin() can be made without

> > penalty?

>

> On context switch and exit to user. That allows to keep those code pathes

> fully preemptible. Still twisting my brain around the details.


Just to make sure we're on the same page, the goal is so that this code:

kernel_fpu_begin();
kernel_fpu_end();
kernel_fpu_begin();
kernel_fpu_end();
kernel_fpu_begin();
kernel_fpu_end();
kernel_fpu_begin();
kernel_fpu_end();
kernel_fpu_begin();
kernel_fpu_end();
kernel_fpu_begin();
kernel_fpu_end();
...

has the same performance as this code:

kernel_fpu_begin();
kernel_fpu_end();

(Unless of course the process is preempted or the like.)

Currently the present situation makes the performance of the above
wildly different, since kernel_fpu_end() does something immediately.

What about something like this:
- Add a tristate flag connected to task_struct (or in the global fpu
struct in the case that this happens in irq and there isn't a valid
current).
- On kernel_fpu_begin(), if the flag is 0, do the usual expensive
XSAVE stuff, and set the flag to 1.
- On kernel_fpu_begin(), if the flag is non-0, just set the flag to 1
and return.
- On kernel_fpu_end(), if the flag is non-0, set the flag to 2.
(Otherwise WARN() or BUG() or something.)
- On context switch / preemption / etc away from the task, if the flag
is non-0, XRSTOR and such.
- On context switch / preemption / etc back to the task, if the flag
is 1, XSAVE and such. If the flag is 2, set it to 0.

Jason
Andy Lutomirski Aug. 26, 2018, 2:18 p.m. UTC | #5
> On Aug 26, 2018, at 7:06 AM, Thomas Gleixner <tglx@linutronix.de> wrote:

> 

> Jason,

> 

>> On Sun, 26 Aug 2018, Jason A. Donenfeld wrote:

>>> On Sun, Aug 26, 2018 at 6:10 AM Thomas Gleixner <tglx@linutronix.de> wrote:

>>> I'm not too fond of this simply because it requires that relax() step in

>>> all code pathes. I'd rather make that completely transparent by just

>>> marking the task as FPU using and let the context switch code deal with it

>>> in case that it gets preempted. I'll let one of my engineers look into

>>> that next week.

>> 

>> Do you mean to say you intend to make kernel_fpu_end() and

>> kernel_neon_end() only actually do something upon context switch, but

>> not when it's actually called? So that multiple calls to

>> kernel_fpu_begin() and kernel_neon_begin() can be made without

>> penalty?

> 

> On context switch and exit to user. That allows to keep those code pathes

> fully preemptible. Still twisting my brain around the details.


I think you’ll have to treat exit to user and context switch as different things. For exit to user, we want to restore the *user* state, but, for context switch, we’ll need to restore *kernel* state.

Do user first as its own patch set. It’ll be less painful that way.

And someone needs to rework PKRU for this to make sense. See previous threads.
Rik van Riel Aug. 26, 2018, 4:53 p.m. UTC | #6
On Sun, 2018-08-26 at 07:18 -0700, Andy Lutomirski wrote:
> > On Aug 26, 2018, at 7:06 AM, Thomas Gleixner <tglx@linutronix.de>

> > wrote:

> > 

> > Jason,

> > 

> > > On Sun, 26 Aug 2018, Jason A. Donenfeld wrote:

> > > > On Sun, Aug 26, 2018 at 6:10 AM Thomas Gleixner <

> > > > tglx@linutronix.de> wrote:

> > > > I'm not too fond of this simply because it requires that

> > > > relax() step in

> > > > all code pathes. I'd rather make that completely transparent by

> > > > just

> > > > marking the task as FPU using and let the context switch code

> > > > deal with it

> > > > in case that it gets preempted. I'll let one of my engineers

> > > > look into

> > > > that next week.

> > > 

> > > Do you mean to say you intend to make kernel_fpu_end() and

> > > kernel_neon_end() only actually do something upon context switch,

> > > but

> > > not when it's actually called? So that multiple calls to

> > > kernel_fpu_begin() and kernel_neon_begin() can be made without

> > > penalty?

> > 

> > On context switch and exit to user. That allows to keep those code

> > pathes

> > fully preemptible. Still twisting my brain around the details.

> 

> I think you’ll have to treat exit to user and context switch as

> different things. For exit to user, we want to restore the *user*

> state, but, for context switch, we’ll need to restore *kernel* state.


For non-preemptible kernel_fpu_begin/end (which seems
like a good starting point since since it gets the
code halfway to where Thomas would like it to go), the
rules would be a little simpler:

- For exit to userspace, restore the user FPU state.
- At kernel_fpu_begin(), save the user FPU state (if still loaded).
- At context switch time, save the user FPU state (if still loaded).

> Do user first as its own patch set. It’ll be less painful that way.

> 

> And someone needs to rework PKRU for this to make sense. See previous

> threads.


I sent Thomas the patches I worked on in the past.

That series is likely incomplete, but should be a
reasonable starting point.

-- 
All Rights Reversed.
Palmer Dabbelt Aug. 27, 2018, 7:50 p.m. UTC | #7
On Fri, 24 Aug 2018 14:38:33 PDT (-0700), Jason@zx2c4.com wrote:
> Sometimes it's useful to amortize calls to XSAVE/XRSTOR and the related

> FPU/SIMD functions over a number of calls, because FPU restoration is

> quite expensive. This adds a simple header for carrying out this pattern:

>

>     simd_context_t simd_context = simd_get();

>     while ((item = get_item_from_queue()) != NULL) {

>         encrypt_item(item, simd_context);

>         simd_context = simd_relax(simd_context);

>     }

>     simd_put(simd_context);

>

> The relaxation step ensures that we don't trample over preemption, and

> the get/put API should be a familiar paradigm in the kernel.

>

> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

> Cc: Andy Lutomirski <luto@kernel.org>

> Cc: Greg KH <gregkh@linuxfoundation.org>

> Cc: Samuel Neves <sneves@dei.uc.pt>

> Cc: linux-arch@vger.kernel.org

> ---

>  arch/alpha/include/asm/Kbuild      |  5 ++--

>  arch/arc/include/asm/Kbuild        |  1 +

>  arch/arm/include/asm/simd.h        | 42 ++++++++++++++++++++++++++++++

>  arch/arm64/include/asm/simd.h      | 37 +++++++++++++++++++++-----

>  arch/c6x/include/asm/Kbuild        |  3 ++-

>  arch/h8300/include/asm/Kbuild      |  3 ++-

>  arch/hexagon/include/asm/Kbuild    |  1 +

>  arch/ia64/include/asm/Kbuild       |  1 +

>  arch/m68k/include/asm/Kbuild       |  1 +

>  arch/microblaze/include/asm/Kbuild |  1 +

>  arch/mips/include/asm/Kbuild       |  1 +

>  arch/nds32/include/asm/Kbuild      |  7 ++---

>  arch/nios2/include/asm/Kbuild      |  1 +

>  arch/openrisc/include/asm/Kbuild   |  7 ++---

>  arch/parisc/include/asm/Kbuild     |  1 +

>  arch/powerpc/include/asm/Kbuild    |  3 ++-

>  arch/riscv/include/asm/Kbuild      |  3 ++-

>  arch/s390/include/asm/Kbuild       |  3 ++-

>  arch/sh/include/asm/Kbuild         |  1 +

>  arch/sparc/include/asm/Kbuild      |  1 +

>  arch/um/include/asm/Kbuild         |  3 ++-

>  arch/unicore32/include/asm/Kbuild  |  1 +

>  arch/x86/include/asm/simd.h        | 30 ++++++++++++++++++++-

>  arch/xtensa/include/asm/Kbuild     |  1 +

>  include/asm-generic/simd.h         | 15 +++++++++++

>  include/linux/simd.h               | 28 ++++++++++++++++++++

>  26 files changed, 180 insertions(+), 21 deletions(-)

>  create mode 100644 arch/arm/include/asm/simd.h

>  create mode 100644 include/linux/simd.h


...

> diff --git a/arch/riscv/include/asm/Kbuild b/arch/riscv/include/asm/Kbuild

> index 576ffdca06ba..8d3e7aef3234 100644

> --- a/arch/riscv/include/asm/Kbuild

> +++ b/arch/riscv/include/asm/Kbuild

> @@ -4,9 +4,9 @@ generic-y += checksum.h

>  generic-y += cputime.h

>  generic-y += device.h

>  generic-y += div64.h

> -generic-y += dma.h

>  generic-y += dma-contiguous.h

>  generic-y += dma-mapping.h

> +generic-y += dma.h

>  generic-y += emergency-restart.h

>  generic-y += errno.h

>  generic-y += exec.h


If this is the canonical ordering and doing so makes your life easier then I'm 
OK taking this as a separate patch into the RISC-V tree, but if not then feel 
free to roll something like this up into your next patch set.

> @@ -45,6 +45,7 @@ generic-y += setup.h

>  generic-y += shmbuf.h

>  generic-y += shmparam.h

>  generic-y += signal.h

> +generic-y += simd.h

>  generic-y += socket.h

>  generic-y += sockios.h

>  generic-y += stat.h


Either way, this looks fine for as far as the RISC-V stuff goes as it's pretty 
much a NOP.  As long as it stays a NOP then feel free to add a 

Reviewed-by: Palmer Dabbelt <palmer@sifive.com>


as far as the RISC-V parts are conceded.  It looks like there's a lot of other 
issues, though, so it's not much of a review :)
Jason A. Donenfeld Sept. 1, 2018, 8:19 p.m. UTC | #8
Hey Thomas,

I'd like to move ahead with my patchset and make some forward progress
in LKML submission. If you've got something brewing regarding the FPU
context on x86 and ARM, I'm happy to wait a bit longer so as to build
on that. But if that is instead a far-off theoretical eventual thing,
perhaps it's better for me to move ahead as planned, and we can switch
to the superior FPU semantics whenever you get around to it? Either
way, please let me know what you have in mind so our plans can stay
somewhat sync'd.

Talk soon,
Jason
Andy Lutomirski Sept. 1, 2018, 8:32 p.m. UTC | #9
On Sat, Sep 1, 2018 at 1:19 PM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> Hey Thomas,

>

> I'd like to move ahead with my patchset and make some forward progress

> in LKML submission. If you've got something brewing regarding the FPU

> context on x86 and ARM, I'm happy to wait a bit longer so as to build

> on that. But if that is instead a far-off theoretical eventual thing,

> perhaps it's better for me to move ahead as planned, and we can switch

> to the superior FPU semantics whenever you get around to it? Either

> way, please let me know what you have in mind so our plans can stay

> somewhat sync'd.


I tend to think the right approach is to merge Jason's code and then
make it better later.  Even with a totally perfect lazy FPU restore
implementation on x86, we'll probably still need some way of dealing
with SIMD contexts.  I think we're highly unlikely to ever a allow
SIMD usage in all NMI contexts, for example, and there will always be
cases where we specifically don't want to use all available SIMD
capabilities even if we can.  For example, generating random numbers
does crypto, but we probably don't want to do *SIMD* crypto, since
that will force a save and restore and will probably fire up the
AVX512 unit, and that's not worth it unless we're already using it for
some other reason.

Also, as Rik has discovered, lazy FPU restore is conceptually
straightforward but isn't entirely trivial :)

--Andy
Jason A. Donenfeld Sept. 1, 2018, 8:34 p.m. UTC | #10
On Sat, Sep 1, 2018 at 2:32 PM Andy Lutomirski <luto@kernel.org> wrote:
> I tend to think the right approach is to merge Jason's code and then

> make it better later.  Even with a totally perfect lazy FPU restore

> implementation on x86, we'll probably still need some way of dealing

> with SIMD contexts.  I think we're highly unlikely to ever a allow

> SIMD usage in all NMI contexts, for example, and there will always be

> cases where we specifically don't want to use all available SIMD

> capabilities even if we can.  For example, generating random numbers

> does crypto, but we probably don't want to do *SIMD* crypto, since

> that will force a save and restore and will probably fire up the

> AVX512 unit, and that's not worth it unless we're already using it for

> some other reason.

>

> Also, as Rik has discovered, lazy FPU restore is conceptually

> straightforward but isn't entirely trivial :)


Sounds good. I'll move ahead on this basis.
Thomas Gleixner Sept. 6, 2018, 1:42 p.m. UTC | #11
On Sat, 1 Sep 2018, Jason A. Donenfeld wrote:
> On Sat, Sep 1, 2018 at 2:32 PM Andy Lutomirski <luto@kernel.org> wrote:

> > I tend to think the right approach is to merge Jason's code and then

> > make it better later.  Even with a totally perfect lazy FPU restore

> > implementation on x86, we'll probably still need some way of dealing

> > with SIMD contexts.  I think we're highly unlikely to ever a allow

> > SIMD usage in all NMI contexts, for example, and there will always be

> > cases where we specifically don't want to use all available SIMD

> > capabilities even if we can.  For example, generating random numbers

> > does crypto, but we probably don't want to do *SIMD* crypto, since

> > that will force a save and restore and will probably fire up the

> > AVX512 unit, and that's not worth it unless we're already using it for

> > some other reason.

> >

> > Also, as Rik has discovered, lazy FPU restore is conceptually

> > straightforward but isn't entirely trivial :)

> 

> Sounds good. I'll move ahead on this basis.


Fine with me.
Jason A. Donenfeld Sept. 6, 2018, 3:52 p.m. UTC | #12
Hi Thomas,

On Thu, Sep 6, 2018 at 9:29 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>

> On Sat, 1 Sep 2018, Jason A. Donenfeld wrote:

> > On Sat, Sep 1, 2018 at 2:32 PM Andy Lutomirski <luto@kernel.org> wrote:

> > > I tend to think the right approach is to merge Jason's code and then

> > > make it better later.  Even with a totally perfect lazy FPU restore

> > > implementation on x86, we'll probably still need some way of dealing

> > > with SIMD contexts.  I think we're highly unlikely to ever a allow

> > > SIMD usage in all NMI contexts, for example, and there will always be

> > > cases where we specifically don't want to use all available SIMD

> > > capabilities even if we can.  For example, generating random numbers

> > > does crypto, but we probably don't want to do *SIMD* crypto, since

> > > that will force a save and restore and will probably fire up the

> > > AVX512 unit, and that's not worth it unless we're already using it for

> > > some other reason.

> > >

> > > Also, as Rik has discovered, lazy FPU restore is conceptually

> > > straightforward but isn't entirely trivial :)

> >

> > Sounds good. I'll move ahead on this basis.

>

> Fine with me.


Do you want to pull this single patch [01/17] into your tree now, and
then when I submit v3 of WireGuard and such, I can just drop this
patch from it, and then the rest will enter like usual networking
stuff through Dave's tree?

Jason
diff mbox series

Patch

diff --git a/arch/alpha/include/asm/Kbuild b/arch/alpha/include/asm/Kbuild
index 0580cb8c84b2..07b2c1025d34 100644
--- a/arch/alpha/include/asm/Kbuild
+++ b/arch/alpha/include/asm/Kbuild
@@ -2,14 +2,15 @@ 
 
 
 generic-y += compat.h
+generic-y += current.h
 generic-y += exec.h
 generic-y += export.h
 generic-y += fb.h
 generic-y += irq_work.h
+generic-y += kprobes.h
 generic-y += mcs_spinlock.h
 generic-y += mm-arch-hooks.h
 generic-y += preempt.h
 generic-y += sections.h
+generic-y += simd.h
 generic-y += trace_clock.h
-generic-y += current.h
-generic-y += kprobes.h
diff --git a/arch/arc/include/asm/Kbuild b/arch/arc/include/asm/Kbuild
index feed50ce89fa..a7f4255f1649 100644
--- a/arch/arc/include/asm/Kbuild
+++ b/arch/arc/include/asm/Kbuild
@@ -22,6 +22,7 @@  generic-y += parport.h
 generic-y += pci.h
 generic-y += percpu.h
 generic-y += preempt.h
+generic-y += simd.h
 generic-y += topology.h
 generic-y += trace_clock.h
 generic-y += user.h
diff --git a/arch/arm/include/asm/simd.h b/arch/arm/include/asm/simd.h
new file mode 100644
index 000000000000..bf468993bbef
--- /dev/null
+++ b/arch/arm/include/asm/simd.h
@@ -0,0 +1,42 @@ 
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#include <linux/simd.h>
+#ifndef _ASM_SIMD_H
+#define _ASM_SIMD_H
+
+static __must_check inline bool may_use_simd(void)
+{
+	return !in_interrupt();
+}
+
+#ifdef CONFIG_KERNEL_MODE_NEON
+#include <asm/neon.h>
+
+static inline simd_context_t simd_get(void)
+{
+	bool have_simd = may_use_simd();
+	if (have_simd)
+		kernel_neon_begin();
+	return have_simd ? HAVE_FULL_SIMD : HAVE_NO_SIMD;
+}
+
+static inline void simd_put(simd_context_t prior_context)
+{
+	if (prior_context != HAVE_NO_SIMD)
+		kernel_neon_end();
+}
+#else
+static inline simd_context_t simd_get(void)
+{
+	return HAVE_NO_SIMD;
+}
+
+static inline void simd_put(simd_context_t prior_context)
+{
+}
+#endif
+
+#endif /* _ASM_SIMD_H */
diff --git a/arch/arm64/include/asm/simd.h b/arch/arm64/include/asm/simd.h
index 6495cc51246f..058c336de38d 100644
--- a/arch/arm64/include/asm/simd.h
+++ b/arch/arm64/include/asm/simd.h
@@ -1,11 +1,10 @@ 
-/*
- * Copyright (C) 2017 Linaro Ltd. <ard.biesheuvel@linaro.org>
+/* SPDX-License-Identifier: GPL-2.0
  *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of the GNU General Public License version 2 as published
- * by the Free Software Foundation.
+ * Copyright (C) 2017 Linaro Ltd. <ard.biesheuvel@linaro.org>
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
  */
 
+#include <linux/simd.h>
 #ifndef __ASM_SIMD_H
 #define __ASM_SIMD_H
 
@@ -16,6 +15,8 @@ 
 #include <linux/types.h>
 
 #ifdef CONFIG_KERNEL_MODE_NEON
+#include <asm/neon.h>
+#include <asm/simd.h>
 
 DECLARE_PER_CPU(bool, kernel_neon_busy);
 
@@ -40,12 +41,36 @@  static __must_check inline bool may_use_simd(void)
 		!this_cpu_read(kernel_neon_busy);
 }
 
+static inline simd_context_t simd_get(void)
+{
+	bool have_simd = may_use_simd();
+	if (have_simd)
+		kernel_neon_begin();
+	return have_simd ? HAVE_FULL_SIMD : HAVE_NO_SIMD;
+}
+
+static inline void simd_put(simd_context_t prior_context)
+{
+	if (prior_context != HAVE_NO_SIMD)
+		kernel_neon_end();
+}
+
 #else /* ! CONFIG_KERNEL_MODE_NEON */
 
-static __must_check inline bool may_use_simd(void) {
+static __must_check inline bool may_use_simd(void)
+{
 	return false;
 }
 
+static inline simd_context_t simd_get(void)
+{
+	return HAVE_NO_SIMD;
+}
+
+static inline void simd_put(simd_context_t prior_context)
+{
+}
+
 #endif /* ! CONFIG_KERNEL_MODE_NEON */
 
 #endif
diff --git a/arch/c6x/include/asm/Kbuild b/arch/c6x/include/asm/Kbuild
index 33a2c94fed0d..22f3d8333c74 100644
--- a/arch/c6x/include/asm/Kbuild
+++ b/arch/c6x/include/asm/Kbuild
@@ -5,8 +5,8 @@  generic-y += compat.h
 generic-y += current.h
 generic-y += device.h
 generic-y += div64.h
-generic-y += dma.h
 generic-y += dma-mapping.h
+generic-y += dma.h
 generic-y += emergency-restart.h
 generic-y += exec.h
 generic-y += extable.h
@@ -30,6 +30,7 @@  generic-y += pgalloc.h
 generic-y += preempt.h
 generic-y += segment.h
 generic-y += serial.h
+generic-y += simd.h
 generic-y += tlbflush.h
 generic-y += topology.h
 generic-y += trace_clock.h
diff --git a/arch/h8300/include/asm/Kbuild b/arch/h8300/include/asm/Kbuild
index a5d0b2991f47..f5c2f12d593e 100644
--- a/arch/h8300/include/asm/Kbuild
+++ b/arch/h8300/include/asm/Kbuild
@@ -8,8 +8,8 @@  generic-y += current.h
 generic-y += delay.h
 generic-y += device.h
 generic-y += div64.h
-generic-y += dma.h
 generic-y += dma-mapping.h
+generic-y += dma.h
 generic-y += emergency-restart.h
 generic-y += exec.h
 generic-y += extable.h
@@ -39,6 +39,7 @@  generic-y += preempt.h
 generic-y += scatterlist.h
 generic-y += sections.h
 generic-y += serial.h
+generic-y += simd.h
 generic-y += sizes.h
 generic-y += spinlock.h
 generic-y += timex.h
diff --git a/arch/hexagon/include/asm/Kbuild b/arch/hexagon/include/asm/Kbuild
index dd2fd9c0d292..217d4695fd8a 100644
--- a/arch/hexagon/include/asm/Kbuild
+++ b/arch/hexagon/include/asm/Kbuild
@@ -29,6 +29,7 @@  generic-y += rwsem.h
 generic-y += sections.h
 generic-y += segment.h
 generic-y += serial.h
+generic-y += simd.h
 generic-y += sizes.h
 generic-y += topology.h
 generic-y += trace_clock.h
diff --git a/arch/ia64/include/asm/Kbuild b/arch/ia64/include/asm/Kbuild
index 557bbc8ba9f5..41c5ebdf79e5 100644
--- a/arch/ia64/include/asm/Kbuild
+++ b/arch/ia64/include/asm/Kbuild
@@ -4,6 +4,7 @@  generic-y += irq_work.h
 generic-y += mcs_spinlock.h
 generic-y += mm-arch-hooks.h
 generic-y += preempt.h
+generic-y += simd.h
 generic-y += trace_clock.h
 generic-y += vtime.h
 generic-y += word-at-a-time.h
diff --git a/arch/m68k/include/asm/Kbuild b/arch/m68k/include/asm/Kbuild
index a4b8d3331a9e..73898dd1a4d0 100644
--- a/arch/m68k/include/asm/Kbuild
+++ b/arch/m68k/include/asm/Kbuild
@@ -19,6 +19,7 @@  generic-y += mm-arch-hooks.h
 generic-y += percpu.h
 generic-y += preempt.h
 generic-y += sections.h
+generic-y += simd.h
 generic-y += spinlock.h
 generic-y += topology.h
 generic-y += trace_clock.h
diff --git a/arch/microblaze/include/asm/Kbuild b/arch/microblaze/include/asm/Kbuild
index fe6a6c6e5003..9002fb24888c 100644
--- a/arch/microblaze/include/asm/Kbuild
+++ b/arch/microblaze/include/asm/Kbuild
@@ -24,6 +24,7 @@  generic-y += parport.h
 generic-y += percpu.h
 generic-y += preempt.h
 generic-y += serial.h
+generic-y += simd.h
 generic-y += syscalls.h
 generic-y += topology.h
 generic-y += trace_clock.h
diff --git a/arch/mips/include/asm/Kbuild b/arch/mips/include/asm/Kbuild
index 58351e48421e..e8868e0fb2c3 100644
--- a/arch/mips/include/asm/Kbuild
+++ b/arch/mips/include/asm/Kbuild
@@ -16,6 +16,7 @@  generic-y += qrwlock.h
 generic-y += qspinlock.h
 generic-y += sections.h
 generic-y += segment.h
+generic-y += simd.h
 generic-y += trace_clock.h
 generic-y += unaligned.h
 generic-y += user.h
diff --git a/arch/nds32/include/asm/Kbuild b/arch/nds32/include/asm/Kbuild
index dbc4e5422550..603c1d020620 100644
--- a/arch/nds32/include/asm/Kbuild
+++ b/arch/nds32/include/asm/Kbuild
@@ -7,14 +7,14 @@  generic-y += bug.h
 generic-y += bugs.h
 generic-y += checksum.h
 generic-y += clkdev.h
-generic-y += cmpxchg.h
 generic-y += cmpxchg-local.h
+generic-y += cmpxchg.h
 generic-y += compat.h
 generic-y += cputime.h
 generic-y += device.h
 generic-y += div64.h
-generic-y += dma.h
 generic-y += dma-mapping.h
+generic-y += dma.h
 generic-y += emergency-restart.h
 generic-y += errno.h
 generic-y += exec.h
@@ -46,14 +46,15 @@  generic-y += sections.h
 generic-y += segment.h
 generic-y += serial.h
 generic-y += shmbuf.h
+generic-y += simd.h
 generic-y += sizes.h
 generic-y += stat.h
 generic-y += switch_to.h
 generic-y += timex.h
 generic-y += topology.h
 generic-y += trace_clock.h
-generic-y += xor.h
 generic-y += unaligned.h
 generic-y += user.h
 generic-y += vga.h
 generic-y += word-at-a-time.h
+generic-y += xor.h
diff --git a/arch/nios2/include/asm/Kbuild b/arch/nios2/include/asm/Kbuild
index 8fde4fa2c34f..571a9d9ad107 100644
--- a/arch/nios2/include/asm/Kbuild
+++ b/arch/nios2/include/asm/Kbuild
@@ -33,6 +33,7 @@  generic-y += preempt.h
 generic-y += sections.h
 generic-y += segment.h
 generic-y += serial.h
+generic-y += simd.h
 generic-y += spinlock.h
 generic-y += topology.h
 generic-y += trace_clock.h
diff --git a/arch/openrisc/include/asm/Kbuild b/arch/openrisc/include/asm/Kbuild
index 65964d390b10..81a39e274f6f 100644
--- a/arch/openrisc/include/asm/Kbuild
+++ b/arch/openrisc/include/asm/Kbuild
@@ -27,12 +27,13 @@  generic-y += module.h
 generic-y += pci.h
 generic-y += percpu.h
 generic-y += preempt.h
-generic-y += qspinlock_types.h
-generic-y += qspinlock.h
-generic-y += qrwlock_types.h
 generic-y += qrwlock.h
+generic-y += qrwlock_types.h
+generic-y += qspinlock.h
+generic-y += qspinlock_types.h
 generic-y += sections.h
 generic-y += segment.h
+generic-y += simd.h
 generic-y += string.h
 generic-y += switch_to.h
 generic-y += topology.h
diff --git a/arch/parisc/include/asm/Kbuild b/arch/parisc/include/asm/Kbuild
index 2013d639e735..97970b4d05ab 100644
--- a/arch/parisc/include/asm/Kbuild
+++ b/arch/parisc/include/asm/Kbuild
@@ -17,6 +17,7 @@  generic-y += percpu.h
 generic-y += preempt.h
 generic-y += seccomp.h
 generic-y += segment.h
+generic-y += simd.h
 generic-y += topology.h
 generic-y += trace_clock.h
 generic-y += user.h
diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild
index 3196d227e351..64290f48e733 100644
--- a/arch/powerpc/include/asm/Kbuild
+++ b/arch/powerpc/include/asm/Kbuild
@@ -4,7 +4,8 @@  generic-y += irq_regs.h
 generic-y += irq_work.h
 generic-y += local64.h
 generic-y += mcs_spinlock.h
+generic-y += msi.h
 generic-y += preempt.h
 generic-y += rwsem.h
+generic-y += simd.h
 generic-y += vtime.h
-generic-y += msi.h
diff --git a/arch/riscv/include/asm/Kbuild b/arch/riscv/include/asm/Kbuild
index 576ffdca06ba..8d3e7aef3234 100644
--- a/arch/riscv/include/asm/Kbuild
+++ b/arch/riscv/include/asm/Kbuild
@@ -4,9 +4,9 @@  generic-y += checksum.h
 generic-y += cputime.h
 generic-y += device.h
 generic-y += div64.h
-generic-y += dma.h
 generic-y += dma-contiguous.h
 generic-y += dma-mapping.h
+generic-y += dma.h
 generic-y += emergency-restart.h
 generic-y += errno.h
 generic-y += exec.h
@@ -45,6 +45,7 @@  generic-y += setup.h
 generic-y += shmbuf.h
 generic-y += shmparam.h
 generic-y += signal.h
+generic-y += simd.h
 generic-y += socket.h
 generic-y += sockios.h
 generic-y += stat.h
diff --git a/arch/s390/include/asm/Kbuild b/arch/s390/include/asm/Kbuild
index e3239772887a..7a26dc6ce815 100644
--- a/arch/s390/include/asm/Kbuild
+++ b/arch/s390/include/asm/Kbuild
@@ -7,9 +7,9 @@  generated-y += unistd_nr.h
 generic-y += asm-offsets.h
 generic-y += cacheflush.h
 generic-y += device.h
+generic-y += div64.h
 generic-y += dma-contiguous.h
 generic-y += dma-mapping.h
-generic-y += div64.h
 generic-y += emergency-restart.h
 generic-y += export.h
 generic-y += fb.h
@@ -22,6 +22,7 @@  generic-y += mcs_spinlock.h
 generic-y += mm-arch-hooks.h
 generic-y += preempt.h
 generic-y += rwsem.h
+generic-y += simd.h
 generic-y += trace_clock.h
 generic-y += unaligned.h
 generic-y += word-at-a-time.h
diff --git a/arch/sh/include/asm/Kbuild b/arch/sh/include/asm/Kbuild
index 6a5609a55965..8e64ff35a933 100644
--- a/arch/sh/include/asm/Kbuild
+++ b/arch/sh/include/asm/Kbuild
@@ -16,6 +16,7 @@  generic-y += percpu.h
 generic-y += preempt.h
 generic-y += rwsem.h
 generic-y += serial.h
+generic-y += simd.h
 generic-y += sizes.h
 generic-y += trace_clock.h
 generic-y += xor.h
diff --git a/arch/sparc/include/asm/Kbuild b/arch/sparc/include/asm/Kbuild
index 410b263ef5c8..72b9e08fb350 100644
--- a/arch/sparc/include/asm/Kbuild
+++ b/arch/sparc/include/asm/Kbuild
@@ -17,5 +17,6 @@  generic-y += msi.h
 generic-y += preempt.h
 generic-y += rwsem.h
 generic-y += serial.h
+generic-y += simd.h
 generic-y += trace_clock.h
 generic-y += word-at-a-time.h
diff --git a/arch/um/include/asm/Kbuild b/arch/um/include/asm/Kbuild
index b10dde6cb793..d37288b08dd2 100644
--- a/arch/um/include/asm/Kbuild
+++ b/arch/um/include/asm/Kbuild
@@ -16,15 +16,16 @@  generic-y += io.h
 generic-y += irq_regs.h
 generic-y += irq_work.h
 generic-y += kdebug.h
+generic-y += kprobes.h
 generic-y += mcs_spinlock.h
 generic-y += mm-arch-hooks.h
 generic-y += param.h
 generic-y += pci.h
 generic-y += percpu.h
 generic-y += preempt.h
+generic-y += simd.h
 generic-y += switch_to.h
 generic-y += topology.h
 generic-y += trace_clock.h
 generic-y += word-at-a-time.h
 generic-y += xor.h
-generic-y += kprobes.h
diff --git a/arch/unicore32/include/asm/Kbuild b/arch/unicore32/include/asm/Kbuild
index bfc7abe77905..98a908720bbd 100644
--- a/arch/unicore32/include/asm/Kbuild
+++ b/arch/unicore32/include/asm/Kbuild
@@ -27,6 +27,7 @@  generic-y += preempt.h
 generic-y += sections.h
 generic-y += segment.h
 generic-y += serial.h
+generic-y += simd.h
 generic-y += sizes.h
 generic-y += syscalls.h
 generic-y += topology.h
diff --git a/arch/x86/include/asm/simd.h b/arch/x86/include/asm/simd.h
index a341c878e977..79411178988a 100644
--- a/arch/x86/include/asm/simd.h
+++ b/arch/x86/include/asm/simd.h
@@ -1,4 +1,11 @@ 
-/* SPDX-License-Identifier: GPL-2.0 */
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#include <linux/simd.h>
+#ifndef _ASM_SIMD_H
+#define _ASM_SIMD_H
 
 #include <asm/fpu/api.h>
 
@@ -10,3 +17,24 @@  static __must_check inline bool may_use_simd(void)
 {
 	return irq_fpu_usable();
 }
+
+static inline simd_context_t simd_get(void)
+{
+	bool have_simd = false;
+#if !defined(CONFIG_UML)
+	have_simd = may_use_simd();
+	if (have_simd)
+		kernel_fpu_begin();
+#endif
+	return have_simd ? HAVE_FULL_SIMD : HAVE_NO_SIMD;
+}
+
+static inline void simd_put(simd_context_t prior_context)
+{
+#if !defined(CONFIG_UML)
+	if (prior_context != HAVE_NO_SIMD)
+		kernel_fpu_end();
+#endif
+}
+
+#endif /* _ASM_SIMD_H */
diff --git a/arch/xtensa/include/asm/Kbuild b/arch/xtensa/include/asm/Kbuild
index e5e1e61c538c..e3b194a187f9 100644
--- a/arch/xtensa/include/asm/Kbuild
+++ b/arch/xtensa/include/asm/Kbuild
@@ -23,6 +23,7 @@  generic-y += percpu.h
 generic-y += preempt.h
 generic-y += rwsem.h
 generic-y += sections.h
+generic-y += simd.h
 generic-y += topology.h
 generic-y += trace_clock.h
 generic-y += word-at-a-time.h
diff --git a/include/asm-generic/simd.h b/include/asm-generic/simd.h
index d0343d58a74a..fad899a5a92d 100644
--- a/include/asm-generic/simd.h
+++ b/include/asm-generic/simd.h
@@ -1,5 +1,9 @@ 
 /* SPDX-License-Identifier: GPL-2.0 */
 
+#include <linux/simd.h>
+#ifndef _ASM_SIMD_H
+#define _ASM_SIMD_H
+
 #include <linux/hardirq.h>
 
 /*
@@ -13,3 +17,14 @@  static __must_check inline bool may_use_simd(void)
 {
 	return !in_interrupt();
 }
+
+static inline simd_context_t simd_get(void)
+{
+	return HAVE_NO_SIMD;
+}
+
+static inline void simd_put(simd_context_t prior_context)
+{
+}
+
+#endif /* _ASM_SIMD_H */
diff --git a/include/linux/simd.h b/include/linux/simd.h
new file mode 100644
index 000000000000..f62d047188bf
--- /dev/null
+++ b/include/linux/simd.h
@@ -0,0 +1,28 @@ 
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#ifndef _SIMD_H
+#define _SIMD_H
+
+typedef enum {
+	HAVE_NO_SIMD,
+	HAVE_FULL_SIMD
+} simd_context_t;
+
+#include <linux/sched.h>
+#include <asm/simd.h>
+
+static inline simd_context_t simd_relax(simd_context_t prior_context)
+{
+#ifdef CONFIG_PREEMPT
+	if (prior_context != HAVE_NO_SIMD && need_resched()) {
+		simd_put(prior_context);
+		return simd_get();
+	}
+#endif
+	return prior_context;
+}
+
+#endif /* _SIMD_H */