mbox series

[RFC,V3,0/9] PKS: Add Protection Keys Supervisor (PKS) support RFC v3

Message ID 20201009194258.3207172-1-ira.weiny@intel.com
Headers show
Series PKS: Add Protection Keys Supervisor (PKS) support RFC v3 | expand

Message

Ira Weiny Oct. 9, 2020, 7:42 p.m. UTC
From: Ira Weiny <ira.weiny@intel.com>

This RFC series has been reviewed by Dave Hansen.

Introduce a new page protection mechanism for supervisor pages, Protection Key
Supervisor (PKS).

2 use cases for PKS are being developed, trusted keys and PMEM.  Trusted keys
is a newer use case which is still being explored.  PMEM was submitted as part
of the RFC (v2) series[1].  However, since then it was found that some callers
of kmap() require a global implementation of PKS.  Specifically some users of
kmap() expect mappings to be available to all kernel threads.  While global use
of PKS is rare it needs to be included for correctness.  Unfortunately the
kmap() updates required a large patch series to make the needed changes at the
various kmap() call sites so that patch set has been split out.  Because the
global PKS feature is only required for that use case it will be deferred to
that set as well.[2]  This patch set is being submitted as a precursor to both
of the use cases.

For an overview of the entire PKS ecosystem, a git tree including this series
and the 2 use cases can be found here:

	https://github.com/weiny2/linux-kernel/tree/pks-rfc-v3


PKS enables protections on 'domains' of supervisor pages to limit supervisor
mode access to those pages beyond the normal paging protections.  PKS works in
a similar fashion to user space pkeys, PKU.  As with PKU, supervisor pkeys are
checked in addition to normal paging protections and Access or Writes can be
disabled via a MSR update without TLB flushes when permissions change.  Also
like PKU, a page mapping is assigned to a domain by setting pkey bits in the
page table entry for that mapping.

Access is controlled through a PKRS register which is updated via WRMSR/RDMSR.

XSAVE is not supported for the PKRS MSR.  Therefore the implementation
saves/restores the MSR across context switches and during exceptions.  Nested
exceptions are supported by each exception getting a new PKS state.

For consistent behavior with current paging protections, pkey 0 is reserved and
configured to allow full access via the pkey mechanism, thus preserving the
default paging protections on mappings with the default pkey value of 0.

Other keys, (1-15) are allocated by an allocator which prepares us for key
contention from day one.  Kernel users should be prepared for the allocator to
fail either because of key exhaustion or due to PKS not being supported on the
arch and/or CPU instance.

The following are key attributes of PKS.

   1) Fast switching of permissions
	1a) Prevents access without page table manipulations
	1b) No TLB flushes required
   2) Works on a per thread basis

PKS is available with 4 and 5 level paging.  Like PKRU it consumes 4 bits from
the PTE to store the pkey within the entry.


[1] https://lore.kernel.org/lkml/20200717072056.73134-1-ira.weiny@intel.com/
[2] https://github.com/weiny2/linux-kernel/commit/f10abb0f0d7b4e14f03fc8890313a5830cde1e49
	and a testing patch
    https://github.com/weiny2/linux-kernel/commit/2a8e0fc7654a7c69b243d628f63b01ff26a5a797


Fenghua Yu (3):
  x86/fpu: Refactor arch_set_user_pkey_access() for PKS support
  x86/pks: Enable Protection Keys Supervisor (PKS)
  x86/pks: Add PKS kernel API

Ira Weiny (6):
  x86/pkeys: Create pkeys_common.h
  x86/pks: Preserve the PKRS MSR on context switch
  x86/entry: Pass irqentry_state_t by reference
  x86/entry: Preserve PKRS MSR across exceptions
  x86/fault: Report the PKRS state on fault
  x86/pks: Add PKS test code

 Documentation/core-api/protection-keys.rst  | 102 ++-
 arch/x86/Kconfig                            |   1 +
 arch/x86/entry/common.c                     |  57 +-
 arch/x86/include/asm/cpufeatures.h          |   1 +
 arch/x86/include/asm/idtentry.h             |  29 +-
 arch/x86/include/asm/msr-index.h            |   1 +
 arch/x86/include/asm/pgtable.h              |  13 +-
 arch/x86/include/asm/pgtable_types.h        |  12 +
 arch/x86/include/asm/pkeys.h                |  15 +
 arch/x86/include/asm/pkeys_common.h         |  36 +
 arch/x86/include/asm/processor.h            |  13 +
 arch/x86/include/uapi/asm/processor-flags.h |   2 +
 arch/x86/kernel/cpu/common.c                |  17 +
 arch/x86/kernel/cpu/mce/core.c              |   4 +
 arch/x86/kernel/fpu/xstate.c                |  22 +-
 arch/x86/kernel/kvm.c                       |   4 +-
 arch/x86/kernel/nmi.c                       |   7 +-
 arch/x86/kernel/process.c                   |  21 +
 arch/x86/kernel/traps.c                     |  21 +-
 arch/x86/mm/fault.c                         |  86 ++-
 arch/x86/mm/pkeys.c                         | 188 +++++-
 include/linux/entry-common.h                |  19 +-
 include/linux/pgtable.h                     |   4 +
 include/linux/pkeys.h                       |  23 +-
 kernel/entry/common.c                       |  28 +-
 lib/Kconfig.debug                           |  12 +
 lib/Makefile                                |   3 +
 lib/pks/Makefile                            |   3 +
 lib/pks/pks_test.c                          | 690 ++++++++++++++++++++
 mm/Kconfig                                  |   2 +
 tools/testing/selftests/x86/Makefile        |   3 +-
 tools/testing/selftests/x86/test_pks.c      |  65 ++
 32 files changed, 1376 insertions(+), 128 deletions(-)
 create mode 100644 arch/x86/include/asm/pkeys_common.h
 create mode 100644 lib/pks/Makefile
 create mode 100644 lib/pks/pks_test.c
 create mode 100644 tools/testing/selftests/x86/test_pks.c

Comments

Ira Weiny Oct. 9, 2020, 8:18 p.m. UTC | #1
On Fri, Oct 09, 2020 at 12:42:49PM -0700, 'Ira Weiny' wrote:
> From: Ira Weiny <ira.weiny@intel.com>

> 

> This RFC series has been reviewed by Dave Hansen.

> 

> Introduce a new page protection mechanism for supervisor pages, Protection Key

> Supervisor (PKS).

> 

> 2 use cases for PKS are being developed, trusted keys and PMEM.


RFC patch sets for these use cases have also been posted:

PMEM:
https://lore.kernel.org/lkml/20201009195033.3208459-1-ira.weiny@intel.com/

Trusted Keys:
https://lore.kernel.org/lkml/20201009201410.3209180-1-ira.weiny@intel.com/

Ira

> Trusted keys

> is a newer use case which is still being explored.  PMEM was submitted as part

> of the RFC (v2) series[1].  However, since then it was found that some callers

> of kmap() require a global implementation of PKS.  Specifically some users of

> kmap() expect mappings to be available to all kernel threads.  While global use

> of PKS is rare it needs to be included for correctness.  Unfortunately the

> kmap() updates required a large patch series to make the needed changes at the

> various kmap() call sites so that patch set has been split out.  Because the

> global PKS feature is only required for that use case it will be deferred to

> that set as well.[2]  This patch set is being submitted as a precursor to both

> of the use cases.

> 

> For an overview of the entire PKS ecosystem, a git tree including this series

> and the 2 use cases can be found here:

> 

> 	https://github.com/weiny2/linux-kernel/tree/pks-rfc-v3

> 

> 

> PKS enables protections on 'domains' of supervisor pages to limit supervisor

> mode access to those pages beyond the normal paging protections.  PKS works in

> a similar fashion to user space pkeys, PKU.  As with PKU, supervisor pkeys are

> checked in addition to normal paging protections and Access or Writes can be

> disabled via a MSR update without TLB flushes when permissions change.  Also

> like PKU, a page mapping is assigned to a domain by setting pkey bits in the

> page table entry for that mapping.

> 

> Access is controlled through a PKRS register which is updated via WRMSR/RDMSR.

> 

> XSAVE is not supported for the PKRS MSR.  Therefore the implementation

> saves/restores the MSR across context switches and during exceptions.  Nested

> exceptions are supported by each exception getting a new PKS state.

> 

> For consistent behavior with current paging protections, pkey 0 is reserved and

> configured to allow full access via the pkey mechanism, thus preserving the

> default paging protections on mappings with the default pkey value of 0.

> 

> Other keys, (1-15) are allocated by an allocator which prepares us for key

> contention from day one.  Kernel users should be prepared for the allocator to

> fail either because of key exhaustion or due to PKS not being supported on the

> arch and/or CPU instance.

> 

> The following are key attributes of PKS.

> 

>    1) Fast switching of permissions

> 	1a) Prevents access without page table manipulations

> 	1b) No TLB flushes required

>    2) Works on a per thread basis

> 

> PKS is available with 4 and 5 level paging.  Like PKRU it consumes 4 bits from

> the PTE to store the pkey within the entry.

> 

> 

> [1] https://lore.kernel.org/lkml/20200717072056.73134-1-ira.weiny@intel.com/

> [2] https://github.com/weiny2/linux-kernel/commit/f10abb0f0d7b4e14f03fc8890313a5830cde1e49

> 	and a testing patch

>     https://github.com/weiny2/linux-kernel/commit/2a8e0fc7654a7c69b243d628f63b01ff26a5a797

> 

> 

> Fenghua Yu (3):

>   x86/fpu: Refactor arch_set_user_pkey_access() for PKS support

>   x86/pks: Enable Protection Keys Supervisor (PKS)

>   x86/pks: Add PKS kernel API

> 

> Ira Weiny (6):

>   x86/pkeys: Create pkeys_common.h

>   x86/pks: Preserve the PKRS MSR on context switch

>   x86/entry: Pass irqentry_state_t by reference

>   x86/entry: Preserve PKRS MSR across exceptions

>   x86/fault: Report the PKRS state on fault

>   x86/pks: Add PKS test code

> 

>  Documentation/core-api/protection-keys.rst  | 102 ++-

>  arch/x86/Kconfig                            |   1 +

>  arch/x86/entry/common.c                     |  57 +-

>  arch/x86/include/asm/cpufeatures.h          |   1 +

>  arch/x86/include/asm/idtentry.h             |  29 +-

>  arch/x86/include/asm/msr-index.h            |   1 +

>  arch/x86/include/asm/pgtable.h              |  13 +-

>  arch/x86/include/asm/pgtable_types.h        |  12 +

>  arch/x86/include/asm/pkeys.h                |  15 +

>  arch/x86/include/asm/pkeys_common.h         |  36 +

>  arch/x86/include/asm/processor.h            |  13 +

>  arch/x86/include/uapi/asm/processor-flags.h |   2 +

>  arch/x86/kernel/cpu/common.c                |  17 +

>  arch/x86/kernel/cpu/mce/core.c              |   4 +

>  arch/x86/kernel/fpu/xstate.c                |  22 +-

>  arch/x86/kernel/kvm.c                       |   4 +-

>  arch/x86/kernel/nmi.c                       |   7 +-

>  arch/x86/kernel/process.c                   |  21 +

>  arch/x86/kernel/traps.c                     |  21 +-

>  arch/x86/mm/fault.c                         |  86 ++-

>  arch/x86/mm/pkeys.c                         | 188 +++++-

>  include/linux/entry-common.h                |  19 +-

>  include/linux/pgtable.h                     |   4 +

>  include/linux/pkeys.h                       |  23 +-

>  kernel/entry/common.c                       |  28 +-

>  lib/Kconfig.debug                           |  12 +

>  lib/Makefile                                |   3 +

>  lib/pks/Makefile                            |   3 +

>  lib/pks/pks_test.c                          | 690 ++++++++++++++++++++

>  mm/Kconfig                                  |   2 +

>  tools/testing/selftests/x86/Makefile        |   3 +-

>  tools/testing/selftests/x86/test_pks.c      |  65 ++

>  32 files changed, 1376 insertions(+), 128 deletions(-)

>  create mode 100644 arch/x86/include/asm/pkeys_common.h

>  create mode 100644 lib/pks/Makefile

>  create mode 100644 lib/pks/pks_test.c

>  create mode 100644 tools/testing/selftests/x86/test_pks.c

> 

> -- 

> 2.28.0.rc0.12.gb6a658bd00c9

>
Dave Hansen Oct. 13, 2020, 5:46 p.m. UTC | #2
On 10/9/20 12:42 PM, ira.weiny@intel.com wrote:
> Protection Keys User (PKU) and Protection Keys Supervisor (PKS) work
> in similar fashions and can share common defines.

Could we be a bit less abstract?  PKS and PKU each have:
1. A single control register
2. The same number of keys
3. The same number of bits in the register per key
4. Access and Write disable in the same bit locations

That means that we can share all the macros that synthesize and
manipulate register values between the two features.

> +++ b/arch/x86/include/asm/pkeys_common.h
> @@ -0,0 +1,11 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_X86_PKEYS_INTERNAL_H
> +#define _ASM_X86_PKEYS_INTERNAL_H
> +
> +#define PKR_AD_BIT 0x1
> +#define PKR_WD_BIT 0x2
> +#define PKR_BITS_PER_PKEY 2
> +
> +#define PKR_AD_KEY(pkey)	(PKR_AD_BIT << ((pkey) * PKR_BITS_PER_PKEY))

Now that this has moved away from its use-site, it's a bit less
self-documenting.  Let's add a comment:

/*
 * Generate an Access-Disable mask for the given pkey.  Several of these
 * can be OR'd together to generate pkey register values.
 */

Once that's in place, along with the updated changelog:

Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Dave Hansen Oct. 13, 2020, 6:23 p.m. UTC | #3
On 10/9/20 12:42 PM, ira.weiny@intel.com wrote:
> +/*
> + * PKS is independent of PKU and either or both may be supported on a CPU.
> + * Configure PKS if the cpu supports the feature.
> + */

Let's at least be consistent about CPU vs. cpu in a single comment. :)

> +static void setup_pks(void)
> +{
> +	if (!IS_ENABLED(CONFIG_ARCH_HAS_SUPERVISOR_PKEYS))
> +		return;
> +	if (!cpu_feature_enabled(X86_FEATURE_PKS))
> +		return;

If you put X86_FEATURE_PKS in disabled-features.h, you can get rid of
the explicit CONFIG_ check.

> +	cr4_set_bits(X86_CR4_PKS);
> +}
> +
>  /*
>   * This does the hard work of actually picking apart the CPU stuff...
>   */
> @@ -1544,6 +1558,7 @@ static void identify_cpu(struct cpuinfo_x86 *c)
>  
>  	x86_init_rdrand(c);
>  	setup_pku(c);
> +	setup_pks();
>  
>  	/*
>  	 * Clear/Set all flags overridden by options, need do it
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 6c974888f86f..1b9bc004d9bc 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -822,6 +822,8 @@ config ARCH_USES_HIGH_VMA_FLAGS
>  	bool
>  config ARCH_HAS_PKEYS
>  	bool
> +config ARCH_HAS_SUPERVISOR_PKEYS
> +	bool
>  
>  config PERCPU_STATS
>  	bool "Collect percpu memory statistics"
>
Dave Hansen Oct. 13, 2020, 6:31 p.m. UTC | #4
On 10/9/20 12:42 PM, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>

> 

> The PKRS MSR is defined as a per-logical-processor register.  This

> isolates memory access by logical CPU.  Unfortunately, the MSR is not

> managed by XSAVE.  Therefore, tasks must save/restore the MSR value on

> context switch.

> 

> Define a saved PKRS value in the task struct, as well as a cached

> per-logical-processor MSR value which mirrors the MSR value of the

> current CPU.  Initialize all tasks with the default MSR value.  Then, on

> schedule in, check the saved task MSR vs the per-cpu value.  If

> different proceed to write the MSR.  If not avoid the overhead of the

> MSR write and continue.


It's probably nice to note how the WRMSR is special here, in addition to
the comments below.

>  #endif /*_ASM_X86_PKEYS_INTERNAL_H */

> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h

> index 97143d87994c..da2381136b2d 100644

> --- a/arch/x86/include/asm/processor.h

> +++ b/arch/x86/include/asm/processor.h

> @@ -18,6 +18,7 @@ struct vm86;

>  #include <asm/cpufeatures.h>

>  #include <asm/page.h>

>  #include <asm/pgtable_types.h>

> +#include <asm/pkeys_common.h>

>  #include <asm/percpu.h>

>  #include <asm/msr.h>

>  #include <asm/desc_defs.h>

> @@ -542,6 +543,11 @@ struct thread_struct {

>  

>  	unsigned int		sig_on_uaccess_err:1;

>  

> +#ifdef	CONFIG_ARCH_HAS_SUPERVISOR_PKEYS

> +	/* Saved Protection key register for supervisor mappings */

> +	u32			saved_pkrs;

> +#endif


Could you take a look around thread_struct and see if there are some
other MSRs near which you can stash this?  This seems like a bit of a
lonely place.

...
>  void flush_thread(void)

>  {

>  	struct task_struct *tsk = current;

> @@ -195,6 +212,8 @@ void flush_thread(void)

>  	memset(tsk->thread.tls_array, 0, sizeof(tsk->thread.tls_array));

>  

>  	fpu__clear_all(&tsk->thread.fpu);

> +

> +	pks_init_task(tsk);

>  }

>  

>  void disable_TSC(void)

> @@ -644,6 +663,8 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p)

>  

>  	if ((tifp ^ tifn) & _TIF_SLD)

>  		switch_to_sld(tifn);

> +

> +	pks_sched_in();

>  }

>  

>  /*

> diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c

> index 3cf8f775f36d..30f65dd3d0c5 100644

> --- a/arch/x86/mm/pkeys.c

> +++ b/arch/x86/mm/pkeys.c

> @@ -229,3 +229,31 @@ u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)

>  

>  	return pk_reg;

>  }

> +

> +DEFINE_PER_CPU(u32, pkrs_cache);

> +

> +/**

> + * It should also be noted that the underlying WRMSR(MSR_IA32_PKRS) is not

> + * serializing but still maintains ordering properties similar to WRPKRU.

> + * The current SDM section on PKRS needs updating but should be the same as

> + * that of WRPKRU.  So to quote from the WRPKRU text:

> + *

> + * 	WRPKRU will never execute transiently. Memory accesses

> + * 	affected by PKRU register will not execute (even transiently)

> + * 	until all prior executions of WRPKRU have completed execution

> + * 	and updated the PKRU register.

> + */

> +void write_pkrs(u32 new_pkrs)

> +{

> +	u32 *pkrs;

> +

> +	if (!static_cpu_has(X86_FEATURE_PKS))

> +		return;

> +

> +	pkrs = get_cpu_ptr(&pkrs_cache);

> +	if (*pkrs != new_pkrs) {

> +		*pkrs = new_pkrs;

> +		wrmsrl(MSR_IA32_PKRS, new_pkrs);

> +	}

> +	put_cpu_ptr(pkrs);

> +}

> 


It bugs me a *bit* that this is being called in a preempt-disabled
region, but we still bother with the get/put_cpu jazz.  Are there other
future call-sites for this that aren't in preempt-disabled regions?
Dave Hansen Oct. 13, 2020, 6:52 p.m. UTC | #5
On 10/9/20 12:42 PM, ira.weiny@intel.com wrote:
> @@ -341,6 +341,9 @@ noinstr void irqentry_enter(struct pt_regs *regs, irqentry_state_t *state)

>  	/* Use the combo lockdep/tracing function */

>  	trace_hardirqs_off();

>  	instrumentation_end();

> +

> +done:

> +	irq_save_pkrs(state);

>  }


One nit: This saves *and* sets PKRS.  It's not obvious from the call
here that PKRS is altered at this site.  Seems like there could be a
better name.

Even if we did:

	irq_save_set_pkrs(state, INIT_VAL);

It would probably compile down to the same thing, but be *really*
obvious what's going on.

>  void irqentry_exit_cond_resched(void)

> @@ -362,7 +365,12 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t *state)

>  	/* Check whether this returns to user mode */

>  	if (user_mode(regs)) {

>  		irqentry_exit_to_user_mode(regs);

> -	} else if (!regs_irqs_disabled(regs)) {

> +		return;

> +	}

> +

> +	irq_restore_pkrs(state);

> +

> +	if (!regs_irqs_disabled(regs)) {

>  		/*

>  		 * If RCU was not watching on entry this needs to be done

>  		 * carefully and needs the same ordering of lockdep/tracing

>
Ira Weiny Oct. 13, 2020, 7:44 p.m. UTC | #6
On Tue, Oct 13, 2020 at 10:46:16AM -0700, Dave Hansen wrote:
> On 10/9/20 12:42 PM, ira.weiny@intel.com wrote:
> > Protection Keys User (PKU) and Protection Keys Supervisor (PKS) work
> > in similar fashions and can share common defines.
> 
> Could we be a bit less abstract?  PKS and PKU each have:
> 1. A single control register
> 2. The same number of keys
> 3. The same number of bits in the register per key
> 4. Access and Write disable in the same bit locations
> 
> That means that we can share all the macros that synthesize and
> manipulate register values between the two features.

Sure.  Done.

> 
> > +++ b/arch/x86/include/asm/pkeys_common.h
> > @@ -0,0 +1,11 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _ASM_X86_PKEYS_INTERNAL_H
> > +#define _ASM_X86_PKEYS_INTERNAL_H
> > +
> > +#define PKR_AD_BIT 0x1
> > +#define PKR_WD_BIT 0x2
> > +#define PKR_BITS_PER_PKEY 2
> > +
> > +#define PKR_AD_KEY(pkey)	(PKR_AD_BIT << ((pkey) * PKR_BITS_PER_PKEY))
> 
> Now that this has moved away from its use-site, it's a bit less
> self-documenting.  Let's add a comment:
> 
> /*
>  * Generate an Access-Disable mask for the given pkey.  Several of these
>  * can be OR'd together to generate pkey register values.
>  */

Fair enough. done.

> 
> Once that's in place, along with the updated changelog:
> 
> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>

Thanks,
Ira
Ira Weiny Oct. 14, 2020, 2:08 a.m. UTC | #7
On Tue, Oct 13, 2020 at 11:23:08AM -0700, Dave Hansen wrote:
> On 10/9/20 12:42 PM, ira.weiny@intel.com wrote:

> > +/*

> > + * PKS is independent of PKU and either or both may be supported on a CPU.

> > + * Configure PKS if the cpu supports the feature.

> > + */

> 

> Let's at least be consistent about CPU vs. cpu in a single comment. :)


Sorry, done.

> 

> > +static void setup_pks(void)

> > +{

> > +	if (!IS_ENABLED(CONFIG_ARCH_HAS_SUPERVISOR_PKEYS))

> > +		return;

> > +	if (!cpu_feature_enabled(X86_FEATURE_PKS))

> > +		return;

> 

> If you put X86_FEATURE_PKS in disabled-features.h, you can get rid of

> the explicit CONFIG_ check.


Done.

> 

> > +	cr4_set_bits(X86_CR4_PKS);

> > +}

> > +

> >  /*

> >   * This does the hard work of actually picking apart the CPU stuff...

> >   */

> > @@ -1544,6 +1558,7 @@ static void identify_cpu(struct cpuinfo_x86 *c)

> >  

> >  	x86_init_rdrand(c);

> >  	setup_pku(c);

> > +	setup_pks();

> >  

> >  	/*

> >  	 * Clear/Set all flags overridden by options, need do it

> > diff --git a/mm/Kconfig b/mm/Kconfig

> > index 6c974888f86f..1b9bc004d9bc 100644

> > --- a/mm/Kconfig

> > +++ b/mm/Kconfig

> > @@ -822,6 +822,8 @@ config ARCH_USES_HIGH_VMA_FLAGS

> >  	bool

> >  config ARCH_HAS_PKEYS

> >  	bool

> > +config ARCH_HAS_SUPERVISOR_PKEYS

> > +	bool

> >  

> >  config PERCPU_STATS

> >  	bool "Collect percpu memory statistics"

> > 

>
Ira Weiny Oct. 14, 2020, 10:36 p.m. UTC | #8
On Tue, Oct 13, 2020 at 11:31:45AM -0700, Dave Hansen wrote:
> On 10/9/20 12:42 PM, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > The PKRS MSR is defined as a per-logical-processor register.  This
> > isolates memory access by logical CPU.  Unfortunately, the MSR is not
> > managed by XSAVE.  Therefore, tasks must save/restore the MSR value on
> > context switch.
> > 
> > Define a saved PKRS value in the task struct, as well as a cached
> > per-logical-processor MSR value which mirrors the MSR value of the
> > current CPU.  Initialize all tasks with the default MSR value.  Then, on
> > schedule in, check the saved task MSR vs the per-cpu value.  If
> > different proceed to write the MSR.  If not avoid the overhead of the
> > MSR write and continue.
> 
> It's probably nice to note how the WRMSR is special here, in addition to
> the comments below.

Sure,

> 
> >  #endif /*_ASM_X86_PKEYS_INTERNAL_H */
> > diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> > index 97143d87994c..da2381136b2d 100644
> > --- a/arch/x86/include/asm/processor.h
> > +++ b/arch/x86/include/asm/processor.h
> > @@ -18,6 +18,7 @@ struct vm86;
> >  #include <asm/cpufeatures.h>
> >  #include <asm/page.h>
> >  #include <asm/pgtable_types.h>
> > +#include <asm/pkeys_common.h>
> >  #include <asm/percpu.h>
> >  #include <asm/msr.h>
> >  #include <asm/desc_defs.h>
> > @@ -542,6 +543,11 @@ struct thread_struct {
> >  
> >  	unsigned int		sig_on_uaccess_err:1;
> >  
> > +#ifdef	CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
> > +	/* Saved Protection key register for supervisor mappings */
> > +	u32			saved_pkrs;
> > +#endif
> 
> Could you take a look around thread_struct and see if there are some
> other MSRs near which you can stash this?  This seems like a bit of a
> lonely place.

Are you more concerned with aesthetics or the in memory struct layout?

How about I put it after error_code?

	unsigned long           error_code;                                     
+                                                                               
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS                                        
+       /* Saved Protection key register for supervisor mappings */             
+       u32                     saved_pkrs;                                     
+#endif                                                                         
+                                                                               

?

> 
> ...
> >  void flush_thread(void)
> >  {
> >  	struct task_struct *tsk = current;
> > @@ -195,6 +212,8 @@ void flush_thread(void)
> >  	memset(tsk->thread.tls_array, 0, sizeof(tsk->thread.tls_array));
> >  
> >  	fpu__clear_all(&tsk->thread.fpu);
> > +
> > +	pks_init_task(tsk);
> >  }
> >  
> >  void disable_TSC(void)
> > @@ -644,6 +663,8 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p)
> >  
> >  	if ((tifp ^ tifn) & _TIF_SLD)
> >  		switch_to_sld(tifn);
> > +
> > +	pks_sched_in();
> >  }
> >  
> >  /*
> > diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
> > index 3cf8f775f36d..30f65dd3d0c5 100644
> > --- a/arch/x86/mm/pkeys.c
> > +++ b/arch/x86/mm/pkeys.c
> > @@ -229,3 +229,31 @@ u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
> >  
> >  	return pk_reg;
> >  }
> > +
> > +DEFINE_PER_CPU(u32, pkrs_cache);
> > +
> > +/**
> > + * It should also be noted that the underlying WRMSR(MSR_IA32_PKRS) is not
> > + * serializing but still maintains ordering properties similar to WRPKRU.
> > + * The current SDM section on PKRS needs updating but should be the same as
> > + * that of WRPKRU.  So to quote from the WRPKRU text:
> > + *
> > + * 	WRPKRU will never execute transiently. Memory accesses
> > + * 	affected by PKRU register will not execute (even transiently)
> > + * 	until all prior executions of WRPKRU have completed execution
> > + * 	and updated the PKRU register.
> > + */
> > +void write_pkrs(u32 new_pkrs)
> > +{
> > +	u32 *pkrs;
> > +
> > +	if (!static_cpu_has(X86_FEATURE_PKS))
> > +		return;
> > +
> > +	pkrs = get_cpu_ptr(&pkrs_cache);
> > +	if (*pkrs != new_pkrs) {
> > +		*pkrs = new_pkrs;
> > +		wrmsrl(MSR_IA32_PKRS, new_pkrs);
> > +	}
> > +	put_cpu_ptr(pkrs);
> > +}
> > 
> 
> It bugs me a *bit* that this is being called in a preempt-disabled
> region, but we still bother with the get/put_cpu jazz.  Are there other
> future call-sites for this that aren't in preempt-disabled regions?

I'm not specifically disabling preempt before calling write_pkrs except in the
next patch (which is buggy because I meant to have it around the modification
of thread.saved_pkrs as well).  But that was to protect the thread variable not
the percpu cache vs MSR.

My thought above was it is safer for this call to ensure the per-cpu variable
is consistent with the register.  The other calls to write_pkrs() may require
preemption disable but for reasons unrelated to write_pkrs' state.

After some research I've now fully confused myself if this is needed in patch
7/9 where write_pkrs() is called from the exception handing code.  But I think
it is needed there.  Isn't it?

Since preempt_disable() is nestable I think this is ok correct?

Ira
Ira Weiny Oct. 15, 2020, 3:46 a.m. UTC | #9
On Tue, Oct 13, 2020 at 11:52:32AM -0700, Dave Hansen wrote:
> On 10/9/20 12:42 PM, ira.weiny@intel.com wrote:
> > @@ -341,6 +341,9 @@ noinstr void irqentry_enter(struct pt_regs *regs, irqentry_state_t *state)
> >  	/* Use the combo lockdep/tracing function */
> >  	trace_hardirqs_off();
> >  	instrumentation_end();
> > +
> > +done:
> > +	irq_save_pkrs(state);
> >  }
> 
> One nit: This saves *and* sets PKRS.  It's not obvious from the call
> here that PKRS is altered at this site.  Seems like there could be a
> better name.
> 
> Even if we did:
> 
> 	irq_save_set_pkrs(state, INIT_VAL);
> 
> It would probably compile down to the same thing, but be *really*
> obvious what's going on.

I suppose that is true.  But I think it is odd having a parameter which is the
same for every call site.

But I'm not going to quibble over something like this.

Changed,
Ira

> 
> >  void irqentry_exit_cond_resched(void)
> > @@ -362,7 +365,12 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t *state)
> >  	/* Check whether this returns to user mode */
> >  	if (user_mode(regs)) {
> >  		irqentry_exit_to_user_mode(regs);
> > -	} else if (!regs_irqs_disabled(regs)) {
> > +		return;
> > +	}
> > +
> > +	irq_restore_pkrs(state);
> > +
> > +	if (!regs_irqs_disabled(regs)) {
> >  		/*
> >  		 * If RCU was not watching on entry this needs to be done
> >  		 * carefully and needs the same ordering of lockdep/tracing
> > 
>
Dave Hansen Oct. 15, 2020, 4:06 a.m. UTC | #10
On 10/14/20 8:46 PM, Ira Weiny wrote:
> On Tue, Oct 13, 2020 at 11:52:32AM -0700, Dave Hansen wrote:

>> On 10/9/20 12:42 PM, ira.weiny@intel.com wrote:

>>> @@ -341,6 +341,9 @@ noinstr void irqentry_enter(struct pt_regs *regs, irqentry_state_t *state)

>>>  	/* Use the combo lockdep/tracing function */

>>>  	trace_hardirqs_off();

>>>  	instrumentation_end();

>>> +

>>> +done:

>>> +	irq_save_pkrs(state);

>>>  }

>> One nit: This saves *and* sets PKRS.  It's not obvious from the call

>> here that PKRS is altered at this site.  Seems like there could be a

>> better name.

>>

>> Even if we did:

>>

>> 	irq_save_set_pkrs(state, INIT_VAL);

>>

>> It would probably compile down to the same thing, but be *really*

>> obvious what's going on.

> I suppose that is true.  But I think it is odd having a parameter which is the

> same for every call site.


Well, it depends on what you optimize for.  I'm trying to optimize for
the code being understood quickly the first time someone reads it.  To
me, that's more important than minimizing the number of function
parameters (which are essentially free).
Ira Weiny Oct. 15, 2020, 4:18 a.m. UTC | #11
On Wed, Oct 14, 2020 at 09:06:44PM -0700, Dave Hansen wrote:
> On 10/14/20 8:46 PM, Ira Weiny wrote:

> > On Tue, Oct 13, 2020 at 11:52:32AM -0700, Dave Hansen wrote:

> >> On 10/9/20 12:42 PM, ira.weiny@intel.com wrote:

> >>> @@ -341,6 +341,9 @@ noinstr void irqentry_enter(struct pt_regs *regs, irqentry_state_t *state)

> >>>  	/* Use the combo lockdep/tracing function */

> >>>  	trace_hardirqs_off();

> >>>  	instrumentation_end();

> >>> +

> >>> +done:

> >>> +	irq_save_pkrs(state);

> >>>  }

> >> One nit: This saves *and* sets PKRS.  It's not obvious from the call

> >> here that PKRS is altered at this site.  Seems like there could be a

> >> better name.

> >>

> >> Even if we did:

> >>

> >> 	irq_save_set_pkrs(state, INIT_VAL);

> >>

> >> It would probably compile down to the same thing, but be *really*

> >> obvious what's going on.

> > I suppose that is true.  But I think it is odd having a parameter which is the

> > same for every call site.

> 

> Well, it depends on what you optimize for.  I'm trying to optimize for

> the code being understood quickly the first time someone reads it.  To

> me, that's more important than minimizing the number of function

> parameters (which are essentially free).

>


Agreed.  Sorry I was not trying to be confrontational.  There is just enough
other things which are going to take me time to get right I need to focus on
them...  :-D

Sorry,
Ira
Peter Zijlstra Oct. 16, 2020, 11:06 a.m. UTC | #12
On Fri, Oct 09, 2020 at 12:42:53PM -0700, ira.weiny@intel.com wrote:

> @@ -644,6 +663,8 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p)
>  
>  	if ((tifp ^ tifn) & _TIF_SLD)
>  		switch_to_sld(tifn);
> +
> +	pks_sched_in();
>  }
>  

You seem to have lost the comment proposed here:

  https://lkml.kernel.org/r/20200717083140.GW10769@hirez.programming.kicks-ass.net

It is useful and important information that the wrmsr normally doesn't
happen.

> diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
> index 3cf8f775f36d..30f65dd3d0c5 100644
> --- a/arch/x86/mm/pkeys.c
> +++ b/arch/x86/mm/pkeys.c
> @@ -229,3 +229,31 @@ u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
>  
>  	return pk_reg;
>  }
> +
> +DEFINE_PER_CPU(u32, pkrs_cache);
> +
> +/**
> + * It should also be noted that the underlying WRMSR(MSR_IA32_PKRS) is not
> + * serializing but still maintains ordering properties similar to WRPKRU.
> + * The current SDM section on PKRS needs updating but should be the same as
> + * that of WRPKRU.  So to quote from the WRPKRU text:
> + *
> + * 	WRPKRU will never execute transiently. Memory accesses
> + * 	affected by PKRU register will not execute (even transiently)
> + * 	until all prior executions of WRPKRU have completed execution
> + * 	and updated the PKRU register.

(whitespace damage; space followed by tabstop)

> + */
> +void write_pkrs(u32 new_pkrs)
> +{
> +	u32 *pkrs;
> +
> +	if (!static_cpu_has(X86_FEATURE_PKS))
> +		return;
> +
> +	pkrs = get_cpu_ptr(&pkrs_cache);
> +	if (*pkrs != new_pkrs) {
> +		*pkrs = new_pkrs;
> +		wrmsrl(MSR_IA32_PKRS, new_pkrs);
> +	}
> +	put_cpu_ptr(pkrs);
> +}

looks familiar that... :-)
Peter Zijlstra Oct. 16, 2020, 11:12 a.m. UTC | #13
On Tue, Oct 13, 2020 at 11:31:45AM -0700, Dave Hansen wrote:
> > +/**

> > + * It should also be noted that the underlying WRMSR(MSR_IA32_PKRS) is not

> > + * serializing but still maintains ordering properties similar to WRPKRU.

> > + * The current SDM section on PKRS needs updating but should be the same as

> > + * that of WRPKRU.  So to quote from the WRPKRU text:

> > + *

> > + * 	WRPKRU will never execute transiently. Memory accesses

> > + * 	affected by PKRU register will not execute (even transiently)

> > + * 	until all prior executions of WRPKRU have completed execution

> > + * 	and updated the PKRU register.

> > + */

> > +void write_pkrs(u32 new_pkrs)

> > +{

> > +	u32 *pkrs;

> > +

> > +	if (!static_cpu_has(X86_FEATURE_PKS))

> > +		return;

> > +

> > +	pkrs = get_cpu_ptr(&pkrs_cache);

> > +	if (*pkrs != new_pkrs) {

> > +		*pkrs = new_pkrs;

> > +		wrmsrl(MSR_IA32_PKRS, new_pkrs);

> > +	}

> > +	put_cpu_ptr(pkrs);

> > +}

> > 

> 

> It bugs me a *bit* that this is being called in a preempt-disabled

> region, but we still bother with the get/put_cpu jazz.  Are there other

> future call-sites for this that aren't in preempt-disabled regions?


So the previous version had a useful comment that got lost. This stuff
needs to fundamentally be preempt disabled, so it either needs to
explicitly do so, or have an assertion that preemption is indeed
disabled.
Peter Zijlstra Oct. 16, 2020, 11:45 a.m. UTC | #14
On Fri, Oct 09, 2020 at 12:42:55PM -0700, ira.weiny@intel.com wrote:
> -noinstr bool idtentry_enter_nmi(struct pt_regs *regs)
> +noinstr void idtentry_enter_nmi(struct pt_regs *regs, irqentry_state_t *irq_state)
>  {
> -	bool irq_state = lockdep_hardirqs_enabled();
> +	irq_state->exit_rcu = lockdep_hardirqs_enabled();
>  
>  	__nmi_enter();
>  	lockdep_hardirqs_off(CALLER_ADDR0);
> @@ -222,15 +222,13 @@ noinstr bool idtentry_enter_nmi(struct pt_regs *regs)
>  	trace_hardirqs_off_finish();
>  	ftrace_nmi_enter();
>  	instrumentation_end();
> -
> -	return irq_state;
>  }
>  
> -noinstr void idtentry_exit_nmi(struct pt_regs *regs, bool restore)
> +noinstr void idtentry_exit_nmi(struct pt_regs *regs, irqentry_state_t *irq_state)
>  {
>  	instrumentation_begin();
>  	ftrace_nmi_exit();
> -	if (restore) {
> +	if (irq_state->exit_rcu) {
>  		trace_hardirqs_on_prepare();
>  		lockdep_hardirqs_on_prepare(CALLER_ADDR0);
>  	}
> @@ -238,7 +236,7 @@ noinstr void idtentry_exit_nmi(struct pt_regs *regs, bool restore)
>  
>  	rcu_nmi_exit();
>  	lockdep_hardirq_exit();
> -	if (restore)
> +	if (irq_state->exit_rcu)
>  		lockdep_hardirqs_on(CALLER_ADDR0);
>  	__nmi_exit();
>  }

That's not nice.. The NMI path is different from the IRQ path and has a
different variable. Yes, this works, but *groan*.

Maybe union them if you want to avoid bloating the structure, but the
above makes it really hard to read.
Thomas Gleixner Oct. 16, 2020, 12:55 p.m. UTC | #15
On Fri, Oct 16 2020 at 13:45, Peter Zijlstra wrote:
> On Fri, Oct 09, 2020 at 12:42:55PM -0700, ira.weiny@intel.com wrote:

>> @@ -238,7 +236,7 @@ noinstr void idtentry_exit_nmi(struct pt_regs *regs, bool restore)

>>  

>>  	rcu_nmi_exit();

>>  	lockdep_hardirq_exit();

>> -	if (restore)

>> +	if (irq_state->exit_rcu)

>>  		lockdep_hardirqs_on(CALLER_ADDR0);

>>  	__nmi_exit();

>>  }

>

> That's not nice.. The NMI path is different from the IRQ path and has a

> different variable. Yes, this works, but *groan*.

>

> Maybe union them if you want to avoid bloating the structure, but the

> above makes it really hard to read.


Right, and also that nmi entry thing should not be in x86. Something
like the untested below as first cleanup.

Thanks,

        tglx
----
Subject: x86/entry: Move nmi entry/exit into common code
From: Thomas Gleixner <tglx@linutronix.de>

Date: Fri, 11 Sep 2020 10:09:56 +0200

Add blurb here.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 arch/x86/entry/common.c         |   34 ----------------------------------
 arch/x86/include/asm/idtentry.h |    3 ---
 arch/x86/kernel/cpu/mce/core.c  |    6 +++---
 arch/x86/kernel/nmi.c           |    6 +++---
 arch/x86/kernel/traps.c         |   13 +++++++------
 include/linux/entry-common.h    |   20 ++++++++++++++++++++
 kernel/entry/common.c           |   36 ++++++++++++++++++++++++++++++++++++
 7 files changed, 69 insertions(+), 49 deletions(-)

--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -209,40 +209,6 @@ SYSCALL_DEFINE0(ni_syscall)
 	return -ENOSYS;
 }
 
-noinstr bool idtentry_enter_nmi(struct pt_regs *regs)
-{
-	bool irq_state = lockdep_hardirqs_enabled();
-
-	__nmi_enter();
-	lockdep_hardirqs_off(CALLER_ADDR0);
-	lockdep_hardirq_enter();
-	rcu_nmi_enter();
-
-	instrumentation_begin();
-	trace_hardirqs_off_finish();
-	ftrace_nmi_enter();
-	instrumentation_end();
-
-	return irq_state;
-}
-
-noinstr void idtentry_exit_nmi(struct pt_regs *regs, bool restore)
-{
-	instrumentation_begin();
-	ftrace_nmi_exit();
-	if (restore) {
-		trace_hardirqs_on_prepare();
-		lockdep_hardirqs_on_prepare(CALLER_ADDR0);
-	}
-	instrumentation_end();
-
-	rcu_nmi_exit();
-	lockdep_hardirq_exit();
-	if (restore)
-		lockdep_hardirqs_on(CALLER_ADDR0);
-	__nmi_exit();
-}
-
 #ifdef CONFIG_XEN_PV
 #ifndef CONFIG_PREEMPTION
 /*
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -11,9 +11,6 @@
 
 #include <asm/irq_stack.h>
 
-bool idtentry_enter_nmi(struct pt_regs *regs);
-void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state);
-
 /**
  * DECLARE_IDTENTRY - Declare functions for simple IDT entry points
  *		      No error code pushed by hardware
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1983,7 +1983,7 @@ void (*machine_check_vector)(struct pt_r
 
 static __always_inline void exc_machine_check_kernel(struct pt_regs *regs)
 {
-	bool irq_state;
+	irqentry_state_t irq_state;
 
 	WARN_ON_ONCE(user_mode(regs));
 
@@ -1995,7 +1995,7 @@ static __always_inline void exc_machine_
 	    mce_check_crashing_cpu())
 		return;
 
-	irq_state = idtentry_enter_nmi(regs);
+	irq_state = irqentry_nmi_enter(regs);
 	/*
 	 * The call targets are marked noinstr, but objtool can't figure
 	 * that out because it's an indirect call. Annotate it.
@@ -2006,7 +2006,7 @@ static __always_inline void exc_machine_
 	if (regs->flags & X86_EFLAGS_IF)
 		trace_hardirqs_on_prepare();
 	instrumentation_end();
-	idtentry_exit_nmi(regs, irq_state);
+	irqentry_nmi_exit(regs, irq_state);
 }
 
 static __always_inline void exc_machine_check_user(struct pt_regs *regs)
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -475,7 +475,7 @@ static DEFINE_PER_CPU(unsigned long, nmi
 
 DEFINE_IDTENTRY_RAW(exc_nmi)
 {
-	bool irq_state;
+	irqentry_state_t irq_state;
 
 	/*
 	 * Re-enable NMIs right here when running as an SEV-ES guest. This might
@@ -502,14 +502,14 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
 
 	this_cpu_write(nmi_dr7, local_db_save());
 
-	irq_state = idtentry_enter_nmi(regs);
+	irq_state = irqentry_nmi_enter(regs);
 
 	inc_irq_stat(__nmi_count);
 
 	if (!ignore_nmis)
 		default_do_nmi(regs);
 
-	idtentry_exit_nmi(regs, irq_state);
+	irqentry_nmi_exit(regs, irq_state);
 
 	local_db_restore(this_cpu_read(nmi_dr7));
 
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -405,7 +405,7 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
 	}
 #endif
 
-	idtentry_enter_nmi(regs);
+	irqentry_nmi_enter(regs);
 	instrumentation_begin();
 	notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);
 
@@ -651,12 +651,13 @@ DEFINE_IDTENTRY_RAW(exc_int3)
 		instrumentation_end();
 		irqentry_exit_to_user_mode(regs);
 	} else {
-		bool irq_state = idtentry_enter_nmi(regs);
+		irqentry_state_t irq_state = irqentry_nmi_enter(regs);
+
 		instrumentation_begin();
 		if (!do_int3(regs))
 			die("int3", regs, 0);
 		instrumentation_end();
-		idtentry_exit_nmi(regs, irq_state);
+		irqentry_nmi_exit(regs, irq_state);
 	}
 }
 
@@ -864,7 +865,7 @@ static __always_inline void exc_debug_ke
 	 * includes the entry stack is excluded for everything.
 	 */
 	unsigned long dr7 = local_db_save();
-	bool irq_state = idtentry_enter_nmi(regs);
+	irqentry_state_t irq_state = irqentry_nmi_enter(regs);
 	instrumentation_begin();
 
 	/*
@@ -907,7 +908,7 @@ static __always_inline void exc_debug_ke
 		regs->flags &= ~X86_EFLAGS_TF;
 out:
 	instrumentation_end();
-	idtentry_exit_nmi(regs, irq_state);
+	irqentry_nmi_exit(regs, irq_state);
 
 	local_db_restore(dr7);
 }
@@ -925,7 +926,7 @@ static __always_inline void exc_debug_us
 
 	/*
 	 * NB: We can't easily clear DR7 here because
-	 * idtentry_exit_to_usermode() can invoke ptrace, schedule, access
+	 * irqentry_exit_to_usermode() can invoke ptrace, schedule, access
 	 * user memory, etc.  This means that a recursive #DB is possible.  If
 	 * this happens, that #DB will hit exc_debug_kernel() and clear DR7.
 	 * Since we're not on the IST stack right now, everything will be
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -343,6 +343,7 @@ void irqentry_exit_to_user_mode(struct p
 #ifndef irqentry_state
 typedef struct irqentry_state {
 	bool	exit_rcu;
+	bool	lockdep;
 } irqentry_state_t;
 #endif
 
@@ -402,4 +403,23 @@ void irqentry_exit_cond_resched(void);
  */
 void noinstr irqentry_exit(struct pt_regs *regs, irqentry_state_t state);
 
+/**
+ * irqentry_nmi_enter - Handle NMI entry
+ * @regs:	Pointer to currents pt_regs
+ *
+ * Similar to irqentry_enter() but taking care of the NMI constraints.
+ */
+irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs);
+
+/**
+ * irqentry_nmi_exit - Handle return from NMI handling
+ * @regs:	Pointer to pt_regs (NMI entry regs)
+ * @state:	Return value from matching call to irqentry_nmi_enter()
+ *
+ * Last action before returning to the low level assmenbly code.
+ *
+ * Counterpart to irqentry_nmi_enter().
+ */
+void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state);
+
 #endif
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -398,3 +398,39 @@ noinstr void irqentry_exit(struct pt_reg
 			rcu_irq_exit();
 	}
 }
+
+irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
+{
+	irqentry_state_t irq_state;
+
+	irq_state.lockdep = lockdep_hardirqs_enabled();
+
+	__nmi_enter();
+	lockdep_hardirqs_off(CALLER_ADDR0);
+	lockdep_hardirq_enter();
+	rcu_nmi_enter();
+
+	instrumentation_begin();
+	trace_hardirqs_off_finish();
+	ftrace_nmi_enter();
+	instrumentation_end();
+
+	return irq_state;
+}
+
+void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)
+{
+	instrumentation_begin();
+	ftrace_nmi_exit();
+	if (irq_state.lockdep) {
+		trace_hardirqs_on_prepare();
+		lockdep_hardirqs_on_prepare(CALLER_ADDR0);
+	}
+	instrumentation_end();
+
+	rcu_nmi_exit();
+	lockdep_hardirq_exit();
+	if (irq_state.lockdep)
+		lockdep_hardirqs_on(CALLER_ADDR0);
+	__nmi_exit();
+}
Ira Weiny Oct. 17, 2020, 5:14 a.m. UTC | #16
On Fri, Oct 16, 2020 at 01:12:26PM +0200, Peter Zijlstra wrote:
> On Tue, Oct 13, 2020 at 11:31:45AM -0700, Dave Hansen wrote:

> > > +/**

> > > + * It should also be noted that the underlying WRMSR(MSR_IA32_PKRS) is not

> > > + * serializing but still maintains ordering properties similar to WRPKRU.

> > > + * The current SDM section on PKRS needs updating but should be the same as

> > > + * that of WRPKRU.  So to quote from the WRPKRU text:

> > > + *

> > > + * 	WRPKRU will never execute transiently. Memory accesses

> > > + * 	affected by PKRU register will not execute (even transiently)

> > > + * 	until all prior executions of WRPKRU have completed execution

> > > + * 	and updated the PKRU register.

> > > + */

> > > +void write_pkrs(u32 new_pkrs)

> > > +{

> > > +	u32 *pkrs;

> > > +

> > > +	if (!static_cpu_has(X86_FEATURE_PKS))

> > > +		return;

> > > +

> > > +	pkrs = get_cpu_ptr(&pkrs_cache);

> > > +	if (*pkrs != new_pkrs) {

> > > +		*pkrs = new_pkrs;

> > > +		wrmsrl(MSR_IA32_PKRS, new_pkrs);

> > > +	}

> > > +	put_cpu_ptr(pkrs);

> > > +}

> > > 

> > 

> > It bugs me a *bit* that this is being called in a preempt-disabled

> > region, but we still bother with the get/put_cpu jazz.  Are there other

> > future call-sites for this that aren't in preempt-disabled regions?

> 

> So the previous version had a useful comment that got lost.


Ok Looking back I see what happened...  This comment...

 /*
  * PKRS is only temporarily changed during specific code paths.
  * Only a preemption during these windows away from the default
  * value would require updating the MSR.
  */

... was added to pks_sched_in() but that got simplified down because cleaning
up write_pkrs() made the code there obsolete.

> This stuff

> needs to fundamentally be preempt disabled,


I agree, the update to the percpu cache value and MSR can not be torn.

> so it either needs to

> explicitly do so, or have an assertion that preemption is indeed

> disabled.


However, I don't think I understand clearly.  Doesn't [get|put]_cpu_ptr()
handle the preempt_disable() for us?  Is it not sufficient to rely on that?

Dave's comment seems to be the opposite where we need to eliminate preempt
disable before calling write_pkrs().

FWIW I think I'm mistaken in my response to Dave regarding the
preempt_disable() in pks_update_protection().

Ira
Ira Weiny Oct. 17, 2020, 5:37 a.m. UTC | #17
On Fri, Oct 16, 2020 at 01:06:36PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 09, 2020 at 12:42:53PM -0700, ira.weiny@intel.com wrote:
> 
> > @@ -644,6 +663,8 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p)
> >  
> >  	if ((tifp ^ tifn) & _TIF_SLD)
> >  		switch_to_sld(tifn);
> > +
> > +	pks_sched_in();
> >  }
> >  
> 
> You seem to have lost the comment proposed here:
> 
>   https://lkml.kernel.org/r/20200717083140.GW10769@hirez.programming.kicks-ass.net
> 
> It is useful and important information that the wrmsr normally doesn't
> happen.

Added back in here.

> 
> > diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
> > index 3cf8f775f36d..30f65dd3d0c5 100644
> > --- a/arch/x86/mm/pkeys.c
> > +++ b/arch/x86/mm/pkeys.c
> > @@ -229,3 +229,31 @@ u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
> >  
> >  	return pk_reg;
> >  }
> > +
> > +DEFINE_PER_CPU(u32, pkrs_cache);
> > +
> > +/**
> > + * It should also be noted that the underlying WRMSR(MSR_IA32_PKRS) is not
> > + * serializing but still maintains ordering properties similar to WRPKRU.
> > + * The current SDM section on PKRS needs updating but should be the same as
> > + * that of WRPKRU.  So to quote from the WRPKRU text:
> > + *
> > + * 	WRPKRU will never execute transiently. Memory accesses
> > + * 	affected by PKRU register will not execute (even transiently)
> > + * 	until all prior executions of WRPKRU have completed execution
> > + * 	and updated the PKRU register.
> 
> (whitespace damage; space followed by tabstop)

Fixed thanks.

> 
> > + */
> > +void write_pkrs(u32 new_pkrs)
> > +{
> > +	u32 *pkrs;
> > +
> > +	if (!static_cpu_has(X86_FEATURE_PKS))
> > +		return;
> > +
> > +	pkrs = get_cpu_ptr(&pkrs_cache);
> > +	if (*pkrs != new_pkrs) {
> > +		*pkrs = new_pkrs;
> > +		wrmsrl(MSR_IA32_PKRS, new_pkrs);
> > +	}
> > +	put_cpu_ptr(pkrs);
> > +}
> 
> looks familiar that... :-)

Added you as a co-developer if that is ok?

Ira
Ira Weiny Oct. 19, 2020, 5:37 a.m. UTC | #18
On Fri, Oct 16, 2020 at 02:55:21PM +0200, Thomas Gleixner wrote:
> On Fri, Oct 16 2020 at 13:45, Peter Zijlstra wrote:

> > On Fri, Oct 09, 2020 at 12:42:55PM -0700, ira.weiny@intel.com wrote:

> >> @@ -238,7 +236,7 @@ noinstr void idtentry_exit_nmi(struct pt_regs *regs, bool restore)

> >>  

> >>  	rcu_nmi_exit();

> >>  	lockdep_hardirq_exit();

> >> -	if (restore)

> >> +	if (irq_state->exit_rcu)

> >>  		lockdep_hardirqs_on(CALLER_ADDR0);

> >>  	__nmi_exit();

> >>  }

> >

> > That's not nice.. The NMI path is different from the IRQ path and has a

> > different variable. Yes, this works, but *groan*.

> >

> > Maybe union them if you want to avoid bloating the structure, but the

> > above makes it really hard to read.

> 

> Right, and also that nmi entry thing should not be in x86. Something

> like the untested below as first cleanup.


Ok, I see what Peter was talking about.  I've added this patch to the series.

> 

> Thanks,

> 

>         tglx

> ----

> Subject: x86/entry: Move nmi entry/exit into common code

> From: Thomas Gleixner <tglx@linutronix.de>

> Date: Fri, 11 Sep 2020 10:09:56 +0200

> 

> Add blurb here.


How about:

To prepare for saving PKRS values across NMI's we lift the
idtentry_[enter|exit]_nmi() to the common code.  Rename them to
irqentry_nmi_[enter|exit]() to reflect the new generic nature and store the
state in the same irqentry_state_t structure as the other irqentry_*()
functions.  Finally, differentiate the state being stored between the NMI and
IRQ path by adding 'lockdep' to irqentry_state_t.

?

> 

> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

> ---

>  arch/x86/entry/common.c         |   34 ----------------------------------

>  arch/x86/include/asm/idtentry.h |    3 ---

>  arch/x86/kernel/cpu/mce/core.c  |    6 +++---

>  arch/x86/kernel/nmi.c           |    6 +++---

>  arch/x86/kernel/traps.c         |   13 +++++++------

>  include/linux/entry-common.h    |   20 ++++++++++++++++++++

>  kernel/entry/common.c           |   36 ++++++++++++++++++++++++++++++++++++

>  7 files changed, 69 insertions(+), 49 deletions(-)

> 


[snip]

> --- a/include/linux/entry-common.h

> +++ b/include/linux/entry-common.h

> @@ -343,6 +343,7 @@ void irqentry_exit_to_user_mode(struct p

>  #ifndef irqentry_state

>  typedef struct irqentry_state {

>  	bool	exit_rcu;

> +	bool	lockdep;

>  } irqentry_state_t;


Building on what Peter said do you agree this should be made into a union?

It may not be strictly necessary in this patch but I think it would reflect the
mutual exclusivity better and could be changed easy enough in the follow on
patch which adds the pkrs state.

Ira
Thomas Gleixner Oct. 19, 2020, 9:32 a.m. UTC | #19
On Sun, Oct 18 2020 at 22:37, Ira Weiny wrote:
> On Fri, Oct 16, 2020 at 02:55:21PM +0200, Thomas Gleixner wrote:
>> Subject: x86/entry: Move nmi entry/exit into common code
>> From: Thomas Gleixner <tglx@linutronix.de>
>> Date: Fri, 11 Sep 2020 10:09:56 +0200
>> 
>> Add blurb here.
>
> How about:
>
> To prepare for saving PKRS values across NMI's we lift the
> idtentry_[enter|exit]_nmi() to the common code.  Rename them to
> irqentry_nmi_[enter|exit]() to reflect the new generic nature and store the
> state in the same irqentry_state_t structure as the other irqentry_*()
> functions.  Finally, differentiate the state being stored between the NMI and
> IRQ path by adding 'lockdep' to irqentry_state_t.

No. This has absolutely nothing to do with PKRS. It's a cleanup valuable
by itself and that's how it should have been done right away.

So the proper changelog is:

  Lockdep state handling on NMI enter and exit is nothing specific to
  X86. It's not any different on other architectures. Also the extra
  state type is not necessary, irqentry_state_t can carry the necessary
  information as well.

  Move it to common code and extend irqentry_state_t to carry lockdep
  state.

>> --- a/include/linux/entry-common.h
>> +++ b/include/linux/entry-common.h
>> @@ -343,6 +343,7 @@ void irqentry_exit_to_user_mode(struct p
>>  #ifndef irqentry_state
>>  typedef struct irqentry_state {
>>  	bool	exit_rcu;
>> +	bool	lockdep;
>>  } irqentry_state_t;
>
> Building on what Peter said do you agree this should be made into a union?
>
> It may not be strictly necessary in this patch but I think it would reflect the
> mutual exclusivity better and could be changed easy enough in the follow on
> patch which adds the pkrs state.

Why the heck should it be changed in a patch which adds something
completely different?

Either it's mutually exclusive or not and if so it want's to be done in
this patch and not in a change which extends the struct for other
reasons.

Thanks,

        tglx
Peter Zijlstra Oct. 19, 2020, 9:37 a.m. UTC | #20
On Fri, Oct 16, 2020 at 10:14:10PM -0700, Ira Weiny wrote:
> > so it either needs to
> > explicitly do so, or have an assertion that preemption is indeed
> > disabled.
> 
> However, I don't think I understand clearly.  Doesn't [get|put]_cpu_ptr()
> handle the preempt_disable() for us? 

It does.

> Is it not sufficient to rely on that?

It is.

> Dave's comment seems to be the opposite where we need to eliminate preempt
> disable before calling write_pkrs().
> 
> FWIW I think I'm mistaken in my response to Dave regarding the
> preempt_disable() in pks_update_protection().

Dave's concern is that we're calling with with preemption already
disabled so disabling it again is superfluous.
Ira Weiny Oct. 19, 2020, 6:48 p.m. UTC | #21
On Mon, Oct 19, 2020 at 11:37:14AM +0200, Peter Zijlstra wrote:
> On Fri, Oct 16, 2020 at 10:14:10PM -0700, Ira Weiny wrote:

> > > so it either needs to

> > > explicitly do so, or have an assertion that preemption is indeed

> > > disabled.

> > 

> > However, I don't think I understand clearly.  Doesn't [get|put]_cpu_ptr()

> > handle the preempt_disable() for us? 

> 

> It does.

> 

> > Is it not sufficient to rely on that?

> 

> It is.

> 

> > Dave's comment seems to be the opposite where we need to eliminate preempt

> > disable before calling write_pkrs().

> > 

> > FWIW I think I'm mistaken in my response to Dave regarding the

> > preempt_disable() in pks_update_protection().

> 

> Dave's concern is that we're calling with with preemption already

> disabled so disabling it again is superfluous.


Ok, thanks, and after getting my head straight I think I agree with him, and
you.

Thanks I've reworked the code to removed the superfluous calls.  Sorry about
being so dense...  :-D

Ira
Ira Weiny Oct. 19, 2020, 8:26 p.m. UTC | #22
On Mon, Oct 19, 2020 at 11:32:50AM +0200, Thomas Gleixner wrote:
> On Sun, Oct 18 2020 at 22:37, Ira Weiny wrote:
> > On Fri, Oct 16, 2020 at 02:55:21PM +0200, Thomas Gleixner wrote:
> >> Subject: x86/entry: Move nmi entry/exit into common code
> >> From: Thomas Gleixner <tglx@linutronix.de>
> >> Date: Fri, 11 Sep 2020 10:09:56 +0200
> >> 
> >> Add blurb here.
> >
> > How about:
> >
> > To prepare for saving PKRS values across NMI's we lift the
> > idtentry_[enter|exit]_nmi() to the common code.  Rename them to
> > irqentry_nmi_[enter|exit]() to reflect the new generic nature and store the
> > state in the same irqentry_state_t structure as the other irqentry_*()
> > functions.  Finally, differentiate the state being stored between the NMI and
> > IRQ path by adding 'lockdep' to irqentry_state_t.
> 
> No. This has absolutely nothing to do with PKRS. It's a cleanup valuable
> by itself and that's how it should have been done right away.
> 
> So the proper changelog is:
> 
>   Lockdep state handling on NMI enter and exit is nothing specific to
>   X86. It's not any different on other architectures. Also the extra
>   state type is not necessary, irqentry_state_t can carry the necessary
>   information as well.
> 
>   Move it to common code and extend irqentry_state_t to carry lockdep
>   state.

Ok sounds good, thanks.

> 
> >> --- a/include/linux/entry-common.h
> >> +++ b/include/linux/entry-common.h
> >> @@ -343,6 +343,7 @@ void irqentry_exit_to_user_mode(struct p
> >>  #ifndef irqentry_state
> >>  typedef struct irqentry_state {
> >>  	bool	exit_rcu;
> >> +	bool	lockdep;
> >>  } irqentry_state_t;
> >
> > Building on what Peter said do you agree this should be made into a union?
> >
> > It may not be strictly necessary in this patch but I think it would reflect the
> > mutual exclusivity better and could be changed easy enough in the follow on
> > patch which adds the pkrs state.
> 
> Why the heck should it be changed in a patch which adds something
> completely different?

Because the PKRS stuff is used in both NMI and IRQ state.

> 
> Either it's mutually exclusive or not and if so it want's to be done in
> this patch and not in a change which extends the struct for other
> reasons.

Sorry, let me clarify.  After this patch we have.

typedef union irqentry_state {
	bool	exit_rcu;
	bool	lockdep;
} irqentry_state_t;

Which reflects the mutual exclusion of the 2 variables.

But then when the pkrs stuff is added the union changes back to a structure and
looks like this.

typedef struct irqentry_state {
#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
        u32 pkrs;
        u32 thread_pkrs;
#endif  
	union {
		bool    exit_rcu;
		bool	lockdep;
	};
} irqentry_state_t;

Because the pkrs information is in addition to exit_rcu OR lockdep.

So this is what I meant by 'could be changed easy enough in the follow on
patch'.

Is that clear?

Ira
Thomas Gleixner Oct. 19, 2020, 9:12 p.m. UTC | #23
On Mon, Oct 19 2020 at 13:26, Ira Weiny wrote:
> On Mon, Oct 19, 2020 at 11:32:50AM +0200, Thomas Gleixner wrote:

> Sorry, let me clarify.  After this patch we have.

>

> typedef union irqentry_state {

> 	bool	exit_rcu;

> 	bool	lockdep;

> } irqentry_state_t;

>

> Which reflects the mutual exclusion of the 2 variables.


Huch? From the patch I gave you:

 #ifndef irqentry_state
 typedef struct irqentry_state {
 	bool    exit_rcu;
+       bool    lockdep;
 } irqentry_state_t;
 #endif

How is that a union?

> But then when the pkrs stuff is added the union changes back to a structure and

> looks like this.


So you want:

  1) Move stuff to struct irqentry_state (my patch)

  2) Change it to a union and pass it as pointer at the same time

  3) Change it back to struct to add PKRS

> Is that clear?


What's clear is that the above is nonsense. We can just do

 #ifndef irqentry_state
 typedef struct irqentry_state {
 	union {
         	bool    exit_rcu;
                bool    lockdep;
        };        
 } irqentry_state_t;
 #endif

right in the patch which I gave you. Because that actually makes sense.

Thanks,

        tglx
Ira Weiny Oct. 20, 2020, 2:10 p.m. UTC | #24
On Mon, Oct 19, 2020 at 11:12:44PM +0200, Thomas Gleixner wrote:
> On Mon, Oct 19 2020 at 13:26, Ira Weiny wrote:
> > On Mon, Oct 19, 2020 at 11:32:50AM +0200, Thomas Gleixner wrote:
> > Sorry, let me clarify.  After this patch we have.
> >
> > typedef union irqentry_state {
> > 	bool	exit_rcu;
> > 	bool	lockdep;
> > } irqentry_state_t;
> >
> > Which reflects the mutual exclusion of the 2 variables.
> 
> Huch? From the patch I gave you:
> 
>  #ifndef irqentry_state
>  typedef struct irqentry_state {
>  	bool    exit_rcu;
> +       bool    lockdep;
>  } irqentry_state_t;
>  #endif
> 
> How is that a union?

I was proposing to make it a union.

> 
> > But then when the pkrs stuff is added the union changes back to a structure and
> > looks like this.
> 
> So you want:
> 
>   1) Move stuff to struct irqentry_state (my patch)
> 
>   2) Change it to a union and pass it as pointer at the same time

No, I would have made it a union in your patch.

Pass by reference would remain largely the same.

> 
>   3) Change it back to struct to add PKRS

Yes.  :-/

> 
> > Is that clear?
> 
> What's clear is that the above is nonsense. We can just do
> 
>  #ifndef irqentry_state
>  typedef struct irqentry_state {
>  	union {
>          	bool    exit_rcu;
>                 bool    lockdep;
>         };        
>  } irqentry_state_t;
>  #endif
> 
> right in the patch which I gave you. Because that actually makes sense.

Ok I'm very sorry.  I was thinking that having a struct containing nothing but
an anonymous union would be unacceptable as a stand alone item in your patch.
In my experience other maintainers would have rejected such a change and
would have asked; 'why not just make it a union'?

I'm very happy skipping the gymnastics on individual patches in favor of making
the whole series work out in the end.

Thank you for your help again.  :-)

Ira