mbox series

[RFC,v2,00/20] context_tracking,x86: Defer some IPIs until a user->kernel transition

Message ID 20230720163056.2564824-1-vschneid@redhat.com
Headers show
Series context_tracking,x86: Defer some IPIs until a user->kernel transition | expand

Message

Valentin Schneider July 20, 2023, 4:30 p.m. UTC
Context
=======

We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
pure-userspace application get regularly interrupted by IPIs sent from
housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
leading to various on_each_cpu() calls, e.g.:

  64359.052209596    NetworkManager       0    1405     smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush)
    smp_call_function_many_cond+0x1
    smp_call_function+0x39
    on_each_cpu+0x2a
    flush_tlb_kernel_range+0x7b
    __purge_vmap_area_lazy+0x70
    _vm_unmap_aliases.part.42+0xdf
    change_page_attr_set_clr+0x16a
    set_memory_ro+0x26
    bpf_int_jit_compile+0x2f9
    bpf_prog_select_runtime+0xc6
    bpf_prepare_filter+0x523
    sk_attach_filter+0x13
    sock_setsockopt+0x92c
    __sys_setsockopt+0x16a
    __x64_sys_setsockopt+0x20
    do_syscall_64+0x87
    entry_SYSCALL_64_after_hwframe+0x65

The heart of this series is the thought that while we cannot remove NOHZ_FULL
CPUs from the list of CPUs targeted by these IPIs, they may not have to execute
the callbacks immediately. Anything that only affects kernelspace can wait
until the next user->kernel transition, providing it can be executed "early
enough" in the entry code.

The original implementation is from Peter [1]. Nicolas then added kernel TLB
invalidation deferral to that [2], and I picked it up from there.

Deferral approach
=================

Storing each and every callback, like a secondary call_single_queue turned out
to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in
userspace for as long as possible - no signal of any form would be sent when
deferring an IPI. This means that any form of queuing for deferred callbacks
would end up as a convoluted memory leak.

Deferred IPIs must thus be coalesced, which this series achieves by assigning
IPIs a "type" and having a mapping of IPI type to callback, leveraged upon
kernel entry.

What about IPIs whose callback take a parameter, you may ask?

Peter suggested during OSPM23 [3] that since on_each_cpu() targets
housekeeping CPUs *and* isolated CPUs, isolated CPUs can access either global or
housekeeping-CPU-local state to "reconstruct" the data that would have been sent
via the IPI.

This series does not affect any IPI callback that requires an argument, but the
approach would remain the same (one coalescable callback executed on kernel
entry).

Kernel entry vs execution of the deferred operation
===================================================

There is a non-zero length of code that is executed upon kernel entry before the
deferred operation can be itself executed (i.e. before we start getting into
context_tracking.c proper).

This means one must take extra care to what can happen in the early entry code,
and that <bad things> cannot happen. For instance, we really don't want to hit
instructions that have been modified by a remote text_poke() while we're on our
way to execute a deferred sync_core().

Patches
=======

o Patches 1-9 have been submitted separately and are included for the sake of
  testing 

o Patches 10-14 focus on having objtool detect problematic static key usage in
  early entry
  
o Patch 15 adds the infrastructure for IPI deferral.
o Patches 16-17 add some RCU testing infrastructure
o Patch 18 adds text_poke() IPI deferral.

o Patches 19-20 add vunmap() flush_tlb_kernel_range() IPI deferral

  These ones I'm a lot less confident about, mostly due to lacking
  instrumentation/verification.
  
  The actual deferred callback is also incomplete as it's not properly noinstr:
    vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x19: call to native_write_cr4() leaves .noinstr.text section
  and it doesn't support PARAVIRT - it's going to need a pv_ops.mmu entry, but I
  have *no idea* what a sane implementation would be for Xen so I haven't
  touched that yet.

Patches are also available at:

https://gitlab.com/vschneid/linux.git -b redhat/isolirq/defer/v2

Testing
=======

Note: this is a different machine than used for v1, because that machine decided
to act difficult.

Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs.
RHEL9 userspace.

Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs
and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:

$ trace-cmd record -e "csd_queue_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
 	           -e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \
	           -e "ipi_send_cpu"     -f "cpu & CPUS{$ISOL_CPUS}" \
		   rteval --onlyload --loads-cpulist=$HK_CPUS \
		   --hackbench-runlowmem=True --duration=$DURATION

This only records IPIs sent to isolated CPUs, so any event there is interference
(with a bit of fuzz at the start/end of the workload when spawning the
processes). All tests were done with a duration of 30 minutes.

v6.5-rc1 (+ cpumask filtering patches):
# This is the actual IPI count
$ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr
    338 callback=generic_smp_call_function_single_interrupt+0x0

# These are the different CSD's that caused IPIs    
$ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr
   9207 func=do_flush_tlb_all
   1116 func=do_sync_core
     62 func=do_kernel_range_flush
      3 func=nohz_full_kick_func

v6.5-rc1 + patches:
# This is the actual IPI count
$ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr
      2 callback=generic_smp_call_function_single_interrupt+0x0

# These are the different CSD's that caused IPIs          
$ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr
      2 func=nohz_full_kick_func

The incriminating IPIs are all gone, but note that on the machine I used to test
v1 there were still some do_flush_tlb_all() IPIs caused by
pcpu_balance_workfn(), since only vmalloc is affected by the deferral
mechanism.


Acknowledgements
================

Special thanks to:
o Clark Williams for listening to my ramblings about this and throwing ideas my way
o Josh Poimboeuf for his guidance regarding objtool and hinting at the
  .data..ro_after_init section.

Links
=====

[1]: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/
[2]: https://github.com/vianpl/linux.git -b ct-work-defer-wip
[3]: https://youtu.be/0vjE6fjoVVE

Revisions
=========

RFCv1 -> RFCv2
++++++++++++++

o Rebased onto v6.5-rc1

o Updated the trace filter patches (Steven)

o Fixed __ro_after_init keys used in modules (Peter)
o Dropped the extra context_tracking atomic, squashed the new bits in the
  existing .state field (Peter, Frederic)
  
o Added an RCU_EXPERT config for the RCU dynticks counter size, and added an
  rcutorture case for a low-size counter (Paul)

  The new TREE11 case with a 2-bit dynticks counter seems to pass when ran
  against this series.

o Fixed flush_tlb_kernel_range_deferrable() definition

Peter Zijlstra (1):
  jump_label,module: Don't alloc static_key_mod for __ro_after_init keys

Valentin Schneider (19):
  tracing/filters: Dynamically allocate filter_pred.regex
  tracing/filters: Enable filtering a cpumask field by another cpumask
  tracing/filters: Enable filtering a scalar field by a cpumask
  tracing/filters: Enable filtering the CPU common field by a cpumask
  tracing/filters: Optimise cpumask vs cpumask filtering when user mask
    is a single CPU
  tracing/filters: Optimise scalar vs cpumask filtering when the user
    mask is a single CPU
  tracing/filters: Optimise CPU vs cpumask filtering when the user mask
    is a single CPU
  tracing/filters: Further optimise scalar vs cpumask comparison
  tracing/filters: Document cpumask filtering
  objtool: Flesh out warning related to pv_ops[] calls
  objtool: Warn about non __ro_after_init static key usage in .noinstr
  context_tracking: Make context_tracking_key __ro_after_init
  x86/kvm: Make kvm_async_pf_enabled __ro_after_init
  context-tracking: Introduce work deferral infrastructure
  rcu: Make RCU dynticks counter size configurable
  rcutorture: Add a test config to torture test low RCU_DYNTICKS width
  context_tracking,x86: Defer kernel text patching IPIs
  context_tracking,x86: Add infrastructure to defer kernel TLBI
  x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL
    CPUs

 Documentation/trace/events.rst                |  14 +
 arch/Kconfig                                  |   9 +
 arch/x86/Kconfig                              |   1 +
 arch/x86/include/asm/context_tracking_work.h  |  20 ++
 arch/x86/include/asm/text-patching.h          |   1 +
 arch/x86/include/asm/tlbflush.h               |   2 +
 arch/x86/kernel/alternative.c                 |  24 +-
 arch/x86/kernel/kprobes/core.c                |   4 +-
 arch/x86/kernel/kprobes/opt.c                 |   4 +-
 arch/x86/kernel/kvm.c                         |   2 +-
 arch/x86/kernel/module.c                      |   2 +-
 arch/x86/mm/tlb.c                             |  40 ++-
 include/asm-generic/sections.h                |   5 +
 include/linux/context_tracking.h              |  26 ++
 include/linux/context_tracking_state.h        |  65 +++-
 include/linux/context_tracking_work.h         |  28 ++
 include/linux/jump_label.h                    |   1 +
 include/linux/trace_events.h                  |   1 +
 init/main.c                                   |   1 +
 kernel/context_tracking.c                     |  53 ++-
 kernel/jump_label.c                           |  49 +++
 kernel/rcu/Kconfig                            |  33 ++
 kernel/time/Kconfig                           |   5 +
 kernel/trace/trace_events_filter.c            | 302 ++++++++++++++++--
 mm/vmalloc.c                                  |  19 +-
 tools/objtool/check.c                         |  22 +-
 tools/objtool/include/objtool/check.h         |   1 +
 tools/objtool/include/objtool/special.h       |   2 +
 tools/objtool/special.c                       |   3 +
 .../selftests/rcutorture/configs/rcu/TREE11   |  19 ++
 .../rcutorture/configs/rcu/TREE11.boot        |   1 +
 31 files changed, 695 insertions(+), 64 deletions(-)
 create mode 100644 arch/x86/include/asm/context_tracking_work.h
 create mode 100644 include/linux/context_tracking_work.h
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot

--
2.31.1

Comments

Paul E. McKenney July 21, 2023, 4 a.m. UTC | #1
On Thu, Jul 20, 2023 at 12:53:05PM -0700, Paul E. McKenney wrote:
> On Thu, Jul 20, 2023 at 05:30:53PM +0100, Valentin Schneider wrote:
> > We now have an RCU_EXPORT knob for configuring the size of the dynticks
> > counter: CONFIG_RCU_DYNTICKS_BITS.
> > 
> > Add a torture config for a ridiculously small counter (2 bits). This is ac
> > opy of TREE4 with the added counter size restriction.
> > 
> > Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop
> > Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> > Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> > ---
> >  .../selftests/rcutorture/configs/rcu/TREE11   | 19 +++++++++++++++++++
> >  .../rcutorture/configs/rcu/TREE11.boot        |  1 +
> >  2 files changed, 20 insertions(+)
> >  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11
> >  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot
> > 
> > diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE11 b/tools/testing/selftests/rcutorture/configs/rcu/TREE11
> > new file mode 100644
> > index 0000000000000..aa7274efd9819
> > --- /dev/null
> > +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE11
> > @@ -0,0 +1,19 @@
> > +CONFIG_SMP=y
> > +CONFIG_NR_CPUS=8
> > +CONFIG_PREEMPT_NONE=n
> > +CONFIG_PREEMPT_VOLUNTARY=y
> > +CONFIG_PREEMPT=n
> > +CONFIG_PREEMPT_DYNAMIC=n
> > +#CHECK#CONFIG_TREE_RCU=y
> > +CONFIG_HZ_PERIODIC=n
> > +CONFIG_NO_HZ_IDLE=n
> > +CONFIG_NO_HZ_FULL=y
> > +CONFIG_RCU_TRACE=y
> > +CONFIG_RCU_FANOUT=4
> > +CONFIG_RCU_FANOUT_LEAF=3
> > +CONFIG_DEBUG_LOCK_ALLOC=n
> > +CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
> > +CONFIG_RCU_EXPERT=y
> > +CONFIG_RCU_EQS_DEBUG=y
> > +CONFIG_RCU_LAZY=y
> > +CONFIG_RCU_DYNTICKS_BITS=2
> 
> Why not just add this last line to the existing TREE04 scenario?
> That would ensure that it gets tested regularly without extending the
> time required to run a full set of rcutorture tests.

Please see below for the version of this patch that I am running overnight
tests with.  Does this one work for you?

							Thanx, Paul

------------------------------------------------------------------------

commit 1aa13731e665193cd833edac5ebc86a9c3fea2b7
Author: Valentin Schneider <vschneid@redhat.com>
Date:   Thu Jul 20 20:58:41 2023 -0700

    rcutorture: Add a test config to torture test low RCU_DYNTICKS width
    
    We now have an RCU_EXPORT knob for configuring the size of the dynticks
    counter: CONFIG_RCU_DYNTICKS_BITS.
    
    Modify scenario TREE04 to exercise a a ridiculously small counter
    (2 bits).
    
    Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop
    Suggested-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Valentin Schneider <vschneid@redhat.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 b/tools/testing/selftests/rcutorture/configs/rcu/TREE04
index dc4985064b3a..aa7274efd981 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TREE04
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE04
@@ -16,3 +16,4 @@ CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
 CONFIG_RCU_EXPERT=y
 CONFIG_RCU_EQS_DEBUG=y
 CONFIG_RCU_LAZY=y
+CONFIG_RCU_DYNTICKS_BITS=2
Valentin Schneider July 21, 2023, 3:08 p.m. UTC | #2
On 21/07/23 07:07, Paul E. McKenney wrote:
> On Fri, Jul 21, 2023 at 08:58:53AM +0100, Valentin Schneider wrote:
>> On 20/07/23 21:00, Paul E. McKenney wrote:
>> > On Thu, Jul 20, 2023 at 12:53:05PM -0700, Paul E. McKenney wrote:
>> >> On Thu, Jul 20, 2023 at 05:30:53PM +0100, Valentin Schneider wrote:
>> >> > diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE11 b/tools/testing/selftests/rcutorture/configs/rcu/TREE11
>> >> > new file mode 100644
>> >> > index 0000000000000..aa7274efd9819
>> >> > --- /dev/null
>> >> > +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE11
>> >> > @@ -0,0 +1,19 @@
>> >> > +CONFIG_SMP=y
>> >> > +CONFIG_NR_CPUS=8
>> >> > +CONFIG_PREEMPT_NONE=n
>> >> > +CONFIG_PREEMPT_VOLUNTARY=y
>> >> > +CONFIG_PREEMPT=n
>> >> > +CONFIG_PREEMPT_DYNAMIC=n
>> >> > +#CHECK#CONFIG_TREE_RCU=y
>> >> > +CONFIG_HZ_PERIODIC=n
>> >> > +CONFIG_NO_HZ_IDLE=n
>> >> > +CONFIG_NO_HZ_FULL=y
>> >> > +CONFIG_RCU_TRACE=y
>> >> > +CONFIG_RCU_FANOUT=4
>> >> > +CONFIG_RCU_FANOUT_LEAF=3
>> >> > +CONFIG_DEBUG_LOCK_ALLOC=n
>> >> > +CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
>> >> > +CONFIG_RCU_EXPERT=y
>> >> > +CONFIG_RCU_EQS_DEBUG=y
>> >> > +CONFIG_RCU_LAZY=y
>> >> > +CONFIG_RCU_DYNTICKS_BITS=2
>> >>
>> >> Why not just add this last line to the existing TREE04 scenario?
>> >> That would ensure that it gets tested regularly without extending the
>> >> time required to run a full set of rcutorture tests.
>> >
>> > Please see below for the version of this patch that I am running overnight
>> > tests with.  Does this one work for you?
>>
>> Yep that's fine with me. I only went with a separate test file as wasn't
>> sure how new test options should be handled (merged into existing tests vs
>> new tests created), and didn't want to negatively impact TREE04 or
>> TREE06. If merging into TREE04 is preferred, then I'll do just that and
>> carry this path moving forwards.
>
> Things worked fine for this one-hour-per-scenario test run on my laptop,

Many thanks for testing!

> except for the CONFIG_SMP=n runs, which all got build errors like the
> following.
>

Harumph, yes !SMP (and !CONTEXT_TRACKING_WORK) doesn't compile nicely, I'll
fix that for v3.
Valentin Schneider July 24, 2023, 4:55 p.m. UTC | #3
On 24/07/23 16:52, Frederic Weisbecker wrote:
> Le Thu, Jul 20, 2023 at 05:30:51PM +0100, Valentin Schneider a écrit :
>> +enum ctx_state {
>> +	/* Following are values */
>> +	CONTEXT_DISABLED	= -1,	/* returned by ct_state() if unknown */
>> +	CONTEXT_KERNEL		= 0,
>> +	CONTEXT_IDLE		= 1,
>> +	CONTEXT_USER		= 2,
>> +	CONTEXT_GUEST		= 3,
>> +	CONTEXT_MAX             = 4,
>> +};
>> +
>> +/*
>> + * We cram three different things within the same atomic variable:
>> + *
>> + *                CONTEXT_STATE_END                        RCU_DYNTICKS_END
>> + *                         |       CONTEXT_WORK_END                |
>> + *                         |               |                       |
>> + *                         v               v                       v
>> + *         [ context_state ][ context work ][ RCU dynticks counter ]
>> + *         ^                ^               ^
>> + *         |                |               |
>> + *         |        CONTEXT_WORK_START      |
>> + * CONTEXT_STATE_START              RCU_DYNTICKS_START
>
> Should the layout be displayed in reverse? Well at least I always picture
> bitmaps in reverse, that's probably due to the direction of the shift arrows.
> Not sure what is the usual way to picture it though...
>

Surprisingly, I managed to confuse myself with that comment :-)  I think I
am subconsciously more used to the reverse as well. I've flipped that and
put "MSB" / "LSB" at either end.

>> + */
>> +
>> +#define CT_STATE_SIZE (sizeof(((struct context_tracking *)0)->state) * BITS_PER_BYTE)
>> +
>> +#define CONTEXT_STATE_START 0
>> +#define CONTEXT_STATE_END   (bits_per(CONTEXT_MAX - 1) - 1)
>
> Since you have non overlapping *_START symbols, perhaps the *_END
> are superfluous?
>

They're only really there to tidy up the GENMASK() further down - it keeps
the range and index definitions in one hunk. I tried defining that directly
within the GENMASK() themselves but it got too ugly IMO.

>> +
>> +#define RCU_DYNTICKS_BITS  (IS_ENABLED(CONFIG_CONTEXT_TRACKING_WORK) ? 16 : 31)
>> +#define RCU_DYNTICKS_START (CT_STATE_SIZE - RCU_DYNTICKS_BITS)
>> +#define RCU_DYNTICKS_END   (CT_STATE_SIZE - 1)
>> +#define RCU_DYNTICKS_IDX   BIT(RCU_DYNTICKS_START)
>
> Might be the right time to standardize and fix our naming:
>
> CT_STATE_START,
> CT_STATE_KERNEL,
> CT_STATE_USER,
> ...
> CT_WORK_START,
> CT_WORK_*,
> ...
> CT_RCU_DYNTICKS_START,
> CT_RCU_DYNTICKS_IDX
>

Heh, I have actually already done this for v3, though I hadn't touched the
RCU_DYNTICKS* family. I'll fold that in.

>> +bool ct_set_cpu_work(unsigned int cpu, unsigned int work)
>> +{
>> +	struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu);
>> +	unsigned int old;
>> +	bool ret = false;
>> +
>> +	preempt_disable();
>> +
>> +	old = atomic_read(&ct->state);
>> +	/*
>> +	 * Try setting the work until either
>> +	 * - the target CPU no longer accepts any more deferred work
>> +	 * - the work has been set
>> +	 *
>> +	 * NOTE: CONTEXT_GUEST intersects with CONTEXT_USER and CONTEXT_IDLE
>> +	 * as they are regular integers rather than bits, but that doesn't
>> +	 * matter here: if any of the context state bit is set, the CPU isn't
>> +	 * in kernel context.
>> +	 */
>> +	while ((old & (CONTEXT_GUEST | CONTEXT_USER | CONTEXT_IDLE)) && !ret)
>
> That may still miss a recent entry to userspace due to the first plain read, ending
> with an undesired interrupt.
>
> You need at least one cmpxchg. Well, of course that stays racy by nature because
> between the cmpxchg() returning CONTEXT_KERNEL and the actual IPI raised and
> received, the remote CPU may have gone to userspace already. But still it limits
> a little the window.
>

I can make that a 'do {} while ()' instead to force at least one execution
of the cmpxchg().

This is only about reducing the race window, right? If we're executing this
just as the target CPU is about to enter userspace, we're going to be in
racy territory anyway. Regardless, I'm happy to do that change.
Frederic Weisbecker July 24, 2023, 7:18 p.m. UTC | #4
On Mon, Jul 24, 2023 at 05:55:44PM +0100, Valentin Schneider wrote:
> I can make that a 'do {} while ()' instead to force at least one execution
> of the cmpxchg().
> 
> This is only about reducing the race window, right? If we're executing this
> just as the target CPU is about to enter userspace, we're going to be in
> racy territory anyway. Regardless, I'm happy to do that change.

Right, it's only about narrowing down the race window. It probably don't matter
in practice, but it's one less thing to consider for the brain :-)

Also, why bothering with handling CONTEXT_IDLE?

Thanks.
Valentin Schneider July 25, 2023, 10:10 a.m. UTC | #5
On 24/07/23 21:18, Frederic Weisbecker wrote:
> On Mon, Jul 24, 2023 at 05:55:44PM +0100, Valentin Schneider wrote:
>> I can make that a 'do {} while ()' instead to force at least one execution
>> of the cmpxchg().
>>
>> This is only about reducing the race window, right? If we're executing this
>> just as the target CPU is about to enter userspace, we're going to be in
>> racy territory anyway. Regardless, I'm happy to do that change.
>
> Right, it's only about narrowing down the race window. It probably don't matter
> in practice, but it's one less thing to consider for the brain :-)
>

Ack

> Also, why bothering with handling CONTEXT_IDLE?
>

I have reasons! I just swept them under the rug and didn't mention them :D
Also looking at the config dependencies again I got it wrong, but
nevertheless that means I get to ramble about it.

With NO_HZ_IDLE, we get CONTEXT_TRACKING_IDLE, so we get these
transitions:

  ct_idle_enter()
    ct_kernel_exit()
      ct_state_inc_clear_work()

  ct_idle_exit()
    ct_kernel_enter()
      ct_work_flush()

Now, if we just make CONTEXT_TRACKING_WORK depend on CONTEXT_TRACKING_IDLE
rather than CONTEXT_TRACKING_USER, we get to leverage the IPI deferral for
NO_HZ_IDLE kernels - in other words, we get to keep idle CPUs idle longer.

It's a completely different argument than reducing interference for
NOHZ_FULL userspace applications and I should have at the very least
mentioned it in the cover letter, but it's the exact same backing
mechanism.

Looking at it again, I'll probably make the CONTEXT_IDLE thing a separate
patch with a proper changelog.
Valentin Schneider July 25, 2023, 1:05 p.m. UTC | #6
On 25/07/23 13:22, Frederic Weisbecker wrote:
> On Tue, Jul 25, 2023 at 11:10:31AM +0100, Valentin Schneider wrote:
>> I have reasons! I just swept them under the rug and didn't mention them :D
>> Also looking at the config dependencies again I got it wrong, but
>> nevertheless that means I get to ramble about it.
>>
>> With NO_HZ_IDLE, we get CONTEXT_TRACKING_IDLE, so we get these
>> transitions:
>>
>>   ct_idle_enter()
>>     ct_kernel_exit()
>>       ct_state_inc_clear_work()
>>
>>   ct_idle_exit()
>>     ct_kernel_enter()
>>       ct_work_flush()
>>
>> Now, if we just make CONTEXT_TRACKING_WORK depend on CONTEXT_TRACKING_IDLE
>> rather than CONTEXT_TRACKING_USER, we get to leverage the IPI deferral for
>> NO_HZ_IDLE kernels - in other words, we get to keep idle CPUs idle longer.
>>
>> It's a completely different argument than reducing interference for
>> NOHZ_FULL userspace applications and I should have at the very least
>> mentioned it in the cover letter, but it's the exact same backing
>> mechanism.
>>
>> Looking at it again, I'll probably make the CONTEXT_IDLE thing a separate
>> patch with a proper changelog.
>
> Ok should that be a seperate Kconfig? This indeed can bring power improvement
> but at the cost of more overhead from the sender. A balance to be measured...

Yep agreed, I'll make that an optional config.
Valentin Schneider July 25, 2023, 2:03 p.m. UTC | #7
Sorry, I missed out Dave's email, so now I'm taking my time to page (hah!)
all of this.

On 25/07/23 15:21, Peter Zijlstra wrote:
> On Mon, Jul 24, 2023 at 10:40:04AM -0700, Dave Hansen wrote:
>
>> TLB flushes for freed page tables are another game entirely.  The CPU is
>> free to cache any part of the paging hierarchy it wants at any time.
>> It's also free to set accessed and dirty bits at any time, even for
>> instructions that may never execute architecturally.
>>
>> That basically means that if you have *ANY* freed page table page
>> *ANYWHERE* in the page table hierarchy of any CPU at any time ... you're
>> screwed.
>>
>> There's no reasoning about accesses or ordering.  As soon as the CPU
>> does *anything*, it's out to get you.
>>

OK, I feel like I need to go back do some more reading now, but I think I
get the difference. Thanks for spelling it out.

>> You're going to need to do something a lot more radical to deal with
>> free page table pages.
>
> Ha! IIRC the only thing we can reasonably do there is to have strict
> per-cpu page-tables such that NOHZ_FULL CPUs can be isolated. That is,
> as long we the per-cpu tables do not contain -- and have never contained
> -- a particular table page, we can avoid flushing it. Because if it
> never was there, it also couldn't have speculatively loaded it.
>
> Now, x86 doesn't really do per-cpu page tables easily (otherwise we'd
> have done them ages ago) and doing them is going to be *major* surgery
> and pain.
>
> Other than that, we must take the TLBI-IPI when freeing
> page-table-pages.
>
>
> But yeah, I think Nadav is right, vmalloc.c never frees page-tables (or
> at least, I couldn't find it in a hurry either), but if we're going to
> be doing this, then that file must include a very prominent comment
> explaining it must never actually do so either.
>

I also couldn't find any freeing of the page-table-pages, I'll do another
pass and sharpen my quill for a big fat comment.

> Not being able to free page-tables might be a 'problem' if we're going
> to be doing more of HUGE_VMALLOC, because that means it becomes rather
> hard to swizzle from small to large pages.
Dave Hansen July 25, 2023, 5:12 p.m. UTC | #8
On 7/25/23 09:37, Marcelo Tosatti wrote:
>> TLB flushes for freed page tables are another game entirely.  The CPU is
>> free to cache any part of the paging hierarchy it wants at any time.
> Depend on CONFIG_PAGE_TABLE_ISOLATION=y, which flushes TLB (and page
> table caches) on user->kernel and kernel->user context switches ?

Well, first of all, CONFIG_PAGE_TABLE_ISOLATION doesn't flush the TLB at
all on user<->kernel switches when PCIDs are enabled.

Second, even if it did, the CPU is still free to cache any portion of
the paging hierarchy at any time.  Without LASS[1], userspace can even
_compel_ walks of the kernel portion of the address space, and we don't
have any infrastructure to tell if a freed kernel page is exposed in the
user copy of the page tables with PTI.

Third, (also ignoring PCIDs) there are plenty of instructions between
kernel entry and the MOV-to-CR3 that can flush the TLB.  All those
instructions architecturally permitted to speculatively set Accessed or
Dirty bits in any part of the address space.  If they run into a free
page table page, things get ugly.

These accesses are not _likely_.  There probably isn't a predictor out
there that's going to see a:

	movq    %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)

and go off trying to dirty memory in the vmalloc() area.  But we'd need
some backward *and* forward-looking guarantees from our intrepid CPU
designers to promise that this kind of thing is safe yesterday, today
and tomorrow.  I suspect such a guarantee is going to be hard to obtain.

1. https://lkml.kernel.org/r/20230110055204.3227669-1-yian.chen@intel.com
Josh Poimboeuf July 26, 2023, 7:41 p.m. UTC | #9
On Thu, Jul 20, 2023 at 05:30:38PM +0100, Valentin Schneider wrote:
>  int filter_assign_type(const char *type)
>  {
> -	if (strstr(type, "__data_loc") && strstr(type, "char"))
> -		return FILTER_DYN_STRING;
> +	if (strstr(type, "__data_loc")) {
> +		if (strstr(type, "char"))
> +			return FILTER_DYN_STRING;
> +		if (strstr(type, "cpumask_t"))
> +			return FILTER_CPUMASK;
> +		}

The closing bracket has the wrong indentation.

> +		/* Copy the cpulist between { and } */
> +		tmp = kmalloc((i - maskstart) + 1, GFP_KERNEL);
> +		strscpy(tmp, str + maskstart, (i - maskstart) + 1);

Need to check kmalloc() failure?  And also free tmp?

> +
> +		pred->mask = kzalloc(cpumask_size(), GFP_KERNEL);
> +		if (!pred->mask)
> +			goto err_mem;
> +
> +		/* Now parse it */
> +		if (cpulist_parse(tmp, pred->mask)) {
> +			parse_error(pe, FILT_ERR_INVALID_CPULIST, pos + i);
> +			goto err_free;
> +		}
> +
> +		/* Move along */
> +		i++;
> +		if (field->filter_type == FILTER_CPUMASK)
> +			pred->fn_num = FILTER_PRED_FN_CPUMASK;
> +
Valentin Schneider July 27, 2023, 9:46 a.m. UTC | #10
On 26/07/23 12:41, Josh Poimboeuf wrote:
> On Thu, Jul 20, 2023 at 05:30:38PM +0100, Valentin Schneider wrote:
>>  int filter_assign_type(const char *type)
>>  {
>> -	if (strstr(type, "__data_loc") && strstr(type, "char"))
>> -		return FILTER_DYN_STRING;
>> +	if (strstr(type, "__data_loc")) {
>> +		if (strstr(type, "char"))
>> +			return FILTER_DYN_STRING;
>> +		if (strstr(type, "cpumask_t"))
>> +			return FILTER_CPUMASK;
>> +		}
>
> The closing bracket has the wrong indentation.
>
>> +		/* Copy the cpulist between { and } */
>> +		tmp = kmalloc((i - maskstart) + 1, GFP_KERNEL);
>> +		strscpy(tmp, str + maskstart, (i - maskstart) + 1);
>
> Need to check kmalloc() failure?  And also free tmp?
>

Heh, indeed, shoddy that :-)

Thanks!

>> +
>> +		pred->mask = kzalloc(cpumask_size(), GFP_KERNEL);
>> +		if (!pred->mask)
>> +			goto err_mem;
>> +
>> +		/* Now parse it */
>> +		if (cpulist_parse(tmp, pred->mask)) {
>> +			parse_error(pe, FILT_ERR_INVALID_CPULIST, pos + i);
>> +			goto err_free;
>> +		}
>> +
>> +		/* Move along */
>> +		i++;
>> +		if (field->filter_type == FILTER_CPUMASK)
>> +			pred->fn_num = FILTER_PRED_FN_CPUMASK;
>> +
>
> --
> Josh
Josh Poimboeuf July 28, 2023, 3:33 p.m. UTC | #11
On Thu, Jul 20, 2023 at 05:30:47PM +0100, Valentin Schneider wrote:
> I had to look into objtool itself to understand what this warning was
> about; make it more explicit.
> 
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
>  tools/objtool/check.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tools/objtool/check.c b/tools/objtool/check.c
> index 8936a05f0e5ac..d308330f2910e 100644
> --- a/tools/objtool/check.c
> +++ b/tools/objtool/check.c
> @@ -3360,7 +3360,7 @@ static bool pv_call_dest(struct objtool_file *file, struct instruction *insn)
>  
>  	list_for_each_entry(target, &file->pv_ops[idx].targets, pv_target) {
>  		if (!target->sec->noinstr) {
> -			WARN("pv_ops[%d]: %s", idx, target->name);
> +			WARN("pv_ops[%d]: indirect call to %s() leaves .noinstr.text section", idx, target->name);
>  			file->pv_ops[idx].clean = false;

This is an improvement, though I think it still results in two warnings,
with the second not-so-useful warning happening in validate_call().

Ideally it would only show a single warning, I guess that would need a
little bit of restructuring the code.
Josh Poimboeuf July 28, 2023, 4 p.m. UTC | #12
On Thu, Jul 20, 2023 at 05:30:49PM +0100, Valentin Schneider wrote:
> objtool now warns about it:
> 
>   vmlinux.o: warning: objtool: enter_from_user_mode+0x4e: Non __ro_after_init static key "context_tracking_key" in .noinstr section
>   vmlinux.o: warning: objtool: enter_from_user_mode+0x50: Non __ro_after_init static key "context_tracking_key" in .noinstr section
>   vmlinux.o: warning: objtool: syscall_enter_from_user_mode+0x60: Non __ro_after_init static key "context_tracking_key" in .noinstr section
>   vmlinux.o: warning: objtool: syscall_enter_from_user_mode+0x62: Non __ro_after_init static key "context_tracking_key" in .noinstr section
>   [...]
> 
> The key can only be enabled (and not disabled) in the __init function
> ct_cpu_tracker_user(), so mark it as __ro_after_init.
> 
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>

It's best to avoid temporarily introducing warnings.  Bots will
rightfully complain about that.  This patch and the next one should come
before the objtool patches.

Also it would be helpful for the commit log to have a brief
justification for the patch beyond "fix the objtool warning".  Something
roughly like:

  Soon, runtime-mutable text won't be allowed in .noinstr sections, so
  that a code patching IPI to a userspace-bound CPU can be safely
  deferred to the next kernel entry.

  'context_tracking_key' is only enabled in __init ct_cpu_tracker_user().
  Mark it as __ro_after_init.
Steven Rostedt July 29, 2023, 7:09 p.m. UTC | #13
On Wed, 26 Jul 2023 12:41:48 -0700
Josh Poimboeuf <jpoimboe@kernel.org> wrote:

> On Thu, Jul 20, 2023 at 05:30:38PM +0100, Valentin Schneider wrote:
> >  int filter_assign_type(const char *type)
> >  {
> > -	if (strstr(type, "__data_loc") && strstr(type, "char"))
> > -		return FILTER_DYN_STRING;
> > +	if (strstr(type, "__data_loc")) {
> > +		if (strstr(type, "char"))
> > +			return FILTER_DYN_STRING;
> > +		if (strstr(type, "cpumask_t"))
> > +			return FILTER_CPUMASK;
> > +		}  
> 
> The closing bracket has the wrong indentation.
> 
> > +		/* Copy the cpulist between { and } */
> > +		tmp = kmalloc((i - maskstart) + 1, GFP_KERNEL);
> > +		strscpy(tmp, str + maskstart, (i - maskstart) + 1);  
> 
> Need to check kmalloc() failure?  And also free tmp?

I came to state the same thing.

Also, when you do an empty for loop:

	for (; str[i] && str[i] != '}'; i++);

Always put the semicolon on the next line, otherwise it is really easy
to think that the next line is part of the for loop. That is, instead
of the above, do:

	for (; str[i] && str[i] != '}'; i++)
		;


-- Steve


> 
> > +
> > +		pred->mask = kzalloc(cpumask_size(), GFP_KERNEL);
> > +		if (!pred->mask)
> > +			goto err_mem;
> > +
> > +		/* Now parse it */
> > +		if (cpulist_parse(tmp, pred->mask)) {
> > +			parse_error(pe, FILT_ERR_INVALID_CPULIST, pos + i);
> > +			goto err_free;
> > +		}
> > +
> > +		/* Move along */
> > +		i++;
> > +		if (field->filter_type == FILTER_CPUMASK)
> > +			pred->fn_num = FILTER_PRED_FN_CPUMASK;
> > +  
>