diff mbox series

[v5,9/9] task_isolation: kick_all_cpus_sync: don't kick isolated cpus

Message ID 3236b13f42679031960c5605be20664e90e75223.camel@marvell.com
State New
Headers show
Series "Task_isolation" mode | expand

Commit Message

Alex Belits Nov. 23, 2020, 5:58 p.m. UTC
From: Yuri Norov <ynorov@marvell.com>


Make sure that kick_all_cpus_sync() does not call CPUs that are running
isolated tasks.

Signed-off-by: Yuri Norov <ynorov@marvell.com>

[abelits@marvell.com: use safe task_isolation_cpumask() implementation]
Signed-off-by: Alex Belits <abelits@marvell.com>

---
 kernel/smp.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

-- 
2.20.1

Comments

Alex Belits Nov. 25, 2020, 3:20 a.m. UTC | #1
On Tue, 2020-11-24 at 00:21 +0100, Frederic Weisbecker wrote:
> On Mon, Nov 23, 2020 at 10:39:34PM +0000, Alex Belits wrote:

> > 

> > This is different from timers. The original design was based on the

> > idea that every CPU should be able to enter kernel at any time and

> > run

> > kernel code with no additional preparation. Then the only solution

> > is

> > to always do full broadcast and require all CPUs to process it.

> > 

> > What I am trying to introduce is the idea of CPU that is not likely

> > to

> > run kernel code any soon, and can afford to go through an

> > additional

> > synchronization procedure on the next entry into kernel. The

> > synchronization is not skipped, it simply happens later, early in

> > kernel entry code.

> 

> Ah I see, this is ordered that way:

> 

> ll_isol_flags = ISOLATED

> 

>          CPU 0                                CPU 1

>     ------------------                       -----------------

>                                             // kernel entry

>     data_to_sync = 1                        ll_isol_flags = ISOLATED_BROKEN

>     smp_mb()                                smp_mb()

>     if ll_isol_flags(CPU 1) == ISOLATED     READ data_to_sync

>          smp_call(CPU 1)

> 


The check for ll_isol_flags(CPU 1) is reversed, and it's a bit more
complex. In terms of scenarios, on entry from isolation the following
can happen:

1. Kernel entry happens simultaneously with operation that requires
synchronization, kernel entry processing happens before the check for
isolation on the sender side:

ll_isol_flags(CPU 1) = ISOLATED

         CPU 0                                CPU 1
    ------------------                      -----------------
                                            // kernel entry
                                            if (ll_isol_flags == ISOLATED) {
                                                  ll_isol_flags = ISOLATED_BROKEN
    data_to_sync = 1                              smp_mb()
                                                  // data_to_sync undetermined
    smp_mb()                                }
    // ll_isol_flags(CPU 1) updated
    if ll_isol_flags(CPU 1) != ISOLATED
                                            // interrupts enabled
         smp_call(CPU 1)                          // kernel entry again
                                                  if (ll_isol_flags == ISOLATED)
                                                        // nothing happens
                                                  // explicit or implied barriers
                                                  // data_to_sync updated
                                                  // kernel exit
    // CPU 0 assumes, CPU 1 will see        READ data_to_sync
    // data_to_sync = 1 when in kernel

2. Kernel entry happens simultaneously with operation that requires
synchronization, kernel entry processing happens after the check for
isolation on the sender side:

ll_isol_flags(CPU 1) = ISOLATED

         CPU 0                                CPU 1
    ------------------                      -----------------
    data_to_sync = 1                        // kernel entry
    smp_mb()                                // data_to_sync undetermined
                                            // should not access data_to_sync here
                                            if (ll_isol_flags == ISOLATED) {      
                                                   ll_isol_flags = ISOLATED_BROKEN
    // ll_isol_flags(CPU 1) undetermined           smp_mb()
                                                   // data_to_sync updated
    if ll_isol_flags(CPU 1) != ISOLATED     }
         // possibly nothing happens
    // CPU 0 assumes, CPU 1 will see        READ data_to_sync
    // data_to_sync = 1 when in kernel

3. Kernel entry processing completed before the check for isolation on the sender
side:

ll_isol_flags(CPU 1) = ISOLATED

         CPU 0                                CPU 1
    ------------------                      -----------------
                                            // kernel entry
                                            if (ll_isol_flags == ISOLATED) {
                                                  ll_isol_flags = ISOLATED_BROKEN
                                                  smp_mb()
                                            }
                                            // interrupts are enabled at some
    data_to_sync = 1                        // point here, data_to_sync value
    smp_mb()                                // is undetermined, CPU 0 makes no
    // ll_isol_flags(CPU 1) updated         // assumptions about it
    if ll_isol_flags(CPU 1) != ISOLATED     //
          smp_call(CPU 1)                         // kernel entry again
                                                  if (ll_isol_flags == ISOLATED)
                                                        // nothing happens
                                                  // explicit or implied barriers
                                                  // data_to_sync updated
                                                  // kernel exit
    // CPU 0 assumes, CPU 1 will see
    // data_to_sync = 1 when in kernel
                                            READ data_to_sync

4. Kernel entry processing happens after the check for isolation on the sender
side:

ll_isol_flags(CPU 1) = ISOLATED

         CPU 0                                CPU 1
    ------------------                      -----------------
    data_to_sync = 1                        // userspace or early kernel entry
    smp_mb()
    if ll_isol_flags(CPU 1) != ISOLATED
          smp_call(CPU 1) // skipped        // userspace or early kernel entry
    // CPU 0 assumes, CPU 1 will see        // continues undisturbed
    // data_to_sync = 1 when in kernel
                                            // kernel entry
                                            // data_to_sync undetermined
                                            // should not access data_to_sync here
                                            if (ll_isol_flags == ISOLATED) {
                                                  ll_isol_flags = ISOLATED_BROKEN
                                                  smp_mb()
                                                  // data_to_sync updated
                                            }
                                            READ data_to_sync

This also applies to exit to userspace after enabling isolation -- once
ll_isol_flags is set to ISOLATED, synchronization will be missed, so
one final barrier should happen before returning to userspace and
enabling interrupts in the process. In this case "unlucky" timing would
result in smp call interrupting userspace that is already supposed to
be isolated, it will trigger normal isolation breaking procedure but
otherwise will be an unremarkable synchronization call. On the other
hand, synchronization that was supposed to happen after setting
ll_isol_flags, will be delayed to the next kernel entry, so there
should be nothing that needs synchronization between the end of
task_isolation_exit_to_user_mode() and entering userspace (interrupts
are ok, though, they will trigger isolation breaking).

This is why I have placed task_isolation_kernel_enter()
and task_isolation_exit_to_user_mode() in entry/exit code, and when I
have seen any ambiguity or had any doubt allowed duplicate calls to
task_isolation_kernel_enter() on entry. If I overdid any of that, I
would appreciate fixes and clarifications, however it should be taken
into account that for now different architectures and even drivers may
call common and specific functions in a slightly different order.

Synchronization also applies to possible effects on pipeline (when
originating CPU writes instructions).

There is a version under development that delays TLB flushes in this
manner. On arm64 that requires a switch to soft TLB flushes, and that's
another potential ball of wax. On architectures that already use soft
TLB flushes, this will be unavoidable because TLB flush becomes another
IPI, and IPIs break isolation. Maybe it will be better to make a
somewhat smarter TLB handling in general, so it will be possible to
avoid bothering unrelated CPUs. But then, I guess, hardware may still
overdo it.Then it will be closer to the situation with timers. For now,
I want to first do the obvious and exclude isolated tasks from TLB
updates until they are back in kernel, and do a slow full flush on
isolation breaking.

> You should document that, ie: explain why what you're doing is safe.


I tried to do that in comments, however this is clearly insufficient
despite their verbosity. Would it make sense to create a separate
documentation text file? Current documentation seems to be light on
detailed specifications for things under heavy development except for
something very significant that affects nearly everyone.

Would it make sense to include all scenarios like the above plus exit
to userspace, and existing sequences of calls that should lead to
synchronization calls, task_isolation_kernel_enter() or
task_isolation_exit_to_user_mode()?


> Also Beware though that the data to sync in question doesn't need to

> be visible

> in the entry code before task_isolation_kernel_enter().


Right. And after task_isolation_exit_to_user_mode() as well. This is
why I had to dig through entry/exit code. I can produce a somewhat
usable call sequence diagram.

If by any chance I am wrong there somewhere, I welcome all relevant
comments and corrections.

At this point I am not sure only about things that are called when
CONFIG_TRACE_IRQFLAGS is enabled, simply because I have not yet checked
what they depend on, and virtualization-specific entry/exit is not
entirely clear to me. Maybe for correctness sake I should have declared
task isolation incompatible with those until I either know for sure or
added working support for them.

>  You need to audit all

> the callers of kick_all_cpus_sync().


Right now it's just three places.

do_tune_cpucache() does the right thing, keeps CPUs from seeing old
values of cachep->cpu_cache as they are being deallocated.

powerpc kvm_arch_destroy_vm() cares only about CPUs with VCPUs on them,
what for now should not be used with isolation.

arm64 flush_icache_range() is the main reason why this could be
potentially problematic, it makes sure that other CPUs will not run
stale instructions, and it's called from all places where code is
modified. If in other situation some kind of delayed processing could
be possible, this one has to be done in the right sequence. And that's
the reason for instr_sync() after smp_mb() in
task_isolation_kernel_enter(). Since flush_icache_range() could skip
this CPU, we have to synchronize it by ourselves.

There is an important issue of not having early kernel entry / late
kernel exit rely on something that may be stale (with both data and
code). With the above mentioned exceptions for now, it can be
demonstrated that this is correct. It should be taken into account that
even though static keys are used in early kernel entry code, static
keys setting does not synchronize flushes, and therefore already should
be done before use.

Task isolation in virtualization guests can be a perfectly valid thing
to support (it just has to be propagated to the host if permitted),
however this is something I will want to revisit in the future. For
now, I assume that virtualization-related events are not supposed to
break isolation, however it would be nice to be ready for handling that
possibility.

-- 
Alex
Marcelo Tosatti Jan. 22, 2021, 3 p.m. UTC | #2
On Tue, Nov 24, 2020 at 12:21:06AM +0100, Frederic Weisbecker wrote:
> On Mon, Nov 23, 2020 at 10:39:34PM +0000, Alex Belits wrote:

> > 

> > On Mon, 2020-11-23 at 23:29 +0100, Frederic Weisbecker wrote:

> > > External Email

> > > 

> > > -------------------------------------------------------------------

> > > ---

> > > On Mon, Nov 23, 2020 at 05:58:42PM +0000, Alex Belits wrote:

> > > > From: Yuri Norov <ynorov@marvell.com>

> > > > 

> > > > Make sure that kick_all_cpus_sync() does not call CPUs that are

> > > > running

> > > > isolated tasks.

> > > > 

> > > > Signed-off-by: Yuri Norov <ynorov@marvell.com>

> > > > [abelits@marvell.com: use safe task_isolation_cpumask()

> > > > implementation]

> > > > Signed-off-by: Alex Belits <abelits@marvell.com>

> > > > ---

> > > >  kernel/smp.c | 14 +++++++++++++-

> > > >  1 file changed, 13 insertions(+), 1 deletion(-)

> > > > 

> > > > diff --git a/kernel/smp.c b/kernel/smp.c

> > > > index 4d17501433be..b2faecf58ed0 100644

> > > > --- a/kernel/smp.c

> > > > +++ b/kernel/smp.c

> > > > @@ -932,9 +932,21 @@ static void do_nothing(void *unused)

> > > >   */

> > > >  void kick_all_cpus_sync(void)

> > > >  {

> > > > +	struct cpumask mask;

> > > > +

> > > >  	/* Make sure the change is visible before we kick the cpus */

> > > >  	smp_mb();

> > > > -	smp_call_function(do_nothing, NULL, 1);

> > > > +

> > > > +	preempt_disable();

> > > > +#ifdef CONFIG_TASK_ISOLATION

> > > > +	cpumask_clear(&mask);

> > > > +	task_isolation_cpumask(&mask);

> > > > +	cpumask_complement(&mask, &mask);

> > > > +#else

> > > > +	cpumask_setall(&mask);

> > > > +#endif

> > > > +	smp_call_function_many(&mask, do_nothing, NULL, 1);

> > > > +	preempt_enable();

> > > 

> > > Same comment about IPIs here.

> > 

> > This is different from timers. The original design was based on the

> > idea that every CPU should be able to enter kernel at any time and run

> > kernel code with no additional preparation. Then the only solution is

> > to always do full broadcast and require all CPUs to process it.

> > 

> > What I am trying to introduce is the idea of CPU that is not likely to

> > run kernel code any soon, and can afford to go through an additional

> > synchronization procedure on the next entry into kernel. The

> > synchronization is not skipped, it simply happens later, early in

> > kernel entry code.


Perhaps a bitmask of pending flushes makes more sense? 
static_key_enable IPIs is one of the users, but for its case it would 
be necessary to differentiate between in-kernel mode and out of kernel 
mode atomically (since i-cache flush must be performed if isolated CPU 
is in kernel mode).

> Ah I see, this is ordered that way:

> 

> ll_isol_flags = ISOLATED

> 

>          CPU 0                                CPU 1

>     ------------------                       -----------------

>                                             // kernel entry

>     data_to_sync = 1                        ll_isol_flags = ISOLATED_BROKEN

>     smp_mb()                                smp_mb()

>     if ll_isol_flags(CPU 1) == ISOLATED     READ data_to_sync

>          smp_call(CPU 1)


Since isolated mode with syscalls is a desired feature, having a
separate atomic with in_kernel_mode = 0/1 (that is set/cleared 
on kernel entry / kernel exit, while on TIF_TASK_ISOLATION), would be
necessary (and a similar race-free logic as above).

> You should document that, ie: explain why what you're doing is safe.

> 

> Also Beware though that the data to sync in question doesn't need to be visible

> in the entry code before task_isolation_kernel_enter(). You need to audit all

> the callers of kick_all_cpus_sync().


Cscope tag: flush_icache_range
   #   line  filename / context / line
   1     96  arch/arc/kernel/jump_label.c <<arch_jump_label_transform>>
             flush_icache_range(entry->code, entry->code + JUMP_LABEL_NOP_SIZE);

This case would be OK for delayed processing before kernel entry, as long as
no code before task_isolation_kernel_enter can be modified (which i am
not sure about).

But:

  36     28  arch/ia64/include/asm/cacheflush.h <<flush_icache_user_page>>
             flush_icache_range(_addr, _addr + (len)); \

Is less certain.

Alex do you recall if arch_jump_label_transform was the only offender or 
there were others as well? (suppose handling only the ones which matter
in production at the moment, and later fixing individual ones makes most
sense).
diff mbox series

Patch

diff --git a/kernel/smp.c b/kernel/smp.c
index 4d17501433be..b2faecf58ed0 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -932,9 +932,21 @@  static void do_nothing(void *unused)
  */
 void kick_all_cpus_sync(void)
 {
+	struct cpumask mask;
+
 	/* Make sure the change is visible before we kick the cpus */
 	smp_mb();
-	smp_call_function(do_nothing, NULL, 1);
+
+	preempt_disable();
+#ifdef CONFIG_TASK_ISOLATION
+	cpumask_clear(&mask);
+	task_isolation_cpumask(&mask);
+	cpumask_complement(&mask, &mask);
+#else
+	cpumask_setall(&mask);
+#endif
+	smp_call_function_many(&mask, do_nothing, NULL, 1);
+	preempt_enable();
 }
 EXPORT_SYMBOL_GPL(kick_all_cpus_sync);