[v5,0/9] "Task_isolation" mode

Message ID	8d887e59ca713726f4fcb25a316e1e932b02823e.camel@marvell.com
Headers	show Return-Path: <netdev-owner@kernel.org> From: Alex Belits <abelits@marvell.com> To: "nitesh@redhat.com" <nitesh@redhat.com>, "frederic@kernel.org" <frederic@kernel.org> CC: Prasun Kapoor <pkapoor@marvell.com>, "linux-api@vger.kernel.org" <linux-api@vger.kernel.org>, "davem@davemloft.net" <davem@davemloft.net>, "trix@redhat.com" <trix@redhat.com>, "mingo@kernel.org" <mingo@kernel.org>, "catalin.marinas@arm.com" <catalin.marinas@arm.com>, "rostedt@goodmis.org" <rostedt@goodmis.org>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, "peterx@redhat.com" <peterx@redhat.com>, "tglx@linutronix.de" <tglx@linutronix.de>, "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>, "mtosatti@redhat.com" <mtosatti@redhat.com>, "will@kernel.org" <will@kernel.org>, "peterz@infradead.org" <peterz@infradead.org>, "leon@sidebranch.com" <leon@sidebranch.com>, "linux-arm-kernel@lists.infradead.org" <linux-arm-kernel@lists.infradead.org>, "pauld@redhat.com" <pauld@redhat.com>, "netdev@vger.kernel.org" <netdev@vger.kernel.org> Subject: [PATCH v5 0/9] "Task_isolation" mode Thread-Topic: [PATCH v5 0/9] "Task_isolation" mode Thread-Index: AQHWwcAN4HMJq5rZb0WNk0YfmsaUgQ== Date: Mon, 23 Nov 2020 17:42:45 +0000 Message-ID: <8d887e59ca713726f4fcb25a316e1e932b02823e.camel@marvell.com> Accept-Language: en-US Content-Language: en-US Content-Type: text/plain; charset="utf-8" Content-ID: <93B54AB342F93C4B9C188076FB9F6EC5@namprd18.prod.outlook.com> Content-Transfer-Encoding: base64 MIME-Version: 1.0 X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: MW2PR18MB2267.namprd18.prod.outlook.com X-MS-Exchange-CrossTenant-Network-Message-Id: a9d35cc7-fd55-4984-568f-08d88fd7302f X-MS-Exchange-CrossTenant-originalarrivaltime: 23 Nov 2020 17:42:45.8256 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted Precedence: bulk
Series	"Task_isolation" mode \| expand [v5,0/9] "Task_isolation" mode [v5,1/9] task_isolation: vmstat: add quiet_vmstat_sync function [v5,2/9] task_isolation: vmstat: add vmstat_idle function [v5,3/9] task_isolation: userspace hard isolation from kernel [v5,4/9] task_isolation: Add task isolation hooks to arch-independent code [v5,5/9] task_isolation: Add driver-specific hooks [v5,6/9] task_isolation: arch/arm64: enable task isolation functionality [v5,7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() [v5,8/9] task_isolation: ringbuffer: don't interrupt CPUs running isolated tasks on buffer resize [v5,9/9] task_isolation: kick_all_cpus_sync: don't kick isolated cpus

Alex Belits Nov. 23, 2020, 5:42 p.m. UTC

This is an update of task isolation work that was originally done by
Chris Metcalf <cmetcalf@mellanox.com> and maintained by him until
November 2017. It is adapted to the current kernel and cleaned up to
implement its functionality in a more complete and cleaner manner.

Previous version is at
https://lore.kernel.org/netdev/04be044c1bcd76b7438b7563edc35383417f12c8.camel@marvell.com/

The last version by Chris Metcalf (now obsolete but may be relevant
for comparison and understanding the origin of the changes) is at
https://lore.kernel.org/lkml/1509728692-10460-1-git-send-email-cmetcalf@mellanox.com

Supported architectures

This version includes only architecture-independent code and arm64
support. x86 and arm support, and everything related to virtualization
will be re-added later when new kernel entry/exit implementation will
be accommodated. Support for other architectures can be added in a
somewhat modular manner, however it heavily depends on the details of
a kernel entry/exit support on any particular architecture.
Development of common entry/exit and conversion to it should simplify
that task. For now, this is the version that is currently being
developed on arm64.

Major changes since v4

The goal was to make isolation-breaking detection as generic as
possible, and remove everything related to determining, _why_
isolation was broken. Originally reporting isolation breaking was done
with a large number of of hooks in specific code (hardware interrupts,
syscalls, IPIs, page faults, etc.), and it was necessary to cover all
possible such events to have a reliable notification of a task about
its isolation being broken. To avoid such a fragile mechanism, this
version relies on mere fact of kernel being entered in isolation
mode. As a result, reporting happens later in kernel code, however it
covers everything.

This means that now there is no specific reporting, in kernel log or
elsewhere, about the reasons for breaking isolation. Information about
that may be valuable at runtime, so a separate mechanism for generic
reporting "why did CPU enter kernel" (with isolation or under other
conditions) may be a good thing. That can be done later, however at
this point it's important that task isolation does not require it, and
such mechanism will not be developed with the limited purpose of
supporting isolation alone.

General description

This is the result of development and maintenance of task isolation
functionality that originally started based on task isolation patch
v15 and was later updated to include v16. It provided predictable
environment for userspace tasks running on arm64 processors alongside
with full-featured Linux environment. It is intended to provide
reliable interruption-free environment from the point when a userspace
task enters isolation and until the moment it leaves isolation or
receives a signal intentionally sent to it, and was successfully used
for this purpose. While CPU isolation with nohz provides an
environment that is close to this requirement, the remaining IPIs and
other disturbances keep it from being usable for tasks that require
complete predictability of CPU timing.

This set of patches only covers the implementation of task isolation,
however additional functionality, such as selective TLB flushes, may
be implemented to avoid other kinds of disturbances that affect
latency and performance of isolated tasks.

The userspace support and test program is now at
https://github.com/abelits/libtmc . It was originally developed for
earlier implementation, so it has some checks that may be redundant
now but kept for compatibility.

My thanks to Chris Metcalf for design and maintenance of the original
task isolation patch, Francis Giraldeau <francis.giraldeau@gmail.com>
and Yuri Norov <ynorov@marvell.com> for various contributions to this
work, Frederic Weisbecker <frederic@kernel.org> for his work on CPU
isolation and housekeeping that made possible to remove some less
elegant solutions that I had to devise for earlier, <4.17 kernels, and
Nitesh Narayan Lal <nitesh@redhat.com> for adapting earlier patches
related to interrupt and work distribution in presence of CPU
isolation.

--
Alex

Thomas Gleixner Nov. 23, 2020, 10:01 p.m. UTC | #1

Alex,

On Mon, Nov 23 2020 at 17:56, Alex Belits wrote:
>  .../admin-guide/kernel-parameters.txt         |   6 +
>  drivers/base/cpu.c                            |  23 +
>  include/linux/hrtimer.h                       |   4 +
>  include/linux/isolation.h                     | 326 ++++++++
>  include/linux/sched.h                         |   5 +
>  include/linux/tick.h                          |   3 +
>  include/uapi/linux/prctl.h                    |   6 +
>  init/Kconfig                                  |  27 +
>  kernel/Makefile                               |   2 +
>  kernel/isolation.c                            | 714 ++++++++++++++++++
>  kernel/signal.c                               |   2 +
>  kernel/sys.c                                  |   6 +
>  kernel/time/hrtimer.c                         |  27 +
>  kernel/time/tick-sched.c                      |  18 +

I asked you before to split this up into bits and pieces and argue and
justify each change. Throwing this wholesale over the fence is going
nowhere. It's not revieable at all.

Aside of that ignoring review comments is a sure path to make yourself
ignored:

> +/*
> + * Logging
> + */
> +int task_isolation_message(int cpu, int level, bool supp, const char *fmt, ...);
> +
> +#define pr_task_isol_emerg(cpu, fmt, ...)			\
> +	task_isolation_message(cpu, LOGLEVEL_EMERG, false, fmt, ##__VA_ARGS__)

The comments various people made about that are not going away and none
of this is going near anything I'm responsible for unless you provide
these independent of the rest and with a reasonable justification why
you can't use any other existing mechanism or extend it for your use
case.

Thanks,

        tglx

Tom Rix Nov. 24, 2020, 4:36 p.m. UTC | #2

On 11/23/20 9:42 AM, Alex Belits wrote:
> This is an update of task isolation work that was originally done by

> Chris Metcalf <cmetcalf@mellanox.com> and maintained by him until

> November 2017. It is adapted to the current kernel and cleaned up to

> implement its functionality in a more complete and cleaner manner.


I am having problems applying the patchset to today's linux-next.

Which kernel should I be using ?

Thanks,

Tom

Alex Belits Nov. 24, 2020, 5:40 p.m. UTC | #3

On Tue, 2020-11-24 at 08:36 -0800, Tom Rix wrote:
> External Email

> 

> -------------------------------------------------------------------

> ---

> 

> On 11/23/20 9:42 AM, Alex Belits wrote:

> > This is an update of task isolation work that was originally done

> > by

> > Chris Metcalf <cmetcalf@mellanox.com> and maintained by him until

> > November 2017. It is adapted to the current kernel and cleaned up

> > to

> > implement its functionality in a more complete and cleaner manner.

> 

> I am having problems applying the patchset to today's linux-next.

> 

> Which kernel should I be using ?


The patches are against Linus' tree, in particular, commit
a349e4c659609fd20e4beea89e5c4a4038e33a95

-- 
Alex

Mark Rutland Dec. 2, 2020, 2:02 p.m. UTC | #4

On Tue, Nov 24, 2020 at 05:40:49PM +0000, Alex Belits wrote:
> 

> On Tue, 2020-11-24 at 08:36 -0800, Tom Rix wrote:

> > External Email

> > 

> > -------------------------------------------------------------------

> > ---

> > 

> > On 11/23/20 9:42 AM, Alex Belits wrote:

> > > This is an update of task isolation work that was originally done

> > > by

> > > Chris Metcalf <cmetcalf@mellanox.com> and maintained by him until

> > > November 2017. It is adapted to the current kernel and cleaned up

> > > to

> > > implement its functionality in a more complete and cleaner manner.

> > 

> > I am having problems applying the patchset to today's linux-next.

> > 

> > Which kernel should I be using ?

> 

> The patches are against Linus' tree, in particular, commit

> a349e4c659609fd20e4beea89e5c4a4038e33a95


Is there any reason to base on that commit in particular?

Generally it's preferred that a series is based on a tag (so either a
release or an -rc kernel), and that the cover letter explains what the
base is. If you can do that in future it'll make the series much easier
to work with.

Thanks,
Mark.

Alex Belits Dec. 4, 2020, 12:39 a.m. UTC | #5

On Wed, 2020-12-02 at 14:02 +0000, Mark Rutland wrote:
> On Tue, Nov 24, 2020 at 05:40:49PM +0000, Alex Belits wrote:

> > 

> > > I am having problems applying the patchset to today's linux-next.

> > > 

> > > Which kernel should I be using ?

> > 

> > The patches are against Linus' tree, in particular, commit

> > a349e4c659609fd20e4beea89e5c4a4038e33a95

> 

> Is there any reason to base on that commit in particular?


No specific reason for that particular commit.

> Generally it's preferred that a series is based on a tag (so either a

> release or an -rc kernel), and that the cover letter explains what

> the

> base is. If you can do that in future it'll make the series much

> easier

> to work with.


Ok.

-- 
Alex

Pavel Machek Dec. 5, 2020, 8:40 p.m. UTC | #6

Hi!

> General description

> 

> This is the result of development and maintenance of task isolation

> functionality that originally started based on task isolation patch

> v15 and was later updated to include v16. It provided predictable

> environment for userspace tasks running on arm64 processors alongside

> with full-featured Linux environment. It is intended to provide

> reliable interruption-free environment from the point when a userspace

> task enters isolation and until the moment it leaves isolation or

> receives a signal intentionally sent to it, and was successfully used

> for this purpose. While CPU isolation with nohz provides an

> environment that is close to this requirement, the remaining IPIs and

> other disturbances keep it from being usable for tasks that require

> complete predictability of CPU timing.


So... what kind of guarantees does this aim to provide / what tasks it
is useful for?

For real time response, we have other approaches.

If you want to guarantee performnace of the "isolated" task... I don't
see how that works. Other tasks on the system still compete for DRAM
bandwidth, caches, etc...

So... what is the usecase?
								Pavel
-- 
http://www.livejournal.com/~pavelmachek

Thomas Gleixner Dec. 5, 2020, 11:25 p.m. UTC | #7

Pavel,

On Sat, Dec 05 2020 at 21:40, Pavel Machek wrote:
> So... what kind of guarantees does this aim to provide / what tasks it

> is useful for?

>

> For real time response, we have other approaches.


Depends on your requirements. Some problems are actually better solved
with busy polling. See below.

> If you want to guarantee performnace of the "isolated" task... I don't

> see how that works. Other tasks on the system still compete for DRAM

> bandwidth, caches, etc...


Applications which want to run as undisturbed as possible. There is
quite a range of those:

  - Hardware in the loop simulation is today often done with that crude
    approach of "offlining" a CPU and then instead of playing dead
    jumping to a preloaded bare metal executable. That's a horrible hack
    and impossible to debug, but gives them the results they need to
    achieve. These applications are well optimized vs. cache and memory
    foot print, so they don't worry about these things too much and they
    surely don't run on SMI and BIOS value add inflicted machines.

    Don't even think about waiting for an interrupt to achieve what
    these folks are doing. So no, there are problems which a general
    purpose realtime OS cannot solve ever.

  - HPC computations on large data sets. While the memory foot print is
    large the access patterns are cache optimized. 

    The problem there is that any unnecessary IPI, tick interrupt or
    whatever nuisance is disturbing the carefully optimized cache usage
    and alone getting rid of the timer interrupt gained them measurable
    performance. Even very low single digit percentage of runtime saving
    is valuable for these folks because the compute time on such beasts
    is expensive.

  - Realtime guests in KVM. With posted interrupts and a fully populated
    host side page table there is no point in running host side
    interrupts or IPIs for random accounting or whatever purposes as
    they affect the latency in the guest. With all the side effects
    mitigated and a properly set up guest and host it is possible to get
    to a zero exit situation after the bootup phase which means pretty
    much matching bare metal behaviour.

    Yes, you can do that with e.g. Jailhouse as well, but you lose lots
    of the fancy things KVM provides. And people care about these not
    just because they are fancy. They care because their application
    scenario needs them.

There are more reasons why people want to be able to get as much
isolation from the OS as possible but at the same time have a sane
execution environment, debugging, performance monitoring and the OS
provided protection mechanisms instead of horrible hacks.

Isolation makes sense for a range of applications and there is no reason
why Linux should not support them. 

Thanks,

        tglx

Yury Norov Dec. 11, 2020, 6:08 p.m. UTC | #8

On Sun, Dec 06, 2020 at 12:25:45AM +0100, Thomas Gleixner wrote:
> Pavel,

> 

> On Sat, Dec 05 2020 at 21:40, Pavel Machek wrote:

> > So... what kind of guarantees does this aim to provide / what tasks it

> > is useful for?

> >

> > For real time response, we have other approaches.

> 

> Depends on your requirements. Some problems are actually better solved

> with busy polling. See below.

> 

> > If you want to guarantee performnace of the "isolated" task... I don't

> > see how that works. Other tasks on the system still compete for DRAM

> > bandwidth, caches, etc...

> 

> Applications which want to run as undisturbed as possible. There is

> quite a range of those:

> 

>   - Hardware in the loop simulation is today often done with that crude

>     approach of "offlining" a CPU and then instead of playing dead

>     jumping to a preloaded bare metal executable. That's a horrible hack

>     and impossible to debug, but gives them the results they need to

>     achieve. These applications are well optimized vs. cache and memory

>     foot print, so they don't worry about these things too much and they

>     surely don't run on SMI and BIOS value add inflicted machines.

> 

>     Don't even think about waiting for an interrupt to achieve what

>     these folks are doing. So no, there are problems which a general

>     purpose realtime OS cannot solve ever.

> 

>   - HPC computations on large data sets. While the memory foot print is

>     large the access patterns are cache optimized. 

> 

>     The problem there is that any unnecessary IPI, tick interrupt or

>     whatever nuisance is disturbing the carefully optimized cache usage

>     and alone getting rid of the timer interrupt gained them measurable

>     performance. Even very low single digit percentage of runtime saving

>     is valuable for these folks because the compute time on such beasts

>     is expensive.

> 

>   - Realtime guests in KVM. With posted interrupts and a fully populated

>     host side page table there is no point in running host side

>     interrupts or IPIs for random accounting or whatever purposes as

>     they affect the latency in the guest. With all the side effects

>     mitigated and a properly set up guest and host it is possible to get

>     to a zero exit situation after the bootup phase which means pretty

>     much matching bare metal behaviour.

> 

>     Yes, you can do that with e.g. Jailhouse as well, but you lose lots

>     of the fancy things KVM provides. And people care about these not

>     just because they are fancy. They care because their application

>     scenario needs them.

> 

> There are more reasons why people want to be able to get as much

> isolation from the OS as possible but at the same time have a sane

> execution environment, debugging, performance monitoring and the OS

> provided protection mechanisms instead of horrible hacks.

> 

> Isolation makes sense for a range of applications and there is no reason

> why Linux should not support them. 


One good client for the task isolation is Open Data Plane. There are
even some code stubs supposed to enable isolation where needed.

> > If you want to guarantee performnace of the "isolated" task... I don't

> > see how that works. Other tasks on the system still compete for DRAM

> > bandwidth, caches, etc...


My experiments say that typical delay caused by dry IPI or syscall is
2000-20000 'ticks'. Typical delay caused by cache miss is 3-30 ticks.

To guarantee cache / memory bandwidth, one can use resctrl. Linux has
implementation of it for x86 only, but arm64 has support for for
resctrl on CPU side.

Thanks,
Yury

> Thanks,

> 

>         tglx

[v5,0/9] "Task_isolation" mode

Message

Comments