mbox series

[v2,0/3] newidle_balance() PREEMPT_RT latency mitigations

Message ID 20210428232821.2506201-1-swood@redhat.com
Headers show
Series newidle_balance() PREEMPT_RT latency mitigations | expand

Message

Crystal Wood April 28, 2021, 11:28 p.m. UTC
These patches mitigate latency caused by newidle_balance() on large
systems when PREEMPT_RT is enabled, by enabling interrupts when the lock
is dropped, and exiting early at various points if an RT task is runnable
on the current CPU.

On a system with 128 CPUs, these patches dropped latency (as measured by
a 12 hour rteval run) from 1045us to 317us (when applied to
5.12.0-rc3-rt3).

I tried a couple scheduler benchmarks (perf bench sched pipe, and
sysbench threads) to try to determine whether the overhead is measurable
on non-RT, but the results varied widely enough (with or without the patches)
that I couldn't draw any conclusions from them.  So at least for now, I
limited the balance callback change to when PREEMPT_RT is enabled.

Link to v1 RFC patches:
https://lore.kernel.org/lkml/20200428050242.17717-1-swood@redhat.com/

Scott Wood (3):
  sched/fair: Call newidle_balance() from balance_callback on PREEMPT_RT
  sched/fair: Enable interrupts when dropping lock in newidle_balance()
  sched/fair: break out of newidle balancing if an RT task appears

 kernel/sched/fair.c  | 66 ++++++++++++++++++++++++++++++++++++++------
 kernel/sched/sched.h |  6 ++++
 2 files changed, 64 insertions(+), 8 deletions(-)

Comments

Crystal Wood May 1, 2021, 10:03 p.m. UTC | #1
On Thu, 2021-04-29 at 09:12 +0200, Vincent Guittot wrote:
> Hi Scott,

> 

> On Thu, 29 Apr 2021 at 01:28, Scott Wood <swood@redhat.com> wrote:

> > These patches mitigate latency caused by newidle_balance() on large

> > systems when PREEMPT_RT is enabled, by enabling interrupts when the lock

> > is dropped, and exiting early at various points if an RT task is

> > runnable

> > on the current CPU.

> > 

> > On a system with 128 CPUs, these patches dropped latency (as measured by

> > a 12 hour rteval run) from 1045us to 317us (when applied to

> > 5.12.0-rc3-rt3).

> 

> The patch below has been queued for v5.13 and removed the update of

> blocked load what seemed to be the major reason for long preempt/irq

> off during newly idle balance:

> https://lore.kernel.org/lkml/20210224133007.28644-1-vincent.guittot@linaro.org/

> 

> I would be curious to see how it impacts your cases


I still get 1000+ ms latencies with those patches applied.

-Scott
Mike Galbraith May 2, 2021, 3:25 a.m. UTC | #2
On Sat, 2021-05-01 at 17:03 -0500, Scott Wood wrote:
> On Thu, 2021-04-29 at 09:12 +0200, Vincent Guittot wrote:

> > Hi Scott,

> >

> > On Thu, 29 Apr 2021 at 01:28, Scott Wood <swood@redhat.com> wrote:

> > > These patches mitigate latency caused by newidle_balance() on large

> > > systems when PREEMPT_RT is enabled, by enabling interrupts when the lock

> > > is dropped, and exiting early at various points if an RT task is

> > > runnable

> > > on the current CPU.

> > >

> > > On a system with 128 CPUs, these patches dropped latency (as measured by

> > > a 12 hour rteval run) from 1045us to 317us (when applied to

> > > 5.12.0-rc3-rt3).

> >

> > The patch below has been queued for v5.13 and removed the update of

> > blocked load what seemed to be the major reason for long preempt/irq

> > off during newly idle balance:

> > https://lore.kernel.org/lkml/20210224133007.28644-1-vincent.guittot@linaro.org/

> >

> > I would be curious to see how it impacts your cases

>

> I still get 1000+ ms latencies with those patches applied.


If NEWIDLE balancing migrates one task, how does that manage to consume
a full *millisecond*, and why would that only be a problem for RT?

	-Mike

(rt tasks don't play !rt balancer here, if CPU goes idle, tough titty)
Crystal Wood May 3, 2021, 4:33 p.m. UTC | #3
On Sun, 2021-05-02 at 05:25 +0200, Mike Galbraith wrote:
> On Sat, 2021-05-01 at 17:03 -0500, Scott Wood wrote:

> > On Thu, 2021-04-29 at 09:12 +0200, Vincent Guittot wrote:

> > > Hi Scott,

> > > 

> > > On Thu, 29 Apr 2021 at 01:28, Scott Wood <swood@redhat.com> wrote:

> > > > These patches mitigate latency caused by newidle_balance() on large

> > > > systems when PREEMPT_RT is enabled, by enabling interrupts when the

> > > > lock

> > > > is dropped, and exiting early at various points if an RT task is

> > > > runnable

> > > > on the current CPU.

> > > > 

> > > > On a system with 128 CPUs, these patches dropped latency (as

> > > > measured by

> > > > a 12 hour rteval run) from 1045us to 317us (when applied to

> > > > 5.12.0-rc3-rt3).

> > > 

> > > The patch below has been queued for v5.13 and removed the update of

> > > blocked load what seemed to be the major reason for long preempt/irq

> > > off during newly idle balance:

> > > https://lore.kernel.org/lkml/20210224133007.28644-1-vincent.guittot@linaro.org/

> > > 

> > > I would be curious to see how it impacts your cases

> > 

> > I still get 1000+ ms latencies with those patches applied.

> 

> If NEWIDLE balancing migrates one task, how does that manage to consume

> a full *millisecond*, and why would that only be a problem for RT?

> 

> 	-Mike

> 

> (rt tasks don't play !rt balancer here, if CPU goes idle, tough titty)


Determining which task to pull is apparently taking that long (again, this
is on a 128-cpu system).  RT is singled out because that is the config that
makes significant tradeoffs to keep latencies down (I expect this would be
far from the only possible 1ms+ latency on a non-RT kernel), and there was
concern about the overhead of a double context switch when pulling a task to
a newidle cpu.

-Scott
Mike Galbraith May 3, 2021, 6:52 p.m. UTC | #4
On Mon, 2021-05-03 at 11:33 -0500, Scott Wood wrote:
> On Sun, 2021-05-02 at 05:25 +0200, Mike Galbraith wrote:

> > On Sat, 2021-05-01 at 17:03 -0500, Scott Wood wrote:

> > > On Thu, 2021-04-29 at 09:12 +0200, Vincent Guittot wrote:

> > > > Hi Scott,

> > > >

> > > > On Thu, 29 Apr 2021 at 01:28, Scott Wood <swood@redhat.com> wrote:

> > > > > These patches mitigate latency caused by newidle_balance() on large

> > > > > systems when PREEMPT_RT is enabled, by enabling interrupts when the

> > > > > lock

> > > > > is dropped, and exiting early at various points if an RT task is

> > > > > runnable

> > > > > on the current CPU.

> > > > >

> > > > > On a system with 128 CPUs, these patches dropped latency (as

> > > > > measured by

> > > > > a 12 hour rteval run) from 1045us to 317us (when applied to

> > > > > 5.12.0-rc3-rt3).

> > > >

> > > > The patch below has been queued for v5.13 and removed the update of

> > > > blocked load what seemed to be the major reason for long preempt/irq

> > > > off during newly idle balance:

> > > > https://lore.kernel.org/lkml/20210224133007.28644-1-vincent.guittot@linaro.org/

> > > >

> > > > I would be curious to see how it impacts your cases

> > >

> > > I still get 1000+ ms latencies with those patches applied.

> >

> > If NEWIDLE balancing migrates one task, how does that manage to consume

> > a full *millisecond*, and why would that only be a problem for RT?

> >

> > 	-Mike

> >

> > (rt tasks don't play !rt balancer here, if CPU goes idle, tough titty)

>

> Determining which task to pull is apparently taking that long (again, this

> is on a 128-cpu system).  RT is singled out because that is the config that

> makes significant tradeoffs to keep latencies down (I expect this would be

> far from the only possible 1ms+ latency on a non-RT kernel), and there was

> concern about the overhead of a double context switch when pulling a task to

> a newidle cpu.


What I think has be going on is that you're running a synchronized RT
load, many CPUs go idle as a thundering herd, and meet at focal point
busiest.  What I was alluding to was that preventing such size scale
pile-ups would be way better than poking holes in it for RT to try to
sneak through.  If pile-up it is, while not particularly likely, the
same should happen with normal tasks, wasting cycles generating heat.

The main issue I see with these patches is that the resulting number is
still so gawd awful as to mean "nope, not rt ready", making the whole
exercise look a bit like a noop.

	-Mike
Crystal Wood May 3, 2021, 9:57 p.m. UTC | #5
On Mon, 2021-05-03 at 20:52 +0200, Mike Galbraith wrote:
> On Mon, 2021-05-03 at 11:33 -0500, Scott Wood wrote:

> > On Sun, 2021-05-02 at 05:25 +0200, Mike Galbraith wrote:

> > > If NEWIDLE balancing migrates one task, how does that manage to

> > > consume

> > > a full *millisecond*, and why would that only be a problem for RT?

> > > 

> > > 	-Mike

> > > 

> > > (rt tasks don't play !rt balancer here, if CPU goes idle, tough titty)

> > 

> > Determining which task to pull is apparently taking that long (again,

> > this is on a 128-cpu system).  RT is singled out because that is the

> > config that makes significant tradeoffs to keep latencies down (I

> > expect this would be far from the only possible 1ms+ latency on a

> > non-RT kernel), and there was concern about the overhead of a double

> > context switch when pulling a task to a newidle cpu.

> 

> What I think has be going on is that you're running a synchronized RT

> load, many CPUs go idle as a thundering herd, and meet at focal point

> busiest.  What I was alluding to was that preventing such size scale

> pile-ups would be way better than poking holes in it for RT to try to

> sneak through.  If pile-up it is, while not particularly likely, the

> same should happen with normal tasks, wasting cycles generating heat.

> 

> The main issue I see with these patches is that the resulting number is

> still so gawd awful as to mean "nope, not rt ready", making the whole

> exercise look a bit like a noop.


It doesn't look like rteval asks cyclictest to synchronize, but
regardless, how is this "poking holes"?  Making sure interrupts are
enabled during potentially long-running activities is pretty fundamental
to PREEMPT_RT.  What specifically is your suggestion?

And yes, 317 us is still not a very good number for PREEMPT_RT, but
progress is progress.  It's hard to address the moderate latency spikes
if they're obscured by large latency spikes.  One also needs to have
realistic expectations when it comes to RT on large systems, particularly
when not isolating the latency-sensitive CPUs.

-Scott
Mike Galbraith May 4, 2021, 4:07 a.m. UTC | #6
On Mon, 2021-05-03 at 16:57 -0500, Scott Wood wrote:
> On Mon, 2021-05-03 at 20:52 +0200, Mike Galbraith wrote:

> > On Mon, 2021-05-03 at 11:33 -0500, Scott Wood wrote:

> > > On Sun, 2021-05-02 at 05:25 +0200, Mike Galbraith wrote:

> > > > If NEWIDLE balancing migrates one task, how does that manage to

> > > > consume

> > > > a full *millisecond*, and why would that only be a problem for RT?

> > > >

> > > > 	-Mike

> > > >

> > > > (rt tasks don't play !rt balancer here, if CPU goes idle, tough titty)

> > >

> > > Determining which task to pull is apparently taking that long (again,

> > > this is on a 128-cpu system).  RT is singled out because that is the

> > > config that makes significant tradeoffs to keep latencies down (I

> > > expect this would be far from the only possible 1ms+ latency on a

> > > non-RT kernel), and there was concern about the overhead of a double

> > > context switch when pulling a task to a newidle cpu.

> >

> > What I think has be going on is that you're running a synchronized RT

> > load, many CPUs go idle as a thundering herd, and meet at focal point

> > busiest.  What I was alluding to was that preventing such size scale

> > pile-ups would be way better than poking holes in it for RT to try to

> > sneak through.  If pile-up it is, while not particularly likely, the

> > same should happen with normal tasks, wasting cycles generating heat.

> >

> > The main issue I see with these patches is that the resulting number is

> > still so gawd awful as to mean "nope, not rt ready", making the whole

> > exercise look a bit like a noop.

>

> It doesn't look like rteval asks cyclictest to synchronize, but

> regardless, how is this "poking holes"?


Pulling a single task is taking _a full millisecond_, which I see as a
mountain of cycles, directly through which you open a path for wakeups.
That "poking holes" isn't meant to be some kind of crude derogatory
remark, it's just the way I see what was done.  The mountain still
stands, you didn't remove it.

>   Making sure interrupts are

> enabled during potentially long-running activities is pretty fundamental

> to PREEMPT_RT.  What specifically is your suggestion?


Try to include fair class in any LB improvement if at all possible,
because that's where most of the real world benefit is to be had.

> And yes, 317 us is still not a very good number for PREEMPT_RT, but

> progress is progress.  It's hard to address the moderate latency spikes

> if they're obscured by large latency spikes.  One also needs to have

> realistic expectations when it comes to RT on large systems, particularly

> when not isolating the latency-sensitive CPUs.


Agreed.  But.  Specifically because the result remains intolerable to
anything remotely sensitive, users running such on their big boxen are
not going to be doing mixed load, that flat does not work, which is why
I said the patch set looks a bit like a noop: it excludes the audience
that stands to gain.. nearly anything.  Big box HPC (acronym includes
RT) gains absolutely nothing, as does big box general case with its not
particularly prevalent, but definitely existent RT tasks.  Big box
users who are NOT running anything they care deeply about do receive
some love.. but don't care deeply, and certainly don't care more any
more deeply than general case users do about these collision induced
latency spikes.

	-Mike