diff mbox

[03/16] sched/fair: Disregard idle task wakee_flips in wake_wide

Message ID 1464001138-25063-4-git-send-email-morten.rasmussen@arm.com
State New
Headers show

Commit Message

Morten Rasmussen May 23, 2016, 10:58 a.m. UTC
wake_wide() is based on task wakee_flips of the waker and the wakee to
decide whether an affine wakeup is desirable. On lightly loaded systems
the waker is frequently the idle task (pid=0) which can accumulate a lot
of wakee_flips in that scenario. It makes little sense to prevent affine
wakeups on an idle cpu due to the idle task wakee_flips, so it makes
more sense to ignore them in wake_wide().

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

---
 kernel/sched/fair.c | 4 ++++
 1 file changed, 4 insertions(+)

-- 
1.9.1

Comments

Morten Rasmussen May 23, 2016, noon UTC | #1
On Mon, May 23, 2016 at 01:12:07PM +0200, Mike Galbraith wrote:
> On Mon, 2016-05-23 at 11:58 +0100, Morten Rasmussen wrote:

> > wake_wide() is based on task wakee_flips of the waker and the wakee to

> > decide whether an affine wakeup is desirable. On lightly loaded systems

> > the waker is frequently the idle task (pid=0) which can accumulate a lot

> > of wakee_flips in that scenario. It makes little sense to prevent affine

> > wakeups on an idle cpu due to the idle task wakee_flips, so it makes

> > more sense to ignore them in wake_wide().

> 

> You sure?  What's the difference between a task flipping enough to

> warrant spreading the load, and an interrupt source doing the same? 

>  I've both witnessed firsthand, and received user confirmation of this

> very thing improving utilization.


Right, I didn't consider the interrupt source scenario, my fault.

The problem then seems to be distinguishing truly idle and busy doing
interrupts. The issue that I observe is that wake_wide() likes pushing
tasks around in lightly scenarios which isn't desirable for power
management. Selecting the same cpu again may potentially let others
reach deeper C-state.

With that in mind I will if I can do better. Suggestions are welcome :-)

> 

> > cc: Ingo Molnar <mingo@redhat.com>

> > cc: Peter Zijlstra <peterz@infradead.org>

> > 

> > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

> > ---

> >  kernel/sched/fair.c | 4 ++++

> >  1 file changed, 4 insertions(+)

> > 

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> > index c49e25a..0fe3020 100644

> > --- a/kernel/sched/fair.c

> > +++ b/kernel/sched/fair.c

> > @@ -5007,6 +5007,10 @@ static int wake_wide(struct task_struct *p)

> >  	unsigned int slave = p->wakee_flips;

> >  	int factor = this_cpu_read(sd_llc_size);

> >  

> > +	/* Don't let the idle task prevent affine wakeups */

> > +	if (is_idle_task(current))

> > +		return 0;

> > +

> >  	if (master < slave)

> >  		swap(master, slave);

> >  	if (slave < factor || master < slave * factor)
Morten Rasmussen May 23, 2016, 2:10 p.m. UTC | #2
On Mon, May 23, 2016 at 03:00:46PM +0200, Mike Galbraith wrote:
> On Mon, 2016-05-23 at 13:00 +0100, Morten Rasmussen wrote:

> 

> > The problem then seems to be distinguishing truly idle and busy doing

> > interrupts. The issue that I observe is that wake_wide() likes pushing

> > tasks around in lightly scenarios which isn't desirable for power

> > management. Selecting the same cpu again may potentially let others

> > reach deeper C-state.

> > 

> > With that in mind I will if I can do better. Suggestions are welcome :-)

> 

> None here.  For big boxen that are highly idle, you'd likely want to

> shut down nodes and consolidate load, but otoh, all that slows response

> to burst, which I hate.  I prefer race to idle, let power gating do its

> job.  If I had a server farm with enough capacity vs load variability

> to worry about, I suspect I'd become highly interested in routing.


I don't disagree for systems of that scale, but at the other end of the
spectrum it is a single SoC we are trying squeeze the best possible
mileage out of. That implies optimizing for power gating to reach deeper
C-states when possible by consolidating idle-time and grouping
idle cpus. Migrating task unnecessarily isn't helping us in achieving
that, unfortunately :-(
Morten Rasmussen June 7, 2016, 12:08 p.m. UTC | #3
On Thu, Jun 02, 2016 at 10:05:09AM +0200, Peter Zijlstra wrote:
> On Wed, Jun 01, 2016 at 09:57:23PM +0200, Peter Zijlstra wrote:

> > On Mon, May 23, 2016 at 01:00:10PM +0100, Morten Rasmussen wrote:

> > > On Mon, May 23, 2016 at 01:12:07PM +0200, Mike Galbraith wrote:

> > > > On Mon, 2016-05-23 at 11:58 +0100, Morten Rasmussen wrote:

> > > > > wake_wide() is based on task wakee_flips of the waker and the wakee to

> > > > > decide whether an affine wakeup is desirable. On lightly loaded systems

> > > > > the waker is frequently the idle task (pid=0) which can accumulate a lot

> > > > > of wakee_flips in that scenario. It makes little sense to prevent affine

> > > > > wakeups on an idle cpu due to the idle task wakee_flips, so it makes

> > > > > more sense to ignore them in wake_wide().

> > > > 

> > > > You sure?  What's the difference between a task flipping enough to

> > > > warrant spreading the load, and an interrupt source doing the same? 

> > > >  I've both witnessed firsthand, and received user confirmation of this

> > > > very thing improving utilization.

> > > 

> > > Right, I didn't consider the interrupt source scenario, my fault.

> > > 

> > > The problem then seems to be distinguishing truly idle and busy doing

> > > interrupts. The issue that I observe is that wake_wide() likes pushing

> > > tasks around in lightly scenarios which isn't desirable for power

> > > management. Selecting the same cpu again may potentially let others

> > > reach deeper C-state.

> > > 

> > > With that in mind I will if I can do better. Suggestions are welcome :-)

> > 

> > Seeing how we always so select_idle_siblings() after affine_sd, the only

> > wake_affine movement that matters is cross-llc.

> > 

> > So intra-llc wakeups can avoid the movement, no?


I think so. I don't see a point in setting affine_sd if it is an
intra-llc wakeup. Maybe make it conditional on tmp->flags &
SD_SHARE_PKG_RESOURCES ?

I would help minimizing some intra-llc wakeup migrations, but not
inter-llc migrations in largely idle scenarios.

> Won't help I think; the interrupt that got us in this situation will

> already have wrecked your idle time/state to begin with. You really want

> to help interrupt routing.


Improved interrupt routing would be nice, but we need to get the
scheduler to not migrate the tasks away from the interrupts in those
partially idle scenarios if we should have chance optimizing idle
time/states by getting the interrupt routed to the cpu where the
consumer task is.
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c49e25a..0fe3020 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5007,6 +5007,10 @@  static int wake_wide(struct task_struct *p)
 	unsigned int slave = p->wakee_flips;
 	int factor = this_cpu_read(sd_llc_size);
 
+	/* Don't let the idle task prevent affine wakeups */
+	if (is_idle_task(current))
+		return 0;
+
 	if (master < slave)
 		swap(master, slave);
 	if (slave < factor || master < slave * factor)