diff mbox series

[v4] sched/freq: move call to cpufreq_update_util

Message ID 1573751251-3505-1-git-send-email-vincent.guittot@linaro.org
State New
Headers show
Series [v4] sched/freq: move call to cpufreq_update_util | expand

Commit Message

Vincent Guittot Nov. 14, 2019, 5:07 p.m. UTC
update_cfs_rq_load_avg() calls cfs_rq_util_change() everytime pelt decays,
which might be inefficient when cpufreq driver has rate limitation.

When a task is attached on a CPU, we have call path:

update_load_avg()
  update_cfs_rq_load_avg()
    cfs_rq_util_change -- > trig frequency update
  attach_entity_load_avg()
    cfs_rq_util_change -- > trig frequency update

The 1st frequency update will not take into account the utilization of the
newly attached task and the 2nd one might be discard because of rate
limitation of the cpufreq driver.

update_cfs_rq_load_avg() is only called by update_blocked_averages()
and update_load_avg() so we can move the call to
cfs_rq_util_change/cpufreq_update_util() into these 2 functions. It's also
interesting to notice that update_load_avg() already calls directly
cfs_rq_util_change() for !SMP case.

This changes will also ensure that cpufreq_update_util() is called even
when there is no more CFS rq in the leaf_cfs_rq_list to update but only
irq, rt or dl pelt signals.

Reported-by: Doug Smythies <dsmythies@telus.net>
Fixes: 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing path")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

---

this patch applies on tip/sched/urgent as there is a dependency with
commit b90f7c9d2198 ("sched/pelt: Fix update of blocked PELT ordering")

Changes for v4:
- updated comments
- added Reviewed-by and Acked-by 

 kernel/sched/fair.c | 47 ++++++++++++++++++++++++++++-------------------
 1 file changed, 28 insertions(+), 19 deletions(-)

-- 
2.7.4

Comments

Vincent Guittot Nov. 15, 2019, 10:29 a.m. UTC | #1
On Fri, 15 Nov 2019 at 11:18, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>

> On Fri, 15 Nov 2019 at 10:55, Peter Zijlstra <peterz@infradead.org> wrote:

> >

> > On Thu, Nov 14, 2019 at 06:07:31PM +0100, Vincent Guittot wrote:

> > > update_cfs_rq_load_avg() calls cfs_rq_util_change() everytime pelt decays,

> > > which might be inefficient when cpufreq driver has rate limitation.

> > >

> > > When a task is attached on a CPU, we have call path:

> > >

> > > update_load_avg()

> > >   update_cfs_rq_load_avg()

> > >     cfs_rq_util_change -- > trig frequency update

> > >   attach_entity_load_avg()

> > >     cfs_rq_util_change -- > trig frequency update

> > >

> > > The 1st frequency update will not take into account the utilization of the

> > > newly attached task and the 2nd one might be discard because of rate

> > > limitation of the cpufreq driver.

> >

> > Doesn't this just show that a dumb rate limit in the driver is broken?

>

> But the rate limit may come from HW constraints that forces to wait

> let say 4ms or even 10ms between each frequency update.

>

> >

> > > update_cfs_rq_load_avg() is only called by update_blocked_averages()

> > > and update_load_avg() so we can move the call to

> > > cfs_rq_util_change/cpufreq_update_util() into these 2 functions. It's also

> > > interesting to notice that update_load_avg() already calls directly

> > > cfs_rq_util_change() for !SMP case.

> > >

> > > This changes will also ensure that cpufreq_update_util() is called even

> > > when there is no more CFS rq in the leaf_cfs_rq_list to update but only

> > > irq, rt or dl pelt signals.

> >

> > I don't think it does that; that is, iirc the return value of

> > ___update_load_sum() is 1 every time a period lapses. So even if the avg

> > is 0 and doesn't change, it'll still return 1 on every period.

> >

> > Which is what that dumb rate-limit thing wants of course. But I'm still

> > thinking that it's stupid to do. If nothing changes, don't generate

> > events.

>

> When everything (irq, dl, rt, cfs) is 0, we don't generate events

> because update_blocked_averages is no more called because

> rq->has_blocked_load is clear

>

> With current implementation, if cfs is 0 but not irq, dl or rt, we

> don't call cpufreq_update_util because it is only called through cfs

>

> >

> > If anything, update_blocked_avgerages() should look at

> > @done/others_have_blocked() to emit events for rt,dl,irq.

>

> other_have_blocked can be set but no decay happened during the update

> and we don't need to call cpufreq_update_util

>

> >

> > So why are we making the scheduler code more ugly instead of fixing that

> > driver?


Also, I think that calling cfs_rq_util_change in
attach_entity_load_avg is not optimal because the attach can happen at
a child level before it has been propagated down to root
So I'm working on trying to remove it from attach_entity_load_avg and
keep it in update_load_avg. this would help cleaning the ugly

-       } else if (decayed && (flags & UPDATE_TG))
-               update_tg_load_avg(cfs_rq, 0);
+       } else if (decayed) {
+               cfs_rq_util_change(cfs_rq, 0);
+
+               if (flags & UPDATE_TG)
+                       update_tg_load_avg(cfs_rq, 0);
+       }
 }
Peter Zijlstra Nov. 15, 2019, 10:37 a.m. UTC | #2
On Fri, Nov 15, 2019 at 11:18:00AM +0100, Vincent Guittot wrote:
> On Fri, 15 Nov 2019 at 10:55, Peter Zijlstra <peterz@infradead.org> wrote:

> >

> > On Thu, Nov 14, 2019 at 06:07:31PM +0100, Vincent Guittot wrote:

> > > update_cfs_rq_load_avg() calls cfs_rq_util_change() everytime pelt decays,

> > > which might be inefficient when cpufreq driver has rate limitation.

> > >

> > > When a task is attached on a CPU, we have call path:

> > >

> > > update_load_avg()

> > >   update_cfs_rq_load_avg()

> > >     cfs_rq_util_change -- > trig frequency update

> > >   attach_entity_load_avg()

> > >     cfs_rq_util_change -- > trig frequency update

> > >

> > > The 1st frequency update will not take into account the utilization of the

> > > newly attached task and the 2nd one might be discard because of rate

> > > limitation of the cpufreq driver.

> >

> > Doesn't this just show that a dumb rate limit in the driver is broken?

> 

> But the rate limit may come from HW constraints that forces to wait

> let say 4ms or even 10ms between each frequency update.


Sure, but then it can still remember the value passed in last and use
that state later.

It doesn't _have_ to completely discard values.

> > > update_cfs_rq_load_avg() is only called by update_blocked_averages()

> > > and update_load_avg() so we can move the call to

> > > cfs_rq_util_change/cpufreq_update_util() into these 2 functions. It's also

> > > interesting to notice that update_load_avg() already calls directly

> > > cfs_rq_util_change() for !SMP case.

> > >

> > > This changes will also ensure that cpufreq_update_util() is called even

> > > when there is no more CFS rq in the leaf_cfs_rq_list to update but only

> > > irq, rt or dl pelt signals.

> >

> > I don't think it does that; that is, iirc the return value of

> > ___update_load_sum() is 1 every time a period lapses. So even if the avg

> > is 0 and doesn't change, it'll still return 1 on every period.

> >

> > Which is what that dumb rate-limit thing wants of course. But I'm still

> > thinking that it's stupid to do. If nothing changes, don't generate

> > events.

> 

> When everything (irq, dl, rt, cfs) is 0, we don't generate events

> because update_blocked_averages is no more called because

> rq->has_blocked_load is clear


Aah.. Ok, let me look at this thing again.

> > If anything, update_blocked_avgerages() should look at

> > @done/others_have_blocked() to emit events for rt,dl,irq.

> 

> other_have_blocked can be set but no decay happened during the update

> and we don't need to call cpufreq_update_util


True.
Vincent Guittot Nov. 15, 2019, 11:05 a.m. UTC | #3
On Fri, 15 Nov 2019 at 11:41, Peter Zijlstra <peterz@infradead.org> wrote:
>

> On Fri, Nov 15, 2019 at 11:03:20AM +0100, Rafael J. Wysocki wrote:

> > On Fri, Nov 15, 2019 at 10:55 AM Peter Zijlstra <peterz@infradead.org> wrote:

>

> > > So why are we making the scheduler code more ugly instead of fixing that

> > > driver?

> >

> > I guess we could "fix" the driver by making it rate limit MSR writes

> > only, but I'm not sure if that would help.

>

> So it is not clear to me what exactly intel_pstate needs and why. Like I

> wrote in my reply to Vincent just now, it can still store the last

> value, even if it doesn't act on it right away.

>

> And it can then act on that stored value at a later event, whatever is

> appropriate.

>

> I'm just saying that generating superfluous events is silly. But

> possibly I read the patch wrong.


This is not the intent of the patch.

Before 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load
balancing path"), the call to cpufreq was done thanks to
update_cfs_rq_load_avg() even if cfs was already null but not irq/rt
or dl
After the patch, cpufreq was not called anymore

This patch fix this to make sure that cpufreq is called while  irq/rt
or dl is not null

Then, it also remove a spurious call to cpufreq just before attaching the task
Peter Zijlstra Nov. 15, 2019, 12:17 p.m. UTC | #4
On Fri, Nov 15, 2019 at 11:46:01AM +0100, Vincent Guittot wrote:
> On Fri, 15 Nov 2019 at 11:37, Peter Zijlstra <peterz@infradead.org> wrote:


> > Sure, but then it can still remember the value passed in last and use

> > that state later.

> >

> > It doesn't _have_ to completely discard values.

> 

> yes but it means that we run at the "wrong" frequency during this

> period and also that the cpufreq must in this case set a kind of timer

> to resubmit a new frequency change out of scheduler event


It always runs at the wrong frequency. Almost per definition. We're
doing near future predictions based on recent past, and if we can only
set the hardware once every N [ms] then there's really nothing better we
can do. We'll _have_ to live with the value we 'randomly' pick at the
start of those N [ms] for the whole period.

And I'm not sure it needs to set a timer, it can simply probe the value
when we go idle, if this is the kind of system that cares about OPP on
idle.
Vincent Guittot Nov. 15, 2019, 1:30 p.m. UTC | #5
On Fri, 15 Nov 2019 at 14:02, Peter Zijlstra <peterz@infradead.org> wrote:
>

> On Fri, Nov 15, 2019 at 12:03:31PM +0100, Vincent Guittot wrote:

>

> > This patch does 2 things:

> > - fix the spurious call to cpufreq just before attaching a task

>

> Right, so that one doesn't concern me too much.

>

> > - make sure cpufreq is still called when cfs is 0 but not irq/rt or dl

>

> But per the rq->has_blocked_load logic we would mostly stop sending

> events once we reach all 0s.

>

> Now, most of those updates will be through _nohz_idle_balance() ->

> update_nohz_stats(), which are remote, which means intel_pstate is

> ignoring them anyway.

>

> Now the _nohz_idle_balance() -> update_blocked_averages() thing runs

> local, and that will update the one random idle CPU we picked to run

> nohz balance, but all others will be left where they were.

>

> So why does intel_pstate care... Esp. on SKL+ with per-core P state this

> is of dubious value.


Doug mentioned some periodic timers that were running on the CPUs

>

> Also, and maybe I should go read back, why do we care what the P state

> is when we're mostly in C states anyway? These are all idle CPUs,

> otherwise we wouldkn't be running update_blocked_averages() on them

> anyway.


AFAIU, there is not 100% idle but they have periodic timers that will
fire and run at higher P state


>

>

> Much confusion..
Vincent Guittot Nov. 15, 2019, 1:37 p.m. UTC | #6
On Fri, 15 Nov 2019 at 14:25, Peter Zijlstra <peterz@infradead.org> wrote:
>

> On Thu, Nov 14, 2019 at 06:07:31PM +0100, Vincent Guittot wrote:

>

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> > index 69a81a5..3be44e1 100644

> > --- a/kernel/sched/fair.c

> > +++ b/kernel/sched/fair.c

> > @@ -3504,9 +3504,6 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)

> >       cfs_rq->load_last_update_time_copy = sa->last_update_time;

> >  #endif

> >

> > -     if (decayed)

> > -             cfs_rq_util_change(cfs_rq, 0);

> > -

> >       return decayed;

> >  }

>

> This removes the call from the for_each_leaf_cfs_rq_safe() loop.

>

> > @@ -7543,18 +7544,19 @@ static void update_blocked_averages(int cpu)

> >       const struct sched_class *curr_class;

> >       struct rq_flags rf;

> >       bool done = true;

> > +     int decayed;

> >

> >       rq_lock_irqsave(rq, &rf);

> >       update_rq_clock(rq);

> >

> >       /*

> > -      * update_cfs_rq_load_avg() can call cpufreq_update_util(). Make sure

> > -      * that RT, DL and IRQ signals have been updated before updating CFS.

> > +      * update_load_avg() can call cpufreq_update_util(). Make sure that RT,

> > +      * DL and IRQ signals have been updated before updating CFS.

> >        */

> >       curr_class = rq->curr->sched_class;

> > -     update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class);

> > -     update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);

> > -     update_irq_load_avg(rq, 0);

> > +     decayed = update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class);

> > +     decayed |= update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);

> > +     decayed |= update_irq_load_avg(rq, 0);

>

> Should not all 3 have their windows aligned and thus alway return the

> exact same value?


rt and dl yes but not irq

But having aligned window doesn't mean that they will all decay.
One can have been updated just before (during a dequeue as an example)
or at least less than 1ms before

>

> >

> >       /* Don't need periodic decay once load/util_avg are null */

> >       if (others_have_blocked(rq))

> > @@ -7567,9 +7569,13 @@ static void update_blocked_averages(int cpu)

> >       for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {

> >               struct sched_entity *se;

> >

> > -             if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq))

> > +             if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)) {

> >                       update_tg_load_avg(cfs_rq, 0);

> >

> > +                     if (cfs_rq == &rq->cfs)

> > +                             decayed = 1;

>

> And that restores it.

>

> But should not also rq->cfs's window be aligned with the above 3?

> Meaning that this one, with exception of the list_del, covers all 4.

>

> > +             }

> > +

> >               /* Propagate pending load changes to the parent, if any: */

> >               se = cfs_rq->tg->se[cpu];

> >               if (se && !skip_blocked_update(se))

> > @@ -7588,6 +7594,9 @@ static void update_blocked_averages(int cpu)

> >       }

> >

> >       update_blocked_load_status(rq, !done);

> > +

> > +     if (decayed)

> > +             cpufreq_update_util(rq, 0);

> >       rq_unlock_irqrestore(rq, &rf);

> >  }

> >

> > @@ -7644,22 +7653,22 @@ static inline void update_blocked_averages(int cpu)

> >       struct cfs_rq *cfs_rq = &rq->cfs;

> >       const struct sched_class *curr_class;

> >       struct rq_flags rf;

> > +     int decayed;

> >

> >       rq_lock_irqsave(rq, &rf);

> >       update_rq_clock(rq);

> >

> > -     /*

> > -      * update_cfs_rq_load_avg() can call cpufreq_update_util(). Make sure

> > -      * that RT, DL and IRQ signals have been updated before updating CFS.

> > -      */

> >       curr_class = rq->curr->sched_class;

> > -     update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class);

> > -     update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);

> > -     update_irq_load_avg(rq, 0);

> > +     decayed = update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class);

> > +     decayed |= update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);

> > +     decayed |= update_irq_load_avg(rq, 0);

> >

> > -     update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq);

> > +     decayed |= update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq);

>

> And that thus this one makes all 3 above redundant.

>

> >

> >       update_blocked_load_status(rq, cfs_rq_has_blocked(cfs_rq) || others_have_blocked(rq));

> > +

> > +     if (decayed)

> > +             cpufreq_update_util(rq, 0);

> >       rq_unlock_irqrestore(rq, &rf);

> >  }

>

> That is, I'm almost tempted to prefer a variant of your initial hack,

> that refuses to remove rq->cfs from the list.

>

> That avoids having to care about the rt,dl,irq decays (their windows

> align with rq->cfs) and makes smp/up similar.

>

>

> I still don't actually understand how any of this makes intel_pstate

> happy though.
Peter Zijlstra Nov. 15, 2019, 3:12 p.m. UTC | #7
On Fri, Nov 15, 2019 at 02:37:27PM +0100, Vincent Guittot wrote:
> On Fri, 15 Nov 2019 at 14:25, Peter Zijlstra <peterz@infradead.org> wrote:


> > Should not all 3 have their windows aligned and thus alway return the

> > exact same value?

> 

> rt and dl yes but not irq

> 

> But having aligned window doesn't mean that they will all decay.

> One can have been updated just before (during a dequeue as an example)

> or at least less than 1ms before


Now, the thing is, if that update happened in sched/rt, then it wouldn't
have called cpufreq anyway. And once we're idle longer than a period,
they'll all decay at once.

Except indeed that IRQ stuff, which runs out of sync.

That is, I'm just not convinced it matters much if we keep rq->cfs
on the list forever (like UP). Because we'll only stop calling when
update_blocked_averages() when everything hit 0, and up until that
point, we'll get one update per period from rq->cfs.

For good measure we can force an update when @done, at that point we
know all 0s.

How is something like this?

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 545bcb90b4de..a99ac2aa4a23 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3508,9 +3508,6 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 	cfs_rq->load_last_update_time_copy = sa->last_update_time;
 #endif
 
-	if (decayed)
-		cfs_rq_util_change(cfs_rq, 0);
-
 	return decayed;
 }
 
@@ -3620,8 +3617,12 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 		attach_entity_load_avg(cfs_rq, se, SCHED_CPUFREQ_MIGRATION);
 		update_tg_load_avg(cfs_rq, 0);
 
-	} else if (decayed && (flags & UPDATE_TG))
-		update_tg_load_avg(cfs_rq, 0);
+	} else if (decayed) {
+		cfs_rq_util_change(cfs_rq, 0);
+
+		if (flags & UPDATE_TG)
+			update_tg_load_avg(cfs_rq, 0);
+	}
 }
 
 #ifndef CONFIG_64BIT
@@ -7453,7 +7454,7 @@ static void update_blocked_averages(int cpu)
 	struct cfs_rq *cfs_rq, *pos;
 	const struct sched_class *curr_class;
 	struct rq_flags rf;
-	bool done = true;
+	bool done = true, decayed = false;
 
 	rq_lock_irqsave(rq, &rf);
 	update_rq_clock(rq);
@@ -7476,10 +7477,14 @@ static void update_blocked_averages(int cpu)
 	 * list_add_leaf_cfs_rq() for details.
 	 */
 	for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
+		bool last = cfs_rq == &rq->cfs;
 		struct sched_entity *se;
 
-		if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq))
+		if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)) {
 			update_tg_load_avg(cfs_rq, 0);
+			if (last)
+				decayed = true;
+		}
 
 		/* Propagate pending load changes to the parent, if any: */
 		se = cfs_rq->tg->se[cpu];
@@ -7490,7 +7495,7 @@ static void update_blocked_averages(int cpu)
 		 * There can be a lot of idle CPU cgroups.  Don't let fully
 		 * decayed cfs_rqs linger on the list.
 		 */
-		if (cfs_rq_is_decayed(cfs_rq))
+		if (!last && cfs_rq_is_decayed(cfs_rq))
 			list_del_leaf_cfs_rq(cfs_rq);
 
 		/* Don't need periodic decay once load/util_avg are null */
@@ -7498,6 +7503,9 @@ static void update_blocked_averages(int cpu)
 			done = false;
 	}
 
+	if (decayed || done)
+		cpufreq_update_util(rq, 0);
+
 	update_blocked_load_status(rq, !done);
 	rq_unlock_irqrestore(rq, &rf);
 }
@@ -7555,6 +7563,7 @@ static inline void update_blocked_averages(int cpu)
 	struct cfs_rq *cfs_rq = &rq->cfs;
 	const struct sched_class *curr_class;
 	struct rq_flags rf;
+	bool done, decayed;
 
 	rq_lock_irqsave(rq, &rf);
 	update_rq_clock(rq);
@@ -7568,9 +7577,13 @@ static inline void update_blocked_averages(int cpu)
 	update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
 	update_irq_load_avg(rq, 0);
 
-	update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq);
+	decayed = update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq);
+	done = !(cfs_rq_has_blocked(cfs_rq) || others_have_blocked(rq));
 
-	update_blocked_load_status(rq, cfs_rq_has_blocked(cfs_rq) || others_have_blocked(rq));
+	if (decayed || done)
+		cpufreq_update_util(rq, 0);
+
+	update_blocked_load_status(rq, !done);
 	rq_unlock_irqrestore(rq, &rf);
 }
Doug Smythies Nov. 15, 2019, 3:30 p.m. UTC | #8
Hi Peter,

On 2019.11.15 05:02 Peter Zijlstra wrote:
> On Fri, Nov 15, 2019 at 12:03:31PM +0100, Vincent Guittot wrote:

>

>> This patch does 2 things:

>> - fix the spurious call to cpufreq just before attaching a task

>

> Right, so that one doesn't concern me too much.

>

>> - make sure cpufreq is still called when cfs is 0 but not irq/rt or dl

>

> But per the rq->has_blocked_load logic we would mostly stop sending

> events once we reach all 0s.

>

> Now, most of those updates will be through _nohz_idle_balance() ->

> update_nohz_stats(), which are remote, which means intel_pstate is

> ignoring them anyway.

>

> Now the _nohz_idle_balance() -> update_blocked_averages() thing runs

> local, and that will update the one random idle CPU we picked to run

> nohz balance, but all others will be left where they were.

>

> So why does intel_pstate care... Esp. on SKL+ with per-core P state this

> is of dubious value.

>

> Also, and maybe I should go read back, why do we care what the P state

> is when we're mostly in C states anyway? These are all idle CPUs,

> otherwise we wouldkn't be running update_blocked_averages() on them

> anyway.

>

> Much confusion..


Background:

It is true that this is very likely a rare use case.
Apparently, I can make my test system considerably more "idle"
than most.

For many years now, I have never seen the time between calls,
per CPU, to the intel_pstate driver exceed 4 seconds.

Then as of:
sched/fair: Fix O(nr_cgroups) in load balance path
and for an idle system, the time between calls could
be as much as a few hundred seconds. Myself, and not
knowing much (anything) about scheduler details, I found
this odd, and so investigated.

And yes, so who cares if we are in deep C states anyhow?
If, for whatever reason, the system is running with
"intel_idle.max_cstate=1" my findings were that
the processor could end up consuming a lot more energy
for a long long time. Why? Because, at least for my
processor, and older i7-2600K (no HWP), in idle state 1, the
CPU does not relinquish its vote to the PLL, and with
no calls to the driver the requested p-state doesn't decay.

Not previously mentioned: The situation is considerably
exasperated by this piece of boost code within the intel_pstate
driver:

        /*
         * If the average P-state during the previous cycle was higher than the
         * current target, add 50% of the difference to the target to reduce
         * possible performance oscillations and offset possible performance
         * loss related to moving the workload from one CPU to another within
         * a package/module.
         */
        avg_pstate = get_avg_pstate(cpu);
        if (avg_pstate > target)
                target += (avg_pstate - target) >> 1;

Hope this helps.

... Doug
Doug Smythies Nov. 16, 2019, 3:07 p.m. UTC | #9
Hi,

I tested both Vincent's V4 and Peter's, I called it V1.

8000 seconds intel_pstate_tracer on a very idle system.
Everything was fine.

Vincent-V4:

Maximum time between calls to the intel_pstate driver,
per CPU: 4.00813 seconds: GOOD/PASS

Total entries/exits to/from driver: 90,730:
Consistent with previous tests / expectations.

228,600 aperf clocks / second / cpu.  
Consistent with previous tests / expectations.

Peter-V1:

Maximum time between calls to the intel_pstate driver,
per CPU: 4.00407 seconds: GOOD/PASS

Total entries/exits to/from driver: 96,440:
Consistent with previous tests / expectations.

241,310 aperf clocks / second / cpu.  
Consistent with previous tests / expectations.

Baseline reference:

Maximum time between calls to the intel_pstate driver,
per CPU: 225.03 seconds: BAD/FAIL

Number of durations over an arbitrary threshold of
10 seconds: 379.

Total entries/exits to/from driver: 75,969:
Consistent with previous tests / expectations,
when the issue is present.

226,963 aperf clocks / second / cpu.
Consistent with previous tests / expectations.

I did not do the load no-load energy test.
It is not possible for it to be a problem
if the long duration issue is solved.
I'll do the test, if someone wants the proof.

Kernels were 5.4-rc7 + linux next.
Idle governor was TEO, but this isn't actually relevant.

Tested-by: Doug Smythies <dsmythies@telus.net>


... Doug
diff mbox series

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 69a81a5..3be44e1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3504,9 +3504,6 @@  update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 	cfs_rq->load_last_update_time_copy = sa->last_update_time;
 #endif
 
-	if (decayed)
-		cfs_rq_util_change(cfs_rq, 0);
-
 	return decayed;
 }
 
@@ -3616,8 +3613,12 @@  static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 		attach_entity_load_avg(cfs_rq, se, SCHED_CPUFREQ_MIGRATION);
 		update_tg_load_avg(cfs_rq, 0);
 
-	} else if (decayed && (flags & UPDATE_TG))
-		update_tg_load_avg(cfs_rq, 0);
+	} else if (decayed) {
+		cfs_rq_util_change(cfs_rq, 0);
+
+		if (flags & UPDATE_TG)
+			update_tg_load_avg(cfs_rq, 0);
+	}
 }
 
 #ifndef CONFIG_64BIT
@@ -7543,18 +7544,19 @@  static void update_blocked_averages(int cpu)
 	const struct sched_class *curr_class;
 	struct rq_flags rf;
 	bool done = true;
+	int decayed;
 
 	rq_lock_irqsave(rq, &rf);
 	update_rq_clock(rq);
 
 	/*
-	 * update_cfs_rq_load_avg() can call cpufreq_update_util(). Make sure
-	 * that RT, DL and IRQ signals have been updated before updating CFS.
+	 * update_load_avg() can call cpufreq_update_util(). Make sure that RT,
+	 * DL and IRQ signals have been updated before updating CFS.
 	 */
 	curr_class = rq->curr->sched_class;
-	update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class);
-	update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
-	update_irq_load_avg(rq, 0);
+	decayed = update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class);
+	decayed |= update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
+	decayed |= update_irq_load_avg(rq, 0);
 
 	/* Don't need periodic decay once load/util_avg are null */
 	if (others_have_blocked(rq))
@@ -7567,9 +7569,13 @@  static void update_blocked_averages(int cpu)
 	for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
 		struct sched_entity *se;
 
-		if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq))
+		if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)) {
 			update_tg_load_avg(cfs_rq, 0);
 
+			if (cfs_rq == &rq->cfs)
+				decayed = 1;
+		}
+
 		/* Propagate pending load changes to the parent, if any: */
 		se = cfs_rq->tg->se[cpu];
 		if (se && !skip_blocked_update(se))
@@ -7588,6 +7594,9 @@  static void update_blocked_averages(int cpu)
 	}
 
 	update_blocked_load_status(rq, !done);
+
+	if (decayed)
+		cpufreq_update_util(rq, 0);
 	rq_unlock_irqrestore(rq, &rf);
 }
 
@@ -7644,22 +7653,22 @@  static inline void update_blocked_averages(int cpu)
 	struct cfs_rq *cfs_rq = &rq->cfs;
 	const struct sched_class *curr_class;
 	struct rq_flags rf;
+	int decayed;
 
 	rq_lock_irqsave(rq, &rf);
 	update_rq_clock(rq);
 
-	/*
-	 * update_cfs_rq_load_avg() can call cpufreq_update_util(). Make sure
-	 * that RT, DL and IRQ signals have been updated before updating CFS.
-	 */
 	curr_class = rq->curr->sched_class;
-	update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class);
-	update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
-	update_irq_load_avg(rq, 0);
+	decayed = update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class);
+	decayed |= update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
+	decayed |= update_irq_load_avg(rq, 0);
 
-	update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq);
+	decayed |= update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq);
 
 	update_blocked_load_status(rq, cfs_rq_has_blocked(cfs_rq) || others_have_blocked(rq));
+
+	if (decayed)
+		cpufreq_update_util(rq, 0);
 	rq_unlock_irqrestore(rq, &rf);
 }