[RFC] sched: Ensure cpu_power periodic update

Message ID 1323717668-2143-1-git-send-email-vincent.guittot@linaro.org
State Accepted
Headers show

Commit Message

Vincent Guittot Dec. 12, 2011, 7:21 p.m.
With a lot of small tasks, the softirq sched is nearly never called
when no_hz is enable. In this case the load_balance is mainly called with
the newly_idle mode which doesn't update the cpu_power.
Add a next_update field which ensure a maximum update period when
there is short activity

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 include/linux/sched.h |    1 +
 kernel/sched/fair.c   |   24 ++++++++++++++++--------
 2 files changed, 17 insertions(+), 8 deletions(-)

Comments

Peter Zijlstra Dec. 15, 2011, 10:08 a.m. | #1
On Mon, 2011-12-12 at 20:21 +0100, Vincent Guittot wrote:
> With a lot of small tasks, the softirq sched is nearly never called
> when no_hz is enable. In this case the load_balance is mainly called with
> the newly_idle mode which doesn't update the cpu_power.
> Add a next_update field which ensure a maximum update period when
> there is short activity

> +	if (local_group) {
> +		if (idle != CPU_NEWLY_IDLE) {
> +			if (balance_cpu != this_cpu) {
> +				*balance = 0;
> +				return;
> +			}
> +			update_group_power(sd, this_cpu);
> +		} else if (time_after_eq(jiffies, group->sgp->next_update))
> +			update_group_power(sd, this_cpu);
>  	}

Hmm, I would have expected it to be called from the NOHZ balancing path
instead of the new_idle path. Your changelog fails to mentions any
considerations on this..

Then again, its probably easier to keep update_group_power on this_cpu
than to allow a remote update of your cpu_power.

So I'm not opposed to this patch, I'd just like a little extra
clarification.
Vincent Guittot Dec. 15, 2011, 1:36 p.m. | #2
On 15 December 2011 11:08, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Mon, 2011-12-12 at 20:21 +0100, Vincent Guittot wrote:
>> With a lot of small tasks, the softirq sched is nearly never called
>> when no_hz is enable. In this case the load_balance is mainly called with
>> the newly_idle mode which doesn't update the cpu_power.
>> Add a next_update field which ensure a maximum update period when
>> there is short activity
>
>> +     if (local_group) {
>> +             if (idle != CPU_NEWLY_IDLE) {
>> +                     if (balance_cpu != this_cpu) {
>> +                             *balance = 0;
>> +                             return;
>> +                     }
>> +                     update_group_power(sd, this_cpu);
>> +             } else if (time_after_eq(jiffies, group->sgp->next_update))
>> +                     update_group_power(sd, this_cpu);
>>       }
>
> Hmm, I would have expected it to be called from the NOHZ balancing path
> instead of the new_idle path. Your changelog fails to mentions any
> considerations on this..
>

As we are not lucky, the small tasks are mainly running between ticks
and  the timer interrupt doesn't fire which implies that both
rebalance_domain of the cpu and nohz_balancer_kick are not called. We
have a lot of call to idle_balance() when cpus become idle and very
few calls to rebalance or nohz_idle_balance. If some tasks are rt
tasks, the cpu_power should be updated regularly to reflect current
use of the cpu by rt scheduler.

I'm using cyclictest to easily reproduce the problem on my dual cortex-A9

> Then again, its probably easier to keep update_group_power on this_cpu
> than to allow a remote update of your cpu_power.
>

This additional path for updating the cpu_power will only be used by
this_cpu because it is called by idle_balance. But we still have a
call to update_group_power by a remote cpu when nohz_idle_balance is
called.

> So I'm not opposed to this patch, I'd just like a little extra
> clarification.

Vincent
Suresh Siddha Dec. 16, 2011, 12:58 a.m. | #3
On Thu, 2011-12-15 at 05:36 -0800, Vincent Guittot wrote:
> I'm using cyclictest to easily reproduce the problem on my dual cortex-A9

So does the cyclictest itself exhibit the problem or running cyclictest
with another workload showed the problem? In other words, what numbers
of the workload did you see change with this patch?

> 
> > Then again, its probably easier to keep update_group_power on this_cpu
> > than to allow a remote update of your cpu_power.
> >
> 
> This additional path for updating the cpu_power will only be used by
> this_cpu because it is called by idle_balance. But we still have a
> call to update_group_power by a remote cpu when nohz_idle_balance is
> called.

As Vincent mentioned, the current mainline kernel already updates the
remote cpu's group_power in the nohz idle load balancing patch.

Also with all the recent nohz idle load balancing using kick, on a
dual-core system there may not be any nohz idle load balancing if
multiple tasks wakeup, run for short time and go back to idle before the
next tick. We rely on the wakeup balance to get it right in this case.

thanks,
suresh
Vincent Guittot Dec. 16, 2011, 8:23 a.m. | #4
On 16 December 2011 01:58, Suresh Siddha <suresh.b.siddha@intel.com> wrote:
> On Thu, 2011-12-15 at 05:36 -0800, Vincent Guittot wrote:
>> I'm using cyclictest to easily reproduce the problem on my dual cortex-A9
>
> So does the cyclictest itself exhibit the problem or running cyclictest
> with another workload showed the problem? In other words, what numbers
> of the workload did you see change with this patch?
>

Using a cyclictest -q -t 5 -D 4 on my dual cortex-A9 shows the fact
that the softirqs timer and sched are not called very often and the
cpu_power is nearly never updated.

I have also used the following sequence :

cyclictest -q -t 5 -D 4 &
sleep 2
cyclictest -q -t 3 --affinity=0 -p 99 -D 2

The cpu_power of cpu0 should start to decrease when the rt threads are
started. Without the patch, we must wait for the next sched softirq
for starting to update the cpu_power and we have no guarantee of the
maximum interval. With the patch, the cpu_power is updated regularly
using the balance_interval value.

Vincent

>>
>> > Then again, its probably easier to keep update_group_power on this_cpu
>> > than to allow a remote update of your cpu_power.
>> >
>>
>> This additional path for updating the cpu_power will only be used by
>> this_cpu because it is called by idle_balance. But we still have a
>> call to update_group_power by a remote cpu when nohz_idle_balance is
>> called.
>
> As Vincent mentioned, the current mainline kernel already updates the
> remote cpu's group_power in the nohz idle load balancing patch.
>
> Also with all the recent nohz idle load balancing using kick, on a
> dual-core system there may not be any nohz idle load balancing if
> multiple tasks wakeup, run for short time and go back to idle before the
> next tick. We rely on the wakeup balance to get it right in this case.
>
> thanks,
> suresh
>
Vincent Guittot Jan. 4, 2012, 8:22 a.m. | #5
Hi Peter,

Does it sound good for you if I update the comment of this patch with
the explanation of the previous mails or do you need more information
?

Thanks,
Vincent

On 16 December 2011 09:23, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> On 16 December 2011 01:58, Suresh Siddha <suresh.b.siddha@intel.com> wrote:
>> On Thu, 2011-12-15 at 05:36 -0800, Vincent Guittot wrote:
>>> I'm using cyclictest to easily reproduce the problem on my dual cortex-A9
>>
>> So does the cyclictest itself exhibit the problem or running cyclictest
>> with another workload showed the problem? In other words, what numbers
>> of the workload did you see change with this patch?
>>
>
> Using a cyclictest -q -t 5 -D 4 on my dual cortex-A9 shows the fact
> that the softirqs timer and sched are not called very often and the
> cpu_power is nearly never updated.
>
> I have also used the following sequence :
>
> cyclictest -q -t 5 -D 4 &
> sleep 2
> cyclictest -q -t 3 --affinity=0 -p 99 -D 2
>
> The cpu_power of cpu0 should start to decrease when the rt threads are
> started. Without the patch, we must wait for the next sched softirq
> for starting to update the cpu_power and we have no guarantee of the
> maximum interval. With the patch, the cpu_power is updated regularly
> using the balance_interval value.
>
> Vincent
>
>>>
>>> > Then again, its probably easier to keep update_group_power on this_cpu
>>> > than to allow a remote update of your cpu_power.
>>> >
>>>
>>> This additional path for updating the cpu_power will only be used by
>>> this_cpu because it is called by idle_balance. But we still have a
>>> call to update_group_power by a remote cpu when nohz_idle_balance is
>>> called.
>>
>> As Vincent mentioned, the current mainline kernel already updates the
>> remote cpu's group_power in the nohz idle load balancing patch.
>>
>> Also with all the recent nohz idle load balancing using kick, on a
>> dual-core system there may not be any nohz idle load balancing if
>> multiple tasks wakeup, run for short time and go back to idle before the
>> next tick. We rely on the wakeup balance to get it right in this case.
>>
>> thanks,
>> suresh
>>
Peter Zijlstra Jan. 6, 2012, 12:10 p.m. | #6
On Wed, 2012-01-04 at 09:22 +0100, Vincent Guittot wrote:
> Does it sound good for you if I update the comment of this patch with
> the explanation of the previous mails or do you need more information
> ?
> 
No, looks ok. Just got stuck in the mailbox + extra holidays delay.

Got it queued now, Thanks!

Patch

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 64527c4..7178446 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -903,6 +903,7 @@  struct sched_group_power {
 	 * single CPU.
 	 */
 	unsigned int power, power_orig;
+	unsigned long next_update;
 	/*
 	 * Number of busy cpus in this group.
 	 */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a4d2b7a..809f484 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -215,6 +215,8 @@  calc_delta_mine(unsigned long delta_exec, unsigned long weight,
 
 const struct sched_class fair_sched_class;
 
+static unsigned long __read_mostly max_load_balance_interval = HZ/10;
+
 /**************************************************************
  * CFS operations on generic schedulable entities:
  */
@@ -3786,6 +3788,11 @@  void update_group_power(struct sched_domain *sd, int cpu)
 	struct sched_domain *child = sd->child;
 	struct sched_group *group, *sdg = sd->groups;
 	unsigned long power;
+	unsigned long interval;
+
+	interval = msecs_to_jiffies(sd->balance_interval);
+	interval = clamp(interval, 1UL, max_load_balance_interval);
+	sdg->sgp->next_update = jiffies + interval;
 
 	if (!child) {
 		update_cpu_power(sd, cpu);
@@ -3893,12 +3900,15 @@  static inline void update_sg_lb_stats(struct sched_domain *sd,
 	 * domains. In the newly idle case, we will allow all the cpu's
 	 * to do the newly idle load balance.
 	 */
-	if (idle != CPU_NEWLY_IDLE && local_group) {
-		if (balance_cpu != this_cpu) {
-			*balance = 0;
-			return;
-		}
-		update_group_power(sd, this_cpu);
+	if (local_group) {
+		if (idle != CPU_NEWLY_IDLE) {
+			if (balance_cpu != this_cpu) {
+				*balance = 0;
+				return;
+			}
+			update_group_power(sd, this_cpu);
+		} else if (time_after_eq(jiffies, group->sgp->next_update))
+			update_group_power(sd, this_cpu);
 	}
 
 	/* Adjust by relative CPU power of the group */
@@ -4917,8 +4927,6 @@  void select_nohz_load_balancer(int stop_tick)
 
 static DEFINE_SPINLOCK(balancing);
 
-static unsigned long __read_mostly max_load_balance_interval = HZ/10;
-
 /*
  * Scale the max load_balance interval with the number of CPUs in the system.
  * This trades load-balance latency on larger machines for less cross talk.