diff mbox

[v3,08/12] sched: move cfs task on a CPU with higher capacity

Message ID 1404144343-18720-9-git-send-email-vincent.guittot@linaro.org
State New
Headers show

Commit Message

Vincent Guittot June 30, 2014, 4:05 p.m. UTC
If the CPU is used for handling lot of IRQs, trig a load balance to check if
it's worth moving its tasks on another CPU that has more capacity.

As a sidenote, this will note generate more spurious ilb because we already
trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that
has a task, we will trig the ilb once for migrating the task.

The nohz_kick_needed function has been cleaned up a bit while adding the new
test

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 58 +++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 41 insertions(+), 17 deletions(-)

Comments

Peter Zijlstra July 10, 2014, 11:18 a.m. UTC | #1
On Mon, Jun 30, 2014 at 06:05:39PM +0200, Vincent Guittot wrote:
> +		/*
> +		 * If the CPUs share their cache and the src_cpu's capacity is
> +		 * reduced because of other sched_class or IRQs, we trig an
> +		 * active balance to move the task
> +		 */
> +		if ((sd->flags & SD_SHARE_PKG_RESOURCES)
> +		 && ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
> +							sd->imbalance_pct)))
>  			return 1;

Why is this tied to shared caches?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra July 10, 2014, 11:24 a.m. UTC | #2
On Mon, Jun 30, 2014 at 06:05:39PM +0200, Vincent Guittot wrote:
> +		if ((sd->flags & SD_SHARE_PKG_RESOURCES)
> +		 && ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
> +							sd->imbalance_pct)))


> +		if ((rq->cfs.h_nr_running >= 1)
> +		 && ((rq->cpu_capacity * sd->imbalance_pct) <
> +					(rq->cpu_capacity_orig * 100))) {

That's just weird indentation and operators should be at the end not at
the beginning of a line-break.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra July 10, 2014, 11:31 a.m. UTC | #3
On Mon, Jun 30, 2014 at 06:05:39PM +0200, Vincent Guittot wrote:

You 'forgot' to update the comment that goes with nohz_kick_needed().

> @@ -7233,9 +7253,10 @@ static inline int nohz_kick_needed(struct rq *rq)
>  	struct sched_domain *sd;
>  	struct sched_group_capacity *sgc;
>  	int nr_busy, cpu = rq->cpu;
> +	bool kick = false;
>  
>  	if (unlikely(rq->idle_balance))
> +		return false;
>  
>         /*
>  	* We may be recently in ticked or tickless idle mode. At the first
> @@ -7249,38 +7270,41 @@ static inline int nohz_kick_needed(struct rq *rq)
>  	 * balancing.
>  	 */
>  	if (likely(!atomic_read(&nohz.nr_cpus)))
> +		return false;
>  
>  	if (time_before(now, nohz.next_balance))
> +		return false;
>  
>  	if (rq->nr_running >= 2)
> +		return true;
>  
>  	rcu_read_lock();
>  	sd = rcu_dereference(per_cpu(sd_busy, cpu));
>  	if (sd) {
>  		sgc = sd->groups->sgc;
>  		nr_busy = atomic_read(&sgc->nr_busy_cpus);
>  
> +		if (nr_busy > 1) {
> +			kick = true;
> +			goto unlock;
> +		}
> +
> +		if ((rq->cfs.h_nr_running >= 1)
> +		 && ((rq->cpu_capacity * sd->imbalance_pct) <
> +					(rq->cpu_capacity_orig * 100))) {
> +			kick = true;
> +			goto unlock;
> +		}

Again, why only for shared caches?

>  	}
>  
>  	sd = rcu_dereference(per_cpu(sd_asym, cpu));
>  	if (sd && (cpumask_first_and(nohz.idle_cpus_mask,
>  				  sched_domain_span(sd)) < cpu))
> +		kick = true;
>  
> +unlock:
>  	rcu_read_unlock();
> +	return kick;
>  }

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Vincent Guittot July 10, 2014, 1:59 p.m. UTC | #4
On 10 July 2014 13:24, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Jun 30, 2014 at 06:05:39PM +0200, Vincent Guittot wrote:
>> +             if ((sd->flags & SD_SHARE_PKG_RESOURCES)
>> +              && ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
>> +                                                     sd->imbalance_pct)))
>
>
>> +             if ((rq->cfs.h_nr_running >= 1)
>> +              && ((rq->cpu_capacity * sd->imbalance_pct) <
>> +                                     (rq->cpu_capacity_orig * 100))) {
>
> That's just weird indentation and operators should be at the end not at
> the beginning of a line-break.

Ok, i'm going to fix it
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Vincent Guittot July 10, 2014, 2:03 p.m. UTC | #5
On 10 July 2014 13:18, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Jun 30, 2014 at 06:05:39PM +0200, Vincent Guittot wrote:
>> +             /*
>> +              * If the CPUs share their cache and the src_cpu's capacity is
>> +              * reduced because of other sched_class or IRQs, we trig an
>> +              * active balance to move the task
>> +              */
>> +             if ((sd->flags & SD_SHARE_PKG_RESOURCES)
>> +              && ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
>> +                                                     sd->imbalance_pct)))
>>                       return 1;
>
> Why is this tied to shared caches?

It's just to limit the change of the policy to a level that can have
benefit without performance regression. I'm not sure that we can do
that at any level without risk
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra July 11, 2014, 2:51 p.m. UTC | #6
On Thu, Jul 10, 2014 at 04:03:51PM +0200, Vincent Guittot wrote:
> On 10 July 2014 13:18, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Mon, Jun 30, 2014 at 06:05:39PM +0200, Vincent Guittot wrote:
> >> +             /*
> >> +              * If the CPUs share their cache and the src_cpu's capacity is
> >> +              * reduced because of other sched_class or IRQs, we trig an
> >> +              * active balance to move the task
> >> +              */
> >> +             if ((sd->flags & SD_SHARE_PKG_RESOURCES)
> >> +              && ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
> >> +                                                     sd->imbalance_pct)))
> >>                       return 1;
> >
> > Why is this tied to shared caches?
> 
> It's just to limit the change of the policy to a level that can have
> benefit without performance regression. I'm not sure that we can do
> that at any level without risk

Similar to the other change; so both details _should_ have been in the
changelogs etc..

In any case, its feels rather arbitrary to me. What about machines where
there's no cache sharing at all (the traditional SMP systems). This
thing you're trying to do still seems to make sense there.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Vincent Guittot July 11, 2014, 3:17 p.m. UTC | #7
On 11 July 2014 16:51, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Jul 10, 2014 at 04:03:51PM +0200, Vincent Guittot wrote:
>> On 10 July 2014 13:18, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Mon, Jun 30, 2014 at 06:05:39PM +0200, Vincent Guittot wrote:
>> >> +             /*
>> >> +              * If the CPUs share their cache and the src_cpu's capacity is
>> >> +              * reduced because of other sched_class or IRQs, we trig an
>> >> +              * active balance to move the task
>> >> +              */
>> >> +             if ((sd->flags & SD_SHARE_PKG_RESOURCES)
>> >> +              && ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
>> >> +                                                     sd->imbalance_pct)))
>> >>                       return 1;
>> >
>> > Why is this tied to shared caches?
>>
>> It's just to limit the change of the policy to a level that can have
>> benefit without performance regression. I'm not sure that we can do
>> that at any level without risk
>
> Similar to the other change; so both details _should_ have been in the
> changelogs etc..

i'm going to add details in the v4

>
> In any case, its feels rather arbitrary to me. What about machines where
> there's no cache sharing at all (the traditional SMP systems). This
> thing you're trying to do still seems to make sense there.

ok, I thought that traditional SMP systems have this flag set at core
level. I mean ARM platforms have the flag for CPUs in the same cluster
(which include current ARM SMP system) and the corei7 of my laptop has
the flag at the cores level.

>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra July 14, 2014, 1:51 p.m. UTC | #8
On Fri, Jul 11, 2014 at 05:17:44PM +0200, Vincent Guittot wrote:
> > In any case, its feels rather arbitrary to me. What about machines where
> > there's no cache sharing at all (the traditional SMP systems). This
> > thing you're trying to do still seems to make sense there.
> 
> ok, I thought that traditional SMP systems have this flag set at core
> level. 

Yeah, with 1 core, so its effectively disabled.

> I mean ARM platforms have the flag for CPUs in the same cluster
> (which include current ARM SMP system) and the corei7 of my laptop has
> the flag at the cores level.

So I can see 'small' parts reducing shared caches in order to improve
idle performance.

The point being that LLC seems a somewhat arbitrary measure for this.

Can we try and see what happens if you remove the limit. Its always best
to try the simpler things first and only make it more complex if we have
to.
Vincent Guittot July 15, 2014, 9:21 a.m. UTC | #9
On 14 July 2014 15:51, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Jul 11, 2014 at 05:17:44PM +0200, Vincent Guittot wrote:
>> > In any case, its feels rather arbitrary to me. What about machines where
>> > there's no cache sharing at all (the traditional SMP systems). This
>> > thing you're trying to do still seems to make sense there.
>>
>> ok, I thought that traditional SMP systems have this flag set at core
>> level.
>
> Yeah, with 1 core, so its effectively disabled.
>
>> I mean ARM platforms have the flag for CPUs in the same cluster
>> (which include current ARM SMP system) and the corei7 of my laptop has
>> the flag at the cores level.
>
> So I can see 'small' parts reducing shared caches in order to improve
> idle performance.
>
> The point being that LLC seems a somewhat arbitrary measure for this.
>
> Can we try and see what happens if you remove the limit. Its always best
> to try the simpler things first and only make it more complex if we have
> to.

ok. i will remove the condition in the next version
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a23c938..742ad88 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5944,6 +5944,14 @@  static bool update_sd_pick_busiest(struct lb_env *env,
 			return true;
 	}
 
+	/*
+	 * The group capacity is reduced probably because of activity from other
+	 * sched class or interrupts which use part of the available capacity
+	 */
+	if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity *
+				env->sd->imbalance_pct))
+		return true;
+
 	return false;
 }
 
@@ -6421,13 +6429,24 @@  static int need_active_balance(struct lb_env *env)
 	struct sched_domain *sd = env->sd;
 
 	if (env->idle == CPU_NEWLY_IDLE) {
+		int src_cpu = env->src_cpu;
 
 		/*
 		 * ASYM_PACKING needs to force migrate tasks from busy but
 		 * higher numbered CPUs in order to pack all tasks in the
 		 * lowest numbered CPUs.
 		 */
-		if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
+		if ((sd->flags & SD_ASYM_PACKING) && src_cpu > env->dst_cpu)
+			return 1;
+
+		/*
+		 * If the CPUs share their cache and the src_cpu's capacity is
+		 * reduced because of other sched_class or IRQs, we trig an
+		 * active balance to move the task
+		 */
+		if ((sd->flags & SD_SHARE_PKG_RESOURCES)
+		 && ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
+							sd->imbalance_pct)))
 			return 1;
 	}
 
@@ -6529,6 +6548,8 @@  redo:
 
 	schedstat_add(sd, lb_imbalance[idle], env.imbalance);
 
+	env.src_cpu = busiest->cpu;
+
 	ld_moved = 0;
 	if (busiest->nr_running > 1) {
 		/*
@@ -6538,7 +6559,6 @@  redo:
 		 * correctly treated as an imbalance.
 		 */
 		env.flags |= LBF_ALL_PINNED;
-		env.src_cpu   = busiest->cpu;
 		env.src_rq    = busiest;
 		env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
 
@@ -7233,9 +7253,10 @@  static inline int nohz_kick_needed(struct rq *rq)
 	struct sched_domain *sd;
 	struct sched_group_capacity *sgc;
 	int nr_busy, cpu = rq->cpu;
+	bool kick = false;
 
 	if (unlikely(rq->idle_balance))
-		return 0;
+		return false;
 
        /*
 	* We may be recently in ticked or tickless idle mode. At the first
@@ -7249,38 +7270,41 @@  static inline int nohz_kick_needed(struct rq *rq)
 	 * balancing.
 	 */
 	if (likely(!atomic_read(&nohz.nr_cpus)))
-		return 0;
+		return false;
 
 	if (time_before(now, nohz.next_balance))
-		return 0;
+		return false;
 
 	if (rq->nr_running >= 2)
-		goto need_kick;
+		return true;
 
 	rcu_read_lock();
 	sd = rcu_dereference(per_cpu(sd_busy, cpu));
-
 	if (sd) {
 		sgc = sd->groups->sgc;
 		nr_busy = atomic_read(&sgc->nr_busy_cpus);
 
-		if (nr_busy > 1)
-			goto need_kick_unlock;
+		if (nr_busy > 1) {
+			kick = true;
+			goto unlock;
+		}
+
+		if ((rq->cfs.h_nr_running >= 1)
+		 && ((rq->cpu_capacity * sd->imbalance_pct) <
+					(rq->cpu_capacity_orig * 100))) {
+			kick = true;
+			goto unlock;
+		}
 	}
 
 	sd = rcu_dereference(per_cpu(sd_asym, cpu));
-
 	if (sd && (cpumask_first_and(nohz.idle_cpus_mask,
 				  sched_domain_span(sd)) < cpu))
-		goto need_kick_unlock;
+		kick = true;
 
+unlock:
 	rcu_read_unlock();
-	return 0;
-
-need_kick_unlock:
-	rcu_read_unlock();
-need_kick:
-	return 1;
+	return kick;
 }
 #else
 static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }