diff mbox

[v2,02/11] sched: remove a wake_affine condition

Message ID 1400860385-14555-3-git-send-email-vincent.guittot@linaro.org
State New
Headers show

Commit Message

Vincent Guittot May 23, 2014, 3:52 p.m. UTC
I have tried to understand the meaning of the condition :
 (this_load <= load &&
  this_load + target_load(prev_cpu, idx) <= tl_per_task)
but i failed to find a use case that can take advantage of it and i haven't
found description of it in the previous commits' log.
Futhermore, the comment of the condition refers to task_hot function that was
used before being replaced by the current condition:
/*
 * This domain has SD_WAKE_AFFINE and
 * p is cache cold in this domain, and
 * there is no bad imbalance.
 */

If we look more deeply the below condition
 this_load + target_load(prev_cpu, idx) <= tl_per_task

When sync is clear, we have :
 tl_per_task = runnable_load_avg / nr_running
 this_load = max(runnable_load_avg, cpuload[idx])
 target_load =  max(runnable_load_avg', cpuload'[idx])

It implies that runnable_load_avg' == 0 and nr_running <= 1 in order to match the
condition. This implies that runnable_load_avg == 0 too because of the
condition: this_load <= load
but if this _load is null, balanced is already set and the test is redundant.

If sync is set, it's not as straight forward as above (especially if cgroup
are involved) but the policy should be similar as we have removed a task that's
going to sleep in order to get a more accurate load and this_load values.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 25 +++++++------------------
 1 file changed, 7 insertions(+), 18 deletions(-)

Comments

Peter Zijlstra May 27, 2014, 12:48 p.m. UTC | #1
On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote:
> I have tried to understand the meaning of the condition :
>  (this_load <= load &&
>   this_load + target_load(prev_cpu, idx) <= tl_per_task)
> but i failed to find a use case that can take advantage of it and i haven't
> found description of it in the previous commits' log.

commit 2dd73a4f09beacadde827a032cf15fd8b1fa3d48

    int try_to_wake_up():
    
    in this function the value SCHED_LOAD_BALANCE is used to represent the load
    contribution of a single task in various calculations in the code that
    decides which CPU to put the waking task on.  While this would be a valid
    on a system where the nice values for the runnable tasks were distributed
    evenly around zero it will lead to anomalous load balancing if the
    distribution is skewed in either direction.  To overcome this problem
    SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task
    or by the average load_weight per task for the queue in question (as
    appropriate).

                        if ((tl <= load &&
-                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
-                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
+                               tl + target_load(cpu, idx) <= tl_per_task) ||
+                               100*(tl + p->load_weight) <= imbalance*load) {


commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f


+                       if ((tl <= load &&
+                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
+                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {


So back when the code got introduced, it read:

	target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE < source_load(this_cpu, idx) &&
	target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(this_cpu, idx) < SCHED_LOAD_SCALE

So while the first line makes some sense, the second line is still
somewhat challenging.

I read the second line something like: if there's less than one full
task running on the combined cpus.

Now for idx==0 this is hard, because even when sync=1 you can only make
it true if both cpus are completely idle, in which case you really want
to move to the waking cpu I suppose.

One task running will have it == SCHED_LOAD_SCALE.

But for idx>0 this can trigger in all kinds of situations of light load.
Peter Zijlstra May 27, 2014, 1:43 p.m. UTC | #2
On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote:
> 
> If sync is set, it's not as straight forward as above (especially if cgroup
> are involved) 

avg load with cgroups is 'interesting' alright.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra May 27, 2014, 1:45 p.m. UTC | #3
On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9587ed1..30240ab 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4238,7 +4238,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>  {
>  	s64 this_load, load;
>  	int idx, this_cpu, prev_cpu;
> -	unsigned long tl_per_task;
>  	struct task_group *tg;
>  	unsigned long weight;
>  	int balanced;
> @@ -4296,32 +4295,22 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>  		balanced = this_eff_load <= prev_eff_load;
>  	} else
>  		balanced = true;
> +	schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts);
>  
> +	if (!balanced)
> +		return 0;
>  	/*
>  	 * If the currently running task will sleep within
>  	 * a reasonable amount of time then attract this newly
>  	 * woken task:
>  	 */
> +	if (sync)
>  		return 1;
>  
> +	schedstat_inc(sd, ttwu_move_affine);
> +	schedstat_inc(p, se.statistics.nr_wakeups_affine);
>  
> +	return 1;
>  }

So I'm not usually one for schedstat nitpicking, but should we fix it in
the sync case? That is, for sync we return 1 but do no inc
nr_wakeups_affine, even though its going to be an affine wakeup.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Vincent Guittot May 27, 2014, 3:19 p.m. UTC | #4
On 27 May 2014 14:48, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote:
>> I have tried to understand the meaning of the condition :
>>  (this_load <= load &&
>>   this_load + target_load(prev_cpu, idx) <= tl_per_task)
>> but i failed to find a use case that can take advantage of it and i haven't
>> found description of it in the previous commits' log.
>
> commit 2dd73a4f09beacadde827a032cf15fd8b1fa3d48
>
>     int try_to_wake_up():
>
>     in this function the value SCHED_LOAD_BALANCE is used to represent the load
>     contribution of a single task in various calculations in the code that
>     decides which CPU to put the waking task on.  While this would be a valid
>     on a system where the nice values for the runnable tasks were distributed
>     evenly around zero it will lead to anomalous load balancing if the
>     distribution is skewed in either direction.  To overcome this problem
>     SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task
>     or by the average load_weight per task for the queue in question (as
>     appropriate).
>
>                         if ((tl <= load &&
> -                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
> -                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
> +                               tl + target_load(cpu, idx) <= tl_per_task) ||
> +                               100*(tl + p->load_weight) <= imbalance*load) {

The oldest patch i had found was: https://lkml.org/lkml/2005/2/24/34
where task_hot had been replaced by
+ if ((tl <= load &&
+ tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
+ 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {

but as explained, i haven't found a clear explanation of this condition

>
>
> commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
>
>
> +                       if ((tl <= load &&
> +                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
> +                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
>
>
> So back when the code got introduced, it read:
>
>         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE < source_load(this_cpu, idx) &&
>         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(this_cpu, idx) < SCHED_LOAD_SCALE
>
> So while the first line makes some sense, the second line is still
> somewhat challenging.
>
> I read the second line something like: if there's less than one full
> task running on the combined cpus.

ok. your explanation makes sense

>
> Now for idx==0 this is hard, because even when sync=1 you can only make
> it true if both cpus are completely idle, in which case you really want
> to move to the waking cpu I suppose.

This use case is already taken into account by

if (this_load > 0)
..
else
 balance = true

>
> One task running will have it == SCHED_LOAD_SCALE.
>
> But for idx>0 this can trigger in all kinds of situations of light load.

target_load is the max between load for idx == 0 and load for the
selected idx so we have even less chance to match the condition : both
cpu are completely idle
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Vincent Guittot May 27, 2014, 3:20 p.m. UTC | #5
On 27 May 2014 15:45, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote:
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 9587ed1..30240ab 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4238,7 +4238,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>>  {
>>       s64 this_load, load;
>>       int idx, this_cpu, prev_cpu;
>> -     unsigned long tl_per_task;
>>       struct task_group *tg;
>>       unsigned long weight;
>>       int balanced;
>> @@ -4296,32 +4295,22 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>>               balanced = this_eff_load <= prev_eff_load;
>>       } else
>>               balanced = true;
>> +     schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts);
>>
>> +     if (!balanced)
>> +             return 0;
>>       /*
>>        * If the currently running task will sleep within
>>        * a reasonable amount of time then attract this newly
>>        * woken task:
>>        */
>> +     if (sync)
>>               return 1;
>>
>> +     schedstat_inc(sd, ttwu_move_affine);
>> +     schedstat_inc(p, se.statistics.nr_wakeups_affine);
>>
>> +     return 1;
>>  }
>
> So I'm not usually one for schedstat nitpicking, but should we fix it in
> the sync case? That is, for sync we return 1 but do no inc
> nr_wakeups_affine, even though its going to be an affine wakeup.

ok, i'm going to move schedstat update at the right place

>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra May 27, 2014, 3:39 p.m. UTC | #6
On Tue, May 27, 2014 at 05:19:02PM +0200, Vincent Guittot wrote:
> On 27 May 2014 14:48, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote:
> >> I have tried to understand the meaning of the condition :
> >>  (this_load <= load &&
> >>   this_load + target_load(prev_cpu, idx) <= tl_per_task)
> >> but i failed to find a use case that can take advantage of it and i haven't
> >> found description of it in the previous commits' log.
> >
> > commit 2dd73a4f09beacadde827a032cf15fd8b1fa3d48
> >
> >     int try_to_wake_up():
> >
> >     in this function the value SCHED_LOAD_BALANCE is used to represent the load
> >     contribution of a single task in various calculations in the code that
> >     decides which CPU to put the waking task on.  While this would be a valid
> >     on a system where the nice values for the runnable tasks were distributed
> >     evenly around zero it will lead to anomalous load balancing if the
> >     distribution is skewed in either direction.  To overcome this problem
> >     SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task
> >     or by the average load_weight per task for the queue in question (as
> >     appropriate).
> >
> >                         if ((tl <= load &&
> > -                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
> > -                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
> > +                               tl + target_load(cpu, idx) <= tl_per_task) ||
> > +                               100*(tl + p->load_weight) <= imbalance*load) {
> 
> The oldest patch i had found was: https://lkml.org/lkml/2005/2/24/34
> where task_hot had been replaced by
> + if ((tl <= load &&
> + tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
> + 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
> 
> but as explained, i haven't found a clear explanation of this condition

Yeah, that's the commit I had below; but I suppose we could ask Nick if
we really want, I've heard he still replies to email, even though he's
locked up in a basement somewhere :-)

> > commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
> >
> >
> > +                       if ((tl <= load &&
> > +                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
> > +                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
> >
> >
> > So back when the code got introduced, it read:
> >
> >         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE < source_load(this_cpu, idx) &&
> >         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(this_cpu, idx) < SCHED_LOAD_SCALE
> >
> > So while the first line makes some sense, the second line is still
> > somewhat challenging.
> >
> > I read the second line something like: if there's less than one full
> > task running on the combined cpus.
> 
> ok. your explanation makes sense

Maybe, its still slightly weird :-)

> >
> > Now for idx==0 this is hard, because even when sync=1 you can only make
> > it true if both cpus are completely idle, in which case you really want
> > to move to the waking cpu I suppose.
> 
> This use case is already taken into account by
> 
> if (this_load > 0)
> ..
> else
>  balance = true

Agreed.

> > One task running will have it == SCHED_LOAD_SCALE.
> >
> > But for idx>0 this can trigger in all kinds of situations of light load.
> 
> target_load is the max between load for idx == 0 and load for the
> selected idx so we have even less chance to match the condition : both
> cpu are completely idle

Ah, yes, I forgot to look at the target_load() thing and missed the max,
yes that all makes it entirely less likely.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Vincent Guittot May 27, 2014, 4:14 p.m. UTC | #7
On 27 May 2014 17:39, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, May 27, 2014 at 05:19:02PM +0200, Vincent Guittot wrote:
>> On 27 May 2014 14:48, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote:
>> >> I have tried to understand the meaning of the condition :
>> >>  (this_load <= load &&
>> >>   this_load + target_load(prev_cpu, idx) <= tl_per_task)
>> >> but i failed to find a use case that can take advantage of it and i haven't
>> >> found description of it in the previous commits' log.
>> >
>> > commit 2dd73a4f09beacadde827a032cf15fd8b1fa3d48
>> >
>> >     int try_to_wake_up():
>> >
>> >     in this function the value SCHED_LOAD_BALANCE is used to represent the load
>> >     contribution of a single task in various calculations in the code that
>> >     decides which CPU to put the waking task on.  While this would be a valid
>> >     on a system where the nice values for the runnable tasks were distributed
>> >     evenly around zero it will lead to anomalous load balancing if the
>> >     distribution is skewed in either direction.  To overcome this problem
>> >     SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task
>> >     or by the average load_weight per task for the queue in question (as
>> >     appropriate).
>> >
>> >                         if ((tl <= load &&
>> > -                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
>> > -                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
>> > +                               tl + target_load(cpu, idx) <= tl_per_task) ||
>> > +                               100*(tl + p->load_weight) <= imbalance*load) {
>>
>> The oldest patch i had found was: https://lkml.org/lkml/2005/2/24/34
>> where task_hot had been replaced by
>> + if ((tl <= load &&
>> + tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
>> + 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
>>
>> but as explained, i haven't found a clear explanation of this condition
>
> Yeah, that's the commit I had below; but I suppose we could ask Nick if
> we really want, I've heard he still replies to email, even though he's
> locked up in a basement somewhere :-)

ok, I have added him in the list

Nick,

While doing some rework on the wake affine part of the scheduler, i
failed to catch the use case that takes advantage of a condition that
you added some while ago with the commit
a3f21bce1fefdf92a4d1705e888d390b10f3ac6f

Could you help us to clarify the 2 first lines of the test that you added ?

+                       if ((tl <= load &&
+                               tl + target_load(cpu, idx) <=
SCHED_LOAD_SCALE) ||
+                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {

Regards,
Vincent

>
>> > commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
>> >
>> >
>> > +                       if ((tl <= load &&
>> > +                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
>> > +                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
>> >
>> >
>> > So back when the code got introduced, it read:
>> >
>> >         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE < source_load(this_cpu, idx) &&
>> >         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(this_cpu, idx) < SCHED_LOAD_SCALE
>> >
>> > So while the first line makes some sense, the second line is still
>> > somewhat challenging.
>> >
>> > I read the second line something like: if there's less than one full
>> > task running on the combined cpus.
>>
>> ok. your explanation makes sense
>
> Maybe, its still slightly weird :-)
>
>> >
>> > Now for idx==0 this is hard, because even when sync=1 you can only make
>> > it true if both cpus are completely idle, in which case you really want
>> > to move to the waking cpu I suppose.
>>
>> This use case is already taken into account by
>>
>> if (this_load > 0)
>> ..
>> else
>>  balance = true
>
> Agreed.
>
>> > One task running will have it == SCHED_LOAD_SCALE.
>> >
>> > But for idx>0 this can trigger in all kinds of situations of light load.
>>
>> target_load is the max between load for idx == 0 and load for the
>> selected idx so we have even less chance to match the condition : both
>> cpu are completely idle
>
> Ah, yes, I forgot to look at the target_load() thing and missed the max,
> yes that all makes it entirely less likely.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Vincent Guittot May 28, 2014, 6:49 a.m. UTC | #8
Using another email address for Nick

On 27 May 2014 18:14, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> On 27 May 2014 17:39, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Tue, May 27, 2014 at 05:19:02PM +0200, Vincent Guittot wrote:
>>> On 27 May 2014 14:48, Peter Zijlstra <peterz@infradead.org> wrote:
>>> > On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote:
>>> >> I have tried to understand the meaning of the condition :
>>> >>  (this_load <= load &&
>>> >>   this_load + target_load(prev_cpu, idx) <= tl_per_task)
>>> >> but i failed to find a use case that can take advantage of it and i haven't
>>> >> found description of it in the previous commits' log.
>>> >
>>> > commit 2dd73a4f09beacadde827a032cf15fd8b1fa3d48
>>> >
>>> >     int try_to_wake_up():
>>> >
>>> >     in this function the value SCHED_LOAD_BALANCE is used to represent the load
>>> >     contribution of a single task in various calculations in the code that
>>> >     decides which CPU to put the waking task on.  While this would be a valid
>>> >     on a system where the nice values for the runnable tasks were distributed
>>> >     evenly around zero it will lead to anomalous load balancing if the
>>> >     distribution is skewed in either direction.  To overcome this problem
>>> >     SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task
>>> >     or by the average load_weight per task for the queue in question (as
>>> >     appropriate).
>>> >
>>> >                         if ((tl <= load &&
>>> > -                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
>>> > -                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
>>> > +                               tl + target_load(cpu, idx) <= tl_per_task) ||
>>> > +                               100*(tl + p->load_weight) <= imbalance*load) {
>>>
>>> The oldest patch i had found was: https://lkml.org/lkml/2005/2/24/34
>>> where task_hot had been replaced by
>>> + if ((tl <= load &&
>>> + tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
>>> + 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
>>>
>>> but as explained, i haven't found a clear explanation of this condition
>>
>> Yeah, that's the commit I had below; but I suppose we could ask Nick if
>> we really want, I've heard he still replies to email, even though he's
>> locked up in a basement somewhere :-)

ok, I have added him in the list

Nick,

While doing some rework on the wake affine part of the scheduler, i
failed to catch the use case that takes advantage of a condition that
you added some while ago with the commit
a3f21bce1fefdf92a4d1705e888d390b10f3ac6f

Could you help us to clarify the 2 first lines of the test that you added ?
+                       if ((tl <= load &&
+                               tl + target_load(cpu, idx) <=
SCHED_LOAD_SCALE) ||
+                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {

Regards,
Vincent
>
>>
>>> > commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
>>> >
>>> >
>>> > +                       if ((tl <= load &&
>>> > +                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
>>> > +                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
>>> >
>>> >
>>> > So back when the code got introduced, it read:
>>> >
>>> >         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE < source_load(this_cpu, idx) &&
>>> >         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(this_cpu, idx) < SCHED_LOAD_SCALE
>>> >
>>> > So while the first line makes some sense, the second line is still
>>> > somewhat challenging.
>>> >
>>> > I read the second line something like: if there's less than one full
>>> > task running on the combined cpus.
>>>
>>> ok. your explanation makes sense
>>
>> Maybe, its still slightly weird :-)
>>
>>> >
>>> > Now for idx==0 this is hard, because even when sync=1 you can only make
>>> > it true if both cpus are completely idle, in which case you really want
>>> > to move to the waking cpu I suppose.
>>>
>>> This use case is already taken into account by
>>>
>>> if (this_load > 0)
>>> ..
>>> else
>>>  balance = true
>>
>> Agreed.
>>
>>> > One task running will have it == SCHED_LOAD_SCALE.
>>> >
>>> > But for idx>0 this can trigger in all kinds of situations of light load.
>>>
>>> target_load is the max between load for idx == 0 and load for the
>>> selected idx so we have even less chance to match the condition : both
>>> cpu are completely idle
>>
>> Ah, yes, I forgot to look at the target_load() thing and missed the max,
>> yes that all makes it entirely less likely.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Dietmar Eggemann May 28, 2014, 3:09 p.m. UTC | #9
Hi Vincent & Peter,

On 28/05/14 07:49, Vincent Guittot wrote:
[...]
> 
> Nick,
> 
> While doing some rework on the wake affine part of the scheduler, i
> failed to catch the use case that takes advantage of a condition that
> you added some while ago with the commit
> a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
> 
> Could you help us to clarify the 2 first lines of the test that you added ?
> +                       if ((tl <= load &&
> +                               tl + target_load(cpu, idx) <=
> SCHED_LOAD_SCALE) ||
> +                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
> 
> Regards,
> Vincent
>>
>>>
>>>>> commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
>>>>>
>>>>>
>>>>> +                       if ((tl <= load &&
>>>>> +                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
>>>>> +                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
>>>>>
>>>>>
>>>>> So back when the code got introduced, it read:
>>>>>
>>>>>         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE < source_load(this_cpu, idx) &&
>>>>>         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(this_cpu, idx) < SCHED_LOAD_SCALE
>>>>>

Shouldn't this be 

target_load(this_cpu, idx) - sync*SCHED_LOAD_SCALE <= source_load(prev_cpu, idx) &&
target_load(this_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(prev_cpu, idx) <= SCHED_LOAD_SCALE

"[PATCH] sched: implement smpnice" (2dd73a4f09beacadde827a032cf15fd8b1fa3d48)
mentions that SCHED_LOAD_BALANCE (IMHO, should be SCHED_LOAD_SCALE) represents
the load contribution of a single task. So I read the second part as if
the sum of the load of this_cpu and prev_cpu is smaller or equal to the
(maximal) load contribution (maximal possible effect) of a single task.

There is even a comment in "[PATCH] sched: tweak affine wakeups"
(a3f21bce1fefdf92a4d1705e888d390b10f3ac6f) in try_to_wake_up() when
SCHED_LOAD_SCALE gets subtracted from tl = this_load =
target_load(this_cpu, idx):

+ * If sync wakeup then subtract the (maximum possible)
+ * effect of the currently running task from the load
+ * of the current CPU:

"[PATCH] sched: implement smpnice" then replaces SCHED_LOAD_SCALE w/ 

+static inline unsigned long cpu_avg_load_per_task(int cpu)
+{
+       runqueue_t *rq = cpu_rq(cpu);
+       unsigned long n = rq->nr_running;
+
+       return n ?  rq->raw_weighted_load / n : SCHED_LOAD_SCALE;

-- Dietmar

>>>>> So while the first line makes some sense, the second line is still
>>>>> somewhat challenging.
>>>>>
>>>>> I read the second line something like: if there's less than one full
>>>>> task running on the combined cpus.
>>>>
>>>> ok. your explanation makes sense
>>>
>>> Maybe, its still slightly weird :-)
>>>
>>>>>
[...]

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Vincent Guittot May 28, 2014, 3:25 p.m. UTC | #10
On 28 May 2014 17:09, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> Hi Vincent & Peter,
>
> On 28/05/14 07:49, Vincent Guittot wrote:
> [...]
>>
>> Nick,
>>
>> While doing some rework on the wake affine part of the scheduler, i
>> failed to catch the use case that takes advantage of a condition that
>> you added some while ago with the commit
>> a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
>>
>> Could you help us to clarify the 2 first lines of the test that you added ?
>> +                       if ((tl <= load &&
>> +                               tl + target_load(cpu, idx) <=
>> SCHED_LOAD_SCALE) ||
>> +                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
>>
>> Regards,
>> Vincent
>>>
>>>>
>>>>>> commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
>>>>>>
>>>>>>
>>>>>> +                       if ((tl <= load &&
>>>>>> +                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
>>>>>> +                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
>>>>>>
>>>>>>
>>>>>> So back when the code got introduced, it read:
>>>>>>
>>>>>>         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE < source_load(this_cpu, idx) &&
>>>>>>         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(this_cpu, idx) < SCHED_LOAD_SCALE
>>>>>>
>
> Shouldn't this be
>
> target_load(this_cpu, idx) - sync*SCHED_LOAD_SCALE <= source_load(prev_cpu, idx) &&
> target_load(this_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(prev_cpu, idx) <= SCHED_LOAD_SCALE

yes, there was a typo mistake in Peter's explanation

>
> "[PATCH] sched: implement smpnice" (2dd73a4f09beacadde827a032cf15fd8b1fa3d48)
> mentions that SCHED_LOAD_BALANCE (IMHO, should be SCHED_LOAD_SCALE) represents
> the load contribution of a single task. So I read the second part as if
> the sum of the load of this_cpu and prev_cpu is smaller or equal to the
> (maximal) load contribution (maximal possible effect) of a single task.
>
> There is even a comment in "[PATCH] sched: tweak affine wakeups"
> (a3f21bce1fefdf92a4d1705e888d390b10f3ac6f) in try_to_wake_up() when
> SCHED_LOAD_SCALE gets subtracted from tl = this_load =
> target_load(this_cpu, idx):
>
> + * If sync wakeup then subtract the (maximum possible)
> + * effect of the currently running task from the load
> + * of the current CPU:
>
> "[PATCH] sched: implement smpnice" then replaces SCHED_LOAD_SCALE w/
>
> +static inline unsigned long cpu_avg_load_per_task(int cpu)
> +{
> +       runqueue_t *rq = cpu_rq(cpu);
> +       unsigned long n = rq->nr_running;
> +
> +       return n ?  rq->raw_weighted_load / n : SCHED_LOAD_SCALE;
>
> -- Dietmar
>
>>>>>> So while the first line makes some sense, the second line is still
>>>>>> somewhat challenging.
>>>>>>
>>>>>> I read the second line something like: if there's less than one full
>>>>>> task running on the combined cpus.
>>>>>
>>>>> ok. your explanation makes sense
>>>>
>>>> Maybe, its still slightly weird :-)
>>>>
>>>>>>
> [...]
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9587ed1..30240ab 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4238,7 +4238,6 @@  static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 {
 	s64 this_load, load;
 	int idx, this_cpu, prev_cpu;
-	unsigned long tl_per_task;
 	struct task_group *tg;
 	unsigned long weight;
 	int balanced;
@@ -4296,32 +4295,22 @@  static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 		balanced = this_eff_load <= prev_eff_load;
 	} else
 		balanced = true;
+	schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts);
 
+	if (!balanced)
+		return 0;
 	/*
 	 * If the currently running task will sleep within
 	 * a reasonable amount of time then attract this newly
 	 * woken task:
 	 */
-	if (sync && balanced)
+	if (sync)
 		return 1;
 
-	schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts);
-	tl_per_task = cpu_avg_load_per_task(this_cpu);
-
-	if (balanced ||
-	    (this_load <= load &&
-	     this_load + target_load(prev_cpu, idx) <= tl_per_task)) {
-		/*
-		 * This domain has SD_WAKE_AFFINE and
-		 * p is cache cold in this domain, and
-		 * there is no bad imbalance.
-		 */
-		schedstat_inc(sd, ttwu_move_affine);
-		schedstat_inc(p, se.statistics.nr_wakeups_affine);
+	schedstat_inc(sd, ttwu_move_affine);
+	schedstat_inc(p, se.statistics.nr_wakeups_affine);
 
-		return 1;
-	}
-	return 0;
+	return 1;
 }
 
 /*