sched/fair: fix rework of find_idlest_group()

Message ID	1571762798-25900-1-git-send-email-vincent.guittot@linaro.org
State	Accepted
Commit	3318544b721d3072fdd1f85ee0f1f214c0b211ee
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; From: Vincent Guittot <vincent.guittot@linaro.org> To: linux-kernel@vger.kernel.org, mingo@redhat.com, peterz@infradead.org Cc: pauld@redhat.com, valentin.schneider@arm.com, srikar@linux.vnet.ibm.com, quentin.perret@arm.com, dietmar.eggemann@arm.com, Morten.Rasmussen@arm.com, hdanton@sina.com, parth@linux.ibm.com, riel@surriel.com, rong.a.chen@intel.com, Vincent Guittot <vincent.guittot@linaro.org> Subject: [PATCH] sched/fair: fix rework of find_idlest_group() Date: Tue, 22 Oct 2019 18:46:38 +0200 Message-Id: <1571762798-25900-1-git-send-email-vincent.guittot@linaro.org> In-Reply-To: <1571405198-27570-12-git-send-email-vincent.guittot@linaro.org> References: <1571405198-27570-12-git-send-email-vincent.guittot@linaro.org> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk
Series	sched/fair: fix rework of find_idlest_group() \| expand sched/fair: fix rework of find_idlest_group()

Vincent Guittot Oct. 22, 2019, 4:46 p.m. UTC

The task, for which the scheduler looks for the idlest group of CPUs, must
be discounted from all statistics in order to get a fair comparison
between groups. This includes utilization, load, nr_running and idle_cpus.

Such unfairness can be easily highlighted with the unixbench execl 1 task.
This test continuously call execve() and the scheduler looks for the idlest
group/CPU on which it should place the task. Because the task runs on the
local group/CPU, the latter seems already busy even if there is nothing
else running on it. As a result, the scheduler will always select another
group/CPU than the local one.

Fixes: 57abff067a08 ("sched/fair: Rework find_idlest_group()")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

---

This recover most of the perf regression on my system and I have asked
Rong if he can rerun the test with the patch to check that it fixes his
system as well.

 kernel/sched/fair.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 83 insertions(+), 7 deletions(-)

-- 
2.7.4

kernel test robot Oct. 23, 2019, 7:50 a.m. UTC | #1

Tested-by: kernel test robot <rong.a.chen@intel.com>


On 10/23/2019 12:46 AM, Vincent Guittot wrote:
> The task, for which the scheduler looks for the idlest group of CPUs, must

> be discounted from all statistics in order to get a fair comparison

> between groups. This includes utilization, load, nr_running and idle_cpus.

>

> Such unfairness can be easily highlighted with the unixbench execl 1 task.

> This test continuously call execve() and the scheduler looks for the idlest

> group/CPU on which it should place the task. Because the task runs on the

> local group/CPU, the latter seems already busy even if there is nothing

> else running on it. As a result, the scheduler will always select another

> group/CPU than the local one.

>

> Fixes: 57abff067a08 ("sched/fair: Rework find_idlest_group()")

> Reported-by: kernel test robot <rong.a.chen@intel.com>

> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---

>

> This recover most of the perf regression on my system and I have asked

> Rong if he can rerun the test with the patch to check that it fixes his

> system as well.

>

>   kernel/sched/fair.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++++-----

>   1 file changed, 83 insertions(+), 7 deletions(-)

>

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index a81c364..0ad4b21 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -5379,6 +5379,36 @@ static unsigned long cpu_load(struct rq *rq)

>   {

>   	return cfs_rq_load_avg(&rq->cfs);

>   }

> +/*

> + * cpu_load_without - compute cpu load without any contributions from *p

> + * @cpu: the CPU which load is requested

> + * @p: the task which load should be discounted

> + *

> + * The load of a CPU is defined by the load of tasks currently enqueued on that

> + * CPU as well as tasks which are currently sleeping after an execution on that

> + * CPU.

> + *

> + * This method returns the load of the specified CPU by discounting the load of

> + * the specified task, whenever the task is currently contributing to the CPU

> + * load.

> + */

> +static unsigned long cpu_load_without(struct rq *rq, struct task_struct *p)

> +{

> +	struct cfs_rq *cfs_rq;

> +	unsigned int load;

> +

> +	/* Task has no contribution or is new */

> +	if (cpu_of(rq) != task_cpu(p) || !READ_ONCE(p->se.avg.last_update_time))

> +		return cpu_load(rq);

> +

> +	cfs_rq = &rq->cfs;

> +	load = READ_ONCE(cfs_rq->avg.load_avg);

> +

> +	/* Discount task's util from CPU's util */

> +	lsub_positive(&load, task_h_load(p));

> +

> +	return load;

> +}

>   

>   static unsigned long capacity_of(int cpu)

>   {

> @@ -8117,10 +8147,55 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq)

>   struct sg_lb_stats;

>   

>   /*

> + * task_running_on_cpu - return 1 if @p is running on @cpu.

> + */

> +

> +static unsigned int task_running_on_cpu(int cpu, struct task_struct *p)

> +{

> +	/* Task has no contribution or is new */

> +	if (cpu != task_cpu(p) || !READ_ONCE(p->se.avg.last_update_time))

> +		return 0;

> +

> +	if (task_on_rq_queued(p))

> +		return 1;

> +

> +	return 0;

> +}

> +

> +/**

> + * idle_cpu_without - would a given CPU be idle without p ?

> + * @cpu: the processor on which idleness is tested.

> + * @p: task which should be ignored.

> + *

> + * Return: 1 if the CPU would be idle. 0 otherwise.

> + */

> +static int idle_cpu_without(int cpu, struct task_struct *p)

> +{

> +	struct rq *rq = cpu_rq(cpu);

> +

> +	if ((rq->curr != rq->idle) && (rq->curr != p))

> +		return 0;

> +

> +	/*

> +	 * rq->nr_running can't be used but an updated version without the

> +	 * impact of p on cpu must be used instead. The updated nr_running

> +	 * be computed and tested before calling idle_cpu_without().

> +	 */

> +

> +#ifdef CONFIG_SMP

> +	if (!llist_empty(&rq->wake_list))

> +		return 0;

> +#endif

> +

> +	return 1;

> +}

> +

> +/*

>    * update_sg_wakeup_stats - Update sched_group's statistics for wakeup.

> - * @denv: The ched_domain level to look for idlest group.

> + * @sd: The sched_domain level to look for idlest group.

>    * @group: sched_group whose statistics are to be updated.

>    * @sgs: variable to hold the statistics for this group.

> + * @p: The task for which we look for the idlest group/CPU.

>    */

>   static inline void update_sg_wakeup_stats(struct sched_domain *sd,

>   					  struct sched_group *group,

> @@ -8133,21 +8208,22 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,

>   

>   	for_each_cpu(i, sched_group_span(group)) {

>   		struct rq *rq = cpu_rq(i);

> +		unsigned int local;

>   

> -		sgs->group_load += cpu_load(rq);

> +		sgs->group_load += cpu_load_without(rq, p);

>   		sgs->group_util += cpu_util_without(i, p);

> -		sgs->sum_h_nr_running += rq->cfs.h_nr_running;

> +		local = task_running_on_cpu(i, p);

> +		sgs->sum_h_nr_running += rq->cfs.h_nr_running - local;

>   

> -		nr_running = rq->nr_running;

> +		nr_running = rq->nr_running - local;

>   		sgs->sum_nr_running += nr_running;

>   

>   		/*

> -		 * No need to call idle_cpu() if nr_running is not 0

> +		 * No need to call idle_cpu_without() if nr_running is not 0

>   		 */

> -		if (!nr_running && idle_cpu(i))

> +		if (!nr_running && idle_cpu_without(i, p))

>   			sgs->idle_cpus++;

>   

> -

>   	}

>   

>   	/* Check if task fits in the group */

Mel Gorman Oct. 30, 2019, 4:07 p.m. UTC | #2

On Tue, Oct 22, 2019 at 06:46:38PM +0200, Vincent Guittot wrote:
> The task, for which the scheduler looks for the idlest group of CPUs, must

> be discounted from all statistics in order to get a fair comparison

> between groups. This includes utilization, load, nr_running and idle_cpus.

> 

> Such unfairness can be easily highlighted with the unixbench execl 1 task.

> This test continuously call execve() and the scheduler looks for the idlest

> group/CPU on which it should place the task. Because the task runs on the

> local group/CPU, the latter seems already busy even if there is nothing

> else running on it. As a result, the scheduler will always select another

> group/CPU than the local one.

> 

> Fixes: 57abff067a08 ("sched/fair: Rework find_idlest_group()")

> Reported-by: kernel test robot <rong.a.chen@intel.com>

> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>


I had gotten fried by this point and had not queued this patch in advance
so I don't want to comment one way or the other. However, I note this was
not picked up in tip and it probably is best that this series all go in
as one lump and not separate out the fixes in the final merge. Otherwise
it'll trigger false positives by LKP.

-- 
Mel Gorman
SUSE Labs

Valentin Schneider Nov. 22, 2019, 2:37 p.m. UTC | #3

Hi Vincent,

I took the liberty of adding some commenting nits in my review. I
know this is already in tip, but as Mel pointed out this should be merged
with the rework when sent out to mainline (similar to the removal of
fix_small_imbalance() & the LB rework).

On 22/10/2019 17:46, Vincent Guittot wrote:
> The task, for which the scheduler looks for the idlest group of CPUs, must

> be discounted from all statistics in order to get a fair comparison

> between groups. This includes utilization, load, nr_running and idle_cpus.

> 

> Such unfairness can be easily highlighted with the unixbench execl 1 task.

> This test continuously call execve() and the scheduler looks for the idlest

> group/CPU on which it should place the task. Because the task runs on the

> local group/CPU, the latter seems already busy even if there is nothing

> else running on it. As a result, the scheduler will always select another

> group/CPU than the local one.

> 

> Fixes: 57abff067a08 ("sched/fair: Rework find_idlest_group()")

> Reported-by: kernel test robot <rong.a.chen@intel.com>

> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---

> 

> This recover most of the perf regression on my system and I have asked

> Rong if he can rerun the test with the patch to check that it fixes his

> system as well.

> 

>  kernel/sched/fair.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++++-----

>  1 file changed, 83 insertions(+), 7 deletions(-)

> 

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index a81c364..0ad4b21 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -5379,6 +5379,36 @@ static unsigned long cpu_load(struct rq *rq)

>  {

>  	return cfs_rq_load_avg(&rq->cfs);

>  }

> +/*

> + * cpu_load_without - compute cpu load without any contributions from *p

> + * @cpu: the CPU which load is requested

> + * @p: the task which load should be discounted


For both @cpu and @p, s/which/whose/ (also applies to cpu_util_without()
which inspired this).

> + *

> + * The load of a CPU is defined by the load of tasks currently enqueued on that

> + * CPU as well as tasks which are currently sleeping after an execution on that

> + * CPU.

> + *

> + * This method returns the load of the specified CPU by discounting the load of

> + * the specified task, whenever the task is currently contributing to the CPU

> + * load.

> + */

> +static unsigned long cpu_load_without(struct rq *rq, struct task_struct *p)

> +{

> +	struct cfs_rq *cfs_rq;

> +	unsigned int load;

> +

> +	/* Task has no contribution or is new */

> +	if (cpu_of(rq) != task_cpu(p) || !READ_ONCE(p->se.avg.last_update_time))

> +		return cpu_load(rq);

> +

> +	cfs_rq = &rq->cfs;

> +	load = READ_ONCE(cfs_rq->avg.load_avg);

> +

> +	/* Discount task's util from CPU's util */


s/util/load

> +	lsub_positive(&load, task_h_load(p));

> +

> +	return load;

> +}

>  

>  static unsigned long capacity_of(int cpu)

>  {

> @@ -8117,10 +8147,55 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq)

>  struct sg_lb_stats;

>  

>  /*

> + * task_running_on_cpu - return 1 if @p is running on @cpu.

> + */

> +

> +static unsigned int task_running_on_cpu(int cpu, struct task_struct *p)

          ^^^^^^^^^^^^
That could very well be bool, right?


> +{

> +	/* Task has no contribution or is new */

> +	if (cpu != task_cpu(p) || !READ_ONCE(p->se.avg.last_update_time))

> +		return 0;

> +

> +	if (task_on_rq_queued(p))

> +		return 1;

> +

> +	return 0;

> +}

> +

> +/**

> + * idle_cpu_without - would a given CPU be idle without p ?

> + * @cpu: the processor on which idleness is tested.

> + * @p: task which should be ignored.

> + *

> + * Return: 1 if the CPU would be idle. 0 otherwise.

> + */

> +static int idle_cpu_without(int cpu, struct task_struct *p)

          ^^^
Ditto on the boolean return values

> +{

> +	struct rq *rq = cpu_rq(cpu);

> +

> +	if ((rq->curr != rq->idle) && (rq->curr != p))

> +		return 0;

> +

> +	/*

> +	 * rq->nr_running can't be used but an updated version without the

> +	 * impact of p on cpu must be used instead. The updated nr_running

> +	 * be computed and tested before calling idle_cpu_without().

> +	 */

> +

> +#ifdef CONFIG_SMP

> +	if (!llist_empty(&rq->wake_list))

> +		return 0;

> +#endif

> +

> +	return 1;

> +}

> +

> +/*

>   * update_sg_wakeup_stats - Update sched_group's statistics for wakeup.

> - * @denv: The ched_domain level to look for idlest group.

> + * @sd: The sched_domain level to look for idlest group.

>   * @group: sched_group whose statistics are to be updated.

>   * @sgs: variable to hold the statistics for this group.

> + * @p: The task for which we look for the idlest group/CPU.

>   */

>  static inline void update_sg_wakeup_stats(struct sched_domain *sd,

>  					  struct sched_group *group,

Vincent Guittot Nov. 25, 2019, 9:16 a.m. UTC | #4

On Fri, 22 Nov 2019 at 15:37, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>

> Hi Vincent,

>

> I took the liberty of adding some commenting nits in my review. I

> know this is already in tip, but as Mel pointed out this should be merged

> with the rework when sent out to mainline (similar to the removal of

> fix_small_imbalance() & the LB rework).

>

> On 22/10/2019 17:46, Vincent Guittot wrote:

> > The task, for which the scheduler looks for the idlest group of CPUs, must

> > be discounted from all statistics in order to get a fair comparison

> > between groups. This includes utilization, load, nr_running and idle_cpus.

> >

> > Such unfairness can be easily highlighted with the unixbench execl 1 task.

> > This test continuously call execve() and the scheduler looks for the idlest

> > group/CPU on which it should place the task. Because the task runs on the

> > local group/CPU, the latter seems already busy even if there is nothing

> > else running on it. As a result, the scheduler will always select another

> > group/CPU than the local one.

> >

> > Fixes: 57abff067a08 ("sched/fair: Rework find_idlest_group()")

> > Reported-by: kernel test robot <rong.a.chen@intel.com>

> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

> > ---

> >

> > This recover most of the perf regression on my system and I have asked

> > Rong if he can rerun the test with the patch to check that it fixes his

> > system as well.

> >

> >  kernel/sched/fair.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++++-----

> >  1 file changed, 83 insertions(+), 7 deletions(-)

> >

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> > index a81c364..0ad4b21 100644

> > --- a/kernel/sched/fair.c

> > +++ b/kernel/sched/fair.c

> > @@ -5379,6 +5379,36 @@ static unsigned long cpu_load(struct rq *rq)

> >  {

> >       return cfs_rq_load_avg(&rq->cfs);

> >  }

> > +/*

> > + * cpu_load_without - compute cpu load without any contributions from *p

> > + * @cpu: the CPU which load is requested

> > + * @p: the task which load should be discounted

>

> For both @cpu and @p, s/which/whose/ (also applies to cpu_util_without()

> which inspired this).


As you mentioned, this is inspired from cpu_util_without and stay
consistent with it

>

> > + *

> > + * The load of a CPU is defined by the load of tasks currently enqueued on that

> > + * CPU as well as tasks which are currently sleeping after an execution on that

> > + * CPU.

> > + *

> > + * This method returns the load of the specified CPU by discounting the load of

> > + * the specified task, whenever the task is currently contributing to the CPU

> > + * load.

> > + */

> > +static unsigned long cpu_load_without(struct rq *rq, struct task_struct *p)

> > +{

> > +     struct cfs_rq *cfs_rq;

> > +     unsigned int load;

> > +

> > +     /* Task has no contribution or is new */

> > +     if (cpu_of(rq) != task_cpu(p) || !READ_ONCE(p->se.avg.last_update_time))

> > +             return cpu_load(rq);

> > +

> > +     cfs_rq = &rq->cfs;

> > +     load = READ_ONCE(cfs_rq->avg.load_avg);

> > +

> > +     /* Discount task's util from CPU's util */

>

> s/util/load

>

> > +     lsub_positive(&load, task_h_load(p));

> > +

> > +     return load;

> > +}

> >

> >  static unsigned long capacity_of(int cpu)

> >  {

> > @@ -8117,10 +8147,55 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq)

> >  struct sg_lb_stats;

> >

> >  /*

> > + * task_running_on_cpu - return 1 if @p is running on @cpu.

> > + */

> > +

> > +static unsigned int task_running_on_cpu(int cpu, struct task_struct *p)

>           ^^^^^^^^^^^^

> That could very well be bool, right?

>

>

> > +{

> > +     /* Task has no contribution or is new */

> > +     if (cpu != task_cpu(p) || !READ_ONCE(p->se.avg.last_update_time))

> > +             return 0;

> > +

> > +     if (task_on_rq_queued(p))

> > +             return 1;

> > +

> > +     return 0;

> > +}

> > +

> > +/**

> > + * idle_cpu_without - would a given CPU be idle without p ?

> > + * @cpu: the processor on which idleness is tested.

> > + * @p: task which should be ignored.

> > + *

> > + * Return: 1 if the CPU would be idle. 0 otherwise.

> > + */

> > +static int idle_cpu_without(int cpu, struct task_struct *p)

>           ^^^

> Ditto on the boolean return values


This is an extension of idle_cpu which also returns int and I wanted
to stay consistent with it

So we might want to make some kind of cleanup or rewording of
interfaces and their descriptions but this should be done as  a whole
and out of the scope of this patch and would worth having a dedicated
patch IMO because it would imply to modify other patch of the code
that is not covered by this patch like idle_cpu or cpu_util_without


>

> > +{

> > +     struct rq *rq = cpu_rq(cpu);

> > +

> > +     if ((rq->curr != rq->idle) && (rq->curr != p))

> > +             return 0;

> > +

> > +     /*

> > +      * rq->nr_running can't be used but an updated version without the

> > +      * impact of p on cpu must be used instead. The updated nr_running

> > +      * be computed and tested before calling idle_cpu_without().

> > +      */

> > +

> > +#ifdef CONFIG_SMP

> > +     if (!llist_empty(&rq->wake_list))

> > +             return 0;

> > +#endif

> > +

> > +     return 1;

> > +}

> > +

> > +/*

> >   * update_sg_wakeup_stats - Update sched_group's statistics for wakeup.

> > - * @denv: The ched_domain level to look for idlest group.

> > + * @sd: The sched_domain level to look for idlest group.

> >   * @group: sched_group whose statistics are to be updated.

> >   * @sgs: variable to hold the statistics for this group.

> > + * @p: The task for which we look for the idlest group/CPU.

> >   */

> >  static inline void update_sg_wakeup_stats(struct sched_domain *sd,

> >                                         struct sched_group *group,

Valentin Schneider Nov. 25, 2019, 11:03 a.m. UTC | #5

On 25/11/2019 09:16, Vincent Guittot wrote:
> 

> This is an extension of idle_cpu which also returns int and I wanted

> to stay consistent with it

> 

> So we might want to make some kind of cleanup or rewording of

> interfaces and their descriptions but this should be done as  a whole

> and out of the scope of this patch and would worth having a dedicated

> patch IMO because it would imply to modify other patch of the code

> that is not covered by this patch like idle_cpu or cpu_util_without

> 


Fair enough.

sched/fair: fix rework of find_idlest_group()

Commit Message

Comments

Patch