[RFC,PATCHC,3/3] sched/fair: use the idle state info to choose the idlest cpu

Message ID 1396009796-31598-4-git-send-email-daniel.lezcano@linaro.org
State New
Headers show

Commit Message

Daniel Lezcano March 28, 2014, 12:29 p.m.
As we know in which idle state the cpu is, we can investigate the following:

1. when did the cpu entered the idle state ? the longer the cpu is idle, the
deeper it is idle
2. what exit latency is ? the greater the exit latency is, the deeper it is

With both information, when all cpus are idle, we can choose the idlest cpu.

When one cpu is not idle, the old check against weighted load applies.

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
---
 kernel/sched/fair.c |   46 ++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 40 insertions(+), 6 deletions(-)

Comments

Nicolas Pitre April 2, 2014, 3:05 a.m. | #1
On Fri, 28 Mar 2014, Daniel Lezcano wrote:

> As we know in which idle state the cpu is, we can investigate the following:
> 
> 1. when did the cpu entered the idle state ? the longer the cpu is idle, the
> deeper it is idle
> 2. what exit latency is ? the greater the exit latency is, the deeper it is
> 
> With both information, when all cpus are idle, we can choose the idlest cpu.
> 
> When one cpu is not idle, the old check against weighted load applies.
> 
> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>

There seems to be some problems with the implementation.

> @@ -4336,20 +4337,53 @@ static int
>  find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>  {
>  	unsigned long load, min_load = ULONG_MAX;
> -	int idlest = -1;
> +	unsigned int min_exit_latency = UINT_MAX;
> +	u64 idle_stamp, min_idle_stamp = ULONG_MAX;

I don't think you really meant to assign an u64 variable with ULONG_MAX.
You probably want ULLONG_MAX here.  And probably not in fact (more 
later).

> +
> +	struct rq *rq;
> +	struct cpuidle_power *power;
> +
> +	int cpu_idle = -1;
> +	int cpu_busy = -1;
>  	int i;
>  
>  	/* Traverse only the allowed CPUs */
>  	for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
> -		load = weighted_cpuload(i);
>  
> -		if (load < min_load || (load == min_load && i == this_cpu)) {
> -			min_load = load;
> -			idlest = i;
> +		if (idle_cpu(i)) {
> +
> +			rq = cpu_rq(i);
> +			power = rq->power;
> +			idle_stamp = rq->idle_stamp;
> +
> +			/* The cpu is idle since a shorter time */
> +			if (idle_stamp < min_idle_stamp) {
> +				min_idle_stamp = idle_stamp;
> +				cpu_idle = i;
> +				continue;

Don't you want the highest time stamp in order to select the most 
recently idled CPU?  Favoring the CPU which has been idle the longest 
makes little sense.

> +			}
> +
> +			/* The cpu is idle but the exit_latency is shorter */
> +			if (power && power->exit_latency < min_exit_latency) {
> +				min_exit_latency = power->exit_latency;
> +				cpu_idle = i;
> +				continue;
> +			}

I think this is wrong.  This gives priority to CPUs which have been idle 
for a (longer... although this should have been) shorter period of time 
over those with a shallower idle state.  I think this should rather be:

	if (power && power->exit_latency < min_exit_latency) {
		min_exit_latency = power->exit_latency;
		latest_idle_stamp = idle_stamp;
	       	cpu_idle = i;
	} else if ((!power || power->exit_latency == min_exit_latency) &&
		   idle_stamp > latest_idle_stamp) {
		latest_idle_stamp = idle_stamp;
		cpu_idle = i;
	}

So the CPU with the shallowest idle state is selected in priority, and 
if many CPUs are in the same state then the time stamp is used to 
select the most recent one. Whenever 
a shallower idle state is found then the latest_idle_stamp is reset for 
that state even if it is further in the past.

> +		} else {
> +
> +			load = weighted_cpuload(i);
> +
> +			if (load < min_load ||
> +			    (load == min_load && i == this_cpu)) {
> +				min_load = load;
> +				cpu_busy = i;
> +				continue;
> +			}
>  		}

I think this is wrong to do an if-else based on idle_cpu() here.  What 
if a CPU is heavily loaded, but for some reason it happens to be idle at 
this very moment?  With your patch it could be selected as an idle CPU 
while it would be discarded as being too busy otherwise.

It is important to determine both cpu_busy and cpu_idle for all CPUs.

And cpu_busy is a bad name for this.  Something like least_loaded would 
be more self explanatory.  Same thing for cpu_idle which could be 
clearer if named shalloest_idle.

> -	return idlest;
> +	/* Busy cpus are considered less idle than idle cpus ;) */
> +	return cpu_busy != -1 ? cpu_busy : cpu_idle;

And finally it is a policy decision whether or not we want to return 
least_loaded over shallowest_idle e.g do we pack tasks on non idle CPUs 
first or not.  That in itself needs more investigation.  To keep the 
existing policy unchanged for now the above condition should have its 
variables swapped.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Rafael J. Wysocki April 4, 2014, 11:57 a.m. | #2
On Tuesday, April 01, 2014 11:05:49 PM Nicolas Pitre wrote:
> On Fri, 28 Mar 2014, Daniel Lezcano wrote:
> 
> > As we know in which idle state the cpu is, we can investigate the following:
> > 
> > 1. when did the cpu entered the idle state ? the longer the cpu is idle, the
> > deeper it is idle
> > 2. what exit latency is ? the greater the exit latency is, the deeper it is
> > 
> > With both information, when all cpus are idle, we can choose the idlest cpu.
> > 
> > When one cpu is not idle, the old check against weighted load applies.
> > 
> > Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
> 
> There seems to be some problems with the implementation.
> 
> > @@ -4336,20 +4337,53 @@ static int
> >  find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
> >  {
> >  	unsigned long load, min_load = ULONG_MAX;
> > -	int idlest = -1;
> > +	unsigned int min_exit_latency = UINT_MAX;
> > +	u64 idle_stamp, min_idle_stamp = ULONG_MAX;
> 
> I don't think you really meant to assign an u64 variable with ULONG_MAX.
> You probably want ULLONG_MAX here.  And probably not in fact (more 
> later).
> 
> > +
> > +	struct rq *rq;
> > +	struct cpuidle_power *power;
> > +
> > +	int cpu_idle = -1;
> > +	int cpu_busy = -1;
> >  	int i;
> >  
> >  	/* Traverse only the allowed CPUs */
> >  	for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
> > -		load = weighted_cpuload(i);
> >  
> > -		if (load < min_load || (load == min_load && i == this_cpu)) {
> > -			min_load = load;
> > -			idlest = i;
> > +		if (idle_cpu(i)) {
> > +
> > +			rq = cpu_rq(i);
> > +			power = rq->power;
> > +			idle_stamp = rq->idle_stamp;
> > +
> > +			/* The cpu is idle since a shorter time */
> > +			if (idle_stamp < min_idle_stamp) {
> > +				min_idle_stamp = idle_stamp;
> > +				cpu_idle = i;
> > +				continue;
> 
> Don't you want the highest time stamp in order to select the most 
> recently idled CPU?  Favoring the CPU which has been idle the longest 
> makes little sense.

It may make sense if the hardware can auto-promote CPUs to deeper C-states.

Something like that happens with package C-states that are only entered when
all cores have entered a particular core C-state already.  In that case the
probability of the core being in a deeper state grows with time.

That said I would just drop this heuristics for the time being.  If auto-promotion
is disregarded, it doesn't really matter how much time the given CPU has been idle
except for one case: When the target residency of its idle state hasn't been
reached yet, waking up the CPU may be a mistake (depending on how deep the state
actually is, but for the majority of drivers in the tree we don't have any measure
of that).

> > +			}
> > +
> > +			/* The cpu is idle but the exit_latency is shorter */
> > +			if (power && power->exit_latency < min_exit_latency) {
> > +				min_exit_latency = power->exit_latency;
> > +				cpu_idle = i;
> > +				continue;
> > +			}
> 
> I think this is wrong.  This gives priority to CPUs which have been idle 
> for a (longer... although this should have been) shorter period of time 
> over those with a shallower idle state.  I think this should rather be:
> 
> 	if (power && power->exit_latency < min_exit_latency) {
> 		min_exit_latency = power->exit_latency;
> 		latest_idle_stamp = idle_stamp;
> 	       	cpu_idle = i;
> 	} else if ((!power || power->exit_latency == min_exit_latency) &&
> 		   idle_stamp > latest_idle_stamp) {
> 		latest_idle_stamp = idle_stamp;
> 		cpu_idle = i;
> 	}
> 
> So the CPU with the shallowest idle state is selected in priority, and 
> if many CPUs are in the same state then the time stamp is used to 
> select the most recent one.

Again, if auto-promotion is disregarded, it doesn't really matter which of them
is woken up.

> Whenever a shallower idle state is found then the latest_idle_stamp is reset for 
> that state even if it is further in the past.
> 
> > +		} else {
> > +
> > +			load = weighted_cpuload(i);
> > +
> > +			if (load < min_load ||
> > +			    (load == min_load && i == this_cpu)) {
> > +				min_load = load;
> > +				cpu_busy = i;
> > +				continue;
> > +			}
> >  		}
> 
> I think this is wrong to do an if-else based on idle_cpu() here.  What 
> if a CPU is heavily loaded, but for some reason it happens to be idle at 
> this very moment?  With your patch it could be selected as an idle CPU 
> while it would be discarded as being too busy otherwise.

But see below ->

> It is important to determine both cpu_busy and cpu_idle for all CPUs.
> 
> And cpu_busy is a bad name for this.  Something like least_loaded would 
> be more self explanatory.  Same thing for cpu_idle which could be 
> clearer if named shalloest_idle.

shallowest_idle?

> > -	return idlest;
> > +	/* Busy cpus are considered less idle than idle cpus ;) */
> > +	return cpu_busy != -1 ? cpu_busy : cpu_idle;
> 
> And finally it is a policy decision whether or not we want to return 
> least_loaded over shallowest_idle e.g do we pack tasks on non idle CPUs 
> first or not.  That in itself needs more investigation.  To keep the 
> existing policy unchanged for now the above condition should have its 
> variables swapped.

Which means that once we've find the first idle CPU, it is not useful to
continue computing least_loaded, because we will return the idle one anyway,
right?
Nicolas Pitre April 4, 2014, 4:56 p.m. | #3
On Fri, 4 Apr 2014, Rafael J. Wysocki wrote:

> On Tuesday, April 01, 2014 11:05:49 PM Nicolas Pitre wrote:
> > On Fri, 28 Mar 2014, Daniel Lezcano wrote:
> > 
> > > As we know in which idle state the cpu is, we can investigate the following:
> > > 
> > > 1. when did the cpu entered the idle state ? the longer the cpu is idle, the
> > > deeper it is idle
> > > 2. what exit latency is ? the greater the exit latency is, the deeper it is
> > > 
> > > With both information, when all cpus are idle, we can choose the idlest cpu.
> > > 
> > > When one cpu is not idle, the old check against weighted load applies.
> > > 
> > > Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
> > 
> > There seems to be some problems with the implementation.
> > 
> > > @@ -4336,20 +4337,53 @@ static int
> > >  find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
> > >  {
> > >  	unsigned long load, min_load = ULONG_MAX;
> > > -	int idlest = -1;
> > > +	unsigned int min_exit_latency = UINT_MAX;
> > > +	u64 idle_stamp, min_idle_stamp = ULONG_MAX;
> > 
> > I don't think you really meant to assign an u64 variable with ULONG_MAX.
> > You probably want ULLONG_MAX here.  And probably not in fact (more 
> > later).
> > 
> > > +
> > > +	struct rq *rq;
> > > +	struct cpuidle_power *power;
> > > +
> > > +	int cpu_idle = -1;
> > > +	int cpu_busy = -1;
> > >  	int i;
> > >  
> > >  	/* Traverse only the allowed CPUs */
> > >  	for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
> > > -		load = weighted_cpuload(i);
> > >  
> > > -		if (load < min_load || (load == min_load && i == this_cpu)) {
> > > -			min_load = load;
> > > -			idlest = i;
> > > +		if (idle_cpu(i)) {
> > > +
> > > +			rq = cpu_rq(i);
> > > +			power = rq->power;
> > > +			idle_stamp = rq->idle_stamp;
> > > +
> > > +			/* The cpu is idle since a shorter time */
> > > +			if (idle_stamp < min_idle_stamp) {
> > > +				min_idle_stamp = idle_stamp;
> > > +				cpu_idle = i;
> > > +				continue;
> > 
> > Don't you want the highest time stamp in order to select the most 
> > recently idled CPU?  Favoring the CPU which has been idle the longest 
> > makes little sense.
> 
> It may make sense if the hardware can auto-promote CPUs to deeper C-states.

If so the promotion will happen over time, no?  What I'm saying here is 
that those CPUs which have been idle longer should not be favored when 
it is time to select a CPU for a task to run. More recently idled CPUs 
are more likely to be in a shallower C-state.

> Something like that happens with package C-states that are only entered when
> all cores have entered a particular core C-state already.  In that case the
> probability of the core being in a deeper state grows with time.

Exactly what I'm saying.

Also here it is worth remembering that the scheduling domains should 
represent those packages that share common C-states at a higher level.  
The scheduler can then be told not to balance across domains if it 
doesn't need to in order to favor the conditions for those package 
C-states to be used.  That's what the task packing patch series is 
about, independently of this one.

> That said I would just drop this heuristics for the time being.  If auto-promotion
> is disregarded, it doesn't really matter how much time the given CPU has been idle
> except for one case: When the target residency of its idle state hasn't been
> reached yet, waking up the CPU may be a mistake (depending on how deep the state
> actually is, but for the majority of drivers in the tree we don't have any measure
> of that).

There is one reason for considering the time a CPU has been idle, 
assuming equivalent C-state, and that is cache snooping.  The longer a 
CPU is idle, the more likely its cache content will have been claimed 
and migrated by other CPUs.  Of course that doesn't make much difference 
for deeper C-states where the cache isn't preserved, but it is probably 
simpler and cheaper to apply this heuristic in all cases.

> > > +			}
> > > +
> > > +			/* The cpu is idle but the exit_latency is shorter */
> > > +			if (power && power->exit_latency < min_exit_latency) {
> > > +				min_exit_latency = power->exit_latency;
> > > +				cpu_idle = i;
> > > +				continue;
> > > +			}
> > 
> > I think this is wrong.  This gives priority to CPUs which have been idle 
> > for a (longer... although this should have been) shorter period of time 
> > over those with a shallower idle state.  I think this should rather be:
> > 
> > 	if (power && power->exit_latency < min_exit_latency) {
> > 		min_exit_latency = power->exit_latency;
> > 		latest_idle_stamp = idle_stamp;
> > 	       	cpu_idle = i;
> > 	} else if ((!power || power->exit_latency == min_exit_latency) &&
> > 		   idle_stamp > latest_idle_stamp) {
> > 		latest_idle_stamp = idle_stamp;
> > 		cpu_idle = i;
> > 	}
> > 
> > So the CPU with the shallowest idle state is selected in priority, and 
> > if many CPUs are in the same state then the time stamp is used to 
> > select the most recent one.
> 
> Again, if auto-promotion is disregarded, it doesn't really matter which of them
> is woken up.

If it doesn't matter then it doesn't hurt.  But in some cases it 
matters.

> > Whenever a shallower idle state is found then the latest_idle_stamp is reset for 
> > that state even if it is further in the past.
> > 
> > > +		} else {
> > > +
> > > +			load = weighted_cpuload(i);
> > > +
> > > +			if (load < min_load ||
> > > +			    (load == min_load && i == this_cpu)) {
> > > +				min_load = load;
> > > +				cpu_busy = i;
> > > +				continue;
> > > +			}
> > >  		}
> > 
> > I think this is wrong to do an if-else based on idle_cpu() here.  What 
> > if a CPU is heavily loaded, but for some reason it happens to be idle at 
> > this very moment?  With your patch it could be selected as an idle CPU 
> > while it would be discarded as being too busy otherwise.
> 
> But see below ->
> 
> > It is important to determine both cpu_busy and cpu_idle for all CPUs.
> > 
> > And cpu_busy is a bad name for this.  Something like least_loaded would 
> > be more self explanatory.  Same thing for cpu_idle which could be 
> > clearer if named shalloest_idle.
> 
> shallowest_idle?

Something that means the CPU with the shallowest C-state.  Using 
"cpu_idle" for this variable doesn't cut it.

> > > -	return idlest;
> > > +	/* Busy cpus are considered less idle than idle cpus ;) */
> > > +	return cpu_busy != -1 ? cpu_busy : cpu_idle;
> > 
> > And finally it is a policy decision whether or not we want to return 
> > least_loaded over shallowest_idle e.g do we pack tasks on non idle CPUs 
> > first or not.  That in itself needs more investigation.  To keep the 
> > existing policy unchanged for now the above condition should have its 
> > variables swapped.
> 
> Which means that once we've find the first idle CPU, it is not useful to
> continue computing least_loaded, because we will return the idle one anyway,
> right?

Good point.  Currently, that should be the case.

Eventually we'll want to put new tasks on lightly loaded CPUs instead of 
waking up a fully idle CPU in order to favor deeper C-states. But that 
requires a patch series of its own just to determine how loaded a CPU is 
and how much work it can still accommodate before being oversubscribed, 
etc.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki April 5, 2014, 2:01 a.m. | #4
On Friday, April 04, 2014 12:56:52 PM Nicolas Pitre wrote:
> On Fri, 4 Apr 2014, Rafael J. Wysocki wrote:
> 
> > On Tuesday, April 01, 2014 11:05:49 PM Nicolas Pitre wrote:
> > > On Fri, 28 Mar 2014, Daniel Lezcano wrote:
> > > 
> > > > As we know in which idle state the cpu is, we can investigate the following:
> > > > 
> > > > 1. when did the cpu entered the idle state ? the longer the cpu is idle, the
> > > > deeper it is idle
> > > > 2. what exit latency is ? the greater the exit latency is, the deeper it is
> > > > 
> > > > With both information, when all cpus are idle, we can choose the idlest cpu.
> > > > 
> > > > When one cpu is not idle, the old check against weighted load applies.
> > > > 
> > > > Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
> > > 
> > > There seems to be some problems with the implementation.
> > > 
> > > > @@ -4336,20 +4337,53 @@ static int
> > > >  find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
> > > >  {
> > > >  	unsigned long load, min_load = ULONG_MAX;
> > > > -	int idlest = -1;
> > > > +	unsigned int min_exit_latency = UINT_MAX;
> > > > +	u64 idle_stamp, min_idle_stamp = ULONG_MAX;
> > > 
> > > I don't think you really meant to assign an u64 variable with ULONG_MAX.
> > > You probably want ULLONG_MAX here.  And probably not in fact (more 
> > > later).
> > > 
> > > > +
> > > > +	struct rq *rq;
> > > > +	struct cpuidle_power *power;
> > > > +
> > > > +	int cpu_idle = -1;
> > > > +	int cpu_busy = -1;
> > > >  	int i;
> > > >  
> > > >  	/* Traverse only the allowed CPUs */
> > > >  	for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
> > > > -		load = weighted_cpuload(i);
> > > >  
> > > > -		if (load < min_load || (load == min_load && i == this_cpu)) {
> > > > -			min_load = load;
> > > > -			idlest = i;
> > > > +		if (idle_cpu(i)) {
> > > > +
> > > > +			rq = cpu_rq(i);
> > > > +			power = rq->power;
> > > > +			idle_stamp = rq->idle_stamp;
> > > > +
> > > > +			/* The cpu is idle since a shorter time */
> > > > +			if (idle_stamp < min_idle_stamp) {
> > > > +				min_idle_stamp = idle_stamp;
> > > > +				cpu_idle = i;
> > > > +				continue;
> > > 
> > > Don't you want the highest time stamp in order to select the most 
> > > recently idled CPU?  Favoring the CPU which has been idle the longest 
> > > makes little sense.
> > 
> > It may make sense if the hardware can auto-promote CPUs to deeper C-states.
> 
> If so the promotion will happen over time, no?  What I'm saying here is 
> that those CPUs which have been idle longer should not be favored when 
> it is time to select a CPU for a task to run. More recently idled CPUs 
> are more likely to be in a shallower C-state.
> 
> > Something like that happens with package C-states that are only entered when
> > all cores have entered a particular core C-state already.  In that case the
> > probability of the core being in a deeper state grows with time.
> 
> Exactly what I'm saying.

Right, I got that the other way around by mistake.

> Also here it is worth remembering that the scheduling domains should 
> represent those packages that share common C-states at a higher level.  
> The scheduler can then be told not to balance across domains if it 
> doesn't need to in order to favor the conditions for those package 
> C-states to be used.  That's what the task packing patch series is 
> about, independently of this one.
> 
> > That said I would just drop this heuristics for the time being.  If auto-promotion
> > is disregarded, it doesn't really matter how much time the given CPU has been idle
> > except for one case: When the target residency of its idle state hasn't been
> > reached yet, waking up the CPU may be a mistake (depending on how deep the state
> > actually is, but for the majority of drivers in the tree we don't have any measure
> > of that).
> 
> There is one reason for considering the time a CPU has been idle, 
> assuming equivalent C-state, and that is cache snooping.  The longer a 
> CPU is idle, the more likely its cache content will have been claimed 
> and migrated by other CPUs.  Of course that doesn't make much difference 
> for deeper C-states where the cache isn't preserved, but it is probably 
> simpler and cheaper to apply this heuristic in all cases.

Yes, that sounds like it might be a reason, but I'd like to see numbers
confirming that to be honest.

> > > > +			}
> > > > +
> > > > +			/* The cpu is idle but the exit_latency is shorter */
> > > > +			if (power && power->exit_latency < min_exit_latency) {
> > > > +				min_exit_latency = power->exit_latency;
> > > > +				cpu_idle = i;
> > > > +				continue;
> > > > +			}
> > > 
> > > I think this is wrong.  This gives priority to CPUs which have been idle 
> > > for a (longer... although this should have been) shorter period of time 
> > > over those with a shallower idle state.  I think this should rather be:
> > > 
> > > 	if (power && power->exit_latency < min_exit_latency) {
> > > 		min_exit_latency = power->exit_latency;
> > > 		latest_idle_stamp = idle_stamp;
> > > 	       	cpu_idle = i;
> > > 	} else if ((!power || power->exit_latency == min_exit_latency) &&
> > > 		   idle_stamp > latest_idle_stamp) {
> > > 		latest_idle_stamp = idle_stamp;
> > > 		cpu_idle = i;
> > > 	}
> > > 
> > > So the CPU with the shallowest idle state is selected in priority, and 
> > > if many CPUs are in the same state then the time stamp is used to 
> > > select the most recent one.
> > 
> > Again, if auto-promotion is disregarded, it doesn't really matter which of them
> > is woken up.
> 
> If it doesn't matter then it doesn't hurt.  But in some cases it 
> matters.
> 
> > > Whenever a shallower idle state is found then the latest_idle_stamp is reset for 
> > > that state even if it is further in the past.
> > > 
> > > > +		} else {
> > > > +
> > > > +			load = weighted_cpuload(i);
> > > > +
> > > > +			if (load < min_load ||
> > > > +			    (load == min_load && i == this_cpu)) {
> > > > +				min_load = load;
> > > > +				cpu_busy = i;
> > > > +				continue;
> > > > +			}
> > > >  		}
> > > 
> > > I think this is wrong to do an if-else based on idle_cpu() here.  What 
> > > if a CPU is heavily loaded, but for some reason it happens to be idle at 
> > > this very moment?  With your patch it could be selected as an idle CPU 
> > > while it would be discarded as being too busy otherwise.
> > 
> > But see below ->
> > 
> > > It is important to determine both cpu_busy and cpu_idle for all CPUs.
> > > 
> > > And cpu_busy is a bad name for this.  Something like least_loaded would 
> > > be more self explanatory.  Same thing for cpu_idle which could be 
> > > clearer if named shalloest_idle.
> > 
> > shallowest_idle?
> 
> Something that means the CPU with the shallowest C-state.  Using 
> "cpu_idle" for this variable doesn't cut it.

Yes, that was about the typo above only. :-)

> > > > -	return idlest;
> > > > +	/* Busy cpus are considered less idle than idle cpus ;) */
> > > > +	return cpu_busy != -1 ? cpu_busy : cpu_idle;
> > > 
> > > And finally it is a policy decision whether or not we want to return 
> > > least_loaded over shallowest_idle e.g do we pack tasks on non idle CPUs 
> > > first or not.  That in itself needs more investigation.  To keep the 
> > > existing policy unchanged for now the above condition should have its 
> > > variables swapped.
> > 
> > Which means that once we've find the first idle CPU, it is not useful to
> > continue computing least_loaded, because we will return the idle one anyway,
> > right?
> 
> Good point.  Currently, that should be the case.
> 
> Eventually we'll want to put new tasks on lightly loaded CPUs instead of 
> waking up a fully idle CPU in order to favor deeper C-states. But that 
> requires a patch series of its own just to determine how loaded a CPU is 
> and how much work it can still accommodate before being oversubscribed, 
> etc.

Wouldn't we need power consumption numbers for that realistically?

Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra April 15, 2014, 1:03 p.m. | #5
On Fri, Mar 28, 2014 at 01:29:56PM +0100, Daniel Lezcano wrote:
> @@ -4336,20 +4337,53 @@ static int
>  find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>  {
>  	unsigned long load, min_load = ULONG_MAX;
> -	int idlest = -1;
> +	unsigned int min_exit_latency = UINT_MAX;
> +	u64 idle_stamp, min_idle_stamp = ULONG_MAX;
> +
> +	struct rq *rq;
> +	struct cpuidle_power *power;
> +
> +	int cpu_idle = -1;
> +	int cpu_busy = -1;
>  	int i;
>  
>  	/* Traverse only the allowed CPUs */
>  	for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
> -		load = weighted_cpuload(i);
>  
> -		if (load < min_load || (load == min_load && i == this_cpu)) {
> -			min_load = load;
> -			idlest = i;
> +		if (idle_cpu(i)) {
> +
> +			rq = cpu_rq(i);
> +			power = rq->power;
> +			idle_stamp = rq->idle_stamp;
> +
> +			/* The cpu is idle since a shorter time */
> +			if (idle_stamp < min_idle_stamp) {
> +				min_idle_stamp = idle_stamp;
> +				cpu_idle = i;
> +				continue;
> +			}
> +
> +			/* The cpu is idle but the exit_latency is shorter */
> +			if (power && power->exit_latency < min_exit_latency) {
> +				min_exit_latency = power->exit_latency;
> +				cpu_idle = i;
> +				continue;
> +			}

Aside from the arguments made by Nico (which I agree with), depending on
the life time rules of the power object we might need
smp_read_barrier_depends() between reading and using.

If all these objects are static and never change content we do not, if
there's dynamic objects involved we probably should.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Lezcano April 17, 2014, 1:53 p.m. | #6
On 04/02/2014 05:05 AM, Nicolas Pitre wrote:
> On Fri, 28 Mar 2014, Daniel Lezcano wrote:
>
>> As we know in which idle state the cpu is, we can investigate the following:
>>
>> 1. when did the cpu entered the idle state ? the longer the cpu is idle, the
>> deeper it is idle
>> 2. what exit latency is ? the greater the exit latency is, the deeper it is
>>
>> With both information, when all cpus are idle, we can choose the idlest cpu.
>>
>> When one cpu is not idle, the old check against weighted load applies.
>>
>> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
>
> There seems to be some problems with the implementation.
>
>> @@ -4336,20 +4337,53 @@ static int
>>   find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>>   {
>>   	unsigned long load, min_load = ULONG_MAX;
>> -	int idlest = -1;
>> +	unsigned int min_exit_latency = UINT_MAX;
>> +	u64 idle_stamp, min_idle_stamp = ULONG_MAX;
>
> I don't think you really meant to assign an u64 variable with ULONG_MAX.
> You probably want ULLONG_MAX here.  And probably not in fact (more
> later).
>
>> +
>> +	struct rq *rq;
>> +	struct cpuidle_power *power;
>> +
>> +	int cpu_idle = -1;
>> +	int cpu_busy = -1;
>>   	int i;
>>
>>   	/* Traverse only the allowed CPUs */
>>   	for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
>> -		load = weighted_cpuload(i);
>>
>> -		if (load < min_load || (load == min_load && i == this_cpu)) {
>> -			min_load = load;
>> -			idlest = i;
>> +		if (idle_cpu(i)) {
>> +
>> +			rq = cpu_rq(i);
>> +			power = rq->power;
>> +			idle_stamp = rq->idle_stamp;
>> +
>> +			/* The cpu is idle since a shorter time */
>> +			if (idle_stamp < min_idle_stamp) {
>> +				min_idle_stamp = idle_stamp;
>> +				cpu_idle = i;
>> +				continue;
>
> Don't you want the highest time stamp in order to select the most
> recently idled CPU?  Favoring the CPU which has been idle the longest
> makes little sense.
>
>> +			}
>> +
>> +			/* The cpu is idle but the exit_latency is shorter */
>> +			if (power && power->exit_latency < min_exit_latency) {
>> +				min_exit_latency = power->exit_latency;
>> +				cpu_idle = i;
>> +				continue;
>> +			}
>
> I think this is wrong.  This gives priority to CPUs which have been idle
> for a (longer... although this should have been) shorter period of time
> over those with a shallower idle state.  I think this should rather be:
>
> 	if (power && power->exit_latency < min_exit_latency) {
> 		min_exit_latency = power->exit_latency;
> 		latest_idle_stamp = idle_stamp;
> 	       	cpu_idle = i;
> 	} else if ((!power || power->exit_latency == min_exit_latency) &&
> 		   idle_stamp > latest_idle_stamp) {
> 		latest_idle_stamp = idle_stamp;
> 		cpu_idle = i;
> 	}
>
> So the CPU with the shallowest idle state is selected in priority, and
> if many CPUs are in the same state then the time stamp is used to
> select the most recent one. Whenever
> a shallower idle state is found then the latest_idle_stamp is reset for
> that state even if it is further in the past.
>
>> +		} else {
>> +
>> +			load = weighted_cpuload(i);
>> +
>> +			if (load < min_load ||
>> +			    (load == min_load && i == this_cpu)) {
>> +				min_load = load;
>> +				cpu_busy = i;
>> +				continue;
>> +			}
>>   		}
>
> I think this is wrong to do an if-else based on idle_cpu() here.  What
> if a CPU is heavily loaded, but for some reason it happens to be idle at
> this very moment?  With your patch it could be selected as an idle CPU
> while it would be discarded as being too busy otherwise.
>
> It is important to determine both cpu_busy and cpu_idle for all CPUs.
>
> And cpu_busy is a bad name for this.  Something like least_loaded would
> be more self explanatory.  Same thing for cpu_idle which could be
> clearer if named shalloest_idle.
>
>> -	return idlest;
>> +	/* Busy cpus are considered less idle than idle cpus ;) */
>> +	return cpu_busy != -1 ? cpu_busy : cpu_idle;
>
> And finally it is a policy decision whether or not we want to return
> least_loaded over shallowest_idle e.g do we pack tasks on non idle CPUs
> first or not.  That in itself needs more investigation.  To keep the
> existing policy unchanged for now the above condition should have its
> variables swapped.


Ok, refreshed the patchset but before sending it out I would to discuss 
about the rational of the changes and the policy, and change the 
patchset consequently.

What order to choose if the cpu is idle ?

Let's assume all cpus are idle on a dual socket quad core.

Also, we can reasonably do the hypothesis if the cluster is in low power 
mode, the cpus belonging to the same cluster are in the same idle state 
(putting apart the auto-promote where we don't have control on).

If the policy you talk above is 'aggressive power saving', we can follow 
the rules with decreasing priority:

1. We want to prevent to wakeup the entire cluster
	=> as the cpus are in the same idle state, by choosing a cpu in shallow 
state, we should have the guarantee we won't wakeup a cluster (except if 
no shallowest idle cpu are found).

2. We want to prevent to wakeup a cpu which did not reach the target 
residency time (will need some work to unify cpuidle idle time and idle 
task run time)
	=> with the target residency and, as a first step, with the idle stamp, 
we can determine if the cpu slept enough

3. We want to prevent to wakeup a cpu in deep idle state
	=> by looking for the cpu in shallowest idle state

4. We want to prevent to wakeup a cpu where the exit latency is longer 
than the expected run time of the task (and the time to migrate the task ?)

Concerning the policy, I would suggest to create an entry in
/proc/sys/kernel/sched_power, where a couple of values could be 
performance - power saving (0 / 1).

Does it make sense ? Any ideas ?

Thanks
   -- Daniel
Peter Zijlstra April 17, 2014, 2:47 p.m. | #7
On Thu, Apr 17, 2014 at 03:53:32PM +0200, Daniel Lezcano wrote:
> Concerning the policy, I would suggest to create an entry in
> /proc/sys/kernel/sched_power, where a couple of values could be performance
> - power saving (0 / 1).

Ingo wanted a sched_balance_policy file with 3 values:
  "performance, power, auto"

Where the auto thing switches between them, initially based off of
having AC or not.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Lezcano April 17, 2014, 3:03 p.m. | #8
On 04/17/2014 04:47 PM, Peter Zijlstra wrote:
> On Thu, Apr 17, 2014 at 03:53:32PM +0200, Daniel Lezcano wrote:
>> Concerning the policy, I would suggest to create an entry in
>> /proc/sys/kernel/sched_power, where a couple of values could be performance
>> - power saving (0 / 1).
>
> Ingo wanted a sched_balance_policy file with 3 values:
>    "performance, power, auto"
>
> Where the auto thing switches between them, initially based off of
> having AC or not.

oh, good. Thanks !
Nicolas Pitre April 17, 2014, 3:53 p.m. | #9
On Thu, 17 Apr 2014, Daniel Lezcano wrote:

> Ok, refreshed the patchset but before sending it out I would to discuss about
> the rational of the changes and the policy, and change the patchset
> consequently.
> 
> What order to choose if the cpu is idle ?
> 
> Let's assume all cpus are idle on a dual socket quad core.
> 
> Also, we can reasonably do the hypothesis if the cluster is in low power mode,
> the cpus belonging to the same cluster are in the same idle state (putting
> apart the auto-promote where we don't have control on).
> 
> If the policy you talk above is 'aggressive power saving', we can follow the
> rules with decreasing priority:
> 
> 1. We want to prevent to wakeup the entire cluster
> 	=> as the cpus are in the same idle state, by choosing a cpu in
> 	=> shallow 
> state, we should have the guarantee we won't wakeup a cluster (except if no
> shallowest idle cpu are found).

This is unclear to me.  Obviously, if an entire cluster is down, that 
means all the CPUs it contains have been idle for a long time. And 
therefore they shouldn't be subject to selection unless there is no 
other CPUs available.  Is that what you mean?

> 2. We want to prevent to wakeup a cpu which did not reach the target residency
> time (will need some work to unify cpuidle idle time and idle task run time)
> 	=> with the target residency and, as a first step, with the idle
> 	=> stamp, 
> we can determine if the cpu slept enough

Agreed. However, right now, the scheduler does not have any 
consideration for that.  So this should be done as a separate patch.

> 3. We want to prevent to wakeup a cpu in deep idle state
> 	=> by looking for the cpu in shallowest idle state

Obvious.

> 4. We want to prevent to wakeup a cpu where the exit latency is longer than
> the expected run time of the task (and the time to migrate the task ?)

Sure.  That would be a case for using task packing even if the policy is 
set to performance rather than powersave whereas task packing is 
normally for powersave.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Daniel Lezcano April 17, 2014, 4:05 p.m. | #10
On 04/17/2014 05:53 PM, Nicolas Pitre wrote:
> On Thu, 17 Apr 2014, Daniel Lezcano wrote:
>
>> Ok, refreshed the patchset but before sending it out I would to discuss about
>> the rational of the changes and the policy, and change the patchset
>> consequently.
>>
>> What order to choose if the cpu is idle ?
>>
>> Let's assume all cpus are idle on a dual socket quad core.
>>
>> Also, we can reasonably do the hypothesis if the cluster is in low power mode,
>> the cpus belonging to the same cluster are in the same idle state (putting
>> apart the auto-promote where we don't have control on).
>>
>> If the policy you talk above is 'aggressive power saving', we can follow the
>> rules with decreasing priority:
>>
>> 1. We want to prevent to wakeup the entire cluster
>> 	=> as the cpus are in the same idle state, by choosing a cpu in
>> 	=> shallow
>> state, we should have the guarantee we won't wakeup a cluster (except if no
>> shallowest idle cpu are found).
>
> This is unclear to me.  Obviously, if an entire cluster is down, that
> means all the CPUs it contains have been idle for a long time.  And
> therefore they shouldn't be subject to selection unless there is no
> other CPUs available.  Is that what you mean?

Yes, this is what I meant. But also what I meant is we can get rid for 
the moment of the cpu topology and the coupling idle state because if we 
do this described approach, as the idle state will be the same for the 
cpus belonging to the same cluster we won't select a cluster down 
(except if there is no other CPUs available).

>> 2. We want to prevent to wakeup a cpu which did not reach the target residency
>> time (will need some work to unify cpuidle idle time and idle task run time)
>> 	=> with the target residency and, as a first step, with the idle
>> 	=> stamp,
>> we can determine if the cpu slept enough
>
> Agreed. However, right now, the scheduler does not have any
> consideration for that.  So this should be done as a separate patch.

Yes, I thought as a very first step we can rely on the idle stamp until 
we unify the times with a big comment. Or I can first unify the idle 
times and then take into account the target residency. It is to comply 
with Rafael's request to have the 'big picture'.

>> 3. We want to prevent to wakeup a cpu in deep idle state
>> 	=> by looking for the cpu in shallowest idle state
>
> Obvious.
>
>> 4. We want to prevent to wakeup a cpu where the exit latency is longer than
>> the expected run time of the task (and the time to migrate the task ?)
>
> Sure.  That would be a case for using task packing even if the policy is
> set to performance rather than powersave whereas task packing is
> normally for powersave.

Yes, I agree, task packing improves also the performances and it makes 
really sense to prevent task migration under some circumstances for a 
better cache efficiency.

Thanks for the comments

   -- Daniel
Nicolas Pitre April 17, 2014, 4:21 p.m. | #11
On Thu, 17 Apr 2014, Daniel Lezcano wrote:

> On 04/17/2014 05:53 PM, Nicolas Pitre wrote:
> > On Thu, 17 Apr 2014, Daniel Lezcano wrote:
> >
> > > Ok, refreshed the patchset but before sending it out I would to discuss
> > > about
> > > the rational of the changes and the policy, and change the patchset
> > > consequently.
> > >
> > > What order to choose if the cpu is idle ?
> > >
> > > Let's assume all cpus are idle on a dual socket quad core.
> > >
> > > Also, we can reasonably do the hypothesis if the cluster is in low power
> > > mode,
> > > the cpus belonging to the same cluster are in the same idle state (putting
> > > apart the auto-promote where we don't have control on).
> > >
> > > If the policy you talk above is 'aggressive power saving', we can follow
> > > the
> > > rules with decreasing priority:
> > >
> > > 1. We want to prevent to wakeup the entire cluster
> > > => as the cpus are in the same idle state, by choosing a cpu in
> > > => shallow
> > > state, we should have the guarantee we won't wakeup a cluster (except if
> > > no
> > > shallowest idle cpu are found).
> >
> > This is unclear to me.  Obviously, if an entire cluster is down, that
> > means all the CPUs it contains have been idle for a long time.  And
> > therefore they shouldn't be subject to selection unless there is no
> > other CPUs available.  Is that what you mean?
> 
> Yes, this is what I meant. But also what I meant is we can get rid for the
> moment of the cpu topology and the coupling idle state because if we do this
> described approach, as the idle state will be the same for the cpus belonging
> to the same cluster we won't select a cluster down (except if there is no
> other CPUs available).

CPU topology is needed to properly describe scheduling domains.  Whether 
we balance across domains or pack using as few domains as possible is a 
separate issue.  In other words, you shouldn't have to care in this 
patch series.

And IMHO coupled C-state is a low-level mechanism that should remain 
private to cpuidle which the scheduler shouldn't be aware of.

> > > 2. We want to prevent to wakeup a cpu which did not reach the target
> > > residency
> > > time (will need some work to unify cpuidle idle time and idle task run
> > > time)
> > > => with the target residency and, as a first step, with the idle
> > > => stamp,
> > > we can determine if the cpu slept enough
> >
> > Agreed. However, right now, the scheduler does not have any
> > consideration for that.  So this should be done as a separate patch.
> 
> Yes, I thought as a very first step we can rely on the idle stamp until we
> unify the times with a big comment. Or I can first unify the idle times and
> then take into account the target residency. It is to comply with Rafael's
> request to have the 'big picture'.

I agree, but that should be done incrementally.  Even without this 
consideration, what you proposed is already an improvement over the 
current state of affairs.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar April 18, 2014, 8:09 a.m. | #12
* Daniel Lezcano <daniel.lezcano@linaro.org> wrote:

> On 04/17/2014 04:47 PM, Peter Zijlstra wrote:
> >On Thu, Apr 17, 2014 at 03:53:32PM +0200, Daniel Lezcano wrote:
> >>Concerning the policy, I would suggest to create an entry in
> >>/proc/sys/kernel/sched_power, where a couple of values could be performance
> >>- power saving (0 / 1).
> >
> >Ingo wanted a sched_balance_policy file with 3 values:
> >   "performance, power, auto"
> >
> >Where the auto thing switches between them, initially based off of
> >having AC or not.
> 
> oh, good. Thanks !

Also, 'auto' should be the default, because the kernel doing TRT is 
really what users want.

Userspace can sill tweak it all and make it all user-space controlled, 
by flipping between 'performance' and 'power'. (and those modes are 
also helpful for development and debugging.)

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Lezcano April 18, 2014, 8:36 a.m. | #13
On 04/18/2014 10:09 AM, Ingo Molnar wrote:
>
> * Daniel Lezcano <daniel.lezcano@linaro.org> wrote:
>
>> On 04/17/2014 04:47 PM, Peter Zijlstra wrote:
>>> On Thu, Apr 17, 2014 at 03:53:32PM +0200, Daniel Lezcano wrote:
>>>> Concerning the policy, I would suggest to create an entry in
>>>> /proc/sys/kernel/sched_power, where a couple of values could be performance
>>>> - power saving (0 / 1).
>>>
>>> Ingo wanted a sched_balance_policy file with 3 values:
>>>    "performance, power, auto"
>>>
>>> Where the auto thing switches between them, initially based off of
>>> having AC or not.
>>
>> oh, good. Thanks !
>
> Also, 'auto' should be the default, because the kernel doing TRT is
> really what users want.
>
> Userspace can sill tweak it all and make it all user-space controlled,
> by flipping between 'performance' and 'power'. (and those modes are
> also helpful for development and debugging.)

Copy that.

   Thanks !

   -- Daniel
Peter Zijlstra April 18, 2014, 9:38 a.m. | #14
On Thu, Apr 17, 2014 at 12:21:28PM -0400, Nicolas Pitre wrote:
> CPU topology is needed to properly describe scheduling domains.  Whether 
> we balance across domains or pack using as few domains as possible is a 
> separate issue.  In other words, you shouldn't have to care in this 
> patch series.
> 
> And IMHO coupled C-state is a low-level mechanism that should remain 
> private to cpuidle which the scheduler shouldn't be aware of.

I'm confused.. why wouldn't you want to expose these?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Daniel Lezcano April 18, 2014, 12:13 p.m. | #15
On 04/18/2014 11:38 AM, Peter Zijlstra wrote:
> On Thu, Apr 17, 2014 at 12:21:28PM -0400, Nicolas Pitre wrote:
>> CPU topology is needed to properly describe scheduling domains.  Whether
>> we balance across domains or pack using as few domains as possible is a
>> separate issue.  In other words, you shouldn't have to care in this
>> patch series.
>>
>> And IMHO coupled C-state is a low-level mechanism that should remain
>> private to cpuidle which the scheduler shouldn't be aware of.
>
> I'm confused.. why wouldn't you want to expose these?

The couple C-state is used as a mechanism for cpuidle to sync the cpus 
when entering a specific c-state. This mechanism is usually used to 
handle the cluster power down. It is only used for a two drivers (soon 
three) but it is not the only mechanism used for syncing the cpus. There 
are also the MCPM (tc2), the hand made sync when the hardware allows it 
(ux500), and an abstraction from the firmware (mwait), transparent to 
the kernel.

Taking into account the couple c-state only does not make sense because 
of the other mechanisms above. This is why it should stay inside the 
cpuidle framework.

The extension of the cpu topology will provide a generic way to describe 
and abstracting such dependencies.

Does it answer your question ?

   -- Daniel
Peter Zijlstra April 18, 2014, 12:53 p.m. | #16
On Fri, Apr 18, 2014 at 02:13:48PM +0200, Daniel Lezcano wrote:
> On 04/18/2014 11:38 AM, Peter Zijlstra wrote:
> >On Thu, Apr 17, 2014 at 12:21:28PM -0400, Nicolas Pitre wrote:
> >>CPU topology is needed to properly describe scheduling domains.  Whether
> >>we balance across domains or pack using as few domains as possible is a
> >>separate issue.  In other words, you shouldn't have to care in this
> >>patch series.
> >>
> >>And IMHO coupled C-state is a low-level mechanism that should remain
> >>private to cpuidle which the scheduler shouldn't be aware of.
> >
> >I'm confused.. why wouldn't you want to expose these?
> 
> The couple C-state is used as a mechanism for cpuidle to sync the cpus when
> entering a specific c-state. This mechanism is usually used to handle the
> cluster power down. It is only used for a two drivers (soon three) but it is
> not the only mechanism used for syncing the cpus. There are also the MCPM
> (tc2), the hand made sync when the hardware allows it (ux500), and an
> abstraction from the firmware (mwait), transparent to the kernel.
> 
> Taking into account the couple c-state only does not make sense because of
> the other mechanisms above. This is why it should stay inside the cpuidle
> framework.
> 
> The extension of the cpu topology will provide a generic way to describe and
> abstracting such dependencies.
> 
> Does it answer your question ?

I suppose so; its still a bit like we won't but we will :-)

So we _will_ actually expose coupled C states through the topology bits,
that's good.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Lezcano April 18, 2014, 1:04 p.m. | #17
On 04/18/2014 02:53 PM, Peter Zijlstra wrote:
> On Fri, Apr 18, 2014 at 02:13:48PM +0200, Daniel Lezcano wrote:
>> On 04/18/2014 11:38 AM, Peter Zijlstra wrote:
>>> On Thu, Apr 17, 2014 at 12:21:28PM -0400, Nicolas Pitre wrote:
>>>> CPU topology is needed to properly describe scheduling domains.  Whether
>>>> we balance across domains or pack using as few domains as possible is a
>>>> separate issue.  In other words, you shouldn't have to care in this
>>>> patch series.
>>>>
>>>> And IMHO coupled C-state is a low-level mechanism that should remain
>>>> private to cpuidle which the scheduler shouldn't be aware of.
>>>
>>> I'm confused.. why wouldn't you want to expose these?
>>
>> The couple C-state is used as a mechanism for cpuidle to sync the cpus when
>> entering a specific c-state. This mechanism is usually used to handle the
>> cluster power down. It is only used for a two drivers (soon three) but it is
>> not the only mechanism used for syncing the cpus. There are also the MCPM
>> (tc2), the hand made sync when the hardware allows it (ux500), and an
>> abstraction from the firmware (mwait), transparent to the kernel.
>>
>> Taking into account the couple c-state only does not make sense because of
>> the other mechanisms above. This is why it should stay inside the cpuidle
>> framework.
>>
>> The extension of the cpu topology will provide a generic way to describe and
>> abstracting such dependencies.
>>
>> Does it answer your question ?
>
> I suppose so; its still a bit like we won't but we will :-)
>
> So we _will_ actually expose coupled C states through the topology bits,
> that's good.

Ah, ok. I think I understood where the confusion is coming from.

A couple of definitions for the same thing :)

1. Coupled C-states : *mechanism* implemented in the cpuidle framework: 
drivers/cpuidle/coupled.c

2. Coupled C-states : *constraint* to reach a cluster power down state, 
will be described through the topology and could be implemented by 
different mechanism (MCPM, handmade sync, cpuidle-coupled-c-state, 
firmware).

We want to expose 2. not 1. to the scheduler.
Nicolas Pitre April 18, 2014, 4 p.m. | #18
On Fri, 18 Apr 2014, Daniel Lezcano wrote:

> On 04/18/2014 02:53 PM, Peter Zijlstra wrote:
> > I suppose so; its still a bit like we won't but we will :-)
> >
> > So we _will_ actually expose coupled C states through the topology bits,
> > that's good.
> 
> Ah, ok. I think I understood where the confusion is coming from.
> 
> A couple of definitions for the same thing :)
> 
> 1. Coupled C-states : *mechanism* implemented in the cpuidle framework:
> drivers/cpuidle/coupled.c
> 
> 2. Coupled C-states : *constraint* to reach a cluster power down state, will
> be described through the topology and could be implemented by different
> mechanism (MCPM, handmade sync, cpuidle-coupled-c-state, firmware).
> 
> We want to expose 2. not 1. to the scheduler.

I couldn't explain it better.

Sorry for creating confusion.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 16042b5..068e503 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -23,6 +23,7 @@ 
 #include <linux/latencytop.h>
 #include <linux/sched.h>
 #include <linux/cpumask.h>
+#include <linux/cpuidle.h>
 #include <linux/slab.h>
 #include <linux/profile.h>
 #include <linux/interrupt.h>
@@ -4336,20 +4337,53 @@  static int
 find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 {
 	unsigned long load, min_load = ULONG_MAX;
-	int idlest = -1;
+	unsigned int min_exit_latency = UINT_MAX;
+	u64 idle_stamp, min_idle_stamp = ULONG_MAX;
+
+	struct rq *rq;
+	struct cpuidle_power *power;
+
+	int cpu_idle = -1;
+	int cpu_busy = -1;
 	int i;
 
 	/* Traverse only the allowed CPUs */
 	for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
-		load = weighted_cpuload(i);
 
-		if (load < min_load || (load == min_load && i == this_cpu)) {
-			min_load = load;
-			idlest = i;
+		if (idle_cpu(i)) {
+
+			rq = cpu_rq(i);
+			power = rq->power;
+			idle_stamp = rq->idle_stamp;
+
+			/* The cpu is idle since a shorter time */
+			if (idle_stamp < min_idle_stamp) {
+				min_idle_stamp = idle_stamp;
+				cpu_idle = i;
+				continue;
+			}
+
+			/* The cpu is idle but the exit_latency is shorter */
+			if (power && power->exit_latency < min_exit_latency) {
+				min_exit_latency = power->exit_latency;
+				cpu_idle = i;
+				continue;
+			}
+		} else {
+
+			load = weighted_cpuload(i);
+
+			if (load < min_load ||
+			    (load == min_load && i == this_cpu)) {
+				min_load = load;
+				cpu_busy = i;
+				continue;
+			}
 		}
 	}
 
-	return idlest;
+	/* Busy cpus are considered less idle than idle cpus ;) */
+	return cpu_busy != -1 ? cpu_busy : cpu_idle;
 }
 
 /*