diff mbox

[3/3] sched: fair: Fix wrong idle timestamp usage

Message ID 1429092024-20498-3-git-send-email-daniel.lezcano@linaro.org
State New
Headers show

Commit Message

Daniel Lezcano April 15, 2015, 10 a.m. UTC
The find_idlest_cpu is assuming the rq->idle_stamp information reflects when
the cpu entered the idle state. This is wrong as the cpu may exit and enter
the idle state several times without the rq->idle_stamp being updated.

We have two informations here:

 * rq->idle_stamp gives when the idle task has been scheduled
 * idle->idle_stamp gives when the cpu entered the idle state

The patch fixes that by using the latter information and fallbacks to
the rq's timestamp when the idle state is not accessible

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
---
 kernel/sched/fair.c | 42 ++++++++++++++++++++++++++++--------------
 1 file changed, 28 insertions(+), 14 deletions(-)

Comments

Daniel Lezcano April 15, 2015, 3:43 p.m. UTC | #1
On 04/15/2015 02:18 PM, Peter Zijlstra wrote:
> On Wed, Apr 15, 2015 at 12:00:24PM +0200, Daniel Lezcano wrote:
>> The find_idlest_cpu is assuming the rq->idle_stamp information reflects when
>> the cpu entered the idle state. This is wrong as the cpu may exit and enter
>> the idle state several times without the rq->idle_stamp being updated.
>
> Sure, but you forgot to tell us why it matters.

Yes, right. Thanks for pointing this out.

Assuming we are in the situation where there are several idle cpus in 
the same idle state.

With the current code, the function find_idlest_cpu will choose a cpu 
with the shortest idle duration. This information is based on the 
rq->idle_stamp variable and is correct until one of the idle cpu is 
exiting the cpuidle_enter function and re-entering it again. As soon as 
this happen, the rq->idle_stamp value is no longer a reliable information.

Example:

  * CPU0 and CPU1 are running
  * CPU2 and CPU3 are in the C3 state.
  * CPU2 entered idle at T2
  * CPU3 entered idle at T3
  * T2 < T3

The function find_idlest_cpu will choose CPU3 because it has a shorter 
idle duration.

Then CPU3 is woken up by an interrupt, process it and re-enter idle C3.

The information will still give the out to date information T2 < T3 and 
find_idlest_cpu will choose CPU2 instead of CPU3.

Even if that shouldn't have a big impact on the performance and energy 
side, we are dealing with a wrong information preventing us to improve 
the energy side later (eg. prevent to wakeup a cpu which did not reach 
its target residency yet).

>> We have two informations here:
>>
>>   * rq->idle_stamp gives when the idle task has been scheduled
>>   * idle->idle_stamp gives when the cpu entered the idle state
>
> I'm not a native speaker, but I'm pretty sure 'information' is a word
> without a plural, a google search suggests it to be a non-countable
> noun.

Ha, sounds like it is a common mistake for non native speaker :)

"Informations" is correct but apparently very uncommon, so uncommon it 
is considered incorrect.

Thanks for the tip, I will keep it in mind :)

>> The patch fixes that by using the latter information and fallbacks to
>> the rq's timestamp when the idle state is not accessible
>>
>> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
>> ---
>>   kernel/sched/fair.c | 42 ++++++++++++++++++++++++++++--------------
>>   1 file changed, 28 insertions(+), 14 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 46855d0..b44f1ad 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4704,21 +4704,35 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>>   		if (idle_cpu(i)) {
>>   			struct rq *rq = cpu_rq(i);
>>   			struct cpuidle_state *idle = idle_get_state(rq);
>> +
>> +			if (idle) {
>> +				if (idle->exit_latency < min_exit_latency) {
>> +					/*
>> +					 * We give priority to a CPU
>> +					 * whose idle state has the
>> +					 * smallest exit latency
>> +					 * irrespective of any idle
>> +					 * timestamp.
>> +					 */
>> +					min_exit_latency = idle->exit_latency;
>> +					latest_idle_timestamp = idle->idle_stamp;
>> +					shallowest_idle_cpu = i;
>> +				} else if (idle->exit_latency == min_exit_latency &&
>> +					   idle->idle_stamp > latest_idle_timestamp) {
>> +					/*
>> +					 * If the CPU is in the same
>> +					 * idle state, choose the more
>> +					 * recent one as it might have
>> +					 * a warmer cache
>> +					 */
>> +					latest_idle_timestamp = idle->idle_stamp;
>> +					shallowest_idle_cpu = i;
>> +				}
>> +			} else if (rq->idle_stamp > latest_idle_timestamp) {
>>   				/*
>> +				 * If no active idle state, then the
>> +				 * most recent idled CPU might have a
>> +				 * warmer cache
>>   				 */
>>   				latest_idle_timestamp = rq->idle_stamp;
>>   				shallowest_idle_cpu = i;
>
> Urgh, you made horrid code more horrible.
>
> And all without reason.

Ok. What is horrible ? The 'if then else' blocks or the algorithm itself ?
Daniel Lezcano April 16, 2015, 8:46 a.m. UTC | #2
On 04/15/2015 07:10 PM, Morten Rasmussen wrote:
> On Wed, Apr 15, 2015 at 04:43:17PM +0100, Daniel Lezcano wrote:
>> On 04/15/2015 02:18 PM, Peter Zijlstra wrote:
>>> On Wed, Apr 15, 2015 at 12:00:24PM +0200, Daniel Lezcano wrote:
>>>> The find_idlest_cpu is assuming the rq->idle_stamp information reflects when
>>>> the cpu entered the idle state. This is wrong as the cpu may exit and enter
>>>> the idle state several times without the rq->idle_stamp being updated.
>>>
>>> Sure, but you forgot to tell us why it matters.
>>
>> Yes, right. Thanks for pointing this out.
>>
>> Assuming we are in the situation where there are several idle cpus in
>> the same idle state.
>>
>> With the current code, the function find_idlest_cpu will choose a cpu
>> with the shortest idle duration. This information is based on the
>> rq->idle_stamp variable and is correct until one of the idle cpu is
>> exiting the cpuidle_enter function and re-entering it again. As soon as
>> this happen, the rq->idle_stamp value is no longer a reliable information.
>>
>> Example:
>>
>>    * CPU0 and CPU1 are running
>>    * CPU2 and CPU3 are in the C3 state.
>>    * CPU2 entered idle at T2
>>    * CPU3 entered idle at T3
>>    * T2 < T3
>>
>> The function find_idlest_cpu will choose CPU3 because it has a shorter
>> idle duration.
>>
>> Then CPU3 is woken up by an interrupt, process it and re-enter idle C3.
>>
>> The information will still give the out to date information T2 < T3 and
>> find_idlest_cpu will choose CPU2 instead of CPU3.
>
> I can't get the example to match your description of how
> find_idlest_cpu() is supposed to work :-(
>
> Did you mean CPU2 (not CPU3) getting woken up by an interrupt and
> find_busiest_cpu() choosing CPU3 instead of CPU2 after the interrupt?
>
> In your example find_busiest_cpu() should return CPU3 before the
> interrupt as it went to sleep last and the interrupt on CPU3 should not
> affect that choice as CPU3 is still the last cpu to go to sleep
> (regardless of your patch). No?

Yes you are right. I meant CPU2 is woken up by the interrupt.
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 46855d0..b44f1ad 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4704,21 +4704,35 @@  find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 		if (idle_cpu(i)) {
 			struct rq *rq = cpu_rq(i);
 			struct cpuidle_state *idle = idle_get_state(rq);
-			if (idle && idle->exit_latency < min_exit_latency) {
-				/*
-				 * We give priority to a CPU whose idle state
-				 * has the smallest exit latency irrespective
-				 * of any idle timestamp.
-				 */
-				min_exit_latency = idle->exit_latency;
-				latest_idle_timestamp = rq->idle_stamp;
-				shallowest_idle_cpu = i;
-			} else if ((!idle || idle->exit_latency == min_exit_latency) &&
-				   rq->idle_stamp > latest_idle_timestamp) {
+
+			if (idle) {
+				if (idle->exit_latency < min_exit_latency) {
+					/*
+					 * We give priority to a CPU
+					 * whose idle state has the
+					 * smallest exit latency
+					 * irrespective of any idle
+					 * timestamp.
+					 */
+					min_exit_latency = idle->exit_latency;
+					latest_idle_timestamp = idle->idle_stamp;
+					shallowest_idle_cpu = i;
+				} else if (idle->exit_latency == min_exit_latency &&
+					   idle->idle_stamp > latest_idle_timestamp) {
+					/*
+					 * If the CPU is in the same
+					 * idle state, choose the more
+					 * recent one as it might have
+					 * a warmer cache
+					 */
+					latest_idle_timestamp = idle->idle_stamp;
+					shallowest_idle_cpu = i;
+				}
+			} else if (rq->idle_stamp > latest_idle_timestamp) {
 				/*
-				 * If equal or no active idle state, then
-				 * the most recently idled CPU might have
-				 * a warmer cache.
+				 * If no active idle state, then the
+				 * most recent idled CPU might have a
+				 * warmer cache
 				 */
 				latest_idle_timestamp = rq->idle_stamp;
 				shallowest_idle_cpu = i;