diff mbox

[RFC] sched: find the latest idle cpu

Message ID 1389758879-19951-1-git-send-email-alex.shi@linaro.org
State New
Headers show

Commit Message

Alex Shi Jan. 15, 2014, 4:07 a.m. UTC
Currently we just try to find least load cpu. If some cpus idled,
we just pick the first cpu in cpu mask.

In fact we can get the interrupted idle cpu or the latest idled cpu,
then we may get the benefit from both latency and power.
The selected cpu maybe not the best, since other cpu may be interrupted
during our selecting. But be captious costs too much.

Signed-off-by: Alex Shi <alex.shi@linaro.org>
---
 kernel/sched/fair.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

Comments

Alex Shi Jan. 15, 2014, 6:45 a.m. UTC | #1
On 01/15/2014 01:33 PM, Michael wang wrote:
> On 01/15/2014 12:07 PM, Alex Shi wrote:
>> > Currently we just try to find least load cpu. If some cpus idled,
>> > we just pick the first cpu in cpu mask.
>> > 
>> > In fact we can get the interrupted idle cpu or the latest idled cpu,
>> > then we may get the benefit from both latency and power.
>> > The selected cpu maybe not the best, since other cpu may be interrupted
>> > during our selecting. But be captious costs too much.
> So the idea here is we want to choose the latest idle cpu if we have
> multiple idle cpu for choosing, correct?

yes.
> 
> And I guess that was in order to avoid choosing tickless cpu while there
> are un-tickless idle one, is that right?

no, current logical choice least load cpu no matter if it is idle.
> 
> What confused me is, what about those cpu who just going to recover from
> tickless as you mentioned, which means latest idle doesn't mean the best
> choice, or even could be the worst (if just two choice, and the longer
> tickless one is just going to recover while the latest is going to
> tickless).

yes, to save your scenario, we need to know the next timer for idle cpu,
but that is not enough, interrupt is totally unpredictable. So, I'd
rather bear the coarse method now.
> 
> So what about just check 'ts->tick_stopped' and record one ticking idle
> cpu? the cost could be lower than time comparison, we could reduce the
> risk may be...(well, not so risky since the logical only works when
> system is relaxing with several cpu idle)

first, nohz full also stop tick. second, tick_stopped can not reflect
the interrupt. when the idle cpu was interrupted, it's waken, then be a
good candidate for task running.
Alex Shi Jan. 15, 2014, 2:28 p.m. UTC | #2
On 01/15/2014 04:05 PM, Michael wang wrote:
> On 01/15/2014 02:45 PM, Alex Shi wrote:
> [snip]
>>
>> yes, to save your scenario, we need to know the next timer for idle cpu,
>> but that is not enough, interrupt is totally unpredictable. So, I'd
>> rather bear the coarse method now.
>>>
>>> So what about just check 'ts->tick_stopped' and record one ticking idle
>>> cpu? the cost could be lower than time comparison, we could reduce the
>>> risk may be...(well, not so risky since the logical only works when
>>> system is relaxing with several cpu idle)
>>
>> first, nohz full also stop tick. second, tick_stopped can not reflect
>> the interrupt. when the idle cpu was interrupted, it's waken, then be a
>> good candidate for task running.
> 
> IMHO, if we have to do gamble here, we better choose the cheaper bet,
> unless we could prove this 'coarse method' have more higher chance for
> BINGO than just check 'tick_stopped'...

Tick stopped on a nohz full CPU, but the cpu still had a task running...
> 
> BTW, may be the logical should be in the select_idle_sibling()?

both of functions need to be considered.
> 
> Regards,
> Michael Wang
> 
>>
>
Alex Shi Jan. 15, 2014, 2:37 p.m. UTC | #3
On 01/15/2014 03:35 PM, Peter Zijlstra wrote:
> On Wed, Jan 15, 2014 at 12:07:59PM +0800, Alex Shi wrote:
>> Currently we just try to find least load cpu. If some cpus idled,
>> we just pick the first cpu in cpu mask.
>>
>> In fact we can get the interrupted idle cpu or the latest idled cpu,
>> then we may get the benefit from both latency and power.
>> The selected cpu maybe not the best, since other cpu may be interrupted
>> during our selecting. But be captious costs too much.
> 
> No, we should not do anything like this without first integrating
> cpuidle.
> 
> At which point we have a sane view of the idle states and can make a
> sane choice between them.
> 


Daniel,

Any comments to make it better?
Daniel Lezcano Jan. 16, 2014, 11:03 a.m. UTC | #4
On 01/15/2014 03:37 PM, Alex Shi wrote:
> On 01/15/2014 03:35 PM, Peter Zijlstra wrote:
>> On Wed, Jan 15, 2014 at 12:07:59PM +0800, Alex Shi wrote:
>>> Currently we just try to find least load cpu. If some cpus idled,
>>> we just pick the first cpu in cpu mask.
>>>
>>> In fact we can get the interrupted idle cpu or the latest idled cpu,
>>> then we may get the benefit from both latency and power.
>>> The selected cpu maybe not the best, since other cpu may be interrupted
>>> during our selecting. But be captious costs too much.
>>
>> No, we should not do anything like this without first integrating
>> cpuidle.
>>
>> At which point we have a sane view of the idle states and can make a
>> sane choice between them.
>>
>
>
> Daniel,
>
> Any comments to make it better?

Hi Alex,

it is a nice optimization attempt but I agree with Peter we should focus 
on integrating cpuidle.

The question is "how do we integrate cpuidle ?"

IMHO, the main problem are the governors, especially the menu governor.

The menu governor tries to predict the events per cpu. This approach 
which gave us a nice benefit for the power saving may not fit well for 
the scheduler.

I think we can classify the events in three categories:

1. fully predictable (timers)
2. partially predictable (eg. MMC, sdd or network)
3. unpredictable (eg. keyboard, network ingress after quiescent period)

The menu governor mix 2 and 3 with statistics and a performance 
multiplier to reach shallow states based on heuristic and 
experimentation for a specific platform.

I was wondering if we shouldn't create a per task io latency tracking.

Mostly based on io_schedule and io_schedule_timeout, we track the 
latency for each task for each device, keeping up to date a rb-tree 
where the left-most leaf is the minimum latency for all the tasks 
running on a specific cpu. That allows better tracking when moving tasks 
across cpus.

With this approach, we have something consistent with the per load task 
tracking.

This io latency tracking gives us the next wake up event we can inject 
to the cpuidle framework directly. That removes all the code related to 
the menu governor statistics based on IO events and simplify a lot the 
menu governor code. So we replaced a piece of the cpuidle code by a 
scheduler code which I hope could be better for prediction, leading to a 
part of integration.

In order to finish integrating the cpuidle framework in the scheduler, 
there are pending questions about the impact in the current design.

Peter or Ingo, if you have time, could you have a look at the email I 
sent previously [1] ?

Thanks

   -- Daniel


[1] https://lkml.org/lkml/2013/12/17/106
Daniel Lezcano Jan. 16, 2014, 12:16 p.m. UTC | #5
On 01/16/2014 12:38 PM, Peter Zijlstra wrote:
> On Thu, Jan 16, 2014 at 12:03:13PM +0100, Daniel Lezcano wrote:
>> Hi Alex,
>>
>> it is a nice optimization attempt but I agree with Peter we should focus on
>> integrating cpuidle.
>>
>> The question is "how do we integrate cpuidle ?"
>>
>> IMHO, the main problem are the governors, especially the menu governor.
>
> Yah.
>
>> The menu governor tries to predict the events per cpu. This approach which
>> gave us a nice benefit for the power saving may not fit well for the
>> scheduler.
>
> So the way to start all this is I think to gradually share more and
> more.
>
> Start by pulling in the actual idle state; such that we can indeed
> observe what the relative cost is of waking a cpu (against another), and
> maybe even the predicted wakeup time.

Ok, I will send a patch for this.

> Then pull in the various statistics gathering bits -- without improving
> them.
>
> Then improve the statistics; try and remove duplicate statistics -- if
> there's such things, try and use the extra information the scheduler has
> etc..
>
> Then worry about the governors, or what's left of them.
>
>> In order to finish integrating the cpuidle framework in the scheduler, there
>> are pending questions about the impact in the current design.
>>
>> Peter or Ingo, if you have time, could you have a look at the email I sent
>> previously [1] ?
>
> I read it once, it didn't make sense at the time, I just read it again,
> still doesn't make sense.

:)

The question raised when I looked closely how to fully integrate cpuidle 
with the scheduler; in particular, the idle time.
The scheduler idle time is not the same than the cpuidle idle time.
A cpu can be idle for the scheduler 1s but it could be interrupted 
several times by an interrupt thus the idle time for cpuidle is 
different. But anyway ...

> We need the idle task, since we need to DO something to go idle, the
> scheduler needs to pick a task to go do that something. This is the idle
> task.
>
> You cannot get rid of that.
>
> In fact, the 'doing' of that task is running much of the cpuidle code,
> so by getting rid of it, there's nobody left to execute that code.
>
> Also, since its already running that cpuidle stuff, integrating it more
> closely with the scheduler will not in fact change much, it will still
> run it.
>
> Could of course be I'm not reading what you meant to write, if so, do
> try again ;-)

Well, I wanted to have a clarification of what was your feeling about 
how to integrate cpuidle in the scheduler. If removing the idle task (in 
the future) does not make sense for you, I will not insist. Let's see 
how the code evolves by integrating cpuidle and we will figure out what 
will be the impact on the idle task.

Thanks for your feedbacks

   -- Daniel
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c7395d9..fb52d26 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4167,6 +4167,26 @@  find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 			min_load = load;
 			idlest = i;
 		}
+#ifdef CONFIG_NO_HZ_COMMON
+		/*
+		 * Coarsely to get the latest idle cpu for shorter latency and
+		 * possible power benefit.
+		 */
+		if (!min_load) {
+			struct tick_sched *ts = &per_cpu(tick_cpu_sched, i);
+
+			s64 latest_wake = 0;
+			/* idle cpu doing irq */
+			if (ts->inidle && !ts->idle_active)
+				idlest = i;
+			/* the cpu resched */
+			else if (!ts->inidle)
+				idlest = i;
+			/* find latest idle cpu */
+			else if (ktime_to_us(ts->idle_entrytime) > latest_wake)
+				idlest = i;
+		}
+#endif
 	}
 
 	return idlest;