diff mbox

[RFC,3/3] idle: store the idle state index in the struct rq

Message ID 1391090962-15032-4-git-send-email-daniel.lezcano@linaro.org
State New
Headers show

Commit Message

Daniel Lezcano Jan. 30, 2014, 2:09 p.m. UTC
The main cpuidle function is part of the idle.c and this one belongs to the
sched directory, allowing to integration the private structure used by the
scheduler.

We can store the idle index where the cpu is and reuse it in the scheduler
to do better decision regarding balancing and idlest cpu.

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
---
 kernel/sched/idle.c  |    9 +++++++++
 kernel/sched/sched.h |    3 +++
 2 files changed, 12 insertions(+)

Comments

Daniel Lezcano Jan. 30, 2014, 4:27 p.m. UTC | #1
On 01/30/2014 04:31 PM, Peter Zijlstra wrote:
> On Thu, Jan 30, 2014 at 03:09:22PM +0100, Daniel Lezcano wrote:
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 90aef084..130debf 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -654,6 +654,9 @@ struct rq {
>>   #endif
>>
>>   	struct sched_avg avg;
>> +#ifdef CONFIG_CPU_IDLE
>> +	int idle_index;
>> +#endif
>>   };
>
> So my 'problem' with this is that I've no ff'ing clue what the
> idle_index is and what I can do with it.

The cpuidle framework works with index corresponding to the index in the 
idle states array. This array contains all the idle states the backend 
driver supports. The cpuidle framework will ask for the governor what 
idle state it should uses for the cpu, the governor will look at the 
different characteristics of the idle states + timing information and 
returns an index corresponding to the idle state the governor thinks 
suit better. This step is the 'cpuidle_select'.

Then the cpuidle framework calls the idle callback associated with the 
idle state at the index location. The cpu will stay blocked on this 
callback until an interrupt occurs. This step is the 'cpuidle_enter'.

Between the 'cpuidle_select' and 'cpuidle_enter', this patch stores the 
current idle state the cpu is about to enter (it stores the index). So 
from the scheduler and the current cpuidle api it is easy to get 
information about the idle state a cpu is:

struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices);
struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);

struct cpuidle_state *state = &drv->states[rq->index];

And from the state, we have the following informations:

struct cpuidle_state {

	[ ... ]

         unsigned int    exit_latency; /* in US */
         int             power_usage; /* in mW */
         unsigned int    target_residency; /* in US */
         bool            disabled; /* disabled on all CPUs */

	[ ... ]
};

IIRC, Alex Shi sent a patchset to improve the choosing of the idlest cpu 
and the exit_latency was needed.
Daniel Lezcano Jan. 30, 2014, 5:25 p.m. UTC | #2
On 01/30/2014 05:35 PM, Peter Zijlstra wrote:
> On Thu, Jan 30, 2014 at 05:27:54PM +0100, Daniel Lezcano wrote:
>> struct cpuidle_state *state = &drv->states[rq->index];
>>
>> And from the state, we have the following informations:
>>
>> struct cpuidle_state {
>>
>> 	[ ... ]
>>
>>          unsigned int    exit_latency; /* in US */
>>          int             power_usage; /* in mW */
>>          unsigned int    target_residency; /* in US */
>>          bool            disabled; /* disabled on all CPUs */
>>
>> 	[ ... ]
>> };
>
> Right, but can we say that a higher index will save more power and have
> a higher exit latency? Or is a driver free to have a random mapping from
> idle_index to state?

If the driver does its own random mapping that will break the governor 
logic. So yes, the states are ordered, the higher the index is, the more 
you save power and the higher the exit latency is.

> Also, we should probably create a pretty function to get that state,
> just like you did in patch 1.

Yes, right.

>> IIRC, Alex Shi sent a patchset to improve the choosing of the idlest cpu and
>> the exit_latency was needed.
>
> Right. However if we have a 'natural' order in the state array the index
> itself might often be sufficient to find the least idle state, in this
> specific case the absolute exit latency doesn't matter, all we want is
> the lowest one.

Indeed. It could be simple as that. I feel we may need more informations 
in the future but comparing the indexes could be a nice simple and 
efficient solution.

> Not dereferencing the state array saves hitting cold cachelines.

Yeah, always good to remind that. Should keep in mind for later.

Thanks for your comments.

   -- Daniel
Lorenzo Pieralisi Jan. 30, 2014, 5:50 p.m. UTC | #3
On Thu, Jan 30, 2014 at 05:25:27PM +0000, Daniel Lezcano wrote:
> On 01/30/2014 05:35 PM, Peter Zijlstra wrote:
> > On Thu, Jan 30, 2014 at 05:27:54PM +0100, Daniel Lezcano wrote:
> >> struct cpuidle_state *state = &drv->states[rq->index];
> >>
> >> And from the state, we have the following informations:
> >>
> >> struct cpuidle_state {
> >>
> >> 	[ ... ]
> >>
> >>          unsigned int    exit_latency; /* in US */
> >>          int             power_usage; /* in mW */
> >>          unsigned int    target_residency; /* in US */
> >>          bool            disabled; /* disabled on all CPUs */
> >>
> >> 	[ ... ]
> >> };
> >
> > Right, but can we say that a higher index will save more power and have
> > a higher exit latency? Or is a driver free to have a random mapping from
> > idle_index to state?
> 
> If the driver does its own random mapping that will break the governor 
> logic. So yes, the states are ordered, the higher the index is, the more 
> you save power and the higher the exit latency is.
> 
> > Also, we should probably create a pretty function to get that state,
> > just like you did in patch 1.
> 
> Yes, right.
> 
> >> IIRC, Alex Shi sent a patchset to improve the choosing of the idlest cpu and
> >> the exit_latency was needed.
> >
> > Right. However if we have a 'natural' order in the state array the index
> > itself might often be sufficient to find the least idle state, in this
> > specific case the absolute exit latency doesn't matter, all we want is
> > the lowest one.
> 
> Indeed. It could be simple as that. I feel we may need more informations 
> in the future but comparing the indexes could be a nice simple and 
> efficient solution.

As long as we take into account that some states might require multiple
CPUs to be idle in order to be entered, fine by me. But we should
certainly avoid waking up a CPU in a cluster that is in eg C2 (all CPUs in
C2, so cluster in C2) when there are CPUs in C3 in other clusters with
some CPUs running in those clusters, because there C3 means "CPU in C3, not
cluster in C3". Overall what I am saying is that what you are doing
makes perfect sense but we have to take the above into account.

Some states have CPU and cluster (or we can call it package) components,
and that's true on ARM and other architectures too, to the best of my
knowledge.

Lorenzo

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nicolas Pitre Jan. 30, 2014, 9:02 p.m. UTC | #4
On Thu, 30 Jan 2014, Lorenzo Pieralisi wrote:

> On Thu, Jan 30, 2014 at 05:25:27PM +0000, Daniel Lezcano wrote:
> > On 01/30/2014 05:35 PM, Peter Zijlstra wrote:
> > > On Thu, Jan 30, 2014 at 05:27:54PM +0100, Daniel Lezcano wrote:
> > >> IIRC, Alex Shi sent a patchset to improve the choosing of the idlest cpu and
> > >> the exit_latency was needed.
> > >
> > > Right. However if we have a 'natural' order in the state array the index
> > > itself might often be sufficient to find the least idle state, in this
> > > specific case the absolute exit latency doesn't matter, all we want is
> > > the lowest one.
> > 
> > Indeed. It could be simple as that. I feel we may need more informations 
> > in the future but comparing the indexes could be a nice simple and 
> > efficient solution.
> 
> As long as we take into account that some states might require multiple
> CPUs to be idle in order to be entered, fine by me. But we should
> certainly avoid waking up a CPU in a cluster that is in eg C2 (all CPUs in
> C2, so cluster in C2) when there are CPUs in C3 in other clusters with
> some CPUs running in those clusters, because there C3 means "CPU in C3, not
> cluster in C3". Overall what I am saying is that what you are doing
> makes perfect sense but we have to take the above into account.
> 
> Some states have CPU and cluster (or we can call it package) components,
> and that's true on ARM and other architectures too, to the best of my
> knowledge.

The notion of cluster or package maps pretty naturally onto scheduling 
domains.  And the search for an idle CPU to wake up should avoid a 
scheduling domain with a load of zero (which is obviously a prerequisite 
for a power save mode to be applied to the cluster level) if there exist 
idle CPUs in another domain already which load is not zero (all other 
considerations being equal).  Hence your concern would be addressed 
without any particular issue even if the individual CPU idle state index 
is not exactly in sync with reality because of other hardware related 
constraints.

The other solution consists in making the index dynamic.  That means 
letting backend idle drivers change it i.e. when the last man in a 
cluster goes idle it could update the index for all the other CPUs in 
the cluster.  There is no locking needed as the scheduler is only 
consuming this info, and the scheduler getting it wrong on rare 
occasions is not a big deal either.  But that looks pretty ugly as at 
least 2 levels of abstractions would be breached in this case.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vincent Guittot Jan. 31, 2014, 9:46 a.m. UTC | #5
On 30 January 2014 22:02, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> On Thu, 30 Jan 2014, Lorenzo Pieralisi wrote:
>
>> On Thu, Jan 30, 2014 at 05:25:27PM +0000, Daniel Lezcano wrote:
>> > On 01/30/2014 05:35 PM, Peter Zijlstra wrote:
>> > > On Thu, Jan 30, 2014 at 05:27:54PM +0100, Daniel Lezcano wrote:
>> > >> IIRC, Alex Shi sent a patchset to improve the choosing of the idlest cpu and
>> > >> the exit_latency was needed.
>> > >
>> > > Right. However if we have a 'natural' order in the state array the index
>> > > itself might often be sufficient to find the least idle state, in this
>> > > specific case the absolute exit latency doesn't matter, all we want is
>> > > the lowest one.
>> >
>> > Indeed. It could be simple as that. I feel we may need more informations
>> > in the future but comparing the indexes could be a nice simple and
>> > efficient solution.
>>
>> As long as we take into account that some states might require multiple
>> CPUs to be idle in order to be entered, fine by me. But we should
>> certainly avoid waking up a CPU in a cluster that is in eg C2 (all CPUs in
>> C2, so cluster in C2) when there are CPUs in C3 in other clusters with
>> some CPUs running in those clusters, because there C3 means "CPU in C3, not
>> cluster in C3". Overall what I am saying is that what you are doing
>> makes perfect sense but we have to take the above into account.
>>
>> Some states have CPU and cluster (or we can call it package) components,
>> and that's true on ARM and other architectures too, to the best of my
>> knowledge.
>
> The notion of cluster or package maps pretty naturally onto scheduling
> domains.  And the search for an idle CPU to wake up should avoid a
> scheduling domain with a load of zero (which is obviously a prerequisite
> for a power save mode to be applied to the cluster level) if there exist
> idle CPUs in another domain already which load is not zero (all other
> considerations being equal).  Hence your concern would be addressed
> without any particular issue even if the individual CPU idle state index
> is not exactly in sync with reality because of other hardware related
> constraints.

It's not only a problem of packing in one cluster but also to check
the cost of waking up a CPU regarding the estimated load of the task.
The main problem with only having the index is that the reality
(latency and power consumption) can be different from the targeted
c-state because the system wait that all the condition for entering
this state has been reached. So you will have the wrong values when
looking for the best core for a task.

>
> The other solution consists in making the index dynamic.  That means
> letting backend idle drivers change it i.e. when the last man in a
> cluster goes idle it could update the index for all the other CPUs in
> the cluster.  There is no locking needed as the scheduler is only
> consuming this info, and the scheduler getting it wrong on rare
> occasions is not a big deal either.  But that looks pretty ugly as at
> least 2 levels of abstractions would be breached in this case.

but it 's the only way to get an good view of the current state of a core

Vincent

>
>
> Nicolas
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lorenzo Pieralisi Jan. 31, 2014, 10:04 a.m. UTC | #6
On Thu, Jan 30, 2014 at 09:02:15PM +0000, Nicolas Pitre wrote:
> On Thu, 30 Jan 2014, Lorenzo Pieralisi wrote:
> 
> > On Thu, Jan 30, 2014 at 05:25:27PM +0000, Daniel Lezcano wrote:
> > > On 01/30/2014 05:35 PM, Peter Zijlstra wrote:
> > > > On Thu, Jan 30, 2014 at 05:27:54PM +0100, Daniel Lezcano wrote:
> > > >> IIRC, Alex Shi sent a patchset to improve the choosing of the idlest cpu and
> > > >> the exit_latency was needed.
> > > >
> > > > Right. However if we have a 'natural' order in the state array the index
> > > > itself might often be sufficient to find the least idle state, in this
> > > > specific case the absolute exit latency doesn't matter, all we want is
> > > > the lowest one.
> > > 
> > > Indeed. It could be simple as that. I feel we may need more informations 
> > > in the future but comparing the indexes could be a nice simple and 
> > > efficient solution.
> > 
> > As long as we take into account that some states might require multiple
> > CPUs to be idle in order to be entered, fine by me. But we should
> > certainly avoid waking up a CPU in a cluster that is in eg C2 (all CPUs in
> > C2, so cluster in C2) when there are CPUs in C3 in other clusters with
> > some CPUs running in those clusters, because there C3 means "CPU in C3, not
> > cluster in C3". Overall what I am saying is that what you are doing
> > makes perfect sense but we have to take the above into account.
> > 
> > Some states have CPU and cluster (or we can call it package) components,
> > and that's true on ARM and other architectures too, to the best of my
> > knowledge.
> 
> The notion of cluster or package maps pretty naturally onto scheduling 
> domains.  And the search for an idle CPU to wake up should avoid a 
> scheduling domain with a load of zero (which is obviously a prerequisite 
> for a power save mode to be applied to the cluster level) if there exist 
> idle CPUs in another domain already which load is not zero (all other 
> considerations being equal).  Hence your concern would be addressed 
> without any particular issue even if the individual CPU idle state index 
> is not exactly in sync with reality because of other hardware related 
> constraints.

Yes, just wanted to mention that relying solely on the C-state index
is not enough and can lead to surprises, as long as we take that into
account, it is ok. Highest index does not mean idlest CPU in power
consumption terms, because of cluster/package shared components.

It is probably worth noticing that parameters like eg exit_latency
have a meaning that necessarily depends on other CPUs in a cluster/package
for a given C-state, same logic as to my reasoning above applies.

> The other solution consists in making the index dynamic.  That means 
> letting backend idle drivers change it i.e. when the last man in a 
> cluster goes idle it could update the index for all the other CPUs in 
> the cluster.  There is no locking needed as the scheduler is only 
> consuming this info, and the scheduler getting it wrong on rare 
> occasions is not a big deal either.  But that looks pretty ugly as at 
> least 2 levels of abstractions would be breached in this case.

Yes, that's ugly, and that's the reason why tracing cluster C-states
require external tools (or HW probing) like the one Daniel wrote to
analize the time spent in cluster states.

Overall, it is not a concern, it is something we should take into account,
that's why I mentioned that.

Thanks,
Lorenzo

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Lezcano Jan. 31, 2014, 10:15 a.m. UTC | #7
On 01/31/2014 09:45 AM, Preeti Murthy wrote:
> Hi,
>
> On Thu, Jan 30, 2014 at 10:55 PM, Daniel Lezcano
> <daniel.lezcano@linaro.org> wrote:
>> On 01/30/2014 05:35 PM, Peter Zijlstra wrote:
>>>
>>> On Thu, Jan 30, 2014 at 05:27:54PM +0100, Daniel Lezcano wrote:
>>>>
>>>> struct cpuidle_state *state = &drv->states[rq->index];
>>>>
>>>> And from the state, we have the following informations:
>>>>
>>>> struct cpuidle_state {
>>>>
>>>>          [ ... ]
>>>>
>>>>           unsigned int    exit_latency; /* in US */
>>>>           int             power_usage; /* in mW */
>>>>           unsigned int    target_residency; /* in US */
>>>>           bool            disabled; /* disabled on all CPUs */
>>>>
>>>>          [ ... ]
>>>> };
>>>
>>>
>>> Right, but can we say that a higher index will save more power and have
>>> a higher exit latency? Or is a driver free to have a random mapping from
>>> idle_index to state?
>>
>>
>> If the driver does its own random mapping that will break the governor
>> logic. So yes, the states are ordered, the higher the index is, the more you
>> save power and the higher the exit latency is.
>
> The above point holds true for only the ladder governor which sees the idle
> states indexed in the increasing order of target_residency/exit_latency.

The cpuidle framework has been modified for both governor, see commit 
8aef33a7.

The power field was initially used to do the selection, but no power 
value was ever used to filled this field by any hardware. So the field 
was arbitrarily filled with a decreasing value (-1, -2, -3 ...), and 
used by the governor's select function. The patch above just removed 
this field and the condition on power for 'select' assuming the idle 
state are power ordered in the array.

> However this is not true as far as I can see in the menu governor. It
> acknowledges the dynamic ordering of idle states as can be seen in the
> menu_select() function in the menu governor, where the idle state for the
> CPU gets chosen.  You will notice that, even if it is found that the predicted
> idle time of the CPU is smaller than the target residency of an idle state,
> the governor continues to search for suitable idle states in the higher indexed
> states although it should have halted if the idle states' were ordered according
> to their target residency.. The same holds for exit_latency.

I am not sure to get the point. Actually, this loop should be just 
optimized to backward search the idle state like cpuidle_play_dead does.

There is also a patch proposed by Alex Shi about this loop.

[RFC PATCH] cpuidle: reduce unnecessary loop in c-state selection

http://comments.gmane.org/gmane.linux.power-management.general/42124

> Hence I think this patch would make sense only with additional information
> like exit_latency or target_residency is present for the scheduler. The idle
> state index alone will not be sufficient.

May be I misunderstood, but if you have the index, you can get the idle 
state, hence the exit_latency and the target_residency, no ?

>>
>>> Also, we should probably create a pretty function to get that state,
>>> just like you did in patch 1.
>>
>>
>> Yes, right.
>>
>>
>>>> IIRC, Alex Shi sent a patchset to improve the choosing of the idlest cpu
>>>> and
>>>> the exit_latency was needed.
>>>
>>>
>>> Right. However if we have a 'natural' order in the state array the index
>>> itself might often be sufficient to find the least idle state, in this
>>> specific case the absolute exit latency doesn't matter, all we want is
>>> the lowest one.
>>
>>
>> Indeed. It could be simple as that. I feel we may need more informations in
>> the future but comparing the indexes could be a nice simple and efficient
>> solution.
>>
>>
>>> Not dereferencing the state array saves hitting cold cachelines.
>>
>>
>> Yeah, always good to remind that. Should keep in mind for later.
>>
>> Thanks for your comments.
>>
>>    -- Daniel
>>
>>
>>
>>
>> --
>>   <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
>>
>> Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
>> <http://twitter.com/#!/linaroorg> Twitter |
>> <http://www.linaro.org/linaro-blog/> Blog
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
Daniel Lezcano Jan. 31, 2014, 10:44 a.m. UTC | #8
On 01/30/2014 10:02 PM, Nicolas Pitre wrote:
> On Thu, 30 Jan 2014, Lorenzo Pieralisi wrote:
>
>> On Thu, Jan 30, 2014 at 05:25:27PM +0000, Daniel Lezcano wrote:
>>> On 01/30/2014 05:35 PM, Peter Zijlstra wrote:
>>>> On Thu, Jan 30, 2014 at 05:27:54PM +0100, Daniel Lezcano wrote:
>>>>> IIRC, Alex Shi sent a patchset to improve the choosing of the idlest cpu and
>>>>> the exit_latency was needed.
>>>>
>>>> Right. However if we have a 'natural' order in the state array the index
>>>> itself might often be sufficient to find the least idle state, in this
>>>> specific case the absolute exit latency doesn't matter, all we want is
>>>> the lowest one.
>>>
>>> Indeed. It could be simple as that. I feel we may need more informations
>>> in the future but comparing the indexes could be a nice simple and
>>> efficient solution.
>>
>> As long as we take into account that some states might require multiple
>> CPUs to be idle in order to be entered, fine by me. But we should
>> certainly avoid waking up a CPU in a cluster that is in eg C2 (all CPUs in
>> C2, so cluster in C2) when there are CPUs in C3 in other clusters with
>> some CPUs running in those clusters, because there C3 means "CPU in C3, not
>> cluster in C3". Overall what I am saying is that what you are doing
>> makes perfect sense but we have to take the above into account.
>>
>> Some states have CPU and cluster (or we can call it package) components,
>> and that's true on ARM and other architectures too, to the best of my
>> knowledge.
>
> The notion of cluster or package maps pretty naturally onto scheduling
> domains.  And the search for an idle CPU to wake up should avoid a
> scheduling domain with a load of zero (which is obviously a prerequisite
> for a power save mode to be applied to the cluster level) if there exist
> idle CPUs in another domain already which load is not zero (all other
> considerations being equal).  Hence your concern would be addressed
> without any particular issue even if the individual CPU idle state index
> is not exactly in sync with reality because of other hardware related
> constraints.
>
> The other solution consists in making the index dynamic.  That means
> letting backend idle drivers change it i.e. when the last man in a
> cluster goes idle it could update the index for all the other CPUs in
> the cluster.  There is no locking needed as the scheduler is only
> consuming this info, and the scheduler getting it wrong on rare
> occasions is not a big deal either.  But that looks pretty ugly as at
> least 2 levels of abstractions would be breached in this case.

Yes, I agree it would break the level of abstractions and I don't think 
it is worth to take into account this for now.

Let's consider the following status:

1. there are archs where the cluster dependency is handled by the 
firmware and where the 'intermediate' idle state to wait for the cpu 
sync is hidden because of the level of abstraction of such firmware.
This is the case for x86 arch and ARM platform with PSCI which represent 
most of the hardware.

2. there are archs where the cluster dependency is handled by the 
cpuidle couple idle state and where the cpumask (stored in the idle 
state structure) gives us this dependency which is a very small part of 
the hardware and where most of the boards at EOL (omap4, tegra2).

3. there are archs where the cluster dependency is built from the device 
tree and where a mapping for the cluster topology is discussed.

4. there are archs where the cluster dependency is reflected by the 
usage of the multiple cpuidle driver support (big.Little).

Having the index stored in the struct rq is a good first step to 
integrate the cpuidle with the scheduler even if we don't have an 
accurate result at the beginning.
Daniel Lezcano Jan. 31, 2014, 2:04 p.m. UTC | #9
On 01/31/2014 10:39 AM, Preeti U Murthy wrote:
> Hi Peter,
>
> On 01/31/2014 02:32 PM, Peter Zijlstra wrote:
>> On Fri, Jan 31, 2014 at 02:15:47PM +0530, Preeti Murthy wrote:
>>>>
>>>> If the driver does its own random mapping that will break the governor
>>>> logic. So yes, the states are ordered, the higher the index is, the more you
>>>> save power and the higher the exit latency is.
>>>
>>> The above point holds true for only the ladder governor which sees the idle
>>> states indexed in the increasing order of target_residency/exit_latency.
>>>
>>> However this is not true as far as I can see in the menu governor. It
>>> acknowledges the dynamic ordering of idle states as can be seen in the
>>> menu_select() function in the menu governor, where the idle state for the
>>> CPU gets chosen.  You will notice that, even if it is found that the predicted
>>> idle time of the CPU is smaller than the target residency of an idle state,
>>> the governor continues to search for suitable idle states in the higher indexed
>>> states although it should have halted if the idle states' were ordered according
>>> to their target residency.. The same holds for exit_latency.
>>>
>>> Hence I think this patch would make sense only with additional information
>>> like exit_latency or target_residency is present for the scheduler. The idle
>>> state index alone will not be sufficient.
>>
>> Alternatively, can we enforce sanity on the cpuidle infrastructure to
>> make the index naturally ordered? If not, please explain why :-)
>
> The commit id 71abbbf856a0e70 says that there are SOCs which could have
> their target_residency and exit_latency values change at runtime. This
> commit thus removed the ordering of the idle states according to their
> target_residency/exit_latency. Adding Len and Arjan to the CC.

This commit is outdated, AFAICT.

Indeed, there are dynamic idle states. Some idle states are added or 
removed when a laptop is going to battery or plugged in.

In ACPI, the power event leads the acpi cpuidle driver to disable the 
cpuidle framework, get the idle states which are ordered, and re-enable 
the cpuidle framework which in turn kicks all the cpus. So the index in 
the struct rq should be always ok.
Dietmar Eggemann Jan. 31, 2014, 2:12 p.m. UTC | #10
On 31/01/14 14:04, Daniel Lezcano wrote:
> On 01/31/2014 10:39 AM, Preeti U Murthy wrote:
>> Hi Peter,
>>
>> On 01/31/2014 02:32 PM, Peter Zijlstra wrote:
>>> On Fri, Jan 31, 2014 at 02:15:47PM +0530, Preeti Murthy wrote:
>>>>>
>>>>> If the driver does its own random mapping that will break the governor
>>>>> logic. So yes, the states are ordered, the higher the index is, the more you
>>>>> save power and the higher the exit latency is.
>>>>
>>>> The above point holds true for only the ladder governor which sees the idle
>>>> states indexed in the increasing order of target_residency/exit_latency.
>>>>
>>>> However this is not true as far as I can see in the menu governor. It
>>>> acknowledges the dynamic ordering of idle states as can be seen in the
>>>> menu_select() function in the menu governor, where the idle state for the
>>>> CPU gets chosen.  You will notice that, even if it is found that the predicted
>>>> idle time of the CPU is smaller than the target residency of an idle state,
>>>> the governor continues to search for suitable idle states in the higher indexed
>>>> states although it should have halted if the idle states' were ordered according
>>>> to their target residency.. The same holds for exit_latency.
>>>>
>>>> Hence I think this patch would make sense only with additional information
>>>> like exit_latency or target_residency is present for the scheduler. The idle
>>>> state index alone will not be sufficient.
>>>
>>> Alternatively, can we enforce sanity on the cpuidle infrastructure to
>>> make the index naturally ordered? If not, please explain why :-)
>>
>> The commit id 71abbbf856a0e70 says that there are SOCs which could have
>> their target_residency and exit_latency values change at runtime. This
>> commit thus removed the ordering of the idle states according to their
>> target_residency/exit_latency. Adding Len and Arjan to the CC.
> 
> This commit is outdated, AFAICT.

Yes, this is also my impression. It's removed by

commit b25edc42bfb9602f0503474b2c94701d5536ce60
Author: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
Date:   Fri Oct 28 16:20:24 2011 +0530

    cpuidle: Remove CPUIDLE_FLAG_IGNORE and dev->prepare()

So far, I'm under the impression that target_residency/exit_latency is
static data and can be propagated towards the scheduler via topology
information.

-- Dietmar

> 
> Indeed, there are dynamic idle states. Some idle states are added or 
> removed when a laptop is going to battery or plugged in.
> 
> In ACPI, the power event leads the acpi cpuidle driver to disable the 
> cpuidle framework, get the idle states which are ordered, and re-enable 
> the cpuidle framework which in turn kicks all the cpus. So the index in 
> the struct rq should be always ok.
> 
> 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Lezcano Jan. 31, 2014, 3:37 p.m. UTC | #11
On 01/31/2014 04:07 PM, Arjan van de Ven wrote:
>>>>
>>>> Hence I think this patch would make sense only with additional
>>>> information
>>>> like exit_latency or target_residency is present for the scheduler.
>>>> The idle
>>>> state index alone will not be sufficient.
>>>
>>> Alternatively, can we enforce sanity on the cpuidle infrastructure to
>>> make the index naturally ordered? If not, please explain why :-)
>>
>> The commit id 71abbbf856a0e70 says that there are SOCs which could have
>> their target_residency and exit_latency values change at runtime. This
>> commit thus removed the ordering of the idle states according to their
>> target_residency/exit_latency. Adding Len and Arjan to the CC.
>
> the ARM folks wanted a dynamic exit latency, so.... it makes much more
> sense
> to me to store the thing you want to use (exit latency) than the number
> of the state.
>
> more than that, you can order either by target residency OR by exit
> latency,
> if you sort by one, there is no guarantee that you're also sorted by the
> other

IMO, it would be preferable to store the index for the moment as we are 
integrating cpuidle with the scheduler. The index allows to access more 
informations. Then when everything is fully integrated we can improve 
the result, no ?

> (for example, you can on a hardware level make a "fast exit" state, and
> burn power for this faster exit,
> which means your break even gets longer to recoup this extra power
> compared to the same state without
> the fast exit)
>
Daniel Lezcano Jan. 31, 2014, 4:35 p.m. UTC | #12
On 01/31/2014 04:50 PM, Arjan van de Ven wrote:
> On 1/31/2014 7:37 AM, Daniel Lezcano wrote:
>> On 01/31/2014 04:07 PM, Arjan van de Ven wrote:
>>>>>>
>>>>>> Hence I think this patch would make sense only with additional
>>>>>> information
>>>>>> like exit_latency or target_residency is present for the scheduler.
>>>>>> The idle
>>>>>> state index alone will not be sufficient.
>>>>>
>>>>> Alternatively, can we enforce sanity on the cpuidle infrastructure to
>>>>> make the index naturally ordered? If not, please explain why :-)
>>>>
>>>> The commit id 71abbbf856a0e70 says that there are SOCs which could have
>>>> their target_residency and exit_latency values change at runtime. This
>>>> commit thus removed the ordering of the idle states according to their
>>>> target_residency/exit_latency. Adding Len and Arjan to the CC.
>>>
>>> the ARM folks wanted a dynamic exit latency, so.... it makes much more
>>> sense
>>> to me to store the thing you want to use (exit latency) than the number
>>> of the state.
>>>
>>> more than that, you can order either by target residency OR by exit
>>> latency,
>>> if you sort by one, there is no guarantee that you're also sorted by the
>>> other
>>
>> IMO, it would be preferable to store the index for the moment as we
>> are integrating cpuidle with the scheduler. The index allows to access
>> more informations. Then when
>> everything is fully integrated we can improve the result, no ?
>
> more information, yes. but if the information isn't actually accurate
> (because it keeps changing
> in the datastructure away from what it was for the cpu)... are you
> really achieving what you want?
>
> on x86 I don't care; we don't actually change these dynamically much[1].
> But if you have 1 or 2 things in mind to use,
> I would suggest copying those 2 integers instead as we go, rather than
> the index.
> Saves refcounting/locking etc etc nightmare as well on the other
> subsystems' datastructures..
> ... which you likely need to do to actually follow that index.

Hmm, yeah. That's a fair argument. That is true, the races and 
locks/refcnt are something we have to worried about. But also we may 
want to prevent duplicating the data across the subsystems.

> [1] Although in an ACPI world, the total number of C states can vary,
> for example it used to be quite common
> that you got an extra C state on battery versus on wall power. This sort
> of dynamic thing requires refcounting
> if more than the local cpuidle uses the data structures.
>
Nicolas Pitre Jan. 31, 2014, 6:19 p.m. UTC | #13
On Fri, 31 Jan 2014, Arjan van de Ven wrote:

> On 1/31/2014 7:37 AM, Daniel Lezcano wrote:
> > On 01/31/2014 04:07 PM, Arjan van de Ven wrote:
> > > > > >
> > > > > > Hence I think this patch would make sense only with additional
> > > > > > information
> > > > > > like exit_latency or target_residency is present for the scheduler.
> > > > > > The idle
> > > > > > state index alone will not be sufficient.
> > > > >
> > > > > Alternatively, can we enforce sanity on the cpuidle infrastructure to
> > > > > make the index naturally ordered? If not, please explain why :-)
> > > >
> > > > The commit id 71abbbf856a0e70 says that there are SOCs which could have
> > > > their target_residency and exit_latency values change at runtime. This
> > > > commit thus removed the ordering of the idle states according to their
> > > > target_residency/exit_latency. Adding Len and Arjan to the CC.
> > >
> > > the ARM folks wanted a dynamic exit latency, so.... it makes much more
> > > sense
> > > to me to store the thing you want to use (exit latency) than the number
> > > of the state.
> > >
> > > more than that, you can order either by target residency OR by exit
> > > latency,
> > > if you sort by one, there is no guarantee that you're also sorted by the
> > > other
> >
> > IMO, it would be preferable to store the index for the moment as we are
> > integrating cpuidle with the scheduler. The index allows to access more
> > informations. Then when
> > everything is fully integrated we can improve the result, no ?
> 
> more information, yes. but if the information isn't actually accurate (because
> it keeps changing
> in the datastructure away from what it was for the cpu)... are you really
> achieving what you want?

Right now (on ARM at least but I imagine this is pretty universal), the 
biggest impact on information accuracy for a CPU depends on what the 
other CPUs are doing.  The most obvious example is cluster power down.  
For a cluster to be powered down, all the CPUs sharing this cluster must 
also be powered down.  And all those CPUs must have agreed to a possible 
cluster power down in advance as well.  But it is not because an idle 
CPU has agreed to the extra latency imposed by a cluster power down that 
the cluster has actually powered down since another CPU in that cluster 
might still be running, in which case the recorded latency information 
for that idle CPU would be higher than it would be in practice at that 
moment.

A cluster should map naturally to a scheduling domain.  If we need to 
wake up a CPU, it is quite obvious that we should prefer an idle CPU 
from a scheduling domain which load is not zero.  If the load is not 
zero then this means that any idle CPU in that domain, even if it 
indicated it was ready for a cluster power down, will not require the 
cluster power-up latency as some other CPUs must still be running.  But 
we already know that of course even if the recorded latency might not 
say so.

In other words, the hardware latency information is dynamic of course.  
But we might not _need_ to have it reflected at the scheduler domain all 
the time as in this case it can be inferred by the scheduling domain 
load.

Within a scheduling domain it is OK to pick up the best idle CPU by 
looking at the index as it is best to leave those CPUs ready for a 
cluster power down set to that state and prefer one which is not.  And a 
scheduling domain with a load of zero should be left alone if idle CPUs 
are found in another domain which load is not zero, irrespective of 
absolute latency information. So all the existing heuristics already in 
place to optimize cache utilization and so on will make things just work 
for idle as well.

All this to say that it is not justified at the moment to worry about 
how to convey the full details to the scheduler and the complexity that 
goes with it since in practice we might be able to achieve our goal just 
as well using simpler hints like some arbitrary index.  Once this is in 
place, then we could look at the actual benefits from having more 
detailed information and weight that against the complexity that comes 
with it.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Nicolas Pitre Feb. 1, 2014, 3:31 p.m. UTC | #14
On Sat, 1 Feb 2014, Brown, Len wrote:

> > Right now (on ARM at least but I imagine this is pretty universal), the
> > biggest impact on information accuracy for a CPU depends on what the
> > other CPUs are doing.  The most obvious example is cluster power down.
> > For a cluster to be powered down, all the CPUs sharing this cluster must
> > also be powered down.  And all those CPUs must have agreed to a possible
> > cluster power down in advance as well.  But it is not because an idle
> > CPU has agreed to the extra latency imposed by a cluster power down that
> > the cluster has actually powered down since another CPU in that cluster
> > might still be running, in which case the recorded latency information
> > for that idle CPU would be higher than it would be in practice at that
> > moment.
> 
> That will not work.

What will not work?

> When a CPU goes idle, it uses the CURRENT criteria for entering that state.
> If the criteria change after it has entered the state, are you going
> to wake it up so it can re-evaluate?  No.

That's not what I'm saying at all.

> That is why the state must describe the worst case latency
> that CPU may see when waking from the state on THAT entry.

No disagreement there.  Isn't that what I'm saying?

> That is why we use the package C-state numbers to describe
> core C-states on IA.

And your point is?


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Lorenzo Pieralisi Feb. 1, 2014, 3:40 p.m. UTC | #15
On Sat, Feb 01, 2014 at 06:00:40AM +0000, Brown, Len wrote:
> > Right now (on ARM at least but I imagine this is pretty universal), the
> > biggest impact on information accuracy for a CPU depends on what the
> > other CPUs are doing.  The most obvious example is cluster power down.
> > For a cluster to be powered down, all the CPUs sharing this cluster must
> > also be powered down.  And all those CPUs must have agreed to a possible
> > cluster power down in advance as well.  But it is not because an idle
> > CPU has agreed to the extra latency imposed by a cluster power down that
> > the cluster has actually powered down since another CPU in that cluster
> > might still be running, in which case the recorded latency information
> > for that idle CPU would be higher than it would be in practice at that
> > moment.
> 
> That will not work.
> 
> When a CPU goes idle, it uses the CURRENT criteria for entering that state.
> If the criteria change after it has entered the state, are you going
> to wake it up so it can re-evaluate?  No.
> 
> That is why the state must describe the worst case latency
> that CPU may see when waking from the state on THAT entry.
> 
> That is why we use the package C-state numbers to describe
> core C-states on IA.

That's what we do on ARM too for cluster states. But the state decision
might turn out suboptimal in this case too for the same reasons you have
just mentioned.

There are some use cases when it matters (and where monitoring the
timers on all CPUs in a cluster shows that aborting cluster shutdown is
required because some CPUs have a pending timer and the governor decision is
stale), there are some use cases where it does not matter at all.

We talked about this at LPC and I guess x86 FW/HW plays a role in
package states demotion too, we can do it in FW on ARM.

Overall we all know that whatever we do, it is impossible to know the
precise C-state a CPU is in, even if we resort to HW probing, it is just
a matter of deciding what level of abstraction is necessary for the
scheduler to work properly.

Thanks for bringing this topic up.

Lorenzo

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nicolas Pitre Feb. 1, 2014, 8:13 p.m. UTC | #16
On Sat, 1 Feb 2014, Brown, Len wrote:

> > And your point is?
> 
> It is a bad idea for an individual CPU to track the C-state
> of another CPU, which can change the cycle after it was checked.

Absolutely.  And I'm far from advocating we do this either.

> We know it is a bad idea because we used to do it,
> until we realized code here can easily impact the
> performance critical path.
> 
> In general, it is the OS's job to communicate constraints
> to the HW, and the HW to act on them.  Not all HW is smart,
> so sometimes the OS has to do more hand-holding -- but
> less is more.

Less is more indeed.  I'm certainly a big fan of that principle.

Just so you understand more about the context we need to care for on 
ARM, I'd invite you to have a look at 
Documentation/arm/cluster-pm-race-avoidance.txt.

I'm advocating we do _not_ track everything at the scheduler domain 
simply because some cluster states are possible only if all the CPUs in 
a cluster are idle, and that idleness is already tracked by the 
scheduler at the scheduling domain level.  So the information we don't 
update can already be inferred indirectly and cheaply with the 
information in place today.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Morten Rasmussen Feb. 3, 2014, 12:54 p.m. UTC | #17
On Fri, Jan 31, 2014 at 06:19:26PM +0000, Nicolas Pitre wrote:
> Right now (on ARM at least but I imagine this is pretty universal), the 
> biggest impact on information accuracy for a CPU depends on what the 
> other CPUs are doing.  The most obvious example is cluster power down.  
> For a cluster to be powered down, all the CPUs sharing this cluster must 
> also be powered down.  And all those CPUs must have agreed to a possible 
> cluster power down in advance as well.  But it is not because an idle 
> CPU has agreed to the extra latency imposed by a cluster power down that 
> the cluster has actually powered down since another CPU in that cluster 
> might still be running, in which case the recorded latency information 
> for that idle CPU would be higher than it would be in practice at that 
> moment.
> 
> A cluster should map naturally to a scheduling domain.  If we need to 
> wake up a CPU, it is quite obvious that we should prefer an idle CPU 
> from a scheduling domain which load is not zero.  If the load is not 
> zero then this means that any idle CPU in that domain, even if it 
> indicated it was ready for a cluster power down, will not require the 
> cluster power-up latency as some other CPUs must still be running.  But 
> we already know that of course even if the recorded latency might not 
> say so.
> 
> In other words, the hardware latency information is dynamic of course.  
> But we might not _need_ to have it reflected at the scheduler domain all 
> the time as in this case it can be inferred by the scheduling domain 
> load.

I agree that the existing sched domain hierarchy should be used to
represent the power topology. But, it is not clear to me how much we can say
about the C-state of cpu without checking the load of the entire cluster
every time?

We would need to know which C-states (index) that are per cpu and per
cluster and ignore the cluster states when the cluster load is non-zero.

Current sched domain load is not maintained in the scheduler, it is only
produced when needed. But I guess you could derive the necessary
information from the idle cpu masks.

> 
> Within a scheduling domain it is OK to pick up the best idle CPU by 
> looking at the index as it is best to leave those CPUs ready for a 
> cluster power down set to that state and prefer one which is not.  And a 
> scheduling domain with a load of zero should be left alone if idle CPUs 
> are found in another domain which load is not zero, irrespective of 
> absolute latency information. So all the existing heuristics already in 
> place to optimize cache utilization and so on will make things just work 
> for idle as well.

IIUC, you propose to only use the index when picking an idle cpu inside
an already busy sched domain and leave idle sched domains alone if
possible. It may work for homogeneous SMP systems, but I don't think it
will work for heterogeneous systems like big.LITTLE.

If the little cluster has zero load and the big has stuff running, it
doesn't mean that it is a good idea to wake up another big cpu. It may
be more power efficient to wake up the little cluster. Comparing idle
state index of a big and little cpu won't help us in making that choice
as the clusters may have different idle states and the costs associated
with each state are different.

I'm therefore not convinced that idle state index is the right thing to
give the scheduler. Using a cost metric would be better in my
opinion.

Morten

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nicolas Pitre Feb. 3, 2014, 2:58 p.m. UTC | #18
On Mon, 3 Feb 2014, Morten Rasmussen wrote:

> On Fri, Jan 31, 2014 at 06:19:26PM +0000, Nicolas Pitre wrote:
> > A cluster should map naturally to a scheduling domain.  If we need to 
> > wake up a CPU, it is quite obvious that we should prefer an idle CPU 
> > from a scheduling domain which load is not zero.  If the load is not 
> > zero then this means that any idle CPU in that domain, even if it 
> > indicated it was ready for a cluster power down, will not require the 
> > cluster power-up latency as some other CPUs must still be running.  But 
> > we already know that of course even if the recorded latency might not 
> > say so.
> > 
> > In other words, the hardware latency information is dynamic of course.  
> > But we might not _need_ to have it reflected at the scheduler domain all 
> > the time as in this case it can be inferred by the scheduling domain 
> > load.
> 
> I agree that the existing sched domain hierarchy should be used to
> represent the power topology. But, it is not clear to me how much we can say
> about the C-state of cpu without checking the load of the entire cluster
> every time?
> 
> We would need to know which C-states (index) that are per cpu and per
> cluster and ignore the cluster states when the cluster load is non-zero.

In any case i.e. whether the cluster load is zero or not, we want to 
select the CPU to wake up with the shallowest C-state.  That should 
correspond to the actual cluster C-state already without having to track 
it explicitly.

> Current sched domain load is not maintained in the scheduler, it is only
> produced when needed. But I guess you could derive the necessary
> information from the idle cpu masks.

Even better.

> > Within a scheduling domain it is OK to pick up the best idle CPU by 
> > looking at the index as it is best to leave those CPUs ready for a 
> > cluster power down set to that state and prefer one which is not.  And a 
> > scheduling domain with a load of zero should be left alone if idle CPUs 
> > are found in another domain which load is not zero, irrespective of 
> > absolute latency information. So all the existing heuristics already in 
> > place to optimize cache utilization and so on will make things just work 
> > for idle as well.
> 
> IIUC, you propose to only use the index when picking an idle cpu inside
> an already busy sched domain and leave idle sched domains alone if
> possible. It may work for homogeneous SMP systems, but I don't think it
> will work for heterogeneous systems like big.LITTLE.

Hence the caveat "everything else being equal" I said previously.

> If the little cluster has zero load and the big has stuff running, it
> doesn't mean that it is a good idea to wake up another big cpu. It may
> be more power efficient to wake up the little cluster. Comparing idle
> state index of a big and little cpu won't help us in making that choice
> as the clusters may have different idle states and the costs associated
> with each state are different.

Agreed.  But let's evolve this in manageable steps.

> I'm therefore not convinced that idle state index is the right thing to
> give the scheduler. Using a cost metric would be better in my
> opinion.

That won't be difficult to move from the idle state index to some other 
cost metric once we've proven the simple index on homogeneous systems 
has benefits.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Lorenzo Pieralisi Feb. 12, 2014, 3:16 p.m. UTC | #19
On Mon, Feb 03, 2014 at 04:17:47PM +0000, Arjan van de Ven wrote:

[...]

> >> 1) A latency driven one
> >> 2) A performance impact on
> >>
> >> first one is pretty much the exit latency related time, sort of a
> >> "expected time to first instruction" (currently menuidle has the
> >> 99.999% worst case number, which is not useful for this, but is a
> >> first approximation). This is obviously the dominating number for
> >> expected-short running tasks
> >>
> >> second on is more of a "is there any cache/TLB left or is it flushed"
> >> kind of metric. It's more tricky to compute, since what is the cost of
> >> an empty cache (or even a cache migration) after all....  .... but I
> >> suspect it's in part what the scheduler will care about more for
> >> expected-long  running tasks.
> >
> > Yeah, so currently we 'assume' cache hotness based on runtime; see
> > task_hot(). A hint that the CPU wiped its caches might help there.
> 
> if there's a simple api like
> 
> sched_cpu_cache_wiped(int llc)
> 
> that would be very nice for this; the menuidle side knows this
> for some cases and thus can just call it. This would be a very
> small and minimal change

What do you mean by "menuidle side knows this for some cases" ?
You mean you know that some C-state entries imply llc clean/invalidate ?

Thanks,
Lorenzo

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lorenzo Pieralisi Feb. 12, 2014, 5:37 p.m. UTC | #20
On Wed, Feb 12, 2014 at 04:14:38PM +0000, Arjan van de Ven wrote:
> 
> >> sched_cpu_cache_wiped(int llc)
> >>
> >> that would be very nice for this; the menuidle side knows this
> >> for some cases and thus can just call it. This would be a very
> >> small and minimal change
> >
> > What do you mean by "menuidle side knows this for some cases" ?
> > You mean you know that some C-state entries imply llc clean/invalidate ?
> 
> in the architectural idle code we can know if the llc got flushed
> there's also the per core flags where we know with reasonable certainty
> that the per core caches got flushed.

Ok, but that's arch specific, not something we can detect from the menu
governor in generic code , that's what I wanted to ask because it was not
clear.

Thanks,
Lorenzo

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nicolas Pitre Feb. 12, 2014, 7:05 p.m. UTC | #21
On Wed, 12 Feb 2014, Lorenzo Pieralisi wrote:

> On Wed, Feb 12, 2014 at 04:14:38PM +0000, Arjan van de Ven wrote:
> > 
> > >> sched_cpu_cache_wiped(int llc)
> > >>
> > >> that would be very nice for this; the menuidle side knows this
> > >> for some cases and thus can just call it. This would be a very
> > >> small and minimal change
> > >
> > > What do you mean by "menuidle side knows this for some cases" ?
> > > You mean you know that some C-state entries imply llc clean/invalidate ?
> > 
> > in the architectural idle code we can know if the llc got flushed
> > there's also the per core flags where we know with reasonable certainty
> > that the per core caches got flushed.
> 
> Ok, but that's arch specific, not something we can detect from the menu
> governor in generic code , that's what I wanted to ask because it was not
> clear.

I would think this is meant to be called from architecture specific 
backend code.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
diff mbox

Patch

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 3e85d38..eeb9f60 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -12,6 +12,8 @@ 
 
 #include <trace/events/power.h>
 
+#include "sched.h"
+
 static int __read_mostly cpu_idle_force_poll;
 
 void cpu_idle_poll_ctrl(bool enable)
@@ -73,6 +75,7 @@  static int cpuidle_idle_call(void)
 {
 	struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices);
 	struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
+	struct rq *rq = this_rq();
 	int next_state, entered_state;
 
 	/* ask the governor for the next state */
@@ -80,6 +83,9 @@  static int cpuidle_idle_call(void)
 	if (next_state < 0)
 		return next_state;
 
+	/* let know the scheduler in which idle state the cpu will be */
+	rq->idle_index = next_state;
+
 	if (need_resched()) {
 		dev->last_residency = 0;
 		/* give the governor an opportunity to reflect on the outcome */
@@ -97,6 +103,9 @@  static int cpuidle_idle_call(void)
 	/* give the governor an opportunity to reflect on the outcome */
 	cpuidle_reflect(dev, entered_state);
 
+	/* we exited the idle state, let the scheduler know */
+	rq->idle_index = -1;
+
 	return 0;
 }
 #else
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 90aef084..130debf 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -654,6 +654,9 @@  struct rq {
 #endif
 
 	struct sched_avg avg;
+#ifdef CONFIG_CPU_IDLE
+	int idle_index;
+#endif
 };
 
 static inline int cpu_of(struct rq *rq)