[V3,2/6] sched: idle: cpuidle: Check the latency req before idle

Message ID 1415370687-18688-3-git-send-email-daniel.lezcano@linaro.org
State New
Headers show

Commit Message

Daniel Lezcano Nov. 7, 2014, 2:31 p.m.
When the pmqos latency requirement is set to zero that means "poll in all the
cases".

That is correctly implemented on x86 but not on the other archs.

As how is written the code, if the latency request is zero, the governor will
return zero, so corresponding, for x86, to the poll function, but for the
others arch the default idle function. For example, on ARM this is wait-for-
interrupt with a latency of '1', so violating the constraint.

In order to fix that, do the latency requirement check *before* calling the
cpuidle framework in order to jump to the poll function without entering
cpuidle. That has several benefits:

 1. It clarifies and unifies the code
 2. It fixes x86 vs other archs behavior
 3. Factors out the call to the same function
 4. Prevent to enter the cpuidle framework with its expensive cost in
    calculation

As the latency_req is needed in all the cases, change the select API to take
the latency_req as parameter in case it is not equal to zero.

As a positive side effect, it introduces the latency constraint specified
externally, so one more step to the cpuidle/scheduler integration.

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Nicolas Pitre <nico@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Len Brown <len.brown@intel.com>
---
 drivers/cpuidle/cpuidle.c          |  6 ++++--
 drivers/cpuidle/governors/ladder.c |  9 +--------
 drivers/cpuidle/governors/menu.c   |  8 ++------
 include/linux/cpuidle.h            |  7 ++++---
 kernel/sched/idle.c                | 18 ++++++++++++++----
 5 files changed, 25 insertions(+), 23 deletions(-)

Comments

Preeti U Murthy Nov. 8, 2014, 10:40 a.m. | #1
On 11/07/2014 08:01 PM, Daniel Lezcano wrote:
> When the pmqos latency requirement is set to zero that means "poll in all the
> cases".
> 
> That is correctly implemented on x86 but not on the other archs.
> 
> As how is written the code, if the latency request is zero, the governor will
> return zero, so corresponding, for x86, to the poll function, but for the
> others arch the default idle function. For example, on ARM this is wait-for-
> interrupt with a latency of '1', so violating the constraint.
> 
> In order to fix that, do the latency requirement check *before* calling the
> cpuidle framework in order to jump to the poll function without entering
> cpuidle. That has several benefits:
> 
>  1. It clarifies and unifies the code
>  2. It fixes x86 vs other archs behavior
>  3. Factors out the call to the same function
>  4. Prevent to enter the cpuidle framework with its expensive cost in
>     calculation
> 
> As the latency_req is needed in all the cases, change the select API to take
> the latency_req as parameter in case it is not equal to zero.
> 
> As a positive side effect, it introduces the latency constraint specified
> externally, so one more step to the cpuidle/scheduler integration.
> 
> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
> Acked-by: Nicolas Pitre <nico@linaro.org>
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Reviewed-by: Len Brown <len.brown@intel.com>
> ---

Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>

Regards
Preeti U Murthy
Peter Zijlstra Nov. 10, 2014, 12:41 p.m. | #2
On Fri, Nov 07, 2014 at 03:31:23PM +0100, Daniel Lezcano wrote:
> @@ -216,19 +219,26 @@ static void cpu_idle_loop(void)
>  			local_irq_disable();
>  			arch_cpu_idle_enter();
>  
> +			latency_req = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);
> +
>  			/*
>  			 * In poll mode we reenable interrupts and spin.
>  			 *
> +			 * If the latency req is zero, we don't want to
> +			 * enter any idle state and we jump to the poll
> +			 * function directly
> +			 *
>  			 * Also if we detected in the wakeup from idle
>  			 * path that the tick broadcast device expired
>  			 * for us, we don't want to go deep idle as we
>  			 * know that the IPI is going to arrive right
>  			 * away
>  			 */
> -			if (cpu_idle_force_poll || tick_check_broadcast_expired())
> +			if (!latency_req || cpu_idle_force_poll ||
> +			    tick_check_broadcast_expired())
>  				cpu_idle_poll();

Is this why you wanted that weak poll function?

Should we not instead allow an arch to deal with !latency_req and only
fall back to this polling if there is no actual way for it to implement
this better?
Daniel Lezcano Nov. 10, 2014, 3:12 p.m. | #3
On 11/10/2014 01:41 PM, Peter Zijlstra wrote:
> On Fri, Nov 07, 2014 at 03:31:23PM +0100, Daniel Lezcano wrote:
>> @@ -216,19 +219,26 @@ static void cpu_idle_loop(void)
>>   			local_irq_disable();
>>   			arch_cpu_idle_enter();
>>
>> +			latency_req = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);
>> +
>>   			/*
>>   			 * In poll mode we reenable interrupts and spin.
>>   			 *
>> +			 * If the latency req is zero, we don't want to
>> +			 * enter any idle state and we jump to the poll
>> +			 * function directly
>> +			 *
>>   			 * Also if we detected in the wakeup from idle
>>   			 * path that the tick broadcast device expired
>>   			 * for us, we don't want to go deep idle as we
>>   			 * know that the IPI is going to arrive right
>>   			 * away
>>   			 */
>> -			if (cpu_idle_force_poll || tick_check_broadcast_expired())
>> +			if (!latency_req || cpu_idle_force_poll ||
>> +			    tick_check_broadcast_expired())
>>   				cpu_idle_poll();
>
> Is this why you wanted that weak poll function?

Not only there, it will be added in the next patchset in the 
cpu_idle_call function.

> Should we not instead allow an arch to deal with !latency_req and only
> fall back to this polling if there is no actual way for it to implement
> this better?

All this is to remove the poll idle state from the x86 cpuidle driver in 
order to remove the CPUIDLE_DRIVER_STATE_START (this one forces to write 
always ugly code in the cpuidle framework).

This poll state introduces the CPUIDLE_DRIVER_STATE_START macro. If you 
look at the different governors and the code, you can checkout what kind 
of tricks this macro introduces and how that makes the code ugly.

For the sake of what ? Just a small optimization in the menu governor.

I suppose that has been introduce and then evolved on a wrong basis. So 
now we have to deal with that.

This patchset provides a first round of cleanup around the poll 
function, the next patchset will move the 5us timer optimization from 
the menu governor and the last patchset will remove the 
CPUIDLE_DRIVER_STATE_START ugly macro.
Peter Zijlstra Nov. 10, 2014, 3:28 p.m. | #4
On Mon, Nov 10, 2014 at 04:12:47PM +0100, Daniel Lezcano wrote:
> All this is to remove the poll idle state from the x86 cpuidle driver in
> order to remove the CPUIDLE_DRIVER_STATE_START (this one forces to write
> always ugly code in the cpuidle framework).
> 
> This poll state introduces the CPUIDLE_DRIVER_STATE_START macro. If you look
> at the different governors and the code, you can checkout what kind of
> tricks this macro introduces and how that makes the code ugly.
> 
> For the sake of what ? Just a small optimization in the menu governor.
> 
> I suppose that has been introduce and then evolved on a wrong basis. So now
> we have to deal with that.
> 
> This patchset provides a first round of cleanup around the poll function,
> the next patchset will move the 5us timer optimization from the menu
> governor and the last patchset will remove the CPUIDLE_DRIVER_STATE_START
> ugly macro.

I don't get it, I've clearly not stared at it long enough, but why is
that STATE_START crap needed anywhere?

To me it appears 'natural' to have a latency_req==0 state, why does it
need so much special casing?
Daniel Lezcano Nov. 10, 2014, 3:58 p.m. | #5
On 11/10/2014 04:28 PM, Peter Zijlstra wrote:
> On Mon, Nov 10, 2014 at 04:12:47PM +0100, Daniel Lezcano wrote:
>> All this is to remove the poll idle state from the x86 cpuidle driver in
>> order to remove the CPUIDLE_DRIVER_STATE_START (this one forces to write
>> always ugly code in the cpuidle framework).
>>
>> This poll state introduces the CPUIDLE_DRIVER_STATE_START macro. If you look
>> at the different governors and the code, you can checkout what kind of
>> tricks this macro introduces and how that makes the code ugly.
>>
>> For the sake of what ? Just a small optimization in the menu governor.
>>
>> I suppose that has been introduce and then evolved on a wrong basis. So now
>> we have to deal with that.
>>
>> This patchset provides a first round of cleanup around the poll function,
>> the next patchset will move the 5us timer optimization from the menu
>> governor and the last patchset will remove the CPUIDLE_DRIVER_STATE_START
>> ugly macro.
>
> I don't get it, I've clearly not stared at it long enough, but why is
> that STATE_START crap needed anywhere?

Excellent question :)

On x86, the config option ARCH_HAS_CPU_RELAX is set (x86 is the only 
one). That leads to the macro CPUIDLE_DRIVER_STATE_START equal 1.

https://git.kernel.org/cgit/linux/kernel/git/rafael/linux-pm.git/tree/include/linux/cpuidle.h#n221

Then the acpi cpuidle driver and the intel_driver begin to fill the idle 
state at index == CPUIDLE_DRIVER_STATE_START, so leaving the 0th idle 
state empty.

https://git.kernel.org/cgit/linux/kernel/git/rafael/linux-pm.git/tree/drivers/idle/intel_idle.c#n848

https://git.kernel.org/cgit/linux/kernel/git/rafael/linux-pm.git/tree/drivers/acpi/processor_idle.c#n953

Then when the driver is registered and if ARCH_HAS_CPU_RELAX is set, the 
cpuidle framework insert the 0th with the poll state (reminder : only 
for x86).

https://git.kernel.org/cgit/linux/kernel/git/rafael/linux-pm.git/tree/drivers/cpuidle/driver.c#n195

If you look at the ladder governor (which I believe nobody is still 
using it), or at the menu governor, all the indexes begin at 
CPUIDLE_DRIVER_STATE_START, so all the code is preventing to use the 0th 
state ... :)

So why is needed the poll state ?

1. When the latency_req is 0 (it returns 0, so the poll state)

2. When the select's menu governor fails to find a state *and* if the 
next timer is before 5us

https://git.kernel.org/cgit/linux/kernel/git/rafael/linux-pm.git/tree/drivers/cpuidle/driver.c#n195

And when we investigate the same code but on the other archs, the 
CPUIDLE_DRIVER_STATE_START dance makes things slightly different.

So the conclusion is, we are inserting a state in the idle state array 
but we do everything to prevent to use it :)

For me it appears logical to just kill this state from the x86 idle 
drivers and move it in the idle_mainloop in case an idle state selection 
fails.



> To me it appears 'natural' to have a latency_req==0 state, why does it
> need so much special casing?
Peter Zijlstra Nov. 10, 2014, 4:15 p.m. | #6
On Mon, Nov 10, 2014 at 04:58:29PM +0100, Daniel Lezcano wrote:
> >I don't get it, I've clearly not stared at it long enough, but why is
> >that STATE_START crap needed anywhere?
> 
> Excellent question :)
> 
> On x86, the config option ARCH_HAS_CPU_RELAX is set (x86 is the only one).
> That leads to the macro CPUIDLE_DRIVER_STATE_START equal 1.
> 
> https://git.kernel.org/cgit/linux/kernel/git/rafael/linux-pm.git/tree/include/linux/cpuidle.h#n221
> 
> Then the acpi cpuidle driver and the intel_driver begin to fill the idle
> state at index == CPUIDLE_DRIVER_STATE_START, so leaving the 0th idle state
> empty.
> 
> https://git.kernel.org/cgit/linux/kernel/git/rafael/linux-pm.git/tree/drivers/idle/intel_idle.c#n848
> 
> https://git.kernel.org/cgit/linux/kernel/git/rafael/linux-pm.git/tree/drivers/acpi/processor_idle.c#n953
> 
> Then when the driver is registered and if ARCH_HAS_CPU_RELAX is set, the
> cpuidle framework insert the 0th with the poll state (reminder : only for
> x86).

Appears to me part of the problem is right there, the intel_idle and
proessor_idle should register the poll state themselves. That
immediately makes this weirdness go away.

Registering states from two places is not something that's sane or
desired I think.

> https://git.kernel.org/cgit/linux/kernel/git/rafael/linux-pm.git/tree/drivers/cpuidle/driver.c#n195
> 
> If you look at the ladder governor (which I believe nobody is still using
> it), or at the menu governor, all the indexes begin at
> CPUIDLE_DRIVER_STATE_START, so all the code is preventing to use the 0th
> state ... :)
> 
> So why is needed the poll state ?
> 
> 1. When the latency_req is 0 (it returns 0, so the poll state)

Right, that makes sense.

> 2. When the select's menu governor fails to find a state *and* if the next
> timer is before 5us

That seems rather arbitrary. Why would it fail to find a state?

> https://git.kernel.org/cgit/linux/kernel/git/rafael/linux-pm.git/tree/drivers/cpuidle/driver.c#n195
> 
> And when we investigate the same code but on the other archs, the
> CPUIDLE_DRIVER_STATE_START dance makes things slightly different.
> 
> So the conclusion is, we are inserting a state in the idle state array but
> we do everything to prevent to use it :)
> 
> For me it appears logical to just kill this state from the x86 idle drivers
> and move it in the idle_mainloop in case an idle state selection fails.

But why, ppc has a latency_req==0 state too, right?

I agree that we should shoot STATE_START in the head, but I feel we
should do it by fixing the state registration.

I really don't get why the governors should know about this though, its
just another state, they should iterate all states and pick the best,
given the power usage this state should really never be eligible unless
we're QoS forced or whatnot.
Daniel Lezcano Nov. 10, 2014, 5:19 p.m. | #7
On 11/10/2014 05:15 PM, Peter Zijlstra wrote:
> On Mon, Nov 10, 2014 at 04:58:29PM +0100, Daniel Lezcano wrote:
>>> I don't get it, I've clearly not stared at it long enough, but why is
>>> that STATE_START crap needed anywhere?
>>
>> Excellent question :)
>>
>> On x86, the config option ARCH_HAS_CPU_RELAX is set (x86 is the only one).
>> That leads to the macro CPUIDLE_DRIVER_STATE_START equal 1.
>>
>> https://git.kernel.org/cgit/linux/kernel/git/rafael/linux-pm.git/tree/include/linux/cpuidle.h#n221
>>
>> Then the acpi cpuidle driver and the intel_driver begin to fill the idle
>> state at index == CPUIDLE_DRIVER_STATE_START, so leaving the 0th idle state
>> empty.
>>
>> https://git.kernel.org/cgit/linux/kernel/git/rafael/linux-pm.git/tree/drivers/idle/intel_idle.c#n848
>>
>> https://git.kernel.org/cgit/linux/kernel/git/rafael/linux-pm.git/tree/drivers/acpi/processor_idle.c#n953
>>
>> Then when the driver is registered and if ARCH_HAS_CPU_RELAX is set, the
>> cpuidle framework insert the 0th with the poll state (reminder : only for
>> x86).
>
> Appears to me part of the problem is right there, the intel_idle and
> proessor_idle should register the poll state themselves. That
> immediately makes this weirdness go away.
>
> Registering states from two places is not something that's sane or
> desired I think.

I fully agree. I did a patchset in this direction but I realized the 
poll state most of the cases was not used.

>> https://git.kernel.org/cgit/linux/kernel/git/rafael/linux-pm.git/tree/drivers/cpuidle/driver.c#n195
>>
>> If you look at the ladder governor (which I believe nobody is still using
>> it), or at the menu governor, all the indexes begin at
>> CPUIDLE_DRIVER_STATE_START, so all the code is preventing to use the 0th
>> state ... :)
>>
>> So why is needed the poll state ?
>>
>> 1. When the latency_req is 0 (it returns 0, so the poll state)
>
> Right, that makes sense.
>
>> 2. When the select's menu governor fails to find a state *and* if the next
>> timer is before 5us
>
> That seems rather arbitrary.

Yeah, and I am wondering if this very particular optimization couldn't 
be just removed. Is there still a benefit ?

> Why would it fail to find a state?

To do short: it could fail to find a state fulfilling the constraints 
(some states could be disabled or the ).

>> https://git.kernel.org/cgit/linux/kernel/git/rafael/linux-pm.git/tree/drivers/cpuidle/driver.c#n195
>>
>> And when we investigate the same code but on the other archs, the
>> CPUIDLE_DRIVER_STATE_START dance makes things slightly different.
>>
>> So the conclusion is, we are inserting a state in the idle state array but
>> we do everything to prevent to use it :)
>>
>> For me it appears logical to just kill this state from the x86 idle drivers
>> and move it in the idle_mainloop in case an idle state selection fails.
>
> But why, ppc has a latency_req==0 state too, right?

Yes, this is why the arch_cpu_poll_idle, so ppc can override it with its 
optimized pooling loop (rep(); nop(); is x86).

> I agree that we should shoot STATE_START in the head, but I feel we
> should do it by fixing the state registration.

You can fix the state registration but that won't fix the STATE_START 
usage in the governors.

> I really don't get why the governors should know about this though, its
> just another state, they should iterate all states and pick the best,
> given the power usage this state should really never be eligible unless
> we're QoS forced or whatnot.

The governors just don't use the poll state at all, except for a couple 
of cases in menu.c defined above in the previous email. What is the 
rational of adding a state in the cpuidle driver and do everything we 
can to avoid using it ? From my POV, the poll state is a special state, 
we should remove from the driver's idle states like the arch_cpu_idle() 
is a specific idle state only used in idle.c (but which may overlap with 
an idle state in different archs eg. cpu_do_idle() and the 0th idle state).
Peter Zijlstra Nov. 10, 2014, 7:48 p.m. | #8
On Mon, Nov 10, 2014 at 06:19:02PM +0100, Daniel Lezcano wrote:
> >I really don't get why the governors should know about this though, its
> >just another state, they should iterate all states and pick the best,
> >given the power usage this state should really never be eligible unless
> >we're QoS forced or whatnot.
> 
> The governors just don't use the poll state at all, except for a couple of
> cases in menu.c defined above in the previous email. What is the rational of
> adding a state in the cpuidle driver and do everything we can to avoid using
> it ? From my POV, the poll state is a special state, we should remove from
> the driver's idle states like the arch_cpu_idle() is a specific idle state
> only used in idle.c (but which may overlap with an idle state in different
> archs eg. cpu_do_idle() and the 0th idle state).

So I disagree, I think poll-idle is an idle mode just like all the
others. It should be an available state to the governor and it should
treat it like any other.

I don't tihnk the whole ARCH_HAS_CPU_RELAX thing makes any kind of
sense, _every_ arch has some definition of it, the generic polling loop
is always a valid idle implementation.

What we can do is always populate the idle state table with it before
calling the regular drivers.

If the arch drivers have a 'better' latency_req==0 idle routine -- note
my argument on the ppc issue, I think its wrong -- it can replace the
existing one.

We should further remove all the special casing in the governors, its
always a valid state, but it should hardly ever be the most desirable
state.

I think the whole arch specific idle loop is a mistake, we already have
an (arch) interface into the idle routines, we don't need yet another.
Daniel Lezcano Nov. 10, 2014, 10:21 p.m. | #9
On 11/10/2014 08:48 PM, Peter Zijlstra wrote:
> On Mon, Nov 10, 2014 at 06:19:02PM +0100, Daniel Lezcano wrote:
>>> I really don't get why the governors should know about this though, its
>>> just another state, they should iterate all states and pick the best,
>>> given the power usage this state should really never be eligible unless
>>> we're QoS forced or whatnot.
>>
>> The governors just don't use the poll state at all, except for a couple of
>> cases in menu.c defined above in the previous email. What is the rational of
>> adding a state in the cpuidle driver and do everything we can to avoid using
>> it ? From my POV, the poll state is a special state, we should remove from
>> the driver's idle states like the arch_cpu_idle() is a specific idle state
>> only used in idle.c (but which may overlap with an idle state in different
>> archs eg. cpu_do_idle() and the 0th idle state).
>
> So I disagree, I think poll-idle is an idle mode just like all the
> others. It should be an available state to the governor and it should
> treat it like any other.

The governors are just ignoring it, except for a small timer 
optimization in menu.c (and I am still not convinced it is worth to have 
it). I don't see the point to add a state we don't want to use.

Eg. on my server it was called 2 times over 1313856 times.

> I don't tihnk the whole ARCH_HAS_CPU_RELAX thing makes any kind of
> sense, _every_ arch has some definition of it, the generic polling loop
> is always a valid idle implementation.
>
> What we can do is always populate the idle state table with it before
> calling the regular drivers.

I am not sure to understand. You want to add the poll idle loop in all 
the drivers ?

What about "safe_halt()" ? (arch_cpu_idle() for x86). It is also an idle 
state. Why not add it in the idle state table also ?

> If the arch drivers have a 'better' latency_req==0 idle routine -- note
> my argument on the ppc issue, I think its wrong -- it can replace the
> existing one.
>
> We should further remove all the special casing in the governors, its
> always a valid state, but it should hardly ever be the most desirable
> state.
>
> I think the whole arch specific idle loop is a mistake, we already have
> an (arch) interface into the idle routines, we don't need yet another.
Peter Zijlstra Nov. 11, 2014, 10:20 a.m. | #10
On Mon, Nov 10, 2014 at 11:21:19PM +0100, Daniel Lezcano wrote:
> On 11/10/2014 08:48 PM, Peter Zijlstra wrote:
> >On Mon, Nov 10, 2014 at 06:19:02PM +0100, Daniel Lezcano wrote:
> >>>I really don't get why the governors should know about this though, its
> >>>just another state, they should iterate all states and pick the best,
> >>>given the power usage this state should really never be eligible unless
> >>>we're QoS forced or whatnot.
> >>
> >>The governors just don't use the poll state at all, except for a couple of
> >>cases in menu.c defined above in the previous email. What is the rational of
> >>adding a state in the cpuidle driver and do everything we can to avoid using
> >>it ? From my POV, the poll state is a special state, we should remove from
> >>the driver's idle states like the arch_cpu_idle() is a specific idle state
> >>only used in idle.c (but which may overlap with an idle state in different
> >>archs eg. cpu_do_idle() and the 0th idle state).
> >
> >So I disagree, I think poll-idle is an idle mode just like all the
> >others. It should be an available state to the governor and it should
> >treat it like any other.
> 
> The governors are just ignoring it, except for a small timer optimization in
> menu.c (and I am still not convinced it is worth to have it). I don't see
> the point to add a state we don't want to use.

The ignoring it is _wrong_. Make that go away and you'll get rid of most
of the STATE_START crap.

The governors are the place where we combine the QoS constraints with
idle predictors and pick an idle state, polling is a valid state to
pick, and given QoS constraints it might be the only state to pick.

> Eg. on my server it was called 2 times over 1313856 times.
> 
> >I don't tihnk the whole ARCH_HAS_CPU_RELAX thing makes any kind of
> >sense, _every_ arch has some definition of it, the generic polling loop
> >is always a valid idle implementation.
> >
> >What we can do is always populate the idle state table with it before
> >calling the regular drivers.
> 
> I am not sure to understand. You want to add the poll idle loop in all the
> drivers ?
> 
> What about "safe_halt()" ? (arch_cpu_idle() for x86). It is also an idle
> state. Why not add it in the idle state table also ?

Because the latter is actually arch specific, whereas the idle polling
thing is not. You can _always_ poll idle, its generic, its valid, and
its guaranteed the most responsive method.

The arch drivers get to add arch specific idle states; if a x86 cpuidle
driver wants to add hlt they can.
Daniel Lezcano Nov. 12, 2014, 1:53 p.m. | #11
On 11/11/2014 11:20 AM, Peter Zijlstra wrote:
> On Mon, Nov 10, 2014 at 11:21:19PM +0100, Daniel Lezcano wrote:
>> On 11/10/2014 08:48 PM, Peter Zijlstra wrote:
>>> On Mon, Nov 10, 2014 at 06:19:02PM +0100, Daniel Lezcano wrote:
>>>>> I really don't get why the governors should know about this though, its
>>>>> just another state, they should iterate all states and pick the best,
>>>>> given the power usage this state should really never be eligible unless
>>>>> we're QoS forced or whatnot.
>>>>
>>>> The governors just don't use the poll state at all, except for a couple of
>>>> cases in menu.c defined above in the previous email. What is the rational of
>>>> adding a state in the cpuidle driver and do everything we can to avoid using
>>>> it ? From my POV, the poll state is a special state, we should remove from
>>>> the driver's idle states like the arch_cpu_idle() is a specific idle state
>>>> only used in idle.c (but which may overlap with an idle state in different
>>>> archs eg. cpu_do_idle() and the 0th idle state).
>>>
>>> So I disagree, I think poll-idle is an idle mode just like all the
>>> others. It should be an available state to the governor and it should
>>> treat it like any other.
>>
>> The governors are just ignoring it, except for a small timer optimization in
>> menu.c (and I am still not convinced it is worth to have it). I don't see
>> the point to add a state we don't want to use.
>
> The ignoring it is _wrong_. Make that go away and you'll get rid of most
> of the STATE_START crap.
>
> The governors are the place where we combine the QoS constraints with
> idle predictors and pick an idle state, polling is a valid state to
> pick, and given QoS constraints it might be the only state to pick.

Well, I understand your point of view but I still disagree. IMO, the 
poll loop shouldn't be considered as a valid idle state for different 
reasons:

0. thermal issue if wrongly selected from any of the governor

1. a polling loop does not have a valid time measurements even if the 
TIME_VALID flag has been added

2. entering the idle governors is not free, especially the menu 
governor, which is contradictory with zero latency req

3. what is the meaning of having a zero latency (target + exit) idle 
state ? We know it will always succeed if the other fails

4. IIUC, you are suggesting to add the poll loop for all the cpuidle 
drivers. This is a *lot* of changes, I am not afraid about the work to 
do but there is a significant code impact and the entire behavior of the 
cpuidle framework for all the arch will be changed.

So given the points above, why not have one poll function, generic, and 
if we fail to find an idle state or if the req is zero, then fallback to 
the poll loop ?


>> Eg. on my server it was called 2 times over 1313856 times.
>>
>>> I don't tihnk the whole ARCH_HAS_CPU_RELAX thing makes any kind of
>>> sense, _every_ arch has some definition of it, the generic polling loop
>>> is always a valid idle implementation.
>>>
>>> What we can do is always populate the idle state table with it before
>>> calling the regular drivers.
>>
>> I am not sure to understand. You want to add the poll idle loop in all the
>> drivers ?
>>
>> What about "safe_halt()" ? (arch_cpu_idle() for x86). It is also an idle
>> state. Why not add it in the idle state table also ?
>
> Because the latter is actually arch specific, whereas the idle polling
> thing is not.
> You can _always_ poll idle, its generic, its valid, and
> its guaranteed the most responsive method.

I agree with this point but this kind of loop is hardware optimized for 
x86. On the other arch, where today this loop is never used, if we 
change the behavior of the cpuidle framework and introduces this loop, 
it may be selected and stay on this state for a long time (resulting 
from a bad prediction), I am afraid that can lead to some thermal issues.

> The arch drivers get to add arch specific idle states; if a x86 cpuidle
> driver wants to add hlt they can.
Daniel Lezcano Nov. 12, 2014, 5:52 p.m. | #12
On 11/12/2014 04:02 PM, Peter Zijlstra wrote:
> On Wed, Nov 12, 2014 at 02:53:10PM +0100, Daniel Lezcano wrote:
>
>>>> The governors are just ignoring it, except for a small timer optimization in
>>>> menu.c (and I am still not convinced it is worth to have it). I don't see
>>>> the point to add a state we don't want to use.
>>>
>>> The ignoring it is _wrong_. Make that go away and you'll get rid of most
>>> of the STATE_START crap.
>>>
>>> The governors are the place where we combine the QoS constraints with
>>> idle predictors and pick an idle state, polling is a valid state to
>>> pick, and given QoS constraints it might be the only state to pick.
>>
>> Well, I understand your point of view but I still disagree. IMO, the poll
>> loop shouldn't be considered as a valid idle state for different reasons:
>>
>> 0. thermal issue if wrongly selected from any of the governor
>
> That seems like a QoS issue and should be fixed there, no?

It could be seen as a QoS issue if we stick the poll loop with the zero 
latency req. But as soon as we introduce the poll loop in the cpuidle 
framework, the issue could have multiple sources (see below at the end 
of the message).

>> 1. a polling loop does not have a valid time measurements even if the
>> TIME_VALID flag has been added
>
> Ah, right you are. It does not. We _could_ fix that, not sure its worth
> the hassle but see below.
>
>> 2. entering the idle governors is not free, especially the menu governor,
>> which is contradictory with zero latency req
>
> Well, you could add this 'fast' path to the cpuidle code (before calling
> into the actual governors) too. Also, since the governors actually use
> this state it makes sense for it to be available.

Actually the governors do not use this state or do whatever they can to 
prevent to use it. The ladder governor just ignore it, and the menu 
governor will default to C1 (not poll) if no state is found. It will use 
the poll loop only if a timer is about to expire within 5us and all this 
dance is around this micro optimization.

You are suggesting to add the fast in the cpuidle framework. IIUC we 
should have:

int cpuidle_select(drv, dev)
{
	...
	if (latency_req == 0)
		return ????;
}

The '????' suggests we duplicate everywhere an 0th idle state (more than 
17 drivers and that breaks the idle states DT binding.

>> 3. what is the meaning of having a zero latency (target + exit) idle state ?
>> We know it will always succeed if the other fails
>
> Not quite sure I follow; you seem to have answered your own question?

Yeah, right :)

>> 4. IIUC, you are suggesting to add the poll loop for all the cpuidle
>> drivers. This is a *lot* of changes, I am not afraid about the work to do
>> but there is a significant code impact and the entire behavior of the
>> cpuidle framework for all the arch will be changed.
>
> I'm not sure it would be a lot of work, add it in the common cpuidle
> code before calling the driver init?



>> So given the points above, why not have one poll function, generic, and if
>> we fail to find an idle state or if the req is zero, then fallback to the
>> poll loop ?
>
> I could agree but only if we keep the poll loop generic, we cannot go
> add yet more arch hooks there.

I understand. The arch hook just give an opportunity to specify an 
arch's specific poll loop, that does not imply it has to.

>>> Because the latter is actually arch specific, whereas the idle
>>> polling thing is not.  You can _always_ poll idle, its generic, its
>>> valid, and its guaranteed the most responsive method.
>>
>> I agree with this point but this kind of loop is hardware optimized for x86.
>
> well 'optimized' is a strong word there, the rep nop, or pause
> instruction isn't really much at all and is mostly a SMT hint afaik.
> ia64, sparc64 and ppc64 have similar magic and s390 and hexagon use a vm
> yield like construct.



>> On the other arch, where today this loop is never used, if we change the
>> behavior of the cpuidle framework and introduces this loop, it may be
>> selected and stay on this state for a long time (resulting from a bad
>> prediction), I am afraid that can lead to some thermal issues.
>
> Because of 1), right? Yes that's a problem.

Well, 1) won't help, but also for different reasons:

By removing the STATE_START macro and adding the poll as the 0th idle state:

ladder governor: it was never, ever, using the poll loop, just ignoring 
the 0th idle state by using the STATE_START macro to stop demoting. That 
won't be the case anymore, and we may fall in the poll loop

menu governor: it was almost never using the poll state. With the change 
it will select the state. If the prediction is bad and the poll idle 
loop is selected but the cpu stays idle longer (that could be several 
seconds). That can happens with a bad predictions or task migration when 
this one is doing a lot of IO.

On the ARM embedded systems where the poll loop is not used, except when 
specified in the kernel command line. That may lead to some thermal 
issues and big troubles.

I like your patch below that will help to re-evaluate the idle state but 
I am not sure it will solve the issue introduced with the generic idle 
loop as the 0th idle state.

Thanks Peter by taking the time to review this patch.

   -- Daniel

> ---
>   kernel/sched/idle.c | 6 +++++-
>   kernel/softirq.c    | 7 +++++++
>   2 files changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index c47fce75e666..9c76ae92650f 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -42,12 +42,16 @@ static int __init cpu_idle_nopoll_setup(char *__unused)
>   __setup("hlt", cpu_idle_nopoll_setup);
>   #endif
>
> +DEFINE_PER_CPU(unsigned int, int_seq);
> +
>   static inline int cpu_idle_poll(void)
>   {
> +	unsigned int seq = this_cpu_read(int_seq);
> +
>   	rcu_idle_enter();
>   	trace_cpu_idle_rcuidle(0, smp_processor_id());
>   	local_irq_enable();
> -	while (!tif_need_resched())
> +	while (!tif_need_resched() && seq == this_cpu_read(int_seq))
>   		cpu_relax();
>   	trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
>   	rcu_idle_exit();
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 0699add19164..bc8d43545964 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -370,6 +370,8 @@ static inline void tick_irq_exit(void)
>   #endif
>   }
>
> +DECLARE_PER_CPU(unsigned int, int_seq);
> +
>   /*
>    * Exit an interrupt context. Process softirqs if needed and possible:
>    */
> @@ -386,6 +388,11 @@ void irq_exit(void)
>   	if (!in_interrupt() && local_softirq_pending())
>   		invoke_softirq();
>
> +#ifdef TIG_POLLING_NRFLAG
> +	if (test_thread_flag(TIF_POLLING_NRFLAG))
> +#endif
> +		this_cpu_inc(int_seq);
> +
>   	tick_irq_exit();
>   	rcu_irq_exit();
>   	trace_hardirq_exit(); /* must be last! */
>

Patch

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 125150d..86f6cb8 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -155,10 +155,12 @@  int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
  *
  * @drv: the cpuidle driver
  * @dev: the cpuidle device
+ * @latency_req: the latency constraint when choosing an idle state
  *
  * Returns the index of the idle state.
  */
-int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
+		   int latency_req)
 {
 	if (off || !initialized)
 		return -ENODEV;
@@ -169,7 +171,7 @@  int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 	if (unlikely(use_deepest_state))
 		return cpuidle_find_deepest_state(drv, dev);
 
-	return cpuidle_curr_governor->select(drv, dev);
+	return cpuidle_curr_governor->select(drv, dev, latency_req);
 }
 
 /**
diff --git a/drivers/cpuidle/governors/ladder.c b/drivers/cpuidle/governors/ladder.c
index 06b57c4..53113c2 100644
--- a/drivers/cpuidle/governors/ladder.c
+++ b/drivers/cpuidle/governors/ladder.c
@@ -64,18 +64,11 @@  static inline void ladder_do_selection(struct ladder_device *ldev,
  * @dev: the CPU
  */
 static int ladder_select_state(struct cpuidle_driver *drv,
-				struct cpuidle_device *dev)
+			       struct cpuidle_device *dev, int latency_req)
 {
 	struct ladder_device *ldev = this_cpu_ptr(&ladder_devices);
 	struct ladder_device_state *last_state;
 	int last_residency, last_idx = ldev->last_state_idx;
-	int latency_req = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);
-
-	/* Special case when user has set very strict latency requirement */
-	if (unlikely(latency_req == 0)) {
-		ladder_do_selection(ldev, last_idx, 0);
-		return 0;
-	}
 
 	last_state = &ldev->states[last_idx];
 
diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 710a233..9b7c0b9 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -287,10 +287,10 @@  again:
  * @drv: cpuidle driver containing state data
  * @dev: the CPU
  */
-static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
+		       int latency_req)
 {
 	struct menu_device *data = this_cpu_ptr(&menu_devices);
-	int latency_req = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);
 	int i;
 	unsigned int interactivity_req;
 	unsigned long nr_iowaiters, cpu_load;
@@ -302,10 +302,6 @@  static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 
 	data->last_state_idx = CPUIDLE_DRIVER_STATE_START - 1;
 
-	/* Special case when user has set very strict latency requirement */
-	if (unlikely(latency_req == 0))
-		return 0;
-
 	/* determine the expected residency time, round up */
 	data->next_timer_us = ktime_to_us(tick_nohz_get_sleep_length());
 
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index 25e0df6..fb465c1 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -122,7 +122,7 @@  struct cpuidle_driver {
 extern void disable_cpuidle(void);
 
 extern int cpuidle_select(struct cpuidle_driver *drv,
-			  struct cpuidle_device *dev);
+			  struct cpuidle_device *dev, int latency_req);
 extern int cpuidle_enter(struct cpuidle_driver *drv,
 			 struct cpuidle_device *dev, int index);
 extern void cpuidle_reflect(struct cpuidle_device *dev, int index);
@@ -150,7 +150,7 @@  extern struct cpuidle_driver *cpuidle_get_cpu_driver(struct cpuidle_device *dev)
 #else
 static inline void disable_cpuidle(void) { }
 static inline int cpuidle_select(struct cpuidle_driver *drv,
-				 struct cpuidle_device *dev)
+				 struct cpuidle_device *dev, int latency_req)
 {return -ENODEV; }
 static inline int cpuidle_enter(struct cpuidle_driver *drv,
 				struct cpuidle_device *dev, int index)
@@ -205,7 +205,8 @@  struct cpuidle_governor {
 					struct cpuidle_device *dev);
 
 	int  (*select)		(struct cpuidle_driver *drv,
-					struct cpuidle_device *dev);
+				 struct cpuidle_device *dev,
+				 int latency_req);
 	void (*reflect)		(struct cpuidle_device *dev, int index);
 
 	struct module 		*owner;
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index ea65bbae..189e80a 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -5,6 +5,7 @@ 
 #include <linux/cpu.h>
 #include <linux/cpuidle.h>
 #include <linux/tick.h>
+#include <linux/pm_qos.h>
 #include <linux/mm.h>
 #include <linux/stackprotector.h>
 
@@ -79,7 +80,7 @@  static inline int cpu_idle_poll(void)
  * set, and it returns with polling set.  If it ever stops polling, it
  * must clear the polling bit.
  */
-static void cpuidle_idle_call(void)
+static void cpuidle_idle_call(unsigned int latency_req)
 {
 	struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices);
 	struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
@@ -112,7 +113,7 @@  static void cpuidle_idle_call(void)
 	 * Ask the cpuidle framework to choose a convenient idle state.
 	 * Fall back to the default arch idle method on errors.
 	 */
-	next_state = cpuidle_select(drv, dev);
+	next_state = cpuidle_select(drv, dev, latency_req);
 	if (next_state < 0) {
 use_default:
 		/*
@@ -193,6 +194,8 @@  exit_idle:
  */
 static void cpu_idle_loop(void)
 {
+	unsigned int latency_req;
+
 	while (1) {
 		/*
 		 * If the arch has a polling bit, we maintain an invariant:
@@ -216,19 +219,26 @@  static void cpu_idle_loop(void)
 			local_irq_disable();
 			arch_cpu_idle_enter();
 
+			latency_req = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);
+
 			/*
 			 * In poll mode we reenable interrupts and spin.
 			 *
+			 * If the latency req is zero, we don't want to
+			 * enter any idle state and we jump to the poll
+			 * function directly
+			 *
 			 * Also if we detected in the wakeup from idle
 			 * path that the tick broadcast device expired
 			 * for us, we don't want to go deep idle as we
 			 * know that the IPI is going to arrive right
 			 * away
 			 */
-			if (cpu_idle_force_poll || tick_check_broadcast_expired())
+			if (!latency_req || cpu_idle_force_poll ||
+			    tick_check_broadcast_expired())
 				cpu_idle_poll();
 			else
-				cpuidle_idle_call();
+				cpuidle_idle_call(latency_req);
 
 			arch_cpu_idle_exit();
 		}