[6/7] Cpuidle: Deal with timer expiring in the past

Message ID 1393223377-5744-7-git-send-email-tuukka.tikkanen@linaro.org
State New
Headers show

Commit Message

Tuukka Tikkanen Feb. 24, 2014, 6:29 a.m.
Sometimes (fairly often) when the cpuidle menu governor is making a decision
about idle state to enter the next timer for the cpu appears to expire in
the past. The menu governor expects the expiry to always be in the future
and in fact stores the time delta in an unsigned variable. However, when
the expiry is in the past, the value returned by tick_nohz_get_sleep_length
can be negative. This patch prevents using negative values, instead making
the governor return immediately similar to having latency requirement set
to 0.

Note: As with latency == 0, the return value is 0 with no check to see if
the state 0 has been disabled or not.

Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org>
---
 drivers/cpuidle/governors/menu.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

Comments

Nicolas Pitre Feb. 24, 2014, 5:05 p.m. | #1
On Mon, 24 Feb 2014, Tuukka Tikkanen wrote:

> Sometimes (fairly often) when the cpuidle menu governor is making a decision
> about idle state to enter the next timer for the cpu appears to expire in
> the past. The menu governor expects the expiry to always be in the future
> and in fact stores the time delta in an unsigned variable. However, when
> the expiry is in the past, the value returned by tick_nohz_get_sleep_length
> can be negative. This patch prevents using negative values, instead making
> the governor return immediately similar to having latency requirement set
> to 0.
> 
> Note: As with latency == 0, the return value is 0 with no check to see if
> the state 0 has been disabled or not.

In your cover letter you mention some occurrences of the negative result 
being observed on x86.  That information is worth capturing in the 
commit log as well to distinguish between theoretical problems from 
actual observations.

> 
> Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org>
> ---
>  drivers/cpuidle/governors/menu.c |   10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
> index 71b5232..c414468 100644
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -302,8 +302,16 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
>  	if (unlikely(latency_req == 0))
>  		return 0;
>  
> -	/* determine the expected residency time, round up */
> +	/*
> +	 * Determine the expected residency time. If the time is negative,
> +	 * a timer interrupt has probably just expired after disabling
> +	 * interrupts. Return as quickly as possible in the most shallow
> +	 * state possible. tv_nsec is always positive, so only check the
> +	 * seconds.
> +	 */
>  	t = ktime_to_timespec(tick_nohz_get_sleep_length());
> +	if (t.tv_sec < 0)
> +		return 0;
>  	data->next_timer_us =
>  		t.tv_sec * USEC_PER_SEC + t.tv_nsec / NSEC_PER_USEC;
>  
> -- 
> 1.7.9.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Len Brown March 6, 2014, 7:41 a.m. | #2
On Mon, Feb 24, 2014 at 1:29 AM, Tuukka Tikkanen
<tuukka.tikkanen@linaro.org> wrote:
> Sometimes (fairly often) when the cpuidle menu governor is making a decision
> about idle state to enter the next timer for the cpu appears to expire in
> the past. The menu governor expects the expiry to always be in the future
> and in fact stores the time delta in an unsigned variable. However, when
> the expiry is in the past, the value returned by tick_nohz_get_sleep_length
> can be negative. This patch prevents using negative values, instead making
> the governor return immediately similar to having latency requirement set
> to 0.
>
> Note: As with latency == 0, the return value is 0 with no check to see if
> the state 0 has been disabled or not.
>
> Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org>
> ---
>  drivers/cpuidle/governors/menu.c |   10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
> index 71b5232..c414468 100644
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -302,8 +302,16 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
>         if (unlikely(latency_req == 0))
>                 return 0;
>
> -       /* determine the expected residency time, round up */
> +       /*
> +        * Determine the expected residency time. If the time is negative,
> +        * a timer interrupt has probably just expired after disabling
> +        * interrupts. Return as quickly as possible in the most shallow
> +        * state possible. tv_nsec is always positive, so only check the
> +        * seconds.
> +        */
>         t = ktime_to_timespec(tick_nohz_get_sleep_length());
> +       if (t.tv_sec < 0)
> +               return 0;
>         data->next_timer_us =
>                 t.tv_sec * USEC_PER_SEC + t.tv_nsec / NSEC_PER_USEC;
>

Are there special conditions that are necessary to provoke a negative
return value?
I've traced this code on several systems, and never seen a negative
return value.

However...
I do see values up to 300.2 seconds, and those large values seem to decay
at the rate of real-time so that after 5 minutes they are small, and then
jump back up to 300 seconds.

Some folks at Oracle debugged it down to use of NEXT_TIMER_MAX_DELTA
when there is _no_ timer currently pending on that CPU.  It seems this is easier
to observe, the more CPUs a system has -- though I've been able to reproduce
it on a system as small as a single-package 8-cpu systems.

One proposed way to address this is to cap large values at 1 second.
However, that will not recognize that for the period when the large
value decays to under 1 second, all of those are fiction.

Also, if we could identify the case where there is no future timer,
it seems that re-using dev->last_residency would probably be
a more useful guess than pretending we'll have a timer expire in 1 second.

thanks,
Len Brown, Intel Open Source Technology Cente
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Len Brown March 7, 2014, 3:09 a.m. | #3
NAK

tick_nohz_get_sleep_length() has exactly 1 consumer -- cpuidle.

Rather than have cpuidle second guess what is returned,
tick_nohz_get_sleep_length() should be updated to return a value
that makes sense to cpuidle.

thanks,
Len Brown, Intel Open Source Technology Cent
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Tuukka Tikkanen March 10, 2014, 10:54 a.m. | #4
Hi,

On 6 March 2014 09:41, Len Brown <lenb@kernel.org> wrote:
>
> On Mon, Feb 24, 2014 at 1:29 AM, Tuukka Tikkanen
> <tuukka.tikkanen@linaro.org> wrote:
> > Sometimes (fairly often) when the cpuidle menu governor is making a decision
> > about idle state to enter the next timer for the cpu appears to expire in
> > the past. The menu governor expects the expiry to always be in the future
> > and in fact stores the time delta in an unsigned variable. However, when
> > the expiry is in the past, the value returned by tick_nohz_get_sleep_length
> > can be negative. This patch prevents using negative values, instead making
> > the governor return immediately similar to having latency requirement set
> > to 0.
> >
> > Note: As with latency == 0, the return value is 0 with no check to see if
> > the state 0 has been disabled or not.
> >
> > Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org>
> > ---
> >  drivers/cpuidle/governors/menu.c |   10 +++++++++-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
> > index 71b5232..c414468 100644
> > --- a/drivers/cpuidle/governors/menu.c
> > +++ b/drivers/cpuidle/governors/menu.c
> > @@ -302,8 +302,16 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
> >         if (unlikely(latency_req == 0))
> >                 return 0;
> >
> > -       /* determine the expected residency time, round up */
> > +       /*
> > +        * Determine the expected residency time. If the time is negative,
> > +        * a timer interrupt has probably just expired after disabling
> > +        * interrupts. Return as quickly as possible in the most shallow
> > +        * state possible. tv_nsec is always positive, so only check the
> > +        * seconds.
> > +        */
> >         t = ktime_to_timespec(tick_nohz_get_sleep_length());
> > +       if (t.tv_sec < 0)
> > +               return 0;
> >         data->next_timer_us =
> >                 t.tv_sec * USEC_PER_SEC + t.tv_nsec / NSEC_PER_USEC;
> >
>
> Are there special conditions that are necessary to provoke a negative
> return value?
> I've traced this code on several systems, and never seen a negative
> return value.
>

By changing the if statement to
if (WARN_ONCE(t.tv_sec < 0, "Next timer in the past. tv_sec: %ld,
tv_usec: %lu", t.tv_sec, t.tv_usec)) return 0;

I eventually got
[ 3092.355144] WARNING: CPU: 4 PID: 0 at
drivers/cpuidle/governors/menu.c:315 menu_select+0x410/0x420()
[ 3092.355145] Next timer in the past. tv_sec: -1, tv_usec: 999999645
...
[ 3092.355169] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 3.14.0-rc3+ #3

so the problem is still lurking there. However, as you can see from
the timestamp, either something has changed recently and made it much
much less frequent or I did not use the use case / hardware that
triggers it more often. It has been quite a long time since I
discovered and fixed this in my private trees, so I have forgotten the
best way to reproduce this, sorry. In any case, existence is still
confirmed.

This was on a i7-3770K with 64 bit kernel, running make clean && make
-j8 && make clean && make for kernel source tree. If I happen to
remember the use case + hardware combination that triggers this
rapidly, I'll let you know.

That said, I agree dropping this patch and fixing the called function
is a better idea.

Tuukka

>
> However...
> I do see values up to 300.2 seconds, and those large values seem to decay
> at the rate of real-time so that after 5 minutes they are small, and then
> jump back up to 300 seconds.
>
> Some folks at Oracle debugged it down to use of NEXT_TIMER_MAX_DELTA
> when there is _no_ timer currently pending on that CPU.  It seems this is easier
> to observe, the more CPUs a system has -- though I've been able to reproduce
> it on a system as small as a single-package 8-cpu systems.
>
> One proposed way to address this is to cap large values at 1 second.
> However, that will not recognize that for the period when the large
> value decays to under 1 second, all of those are fiction.
>
> Also, if we could identify the case where there is no future timer,
> it seems that re-using dev->last_residency would probably be
> a more useful guess than pretending we'll have a timer expire in 1 second.
>
> thanks,
> Len Brown, Intel Open Source Technology Cente
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Len Brown March 21, 2014, 11:10 p.m. | #5
Tuukka,
I've reproduced this negative on a 48 thread 2-socket Xeon during boot
(seen it only once, so far).

expected_us gets calculated to be -1, which is truthful, since the
next timer return value was about 500ns in the past
and our math truncates.  This, in turn, confuses the heck out of
menu's prediction of the next event to be
over 500 sec in the future.

I suspect more breakage is lurking, so I'll poke some more...

Len Brown, Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Patch

diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 71b5232..c414468 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -302,8 +302,16 @@  static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 	if (unlikely(latency_req == 0))
 		return 0;
 
-	/* determine the expected residency time, round up */
+	/*
+	 * Determine the expected residency time. If the time is negative,
+	 * a timer interrupt has probably just expired after disabling
+	 * interrupts. Return as quickly as possible in the most shallow
+	 * state possible. tv_nsec is always positive, so only check the
+	 * seconds.
+	 */
 	t = ktime_to_timespec(tick_nohz_get_sleep_length());
+	if (t.tv_sec < 0)
+		return 0;
 	data->next_timer_us =
 		t.tv_sec * USEC_PER_SEC + t.tv_nsec / NSEC_PER_USEC;