ARM: Don't ever downscale loops_per_jiffy in SMP systems#

Message ID alpine.LFD.2.11.1405091640090.980@knanqh.ubzr
State New
Headers show

Commit Message

Nicolas Pitre May 9, 2014, 9:05 p.m.
On Fri, 9 May 2014, Russell King - ARM Linux wrote:

> I'd much prefer just printing a warning at kernel boot time to report
> that the kernel is running with features which would make udelay() less
> than accurate.

What if there is simply no timer to rely upon, as in those cases where 
interrupts are needed for time keeping to make progress?  We should do 
better than simply saying "sorry your kernel should irradicate every 
udelay() usage to be reliable".

And I mean "reliable" which is not exactly the same as "accurate".  
Reliable means "never *significantly* shorter".

> Remember, it should be usable for _short_ delays on slow machines as
> well as other stuff, and if we're going to start throwing stuff like
> the above at it, it's going to become very inefficient.

You said that udelay can be much longer than expected due to various 
reasons.

You also said that the IRQ handler overhead during udelay calibration 
makes actual delays slightli shorter than expected.

I'm suggesting the addition of a slight overhead that is much smaller 
than the IRQ handler here.  That shouldn't impact things masurably.  
I'd certainly like Doug to run his udelay timing test with the following 
patch to see if it solves the problem.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Comments

Douglas Anderson May 12, 2014, 11:51 p.m. | #1
Hi,

On Fri, May 9, 2014 at 2:05 PM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> On Fri, 9 May 2014, Russell King - ARM Linux wrote:
>
>> I'd much prefer just printing a warning at kernel boot time to report
>> that the kernel is running with features which would make udelay() less
>> than accurate.
>
> What if there is simply no timer to rely upon, as in those cases where
> interrupts are needed for time keeping to make progress?  We should do
> better than simply saying "sorry your kernel should irradicate every
> udelay() usage to be reliable".
>
> And I mean "reliable" which is not exactly the same as "accurate".
> Reliable means "never *significantly* shorter".
>
>> Remember, it should be usable for _short_ delays on slow machines as
>> well as other stuff, and if we're going to start throwing stuff like
>> the above at it, it's going to become very inefficient.
>
> You said that udelay can be much longer than expected due to various
> reasons.
>
> You also said that the IRQ handler overhead during udelay calibration
> makes actual delays slightli shorter than expected.
>
> I'm suggesting the addition of a slight overhead that is much smaller
> than the IRQ handler here.  That shouldn't impact things masurably.
> I'd certainly like Doug to run his udelay timing test with the following
> patch to see if it solves the problem.

...so I spent a whole chunk of time debugging this problem today.  I'm
out of time today (more tomorrow), but it looks like the theory I
proposed about why udelay() is giving bad results _might_ have more to
do with bugs in the exynos cpufreq driver and less to do with the
theoretical race we've been talking about.  It looks possible that the
driver is not setting the "old" frequency properly, which would
certainly cause problems.

I'll post more when I figure this out for sure.

-Doug
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Douglas Anderson May 13, 2014, 9:50 p.m. | #2
Hi,

On Mon, May 12, 2014 at 4:51 PM, Doug Anderson <dianders@chromium.org> wrote:
> Hi,
>
> On Fri, May 9, 2014 at 2:05 PM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
>> On Fri, 9 May 2014, Russell King - ARM Linux wrote:
>>
>>> I'd much prefer just printing a warning at kernel boot time to report
>>> that the kernel is running with features which would make udelay() less
>>> than accurate.
>>
>> What if there is simply no timer to rely upon, as in those cases where
>> interrupts are needed for time keeping to make progress?  We should do
>> better than simply saying "sorry your kernel should irradicate every
>> udelay() usage to be reliable".
>>
>> And I mean "reliable" which is not exactly the same as "accurate".
>> Reliable means "never *significantly* shorter".
>>
>>> Remember, it should be usable for _short_ delays on slow machines as
>>> well as other stuff, and if we're going to start throwing stuff like
>>> the above at it, it's going to become very inefficient.
>>
>> You said that udelay can be much longer than expected due to various
>> reasons.
>>
>> You also said that the IRQ handler overhead during udelay calibration
>> makes actual delays slightli shorter than expected.
>>
>> I'm suggesting the addition of a slight overhead that is much smaller
>> than the IRQ handler here.  That shouldn't impact things masurably.
>> I'd certainly like Doug to run his udelay timing test with the following
>> patch to see if it solves the problem.
>
> ...so I spent a whole chunk of time debugging this problem today.  I'm
> out of time today (more tomorrow), but it looks like the theory I
> proposed about why udelay() is giving bad results _might_ have more to
> do with bugs in the exynos cpufreq driver and less to do with the
> theoretical race we've been talking about.  It looks possible that the
> driver is not setting the "old" frequency properly, which would
> certainly cause problems.

Argh.  It turns out that I spent a whole lot of time tracking down the
fact that cpufreq_out_of_sync() running.  As part of debugging this
problem I added a cpufreq_get(0).  That would periodically notice that
the driver's reported frequency didn't match "policy->cur" and call
cpufreq_out_of_sync().  cpufreq_out_of_sync() would "thoughtfully"
send out its own CPUFREQ_PRECHANGE / CPUFREQ_POSTCHANGE but without
any sort of mutexes (at least in our tree).  Ugh.

Overall cpufreq_out_of_sync() seems incredibly racy since there will
inevitably be some period of time where the cpufreq driver has changed
the real CPU frequency but hasn't yet sent out the
cpufreq_notify_transition().  ...and there is no locking between the
two that I see.  ...but that's getting pretty far afield from my
original bug and it's been that way forever, so I guess I'll ignore
it.

--

...but then I found the true problem shows up when we transition
between very low frequencies on exynos, like between 200MHz and
300MHz.  While transitioning between frequencies the system
temporarily bumps over to the "switcher" PLL running at 800MHz while
waiting for the main PLL to stabilize.  No CPUFREQ notification is
sent for that.  That means there's a period of time when we're running
at 800MHz but loops_per_jiffy is calibrated at between 200MHz and
300MHz.


I'm welcome to any suggestions for how to address this.  It sorta
feels like it would be a common thing to have a temporary PLL during
the transition, so my inclination would be to add a "temp" field to
"struct cpufreq_freqs".  Anyone who cared about the fact that cpufreq
might transition through a different frequency on its way from old to
new could look at this field.


What do people think?


-Doug
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Stephen Warren May 13, 2014, 10:15 p.m. | #3
On 05/13/2014 03:50 PM, Doug Anderson wrote:
...
> ...but then I found the true problem shows up when we transition
> between very low frequencies on exynos, like between 200MHz and
> 300MHz.  While transitioning between frequencies the system
> temporarily bumps over to the "switcher" PLL running at 800MHz while
> waiting for the main PLL to stabilize.  No CPUFREQ notification is
> sent for that.  That means there's a period of time when we're running
> at 800MHz but loops_per_jiffy is calibrated at between 200MHz and
> 300MHz.
> 
> 
> I'm welcome to any suggestions for how to address this.  It sorta
> feels like it would be a common thing to have a temporary PLL during
> the transition, ...

We definitely do that on Tegra for some cpufreq transitions.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Nicolas Pitre May 13, 2014, 11:15 p.m. | #4
On Tue, 13 May 2014, Stephen Warren wrote:

> On 05/13/2014 03:50 PM, Doug Anderson wrote:
> ...
> > ...but then I found the true problem shows up when we transition
> > between very low frequencies on exynos, like between 200MHz and
> > 300MHz.  While transitioning between frequencies the system
> > temporarily bumps over to the "switcher" PLL running at 800MHz while
> > waiting for the main PLL to stabilize.  No CPUFREQ notification is
> > sent for that.  That means there's a period of time when we're running
> > at 800MHz but loops_per_jiffy is calibrated at between 200MHz and
> > 300MHz.
> > 
> > 
> > I'm welcome to any suggestions for how to address this.  It sorta
> > feels like it would be a common thing to have a temporary PLL during
> > the transition, ...
> 
> We definitely do that on Tegra for some cpufreq transitions.

Ouch...  If this is a common strategy to use a third frequency during a 
transition phase, especially if that frequency is way off (800MHz vs 
200-300MHz) then it is something the cpufreq layer must capture and 
advertise.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Nicolas Pitre May 13, 2014, 11:29 p.m. | #5
On Tue, 13 May 2014, Nicolas Pitre wrote:

> On Tue, 13 May 2014, Stephen Warren wrote:
> 
> > On 05/13/2014 03:50 PM, Doug Anderson wrote:
> > ...
> > > ...but then I found the true problem shows up when we transition
> > > between very low frequencies on exynos, like between 200MHz and
> > > 300MHz.  While transitioning between frequencies the system
> > > temporarily bumps over to the "switcher" PLL running at 800MHz while
> > > waiting for the main PLL to stabilize.  No CPUFREQ notification is
> > > sent for that.  That means there's a period of time when we're running
> > > at 800MHz but loops_per_jiffy is calibrated at between 200MHz and
> > > 300MHz.
> > > 
> > > 
> > > I'm welcome to any suggestions for how to address this.  It sorta
> > > feels like it would be a common thing to have a temporary PLL during
> > > the transition, ...
> > 
> > We definitely do that on Tegra for some cpufreq transitions.
> 
> Ouch...  If this is a common strategy to use a third frequency during a 
> transition phase, especially if that frequency is way off (800MHz vs 
> 200-300MHz) then it is something the cpufreq layer must capture and 
> advertise.

Of course if only the loops_per_jiffy scaling does care about frequency 
changes these days, and if in those cases udelay() can instead be moved 
to a timer source on those hick-up prone platforms, then all this is 
fairly theoretical and may not be worth pursuing.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Russell King - ARM Linux May 13, 2014, 11:36 p.m. | #6
On Tue, May 13, 2014 at 07:29:52PM -0400, Nicolas Pitre wrote:
> On Tue, 13 May 2014, Nicolas Pitre wrote:
> 
> > On Tue, 13 May 2014, Stephen Warren wrote:
> > 
> > > On 05/13/2014 03:50 PM, Doug Anderson wrote:
> > > ...
> > > > ...but then I found the true problem shows up when we transition
> > > > between very low frequencies on exynos, like between 200MHz and
> > > > 300MHz.  While transitioning between frequencies the system
> > > > temporarily bumps over to the "switcher" PLL running at 800MHz while
> > > > waiting for the main PLL to stabilize.  No CPUFREQ notification is
> > > > sent for that.  That means there's a period of time when we're running
> > > > at 800MHz but loops_per_jiffy is calibrated at between 200MHz and
> > > > 300MHz.
> > > > 
> > > > 
> > > > I'm welcome to any suggestions for how to address this.  It sorta
> > > > feels like it would be a common thing to have a temporary PLL during
> > > > the transition, ...
> > > 
> > > We definitely do that on Tegra for some cpufreq transitions.
> > 
> > Ouch...  If this is a common strategy to use a third frequency during a 
> > transition phase, especially if that frequency is way off (800MHz vs 
> > 200-300MHz) then it is something the cpufreq layer must capture and 
> > advertise.
> 
> Of course if only the loops_per_jiffy scaling does care about frequency 
> changes these days, and if in those cases udelay() can instead be moved 
> to a timer source on those hick-up prone platforms, then all this is 
> fairly theoretical and may not be worth pursuing.

As I've been saying... use a bloody timer. :)
Viresh Kumar May 15, 2014, 6:12 a.m. | #7
On 14 May 2014 03:20, Doug Anderson <dianders@chromium.org> wrote:
> ...but then I found the true problem shows up when we transition
> between very low frequencies on exynos, like between 200MHz and
> 300MHz.  While transitioning between frequencies the system
> temporarily bumps over to the "switcher" PLL running at 800MHz while
> waiting for the main PLL to stabilize.  No CPUFREQ notification is
> sent for that.  That means there's a period of time when we're running
> at 800MHz but loops_per_jiffy is calibrated at between 200MHz and
> 300MHz.
>
>
> I'm welcome to any suggestions for how to address this.

I have attempted to fix this in a generic way and sent an RFC patch for
this. I have cc'd only few people from this list which I thought would be
interested in cpufreq stuff, sorry if I missed anyone.

https://lkml.org/lkml/2014/5/15/40
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Patch

diff --git a/arch/arm/include/asm/delay.h b/arch/arm/include/asm/delay.h
index dff714d886..74fb571a55 100644
--- a/arch/arm/include/asm/delay.h
+++ b/arch/arm/include/asm/delay.h
@@ -57,11 +57,6 @@  extern void __bad_udelay(void);
 			__const_udelay((n) * UDELAY_MULT)) :		\
 	  __udelay(n))
 
-/* Loop-based definitions for assembly code. */
-extern void __loop_delay(unsigned long loops);
-extern void __loop_udelay(unsigned long usecs);
-extern void __loop_const_udelay(unsigned long);
-
 /* Delay-loop timer registration. */
 #define ARCH_HAS_READ_CURRENT_TIMER
 extern void register_current_timer_delay(const struct delay_timer *timer);
diff --git a/arch/arm/lib/delay.c b/arch/arm/lib/delay.c
index 5306de3501..9150d31c2d 100644
--- a/arch/arm/lib/delay.c
+++ b/arch/arm/lib/delay.c
@@ -25,6 +25,11 @@ 
 #include <linux/module.h>
 #include <linux/timex.h>
 
+/* Loop-based definitions for assembly code. */
+extern void __loop_delay(unsigned long loops);
+extern void __loop_udelay(unsigned long usecs);
+extern void __loop_const_udelay(unsigned long);
+
 /*
  * Default to the loop-based delay implementation.
  */
@@ -34,6 +39,85 @@  struct arm_delay_ops arm_delay_ops = {
 	.udelay		= __loop_udelay,
 };
 
+#if defined(CONFIG_CPU_FREQ) && (defined(CONFIG_SMP) || defined(CONFIG_PREEMPT))
+
+#include <linux/cpufreq.h>
+
+/*
+ * Another CPU/thread might increase the CPU clock in the middle of
+ * the loop based delay routine and the newly scaled LPJ value won't be
+ * accounted for, resulting in a possibly significantly shorter delay than
+ * expected.  Let's make sure this occurrence is trapped and compensated.
+ */
+
+static int __loop_seq;
+static unsigned int __loop_security_factor;
+
+#define __safe_loop_(type) \
+static void __safe_loop_##type(unsigned long val) \
+{ \
+	int seq_count = __loop_seq; \
+	__loop_##type(val); \
+	if (seq_count != __loop_seq) \
+		__loop_##type(val * __loop_security_factor); \
+}
+
+__safe_loop_(delay)
+__safe_loop_(const_udelay)
+__safe_loop_(udelay)
+
+static int cpufreq_callback(struct notifier_block *nb,
+			    unsigned long val, void *data)
+{
+	struct cpufreq_freqs *freq = data;
+	unsigned int f;
+
+	if ((freq->flags & CPUFREQ_CONST_LOOPS) ||
+	    freq->old >= freq->new)
+		return NOTIFY_OK;
+
+	switch (val) {
+	case CPUFREQ_PRECHANGE:
+		/* Remember the largest security factor ever needed */
+		f = DIV_ROUND_UP(freq->new, freq->old) - 1;
+		if (__loop_security_factor < f)
+			__loop_security_factor = f;
+		/* fallthrough */
+	case CPUFREQ_POSTCHANGE:
+		__loop_seq++;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block cpufreq_notifier = {
+	.notifier_call  = cpufreq_callback,
+};
+
+static int __init init_safe_loop_delays(void)
+{
+	int err;
+
+	/*
+	 * Bail out if the default loop based implementation has
+	 * already been replaced by something better.
+	 */
+	if (arm_delay_ops.udelay != __loop_udelay)
+		return 0;
+
+	__loop_security_factor = 1;
+	err = cpufreq_register_notifier(&cpufreq_notifier,
+					CPUFREQ_TRANSITION_NOTIFIER);
+	if (!err) {
+		arm_delay_ops.delay		= __safe_loop_delay;
+		arm_delay_ops.const_udelay	= __safe_loop_const_udelay;
+		arm_delay_ops.udelay		= __safe_loop_udelay;
+	}
+	return err;
+}
+core_initcall(init_safe_loop_delays);
+
+#endif
+
 static const struct delay_timer *delay_timer;
 static bool delay_calibrated;