Message ID | 5375079E.3060305@ti.com |
---|---|
State | Accepted |
Commit | 4b353a706a86598ba47307c47301c3c428b79e09 |
Headers | show |
On 05/16/2014 02:29 AM, Santosh Shilimkar wrote: > Alex, > Please give a try with your test-case and see if you still see the hang. > Am just curious about your issue and hence the request.. there is no test case. I just patched your patches, then boot new kernel, system quickly to hang without any serial port oops, I am using pandaES/ubuntu.
Tony, On Thursday 15 May 2014 02:29 PM, Santosh Shilimkar wrote: > On Thursday 15 May 2014 01:54 PM, Santosh Shilimkar wrote: >> On Thursday 15 May 2014 01:50 PM, Daniel Lezcano wrote: >>> On 05/15/2014 07:03 PM, Santosh Shilimkar wrote: > > [..] > >>>>> With above mentioned change, it should work. Other alternatives is OMAP4 driver does >>>>> its won registration where it can start the timer. The way it was before the >>>>> consolidation. >>>>> >>>>> Ofcourse if you have better fix, then great. >>>>> >>>> What is your suggestion. We *must* fix the regression asap. I think >>>> $subject patch with an update to bctimer start under CPUIDLE_FLAG_COUPLED >>>> seems a good way forward. >>>> >>>> Do let me know. >>> >>> Did you see Alex Shi's email [cc'ed] ? Reverting this change makes the panda ES to hang. >>> >> The hang is definitely due to the bctimer not started. As I said, I assumed it was and >> then you corrected saying it is under the flag. >> >>> I am not convinced the culprit is this code you are trying to revert. >>> >> fair enough. Thats why I said if you have an alternative fix thats great. >> > For record, below is updated patch with bctimer started which > was missed in earlier version. I haven't tested it though. > > Alex, > Please give a try with your test-case and see if you still see the hang. > Am just curious about your issue and hence the request.. > Alex tested below patch and he don't see the hang so the patch is addressing the issue. If Daniel works out an alternate fix to avoid reverts, that will be great but if not, we should merge the below patch. I let you take call on it. > > > From bb3b82cc5645b83bedf1343d03cc956f27f6fc83 Mon Sep 17 00:00:00 2001 > From: Santosh Shilimkar <santosh.shilimkar@ti.com> > Date: Mon, 12 May 2014 17:37:59 -0400 > Subject: [PATCH] ARM: OMAP4: Fix the boot regression with CPU_IDLE enabled > > On OMAP4 panda board, there have been several bug reports about boot > hang and lock-ups with CPU_IDLE enabled. The root cause of the issue > is missing interrupts while in idle state. Commit cb7094e8 {cpuidle / omap4 : > use CPUIDLE_FLAG_TIMER_STOP flag} moved the broadcast notifiers to common > code for right reasons but on OMAP4 which suffers from a nasty ROM code > bug with GIC, commit ff999b8a {ARM: OMAP4460: Workaround for ROM bug ..}, > we loose interrupts which leads to issues like lock-up, hangs etc. > > Patch reverts commit cb7094 {cpuidle / omap4 : use CPUIDLE_FLAG_TIMER_STOP > flag} and 54769d6 {cpuidle: OMAP4: remove timer broadcast initialization} to > avoid the issue. With this change, OMAP4 panda boards, the mentioned > issues are getting fixed. We no longer loose interrupts which was the cause > of the regression. > > Cc: Roger Quadros <rogerq@ti.com> > Cc: Kevin Hilman <khilman@linaro.org> > Cc: Tony Lindgren <tony@atomide.com> > Cc: Daniel Lezcano <daniel.lezcano@linaro.org> > Reported-tested-by: Roger Quadros <rogerq@ti.com> > Reported-tested-by: Kevin Hilman <khilman@linaro.org> > Tested-by: Tony Lindgren <tony@atomide.com> > Signed-off-by: Santosh Shilimkar <santosh.shilimkar@ti.com> > --- > arch/arm/mach-omap2/cpuidle44xx.c | 25 +++++++++++++++++++++---- > 1 file changed, 21 insertions(+), 4 deletions(-) > > diff --git a/arch/arm/mach-omap2/cpuidle44xx.c b/arch/arm/mach-omap2/cpuidle44xx.c > index 01fc710..2498ab0 100644 > --- a/arch/arm/mach-omap2/cpuidle44xx.c > +++ b/arch/arm/mach-omap2/cpuidle44xx.c > @@ -14,6 +14,7 @@ > #include <linux/cpuidle.h> > #include <linux/cpu_pm.h> > #include <linux/export.h> > +#include <linux/clockchips.h> > > #include <asm/cpuidle.h> > #include <asm/proc-fns.h> > @@ -83,6 +84,7 @@ static int omap_enter_idle_coupled(struct cpuidle_device *dev, > { > struct idle_statedata *cx = state_ptr + index; > u32 mpuss_can_lose_context = 0; > + int cpu_id = smp_processor_id(); > > /* > * CPU0 has to wait and stay ON until CPU1 is OFF state. > @@ -110,6 +112,8 @@ static int omap_enter_idle_coupled(struct cpuidle_device *dev, > mpuss_can_lose_context = (cx->mpu_state == PWRDM_POWER_RET) && > (cx->mpu_logic_state == PWRDM_POWER_OFF); > > + clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu_id); > + > /* > * Call idle CPU PM enter notifier chain so that > * VFP and per CPU interrupt context is saved. > @@ -165,6 +169,8 @@ static int omap_enter_idle_coupled(struct cpuidle_device *dev, > if (dev->cpu == 0 && mpuss_can_lose_context) > cpu_cluster_pm_exit(); > > + clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu_id); > + > fail: > cpuidle_coupled_parallel_barrier(dev, &abort_barrier); > cpu_done[dev->cpu] = false; > @@ -172,6 +178,16 @@ fail: > return index; > } > > +/* > + * For each cpu, setup the broadcast timer because local timers > + * stops for the states above C1. > + */ > +static void omap_setup_broadcast_timer(void *arg) > +{ > + int cpu = smp_processor_id(); > + clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ON, &cpu); > +} > + > static struct cpuidle_driver omap4_idle_driver = { > .name = "omap4_idle", > .owner = THIS_MODULE, > @@ -189,8 +205,7 @@ static struct cpuidle_driver omap4_idle_driver = { > /* C2 - CPU0 OFF + CPU1 OFF + MPU CSWR */ > .exit_latency = 328 + 440, > .target_residency = 960, > - .flags = CPUIDLE_FLAG_TIME_VALID | CPUIDLE_FLAG_COUPLED | > - CPUIDLE_FLAG_TIMER_STOP, > + .flags = CPUIDLE_FLAG_TIME_VALID | CPUIDLE_FLAG_COUPLED, > .enter = omap_enter_idle_coupled, > .name = "C2", > .desc = "CPUx OFF, MPUSS CSWR", > @@ -199,8 +214,7 @@ static struct cpuidle_driver omap4_idle_driver = { > /* C3 - CPU0 OFF + CPU1 OFF + MPU OSWR */ > .exit_latency = 460 + 518, > .target_residency = 1100, > - .flags = CPUIDLE_FLAG_TIME_VALID | CPUIDLE_FLAG_COUPLED | > - CPUIDLE_FLAG_TIMER_STOP, > + .flags = CPUIDLE_FLAG_TIME_VALID | CPUIDLE_FLAG_COUPLED, > .enter = omap_enter_idle_coupled, > .name = "C3", > .desc = "CPUx OFF, MPUSS OSWR", > @@ -231,5 +245,8 @@ int __init omap4_idle_init(void) > if (!cpu_clkdm[0] || !cpu_clkdm[1]) > return -ENODEV; > > + /* Configure the broadcast timer on each cpu */ > + on_each_cpu(omap_setup_broadcast_timer, NULL, 1); > + > return cpuidle_register(&omap4_idle_driver, cpu_online_mask); > } > -- To unsubscribe from this list: send the line "unsubscribe linux-omap" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Santosh Shilimkar <santosh.shilimkar@ti.com> [140516 06:43]: > Tony, > > On Thursday 15 May 2014 02:29 PM, Santosh Shilimkar wrote: > > On Thursday 15 May 2014 01:54 PM, Santosh Shilimkar wrote: > >> On Thursday 15 May 2014 01:50 PM, Daniel Lezcano wrote: > >>> On 05/15/2014 07:03 PM, Santosh Shilimkar wrote: > > > > [..] > > > >>>>> With above mentioned change, it should work. Other alternatives is OMAP4 driver does > >>>>> its won registration where it can start the timer. The way it was before the > >>>>> consolidation. > >>>>> > >>>>> Ofcourse if you have better fix, then great. > >>>>> > >>>> What is your suggestion. We *must* fix the regression asap. I think > >>>> $subject patch with an update to bctimer start under CPUIDLE_FLAG_COUPLED > >>>> seems a good way forward. > >>>> > >>>> Do let me know. > >>> > >>> Did you see Alex Shi's email [cc'ed] ? Reverting this change makes the panda ES to hang. > >>> > >> The hang is definitely due to the bctimer not started. As I said, I assumed it was and > >> then you corrected saying it is under the flag. > >> > >>> I am not convinced the culprit is this code you are trying to revert. > >>> > >> fair enough. Thats why I said if you have an alternative fix thats great. > >> > > For record, below is updated patch with bctimer started which > > was missed in earlier version. I haven't tested it though. > > > > Alex, > > Please give a try with your test-case and see if you still see the hang. > > Am just curious about your issue and hence the request.. > > > Alex tested below patch and he don't see the hang so the patch is > addressing the issue. > > If Daniel works out an alternate fix to avoid reverts, that will be great > but if not, we should merge the below patch. I let you take call on it. Daniel any news on this? And just to recap, this problem can be reproduced with current Linux next with omap2plus_defconfig with CONFIG_CPU_IDLE enabled. The system should hang during the boot at some point. And for the record, the omap3 hang fix is now posted to the lists as "[PATCH] ARM: OMAP2+: Fix DMA hang after off-idle". This should not have anything to do with the omap4 cpu_idle hang as omap4 does not currently lose context during idle. Regards, Tony -- To unsubscribe from this list: send the line "unsubscribe linux-omap" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/16/2014 11:29 PM, Tony Lindgren wrote: > * Santosh Shilimkar <santosh.shilimkar@ti.com> [140516 06:43]: >> Tony, >> >> On Thursday 15 May 2014 02:29 PM, Santosh Shilimkar wrote: >>> On Thursday 15 May 2014 01:54 PM, Santosh Shilimkar wrote: >>>> On Thursday 15 May 2014 01:50 PM, Daniel Lezcano wrote: >>>>> On 05/15/2014 07:03 PM, Santosh Shilimkar wrote: >>> >>> [..] >>> >>>>>>> With above mentioned change, it should work. Other alternatives is OMAP4 driver does >>>>>>> its won registration where it can start the timer. The way it was before the >>>>>>> consolidation. >>>>>>> >>>>>>> Ofcourse if you have better fix, then great. >>>>>>> >>>>>> What is your suggestion. We *must* fix the regression asap. I think >>>>>> $subject patch with an update to bctimer start under CPUIDLE_FLAG_COUPLED >>>>>> seems a good way forward. >>>>>> >>>>>> Do let me know. >>>>> >>>>> Did you see Alex Shi's email [cc'ed] ? Reverting this change makes the panda ES to hang. >>>>> >>>> The hang is definitely due to the bctimer not started. As I said, I assumed it was and >>>> then you corrected saying it is under the flag. >>>> >>>>> I am not convinced the culprit is this code you are trying to revert. >>>>> >>>> fair enough. Thats why I said if you have an alternative fix thats great. >>>> >>> For record, below is updated patch with bctimer started which >>> was missed in earlier version. I haven't tested it though. >>> >>> Alex, >>> Please give a try with your test-case and see if you still see the hang. >>> Am just curious about your issue and hence the request.. >>> >> Alex tested below patch and he don't see the hang so the patch is >> addressing the issue. >> >> If Daniel works out an alternate fix to avoid reverts, that will be great >> but if not, we should merge the below patch. I let you take call on it. > > Daniel any news on this? > > And just to recap, this problem can be reproduced with current > Linux next with omap2plus_defconfig with CONFIG_CPU_IDLE enabled. The > system should hang during the boot at some point. I can take the time to investigate a bit more but not right now. What is your deadline before committing the reverts ? > And for the record, the omap3 hang fix is now posted to the lists as > "[PATCH] ARM: OMAP2+: Fix DMA hang after off-idle". This should not > have anything to do with the omap4 cpu_idle hang as omap4 does not > currently lose context during idle.
* Daniel Lezcano <daniel.lezcano@linaro.org> [140519 09:46]: > On 05/16/2014 11:29 PM, Tony Lindgren wrote: > > > >And just to recap, this problem can be reproduced with current > >Linux next with omap2plus_defconfig with CONFIG_CPU_IDLE enabled. The > >system should hang during the boot at some point. > > I can take the time to investigate a bit more but not right now. What is > your deadline before committing the reverts ? Well we do have several automated build and boot systems failing because of this with multi_v7_defconfig. And users are complaining, see this report from Tobias Jakobi: https://bugzilla.kernel.org/show_bug.cgi?id=75421 It seems that doing the revert is not enough based on the page above. I'd prefer we'd fix this issue properly for sure, it seems that we're not quite understanding what's going on. And this might hit other platforms too when they start implementing deeper PM idle states in the mainline kernel. Regards, Tony -- To unsubscribe from this list: send the line "unsubscribe linux-omap" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Monday 19 May 2014 01:23 PM, Tony Lindgren wrote: > * Daniel Lezcano <daniel.lezcano@linaro.org> [140519 09:46]: >> On 05/16/2014 11:29 PM, Tony Lindgren wrote: >>> >>> And just to recap, this problem can be reproduced with current >>> Linux next with omap2plus_defconfig with CONFIG_CPU_IDLE enabled. The >>> system should hang during the boot at some point. >> >> I can take the time to investigate a bit more but not right now. What is >> your deadline before committing the reverts ? > > Well we do have several automated build and boot systems failing > because of this with multi_v7_defconfig. And users are complaining, > see this report from Tobias Jakobi: > > https://bugzilla.kernel.org/show_bug.cgi?id=75421 > > It seems that doing the revert is not enough based on the > page above. > Thats not true. The above link used the half patch and not the updated patch. Updated patch worked for Alex also. As you can see they saw RCU stalls and they go away after the updated patch. Can you please point them to try out the updated patch ? > I'd prefer we'd fix this issue properly for sure, it seems that > we're not quite understanding what's going on. And this might > hit other platforms too when they start implementing deeper > PM idle states in the mainline kernel. > I am certain that the updated patch fixed the regression for sure. The issue is really not generic enough since its related an OMAP ROM errata which needs that special handling of interrupt re-trigger etc. You don't need that for other platforms so they are not likely get affected. -- To unsubscribe from this list: send the line "unsubscribe linux-omap" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Santosh Shilimkar <santosh.shilimkar@ti.com> [140519 10:35]: > On Monday 19 May 2014 01:23 PM, Tony Lindgren wrote: > > * Daniel Lezcano <daniel.lezcano@linaro.org> [140519 09:46]: > >> On 05/16/2014 11:29 PM, Tony Lindgren wrote: > >>> > >>> And just to recap, this problem can be reproduced with current > >>> Linux next with omap2plus_defconfig with CONFIG_CPU_IDLE enabled. The > >>> system should hang during the boot at some point. > >> > >> I can take the time to investigate a bit more but not right now. What is > >> your deadline before committing the reverts ? > > > > Well we do have several automated build and boot systems failing > > because of this with multi_v7_defconfig. And users are complaining, > > see this report from Tobias Jakobi: > > > > https://bugzilla.kernel.org/show_bug.cgi?id=75421 > > > > It seems that doing the revert is not enough based on the > > page above. > > > Thats not true. The above link used the half patch and not the > updated patch. Updated patch worked for Alex also. As you can > see they saw RCU stalls and they go away after the updated patch. > > Can you please point them to try out the updated patch ? OK good point. I added a link to the updated patch in bugzilla. > > I'd prefer we'd fix this issue properly for sure, it seems that > > we're not quite understanding what's going on. And this might > > hit other platforms too when they start implementing deeper > > PM idle states in the mainline kernel. > > > I am certain that the updated patch fixed the regression > for sure. The issue is really not generic enough since its related > an OMAP ROM errata which needs that special handling of > interrupt re-trigger etc. You don't need that for other platforms > so they are not likely get affected. OK makes sense to me considering the ROM code. Daniel, are you OK with that or do you still want to investigate further? Regards, Tony -- To unsubscribe from this list: send the line "unsubscribe linux-omap" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/19/2014 07:51 PM, Tony Lindgren wrote: > * Santosh Shilimkar <santosh.shilimkar@ti.com> [140519 10:35]: >> On Monday 19 May 2014 01:23 PM, Tony Lindgren wrote: >>> * Daniel Lezcano <daniel.lezcano@linaro.org> [140519 09:46]: >>>> On 05/16/2014 11:29 PM, Tony Lindgren wrote: >>>>> >>>>> And just to recap, this problem can be reproduced with current >>>>> Linux next with omap2plus_defconfig with CONFIG_CPU_IDLE enabled. The >>>>> system should hang during the boot at some point. >>>> >>>> I can take the time to investigate a bit more but not right now. What is >>>> your deadline before committing the reverts ? >>> >>> Well we do have several automated build and boot systems failing >>> because of this with multi_v7_defconfig. And users are complaining, >>> see this report from Tobias Jakobi: >>> >>> https://bugzilla.kernel.org/show_bug.cgi?id=75421 >>> >>> It seems that doing the revert is not enough based on the >>> page above. >>> >> Thats not true. The above link used the half patch and not the >> updated patch. Updated patch worked for Alex also. As you can >> see they saw RCU stalls and they go away after the updated patch. >> >> Can you please point them to try out the updated patch ? > > OK good point. I added a link to the updated patch in > bugzilla. > >>> I'd prefer we'd fix this issue properly for sure, it seems that >>> we're not quite understanding what's going on. And this might >>> hit other platforms too when they start implementing deeper >>> PM idle states in the mainline kernel. >>> >> I am certain that the updated patch fixed the regression >> for sure. The issue is really not generic enough since its related >> an OMAP ROM errata which needs that special handling of >> interrupt re-trigger etc. You don't need that for other platforms >> so they are not likely get affected. > > OK makes sense to me considering the ROM code. Daniel, are you OK > with that or do you still want to investigate further? For the moment I am a bit short in time for some other tasks. So feel free to apply the revert and I will look for a proper fix when I will have time. Thanks -- Daniel
* Daniel Lezcano <daniel.lezcano@linaro.org> [140519 11:07]: > On 05/19/2014 07:51 PM, Tony Lindgren wrote: > >* Santosh Shilimkar <santosh.shilimkar@ti.com> [140519 10:35]: > >>On Monday 19 May 2014 01:23 PM, Tony Lindgren wrote: > >>>* Daniel Lezcano <daniel.lezcano@linaro.org> [140519 09:46]: > >>>>On 05/16/2014 11:29 PM, Tony Lindgren wrote: > >>>>> > >>>>>And just to recap, this problem can be reproduced with current > >>>>>Linux next with omap2plus_defconfig with CONFIG_CPU_IDLE enabled. The > >>>>>system should hang during the boot at some point. > >>>> > >>>>I can take the time to investigate a bit more but not right now. What is > >>>>your deadline before committing the reverts ? > >>> > >>>Well we do have several automated build and boot systems failing > >>>because of this with multi_v7_defconfig. And users are complaining, > >>>see this report from Tobias Jakobi: > >>> > >>>https://bugzilla.kernel.org/show_bug.cgi?id=75421 > >>> > >>>It seems that doing the revert is not enough based on the > >>>page above. > >>> > >>Thats not true. The above link used the half patch and not the > >>updated patch. Updated patch worked for Alex also. As you can > >>see they saw RCU stalls and they go away after the updated patch. > >> > >>Can you please point them to try out the updated patch ? > > > >OK good point. I added a link to the updated patch in > >bugzilla. > > > >>>I'd prefer we'd fix this issue properly for sure, it seems that > >>>we're not quite understanding what's going on. And this might > >>>hit other platforms too when they start implementing deeper > >>>PM idle states in the mainline kernel. > >>> > >>I am certain that the updated patch fixed the regression > >>for sure. The issue is really not generic enough since its related > >>an OMAP ROM errata which needs that special handling of > >>interrupt re-trigger etc. You don't need that for other platforms > >>so they are not likely get affected. > > > >OK makes sense to me considering the ROM code. Daniel, are you OK > >with that or do you still want to investigate further? > > For the moment I am a bit short in time for some other tasks. So feel free > to apply the revert and I will look for a proper fix when I will have time. Added Tobias to Cc. At the bugzilla link Tobias is saying he used the right patch from Santosh to test and it still fails. Regards, Tony -- To unsubscribe from this list: send the line "unsubscribe linux-omap" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/19/2014 09:36 PM, Tony Lindgren wrote: > * Daniel Lezcano <daniel.lezcano@linaro.org> [140519 11:07]: >> On 05/19/2014 07:51 PM, Tony Lindgren wrote: >>> * Santosh Shilimkar <santosh.shilimkar@ti.com> [140519 10:35]: >>>> On Monday 19 May 2014 01:23 PM, Tony Lindgren wrote: >>>>> * Daniel Lezcano <daniel.lezcano@linaro.org> [140519 09:46]: >>>>>> On 05/16/2014 11:29 PM, Tony Lindgren wrote: >>>>>>> >>>>>>> And just to recap, this problem can be reproduced with current >>>>>>> Linux next with omap2plus_defconfig with CONFIG_CPU_IDLE enabled. The >>>>>>> system should hang during the boot at some point. >>>>>> >>>>>> I can take the time to investigate a bit more but not right now. What is >>>>>> your deadline before committing the reverts ? >>>>> >>>>> Well we do have several automated build and boot systems failing >>>>> because of this with multi_v7_defconfig. And users are complaining, >>>>> see this report from Tobias Jakobi: >>>>> >>>>> https://bugzilla.kernel.org/show_bug.cgi?id=75421 >>>>> >>>>> It seems that doing the revert is not enough based on the >>>>> page above. >>>>> >>>> Thats not true. The above link used the half patch and not the >>>> updated patch. Updated patch worked for Alex also. As you can >>>> see they saw RCU stalls and they go away after the updated patch. >>>> >>>> Can you please point them to try out the updated patch ? >>> >>> OK good point. I added a link to the updated patch in >>> bugzilla. >>> >>>>> I'd prefer we'd fix this issue properly for sure, it seems that >>>>> we're not quite understanding what's going on. And this might >>>>> hit other platforms too when they start implementing deeper >>>>> PM idle states in the mainline kernel. >>>>> >>>> I am certain that the updated patch fixed the regression >>>> for sure. The issue is really not generic enough since its related >>>> an OMAP ROM errata which needs that special handling of >>>> interrupt re-trigger etc. You don't need that for other platforms >>>> so they are not likely get affected. >>> >>> OK makes sense to me considering the ROM code. Daniel, are you OK >>> with that or do you still want to investigate further? >> >> For the moment I am a bit short in time for some other tasks. So feel free >> to apply the revert and I will look for a proper fix when I will have time. > > Added Tobias to Cc. At the bugzilla link Tobias is saying > he used the right patch from Santosh to test and it still > fails. So apparently Santosh, Kevin and Alex say they are not facing the issue anymore with the revert. But Tobias is still facing this issue. There is this simple program [1] which usually makes cpuidle to hang the system more quickly when there is a problem somewhere. May be worth to check with it. Hope that helps. -- Daniel [1] https://git.linaro.org/power/pm-qa.git/blob/HEAD:/cpuidle/cpuidle_killer.c
On Monday 19 May 2014 03:36 PM, Tony Lindgren wrote: > * Daniel Lezcano <daniel.lezcano@linaro.org> [140519 11:07]: >> On 05/19/2014 07:51 PM, Tony Lindgren wrote: >>> * Santosh Shilimkar <santosh.shilimkar@ti.com> [140519 10:35]: >>>> On Monday 19 May 2014 01:23 PM, Tony Lindgren wrote: >>>>> * Daniel Lezcano <daniel.lezcano@linaro.org> [140519 09:46]: >>>>>> On 05/16/2014 11:29 PM, Tony Lindgren wrote: >>>>>>> >>>>>>> And just to recap, this problem can be reproduced with current >>>>>>> Linux next with omap2plus_defconfig with CONFIG_CPU_IDLE enabled. The >>>>>>> system should hang during the boot at some point. >>>>>> >>>>>> I can take the time to investigate a bit more but not right now. What is >>>>>> your deadline before committing the reverts ? >>>>> >>>>> Well we do have several automated build and boot systems failing >>>>> because of this with multi_v7_defconfig. And users are complaining, >>>>> see this report from Tobias Jakobi: >>>>> >>>>> https://bugzilla.kernel.org/show_bug.cgi?id=75421 >>>>> >>>>> It seems that doing the revert is not enough based on the >>>>> page above. >>>>> >>>> Thats not true. The above link used the half patch and not the >>>> updated patch. Updated patch worked for Alex also. As you can >>>> see they saw RCU stalls and they go away after the updated patch. >>>> >>>> Can you please point them to try out the updated patch ? >>> >>> OK good point. I added a link to the updated patch in >>> bugzilla. >>> >>>>> I'd prefer we'd fix this issue properly for sure, it seems that >>>>> we're not quite understanding what's going on. And this might >>>>> hit other platforms too when they start implementing deeper >>>>> PM idle states in the mainline kernel. >>>>> >>>> I am certain that the updated patch fixed the regression >>>> for sure. The issue is really not generic enough since its related >>>> an OMAP ROM errata which needs that special handling of >>>> interrupt re-trigger etc. You don't need that for other platforms >>>> so they are not likely get affected. >>> >>> OK makes sense to me considering the ROM code. Daniel, are you OK >>> with that or do you still want to investigate further? >> >> For the moment I am a bit short in time for some other tasks. So feel free >> to apply the revert and I will look for a proper fix when I will have time. > > Added Tobias to Cc. At the bugzilla link Tobias is saying > he used the right patch from Santosh to test and it still > fails. > The link of the patch in bugzilla shows that it was not the updated patch and that was my reference point. The other bugzilla issue seems to be different and mostly not related to cpuidle. I could be wrong. ---------------------- dmesg | grep idle: cpuidle: using governor ladder cpuidle: using governor menu grep . /sys/devices/system/cpu/cpu*/cpuidle/*/*: (now attached as 'cpuidle states') The 'BUG' looks like this: BUG: scheduling while atomic: swapper/1/0/0xffff0000 Modules linked in: ctr ccm btrfs xor xor_neon zlib_inflate zlib_deflate omapfb cfbfillrect cfbimgblt cfbcopyarea raid6_pq arc4 wl12xx wlcore mac80211 cfg80211 rfkill usb_storage snd_soc_omap_hdmi snd_soc_omap_hdmi_card wlcore_sdio CPU: 1 PID: 0 Comm: swapper/1 Tainted: G W 3.15.0-rc5+ #1 Backtrace: [<c001152c>] (dump_backtrace) from [<c00116c8>] (show_stack+0x18/0x1c) r6:eb894000 r5:00000001 r4:00000000 r3:00000000 [<c00116b0>] (show_stack) from [<c0430a94>] (dump_stack+0x7c/0x98) [<c0430a18>] (dump_stack) from [<c042ec10>] (__schedule_bug+0x48/0x60) r4:00000000 r3:00000153 [<c042ebc8>] (__schedule_bug) from [<c0432248>] (__schedule+0x41c/0x4b4) r4:2b9fe000 r3:00000000 [<c0431e2c>] (__schedule) from [<c04323d4>] (schedule+0x38/0x88) r10:eb894000 r9:eb894000 r8:00000000 r7:eb894000 r6:c05b0450 r5:c05b04a0 r4:eb894000 [<c043239c>] (schedule) from [<c04326fc>] (schedule_preempt_disabled+0x10/0x14) [<c04326ec>] (schedule_preempt_disabled) from [<c006d8f0>] (cpu_startup_entry+0xec/0x22c) [<c006d804>] (cpu_startup_entry) from [<c00131a4>] (secondary_start_kernel+0xe8/0x100) r7:c05de240 [<c00130bc>] (secondary_start_kernel) from [<80008664>] (0x80008664) r4:ab87c06a r3:c000864c ---------------------- -- To unsubscribe from this list: send the line "unsubscribe linux-omap" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Daniel Lezcano wrote: > > So apparently Santosh, Kevin and Alex say they are not facing the > issue anymore with the revert. But Tobias is still facing this issue. > > There is this simple program [1] which usually makes cpuidle to hang > the system more quickly when there is a problem somewhere. May be > worth to check with it. > > Hope that helps. > > -- Daniel > > [1] > https://git.linaro.org/power/pm-qa.git/blob/HEAD:/cpuidle/cpuidle_killer.c > Hello, just to clarify. I never encountered the boot failure problem on my Panda ES in the first place. The issue with CPU_IDLE for me, like I also say in the bugreport, showed itself through instability of the USB. I saw the boot issue on the ml, read that it was caused by CPU_IDLE, and thought I should make Tony aware that I have a similar problem. Now, after applying the latest version of the patch that Santosh proposed, the situation with CPU_IDLE improves. But it is still not stable for me, since after some while, working with an SSH session of WiFi (which IIRC is not connected via USB on the board), this 'BUG' storm hits. This does _not_ happen when I disable CPU_IDLE completly. But even if I don't connect via WiFi at all, just boot and let me system run with serial console connected, after some time I get a kernel 'WARNING': http://www.math.uni-bielefeld.de/~tjakobi/archive/dmesg.1.log So something with CPU_IDLE is not right here. However I can't say if it's exactly this boot issue which this thread was initially about. Maybe seperate issues are at work here, who knows. Anyway, I'm going to have a look at that cpuidle killer thingy later. -- Tobias -- To unsubscribe from this list: send the line "unsubscribe linux-omap" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Tobias Jakobi <tjakobi@math.uni-bielefeld.de> [140519 14:19]: > Daniel Lezcano wrote: > > > > So apparently Santosh, Kevin and Alex say they are not facing the > > issue anymore with the revert. But Tobias is still facing this issue. > > > > There is this simple program [1] which usually makes cpuidle to hang > > the system more quickly when there is a problem somewhere. May be > > worth to check with it. > > > > Hope that helps. > > > > -- Daniel > > > > [1] > > https://git.linaro.org/power/pm-qa.git/blob/HEAD:/cpuidle/cpuidle_killer.c > > > Hello, > > just to clarify. I never encountered the boot failure problem on my > Panda ES in the first place. The issue with CPU_IDLE for me, like I also > say in the bugreport, showed itself through instability of the USB. I > saw the boot issue on the ml, read that it was caused by CPU_IDLE, and > thought I should make Tony aware that I have a similar problem. > > Now, after applying the latest version of the patch that Santosh > proposed, the situation with CPU_IDLE improves. But it is still not > stable for me, since after some while, working with an SSH session of > WiFi (which IIRC is not connected via USB on the board), this 'BUG' > storm hits. This does _not_ happen when I disable CPU_IDLE completly. OK that seems like a different problem. > But even if I don't connect via WiFi at all, just boot and let me system > run with serial console connected, after some time I get a kernel 'WARNING': > http://www.math.uni-bielefeld.de/~tjakobi/archive/dmesg.1.log > > So something with CPU_IDLE is not right here. However I can't say if > it's exactly this boot issue which this thread was initially about. > Maybe seperate issues are at work here, who knows. > > Anyway, I'm going to have a look at that cpuidle killer thingy later. OK. I gave it a try, but did not get any errors at least after about 30 minutes of testing on my panda es. I'll apply the patch from Santosh as that gets things booting again with linux next. Regards, Tony -- To unsubscribe from this list: send the line "unsubscribe linux-omap" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Tobias Jakobi <tjakobi@math.uni-bielefeld.de> [140519 14:19]: > > But even if I don't connect via WiFi at all, just boot and let me system > run with serial console connected, after some time I get a kernel 'WARNING': > http://www.math.uni-bielefeld.de/~tjakobi/archive/dmesg.1.log BTW, care to update the bugzilla page with the second warning in this log? That's the WARNING: CPU: 1 PID: 0 at kernel/timer.c:1147 that's at 238 seconds. Regards, Tony -- To unsubscribe from this list: send the line "unsubscribe linux-omap" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Daniel Lezcano <daniel.lezcano@linaro.org> [140523 13:53]: > On 23 May 2014 20:32, Tony Lindgren <tony@atomide.com> wrote: > > > * Tony Lindgren <tony@atomide.com> [140523 07:45]: > > > * Tobias Jakobi <tjakobi@math.uni-bielefeld.de> [140519 14:19]: > > > > > > > > But even if I don't connect via WiFi at all, just boot and let me > > system > > > > run with serial console connected, after some time I get a kernel > > 'WARNING': > > > > http://www.math.uni-bielefeld.de/~tjakobi/archive/dmesg.1.log > > > > > > BTW, care to update the bugzilla page with the second warning > > > in this log? > > > > > > That's the WARNING: CPU: 1 PID: 0 at kernel/timer.c:1147 that's > > > at 238 seconds. > > > > Also, with Santosh's fix applied, can you also try disabling one > > or more of the idle states for cpuidle and see if that helps? > > > > Something like this patch below. If that helps with the WARNING > > above you're getting it narrows down the problem down quite a bit. > > > > Regards, > > > > Tony > > > > --- a/arch/arm/mach-omap2/cpuidle44xx.c > > +++ b/arch/arm/mach-omap2/cpuidle44xx.c > > @@ -44,11 +44,13 @@ static struct idle_statedata omap4_idle_data[] = { > > .mpu_state = PWRDM_POWER_RET, > > .mpu_logic_state = PWRDM_POWER_RET, > > }, > > +#if 0 > > { > > .cpu_state = PWRDM_POWER_OFF, > > .mpu_state = PWRDM_POWER_RET, > > .mpu_logic_state = PWRDM_POWER_OFF, > > }, > > +#endif > > > > Hmm, I am afraid that will lead to a fault. Safer to set the state_count = > 2 instead. Hmm don't we have state_count = ARRAY_SIZE(omap4_idle_data) or am I missing something? Regards, Tony -- To unsubscribe from this list: send the line "unsubscribe linux-omap" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tuesday 27 May 2014 04:34 PM, Tony Lindgren wrote: > * Daniel Lezcano <daniel.lezcano@linaro.org> [140523 13:53]: >> On 23 May 2014 20:32, Tony Lindgren <tony@atomide.com> wrote: >> >>> * Tony Lindgren <tony@atomide.com> [140523 07:45]: >>>> * Tobias Jakobi <tjakobi@math.uni-bielefeld.de> [140519 14:19]: >>>>> >>>>> But even if I don't connect via WiFi at all, just boot and let me >>> system >>>>> run with serial console connected, after some time I get a kernel >>> 'WARNING': >>>>> http://www.math.uni-bielefeld.de/~tjakobi/archive/dmesg.1.log >>>> >>>> BTW, care to update the bugzilla page with the second warning >>>> in this log? >>>> >>>> That's the WARNING: CPU: 1 PID: 0 at kernel/timer.c:1147 that's >>>> at 238 seconds. >>> >>> Also, with Santosh's fix applied, can you also try disabling one >>> or more of the idle states for cpuidle and see if that helps? >>> >>> Something like this patch below. If that helps with the WARNING >>> above you're getting it narrows down the problem down quite a bit. >>> >>> Regards, >>> >>> Tony >>> >>> --- a/arch/arm/mach-omap2/cpuidle44xx.c >>> +++ b/arch/arm/mach-omap2/cpuidle44xx.c >>> @@ -44,11 +44,13 @@ static struct idle_statedata omap4_idle_data[] = { >>> .mpu_state = PWRDM_POWER_RET, >>> .mpu_logic_state = PWRDM_POWER_RET, >>> }, >>> +#if 0 >>> { >>> .cpu_state = PWRDM_POWER_OFF, >>> .mpu_state = PWRDM_POWER_RET, >>> .mpu_logic_state = PWRDM_POWER_OFF, >>> }, >>> +#endif >>> >> >> Hmm, I am afraid that will lead to a fault. Safer to set the state_count = >> 2 instead. > > Hmm don't we have state_count = ARRAY_SIZE(omap4_idle_data) or am I > missing something? > I don't think you are missing anything. The change should work. -- To unsubscribe from this list: send the line "unsubscribe linux-omap" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/27/2014 11:03 PM, Santosh Shilimkar wrote: > On Tuesday 27 May 2014 04:34 PM, Tony Lindgren wrote: >> * Daniel Lezcano <daniel.lezcano@linaro.org> [140523 13:53]: >>> On 23 May 2014 20:32, Tony Lindgren <tony@atomide.com> wrote: >>> >>>> * Tony Lindgren <tony@atomide.com> [140523 07:45]: >>>>> * Tobias Jakobi <tjakobi@math.uni-bielefeld.de> [140519 14:19]: >>>>>> >>>>>> But even if I don't connect via WiFi at all, just boot and let me >>>> system >>>>>> run with serial console connected, after some time I get a kernel >>>> 'WARNING': >>>>>> http://www.math.uni-bielefeld.de/~tjakobi/archive/dmesg.1.log >>>>> >>>>> BTW, care to update the bugzilla page with the second warning >>>>> in this log? >>>>> >>>>> That's the WARNING: CPU: 1 PID: 0 at kernel/timer.c:1147 that's >>>>> at 238 seconds. >>>> >>>> Also, with Santosh's fix applied, can you also try disabling one >>>> or more of the idle states for cpuidle and see if that helps? >>>> >>>> Something like this patch below. If that helps with the WARNING >>>> above you're getting it narrows down the problem down quite a bit. >>>> >>>> Regards, >>>> >>>> Tony >>>> >>>> --- a/arch/arm/mach-omap2/cpuidle44xx.c >>>> +++ b/arch/arm/mach-omap2/cpuidle44xx.c >>>> @@ -44,11 +44,13 @@ static struct idle_statedata omap4_idle_data[] = { >>>> .mpu_state = PWRDM_POWER_RET, >>>> .mpu_logic_state = PWRDM_POWER_RET, >>>> }, >>>> +#if 0 >>>> { >>>> .cpu_state = PWRDM_POWER_OFF, >>>> .mpu_state = PWRDM_POWER_RET, >>>> .mpu_logic_state = PWRDM_POWER_OFF, >>>> }, >>>> +#endif >>>> >>> >>> Hmm, I am afraid that will lead to a fault. Safer to set the state_count = >>> 2 instead. >> >> Hmm don't we have state_count = ARRAY_SIZE(omap4_idle_data) or am I >> missing something? >> > I don't think you are missing anything. The change should work. Oh, yes. Sorry, over-sighted that. I am too used with the other drivers assigning the state_count with a number.
diff --git a/arch/arm/mach-omap2/cpuidle44xx.c b/arch/arm/mach-omap2/cpuidle44xx.c index 01fc710..2498ab0 100644 --- a/arch/arm/mach-omap2/cpuidle44xx.c +++ b/arch/arm/mach-omap2/cpuidle44xx.c @@ -14,6 +14,7 @@ #include <linux/cpuidle.h> #include <linux/cpu_pm.h> #include <linux/export.h> +#include <linux/clockchips.h> #include <asm/cpuidle.h> #include <asm/proc-fns.h> @@ -83,6 +84,7 @@ static int omap_enter_idle_coupled(struct cpuidle_device *dev, { struct idle_statedata *cx = state_ptr + index; u32 mpuss_can_lose_context = 0; + int cpu_id = smp_processor_id(); /* * CPU0 has to wait and stay ON until CPU1 is OFF state. @@ -110,6 +112,8 @@ static int omap_enter_idle_coupled(struct cpuidle_device *dev, mpuss_can_lose_context = (cx->mpu_state == PWRDM_POWER_RET) && (cx->mpu_logic_state == PWRDM_POWER_OFF); + clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu_id); + /* * Call idle CPU PM enter notifier chain so that * VFP and per CPU interrupt context is saved. @@ -165,6 +169,8 @@ static int omap_enter_idle_coupled(struct cpuidle_device *dev, if (dev->cpu == 0 && mpuss_can_lose_context) cpu_cluster_pm_exit(); + clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu_id); + fail: cpuidle_coupled_parallel_barrier(dev, &abort_barrier); cpu_done[dev->cpu] = false; @@ -172,6 +178,16 @@ fail: return index; } +/* + * For each cpu, setup the broadcast timer because local timers + * stops for the states above C1. + */ +static void omap_setup_broadcast_timer(void *arg) +{ + int cpu = smp_processor_id(); + clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ON, &cpu); +} + static struct cpuidle_driver omap4_idle_driver = { .name = "omap4_idle", .owner = THIS_MODULE, @@ -189,8 +205,7 @@ static struct cpuidle_driver omap4_idle_driver = { /* C2 - CPU0 OFF + CPU1 OFF + MPU CSWR */ .exit_latency = 328 + 440, .target_residency = 960, - .flags = CPUIDLE_FLAG_TIME_VALID | CPUIDLE_FLAG_COUPLED | - CPUIDLE_FLAG_TIMER_STOP, + .flags = CPUIDLE_FLAG_TIME_VALID | CPUIDLE_FLAG_COUPLED, .enter = omap_enter_idle_coupled, .name = "C2", .desc = "CPUx OFF, MPUSS CSWR", @@ -199,8 +214,7 @@ static struct cpuidle_driver omap4_idle_driver = { /* C3 - CPU0 OFF + CPU1 OFF + MPU OSWR */ .exit_latency = 460 + 518, .target_residency = 1100, - .flags = CPUIDLE_FLAG_TIME_VALID | CPUIDLE_FLAG_COUPLED | - CPUIDLE_FLAG_TIMER_STOP, + .flags = CPUIDLE_FLAG_TIME_VALID | CPUIDLE_FLAG_COUPLED, .enter = omap_enter_idle_coupled, .name = "C3", .desc = "CPUx OFF, MPUSS OSWR", @@ -231,5 +245,8 @@ int __init omap4_idle_init(void) if (!cpu_clkdm[0] || !cpu_clkdm[1]) return -ENODEV; + /* Configure the broadcast timer on each cpu */ + on_each_cpu(omap_setup_broadcast_timer, NULL, 1); + return cpuidle_register(&omap4_idle_driver, cpu_online_mask); }