irqchip: omap-intc: fix spurious irq handling

Message ID	3d433cfeeb93366cadbb1668ebeac2e8006b0fd5.1445247844.git.nsekhar@ti.com
State	New
Headers	show Return-Path: <patchwork-forward+bncBDR6LBMLQUFBBE7YSKYQKGQE3WIUERQ@linaro.org> Received-SPF: pass (google.com: domain of patch+caf_=patchwork-forward=linaro.org@linaro.org designates 209.85.217.182 as permitted sender) client-ip=209.85.217.182; Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; From: Sekhar Nori <nsekhar@ti.com> To: Thomas Gleixner <tglx@linutronix.de>, Tony Lindgren <tony@atomide.com>, Jason Cooper <jason@lakedaemon.net>, Marc Zyngier <marc.zyngier@arm.com> CC: John Ogness <john.ogness@linutronix.de>, Felipe Balbi <balbi@ti.com>, Linux OMAP Mailing List <linux-omap@vger.kernel.org>, <linux-kernel@vger.kernel.org> Subject: [PATCH] irqchip: omap-intc: fix spurious irq handling Date: Mon, 19 Oct 2015 15:16:31 +0530 Message-ID: <3d433cfeeb93366cadbb1668ebeac2e8006b0fd5.1445247844.git.nsekhar@ti.com> MIME-Version: 1.0 Content-Type: text/plain Sender: linux-kernel-owner@vger.kernel.org Precedence: list Mailing-list: list patchwork-forward@linaro.org; contact patchwork-forward+owners@linaro.org

Sekhar Nori Oct. 19, 2015, 9:46 a.m. UTC

Under some conditions, irq sorting procedure used by INTC can go wrong
resulting in a spurious irq getting reported.

This condition is flagged by INTC by setting "Spurious IRQ Flag" in SIR
register to 0x1ffffff. Section 6.2.5 of AM335x TRM revised Jun 2014
describes this.

Using IRQ number 0 for checking this condition is wrong. 0 is a valid
INTC IRQ. For example, on AM335x, it is the emulation interrupt.

Fix handing of spurious interrupt condition in omap-intc driver by
correct detection of spurious interrupt condition.

Since spurious IRQ condition can happen under genuine conditions (see
the section of AM335x TRM for details) and is recoverable, we do not
need a warning splat for users to report. It can however result in
reduced performance so we add a ratelimited debug print to aid
developers.

Signed-off-by: Sekhar Nori <nsekhar@ti.com>
---
 drivers/irqchip/irq-omap-intc.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

Thomas Gleixner Oct. 19, 2015, 10:13 a.m. UTC | #1

On Mon, 19 Oct 2015, Sekhar Nori wrote:
> +	/*
> +	 * A spurious IRQ can result if interrupt that triggered the
> +	 * sorting is no longer active during the sorting (10 INTC
> +	 * functional clock cycles after interrupt assertion). Or a
> +	 * change in interrupt mask affected the result during sorting
> +	 * time. There is no special handling required except ignoring
> +	 * the SIR register value just read and retrying.
> +	 * See section 6.2.5 of AM335x TRM Literature Number: SPRUH73K
> +	 */
> +	if ((irqnr & SPURIOUSIRQ_MASK) == SPURIOUSIRQ_MASK) {
> +		pr_debug_ratelimited("%s: spurious irq!\n", __func__);

I'd prefer that this is a pr_once() and the spurious interrupt counter
is incremented. That's far more useful as it gives you real
information about the frequency of the issue.

Thanks,

	tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Tony Lindgren Oct. 19, 2015, 2:50 p.m. UTC | #2

Hi,

* Sekhar Nori <nsekhar@ti.com> [151019 02:51]:
> Under some conditions, irq sorting procedure used by INTC can go wrong
> resulting in a spurious irq getting reported.
> 
> This condition is flagged by INTC by setting "Spurious IRQ Flag" in SIR
> register to 0x1ffffff. Section 6.2.5 of AM335x TRM revised Jun 2014
> describes this.

OK so we have this finally documented, that's great. It's been bugging
me for years now :) What we used to have for omap3 was 6ccc4c0dedf8
("ARM: OMAP3: Warn about spurious interrupts"). I alsways thought it's
some undocumented omap3 weirdness but obviously not if you're seeing it
on am335x too.

> Using IRQ number 0 for checking this condition is wrong. 0 is a valid
> INTC IRQ. For example, on AM335x, it is the emulation interrupt.
> 
> Fix handing of spurious interrupt condition in omap-intc driver by
> correct detection of spurious interrupt condition.
> 
> Since spurious IRQ condition can happen under genuine conditions (see
> the section of AM335x TRM for details) and is recoverable, we do not
> need a warning splat for users to report. It can however result in
> reduced performance so we add a ratelimited debug print to aid
> developers.

Do you know what really is causing the spurious interrupts in your
case?

In all the cases I've seen, the spurious interrupts were caused by
a missing flush of posted write acking the IRQ at the device driver.
for the _previously triggered_ INTC interrupt.

If you have a reproducable case, I suggest you test that by printing
out the previous interrupt to check if that makes sense. And then see
if adding the missing read back to that interrupt handler fixes the
issue.

And if my assumption is correct, you can then update your patch and
actually warn about the real culprit irq number :)

Regards,

Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Sekhar Nori Oct. 20, 2015, 6:22 a.m. UTC | #3

On Monday 19 October 2015 08:20 PM, Tony Lindgren wrote:
> Hi,
> 
> * Sekhar Nori <nsekhar@ti.com> [151019 02:51]:
>> Under some conditions, irq sorting procedure used by INTC can go wrong
>> resulting in a spurious irq getting reported.
>>
>> This condition is flagged by INTC by setting "Spurious IRQ Flag" in SIR
>> register to 0x1ffffff. Section 6.2.5 of AM335x TRM revised Jun 2014
>> describes this.
> 
> OK so we have this finally documented, that's great. It's been bugging
> me for years now :) What we used to have for omap3 was 6ccc4c0dedf8
> ("ARM: OMAP3: Warn about spurious interrupts"). I alsways thought it's
> some undocumented omap3 weirdness but obviously not if you're seeing it
> on am335x too.

BTW, I noticed the AM335x documentation itself is copied from OMAP35x
public TRM: http://www.ti.com/lit/ug/spruf98y/spruf98y.pdf. Surprising
that OMAP34x never had this documented though.

> 
>> Using IRQ number 0 for checking this condition is wrong. 0 is a valid
>> INTC IRQ. For example, on AM335x, it is the emulation interrupt.
>>
>> Fix handing of spurious interrupt condition in omap-intc driver by
>> correct detection of spurious interrupt condition.
>>
>> Since spurious IRQ condition can happen under genuine conditions (see
>> the section of AM335x TRM for details) and is recoverable, we do not
>> need a warning splat for users to report. It can however result in
>> reduced performance so we add a ratelimited debug print to aid
>> developers.
> 
> Do you know what really is causing the spurious interrupts in your
> case?

No, not yet.

> 
> In all the cases I've seen, the spurious interrupts were caused by
> a missing flush of posted write acking the IRQ at the device driver.
> for the _previously triggered_ INTC interrupt.
> 
> If you have a reproducable case, I suggest you test that by printing
> out the previous interrupt to check if that makes sense. And then see
> if adding the missing read back to that interrupt handler fixes the
> issue.

Okay, thats good to know. Thanks for the hints and history of your debug
on OMAP3. The issue is not easily reproducible in my case. But if I try
hard enough, I can get hit it though. So I can surely try your hints.

> 
> And if my assumption is correct, you can then update your patch and
> actually warn about the real culprit irq number :)

I am not sure about introducing the prediction of bad IRQ in the same
patch as this. While its certainly useful to have hints about the
culprit, its not guaranteed to be true all the time. And if we later
discover the prediction scheme is throwing people off course more often
than not, it will be easy to revert just that prediction part without
affecting basic detection of spurious IRQ itself as documented by TRM.

So I propose this patch goes in with Thomas's comments fixed and I work
on adding some prediction based on your work in 6ccc4c0dedf8 ("ARM:
OMAP3: Warn about spurious interrupts").

Thanks,
Sekhar
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

John Ogness Oct. 20, 2015, 7:32 a.m. UTC | #4

On 2015-10-20, Sekhar Nori <nsekhar@ti.com> wrote:
>> Do you know what really is causing the spurious interrupts in your
>> case?
>
> No, not yet.

According to the TRM this is normal behavior if conditions that might
affect priority are changed during priority sorting.

    6.2.5 ARM A8 INTC Spurious Interrupt Handling

    The spurious flag indicates whether the result of the sorting (a
    window of 10 INTC functional clock cycles after the interrupt
    assertion) is invalid. The sorting is invalid if:

    - The interrupt that triggered the sorting is no longer active
      during the sorting.

    - A change in the mask has affected the result during the sorting
      time.

>> In all the cases I've seen, the spurious interrupts were caused by a
>> missing flush of posted write acking the IRQ at the device driver.
>> for the _previously triggered_ INTC interrupt.
>> 
>> If you have a reproducable case, I suggest you test that by printing
>> out the previous interrupt to check if that makes sense. And then see
>> if adding the missing read back to that interrupt handler fixes the
>> issue.
>
> Okay, thats good to know. Thanks for the hints and history of your debug
> on OMAP3. The issue is not easily reproducible in my case. But if I try
> hard enough, I can get hit it though. So I can surely try your hints.

I can reproduce the situation very easily. After running a test for a
few minutes and printing out the previous interrupt, I have the
following list. These are the irq numbers seen by the handler before the
spurious interrupt triggered.

    INT12 - EDMACOMPINT - TPCC (EDMA)
    INT41 - 3PGSWRXINT0 - CPSW (Ethernet)
    INT42 - 3PGSWTXINT0 - CPSW (Ethernet)
    INT68 - TINT2       - DMTIMER2
    INT72 - UART0INT    - UART0

From this I do not think we can put the blame on any single driver. I
trigger this situation very easily by putting a load of 7,000+
interrupts per second on the system. This means we have 70,000 INTC
clock cycles per second where a change in the interrupt priority
conditions would cause the priority sorting to become invalid and thus
cause the spurious interrupt.

I'm not sure if we can/should do anything more than Sekhar's patch of
acknowledging the spurious interrupt so the priority sorting algorithm
can run again.

John Ogness
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Tony Lindgren Oct. 20, 2015, 2:52 p.m. UTC | #5

* John Ogness <john.ogness@linutronix.de> [151020 00:33]:
> On 2015-10-20, Sekhar Nori <nsekhar@ti.com> wrote:
> >> Do you know what really is causing the spurious interrupts in your
> >> case?
> >
> > No, not yet.
> 
> According to the TRM this is normal behavior if conditions that might
> affect priority are changed during priority sorting.
> 
>     6.2.5 ARM A8 INTC Spurious Interrupt Handling
> 
>     The spurious flag indicates whether the result of the sorting (a
>     window of 10 INTC functional clock cycles after the interrupt
>     assertion) is invalid. The sorting is invalid if:
> 
>     - The interrupt that triggered the sorting is no longer active
>       during the sorting.
> 
>     - A change in the mask has affected the result during the sorting
>       time.
> 
> >> In all the cases I've seen, the spurious interrupts were caused by a
> >> missing flush of posted write acking the IRQ at the device driver.
> >> for the _previously triggered_ INTC interrupt.
> >> 
> >> If you have a reproducable case, I suggest you test that by printing
> >> out the previous interrupt to check if that makes sense. And then see
> >> if adding the missing read back to that interrupt handler fixes the
> >> issue.
> >
> > Okay, thats good to know. Thanks for the hints and history of your debug
> > on OMAP3. The issue is not easily reproducible in my case. But if I try
> > hard enough, I can get hit it though. So I can surely try your hints.
> 
> I can reproduce the situation very easily. After running a test for a
> few minutes and printing out the previous interrupt, I have the
> following list. These are the irq numbers seen by the handler before the
> spurious interrupt triggered.
> 
>     INT12 - EDMACOMPINT - TPCC (EDMA)
>     INT41 - 3PGSWRXINT0 - CPSW (Ethernet)
>     INT42 - 3PGSWTXINT0 - CPSW (Ethernet)
>     INT68 - TINT2       - DMTIMER2
>     INT72 - UART0INT    - UART0
> 
> From this I do not think we can put the blame on any single driver. I
> trigger this situation very easily by putting a load of 7,000+
> interrupts per second on the system. This means we have 70,000 INTC
> clock cycles per second where a change in the interrupt priority
> conditions would cause the priority sorting to become invalid and thus
> cause the spurious interrupt.
> 
> I'm not sure if we can/should do anything more than Sekhar's patch of
> acknowledging the spurious interrupt so the priority sorting algorithm
> can run again.

OK thanks for testing. My guess from the above list would be EDMA
or CPSW missing a flush of posted write. Maybe try adding a readback
of the related device revision register after acking the interrupt into
TPCC interrupt handler and CPSW interrupt handler(s)?

The timer2 and uart0 seem to be false positives here naturally.

I would not yet rule out the "previous interrupt" theory until you have
tried that. We really want to know the root cause of the issue, just
printing out spurious interrupt does not fix the problem :)

Regards,

Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

irqchip: omap-intc: fix spurious irq handling

Commit Message

Comments

Patch