diff mbox series

mmc: cqhci: Be more verbose in error irq handler

Message ID 20231016095610.1095084-1-korneld@chromium.org
State New
Headers show
Series mmc: cqhci: Be more verbose in error irq handler | expand

Commit Message

Kornel Dulęba Oct. 16, 2023, 9:56 a.m. UTC
There are several reasons for controller to generate an error interrupt.
They include controller<->card timeout, and CRC mismatch error.
Right now we only get one line in the logs stating that CQE recovery was
triggered, but with no information about what caused it.
To figure out what happened be more verbose and dump the registers from
irq error handler logic.
This matches the behaviour of the software timeout logic, see
cqhci_timeout.

Signed-off-by: Kornel Dulęba <korneld@chromium.org>
---
 drivers/mmc/host/cqhci-core.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Comments

Adrian Hunter Oct. 20, 2023, 7:40 a.m. UTC | #1
On 16/10/23 12:56, Kornel Dulęba wrote:
> There are several reasons for controller to generate an error interrupt.
> They include controller<->card timeout, and CRC mismatch error.
> Right now we only get one line in the logs stating that CQE recovery was
> triggered, but with no information about what caused it.
> To figure out what happened be more verbose and dump the registers from
> irq error handler logic.
> This matches the behaviour of the software timeout logic, see
> cqhci_timeout.
> 
> Signed-off-by: Kornel Dulęba <korneld@chromium.org>
> ---
>  drivers/mmc/host/cqhci-core.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c
> index b3d7d6d8d654..33abb4bd53b5 100644
> --- a/drivers/mmc/host/cqhci-core.c
> +++ b/drivers/mmc/host/cqhci-core.c
> @@ -700,8 +700,9 @@ static void cqhci_error_irq(struct mmc_host *mmc, u32 status, int cmd_error,
>  
>  	terri = cqhci_readl(cq_host, CQHCI_TERRI);
>  
> -	pr_debug("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d TERRI: 0x%08x\n",
> -		 mmc_hostname(mmc), status, cmd_error, data_error, terri);
> +	pr_warn("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d\n",
> +		 mmc_hostname(mmc), status, cmd_error, data_error);
> +	cqhci_dumpregs(cq_host);

For debugging, isn't dynamic debug seems more appropriate?

>  
>  	/* Forget about errors when recovery has already been triggered */
>  	if (cq_host->recovery_halt)
Kornel Dulęba Oct. 20, 2023, 8:53 a.m. UTC | #2
On Fri, Oct 20, 2023 at 9:41 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
>
> On 16/10/23 12:56, Kornel Dulęba wrote:
> > There are several reasons for controller to generate an error interrupt.
> > They include controller<->card timeout, and CRC mismatch error.
> > Right now we only get one line in the logs stating that CQE recovery was
> > triggered, but with no information about what caused it.
> > To figure out what happened be more verbose and dump the registers from
> > irq error handler logic.
> > This matches the behaviour of the software timeout logic, see
> > cqhci_timeout.
> >
> > Signed-off-by: Kornel Dulęba <korneld@chromium.org>
> > ---
> >  drivers/mmc/host/cqhci-core.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c
> > index b3d7d6d8d654..33abb4bd53b5 100644
> > --- a/drivers/mmc/host/cqhci-core.c
> > +++ b/drivers/mmc/host/cqhci-core.c
> > @@ -700,8 +700,9 @@ static void cqhci_error_irq(struct mmc_host *mmc, u32 status, int cmd_error,
> >
> >       terri = cqhci_readl(cq_host, CQHCI_TERRI);
> >
> > -     pr_debug("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d TERRI: 0x%08x\n",
> > -              mmc_hostname(mmc), status, cmd_error, data_error, terri);
> > +     pr_warn("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d\n",
> > +              mmc_hostname(mmc), status, cmd_error, data_error);
> > +     cqhci_dumpregs(cq_host);
>
> For debugging, isn't dynamic debug seems more appropriate?

Dynamic debug is an option, but my personal preference would be to
just log more info in the error handler.
To give you some background.
We're seeing some "running CQE recovery" lines in the logs, followed
by a dm_verity mismatch error.
The reports come from the field, with no feasible way to reproduce the
issue locally.

I'd argue that logging only the info that CQE recovery was executed is
not particularly helpful for someone looking into those logs.
Ideally we would have more data about the state the controller was in
when the error happened, or at least what caused the recovery to be
triggered.
The question here is how verbose should we be in this error scenario.
Looking at other error scenarios, in the case of a software timeout
we're dumping the controller registers. (cqhci_timeout)
Hence I thought that I'd be appropriate to match that and do the same
in CQE recovery logic.

>
> >
> >       /* Forget about errors when recovery has already been triggered */
> >       if (cq_host->recovery_halt)
>
Adrian Hunter Oct. 23, 2023, 11:38 a.m. UTC | #3
On 20/10/23 11:53, Kornel Dulęba wrote:
> On Fri, Oct 20, 2023 at 9:41 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
>>
>> On 16/10/23 12:56, Kornel Dulęba wrote:
>>> There are several reasons for controller to generate an error interrupt.
>>> They include controller<->card timeout, and CRC mismatch error.
>>> Right now we only get one line in the logs stating that CQE recovery was
>>> triggered, but with no information about what caused it.
>>> To figure out what happened be more verbose and dump the registers from
>>> irq error handler logic.
>>> This matches the behaviour of the software timeout logic, see
>>> cqhci_timeout.
>>>
>>> Signed-off-by: Kornel Dulęba <korneld@chromium.org>
>>> ---
>>>  drivers/mmc/host/cqhci-core.c | 5 +++--
>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c
>>> index b3d7d6d8d654..33abb4bd53b5 100644
>>> --- a/drivers/mmc/host/cqhci-core.c
>>> +++ b/drivers/mmc/host/cqhci-core.c
>>> @@ -700,8 +700,9 @@ static void cqhci_error_irq(struct mmc_host *mmc, u32 status, int cmd_error,
>>>
>>>       terri = cqhci_readl(cq_host, CQHCI_TERRI);
>>>
>>> -     pr_debug("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d TERRI: 0x%08x\n",
>>> -              mmc_hostname(mmc), status, cmd_error, data_error, terri);
>>> +     pr_warn("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d\n",
>>> +              mmc_hostname(mmc), status, cmd_error, data_error);
>>> +     cqhci_dumpregs(cq_host);
>>
>> For debugging, isn't dynamic debug seems more appropriate?
> 
> Dynamic debug is an option, but my personal preference would be to
> just log more info in the error handler.

Interrupt handlers can get called very rapidly, so some kind of rate
limiting should be used if the message is unconditional.  Also you need
to provide actual reasons for your preference.

For dynamic debug of the register dump, something like below is
possible.

#define cqhci_dynamic_dumpregs(cqhost) \
	_dynamic_func_call_no_desc("cqhci_dynamic_dumpregs", cqhci_dumpregs, cqhost)

> To give you some background.
> We're seeing some "running CQE recovery" lines in the logs, followed
> by a dm_verity mismatch error.
> The reports come from the field, with no feasible way to reproduce the
> issue locally.

If it is a software error, some kind of error injection may well
reproduce it.  Also if it is a hardware error that only happens
during recovery, error injection could increase the likelihood of
reproducing it.

> 
> I'd argue that logging only the info that CQE recovery was executed is
> not particularly helpful for someone looking into those logs.

As the comment says, that message is there because recovery reduces
performance, it is not to aid debugging per se.

> Ideally we would have more data about the state the controller was in
> when the error happened, or at least what caused the recovery to be
> triggered.
> The question here is how verbose should we be in this error scenario.
> Looking at other error scenarios, in the case of a software timeout
> we're dumping the controller registers. (cqhci_timeout)

Timeout means something is broken - either the driver, the cq engine
or the card.  On the other hand, an error interrupt is most likely a
CRC error which is not unexpected occasionally, due to thermal drift
or perhaps interference.

> Hence I thought that I'd be appropriate to match that and do the same
> in CQE recovery logic.

It needs to be consistent. There are other pr_debugs, such as:

		pr_debug("%s: cqhci: Failed to clear tasks\n",
		pr_debug("%s: cqhci: Failed to halt\n", mmc_hostname(mmc));
		pr_debug("%s: cqhci: disable / re-enable\n", mmc_hostname(mmc));

which should perhaps be treated the same.

And there are no messages for errors from the commands in
mmc_cqe_recovery().

> 
>>
>>>
>>>       /* Forget about errors when recovery has already been triggered */
>>>       if (cq_host->recovery_halt)
>>
Kornel Dulęba Oct. 25, 2023, 8:01 a.m. UTC | #4
On Mon, Oct 23, 2023 at 1:38 PM Adrian Hunter <adrian.hunter@intel.com> wrote:
>
> On 20/10/23 11:53, Kornel Dulęba wrote:
> > On Fri, Oct 20, 2023 at 9:41 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
> >>
> >> On 16/10/23 12:56, Kornel Dulęba wrote:
> >>> There are several reasons for controller to generate an error interrupt.
> >>> They include controller<->card timeout, and CRC mismatch error.
> >>> Right now we only get one line in the logs stating that CQE recovery was
> >>> triggered, but with no information about what caused it.
> >>> To figure out what happened be more verbose and dump the registers from
> >>> irq error handler logic.
> >>> This matches the behaviour of the software timeout logic, see
> >>> cqhci_timeout.
> >>>
> >>> Signed-off-by: Kornel Dulęba <korneld@chromium.org>
> >>> ---
> >>>  drivers/mmc/host/cqhci-core.c | 5 +++--
> >>>  1 file changed, 3 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c
> >>> index b3d7d6d8d654..33abb4bd53b5 100644
> >>> --- a/drivers/mmc/host/cqhci-core.c
> >>> +++ b/drivers/mmc/host/cqhci-core.c
> >>> @@ -700,8 +700,9 @@ static void cqhci_error_irq(struct mmc_host *mmc, u32 status, int cmd_error,
> >>>
> >>>       terri = cqhci_readl(cq_host, CQHCI_TERRI);
> >>>
> >>> -     pr_debug("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d TERRI: 0x%08x\n",
> >>> -              mmc_hostname(mmc), status, cmd_error, data_error, terri);
> >>> +     pr_warn("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d\n",
> >>> +              mmc_hostname(mmc), status, cmd_error, data_error);
> >>> +     cqhci_dumpregs(cq_host);
> >>
> >> For debugging, isn't dynamic debug seems more appropriate?
> >
> > Dynamic debug is an option, but my personal preference would be to
> > just log more info in the error handler.
>
> Interrupt handlers can get called very rapidly, so some kind of rate
> limiting should be used if the message is unconditional.  Also you need
> to provide actual reasons for your preference.
>
> For dynamic debug of the register dump, something like below is
> possible.
>
> #define cqhci_dynamic_dumpregs(cqhost) \
>         _dynamic_func_call_no_desc("cqhci_dynamic_dumpregs", cqhci_dumpregs, cqhost)
>
Fair point.
The reason I'm not a fan of using dynamic debug for this is that my
goal here is to improve the warning/error logging information that we
get from systems running in production.
I.e. if we get a lot of "running CQE recovery" messages, at the very
least I'd like to know what is causing them.
> > To give you some background.
> > We're seeing some "running CQE recovery" lines in the logs, followed
> > by a dm_verity mismatch error.
> > The reports come from the field, with no feasible way to reproduce the
> > issue locally.
>
> If it is a software error, some kind of error injection may well
> reproduce it.  Also if it is a hardware error that only happens
> during recovery, error injection could increase the likelihood of
> reproducing it.

We tried software injection and it didn't yield any results.
We're currently looking into "injecting" hw errors by using a small
burst field generator to interfere with transfers on the data line
directly.
>
> >
> > I'd argue that logging only the info that CQE recovery was executed is
> > not particularly helpful for someone looking into those logs.
>
> As the comment says, that message is there because recovery reduces
> performance, it is not to aid debugging per se.
>
> > Ideally we would have more data about the state the controller was in
> > when the error happened, or at least what caused the recovery to be
> > triggered.
> > The question here is how verbose should we be in this error scenario.
> > Looking at other error scenarios, in the case of a software timeout
> > we're dumping the controller registers. (cqhci_timeout)
>
> Timeout means something is broken - either the driver, the cq engine
> or the card.  On the other hand, an error interrupt is most likely a
> CRC error which is not unexpected occasionally, due to thermal drift
> or perhaps interference.

Right, but my point is that we don't know what triggered CQE recovery.

>
> > Hence I thought that I'd be appropriate to match that and do the same
> > in CQE recovery logic.
>
> It needs to be consistent. There are other pr_debugs, such as:
>
>                 pr_debug("%s: cqhci: Failed to clear tasks\n",
>                 pr_debug("%s: cqhci: Failed to halt\n", mmc_hostname(mmc));
>                 pr_debug("%s: cqhci: disable / re-enable\n", mmc_hostname(mmc));
>
> which should perhaps be treated the same.
>
> And there are no messages for errors from the commands in
> mmc_cqe_recovery().

How about this.
As a compromise would it be okay to just do a single pr_warn directly
from cqhci_error_irq.
We could simply promote the existing pr_debug to pr_warn at the
beginning of that function.
This would tell us what triggered the recovery. (controller timeout,
CRC mismatch)
We can also consider removing the "running CQE recovery" print for the
sake of brevity.
The only downside of this that I can see is that we'd be running the
logic from the interrupt handler directly, but I can't see an easy way
around that.
What do you think?
>
> >
> >>
> >>>
> >>>       /* Forget about errors when recovery has already been triggered */
> >>>       if (cq_host->recovery_halt)
> >>
>
Adrian Hunter Oct. 26, 2023, 6:25 a.m. UTC | #5
On 25/10/23 11:01, Kornel Dulęba wrote:
> On Mon, Oct 23, 2023 at 1:38 PM Adrian Hunter <adrian.hunter@intel.com> wrote:
>>
>> On 20/10/23 11:53, Kornel Dulęba wrote:
>>> On Fri, Oct 20, 2023 at 9:41 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
>>>>
>>>> On 16/10/23 12:56, Kornel Dulęba wrote:
>>>>> There are several reasons for controller to generate an error interrupt.
>>>>> They include controller<->card timeout, and CRC mismatch error.
>>>>> Right now we only get one line in the logs stating that CQE recovery was
>>>>> triggered, but with no information about what caused it.
>>>>> To figure out what happened be more verbose and dump the registers from
>>>>> irq error handler logic.
>>>>> This matches the behaviour of the software timeout logic, see
>>>>> cqhci_timeout.
>>>>>
>>>>> Signed-off-by: Kornel Dulęba <korneld@chromium.org>
>>>>> ---
>>>>>  drivers/mmc/host/cqhci-core.c | 5 +++--
>>>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c
>>>>> index b3d7d6d8d654..33abb4bd53b5 100644
>>>>> --- a/drivers/mmc/host/cqhci-core.c
>>>>> +++ b/drivers/mmc/host/cqhci-core.c
>>>>> @@ -700,8 +700,9 @@ static void cqhci_error_irq(struct mmc_host *mmc, u32 status, int cmd_error,
>>>>>
>>>>>       terri = cqhci_readl(cq_host, CQHCI_TERRI);
>>>>>
>>>>> -     pr_debug("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d TERRI: 0x%08x\n",
>>>>> -              mmc_hostname(mmc), status, cmd_error, data_error, terri);
>>>>> +     pr_warn("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d\n",
>>>>> +              mmc_hostname(mmc), status, cmd_error, data_error);
>>>>> +     cqhci_dumpregs(cq_host);
>>>>
>>>> For debugging, isn't dynamic debug seems more appropriate?
>>>
>>> Dynamic debug is an option, but my personal preference would be to
>>> just log more info in the error handler.
>>
>> Interrupt handlers can get called very rapidly, so some kind of rate
>> limiting should be used if the message is unconditional.  Also you need
>> to provide actual reasons for your preference.
>>
>> For dynamic debug of the register dump, something like below is
>> possible.
>>
>> #define cqhci_dynamic_dumpregs(cqhost) \
>>         _dynamic_func_call_no_desc("cqhci_dynamic_dumpregs", cqhci_dumpregs, cqhost)
>>
> Fair point.
> The reason I'm not a fan of using dynamic debug for this is that my
> goal here is to improve the warning/error logging information that we
> get from systems running in production.
> I.e. if we get a lot of "running CQE recovery" messages, at the very
> least I'd like to know what is causing them.

So you are saying you want to collect debug information from production
systems, but don't want to use dynamic debug to do it?

>>> To give you some background.
>>> We're seeing some "running CQE recovery" lines in the logs, followed
>>> by a dm_verity mismatch error.
>>> The reports come from the field, with no feasible way to reproduce the
>>> issue locally.
>>
>> If it is a software error, some kind of error injection may well
>> reproduce it.  Also if it is a hardware error that only happens
>> during recovery, error injection could increase the likelihood of
>> reproducing it.
> 
> We tried software injection and it didn't yield any results.
> We're currently looking into "injecting" hw errors by using a small
> burst field generator to interfere with transfers on the data line
> directly.

I just tried instrumenting a driver to inject CRC errors and it
revealed several CQE recovery issues, including spurious TCN for
tag 31.  I will send some patches when they are ready.

>>
>>>
>>> I'd argue that logging only the info that CQE recovery was executed is
>>> not particularly helpful for someone looking into those logs.
>>
>> As the comment says, that message is there because recovery reduces
>> performance, it is not to aid debugging per se.
>>
>>> Ideally we would have more data about the state the controller was in
>>> when the error happened, or at least what caused the recovery to be
>>> triggered.
>>> The question here is how verbose should we be in this error scenario.
>>> Looking at other error scenarios, in the case of a software timeout
>>> we're dumping the controller registers. (cqhci_timeout)
>>
>> Timeout means something is broken - either the driver, the cq engine
>> or the card.  On the other hand, an error interrupt is most likely a
>> CRC error which is not unexpected occasionally, due to thermal drift
>> or perhaps interference.
> 
> Right, but my point is that we don't know what triggered CQE recovery.

True, although probably a CRC error.

> 
>>
>>> Hence I thought that I'd be appropriate to match that and do the same
>>> in CQE recovery logic.
>>
>> It needs to be consistent. There are other pr_debugs, such as:
>>
>>                 pr_debug("%s: cqhci: Failed to clear tasks\n",
>>                 pr_debug("%s: cqhci: Failed to halt\n", mmc_hostname(mmc));
>>                 pr_debug("%s: cqhci: disable / re-enable\n", mmc_hostname(mmc));
>>
>> which should perhaps be treated the same.
>>
>> And there are no messages for errors from the commands in
>> mmc_cqe_recovery().
> 
> How about this.
> As a compromise would it be okay to just do a single pr_warn directly
> from cqhci_error_irq.

Sure, printk_ratelimited() or __ratelimit()

> We could simply promote the existing pr_debug to pr_warn at the
> beginning of that function.
> This would tell us what triggered the recovery. (controller timeout,
> CRC mismatch)
> We can also consider removing the "running CQE recovery" print for the
> sake of brevity.

No, that serves a different purpose.

> The only downside of this that I can see is that we'd be running the
> logic from the interrupt handler directly, but I can't see an easy way
> around that.
> What do you think?

Should be OK with rate limiting.

>>
>>>
>>>>
>>>>>
>>>>>       /* Forget about errors when recovery has already been triggered */
>>>>>       if (cq_host->recovery_halt)
>>>>
>>
Kornel Dulęba Oct. 26, 2023, 7:56 a.m. UTC | #6
On Thu, Oct 26, 2023 at 8:25 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
>
> On 25/10/23 11:01, Kornel Dulęba wrote:
> > On Mon, Oct 23, 2023 at 1:38 PM Adrian Hunter <adrian.hunter@intel.com> wrote:
> >>
> >> On 20/10/23 11:53, Kornel Dulęba wrote:
> >>> On Fri, Oct 20, 2023 at 9:41 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
> >>>>
> >>>> On 16/10/23 12:56, Kornel Dulęba wrote:
> >>>>> There are several reasons for controller to generate an error interrupt.
> >>>>> They include controller<->card timeout, and CRC mismatch error.
> >>>>> Right now we only get one line in the logs stating that CQE recovery was
> >>>>> triggered, but with no information about what caused it.
> >>>>> To figure out what happened be more verbose and dump the registers from
> >>>>> irq error handler logic.
> >>>>> This matches the behaviour of the software timeout logic, see
> >>>>> cqhci_timeout.
> >>>>>
> >>>>> Signed-off-by: Kornel Dulęba <korneld@chromium.org>
> >>>>> ---
> >>>>>  drivers/mmc/host/cqhci-core.c | 5 +++--
> >>>>>  1 file changed, 3 insertions(+), 2 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c
> >>>>> index b3d7d6d8d654..33abb4bd53b5 100644
> >>>>> --- a/drivers/mmc/host/cqhci-core.c
> >>>>> +++ b/drivers/mmc/host/cqhci-core.c
> >>>>> @@ -700,8 +700,9 @@ static void cqhci_error_irq(struct mmc_host *mmc, u32 status, int cmd_error,
> >>>>>
> >>>>>       terri = cqhci_readl(cq_host, CQHCI_TERRI);
> >>>>>
> >>>>> -     pr_debug("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d TERRI: 0x%08x\n",
> >>>>> -              mmc_hostname(mmc), status, cmd_error, data_error, terri);
> >>>>> +     pr_warn("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d\n",
> >>>>> +              mmc_hostname(mmc), status, cmd_error, data_error);
> >>>>> +     cqhci_dumpregs(cq_host);
> >>>>
> >>>> For debugging, isn't dynamic debug seems more appropriate?
> >>>
> >>> Dynamic debug is an option, but my personal preference would be to
> >>> just log more info in the error handler.
> >>
> >> Interrupt handlers can get called very rapidly, so some kind of rate
> >> limiting should be used if the message is unconditional.  Also you need
> >> to provide actual reasons for your preference.
> >>
> >> For dynamic debug of the register dump, something like below is
> >> possible.
> >>
> >> #define cqhci_dynamic_dumpregs(cqhost) \
> >>         _dynamic_func_call_no_desc("cqhci_dynamic_dumpregs", cqhci_dumpregs, cqhost)
> >>
> > Fair point.
> > The reason I'm not a fan of using dynamic debug for this is that my
> > goal here is to improve the warning/error logging information that we
> > get from systems running in production.
> > I.e. if we get a lot of "running CQE recovery" messages, at the very
> > least I'd like to know what is causing them.
>
> So you are saying you want to collect debug information from production
> systems, but don't want to use dynamic debug to do it?
>
> >>> To give you some background.
> >>> We're seeing some "running CQE recovery" lines in the logs, followed
> >>> by a dm_verity mismatch error.
> >>> The reports come from the field, with no feasible way to reproduce the
> >>> issue locally.
> >>
> >> If it is a software error, some kind of error injection may well
> >> reproduce it.  Also if it is a hardware error that only happens
> >> during recovery, error injection could increase the likelihood of
> >> reproducing it.
> >
> > We tried software injection and it didn't yield any results.
> > We're currently looking into "injecting" hw errors by using a small
> > burst field generator to interfere with transfers on the data line
> > directly.
>
> I just tried instrumenting a driver to inject CRC errors and it
> revealed several CQE recovery issues, including spurious TCN for
> tag 31.  I will send some patches when they are ready.

Sorry, what I meant by it didn't yield results is that Ii didn't
trigger the dm-verity error that we're seeing on production.
With SW injection there are two potential issues that come to my mind:

1. In the cqhci_error_irq when TERRI is not valid only a single,
"random" task is marked as bad.
Then in cqhci_recover_mrq we're marking all pending requests as done.
For data transfers this is somewhat bening as it will return with
bytes_xfered=0.
IIUC this will then cause the upper layer to re-enqueue this request.
The bigger problem is a CMD only mrq, which will be mistakenly marked
as completed successfully.

2. As for the spurious task completion warning.
I initially thought that it was bening.
The check for !mrq is done before checking if we're currently doing recovery.
So if it's called just right at the end of recovery, right after the
cqhci_recover_mrqs is executed that would explain it.
With that being said if that irq handler is run right after the
recovery is finished we'll end up with a race where a new request,
that was just enqueued, might be mistakenly marked as done.
This would explain the dm-verity errors we're seeing.

>
> >>
> >>>
> >>> I'd argue that logging only the info that CQE recovery was executed is
> >>> not particularly helpful for someone looking into those logs.
> >>
> >> As the comment says, that message is there because recovery reduces
> >> performance, it is not to aid debugging per se.
> >>
> >>> Ideally we would have more data about the state the controller was in
> >>> when the error happened, or at least what caused the recovery to be
> >>> triggered.
> >>> The question here is how verbose should we be in this error scenario.
> >>> Looking at other error scenarios, in the case of a software timeout
> >>> we're dumping the controller registers. (cqhci_timeout)
> >>
> >> Timeout means something is broken - either the driver, the cq engine
> >> or the card.  On the other hand, an error interrupt is most likely a
> >> CRC error which is not unexpected occasionally, due to thermal drift
> >> or perhaps interference.
> >
> > Right, but my point is that we don't know what triggered CQE recovery.
>
> True, although probably a CRC error.
>
> >
> >>
> >>> Hence I thought that I'd be appropriate to match that and do the same
> >>> in CQE recovery logic.
> >>
> >> It needs to be consistent. There are other pr_debugs, such as:
> >>
> >>                 pr_debug("%s: cqhci: Failed to clear tasks\n",
> >>                 pr_debug("%s: cqhci: Failed to halt\n", mmc_hostname(mmc));
> >>                 pr_debug("%s: cqhci: disable / re-enable\n", mmc_hostname(mmc));
> >>
> >> which should perhaps be treated the same.
> >>
> >> And there are no messages for errors from the commands in
> >> mmc_cqe_recovery().
> >
> > How about this.
> > As a compromise would it be okay to just do a single pr_warn directly
> > from cqhci_error_irq.
>
> Sure, printk_ratelimited() or __ratelimit()
>
> > We could simply promote the existing pr_debug to pr_warn at the
> > beginning of that function.
> > This would tell us what triggered the recovery. (controller timeout,
> > CRC mismatch)
> > We can also consider removing the "running CQE recovery" print for the
> > sake of brevity.
>
> No, that serves a different purpose.
>
> > The only downside of this that I can see is that we'd be running the
> > logic from the interrupt handler directly, but I can't see an easy way
> > around that.
> > What do you think?
>
> Should be OK with rate limiting.

OK, I'll look into the rate limiting and will send a v2.

>
> >>
> >>>
> >>>>
> >>>>>
> >>>>>       /* Forget about errors when recovery has already been triggered */
> >>>>>       if (cq_host->recovery_halt)
> >>>>
> >>
>
Adrian Hunter Oct. 26, 2023, 8:14 a.m. UTC | #7
On 26/10/23 10:56, Kornel Dulęba wrote:
> On Thu, Oct 26, 2023 at 8:25 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
>>
>> On 25/10/23 11:01, Kornel Dulęba wrote:
>>> On Mon, Oct 23, 2023 at 1:38 PM Adrian Hunter <adrian.hunter@intel.com> wrote:
>>>>
>>>> On 20/10/23 11:53, Kornel Dulęba wrote:
>>>>> On Fri, Oct 20, 2023 at 9:41 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
>>>>>>
>>>>>> On 16/10/23 12:56, Kornel Dulęba wrote:
>>>>>>> There are several reasons for controller to generate an error interrupt.
>>>>>>> They include controller<->card timeout, and CRC mismatch error.
>>>>>>> Right now we only get one line in the logs stating that CQE recovery was
>>>>>>> triggered, but with no information about what caused it.
>>>>>>> To figure out what happened be more verbose and dump the registers from
>>>>>>> irq error handler logic.
>>>>>>> This matches the behaviour of the software timeout logic, see
>>>>>>> cqhci_timeout.
>>>>>>>
>>>>>>> Signed-off-by: Kornel Dulęba <korneld@chromium.org>
>>>>>>> ---
>>>>>>>  drivers/mmc/host/cqhci-core.c | 5 +++--
>>>>>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c
>>>>>>> index b3d7d6d8d654..33abb4bd53b5 100644
>>>>>>> --- a/drivers/mmc/host/cqhci-core.c
>>>>>>> +++ b/drivers/mmc/host/cqhci-core.c
>>>>>>> @@ -700,8 +700,9 @@ static void cqhci_error_irq(struct mmc_host *mmc, u32 status, int cmd_error,
>>>>>>>
>>>>>>>       terri = cqhci_readl(cq_host, CQHCI_TERRI);
>>>>>>>
>>>>>>> -     pr_debug("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d TERRI: 0x%08x\n",
>>>>>>> -              mmc_hostname(mmc), status, cmd_error, data_error, terri);
>>>>>>> +     pr_warn("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d\n",
>>>>>>> +              mmc_hostname(mmc), status, cmd_error, data_error);
>>>>>>> +     cqhci_dumpregs(cq_host);
>>>>>>
>>>>>> For debugging, isn't dynamic debug seems more appropriate?
>>>>>
>>>>> Dynamic debug is an option, but my personal preference would be to
>>>>> just log more info in the error handler.
>>>>
>>>> Interrupt handlers can get called very rapidly, so some kind of rate
>>>> limiting should be used if the message is unconditional.  Also you need
>>>> to provide actual reasons for your preference.
>>>>
>>>> For dynamic debug of the register dump, something like below is
>>>> possible.
>>>>
>>>> #define cqhci_dynamic_dumpregs(cqhost) \
>>>>         _dynamic_func_call_no_desc("cqhci_dynamic_dumpregs", cqhci_dumpregs, cqhost)
>>>>
>>> Fair point.
>>> The reason I'm not a fan of using dynamic debug for this is that my
>>> goal here is to improve the warning/error logging information that we
>>> get from systems running in production.
>>> I.e. if we get a lot of "running CQE recovery" messages, at the very
>>> least I'd like to know what is causing them.
>>
>> So you are saying you want to collect debug information from production
>> systems, but don't want to use dynamic debug to do it?
>>
>>>>> To give you some background.
>>>>> We're seeing some "running CQE recovery" lines in the logs, followed
>>>>> by a dm_verity mismatch error.
>>>>> The reports come from the field, with no feasible way to reproduce the
>>>>> issue locally.
>>>>
>>>> If it is a software error, some kind of error injection may well
>>>> reproduce it.  Also if it is a hardware error that only happens
>>>> during recovery, error injection could increase the likelihood of
>>>> reproducing it.
>>>
>>> We tried software injection and it didn't yield any results.
>>> We're currently looking into "injecting" hw errors by using a small
>>> burst field generator to interfere with transfers on the data line
>>> directly.
>>
>> I just tried instrumenting a driver to inject CRC errors and it
>> revealed several CQE recovery issues, including spurious TCN for
>> tag 31.  I will send some patches when they are ready.
> 
> Sorry, what I meant by it didn't yield results is that Ii didn't
> trigger the dm-verity error that we're seeing on production.
> With SW injection there are two potential issues that come to my mind:
> 
> 1. In the cqhci_error_irq when TERRI is not valid only a single,
> "random" task is marked as bad.
> Then in cqhci_recover_mrq we're marking all pending requests as done.
> For data transfers this is somewhat bening as it will return with
> bytes_xfered=0.
> IIUC this will then cause the upper layer to re-enqueue this request.

Yes

> The bigger problem is a CMD only mrq, which will be mistakenly marked
> as completed successfully.

I noticed that also.  Notably the only non-data CMD is cache flush.

There are several other issues, but patches will describe
them better.

> 
> 2. As for the spurious task completion warning.
> I initially thought that it was bening.
> The check for !mrq is done before checking if we're currently doing recovery.
> So if it's called just right at the end of recovery, right after the
> cqhci_recover_mrqs is executed that would explain it.
> With that being said if that irq handler is run right after the
> recovery is finished we'll end up with a race where a new request,
> that was just enqueued, might be mistakenly marked as done.
> This would explain the dm-verity errors we're seeing.
> 
>>
>>>>
>>>>>
>>>>> I'd argue that logging only the info that CQE recovery was executed is
>>>>> not particularly helpful for someone looking into those logs.
>>>>
>>>> As the comment says, that message is there because recovery reduces
>>>> performance, it is not to aid debugging per se.
>>>>
>>>>> Ideally we would have more data about the state the controller was in
>>>>> when the error happened, or at least what caused the recovery to be
>>>>> triggered.
>>>>> The question here is how verbose should we be in this error scenario.
>>>>> Looking at other error scenarios, in the case of a software timeout
>>>>> we're dumping the controller registers. (cqhci_timeout)
>>>>
>>>> Timeout means something is broken - either the driver, the cq engine
>>>> or the card.  On the other hand, an error interrupt is most likely a
>>>> CRC error which is not unexpected occasionally, due to thermal drift
>>>> or perhaps interference.
>>>
>>> Right, but my point is that we don't know what triggered CQE recovery.
>>
>> True, although probably a CRC error.
>>
>>>
>>>>
>>>>> Hence I thought that I'd be appropriate to match that and do the same
>>>>> in CQE recovery logic.
>>>>
>>>> It needs to be consistent. There are other pr_debugs, such as:
>>>>
>>>>                 pr_debug("%s: cqhci: Failed to clear tasks\n",
>>>>                 pr_debug("%s: cqhci: Failed to halt\n", mmc_hostname(mmc));
>>>>                 pr_debug("%s: cqhci: disable / re-enable\n", mmc_hostname(mmc));
>>>>
>>>> which should perhaps be treated the same.
>>>>
>>>> And there are no messages for errors from the commands in
>>>> mmc_cqe_recovery().
>>>
>>> How about this.
>>> As a compromise would it be okay to just do a single pr_warn directly
>>> from cqhci_error_irq.
>>
>> Sure, printk_ratelimited() or __ratelimit()
>>
>>> We could simply promote the existing pr_debug to pr_warn at the
>>> beginning of that function.
>>> This would tell us what triggered the recovery. (controller timeout,
>>> CRC mismatch)
>>> We can also consider removing the "running CQE recovery" print for the
>>> sake of brevity.
>>
>> No, that serves a different purpose.
>>
>>> The only downside of this that I can see is that we'd be running the
>>> logic from the interrupt handler directly, but I can't see an easy way
>>> around that.
>>> What do you think?
>>
>> Should be OK with rate limiting.
> 
> OK, I'll look into the rate limiting and will send a v2.
> 
>>
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>       /* Forget about errors when recovery has already been triggered */
>>>>>>>       if (cq_host->recovery_halt)
>>>>>>
>>>>
>>
Kornel Dulęba Oct. 26, 2023, 8:29 a.m. UTC | #8
On Thu, Oct 26, 2023 at 10:14 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
>
> On 26/10/23 10:56, Kornel Dulęba wrote:
> > On Thu, Oct 26, 2023 at 8:25 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
> >>
> >> On 25/10/23 11:01, Kornel Dulęba wrote:
> >>> On Mon, Oct 23, 2023 at 1:38 PM Adrian Hunter <adrian.hunter@intel.com> wrote:
> >>>>
> >>>> On 20/10/23 11:53, Kornel Dulęba wrote:
> >>>>> On Fri, Oct 20, 2023 at 9:41 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
> >>>>>>
> >>>>>> On 16/10/23 12:56, Kornel Dulęba wrote:
> >>>>>>> There are several reasons for controller to generate an error interrupt.
> >>>>>>> They include controller<->card timeout, and CRC mismatch error.
> >>>>>>> Right now we only get one line in the logs stating that CQE recovery was
> >>>>>>> triggered, but with no information about what caused it.
> >>>>>>> To figure out what happened be more verbose and dump the registers from
> >>>>>>> irq error handler logic.
> >>>>>>> This matches the behaviour of the software timeout logic, see
> >>>>>>> cqhci_timeout.
> >>>>>>>
> >>>>>>> Signed-off-by: Kornel Dulęba <korneld@chromium.org>
> >>>>>>> ---
> >>>>>>>  drivers/mmc/host/cqhci-core.c | 5 +++--
> >>>>>>>  1 file changed, 3 insertions(+), 2 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c
> >>>>>>> index b3d7d6d8d654..33abb4bd53b5 100644
> >>>>>>> --- a/drivers/mmc/host/cqhci-core.c
> >>>>>>> +++ b/drivers/mmc/host/cqhci-core.c
> >>>>>>> @@ -700,8 +700,9 @@ static void cqhci_error_irq(struct mmc_host *mmc, u32 status, int cmd_error,
> >>>>>>>
> >>>>>>>       terri = cqhci_readl(cq_host, CQHCI_TERRI);
> >>>>>>>
> >>>>>>> -     pr_debug("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d TERRI: 0x%08x\n",
> >>>>>>> -              mmc_hostname(mmc), status, cmd_error, data_error, terri);
> >>>>>>> +     pr_warn("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d\n",
> >>>>>>> +              mmc_hostname(mmc), status, cmd_error, data_error);
> >>>>>>> +     cqhci_dumpregs(cq_host);
> >>>>>>
> >>>>>> For debugging, isn't dynamic debug seems more appropriate?
> >>>>>
> >>>>> Dynamic debug is an option, but my personal preference would be to
> >>>>> just log more info in the error handler.
> >>>>
> >>>> Interrupt handlers can get called very rapidly, so some kind of rate
> >>>> limiting should be used if the message is unconditional.  Also you need
> >>>> to provide actual reasons for your preference.
> >>>>
> >>>> For dynamic debug of the register dump, something like below is
> >>>> possible.
> >>>>
> >>>> #define cqhci_dynamic_dumpregs(cqhost) \
> >>>>         _dynamic_func_call_no_desc("cqhci_dynamic_dumpregs", cqhci_dumpregs, cqhost)
> >>>>
> >>> Fair point.
> >>> The reason I'm not a fan of using dynamic debug for this is that my
> >>> goal here is to improve the warning/error logging information that we
> >>> get from systems running in production.
> >>> I.e. if we get a lot of "running CQE recovery" messages, at the very
> >>> least I'd like to know what is causing them.
> >>
> >> So you are saying you want to collect debug information from production
> >> systems, but don't want to use dynamic debug to do it?
> >>
> >>>>> To give you some background.
> >>>>> We're seeing some "running CQE recovery" lines in the logs, followed
> >>>>> by a dm_verity mismatch error.
> >>>>> The reports come from the field, with no feasible way to reproduce the
> >>>>> issue locally.
> >>>>
> >>>> If it is a software error, some kind of error injection may well
> >>>> reproduce it.  Also if it is a hardware error that only happens
> >>>> during recovery, error injection could increase the likelihood of
> >>>> reproducing it.
> >>>
> >>> We tried software injection and it didn't yield any results.
> >>> We're currently looking into "injecting" hw errors by using a small
> >>> burst field generator to interfere with transfers on the data line
> >>> directly.
> >>
> >> I just tried instrumenting a driver to inject CRC errors and it
> >> revealed several CQE recovery issues, including spurious TCN for
> >> tag 31.  I will send some patches when they are ready.
> >
> > Sorry, what I meant by it didn't yield results is that Ii didn't
> > trigger the dm-verity error that we're seeing on production.
> > With SW injection there are two potential issues that come to my mind:
> >
> > 1. In the cqhci_error_irq when TERRI is not valid only a single,
> > "random" task is marked as bad.
> > Then in cqhci_recover_mrq we're marking all pending requests as done.
> > For data transfers this is somewhat bening as it will return with
> > bytes_xfered=0.
> > IIUC this will then cause the upper layer to re-enqueue this request.
>
> Yes
>
> > The bigger problem is a CMD only mrq, which will be mistakenly marked
> > as completed successfully.
>
> I noticed that also.  Notably the only non-data CMD is cache flush.
>
> There are several other issues, but patches will describe
> them better.

Sure. :)

jfyi, I've just managed to inject CRC errors by using a burst field generator.
(Langer P23 if you're interested.)
I'm using it by touching the D0 MMC line directly, and it yields
surprisingly good results.
I've changed the spurious TCN WARN_ONCE to pr_warn and got the following:

[   71.885698] mmc1: cqhci: error IRQ status: 0x00000000 cmd error 0
data error -84 TERRI: 0x972e0000
[   71.885730] mmc1: running CQE recovery
[   71.888135] cqhci_recovery_finish: TCN: 0x00000000
[   71.888141] mmc1: cqhci: recovery done
[   71.888223] mmc1: cqhci: spurious TCN for tag 23
(...)
[   95.558736] mmc1: cqhci: error IRQ status: 0x00000000 cmd error 0
data error -84 TERRI: 0x822e0000
[   95.558768] mmc1: running CQE recovery
[   95.561073] cqhci_recovery_finish: TCN: 0x00000000
[   95.561078] mmc1: cqhci: recovery done
[   95.561288] device-mapper: verity: 179:3: data block 712181 is corrupted

Now I get a spurious TCN after every recovery, with the only exception
being the one that ends up with dm-verity error.
So it'd seem that there's a race in which the "spurious" TCN hits a
pending request, enqueued right after recovery was completed.
I'm currently looking into how to fix it, but if you beat me to it I
can also test your patches and see if it fixes the dm-verity issue.

>
> >
> > 2. As for the spurious task completion warning.
> > I initially thought that it was bening.
> > The check for !mrq is done before checking if we're currently doing recovery.
> > So if it's called just right at the end of recovery, right after the
> > cqhci_recover_mrqs is executed that would explain it.
> > With that being said if that irq handler is run right after the
> > recovery is finished we'll end up with a race where a new request,
> > that was just enqueued, might be mistakenly marked as done.
> > This would explain the dm-verity errors we're seeing.
> >
> >>
> >>>>
> >>>>>
> >>>>> I'd argue that logging only the info that CQE recovery was executed is
> >>>>> not particularly helpful for someone looking into those logs.
> >>>>
> >>>> As the comment says, that message is there because recovery reduces
> >>>> performance, it is not to aid debugging per se.
> >>>>
> >>>>> Ideally we would have more data about the state the controller was in
> >>>>> when the error happened, or at least what caused the recovery to be
> >>>>> triggered.
> >>>>> The question here is how verbose should we be in this error scenario.
> >>>>> Looking at other error scenarios, in the case of a software timeout
> >>>>> we're dumping the controller registers. (cqhci_timeout)
> >>>>
> >>>> Timeout means something is broken - either the driver, the cq engine
> >>>> or the card.  On the other hand, an error interrupt is most likely a
> >>>> CRC error which is not unexpected occasionally, due to thermal drift
> >>>> or perhaps interference.
> >>>
> >>> Right, but my point is that we don't know what triggered CQE recovery.
> >>
> >> True, although probably a CRC error.
> >>
> >>>
> >>>>
> >>>>> Hence I thought that I'd be appropriate to match that and do the same
> >>>>> in CQE recovery logic.
> >>>>
> >>>> It needs to be consistent. There are other pr_debugs, such as:
> >>>>
> >>>>                 pr_debug("%s: cqhci: Failed to clear tasks\n",
> >>>>                 pr_debug("%s: cqhci: Failed to halt\n", mmc_hostname(mmc));
> >>>>                 pr_debug("%s: cqhci: disable / re-enable\n", mmc_hostname(mmc));
> >>>>
> >>>> which should perhaps be treated the same.
> >>>>
> >>>> And there are no messages for errors from the commands in
> >>>> mmc_cqe_recovery().
> >>>
> >>> How about this.
> >>> As a compromise would it be okay to just do a single pr_warn directly
> >>> from cqhci_error_irq.
> >>
> >> Sure, printk_ratelimited() or __ratelimit()
> >>
> >>> We could simply promote the existing pr_debug to pr_warn at the
> >>> beginning of that function.
> >>> This would tell us what triggered the recovery. (controller timeout,
> >>> CRC mismatch)
> >>> We can also consider removing the "running CQE recovery" print for the
> >>> sake of brevity.
> >>
> >> No, that serves a different purpose.
> >>
> >>> The only downside of this that I can see is that we'd be running the
> >>> logic from the interrupt handler directly, but I can't see an easy way
> >>> around that.
> >>> What do you think?
> >>
> >> Should be OK with rate limiting.
> >
> > OK, I'll look into the rate limiting and will send a v2.
> >
> >>
> >>>>
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>       /* Forget about errors when recovery has already been triggered */
> >>>>>>>       if (cq_host->recovery_halt)
> >>>>>>
> >>>>
> >>
>
Adrian Hunter Oct. 26, 2023, 8:51 a.m. UTC | #9
On 26/10/23 11:29, Kornel Dulęba wrote:
> On Thu, Oct 26, 2023 at 10:14 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
>>
>> On 26/10/23 10:56, Kornel Dulęba wrote:
>>> On Thu, Oct 26, 2023 at 8:25 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
>>>>
>>>> On 25/10/23 11:01, Kornel Dulęba wrote:
>>>>> On Mon, Oct 23, 2023 at 1:38 PM Adrian Hunter <adrian.hunter@intel.com> wrote:
>>>>>>
>>>>>> On 20/10/23 11:53, Kornel Dulęba wrote:
>>>>>>> On Fri, Oct 20, 2023 at 9:41 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
>>>>>>>>
>>>>>>>> On 16/10/23 12:56, Kornel Dulęba wrote:
>>>>>>>>> There are several reasons for controller to generate an error interrupt.
>>>>>>>>> They include controller<->card timeout, and CRC mismatch error.
>>>>>>>>> Right now we only get one line in the logs stating that CQE recovery was
>>>>>>>>> triggered, but with no information about what caused it.
>>>>>>>>> To figure out what happened be more verbose and dump the registers from
>>>>>>>>> irq error handler logic.
>>>>>>>>> This matches the behaviour of the software timeout logic, see
>>>>>>>>> cqhci_timeout.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Kornel Dulęba <korneld@chromium.org>
>>>>>>>>> ---
>>>>>>>>>  drivers/mmc/host/cqhci-core.c | 5 +++--
>>>>>>>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c
>>>>>>>>> index b3d7d6d8d654..33abb4bd53b5 100644
>>>>>>>>> --- a/drivers/mmc/host/cqhci-core.c
>>>>>>>>> +++ b/drivers/mmc/host/cqhci-core.c
>>>>>>>>> @@ -700,8 +700,9 @@ static void cqhci_error_irq(struct mmc_host *mmc, u32 status, int cmd_error,
>>>>>>>>>
>>>>>>>>>       terri = cqhci_readl(cq_host, CQHCI_TERRI);
>>>>>>>>>
>>>>>>>>> -     pr_debug("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d TERRI: 0x%08x\n",
>>>>>>>>> -              mmc_hostname(mmc), status, cmd_error, data_error, terri);
>>>>>>>>> +     pr_warn("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d\n",
>>>>>>>>> +              mmc_hostname(mmc), status, cmd_error, data_error);
>>>>>>>>> +     cqhci_dumpregs(cq_host);
>>>>>>>>
>>>>>>>> For debugging, isn't dynamic debug seems more appropriate?
>>>>>>>
>>>>>>> Dynamic debug is an option, but my personal preference would be to
>>>>>>> just log more info in the error handler.
>>>>>>
>>>>>> Interrupt handlers can get called very rapidly, so some kind of rate
>>>>>> limiting should be used if the message is unconditional.  Also you need
>>>>>> to provide actual reasons for your preference.
>>>>>>
>>>>>> For dynamic debug of the register dump, something like below is
>>>>>> possible.
>>>>>>
>>>>>> #define cqhci_dynamic_dumpregs(cqhost) \
>>>>>>         _dynamic_func_call_no_desc("cqhci_dynamic_dumpregs", cqhci_dumpregs, cqhost)
>>>>>>
>>>>> Fair point.
>>>>> The reason I'm not a fan of using dynamic debug for this is that my
>>>>> goal here is to improve the warning/error logging information that we
>>>>> get from systems running in production.
>>>>> I.e. if we get a lot of "running CQE recovery" messages, at the very
>>>>> least I'd like to know what is causing them.
>>>>
>>>> So you are saying you want to collect debug information from production
>>>> systems, but don't want to use dynamic debug to do it?
>>>>
>>>>>>> To give you some background.
>>>>>>> We're seeing some "running CQE recovery" lines in the logs, followed
>>>>>>> by a dm_verity mismatch error.
>>>>>>> The reports come from the field, with no feasible way to reproduce the
>>>>>>> issue locally.
>>>>>>
>>>>>> If it is a software error, some kind of error injection may well
>>>>>> reproduce it.  Also if it is a hardware error that only happens
>>>>>> during recovery, error injection could increase the likelihood of
>>>>>> reproducing it.
>>>>>
>>>>> We tried software injection and it didn't yield any results.
>>>>> We're currently looking into "injecting" hw errors by using a small
>>>>> burst field generator to interfere with transfers on the data line
>>>>> directly.
>>>>
>>>> I just tried instrumenting a driver to inject CRC errors and it
>>>> revealed several CQE recovery issues, including spurious TCN for
>>>> tag 31.  I will send some patches when they are ready.
>>>
>>> Sorry, what I meant by it didn't yield results is that Ii didn't
>>> trigger the dm-verity error that we're seeing on production.
>>> With SW injection there are two potential issues that come to my mind:
>>>
>>> 1. In the cqhci_error_irq when TERRI is not valid only a single,
>>> "random" task is marked as bad.
>>> Then in cqhci_recover_mrq we're marking all pending requests as done.
>>> For data transfers this is somewhat bening as it will return with
>>> bytes_xfered=0.
>>> IIUC this will then cause the upper layer to re-enqueue this request.
>>
>> Yes
>>
>>> The bigger problem is a CMD only mrq, which will be mistakenly marked
>>> as completed successfully.
>>
>> I noticed that also.  Notably the only non-data CMD is cache flush.
>>
>> There are several other issues, but patches will describe
>> them better.
> 
> Sure. :)
> 
> jfyi, I've just managed to inject CRC errors by using a burst field generator.
> (Langer P23 if you're interested.)
> I'm using it by touching the D0 MMC line directly, and it yields
> surprisingly good results.
> I've changed the spurious TCN WARN_ONCE to pr_warn and got the following:
> 
> [   71.885698] mmc1: cqhci: error IRQ status: 0x00000000 cmd error 0
> data error -84 TERRI: 0x972e0000
> [   71.885730] mmc1: running CQE recovery
> [   71.888135] cqhci_recovery_finish: TCN: 0x00000000
> [   71.888141] mmc1: cqhci: recovery done
> [   71.888223] mmc1: cqhci: spurious TCN for tag 23
> (...)
> [   95.558736] mmc1: cqhci: error IRQ status: 0x00000000 cmd error 0
> data error -84 TERRI: 0x822e0000
> [   95.558768] mmc1: running CQE recovery
> [   95.561073] cqhci_recovery_finish: TCN: 0x00000000
> [   95.561078] mmc1: cqhci: recovery done
> [   95.561288] device-mapper: verity: 179:3: data block 712181 is corrupted
> 
> Now I get a spurious TCN after every recovery, with the only exception
> being the one that ends up with dm-verity error.
> So it'd seem that there's a race in which the "spurious" TCN hits a
> pending request, enqueued right after recovery was completed.
> I'm currently looking into how to fix it, but if you beat me to it I
> can also test your patches and see if it fixes the dm-verity issue.

OK, but here are some hacks to try:

diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index 3a8f27c3e310..5eb98da933d5 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -2379,6 +2379,9 @@ enum mmc_issued mmc_blk_mq_issue_rq(struct mmc_queue *mq, struct request *req)
 				blk_mq_end_request(req, BLK_STS_OK);
 				return MMC_REQ_FINISHED;
 			}
+			ret = mmc_blk_wait_for_idle(mq, host);
+			if (ret)
+				return MMC_REQ_BUSY;
 			ret = mmc_blk_cqe_issue_flush(mq, req);
 			break;
 		case REQ_OP_READ:
diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index 3d3e0ca52614..8ac1efca2be6 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -563,6 +563,16 @@ int mmc_cqe_recovery(struct mmc_host *host)
 
 	host->cqe_ops->cqe_recovery_finish(host);
 
+	if (err) {
+		memset(&cmd, 0, sizeof(cmd));
+		cmd.opcode       = MMC_CMDQ_TASK_MGMT;
+		cmd.arg          = 1; /* Discard entire queue */
+		cmd.flags        = MMC_RSP_R1B | MMC_CMD_AC;
+		cmd.flags       &= ~MMC_RSP_CRC; /* Ignore CRC */
+		cmd.busy_timeout = MMC_CQE_RECOVERY_TIMEOUT;
+		err = mmc_wait_for_cmd(host, &cmd, 3);
+	}
+
 	mmc_retune_release(host);
 
 	return err;
diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c
index b3d7d6d8d654..93e1bbdc446f 100644
--- a/drivers/mmc/host/cqhci-core.c
+++ b/drivers/mmc/host/cqhci-core.c
@@ -975,8 +975,13 @@ static bool cqhci_halt(struct mmc_host *mmc, unsigned int timeout)
 
 	ret = cqhci_halted(cq_host);
 
-	if (!ret)
-		pr_debug("%s: cqhci: Failed to halt\n", mmc_hostname(mmc));
+	if (!ret) {
+		u32 cqcfg;
+
+		cqcfg = cqhci_readl(cq_host, CQHCI_CFG);
+		cqcfg &= ~CQHCI_ENABLE;
+		cqhci_writel(cq_host, cqcfg, CQHCI_CFG);
+	}
 
 	return ret;
 }
@@ -987,7 +992,7 @@ static bool cqhci_halt(struct mmc_host *mmc, unsigned int timeout)
  * layers will need to send a STOP command), so we set the timeout based on a
  * generous command timeout.
  */
-#define CQHCI_START_HALT_TIMEOUT	5
+#define CQHCI_START_HALT_TIMEOUT	200
 
 static void cqhci_recovery_start(struct mmc_host *mmc)
 {
@@ -1039,6 +1044,8 @@ static void cqhci_recover_mrq(struct cqhci_host *cq_host, unsigned int tag)
 		data->error = cqhci_error_from_flags(slot->flags);
 	} else {
 		mrq->cmd->error = cqhci_error_from_flags(slot->flags);
+		if (!mrq->cmd->error)
+			mrq->cmd->error = -EIO;
 	}
 
 	mmc_cqe_request_done(cq_host->mmc, mrq);


> 
>>
>>>
>>> 2. As for the spurious task completion warning.
>>> I initially thought that it was bening.
>>> The check for !mrq is done before checking if we're currently doing recovery.
>>> So if it's called just right at the end of recovery, right after the
>>> cqhci_recover_mrqs is executed that would explain it.
>>> With that being said if that irq handler is run right after the
>>> recovery is finished we'll end up with a race where a new request,
>>> that was just enqueued, might be mistakenly marked as done.
>>> This would explain the dm-verity errors we're seeing.
>>>
>>>>
>>>>>>
>>>>>>>
>>>>>>> I'd argue that logging only the info that CQE recovery was executed is
>>>>>>> not particularly helpful for someone looking into those logs.
>>>>>>
>>>>>> As the comment says, that message is there because recovery reduces
>>>>>> performance, it is not to aid debugging per se.
>>>>>>
>>>>>>> Ideally we would have more data about the state the controller was in
>>>>>>> when the error happened, or at least what caused the recovery to be
>>>>>>> triggered.
>>>>>>> The question here is how verbose should we be in this error scenario.
>>>>>>> Looking at other error scenarios, in the case of a software timeout
>>>>>>> we're dumping the controller registers. (cqhci_timeout)
>>>>>>
>>>>>> Timeout means something is broken - either the driver, the cq engine
>>>>>> or the card.  On the other hand, an error interrupt is most likely a
>>>>>> CRC error which is not unexpected occasionally, due to thermal drift
>>>>>> or perhaps interference.
>>>>>
>>>>> Right, but my point is that we don't know what triggered CQE recovery.
>>>>
>>>> True, although probably a CRC error.
>>>>
>>>>>
>>>>>>
>>>>>>> Hence I thought that I'd be appropriate to match that and do the same
>>>>>>> in CQE recovery logic.
>>>>>>
>>>>>> It needs to be consistent. There are other pr_debugs, such as:
>>>>>>
>>>>>>                 pr_debug("%s: cqhci: Failed to clear tasks\n",
>>>>>>                 pr_debug("%s: cqhci: Failed to halt\n", mmc_hostname(mmc));
>>>>>>                 pr_debug("%s: cqhci: disable / re-enable\n", mmc_hostname(mmc));
>>>>>>
>>>>>> which should perhaps be treated the same.
>>>>>>
>>>>>> And there are no messages for errors from the commands in
>>>>>> mmc_cqe_recovery().
>>>>>
>>>>> How about this.
>>>>> As a compromise would it be okay to just do a single pr_warn directly
>>>>> from cqhci_error_irq.
>>>>
>>>> Sure, printk_ratelimited() or __ratelimit()
>>>>
>>>>> We could simply promote the existing pr_debug to pr_warn at the
>>>>> beginning of that function.
>>>>> This would tell us what triggered the recovery. (controller timeout,
>>>>> CRC mismatch)
>>>>> We can also consider removing the "running CQE recovery" print for the
>>>>> sake of brevity.
>>>>
>>>> No, that serves a different purpose.
>>>>
>>>>> The only downside of this that I can see is that we'd be running the
>>>>> logic from the interrupt handler directly, but I can't see an easy way
>>>>> around that.
>>>>> What do you think?
>>>>
>>>> Should be OK with rate limiting.
>>>
>>> OK, I'll look into the rate limiting and will send a v2.
>>>
>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>       /* Forget about errors when recovery has already been triggered */
>>>>>>>>>       if (cq_host->recovery_halt)
>>>>>>>>
>>>>>>
>>>>
>>
Kornel Dulęba Oct. 26, 2023, 9:32 a.m. UTC | #10
On Thu, Oct 26, 2023 at 10:52 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
>
> On 26/10/23 11:29, Kornel Dulęba wrote:
> > On Thu, Oct 26, 2023 at 10:14 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
> >>
> >> On 26/10/23 10:56, Kornel Dulęba wrote:
> >>> On Thu, Oct 26, 2023 at 8:25 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
> >>>>
> >>>> On 25/10/23 11:01, Kornel Dulęba wrote:
> >>>>> On Mon, Oct 23, 2023 at 1:38 PM Adrian Hunter <adrian.hunter@intel.com> wrote:
> >>>>>>
> >>>>>> On 20/10/23 11:53, Kornel Dulęba wrote:
> >>>>>>> On Fri, Oct 20, 2023 at 9:41 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
> >>>>>>>>
> >>>>>>>> On 16/10/23 12:56, Kornel Dulęba wrote:
> >>>>>>>>> There are several reasons for controller to generate an error interrupt.
> >>>>>>>>> They include controller<->card timeout, and CRC mismatch error.
> >>>>>>>>> Right now we only get one line in the logs stating that CQE recovery was
> >>>>>>>>> triggered, but with no information about what caused it.
> >>>>>>>>> To figure out what happened be more verbose and dump the registers from
> >>>>>>>>> irq error handler logic.
> >>>>>>>>> This matches the behaviour of the software timeout logic, see
> >>>>>>>>> cqhci_timeout.
> >>>>>>>>>
> >>>>>>>>> Signed-off-by: Kornel Dulęba <korneld@chromium.org>
> >>>>>>>>> ---
> >>>>>>>>>  drivers/mmc/host/cqhci-core.c | 5 +++--
> >>>>>>>>>  1 file changed, 3 insertions(+), 2 deletions(-)
> >>>>>>>>>
> >>>>>>>>> diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c
> >>>>>>>>> index b3d7d6d8d654..33abb4bd53b5 100644
> >>>>>>>>> --- a/drivers/mmc/host/cqhci-core.c
> >>>>>>>>> +++ b/drivers/mmc/host/cqhci-core.c
> >>>>>>>>> @@ -700,8 +700,9 @@ static void cqhci_error_irq(struct mmc_host *mmc, u32 status, int cmd_error,
> >>>>>>>>>
> >>>>>>>>>       terri = cqhci_readl(cq_host, CQHCI_TERRI);
> >>>>>>>>>
> >>>>>>>>> -     pr_debug("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d TERRI: 0x%08x\n",
> >>>>>>>>> -              mmc_hostname(mmc), status, cmd_error, data_error, terri);
> >>>>>>>>> +     pr_warn("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d\n",
> >>>>>>>>> +              mmc_hostname(mmc), status, cmd_error, data_error);
> >>>>>>>>> +     cqhci_dumpregs(cq_host);
> >>>>>>>>
> >>>>>>>> For debugging, isn't dynamic debug seems more appropriate?
> >>>>>>>
> >>>>>>> Dynamic debug is an option, but my personal preference would be to
> >>>>>>> just log more info in the error handler.
> >>>>>>
> >>>>>> Interrupt handlers can get called very rapidly, so some kind of rate
> >>>>>> limiting should be used if the message is unconditional.  Also you need
> >>>>>> to provide actual reasons for your preference.
> >>>>>>
> >>>>>> For dynamic debug of the register dump, something like below is
> >>>>>> possible.
> >>>>>>
> >>>>>> #define cqhci_dynamic_dumpregs(cqhost) \
> >>>>>>         _dynamic_func_call_no_desc("cqhci_dynamic_dumpregs", cqhci_dumpregs, cqhost)
> >>>>>>
> >>>>> Fair point.
> >>>>> The reason I'm not a fan of using dynamic debug for this is that my
> >>>>> goal here is to improve the warning/error logging information that we
> >>>>> get from systems running in production.
> >>>>> I.e. if we get a lot of "running CQE recovery" messages, at the very
> >>>>> least I'd like to know what is causing them.
> >>>>
> >>>> So you are saying you want to collect debug information from production
> >>>> systems, but don't want to use dynamic debug to do it?
> >>>>
> >>>>>>> To give you some background.
> >>>>>>> We're seeing some "running CQE recovery" lines in the logs, followed
> >>>>>>> by a dm_verity mismatch error.
> >>>>>>> The reports come from the field, with no feasible way to reproduce the
> >>>>>>> issue locally.
> >>>>>>
> >>>>>> If it is a software error, some kind of error injection may well
> >>>>>> reproduce it.  Also if it is a hardware error that only happens
> >>>>>> during recovery, error injection could increase the likelihood of
> >>>>>> reproducing it.
> >>>>>
> >>>>> We tried software injection and it didn't yield any results.
> >>>>> We're currently looking into "injecting" hw errors by using a small
> >>>>> burst field generator to interfere with transfers on the data line
> >>>>> directly.
> >>>>
> >>>> I just tried instrumenting a driver to inject CRC errors and it
> >>>> revealed several CQE recovery issues, including spurious TCN for
> >>>> tag 31.  I will send some patches when they are ready.
> >>>
> >>> Sorry, what I meant by it didn't yield results is that Ii didn't
> >>> trigger the dm-verity error that we're seeing on production.
> >>> With SW injection there are two potential issues that come to my mind:
> >>>
> >>> 1. In the cqhci_error_irq when TERRI is not valid only a single,
> >>> "random" task is marked as bad.
> >>> Then in cqhci_recover_mrq we're marking all pending requests as done.
> >>> For data transfers this is somewhat bening as it will return with
> >>> bytes_xfered=0.
> >>> IIUC this will then cause the upper layer to re-enqueue this request.
> >>
> >> Yes
> >>
> >>> The bigger problem is a CMD only mrq, which will be mistakenly marked
> >>> as completed successfully.
> >>
> >> I noticed that also.  Notably the only non-data CMD is cache flush.
> >>
> >> There are several other issues, but patches will describe
> >> them better.
> >
> > Sure. :)
> >
> > jfyi, I've just managed to inject CRC errors by using a burst field generator.
> > (Langer P23 if you're interested.)
> > I'm using it by touching the D0 MMC line directly, and it yields
> > surprisingly good results.
> > I've changed the spurious TCN WARN_ONCE to pr_warn and got the following:
> >
> > [   71.885698] mmc1: cqhci: error IRQ status: 0x00000000 cmd error 0
> > data error -84 TERRI: 0x972e0000
> > [   71.885730] mmc1: running CQE recovery
> > [   71.888135] cqhci_recovery_finish: TCN: 0x00000000
> > [   71.888141] mmc1: cqhci: recovery done
> > [   71.888223] mmc1: cqhci: spurious TCN for tag 23
> > (...)
> > [   95.558736] mmc1: cqhci: error IRQ status: 0x00000000 cmd error 0
> > data error -84 TERRI: 0x822e0000
> > [   95.558768] mmc1: running CQE recovery
> > [   95.561073] cqhci_recovery_finish: TCN: 0x00000000
> > [   95.561078] mmc1: cqhci: recovery done
> > [   95.561288] device-mapper: verity: 179:3: data block 712181 is corrupted
> >
> > Now I get a spurious TCN after every recovery, with the only exception
> > being the one that ends up with dm-verity error.
> > So it'd seem that there's a race in which the "spurious" TCN hits a
> > pending request, enqueued right after recovery was completed.
> > I'm currently looking into how to fix it, but if you beat me to it I
> > can also test your patches and see if it fixes the dm-verity issue.
>
> OK, but here are some hacks to try:
(snip)
Thanks, fyi, with them applied I can see the exact same behaviour,
including what I'm seeing in the logs.
>
> >
> >>
> >>>
> >>> 2. As for the spurious task completion warning.
> >>> I initially thought that it was bening.
> >>> The check for !mrq is done before checking if we're currently doing recovery.
> >>> So if it's called just right at the end of recovery, right after the
> >>> cqhci_recover_mrqs is executed that would explain it.
> >>> With that being said if that irq handler is run right after the
> >>> recovery is finished we'll end up with a race where a new request,
> >>> that was just enqueued, might be mistakenly marked as done.
> >>> This would explain the dm-verity errors we're seeing.
> >>>
> >>>>
> >>>>>>
> >>>>>>>
> >>>>>>> I'd argue that logging only the info that CQE recovery was executed is
> >>>>>>> not particularly helpful for someone looking into those logs.
> >>>>>>
> >>>>>> As the comment says, that message is there because recovery reduces
> >>>>>> performance, it is not to aid debugging per se.
> >>>>>>
> >>>>>>> Ideally we would have more data about the state the controller was in
> >>>>>>> when the error happened, or at least what caused the recovery to be
> >>>>>>> triggered.
> >>>>>>> The question here is how verbose should we be in this error scenario.
> >>>>>>> Looking at other error scenarios, in the case of a software timeout
> >>>>>>> we're dumping the controller registers. (cqhci_timeout)
> >>>>>>
> >>>>>> Timeout means something is broken - either the driver, the cq engine
> >>>>>> or the card.  On the other hand, an error interrupt is most likely a
> >>>>>> CRC error which is not unexpected occasionally, due to thermal drift
> >>>>>> or perhaps interference.
> >>>>>
> >>>>> Right, but my point is that we don't know what triggered CQE recovery.
> >>>>
> >>>> True, although probably a CRC error.
> >>>>
> >>>>>
> >>>>>>
> >>>>>>> Hence I thought that I'd be appropriate to match that and do the same
> >>>>>>> in CQE recovery logic.
> >>>>>>
> >>>>>> It needs to be consistent. There are other pr_debugs, such as:
> >>>>>>
> >>>>>>                 pr_debug("%s: cqhci: Failed to clear tasks\n",
> >>>>>>                 pr_debug("%s: cqhci: Failed to halt\n", mmc_hostname(mmc));
> >>>>>>                 pr_debug("%s: cqhci: disable / re-enable\n", mmc_hostname(mmc));
> >>>>>>
> >>>>>> which should perhaps be treated the same.
> >>>>>>
> >>>>>> And there are no messages for errors from the commands in
> >>>>>> mmc_cqe_recovery().
> >>>>>
> >>>>> How about this.
> >>>>> As a compromise would it be okay to just do a single pr_warn directly
> >>>>> from cqhci_error_irq.
> >>>>
> >>>> Sure, printk_ratelimited() or __ratelimit()
> >>>>
> >>>>> We could simply promote the existing pr_debug to pr_warn at the
> >>>>> beginning of that function.
> >>>>> This would tell us what triggered the recovery. (controller timeout,
> >>>>> CRC mismatch)
> >>>>> We can also consider removing the "running CQE recovery" print for the
> >>>>> sake of brevity.
> >>>>
> >>>> No, that serves a different purpose.
> >>>>
> >>>>> The only downside of this that I can see is that we'd be running the
> >>>>> logic from the interrupt handler directly, but I can't see an easy way
> >>>>> around that.
> >>>>> What do you think?
> >>>>
> >>>> Should be OK with rate limiting.
> >>>
> >>> OK, I'll look into the rate limiting and will send a v2.
> >>>
> >>>>
> >>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>       /* Forget about errors when recovery has already been triggered */
> >>>>>>>>>       if (cq_host->recovery_halt)
> >>>>>>>>
> >>>>>>
> >>>>
> >>
>
diff mbox series

Patch

diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c
index b3d7d6d8d654..33abb4bd53b5 100644
--- a/drivers/mmc/host/cqhci-core.c
+++ b/drivers/mmc/host/cqhci-core.c
@@ -700,8 +700,9 @@  static void cqhci_error_irq(struct mmc_host *mmc, u32 status, int cmd_error,
 
 	terri = cqhci_readl(cq_host, CQHCI_TERRI);
 
-	pr_debug("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d TERRI: 0x%08x\n",
-		 mmc_hostname(mmc), status, cmd_error, data_error, terri);
+	pr_warn("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d\n",
+		 mmc_hostname(mmc), status, cmd_error, data_error);
+	cqhci_dumpregs(cq_host);
 
 	/* Forget about errors when recovery has already been triggered */
 	if (cq_host->recovery_halt)