[v2,7/8] remoteproc: qcom: q6v5: Add common panic handler

Message ID 20191227053215.423811-8-bjorn.andersson@linaro.org
State New
Headers show
Series
  • remoteproc: qcom: post mortem debug support
Related show

Commit Message

Bjorn Andersson Dec. 27, 2019, 5:32 a.m.
Add a common panic handler that invokes a stop request and sleep enough
to let the remoteproc flush it's caches etc in order to aid post mortem
debugging.

Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>

---

Changes since v1:
- None

 drivers/remoteproc/qcom_q6v5.c | 19 +++++++++++++++++++
 drivers/remoteproc/qcom_q6v5.h |  1 +
 2 files changed, 20 insertions(+)

-- 
2.24.0

Comments

Bjorn Andersson Jan. 22, 2020, 7:39 p.m. | #1
On Fri 10 Jan 13:28 PST 2020, Mathieu Poirier wrote:

> On Thu, Dec 26, 2019 at 09:32:14PM -0800, Bjorn Andersson wrote:
> > Add a common panic handler that invokes a stop request and sleep enough
> > to let the remoteproc flush it's caches etc in order to aid post mortem
> > debugging.
> > 
> > Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
> > ---
> > 
> > Changes since v1:
> > - None
> > 
> >  drivers/remoteproc/qcom_q6v5.c | 19 +++++++++++++++++++
> >  drivers/remoteproc/qcom_q6v5.h |  1 +
> >  2 files changed, 20 insertions(+)
> > 
> > diff --git a/drivers/remoteproc/qcom_q6v5.c b/drivers/remoteproc/qcom_q6v5.c
> > index cb0f4a0be032..17167c980e02 100644
> > --- a/drivers/remoteproc/qcom_q6v5.c
> > +++ b/drivers/remoteproc/qcom_q6v5.c
> > @@ -6,6 +6,7 @@
> >   * Copyright (C) 2014 Sony Mobile Communications AB
> >   * Copyright (c) 2012-2013, The Linux Foundation. All rights reserved.
> >   */
> > +#include <linux/delay.h>
> >  #include <linux/kernel.h>
> >  #include <linux/platform_device.h>
> >  #include <linux/interrupt.h>
> > @@ -15,6 +16,8 @@
> >  #include <linux/remoteproc.h>
> >  #include "qcom_q6v5.h"
> >  
> > +#define Q6V5_PANIC_DELAY_MS	200
> > +
> >  /**
> >   * qcom_q6v5_prepare() - reinitialize the qcom_q6v5 context before start
> >   * @q6v5:	reference to qcom_q6v5 context to be reinitialized
> > @@ -162,6 +165,22 @@ int qcom_q6v5_request_stop(struct qcom_q6v5 *q6v5)
> >  }
> >  EXPORT_SYMBOL_GPL(qcom_q6v5_request_stop);
> >  
> > +/**
> > + * qcom_q6v5_panic() - panic handler to invoke a stop on the remote
> > + * @q6v5:	reference to qcom_q6v5 context
> > + *
> > + * Set the stop bit and sleep in order to allow the remote processor to flush
> > + * its caches etc for post mortem debugging.
> > + */
> > +void qcom_q6v5_panic(struct qcom_q6v5 *q6v5)
> > +{
> > +	qcom_smem_state_update_bits(q6v5->state,
> > +				    BIT(q6v5->stop_bit), BIT(q6v5->stop_bit));
> > +
> > +	mdelay(Q6V5_PANIC_DELAY_MS);
> 
> I really wonder if the delay should be part of the remoteproc core and
> configurable via device tree.  Wanting the remote processor to flush its caches
> is likely something other vendors will want when dealing with a kernel panic.
> It would be nice to see if other people have an opinion on this topic.  If not
> then we can keep the delay here and move it to the core if need be.
> 

I gave this some more thought and what we're trying to achieve is to
signal the remote processors about the panic and then give them time to
react, but per the proposal (and Qualcomm downstream iirc) we will do
this for each remote processor, one by one.

So in the typical case of a Qualcomm platform with 4-5 remoteprocs we'll
end up giving the first one a whole second to react and the last one
"only" 200ms.

Moving the delay to the core by iterating over rproc_list calling
panic() and then delaying would be cleaner imo.

It might be nice to make this configurable in DT, but I agree that it
would be nice to hear from others if this would be useful.

Regards,
Bjorn

> Thanks,
> Mathieu
> 
> > +}
> > +EXPORT_SYMBOL_GPL(qcom_q6v5_panic);
> > +
> >  /**
> >   * qcom_q6v5_init() - initializer of the q6v5 common struct
> >   * @q6v5:	handle to be initialized
> > diff --git a/drivers/remoteproc/qcom_q6v5.h b/drivers/remoteproc/qcom_q6v5.h
> > index 7ac92c1e0f49..c37e6fd063e4 100644
> > --- a/drivers/remoteproc/qcom_q6v5.h
> > +++ b/drivers/remoteproc/qcom_q6v5.h
> > @@ -42,5 +42,6 @@ int qcom_q6v5_prepare(struct qcom_q6v5 *q6v5);
> >  int qcom_q6v5_unprepare(struct qcom_q6v5 *q6v5);
> >  int qcom_q6v5_request_stop(struct qcom_q6v5 *q6v5);
> >  int qcom_q6v5_wait_for_start(struct qcom_q6v5 *q6v5, int timeout);
> > +void qcom_q6v5_panic(struct qcom_q6v5 *q6v5);
> >  
> >  #endif
> > -- 
> > 2.24.0
> >
Bjorn Andersson Jan. 23, 2020, 5:15 p.m. | #2
On Thu 23 Jan 09:01 PST 2020, Mathieu Poirier wrote:

> On Wed, 22 Jan 2020 at 12:40, Bjorn Andersson
> <bjorn.andersson@linaro.org> wrote:
> >
> > On Fri 10 Jan 13:28 PST 2020, Mathieu Poirier wrote:
> >
> > > On Thu, Dec 26, 2019 at 09:32:14PM -0800, Bjorn Andersson wrote:
> > > > Add a common panic handler that invokes a stop request and sleep enough
> > > > to let the remoteproc flush it's caches etc in order to aid post mortem
> > > > debugging.
> > > >
> > > > Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
> > > > ---
> > > >
> > > > Changes since v1:
> > > > - None
> > > >
> > > >  drivers/remoteproc/qcom_q6v5.c | 19 +++++++++++++++++++
> > > >  drivers/remoteproc/qcom_q6v5.h |  1 +
> > > >  2 files changed, 20 insertions(+)
> > > >
> > > > diff --git a/drivers/remoteproc/qcom_q6v5.c b/drivers/remoteproc/qcom_q6v5.c
> > > > index cb0f4a0be032..17167c980e02 100644
> > > > --- a/drivers/remoteproc/qcom_q6v5.c
> > > > +++ b/drivers/remoteproc/qcom_q6v5.c
> > > > @@ -6,6 +6,7 @@
> > > >   * Copyright (C) 2014 Sony Mobile Communications AB
> > > >   * Copyright (c) 2012-2013, The Linux Foundation. All rights reserved.
> > > >   */
> > > > +#include <linux/delay.h>
> > > >  #include <linux/kernel.h>
> > > >  #include <linux/platform_device.h>
> > > >  #include <linux/interrupt.h>
> > > > @@ -15,6 +16,8 @@
> > > >  #include <linux/remoteproc.h>
> > > >  #include "qcom_q6v5.h"
> > > >
> > > > +#define Q6V5_PANIC_DELAY_MS        200
> > > > +
> > > >  /**
> > > >   * qcom_q6v5_prepare() - reinitialize the qcom_q6v5 context before start
> > > >   * @q6v5:  reference to qcom_q6v5 context to be reinitialized
> > > > @@ -162,6 +165,22 @@ int qcom_q6v5_request_stop(struct qcom_q6v5 *q6v5)
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(qcom_q6v5_request_stop);
> > > >
> > > > +/**
> > > > + * qcom_q6v5_panic() - panic handler to invoke a stop on the remote
> > > > + * @q6v5:  reference to qcom_q6v5 context
> > > > + *
> > > > + * Set the stop bit and sleep in order to allow the remote processor to flush
> > > > + * its caches etc for post mortem debugging.
> > > > + */
> > > > +void qcom_q6v5_panic(struct qcom_q6v5 *q6v5)
> > > > +{
> > > > +   qcom_smem_state_update_bits(q6v5->state,
> > > > +                               BIT(q6v5->stop_bit), BIT(q6v5->stop_bit));
> > > > +
> > > > +   mdelay(Q6V5_PANIC_DELAY_MS);
> > >
> > > I really wonder if the delay should be part of the remoteproc core and
> > > configurable via device tree.  Wanting the remote processor to flush its caches
> > > is likely something other vendors will want when dealing with a kernel panic.
> > > It would be nice to see if other people have an opinion on this topic.  If not
> > > then we can keep the delay here and move it to the core if need be.
> > >
> >
> > I gave this some more thought and what we're trying to achieve is to
> > signal the remote processors about the panic and then give them time to
> > react, but per the proposal (and Qualcomm downstream iirc) we will do
> > this for each remote processor, one by one.
> >
> > So in the typical case of a Qualcomm platform with 4-5 remoteprocs we'll
> > end up giving the first one a whole second to react and the last one
> > "only" 200ms.
> >
> > Moving the delay to the core by iterating over rproc_list calling
> > panic() and then delaying would be cleaner imo.
> 
> I agree.
> 
> >
> > It might be nice to make this configurable in DT, but I agree that it
> > would be nice to hear from others if this would be useful.
> 
> I think the delay has to be configurable via DT if we move this to the
> core.  The binding can be optional and default to 200ms if not
> present.
> 

How about I make the panic() return the required delay and then we let
the core sleep for MAX() of the returned durations? That way the default
is still a property of the remoteproc drivers - and 200ms seems rather
arbitrary to put in the core, even as a default.

Regards,
Bjorn
Arnaud POULIQUEN Jan. 27, 2020, 9:46 a.m. | #3
On 1/24/20 7:44 PM, Mathieu Poirier wrote:
> On Thu, 23 Jan 2020 at 10:49, Arnaud POULIQUEN <arnaud.pouliquen@st.com> wrote:
>>
>> Hi Bjorn, Mathieu
>>
>> On 1/23/20 6:15 PM, Bjorn Andersson wrote:
>>> On Thu 23 Jan 09:01 PST 2020, Mathieu Poirier wrote:
>>>
>>>> On Wed, 22 Jan 2020 at 12:40, Bjorn Andersson
>>>> <bjorn.andersson@linaro.org> wrote:
>>>>>
>>>>> On Fri 10 Jan 13:28 PST 2020, Mathieu Poirier wrote:
>>>>>
>>>>>> On Thu, Dec 26, 2019 at 09:32:14PM -0800, Bjorn Andersson wrote:
>>>>>>> Add a common panic handler that invokes a stop request and sleep enough
>>>>>>> to let the remoteproc flush it's caches etc in order to aid post mortem
>>>>>>> debugging.
>>>>>>>
>>>>>>> Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
>>>>>>> ---
>>>>>>>
>>>>>>> Changes since v1:
>>>>>>> - None
>>>>>>>
>>>>>>>  drivers/remoteproc/qcom_q6v5.c | 19 +++++++++++++++++++
>>>>>>>  drivers/remoteproc/qcom_q6v5.h |  1 +
>>>>>>>  2 files changed, 20 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/remoteproc/qcom_q6v5.c b/drivers/remoteproc/qcom_q6v5.c
>>>>>>> index cb0f4a0be032..17167c980e02 100644
>>>>>>> --- a/drivers/remoteproc/qcom_q6v5.c
>>>>>>> +++ b/drivers/remoteproc/qcom_q6v5.c
>>>>>>> @@ -6,6 +6,7 @@
>>>>>>>   * Copyright (C) 2014 Sony Mobile Communications AB
>>>>>>>   * Copyright (c) 2012-2013, The Linux Foundation. All rights reserved.
>>>>>>>   */
>>>>>>> +#include <linux/delay.h>
>>>>>>>  #include <linux/kernel.h>
>>>>>>>  #include <linux/platform_device.h>
>>>>>>>  #include <linux/interrupt.h>
>>>>>>> @@ -15,6 +16,8 @@
>>>>>>>  #include <linux/remoteproc.h>
>>>>>>>  #include "qcom_q6v5.h"
>>>>>>>
>>>>>>> +#define Q6V5_PANIC_DELAY_MS        200
>>>>>>> +
>>>>>>>  /**
>>>>>>>   * qcom_q6v5_prepare() - reinitialize the qcom_q6v5 context before start
>>>>>>>   * @q6v5:  reference to qcom_q6v5 context to be reinitialized
>>>>>>> @@ -162,6 +165,22 @@ int qcom_q6v5_request_stop(struct qcom_q6v5 *q6v5)
>>>>>>>  }
>>>>>>>  EXPORT_SYMBOL_GPL(qcom_q6v5_request_stop);
>>>>>>>
>>>>>>> +/**
>>>>>>> + * qcom_q6v5_panic() - panic handler to invoke a stop on the remote
>>>>>>> + * @q6v5:  reference to qcom_q6v5 context
>>>>>>> + *
>>>>>>> + * Set the stop bit and sleep in order to allow the remote processor to flush
>>>>>>> + * its caches etc for post mortem debugging.
>>>>>>> + */
>>>>>>> +void qcom_q6v5_panic(struct qcom_q6v5 *q6v5)
>>>>>>> +{
>>>>>>> +   qcom_smem_state_update_bits(q6v5->state,
>>>>>>> +                               BIT(q6v5->stop_bit), BIT(q6v5->stop_bit));
>>>>>>> +
>>>>>>> +   mdelay(Q6V5_PANIC_DELAY_MS);
>>>>>>
>>>>>> I really wonder if the delay should be part of the remoteproc core and
>>>>>> configurable via device tree.  Wanting the remote processor to flush its caches
>>>>>> is likely something other vendors will want when dealing with a kernel panic.
>>>>>> It would be nice to see if other people have an opinion on this topic.  If not
>>>>>> then we can keep the delay here and move it to the core if need be.
>>>>>>
>>>>>
>>>>> I gave this some more thought and what we're trying to achieve is to
>>>>> signal the remote processors about the panic and then give them time to
>>>>> react, but per the proposal (and Qualcomm downstream iirc) we will do
>>>>> this for each remote processor, one by one.
>>>>>
>>>>> So in the typical case of a Qualcomm platform with 4-5 remoteprocs we'll
>>>>> end up giving the first one a whole second to react and the last one
>>>>> "only" 200ms.
>>>>>
>>>>> Moving the delay to the core by iterating over rproc_list calling
>>>>> panic() and then delaying would be cleaner imo.
>>>>
>>>> I agree.
>>>>
>>>>>
>>>>> It might be nice to make this configurable in DT, but I agree that it
>>>>> would be nice to hear from others if this would be useful.
>>>>
>>>> I think the delay has to be configurable via DT if we move this to the
>>>> core.  The binding can be optional and default to 200ms if not
>>>> present.
>>>>
>>>
>>> How about I make the panic() return the required delay and then we let
>>> the core sleep for MAX() of the returned durations?
> 
> I like it.
> 
>> That way the default
>>> is still a property of the remoteproc drivers - and 200ms seems rather
>>> arbitrary to put in the core, even as a default.
>>
>> I agree with Bjorn, the delay should be provided by the platform.
>> But in this case i wonder if it is simpler to just let the platform take care it?
> 
> If I understand you correctly, that is what Bjorn's original
> implementation was doing and it had drawbacks.
Yes, 
Please tell me if i missed something, the only drawback seems mentioned is the accumulative delay.
Could you elaborate how to implement the delay in remote proc core for multi rproc instance.
Here is my view:
To optimize the delay it would probably be necessary to compute:
- the delay based on an initial date,
- the delay requested by each rproc instance,
- the delay elapsed in each rproc panic ops.
Feasible but not straight forward... 
So I suppose that you are thinking about a solution based on the store of the max delay that would be applied after last panic() return?
anyway, how do you determine the last rproc instance? seems that a prerequisite would be that the panic ops is mandatory... 

I'm not familiar with panic mechanism, but how panic ops are scheduled in SMP? Does panics ops would be treated in parallel (using msleep instead of mdelay)?
In this case delays could not be cumulative...

> 
>> For instance for stm32mp1 the stop corresponds to the reset on the remote processor core. To inform the coprocessor about an imminent shutdown we use a signal relying on a mailbox (cf. stm32_rproc_stop).
>> In this case we would need a delay between the signal and the reset, but not after (no cache management).
> 
> Here I believe you are referring to the upper limit of 500ms that is
> needed for the mbox_send_message() in stm32_rproc_stop() to complete.
> Since that is a blocking call I think it would fit with Bjorn's
> proposal above if a value of '0' is returned by rproc->ops->panic().
> That would mean no further delays are needed (because the blocking
> mbox_send_message() would have done the job already).  Let me know if
> I'm in the weeds.
Yes you are :), this is what i thought, if delay implemented in core.

Regards,
Arnaud

> 
>>
>> Regards,
>> Arnaud
>>>
>>> Regards,
>>> Bjorn
>>>
Mathieu Poirier Jan. 29, 2020, 8:15 p.m. | #4
On Mon, Jan 27, 2020 at 10:46:05AM +0100, Arnaud POULIQUEN wrote:
> 
> 
> On 1/24/20 7:44 PM, Mathieu Poirier wrote:
> > On Thu, 23 Jan 2020 at 10:49, Arnaud POULIQUEN <arnaud.pouliquen@st.com> wrote:
> >>
> >> Hi Bjorn, Mathieu
> >>
> >> On 1/23/20 6:15 PM, Bjorn Andersson wrote:
> >>> On Thu 23 Jan 09:01 PST 2020, Mathieu Poirier wrote:
> >>>
> >>>> On Wed, 22 Jan 2020 at 12:40, Bjorn Andersson
> >>>> <bjorn.andersson@linaro.org> wrote:
> >>>>>
> >>>>> On Fri 10 Jan 13:28 PST 2020, Mathieu Poirier wrote:
> >>>>>
> >>>>>> On Thu, Dec 26, 2019 at 09:32:14PM -0800, Bjorn Andersson wrote:
> >>>>>>> Add a common panic handler that invokes a stop request and sleep enough
> >>>>>>> to let the remoteproc flush it's caches etc in order to aid post mortem
> >>>>>>> debugging.
> >>>>>>>
> >>>>>>> Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
> >>>>>>> ---
> >>>>>>>
> >>>>>>> Changes since v1:
> >>>>>>> - None
> >>>>>>>
> >>>>>>>  drivers/remoteproc/qcom_q6v5.c | 19 +++++++++++++++++++
> >>>>>>>  drivers/remoteproc/qcom_q6v5.h |  1 +
> >>>>>>>  2 files changed, 20 insertions(+)
> >>>>>>>
> >>>>>>> diff --git a/drivers/remoteproc/qcom_q6v5.c b/drivers/remoteproc/qcom_q6v5.c
> >>>>>>> index cb0f4a0be032..17167c980e02 100644
> >>>>>>> --- a/drivers/remoteproc/qcom_q6v5.c
> >>>>>>> +++ b/drivers/remoteproc/qcom_q6v5.c
> >>>>>>> @@ -6,6 +6,7 @@
> >>>>>>>   * Copyright (C) 2014 Sony Mobile Communications AB
> >>>>>>>   * Copyright (c) 2012-2013, The Linux Foundation. All rights reserved.
> >>>>>>>   */
> >>>>>>> +#include <linux/delay.h>
> >>>>>>>  #include <linux/kernel.h>
> >>>>>>>  #include <linux/platform_device.h>
> >>>>>>>  #include <linux/interrupt.h>
> >>>>>>> @@ -15,6 +16,8 @@
> >>>>>>>  #include <linux/remoteproc.h>
> >>>>>>>  #include "qcom_q6v5.h"
> >>>>>>>
> >>>>>>> +#define Q6V5_PANIC_DELAY_MS        200
> >>>>>>> +
> >>>>>>>  /**
> >>>>>>>   * qcom_q6v5_prepare() - reinitialize the qcom_q6v5 context before start
> >>>>>>>   * @q6v5:  reference to qcom_q6v5 context to be reinitialized
> >>>>>>> @@ -162,6 +165,22 @@ int qcom_q6v5_request_stop(struct qcom_q6v5 *q6v5)
> >>>>>>>  }
> >>>>>>>  EXPORT_SYMBOL_GPL(qcom_q6v5_request_stop);
> >>>>>>>
> >>>>>>> +/**
> >>>>>>> + * qcom_q6v5_panic() - panic handler to invoke a stop on the remote
> >>>>>>> + * @q6v5:  reference to qcom_q6v5 context
> >>>>>>> + *
> >>>>>>> + * Set the stop bit and sleep in order to allow the remote processor to flush
> >>>>>>> + * its caches etc for post mortem debugging.
> >>>>>>> + */
> >>>>>>> +void qcom_q6v5_panic(struct qcom_q6v5 *q6v5)
> >>>>>>> +{
> >>>>>>> +   qcom_smem_state_update_bits(q6v5->state,
> >>>>>>> +                               BIT(q6v5->stop_bit), BIT(q6v5->stop_bit));
> >>>>>>> +
> >>>>>>> +   mdelay(Q6V5_PANIC_DELAY_MS);
> >>>>>>
> >>>>>> I really wonder if the delay should be part of the remoteproc core and
> >>>>>> configurable via device tree.  Wanting the remote processor to flush its caches
> >>>>>> is likely something other vendors will want when dealing with a kernel panic.
> >>>>>> It would be nice to see if other people have an opinion on this topic.  If not
> >>>>>> then we can keep the delay here and move it to the core if need be.
> >>>>>>
> >>>>>
> >>>>> I gave this some more thought and what we're trying to achieve is to
> >>>>> signal the remote processors about the panic and then give them time to
> >>>>> react, but per the proposal (and Qualcomm downstream iirc) we will do
> >>>>> this for each remote processor, one by one.
> >>>>>
> >>>>> So in the typical case of a Qualcomm platform with 4-5 remoteprocs we'll
> >>>>> end up giving the first one a whole second to react and the last one
> >>>>> "only" 200ms.
> >>>>>
> >>>>> Moving the delay to the core by iterating over rproc_list calling
> >>>>> panic() and then delaying would be cleaner imo.
> >>>>
> >>>> I agree.
> >>>>
> >>>>>
> >>>>> It might be nice to make this configurable in DT, but I agree that it
> >>>>> would be nice to hear from others if this would be useful.
> >>>>
> >>>> I think the delay has to be configurable via DT if we move this to the
> >>>> core.  The binding can be optional and default to 200ms if not
> >>>> present.
> >>>>
> >>>
> >>> How about I make the panic() return the required delay and then we let
> >>> the core sleep for MAX() of the returned durations?
> > 
> > I like it.
> > 
> >> That way the default
> >>> is still a property of the remoteproc drivers - and 200ms seems rather
> >>> arbitrary to put in the core, even as a default.
> >>
> >> I agree with Bjorn, the delay should be provided by the platform.
> >> But in this case i wonder if it is simpler to just let the platform take care it?
> > 
> > If I understand you correctly, that is what Bjorn's original
> > implementation was doing and it had drawbacks.
> Yes, 
> Please tell me if i missed something, the only drawback seems mentioned is the accumulative delay.

Yes, that is correct.

> Could you elaborate how to implement the delay in remote proc core for multi rproc instance.
> Here is my view:
> To optimize the delay it would probably be necessary to compute:
> - the delay based on an initial date,
> - the delay requested by each rproc instance,
> - the delay elapsed in each rproc panic ops.
> Feasible but not straight forward... 
> So I suppose that you are thinking about a solution based on the store of the max delay that would be applied after last panic() return?

Yes

> anyway, how do you determine the last rproc instance? seems that a prerequisite would be that the panic ops is mandatory... 

Each ->panic() should return the amount of time to way or 0 if no delay is
required.  If an rpoc doesn't implement ->panic() then it is treated as 0.
Arnaud POULIQUEN Jan. 30, 2020, 10 a.m. | #5
On 1/29/20 9:15 PM, Mathieu Poirier wrote:
> On Mon, Jan 27, 2020 at 10:46:05AM +0100, Arnaud POULIQUEN wrote:
>>
>>
>> On 1/24/20 7:44 PM, Mathieu Poirier wrote:
>>> On Thu, 23 Jan 2020 at 10:49, Arnaud POULIQUEN <arnaud.pouliquen@st.com> wrote:
>>>>
>>>> Hi Bjorn, Mathieu
>>>>
>>>> On 1/23/20 6:15 PM, Bjorn Andersson wrote:
>>>>> On Thu 23 Jan 09:01 PST 2020, Mathieu Poirier wrote:
>>>>>
>>>>>> On Wed, 22 Jan 2020 at 12:40, Bjorn Andersson
>>>>>> <bjorn.andersson@linaro.org> wrote:
>>>>>>>
>>>>>>> On Fri 10 Jan 13:28 PST 2020, Mathieu Poirier wrote:
>>>>>>>
>>>>>>>> On Thu, Dec 26, 2019 at 09:32:14PM -0800, Bjorn Andersson wrote:
>>>>>>>>> Add a common panic handler that invokes a stop request and sleep enough
>>>>>>>>> to let the remoteproc flush it's caches etc in order to aid post mortem
>>>>>>>>> debugging.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
>>>>>>>>> ---
>>>>>>>>>
>>>>>>>>> Changes since v1:
>>>>>>>>> - None
>>>>>>>>>
>>>>>>>>>  drivers/remoteproc/qcom_q6v5.c | 19 +++++++++++++++++++
>>>>>>>>>  drivers/remoteproc/qcom_q6v5.h |  1 +
>>>>>>>>>  2 files changed, 20 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/remoteproc/qcom_q6v5.c b/drivers/remoteproc/qcom_q6v5.c
>>>>>>>>> index cb0f4a0be032..17167c980e02 100644
>>>>>>>>> --- a/drivers/remoteproc/qcom_q6v5.c
>>>>>>>>> +++ b/drivers/remoteproc/qcom_q6v5.c
>>>>>>>>> @@ -6,6 +6,7 @@
>>>>>>>>>   * Copyright (C) 2014 Sony Mobile Communications AB
>>>>>>>>>   * Copyright (c) 2012-2013, The Linux Foundation. All rights reserved.
>>>>>>>>>   */
>>>>>>>>> +#include <linux/delay.h>
>>>>>>>>>  #include <linux/kernel.h>
>>>>>>>>>  #include <linux/platform_device.h>
>>>>>>>>>  #include <linux/interrupt.h>
>>>>>>>>> @@ -15,6 +16,8 @@
>>>>>>>>>  #include <linux/remoteproc.h>
>>>>>>>>>  #include "qcom_q6v5.h"
>>>>>>>>>
>>>>>>>>> +#define Q6V5_PANIC_DELAY_MS        200
>>>>>>>>> +
>>>>>>>>>  /**
>>>>>>>>>   * qcom_q6v5_prepare() - reinitialize the qcom_q6v5 context before start
>>>>>>>>>   * @q6v5:  reference to qcom_q6v5 context to be reinitialized
>>>>>>>>> @@ -162,6 +165,22 @@ int qcom_q6v5_request_stop(struct qcom_q6v5 *q6v5)
>>>>>>>>>  }
>>>>>>>>>  EXPORT_SYMBOL_GPL(qcom_q6v5_request_stop);
>>>>>>>>>
>>>>>>>>> +/**
>>>>>>>>> + * qcom_q6v5_panic() - panic handler to invoke a stop on the remote
>>>>>>>>> + * @q6v5:  reference to qcom_q6v5 context
>>>>>>>>> + *
>>>>>>>>> + * Set the stop bit and sleep in order to allow the remote processor to flush
>>>>>>>>> + * its caches etc for post mortem debugging.
>>>>>>>>> + */
>>>>>>>>> +void qcom_q6v5_panic(struct qcom_q6v5 *q6v5)
>>>>>>>>> +{
>>>>>>>>> +   qcom_smem_state_update_bits(q6v5->state,
>>>>>>>>> +                               BIT(q6v5->stop_bit), BIT(q6v5->stop_bit));
>>>>>>>>> +
>>>>>>>>> +   mdelay(Q6V5_PANIC_DELAY_MS);
>>>>>>>>
>>>>>>>> I really wonder if the delay should be part of the remoteproc core and
>>>>>>>> configurable via device tree.  Wanting the remote processor to flush its caches
>>>>>>>> is likely something other vendors will want when dealing with a kernel panic.
>>>>>>>> It would be nice to see if other people have an opinion on this topic.  If not
>>>>>>>> then we can keep the delay here and move it to the core if need be.
>>>>>>>>
>>>>>>>
>>>>>>> I gave this some more thought and what we're trying to achieve is to
>>>>>>> signal the remote processors about the panic and then give them time to
>>>>>>> react, but per the proposal (and Qualcomm downstream iirc) we will do
>>>>>>> this for each remote processor, one by one.
>>>>>>>
>>>>>>> So in the typical case of a Qualcomm platform with 4-5 remoteprocs we'll
>>>>>>> end up giving the first one a whole second to react and the last one
>>>>>>> "only" 200ms.
>>>>>>>
>>>>>>> Moving the delay to the core by iterating over rproc_list calling
>>>>>>> panic() and then delaying would be cleaner imo.
>>>>>>
>>>>>> I agree.
>>>>>>
>>>>>>>
>>>>>>> It might be nice to make this configurable in DT, but I agree that it
>>>>>>> would be nice to hear from others if this would be useful.
>>>>>>
>>>>>> I think the delay has to be configurable via DT if we move this to the
>>>>>> core.  The binding can be optional and default to 200ms if not
>>>>>> present.
>>>>>>
>>>>>
>>>>> How about I make the panic() return the required delay and then we let
>>>>> the core sleep for MAX() of the returned durations?
>>>
>>> I like it.
>>>
>>>> That way the default
>>>>> is still a property of the remoteproc drivers - and 200ms seems rather
>>>>> arbitrary to put in the core, even as a default.
>>>>
>>>> I agree with Bjorn, the delay should be provided by the platform.
>>>> But in this case i wonder if it is simpler to just let the platform take care it?
>>>
>>> If I understand you correctly, that is what Bjorn's original
>>> implementation was doing and it had drawbacks.
>> Yes, 
>> Please tell me if i missed something, the only drawback seems mentioned is the accumulative delay.
> 
> Yes, that is correct.
> 
>> Could you elaborate how to implement the delay in remote proc core for multi rproc instance.
>> Here is my view:
>> To optimize the delay it would probably be necessary to compute:
>> - the delay based on an initial date,
>> - the delay requested by each rproc instance,
>> - the delay elapsed in each rproc panic ops.
>> Feasible but not straight forward... 
>> So I suppose that you are thinking about a solution based on the store of the max delay that would be applied after last panic() return?
> 
> Yes
> 
>> anyway, how do you determine the last rproc instance? seems that a prerequisite would be that the panic ops is mandatory... 
> 
> Each ->panic() should return the amount of time to way or 0 if no delay is
> required.  If an rpoc doesn't implement ->panic() then it is treated as 0.
> From there wait for the maximum time that was collected.
> 
> It would be possible to do something more complicated like taking timestamps
> everytime a ->panic() returns and optimize the time to wait for but that may be
> for a future set.  The first implementation could go with an simple heuristic as
> detailed above.
Seems reasonable. 
A last point. i don't know if the case is realistic so i prefer to mention it:
we can imagine that a rproc platform driver already manages a panic (for instance a video decoder driver that uses a coprocessor).
In this case there is a risk that the rproc is remove during the panic. 
Depending on the panic sequence ordering this could generate a side effect (no delay as last rproc panic ops could be never called).
But seems not too tricky to take it into account in remoteproc core.
> 
>>
>> I'm not familiar with panic mechanism, but how panic ops are scheduled in SMP? Does panics ops would be treated in parallel (using msleep instead of mdelay)?
>> In this case delays could not be cumulative...
> 
> The processor that triggered the panic sequentially runs the notifier registered
> with the panic_notifier_list.  Other processors are instructed to take
> themselves offline.  As such there won't be multiple ->panic() running
> concurrently. 
Thank you for that explanation!

Regards,
Arnaud
> 
>>
>>>
>>>> For instance for stm32mp1 the stop corresponds to the reset on the remote processor core. To inform the coprocessor about an imminent shutdown we use a signal relying on a mailbox (cf. stm32_rproc_stop).
>>>> In this case we would need a delay between the signal and the reset, but not after (no cache management).
>>>
>>> Here I believe you are referring to the upper limit of 500ms that is
>>> needed for the mbox_send_message() in stm32_rproc_stop() to complete.
>>> Since that is a blocking call I think it would fit with Bjorn's
>>> proposal above if a value of '0' is returned by rproc->ops->panic().
>>> That would mean no further delays are needed (because the blocking
>>> mbox_send_message() would have done the job already).  Let me know if
>>> I'm in the weeds.
>> Yes you are :), this is what i thought, if delay implemented in core.
> 
> Not sure I understand your last reply but I _think_ we are saying the same
> thing.
> 
>>
>> Regards,
>> Arnaud
>>
>>>
>>>>
>>>> Regards,
>>>> Arnaud
>>>>>
>>>>> Regards,
>>>>> Bjorn
>>>>>

Patch

diff --git a/drivers/remoteproc/qcom_q6v5.c b/drivers/remoteproc/qcom_q6v5.c
index cb0f4a0be032..17167c980e02 100644
--- a/drivers/remoteproc/qcom_q6v5.c
+++ b/drivers/remoteproc/qcom_q6v5.c
@@ -6,6 +6,7 @@ 
  * Copyright (C) 2014 Sony Mobile Communications AB
  * Copyright (c) 2012-2013, The Linux Foundation. All rights reserved.
  */
+#include <linux/delay.h>
 #include <linux/kernel.h>
 #include <linux/platform_device.h>
 #include <linux/interrupt.h>
@@ -15,6 +16,8 @@ 
 #include <linux/remoteproc.h>
 #include "qcom_q6v5.h"
 
+#define Q6V5_PANIC_DELAY_MS	200
+
 /**
  * qcom_q6v5_prepare() - reinitialize the qcom_q6v5 context before start
  * @q6v5:	reference to qcom_q6v5 context to be reinitialized
@@ -162,6 +165,22 @@  int qcom_q6v5_request_stop(struct qcom_q6v5 *q6v5)
 }
 EXPORT_SYMBOL_GPL(qcom_q6v5_request_stop);
 
+/**
+ * qcom_q6v5_panic() - panic handler to invoke a stop on the remote
+ * @q6v5:	reference to qcom_q6v5 context
+ *
+ * Set the stop bit and sleep in order to allow the remote processor to flush
+ * its caches etc for post mortem debugging.
+ */
+void qcom_q6v5_panic(struct qcom_q6v5 *q6v5)
+{
+	qcom_smem_state_update_bits(q6v5->state,
+				    BIT(q6v5->stop_bit), BIT(q6v5->stop_bit));
+
+	mdelay(Q6V5_PANIC_DELAY_MS);
+}
+EXPORT_SYMBOL_GPL(qcom_q6v5_panic);
+
 /**
  * qcom_q6v5_init() - initializer of the q6v5 common struct
  * @q6v5:	handle to be initialized
diff --git a/drivers/remoteproc/qcom_q6v5.h b/drivers/remoteproc/qcom_q6v5.h
index 7ac92c1e0f49..c37e6fd063e4 100644
--- a/drivers/remoteproc/qcom_q6v5.h
+++ b/drivers/remoteproc/qcom_q6v5.h
@@ -42,5 +42,6 @@  int qcom_q6v5_prepare(struct qcom_q6v5 *q6v5);
 int qcom_q6v5_unprepare(struct qcom_q6v5 *q6v5);
 int qcom_q6v5_request_stop(struct qcom_q6v5 *q6v5);
 int qcom_q6v5_wait_for_start(struct qcom_q6v5 *q6v5, int timeout);
+void qcom_q6v5_panic(struct qcom_q6v5 *q6v5);
 
 #endif