diff mbox series

[net-next,9/9] net: ipa: don't disable NAPI in suspend

Message ID 20210129202019.2099259-10-elder@linaro.org
State New
Headers show
Series net: ipa: don't disable NAPI in suspend | expand

Commit Message

Alex Elder Jan. 29, 2021, 8:20 p.m. UTC
The channel stop and suspend paths both call __gsi_channel_stop(),
which quiesces channel activity, disables NAPI, and (on other than
SDM845) stops the channel.  Similarly, the start and resume paths
share __gsi_channel_start(), which starts the channel and re-enables
NAPI again.

Disabling NAPI should be done when stopping a channel, but this
should *not* be done when suspending.  It's not necessary in the
suspend path anyway, because the stopped channel (or suspended
endpoint on SDM845) will not cause interrupts to schedule NAPI,
and gsi_channel_trans_quiesce() won't return until there are no
more transactions to process in the NAPI polling loop.

Instead, enable NAPI in gsi_channel_start(), when the completion
interrupt is first enabled.  Disable it again in gsi_channel_stop(),
when finally disabling the interrupt.

Add a call to napi_synchronize() to __gsi_channel_stop(), to ensure
NAPI polling is done before moving on.

Signed-off-by: Alex Elder <elder@linaro.org>

---
 drivers/net/ipa/gsi.c | 35 +++++++++++++++++++++--------------
 1 file changed, 21 insertions(+), 14 deletions(-)

-- 
2.27.0

Comments

Willem de Bruijn Jan. 30, 2021, 3:25 p.m. UTC | #1
On Fri, Jan 29, 2021 at 3:29 PM Alex Elder <elder@linaro.org> wrote:
>

> The channel stop and suspend paths both call __gsi_channel_stop(),

> which quiesces channel activity, disables NAPI, and (on other than

> SDM845) stops the channel.  Similarly, the start and resume paths

> share __gsi_channel_start(), which starts the channel and re-enables

> NAPI again.

>

> Disabling NAPI should be done when stopping a channel, but this

> should *not* be done when suspending.  It's not necessary in the

> suspend path anyway, because the stopped channel (or suspended

> endpoint on SDM845) will not cause interrupts to schedule NAPI,

> and gsi_channel_trans_quiesce() won't return until there are no

> more transactions to process in the NAPI polling loop.


But why is it incorrect to do so?

From a quick look, virtio-net disables on both remove and freeze, for instance.

> Instead, enable NAPI in gsi_channel_start(), when the completion

> interrupt is first enabled.  Disable it again in gsi_channel_stop(),

> when finally disabling the interrupt.

>

> Add a call to napi_synchronize() to __gsi_channel_stop(), to ensure

> NAPI polling is done before moving on.

>

> Signed-off-by: Alex Elder <elder@linaro.org>

> ---

=
> @@ -894,12 +894,16 @@ int gsi_channel_start(struct gsi *gsi, u32 channel_id)

>         struct gsi_channel *channel = &gsi->channel[channel_id];

>         int ret;

>

> -       /* Enable the completion interrupt */

> +       /* Enable NAPI and the completion interrupt */

> +       napi_enable(&channel->napi);

>         gsi_irq_ieob_enable_one(gsi, channel->evt_ring_id);

>

>         ret = __gsi_channel_start(channel, true);

> -       if (ret)

> -               gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);

> +       if (!ret)

> +               return 0;

> +

> +       gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);

> +       napi_disable(&channel->napi);

>

>         return ret;

>  }


subjective, but easier to parse when the normal control flow is linear
and the error path takes a branch (or goto, if reused).
Jakub Kicinski Jan. 30, 2021, 7:22 p.m. UTC | #2
On Sat, 30 Jan 2021 10:25:16 -0500 Willem de Bruijn wrote:
> > @@ -894,12 +894,16 @@ int gsi_channel_start(struct gsi *gsi, u32 channel_id)

> >         struct gsi_channel *channel = &gsi->channel[channel_id];

> >         int ret;

> >

> > -       /* Enable the completion interrupt */

> > +       /* Enable NAPI and the completion interrupt */

> > +       napi_enable(&channel->napi);

> >         gsi_irq_ieob_enable_one(gsi, channel->evt_ring_id);

> >

> >         ret = __gsi_channel_start(channel, true);

> > -       if (ret)

> > -               gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);

> > +       if (!ret)

> > +               return 0;

> > +

> > +       gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);

> > +       napi_disable(&channel->napi);

> >

> >         return ret;

> >  }  

> 

> subjective, but easier to parse when the normal control flow is linear

> and the error path takes a branch (or goto, if reused).


FWIW - fully agreed, I always thought this was part of the kernel
coding style.
Alex Elder Jan. 31, 2021, 4:29 a.m. UTC | #3
On 1/30/21 9:25 AM, Willem de Bruijn wrote:
> On Fri, Jan 29, 2021 at 3:29 PM Alex Elder <elder@linaro.org> wrote:

>>

>> The channel stop and suspend paths both call __gsi_channel_stop(),

>> which quiesces channel activity, disables NAPI, and (on other than

>> SDM845) stops the channel.  Similarly, the start and resume paths

>> share __gsi_channel_start(), which starts the channel and re-enables

>> NAPI again.

>>

>> Disabling NAPI should be done when stopping a channel, but this

>> should *not* be done when suspending.  It's not necessary in the

>> suspend path anyway, because the stopped channel (or suspended

>> endpoint on SDM845) will not cause interrupts to schedule NAPI,

>> and gsi_channel_trans_quiesce() won't return until there are no

>> more transactions to process in the NAPI polling loop.

> 

> But why is it incorrect to do so?


Maybe it's not; I also thought it was fine before, but...

Someone at Qualcomm asked me why I thought NAPI needed
to be disabled on suspend.  My response was basically
that it was a lightweight operation, and it shouldn't
really be a problem to do so.

Then, when I posted two patches last month, Jakub's
response told me he didn't understand why I was doing
what I was doing, and I stepped back to reconsider
the details of what was happening at suspend time.
 
https://lore.kernel.org/netdev/20210107183803.47308e23@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/

Four things were happening to suspend a channel:
quiesce activity; disable interrupt; disable NAPI;
and stop the channel.  It occurred to me that a
stopped channel would not generate interrupts, so if
the channel was stopped earlier there would be no need
to disable the interrupt.  Similarly there would be
(essentially) no need to disable NAPI once a channel
was stopped.

Underlying all of this is that I started chasing a
hang that was occurring on suspend over a month ago.
It was hard to reproduce (hundreds or thousands of
suspend/resume cycles without hitting it), and one
of the few times I actually hit the problem it was
stuck in napi_disable(), apparently waiting for
NAPI_STATE_SCHED to get cleared by napi_complete().

My best guess about how this could occur was if there
were a race of some kind between the interrupt handler
(scheduling NAPI) and the poll function (completing
it).  I found a number of problems while looking
at this, and in the past few weeks I've posted some
fixes to improve things.  Still, even with some of
these fixes in place we have seen a hang (but now
even more rarely).

So this grand rework of suspending/stopping channels
is an attempt to resolve this hang on suspend.

The channel is now stopped early, and once stopped,
everything that completed prior to the channel being
stopped is polled before considering the suspend
function done.  A stopped channel won't interrupt,
so we don't bother disabling the completion interrupt,
with no interrupts, NAPI won't be scheduled, so there's
no need to disable NAPI either.

The net result is simpler, and seems logical, and
should preclude any possible race between the interrupt
handler and poll function.  I'm trying to solve the
hang problem analytically, because it takes *so* long
to reproduce.

I'm open to other suggestions.

					-Alex

>  From a quick look, virtio-net disables on both remove and freeze, for instance.

> 

>> Instead, enable NAPI in gsi_channel_start(), when the completion

>> interrupt is first enabled.  Disable it again in gsi_channel_stop(),

>> when finally disabling the interrupt.

>>

>> Add a call to napi_synchronize() to __gsi_channel_stop(), to ensure

>> NAPI polling is done before moving on.

>>

>> Signed-off-by: Alex Elder <elder@linaro.org>

>> ---

> =

>> @@ -894,12 +894,16 @@ int gsi_channel_start(struct gsi *gsi, u32 channel_id)

>>          struct gsi_channel *channel = &gsi->channel[channel_id];

>>          int ret;

>>

>> -       /* Enable the completion interrupt */

>> +       /* Enable NAPI and the completion interrupt */

>> +       napi_enable(&channel->napi);

>>          gsi_irq_ieob_enable_one(gsi, channel->evt_ring_id);

>>

>>          ret = __gsi_channel_start(channel, true);

>> -       if (ret)

>> -               gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);

>> +       if (!ret)

>> +               return 0;

>> +

>> +       gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);

>> +       napi_disable(&channel->napi);

>>

>>          return ret;

>>   }

> 

> subjective, but easier to parse when the normal control flow is linear

> and the error path takes a branch (or goto, if reused).

>
Alex Elder Jan. 31, 2021, 4:30 a.m. UTC | #4
On 1/30/21 1:22 PM, Jakub Kicinski wrote:
> On Sat, 30 Jan 2021 10:25:16 -0500 Willem de Bruijn wrote:

>>> @@ -894,12 +894,16 @@ int gsi_channel_start(struct gsi *gsi, u32 channel_id)

>>>          struct gsi_channel *channel = &gsi->channel[channel_id];

>>>          int ret;

>>>

>>> -       /* Enable the completion interrupt */

>>> +       /* Enable NAPI and the completion interrupt */

>>> +       napi_enable(&channel->napi);

>>>          gsi_irq_ieob_enable_one(gsi, channel->evt_ring_id);

>>>

>>>          ret = __gsi_channel_start(channel, true);

>>> -       if (ret)

>>> -               gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);

>>> +       if (!ret)

>>> +               return 0;

>>> +

>>> +       gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);

>>> +       napi_disable(&channel->napi);

>>>

>>>          return ret;

>>>   }

>>

>> subjective, but easier to parse when the normal control flow is linear

>> and the error path takes a branch (or goto, if reused).

> 

> FWIW - fully agreed, I always thought this was part of the kernel

> coding style.


That's fine.  I will post a v2 of the series to fix this up.
I'll wait a bit (maybe until Monday morning), in case there's
any other input on the series to address.

Thanks both of you for your comments.

					-Alex
Alex Elder Jan. 31, 2021, 1:11 p.m. UTC | #5
On 1/30/21 10:29 PM, Alex Elder wrote:
> On 1/30/21 9:25 AM, Willem de Bruijn wrote:

>> On Fri, Jan 29, 2021 at 3:29 PM Alex Elder <elder@linaro.org> wrote:

>>>

>>> The channel stop and suspend paths both call __gsi_channel_stop(),

>>> which quiesces channel activity, disables NAPI, and (on other than

>>> SDM845) stops the channel.  Similarly, the start and resume paths

>>> share __gsi_channel_start(), which starts the channel and re-enables

>>> NAPI again.

>>>

>>> Disabling NAPI should be done when stopping a channel, but this

>>> should *not* be done when suspending.  It's not necessary in the

>>> suspend path anyway, because the stopped channel (or suspended

>>> endpoint on SDM845) will not cause interrupts to schedule NAPI,

>>> and gsi_channel_trans_quiesce() won't return until there are no

>>> more transactions to process in the NAPI polling loop.

>>

>> But why is it incorrect to do so?

> 

> Maybe it's not; I also thought it was fine before, but...

> 

> Someone at Qualcomm asked me why I thought NAPI needed

> to be disabled on suspend.  My response was basically

> that it was a lightweight operation, and it shouldn't

> really be a problem to do so.

> 

> Then, when I posted two patches last month, Jakub's

> response told me he didn't understand why I was doing

> what I was doing, and I stepped back to reconsider

> the details of what was happening at suspend time.

> 

> https://lore.kernel.org/netdev/20210107183803.47308e23@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/


I should have mentioned that *this* response from Jakub
to a question I had also led to my conclusion that NAPI
should not be disabled on suspend--at least for IPA.
  https://lore.kernel.org/netdev/20210105122328.3e5569a4@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/

The channel is *not* reset on suspend for IPA.

					-Alex

> Four things were happening to suspend a channel:

> quiesce activity; disable interrupt; disable NAPI;

> and stop the channel.  It occurred to me that a

> stopped channel would not generate interrupts, so if

> the channel was stopped earlier there would be no need

> to disable the interrupt.  Similarly there would be

> (essentially) no need to disable NAPI once a channel

> was stopped.

> 

> Underlying all of this is that I started chasing a

> hang that was occurring on suspend over a month ago.

> It was hard to reproduce (hundreds or thousands of

> suspend/resume cycles without hitting it), and one

> of the few times I actually hit the problem it was

> stuck in napi_disable(), apparently waiting for

> NAPI_STATE_SCHED to get cleared by napi_complete().

> 

> My best guess about how this could occur was if there

> were a race of some kind between the interrupt handler

> (scheduling NAPI) and the poll function (completing

> it).  I found a number of problems while looking

> at this, and in the past few weeks I've posted some

> fixes to improve things.  Still, even with some of

> these fixes in place we have seen a hang (but now

> even more rarely).

> 

> So this grand rework of suspending/stopping channels

> is an attempt to resolve this hang on suspend.

> 

> The channel is now stopped early, and once stopped,

> everything that completed prior to the channel being

> stopped is polled before considering the suspend

> function done.  A stopped channel won't interrupt,

> so we don't bother disabling the completion interrupt,

> with no interrupts, NAPI won't be scheduled, so there's

> no need to disable NAPI either.

> 

> The net result is simpler, and seems logical, and

> should preclude any possible race between the interrupt

> handler and poll function.  I'm trying to solve the

> hang problem analytically, because it takes *so* long

> to reproduce.

> 

> I'm open to other suggestions.

> 

>                     -Alex

> 

>>  From a quick look, virtio-net disables on both remove and freeze, for instance.

>>

>>> Instead, enable NAPI in gsi_channel_start(), when the completion

>>> interrupt is first enabled.  Disable it again in gsi_channel_stop(),

>>> when finally disabling the interrupt.

>>>

>>> Add a call to napi_synchronize() to __gsi_channel_stop(), to ensure

>>> NAPI polling is done before moving on.

>>>

>>> Signed-off-by: Alex Elder <elder@linaro.org>

>>> ---

>> =

>>> @@ -894,12 +894,16 @@ int gsi_channel_start(struct gsi *gsi, u32 channel_id)

>>>          struct gsi_channel *channel = &gsi->channel[channel_id];

>>>          int ret;

>>>

>>> -       /* Enable the completion interrupt */

>>> +       /* Enable NAPI and the completion interrupt */

>>> +       napi_enable(&channel->napi);

>>>          gsi_irq_ieob_enable_one(gsi, channel->evt_ring_id);

>>>

>>>          ret = __gsi_channel_start(channel, true);

>>> -       if (ret)

>>> -               gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);

>>> +       if (!ret)

>>> +               return 0;

>>> +

>>> +       gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);

>>> +       napi_disable(&channel->napi);

>>>

>>>          return ret;

>>>   }

>>

>> subjective, but easier to parse when the normal control flow is linear

>> and the error path takes a branch (or goto, if reused).

>>

>
Willem de Bruijn Jan. 31, 2021, 2:52 p.m. UTC | #6
On Sat, Jan 30, 2021 at 11:29 PM Alex Elder <elder@linaro.org> wrote:
>

> On 1/30/21 9:25 AM, Willem de Bruijn wrote:

> > On Fri, Jan 29, 2021 at 3:29 PM Alex Elder <elder@linaro.org> wrote:

> >>

> >> The channel stop and suspend paths both call __gsi_channel_stop(),

> >> which quiesces channel activity, disables NAPI, and (on other than

> >> SDM845) stops the channel.  Similarly, the start and resume paths

> >> share __gsi_channel_start(), which starts the channel and re-enables

> >> NAPI again.

> >>

> >> Disabling NAPI should be done when stopping a channel, but this

> >> should *not* be done when suspending.  It's not necessary in the

> >> suspend path anyway, because the stopped channel (or suspended

> >> endpoint on SDM845) will not cause interrupts to schedule NAPI,

> >> and gsi_channel_trans_quiesce() won't return until there are no

> >> more transactions to process in the NAPI polling loop.

> >

> > But why is it incorrect to do so?

>

> Maybe it's not; I also thought it was fine before, but...

>

> Someone at Qualcomm asked me why I thought NAPI needed

> to be disabled on suspend.  My response was basically

> that it was a lightweight operation, and it shouldn't

> really be a problem to do so.

>

> Then, when I posted two patches last month, Jakub's

> response told me he didn't understand why I was doing

> what I was doing, and I stepped back to reconsider

> the details of what was happening at suspend time.

>

> https://lore.kernel.org/netdev/20210107183803.47308e23@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/

>

> Four things were happening to suspend a channel:

> quiesce activity; disable interrupt; disable NAPI;

> and stop the channel.  It occurred to me that a

> stopped channel would not generate interrupts, so if

> the channel was stopped earlier there would be no need

> to disable the interrupt.  Similarly there would be

> (essentially) no need to disable NAPI once a channel

> was stopped.

>

> Underlying all of this is that I started chasing a

> hang that was occurring on suspend over a month ago.

> It was hard to reproduce (hundreds or thousands of

> suspend/resume cycles without hitting it), and one

> of the few times I actually hit the problem it was

> stuck in napi_disable(), apparently waiting for

> NAPI_STATE_SCHED to get cleared by napi_complete().


This is important information.

What exactly do you mean by hang?

>

> My best guess about how this could occur was if there

> were a race of some kind between the interrupt handler

> (scheduling NAPI) and the poll function (completing

> it).  I found a number of problems while looking

> at this, and in the past few weeks I've posted some

> fixes to improve things.  Still, even with some of

> these fixes in place we have seen a hang (but now

> even more rarely).

>

> So this grand rework of suspending/stopping channels

> is an attempt to resolve this hang on suspend.


Do you have any data that this patchset resolves the issue, or is it
too hard to reproduce to say anything?

> The channel is now stopped early, and once stopped,

> everything that completed prior to the channel being

> stopped is polled before considering the suspend

> function done.


Does the call to gsi_channel_trans_quiesce before
gsi_channel_stop_retry leave a race where new transactions may occur
until state GSI_CHANNEL_STATE_STOPPED is reached? Asking without truly
knowing the details of this device.

> A stopped channel won't interrupt,

> so we don't bother disabling the completion interrupt,

> with no interrupts, NAPI won't be scheduled, so there's

> no need to disable NAPI either.


That sounds plausible. But it doesn't explain why napi_disable "should
*not* be done when suspending" as the commit states.

Arguably, leaving that won't have much effect either way, and is in
line with other drivers.

Your previous patchset mentions "When stopping a channel, the IPA
driver currently disables NAPI before disabling the interrupt." That
would no longer be the case.

> The net result is simpler, and seems logical, and

> should preclude any possible race between the interrupt

> handler and poll function.  I'm trying to solve the

> hang problem analytically, because it takes *so* long

> to reproduce.

>

> I'm open to other suggestions.

>

>                                         -Alex

>

> >  From a quick look, virtio-net disables on both remove and freeze, for instance.

> >

> >> Instead, enable NAPI in gsi_channel_start(), when the completion

> >> interrupt is first enabled.  Disable it again in gsi_channel_stop(),

> >> when finally disabling the interrupt.

> >>

> >> Add a call to napi_synchronize() to __gsi_channel_stop(), to ensure

> >> NAPI polling is done before moving on.

> >>

> >> Signed-off-by: Alex Elder <elder@linaro.org>

> >> ---

> > =

> >> @@ -894,12 +894,16 @@ int gsi_channel_start(struct gsi *gsi, u32 channel_id)

> >>          struct gsi_channel *channel = &gsi->channel[channel_id];

> >>          int ret;

> >>

> >> -       /* Enable the completion interrupt */

> >> +       /* Enable NAPI and the completion interrupt */

> >> +       napi_enable(&channel->napi);

> >>          gsi_irq_ieob_enable_one(gsi, channel->evt_ring_id);

> >>

> >>          ret = __gsi_channel_start(channel, true);

> >> -       if (ret)

> >> -               gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);

> >> +       if (!ret)

> >> +               return 0;

> >> +

> >> +       gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);

> >> +       napi_disable(&channel->napi);

> >>

> >>          return ret;

> >>   }

> >

> > subjective, but easier to parse when the normal control flow is linear

> > and the error path takes a branch (or goto, if reused).

> >

>
Alex Elder Jan. 31, 2021, 3:32 p.m. UTC | #7
On 1/31/21 8:52 AM, Willem de Bruijn wrote:
> On Sat, Jan 30, 2021 at 11:29 PM Alex Elder <elder@linaro.org> wrote:

>>

>> On 1/30/21 9:25 AM, Willem de Bruijn wrote:

>>> On Fri, Jan 29, 2021 at 3:29 PM Alex Elder <elder@linaro.org> wrote:

>>>>

>>>> The channel stop and suspend paths both call __gsi_channel_stop(),

>>>> which quiesces channel activity, disables NAPI, and (on other than

>>>> SDM845) stops the channel.  Similarly, the start and resume paths

>>>> share __gsi_channel_start(), which starts the channel and re-enables

>>>> NAPI again.

>>>>

>>>> Disabling NAPI should be done when stopping a channel, but this

>>>> should *not* be done when suspending.  It's not necessary in the

>>>> suspend path anyway, because the stopped channel (or suspended

>>>> endpoint on SDM845) will not cause interrupts to schedule NAPI,

>>>> and gsi_channel_trans_quiesce() won't return until there are no

>>>> more transactions to process in the NAPI polling loop.

>>>

>>> But why is it incorrect to do so?

>>

>> Maybe it's not; I also thought it was fine before, but...

>>

>> Someone at Qualcomm asked me why I thought NAPI needed

>> to be disabled on suspend.  My response was basically

>> that it was a lightweight operation, and it shouldn't

>> really be a problem to do so.

>>

>> Then, when I posted two patches last month, Jakub's

>> response told me he didn't understand why I was doing

>> what I was doing, and I stepped back to reconsider

>> the details of what was happening at suspend time.

>>

>> https://lore.kernel.org/netdev/20210107183803.47308e23@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/

>>

>> Four things were happening to suspend a channel:

>> quiesce activity; disable interrupt; disable NAPI;

>> and stop the channel.  It occurred to me that a

>> stopped channel would not generate interrupts, so if

>> the channel was stopped earlier there would be no need

>> to disable the interrupt.  Similarly there would be

>> (essentially) no need to disable NAPI once a channel

>> was stopped.

>>

>> Underlying all of this is that I started chasing a

>> hang that was occurring on suspend over a month ago.

>> It was hard to reproduce (hundreds or thousands of

>> suspend/resume cycles without hitting it), and one

>> of the few times I actually hit the problem it was

>> stuck in napi_disable(), apparently waiting for

>> NAPI_STATE_SCHED to get cleared by napi_complete().

> 

> This is important information.

> 

> What exactly do you mean by hang?


Yes it's important!  Unfortunately I was not able to
gather details about the problem in all the cases where
it occurred.  But in at least one case I *did* confirm
it was in the situation described above.

What I mean by "hang" is that the system simply stopped
on its way down, and the IPA ->suspend callback never
completed (stuck in napi_disable).  So I expect that
the SCHED flag was never going to get cleared (because
of a race, presumably).

>> My best guess about how this could occur was if there

>> were a race of some kind between the interrupt handler

>> (scheduling NAPI) and the poll function (completing

>> it).  I found a number of problems while looking

>> at this, and in the past few weeks I've posted some

>> fixes to improve things.  Still, even with some of

>> these fixes in place we have seen a hang (but now

>> even more rarely).

>>

>> So this grand rework of suspending/stopping channels

>> is an attempt to resolve this hang on suspend.

> 

> Do you have any data that this patchset resolves the issue, or is it

> too hard to reproduce to say anything?


The data I have is that I have been running for weeks
with tens of thousands of iterations with this patch
(and the rest of them) without any hang.  Unfortunately
that doesn't guarantee anything.  I contemplated trying
to "catch" the problem and report that it *would have*
occurred had the fix not been in place, but I haven't
tried that (in part because it might not be easy).

So...  Too hard to reproduce, but I have evidence that
my testing so far has never reproduced the hang.

>> The channel is now stopped early, and once stopped,

>> everything that completed prior to the channel being

>> stopped is polled before considering the suspend

>> function done.

> 

> Does the call to gsi_channel_trans_quiesce before

> gsi_channel_stop_retry leave a race where new transactions may occur

> until state GSI_CHANNEL_STATE_STOPPED is reached? Asking without truly

> knowing the details of this device.


It should not.  For TX endpoints that have a net device, new
requests will have been stopped earlier by netif_stop_queue()
(in ipa_modem_suspend()).  For RX endpoints, receive buffers
are replenished to the hardware, but we stop that earlier
as well, in ipa_endpoint_suspend_one().  So the quiesce call
is meant to figure out what the last submitted request was
for an endpoint (channel), and then wait for that to complete.

The "hang" occurs on an RX endpoint, and in particular it
occurs on an endpoint that we *know* will be receiving a
packet as part of the suspend process (when clearing the
hardware pipeline).  I can go into that further but won't'
unless asked.

>> A stopped channel won't interrupt,

>> so we don't bother disabling the completion interrupt,

>> with no interrupts, NAPI won't be scheduled, so there's

>> no need to disable NAPI either.

> 

> That sounds plausible. But it doesn't explain why napi_disable "should

> *not* be done when suspending" as the commit states.

> 

> Arguably, leaving that won't have much effect either way, and is in

> line with other drivers.


Understood and agreed.  In fact, if the hang occurrs in
napi_disable() when waiting for NAPI_STATE_SCHED to clear,
it would occur in napi_synchronize() as well.  At this point
it's more about the whole set of rework here, and keeping
NAPI enabled during suspend seems a little cleaner.

See my followup message, about Jakub's assertion that NAPI
assumes the device will be *reset* when NAPI is disabled.
(I'm not convinced NAPI assumes that, but that doesn't matter.)
In any case, the IPA hardware does *not* reset channels when
suspended.

> Your previous patchset mentions "When stopping a channel, the IPA

> driver currently disables NAPI before disabling the interrupt." That

> would no longer be the case.


I'm not sure which patch you're referring to (and I'm in
a hurry at the moment).  But yes, with this patch we would
only disable NAPI when "really" stopping the channel, not
when suspending it.  And we'd similarly be no longer
disabling the completion interrupt on suspend either.

Thanks a lot, I appreciate the help and input on this.

					-Alex

>> The net result is simpler, and seems logical, and

>> should preclude any possible race between the interrupt

>> handler and poll function.  I'm trying to solve the

>> hang problem analytically, because it takes *so* long

>> to reproduce.

>>

>> I'm open to other suggestions.

>>

>>                                         -Alex

>>

>>>  From a quick look, virtio-net disables on both remove and freeze, for instance.

>>>

>>>> Instead, enable NAPI in gsi_channel_start(), when the completion

>>>> interrupt is first enabled.  Disable it again in gsi_channel_stop(),

>>>> when finally disabling the interrupt.

>>>>

>>>> Add a call to napi_synchronize() to __gsi_channel_stop(), to ensure

>>>> NAPI polling is done before moving on.

>>>>

>>>> Signed-off-by: Alex Elder <elder@linaro.org>

>>>> ---

>>> =

>>>> @@ -894,12 +894,16 @@ int gsi_channel_start(struct gsi *gsi, u32 channel_id)

>>>>          struct gsi_channel *channel = &gsi->channel[channel_id];

>>>>          int ret;

>>>>

>>>> -       /* Enable the completion interrupt */

>>>> +       /* Enable NAPI and the completion interrupt */

>>>> +       napi_enable(&channel->napi);

>>>>          gsi_irq_ieob_enable_one(gsi, channel->evt_ring_id);

>>>>

>>>>          ret = __gsi_channel_start(channel, true);

>>>> -       if (ret)

>>>> -               gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);

>>>> +       if (!ret)

>>>> +               return 0;

>>>> +

>>>> +       gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);

>>>> +       napi_disable(&channel->napi);

>>>>

>>>>          return ret;

>>>>   }

>>>

>>> subjective, but easier to parse when the normal control flow is linear

>>> and the error path takes a branch (or goto, if reused).

>>>

>>
Willem de Bruijn Feb. 1, 2021, 1:36 a.m. UTC | #8
On Sun, Jan 31, 2021 at 10:32 AM Alex Elder <elder@linaro.org> wrote:
>

> On 1/31/21 8:52 AM, Willem de Bruijn wrote:

> > On Sat, Jan 30, 2021 at 11:29 PM Alex Elder <elder@linaro.org> wrote:

> >>

> >> On 1/30/21 9:25 AM, Willem de Bruijn wrote:

> >>> On Fri, Jan 29, 2021 at 3:29 PM Alex Elder <elder@linaro.org> wrote:

> >>>>

> >>>> The channel stop and suspend paths both call __gsi_channel_stop(),

> >>>> which quiesces channel activity, disables NAPI, and (on other than

> >>>> SDM845) stops the channel.  Similarly, the start and resume paths

> >>>> share __gsi_channel_start(), which starts the channel and re-enables

> >>>> NAPI again.

> >>>>

> >>>> Disabling NAPI should be done when stopping a channel, but this

> >>>> should *not* be done when suspending.  It's not necessary in the

> >>>> suspend path anyway, because the stopped channel (or suspended

> >>>> endpoint on SDM845) will not cause interrupts to schedule NAPI,

> >>>> and gsi_channel_trans_quiesce() won't return until there are no

> >>>> more transactions to process in the NAPI polling loop.

> >>>

> >>> But why is it incorrect to do so?

> >>

> >> Maybe it's not; I also thought it was fine before, but...

> >>

> >> Someone at Qualcomm asked me why I thought NAPI needed

> >> to be disabled on suspend.  My response was basically

> >> that it was a lightweight operation, and it shouldn't

> >> really be a problem to do so.

> >>

> >> Then, when I posted two patches last month, Jakub's

> >> response told me he didn't understand why I was doing

> >> what I was doing, and I stepped back to reconsider

> >> the details of what was happening at suspend time.

> >>

> >> https://lore.kernel.org/netdev/20210107183803.47308e23@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/

> >>

> >> Four things were happening to suspend a channel:

> >> quiesce activity; disable interrupt; disable NAPI;

> >> and stop the channel.  It occurred to me that a

> >> stopped channel would not generate interrupts, so if

> >> the channel was stopped earlier there would be no need

> >> to disable the interrupt.  Similarly there would be

> >> (essentially) no need to disable NAPI once a channel

> >> was stopped.

> >>

> >> Underlying all of this is that I started chasing a

> >> hang that was occurring on suspend over a month ago.

> >> It was hard to reproduce (hundreds or thousands of

> >> suspend/resume cycles without hitting it), and one

> >> of the few times I actually hit the problem it was

> >> stuck in napi_disable(), apparently waiting for

> >> NAPI_STATE_SCHED to get cleared by napi_complete().

> >

> > This is important information.

> >

> > What exactly do you mean by hang?

>

> Yes it's important!  Unfortunately I was not able to

> gather details about the problem in all the cases where

> it occurred.  But in at least one case I *did* confirm

> it was in the situation described above.

>

> What I mean by "hang" is that the system simply stopped

> on its way down, and the IPA ->suspend callback never

> completed (stuck in napi_disable).  So I expect that

> the SCHED flag was never going to get cleared (because

> of a race, presumably).

>

> >> My best guess about how this could occur was if there

> >> were a race of some kind between the interrupt handler

> >> (scheduling NAPI) and the poll function (completing

> >> it).  I found a number of problems while looking

> >> at this, and in the past few weeks I've posted some

> >> fixes to improve things.  Still, even with some of

> >> these fixes in place we have seen a hang (but now

> >> even more rarely).

> >>

> >> So this grand rework of suspending/stopping channels

> >> is an attempt to resolve this hang on suspend.

> >

> > Do you have any data that this patchset resolves the issue, or is it

> > too hard to reproduce to say anything?

>

> The data I have is that I have been running for weeks

> with tens of thousands of iterations with this patch

> (and the rest of them) without any hang.  Unfortunately

> that doesn't guarantee anything.  I contemplated trying

> to "catch" the problem and report that it *would have*

> occurred had the fix not been in place, but I haven't

> tried that (in part because it might not be easy).

>

> So...  Too hard to reproduce, but I have evidence that

> my testing so far has never reproduced the hang.

>

> >> The channel is now stopped early, and once stopped,

> >> everything that completed prior to the channel being

> >> stopped is polled before considering the suspend

> >> function done.

> >

> > Does the call to gsi_channel_trans_quiesce before

> > gsi_channel_stop_retry leave a race where new transactions may occur

> > until state GSI_CHANNEL_STATE_STOPPED is reached? Asking without truly

> > knowing the details of this device.

>

> It should not.  For TX endpoints that have a net device, new

> requests will have been stopped earlier by netif_stop_queue()

> (in ipa_modem_suspend()).  For RX endpoints, receive buffers

> are replenished to the hardware, but we stop that earlier

> as well, in ipa_endpoint_suspend_one().  So the quiesce call

> is meant to figure out what the last submitted request was

> for an endpoint (channel), and then wait for that to complete.

>

> The "hang" occurs on an RX endpoint, and in particular it

> occurs on an endpoint that we *know* will be receiving a

> packet as part of the suspend process (when clearing the

> hardware pipeline).  I can go into that further but won't'

> unless asked.

>

> >> A stopped channel won't interrupt,

> >> so we don't bother disabling the completion interrupt,

> >> with no interrupts, NAPI won't be scheduled, so there's

> >> no need to disable NAPI either.

> >

> > That sounds plausible. But it doesn't explain why napi_disable "should

> > *not* be done when suspending" as the commit states.

> >

> > Arguably, leaving that won't have much effect either way, and is in

> > line with other drivers.

>

> Understood and agreed.  In fact, if the hang occurrs in

> napi_disable() when waiting for NAPI_STATE_SCHED to clear,

> it would occur in napi_synchronize() as well.


Agreed.

So you have an environment to test a patch in, it might be worthwhile
to test essentially the same logic reordering as in this patch set,
but while still disabling napi.

The disappearing race may be due to another change rather than
napi_disable vs napi_synchronize. A smaller, more targeted patch could
also be a net (instead of net-next) candidate.

> At this point

> it's more about the whole set of rework here, and keeping

> NAPI enabled during suspend seems a little cleaner.


I'm not sure. I haven't looked if there is a common behavior across
devices. That might be informative. igb, for one, releases all
resources.

> See my followup message, about Jakub's assertion that NAPI

> assumes the device will be *reset* when NAPI is disabled.

> (I'm not convinced NAPI assumes that, but that doesn't matter.)

> In any case, the IPA hardware does *not* reset channels when

> suspended.

>

> > Your previous patchset mentions "When stopping a channel, the IPA

> > driver currently disables NAPI before disabling the interrupt." That

> > would no longer be the case.

>

> I'm not sure which patch you're referring to (and I'm in

> a hurry at the moment).  But yes, with this patch we would

> only disable NAPI when "really" stopping the channel, not

> when suspending it.  And we'd similarly be no longer

> disabling the completion interrupt on suspend either.

>

> Thanks a lot, I appreciate the help and input on this.

>

>                                         -Alex

>

> >> The net result is simpler, and seems logical, and

> >> should preclude any possible race between the interrupt

> >> handler and poll function.  I'm trying to solve the

> >> hang problem analytically, because it takes *so* long

> >> to reproduce.

> >>

> >> I'm open to other suggestions.

> >>

> >>                                         -Alex

> >>

> >>>  From a quick look, virtio-net disables on both remove and freeze, for instance.

> >>>

> >>>> Instead, enable NAPI in gsi_channel_start(), when the completion

> >>>> interrupt is first enabled.  Disable it again in gsi_channel_stop(),

> >>>> when finally disabling the interrupt.

> >>>>

> >>>> Add a call to napi_synchronize() to __gsi_channel_stop(), to ensure

> >>>> NAPI polling is done before moving on.

> >>>>

> >>>> Signed-off-by: Alex Elder <elder@linaro.org>

> >>>> ---

> >>> =

> >>>> @@ -894,12 +894,16 @@ int gsi_channel_start(struct gsi *gsi, u32 channel_id)

> >>>>          struct gsi_channel *channel = &gsi->channel[channel_id];

> >>>>          int ret;

> >>>>

> >>>> -       /* Enable the completion interrupt */

> >>>> +       /* Enable NAPI and the completion interrupt */

> >>>> +       napi_enable(&channel->napi);

> >>>>          gsi_irq_ieob_enable_one(gsi, channel->evt_ring_id);

> >>>>

> >>>>          ret = __gsi_channel_start(channel, true);

> >>>> -       if (ret)

> >>>> -               gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);

> >>>> +       if (!ret)

> >>>> +               return 0;

> >>>> +

> >>>> +       gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);

> >>>> +       napi_disable(&channel->napi);

> >>>>

> >>>>          return ret;

> >>>>   }

> >>>

> >>> subjective, but easier to parse when the normal control flow is linear

> >>> and the error path takes a branch (or goto, if reused).

> >>>

> >>

>
Alex Elder Feb. 1, 2021, 2:35 p.m. UTC | #9
On 1/31/21 7:36 PM, Willem de Bruijn wrote:
> On Sun, Jan 31, 2021 at 10:32 AM Alex Elder <elder@linaro.org> wrote:

>>

>> On 1/31/21 8:52 AM, Willem de Bruijn wrote:

>>> On Sat, Jan 30, 2021 at 11:29 PM Alex Elder <elder@linaro.org> wrote:

>>>>

>>>> On 1/30/21 9:25 AM, Willem de Bruijn wrote:

>>>>> On Fri, Jan 29, 2021 at 3:29 PM Alex Elder <elder@linaro.org> wrote:

>>>>>>

>>>>>> The channel stop and suspend paths both call __gsi_channel_stop(),

>>>>>> which quiesces channel activity, disables NAPI, and (on other than

>>>>>> SDM845) stops the channel.  Similarly, the start and resume paths

>>>>>> share __gsi_channel_start(), which starts the channel and re-enables

>>>>>> NAPI again.

>>>>>>

>>>>>> Disabling NAPI should be done when stopping a channel, but this

>>>>>> should *not* be done when suspending.  It's not necessary in the

>>>>>> suspend path anyway, because the stopped channel (or suspended

>>>>>> endpoint on SDM845) will not cause interrupts to schedule NAPI,

>>>>>> and gsi_channel_trans_quiesce() won't return until there are no

>>>>>> more transactions to process in the NAPI polling loop.

>>>>>

>>>>> But why is it incorrect to do so?

>>>>

>>>> Maybe it's not; I also thought it was fine before, but...


. . .

>> The "hang" occurs on an RX endpoint, and in particular it

>> occurs on an endpoint that we *know* will be receiving a

>> packet as part of the suspend process (when clearing the

>> hardware pipeline).  I can go into that further but won't'

>> unless asked.

>>

>>>> A stopped channel won't interrupt,

>>>> so we don't bother disabling the completion interrupt,

>>>> with no interrupts, NAPI won't be scheduled, so there's

>>>> no need to disable NAPI either.

>>>

>>> That sounds plausible. But it doesn't explain why napi_disable "should

>>> *not* be done when suspending" as the commit states.

>>>

>>> Arguably, leaving that won't have much effect either way, and is in

>>> line with other drivers.

>>

>> Understood and agreed.  In fact, if the hang occurrs in

>> napi_disable() when waiting for NAPI_STATE_SCHED to clear,

>> it would occur in napi_synchronize() as well.

> 

> Agreed.

> 

> So you have an environment to test a patch in, it might be worthwhile

> to test essentially the same logic reordering as in this patch set,

> but while still disabling napi.


What is the purpose of this test?  Just to guarantee
that the NAPI hang goes away?  Because you agree that
the napi_schedule() call would *also* hang if that
problem exists, right?

Anyway, what you're suggesting is to simply test with
this last patch removed.  I can do that but I really
don't expect it to change anything.  I will start that
test later today when I'm turning my attention to
something else for a while.

> The disappearing race may be due to another change rather than

> napi_disable vs napi_synchronize. A smaller, more targeted patch could

> also be a net (instead of net-next) candidate.


I am certain it is.

I can tell you that we have seen a hang (after I think 2500+
suspend/resume cycles) with the IPA code that is currently
upstream.

But with this latest series of 9, there is no hang after
10,000+ cycles.  That gives me a bisect window, but I really
don't want to go through a full bisect of even those 9,
because it's 4 tests, each of which takes days to complete.

Looking at the 9 patches, I think this one is the most
likely culprit:
   net: ipa: disable IEOB interrupt after channel stop

I think the race involves the I/O completion handler
interacting with NAPI in an unwanted way, but I have
not come up with the exact sequence that would lead
to getting stuck in napi_disable().

Here are some possible events that could occur on an
RX channel in *some* order, prior to that patch.  And
in the order I show there's at least a problem of a
receive not being processed immediately.

		. . . (suspend initiated)

	replenish_stop()
	quiesce()
			IRQ fires (receive ready)
	napi_disable()
			napi_schedule() (ignored)
	irq_disable()
			IRQ condition; pending
	channel_stop()

		. . . (resume triggered)

	channel_start()
	irq_enable()
			pending IRQ fires
			napi_schedule() (ignored)
	napi_enable()

		. . . (suspend initiated)

>> At this point

>> it's more about the whole set of rework here, and keeping

>> NAPI enabled during suspend seems a little cleaner.

> 

> I'm not sure. I haven't looked if there is a common behavior across

> devices. That might be informative. igb, for one, releases all

> resources.


I tried to do a survey of that too and did not see a
consistent pattern.  I didn't look *that* hard because
doing so would be more involved than I wanted to get.

So in summary:
- I'm putting together version 2 of this series now
- Testing this past week seems to show that this series
   makes the hang in napi_disable() (or synchronize)
   go away.
- I think the most likely patch in this series that
   fixes the problem is the IRQ ordering one I mention
   above, but right now I can't cite a specific sequence
   of events that would prove it.
- I will begin some long testing later today without
   this last patch applied
     --> But I think testing without the IRQ ordering
	patch would be more promising, and I'd like
	to hear your opinion on that

Thanks again for your input and help on this.

					-Alex

. . .
Willem de Bruijn Feb. 1, 2021, 2:47 p.m. UTC | #10
On Mon, Feb 1, 2021 at 9:35 AM Alex Elder <elder@linaro.org> wrote:
>

> On 1/31/21 7:36 PM, Willem de Bruijn wrote:

> > On Sun, Jan 31, 2021 at 10:32 AM Alex Elder <elder@linaro.org> wrote:

> >>

> >> On 1/31/21 8:52 AM, Willem de Bruijn wrote:

> >>> On Sat, Jan 30, 2021 at 11:29 PM Alex Elder <elder@linaro.org> wrote:

> >>>>

> >>>> On 1/30/21 9:25 AM, Willem de Bruijn wrote:

> >>>>> On Fri, Jan 29, 2021 at 3:29 PM Alex Elder <elder@linaro.org> wrote:

> >>>>>>

> >>>>>> The channel stop and suspend paths both call __gsi_channel_stop(),

> >>>>>> which quiesces channel activity, disables NAPI, and (on other than

> >>>>>> SDM845) stops the channel.  Similarly, the start and resume paths

> >>>>>> share __gsi_channel_start(), which starts the channel and re-enables

> >>>>>> NAPI again.

> >>>>>>

> >>>>>> Disabling NAPI should be done when stopping a channel, but this

> >>>>>> should *not* be done when suspending.  It's not necessary in the

> >>>>>> suspend path anyway, because the stopped channel (or suspended

> >>>>>> endpoint on SDM845) will not cause interrupts to schedule NAPI,

> >>>>>> and gsi_channel_trans_quiesce() won't return until there are no

> >>>>>> more transactions to process in the NAPI polling loop.

> >>>>>

> >>>>> But why is it incorrect to do so?

> >>>>

> >>>> Maybe it's not; I also thought it was fine before, but...

>

> . . .

>

> >> The "hang" occurs on an RX endpoint, and in particular it

> >> occurs on an endpoint that we *know* will be receiving a

> >> packet as part of the suspend process (when clearing the

> >> hardware pipeline).  I can go into that further but won't'

> >> unless asked.

> >>

> >>>> A stopped channel won't interrupt,

> >>>> so we don't bother disabling the completion interrupt,

> >>>> with no interrupts, NAPI won't be scheduled, so there's

> >>>> no need to disable NAPI either.

> >>>

> >>> That sounds plausible. But it doesn't explain why napi_disable "should

> >>> *not* be done when suspending" as the commit states.

> >>>

> >>> Arguably, leaving that won't have much effect either way, and is in

> >>> line with other drivers.

> >>

> >> Understood and agreed.  In fact, if the hang occurrs in

> >> napi_disable() when waiting for NAPI_STATE_SCHED to clear,

> >> it would occur in napi_synchronize() as well.

> >

> > Agreed.

> >

> > So you have an environment to test a patch in, it might be worthwhile

> > to test essentially the same logic reordering as in this patch set,

> > but while still disabling napi.

>

> What is the purpose of this test?  Just to guarantee

> that the NAPI hang goes away?  Because you agree that

> the napi_schedule() call would *also* hang if that

> problem exists, right?

>

> Anyway, what you're suggesting is to simply test with

> this last patch removed.  I can do that but I really

> don't expect it to change anything.  I will start that

> test later today when I'm turning my attention to

> something else for a while.

>

> > The disappearing race may be due to another change rather than

> > napi_disable vs napi_synchronize. A smaller, more targeted patch could

> > also be a net (instead of net-next) candidate.

>

> I am certain it is.

>

> I can tell you that we have seen a hang (after I think 2500+

> suspend/resume cycles) with the IPA code that is currently

> upstream.

>

> But with this latest series of 9, there is no hang after

> 10,000+ cycles.  That gives me a bisect window, but I really

> don't want to go through a full bisect of even those 9,

> because it's 4 tests, each of which takes days to complete.

>

> Looking at the 9 patches, I think this one is the most

> likely culprit:

>    net: ipa: disable IEOB interrupt after channel stop

>

> I think the race involves the I/O completion handler

> interacting with NAPI in an unwanted way, but I have

> not come up with the exact sequence that would lead

> to getting stuck in napi_disable().

>

> Here are some possible events that could occur on an

> RX channel in *some* order, prior to that patch.  And

> in the order I show there's at least a problem of a

> receive not being processed immediately.

>

>                 . . . (suspend initiated)

>

>         replenish_stop()

>         quiesce()

>                         IRQ fires (receive ready)

>         napi_disable()

>                         napi_schedule() (ignored)

>         irq_disable()

>                         IRQ condition; pending

>         channel_stop()

>

>                 . . . (resume triggered)

>

>         channel_start()

>         irq_enable()

>                         pending IRQ fires

>                         napi_schedule() (ignored)

>         napi_enable()

>

>                 . . . (suspend initiated)

>

> >> At this point

> >> it's more about the whole set of rework here, and keeping

> >> NAPI enabled during suspend seems a little cleaner.

> >

> > I'm not sure. I haven't looked if there is a common behavior across

> > devices. That might be informative. igb, for one, releases all

> > resources.

>

> I tried to do a survey of that too and did not see a

> consistent pattern.  I didn't look *that* hard because

> doing so would be more involved than I wanted to get.


Okay. If there is no consistent pattern, either approach works.

I'm fine with this patchset as is.

> So in summary:

> - I'm putting together version 2 of this series now

> - Testing this past week seems to show that this series

>    makes the hang in napi_disable() (or synchronize)

>    go away.

> - I think the most likely patch in this series that

>    fixes the problem is the IRQ ordering one I mention

>    above, but right now I can't cite a specific sequence

>    of events that would prove it.

> - I will begin some long testing later today without

>    this last patch applied

>      --> But I think testing without the IRQ ordering

>         patch would be more promising, and I'd like

>         to hear your opinion on that


Either test depends on whether you find it worthwhile to more
specifically identify the root cause. If not, no need to run the tests
on my behalf. I understand that these are time consuming.

>

> Thanks again for your input and help on this.

>

>                                         -Alex

>

> . . .
Alex Elder Feb. 1, 2021, 3:48 p.m. UTC | #11
On 2/1/21 8:47 AM, Willem de Bruijn wrote:
>> I tried to do a survey of that too and did not see a

>> consistent pattern.  I didn't look*that*  hard because

>> doing so would be more involved than I wanted to get.

> Okay. If there is no consistent pattern, either approach works.

> 

> I'm fine with this patchset as is.


OK.  I'm sending version 2 of the series to address the
comment about returning success early--before the end
of the function.  I'm also tweaking one or two patch
descriptions.

As I looked at the series I saw some other things I'd
like to do a little differently.  But to keep things
simple I'm going to do that in a follow-on series
instead.

Thank you very much for your time on this.

					-Alex
Alex Elder Feb. 1, 2021, 6:38 p.m. UTC | #12
On 2/1/21 8:47 AM, Willem de Bruijn wrote:
>> - I will begin some long testing later today without

>>     this last patch applied

>>       --> But I think testing without the IRQ ordering

>>          patch would be more promising, and I'd like

>>          to hear your opinion on that

> Either test depends on whether you find it worthwhile to more

> specifically identify the root cause. If not, no need to run the tests

> on my behalf. I understand that these are time consuming.


I *would* like to really understand the root cause.
And a month ago I would have absolutely gone through
a bisect process so I could narrow it down.

But at this point I'm content to have it fixed (fingers
crossed) and am more interested in putting all this
behind me.  It's taken a lot of time and attention.

And even though this represents a real bug, at least
for now I am content to keep all this work in net-next
rather than isolate the specific problem and try to
back-port it (without all the other 20+ commits that
preceded these) to the net branch.

So although I see the value in identifying the specific
root cause, I'm not going to do that unless/until it
becomes clear it *must* be done (to back-port to net).

					-Alex
diff mbox series

Patch

diff --git a/drivers/net/ipa/gsi.c b/drivers/net/ipa/gsi.c
index 217ca21bfe043..afc5c9ede01af 100644
--- a/drivers/net/ipa/gsi.c
+++ b/drivers/net/ipa/gsi.c
@@ -876,15 +876,15 @@  static int __gsi_channel_start(struct gsi_channel *channel, bool start)
 	struct gsi *gsi = channel->gsi;
 	int ret;
 
+	if (!start)
+		return 0;
+
 	mutex_lock(&gsi->mutex);
 
-	ret = start ? gsi_channel_start_command(channel) : 0;
+	ret = gsi_channel_start_command(channel);
 
 	mutex_unlock(&gsi->mutex);
 
-	if (!ret)
-		napi_enable(&channel->napi);
-
 	return ret;
 }
 
@@ -894,12 +894,16 @@  int gsi_channel_start(struct gsi *gsi, u32 channel_id)
 	struct gsi_channel *channel = &gsi->channel[channel_id];
 	int ret;
 
-	/* Enable the completion interrupt */
+	/* Enable NAPI and the completion interrupt */
+	napi_enable(&channel->napi);
 	gsi_irq_ieob_enable_one(gsi, channel->evt_ring_id);
 
 	ret = __gsi_channel_start(channel, true);
-	if (ret)
-		gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
+	if (!ret)
+		return 0;
+
+	gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
+	napi_disable(&channel->napi);
 
 	return ret;
 }
@@ -928,13 +932,13 @@  static int __gsi_channel_stop(struct gsi_channel *channel, bool stop)
 {
 	int ret;
 
+	/* Wait for any underway transactions to complete before stopping. */
 	gsi_channel_trans_quiesce(channel);
-	napi_disable(&channel->napi);
 
 	ret = stop ? gsi_channel_stop_retry(channel) : 0;
-
-	if (ret)
-		napi_enable(&channel->napi);
+	/* Finally, ensure NAPI polling has finished. */
+	if (!ret)
+		napi_synchronize(&channel->napi);
 
 	return ret;
 }
@@ -947,10 +951,13 @@  int gsi_channel_stop(struct gsi *gsi, u32 channel_id)
 
 	/* Only disable the completion interrupt if stop is successful */
 	ret = __gsi_channel_stop(channel, true);
-	if (!ret)
-		gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
+	if (ret)
+		return ret;
 
-	return ret;
+	gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
+	napi_disable(&channel->napi);
+
+	return 0;
 }
 
 /* Reset and reconfigure a channel, (possibly) enabling the doorbell engine */