mbox series

[v2,0/2] USB: Quiesce interrupts across pm freeze

Message ID 20220418210046.2060937-1-evgreen@chromium.org
Headers show
Series USB: Quiesce interrupts across pm freeze | expand

Message

Evan Green April 18, 2022, 9 p.m. UTC
The documentation for the freeze() method says that it "should quiesce
the device so that it doesn't generate IRQs or DMA". The unspoken
consequence of not doing this is that MSIs aimed at non-boot CPUs may
get fully lost if they're sent during the period where the target CPU is
offline.

The current behavior of the USB subsystem still allows interrupts to
come in after freeze, both in terms of remote wakeups and HC events
related to things like root plug port activity. This can get controllers
like XHCI, which is very sensitive to lost interrupts, in a wedged
state. This series attempts to fully quiesce interrupts coming from USB
across in a freeze or quiescent state.

These patches are grouped together because they serve a united purpose,
but are actually independent. They could be merged or reverted
individually.

You may be able to reproduce this issue on your own machine via the
following:
1. Disable runtime PM on your XHCI controller
2. Aim your XHCI IRQ at a non-boot CPU (replace 174): echo 2 >
   /proc/irq/174/smp_affinity
3. Attempt to hibernate (no need to actually go all the way down).

I run 2 and 3 in a loop, and can usually hit a hang or dead XHCI
controller within 1-2 iterations. I happened to notice this on an
Alderlake system where runtime PM is accidentally disabled for one of
the XHCI controllers. Some more discussion and debugging can be found at
[1].

[1] https://lore.kernel.org/linux-pci/CAE=gft4a-QL82iFJE_xRQ3JrMmz-KZKWREtz=MghhjFbJeK=8A@mail.gmail.com/T/#u

Changes in v2:
 - Added the patch modifying the remote wakeup state
 - Removed the change to freeze_noirq/thaw_noirq

Evan Green (2):
  USB: core: Disable remote wakeup for freeze/quiesce
  USB: hcd-pci: Fully suspend across freeze/thaw cycle

 drivers/usb/core/driver.c  | 20 +++++++++-----------
 drivers/usb/core/hcd-pci.c |  4 ++--
 2 files changed, 11 insertions(+), 13 deletions(-)

Comments

Alan Stern April 19, 2022, 2:35 p.m. UTC | #1
On Tue, Apr 19, 2022 at 09:05:53AM +0200, Oliver Neukum wrote:
> 
> 
> On 18.04.22 23:00, Evan Green wrote:
> > The documentation for the freeze() method says that it "should quiesce
> > the device so that it doesn't generate IRQs or DMA". The unspoken
> > consequence of not doing this is that MSIs aimed at non-boot CPUs may
> > get fully lost if they're sent during the period where the target CPU is
> > offline.
> >
> > The current behavior of the USB subsystem still allows interrupts to
> > come in after freeze, both in terms of remote wakeups and HC events
> > related to things like root plug port activity. This can get controllers
> > like XHCI, which is very sensitive to lost interrupts, in a wedged
> > state. This series attempts to fully quiesce interrupts coming from USB
> > across in a freeze or quiescent state.
> >
> > These patches are grouped together because they serve a united purpose,
> > but are actually independent. They could be merged or reverted
> > individually.
> Hi,
> 
> sorry for being a bit late in this discussion. There was something that
> I didn't remember immediately.
> 
> We have a set of quirky devices that need HID_QUIRK_ALWAYS_POLL.
> They have the nasty firmware bug that, if you suspend them without
> remote wakeup, they will crash or reset themselves.
> I am afraid that has an obvious relevance to your cool patches.
> I am not completely sure how to deal with this. It seems to me that the
> quirk will need to be shifted from HID to core USB and thaw() needs to
> be translated into usb_device_reset() + reset_resume() for them,
> but I am not really sure about the optimal mechanism.

We may not need to do anything.  This patch specifically addresses 
hibernation, not system suspend or runtime suspend.  A device crashing 
or resetting during hibernation is not at all unusual; we should be able 
to handle such cases properly.

The THAW part of suspend-to-hibernation is used only for writing the 
memory image to permanent storage.  I doubt that a malfunctioning HID 
device would interfere with this process.

Alan Stern
Evan Green April 20, 2022, 7:30 p.m. UTC | #2
On Tue, Apr 19, 2022 at 7:41 AM Alan Stern <stern@rowland.harvard.edu> wrote:
>
> On Mon, Apr 18, 2022 at 02:00:45PM -0700, Evan Green wrote:
> > The PM_EVENT_FREEZE and PM_EVENT_QUIESCE messages should cause the
> > device to stop generating interrupts. USB core was previously allowing
> > devices that were already runtime suspended to keep remote wakeup
> > enabled if they had gone down that way. This violates the contract with
> > pm, and can potentially cause MSI interrupts to be lost.
> >
> > Change that so that if a device is runtime suspended with remote wakeups
> > enabled, it will be resumed to ensure remote wakeup is always disabled
> > across a freeze.
> >
> > Signed-off-by: Evan Green <evgreen@chromium.org>
> > ---
> >
> > (no changes since v1)
> >
> >  drivers/usb/core/driver.c | 20 +++++++++-----------
> >  1 file changed, 9 insertions(+), 11 deletions(-)
> >
> > diff --git a/drivers/usb/core/driver.c b/drivers/usb/core/driver.c
> > index 355ed33a21792b..93c8cf66adccec 100644
> > --- a/drivers/usb/core/driver.c
> > +++ b/drivers/usb/core/driver.c
> > @@ -1533,20 +1533,18 @@ static void choose_wakeup(struct usb_device *udev, pm_message_t msg)
> >  {
> >       int     w;
> >
> > -     /* Remote wakeup is needed only when we actually go to sleep.
> > -      * For things like FREEZE and QUIESCE, if the device is already
> > -      * autosuspended then its current wakeup setting is okay.
> > +     /* For FREEZE/QUIESCE, disable remote wakeups so no interrupts get generated
> > +      * by the device.
>
> You mean "by the host controller".  USB devices don't generate
> interrupts; they generate wakeup requests (which can cause a host
> controller to generate an interrupt).

Right, I guess I mean "at the behest of the device". I could probably
just delete "by the device", since the goal of the comment is simply
to highlight that we're trying to kill interrupts, and describing
their provenance isn't as relevant.

>
> >        */
> >       if (msg.event == PM_EVENT_FREEZE || msg.event == PM_EVENT_QUIESCE) {
> > -             if (udev->state != USB_STATE_SUSPENDED)
> > -                     udev->do_remote_wakeup = 0;
> > -             return;
> > -     }
> > +             w = 0;
> >
> > -     /* Enable remote wakeup if it is allowed, even if no interface drivers
> > -      * actually want it.
> > -      */
> > -     w = device_may_wakeup(&udev->dev);
> > +     } else {
> > +             /* Enable remote wakeup if it is allowed, even if no interface drivers
> > +              * actually want it.
> > +              */
> > +             w = device_may_wakeup(&udev->dev);
> > +     }
> >
> >       /* If the device is autosuspended with the wrong wakeup setting,
> >        * autoresume now so the setting can be changed.
> > --
>
> I would prefer it if you reformatted the comments to agree with the
> current style:
>
>         /*
>          * Blah blah blah
>          */
>
> and to avoid line wrap beyond 80 columns.  Apart from that:

Sure, I can fix these up, add your tags, and repost.

>
> Acked-by: Alan Stern <stern@rowland.harvard.edu>

Thanks!

>
> Alan Stern