mbox series

[0/2] serial: Fix problems when serial transfer is happening at suspend time

Message ID 20240523232216.3148367-1-dianders@chromium.org
Headers show
Series serial: Fix problems when serial transfer is happening at suspend time | expand

Message

Doug Anderson May 23, 2024, 11:22 p.m. UTC
This is a set of two patches that fix problems related to suspending
while a serial transfer is going on. The two patches are independent
from each other and can land in any order. The only thing tying them
together is that I used the same test to reproduce both of them.
Specifically, I could reproduce my problemes by logging in via an
agetty on the debug serial port (which was _not_ used for kernel
console) and running:
  cat /var/log/messages
...and then (via an SSH session) forcing a few suspend/resume cycles.

The first patch solves a problem that is probably more major. It was
introduced recently and has even shown up in stable trees.
Suspend/resume testing in ChromeOS test labs are hitting the problem
fixed by this patch. The fix hasn't been tested in labs, but when I
reproduced the problem locally I could see that the fix worked. IMO it
should land ASAP.

The second patch fixes an ancient problem that I only found because I
was trying to reproduce the first problem. Given how long it's been
around it's probably not urgent but it would be nice to get fixed.


Douglas Anderson (2):
  serial: port: Don't block system suspend even if bytes are left to
    xmit
  serial: qcom-geni: Fix qcom_geni_serial_stop_tx_fifo() while xfer

 drivers/tty/serial/qcom_geni_serial.c | 45 +++++++++++++++++++++++++--
 drivers/tty/serial/serial_port.c      | 10 ++++++
 2 files changed, 52 insertions(+), 3 deletions(-)

Comments

Doug Anderson May 24, 2024, 3:20 p.m. UTC | #1
Hi,

On Thu, May 23, 2024 at 5:25 PM Stephen Boyd <swboyd@chromium.org> wrote:
>
> Quoting Douglas Anderson (2024-05-23 16:22:12)
> > Recently, suspend testing on sc7180-trogdor based devices has started
> > to sometimes fail with messages like this:
> >
> >   port a88000.serial:0.0: PM: calling pm_runtime_force_suspend+0x0/0xf8 @ 28934, parent: a88000.serial:0
> >   port a88000.serial:0.0: PM: dpm_run_callback(): pm_runtime_force_suspend+0x0/0xf8 returns -16
> >   port a88000.serial:0.0: PM: pm_runtime_force_suspend+0x0/0xf8 returned -16 after 33 usecs
> >   port a88000.serial:0.0: PM: failed to suspend: error -16
> >
> > I could reproduce these problem by logging in via an agetty on the
> > debug serial port (which was _not_ used for kernel console) and
> > running:
> >   cat /var/log/messages
> > ...and then (via an SSH session) forcing a few suspend/resume cycles.
> >
> > Tracing through the code and doing some printf debugging shows that
> > the -16 (-EBUSY) comes from the recently added
> > serial_port_runtime_suspend().
> >
> > The idea of the serial_port_runtime_suspend() function is to prevent
> > the port from being _runtime_ suspended if it still has bytes left to
> > transmit. Having bytes left to transmit isn't a reason to block
> > _system_ suspend, though.
>
> Can you elaborate? I paused to think that maybe we would want to make
> sure that everything that was transmitted had been transmitted but that
> doesn't seem right because it's a problem for higher layers to solve,
> e.g. serdev would want to make sure some sleep command sent over the
> wire actually got sent.

I don't have a ton of intuition about how the new "serial_port" code
is designed to work. The patch that added it mentioned that it was
based on suggestions by a whole bunch of people but there were no
links to the previous discussions so it wasn't easy to get tons of
context. I would certainly love it if people who have been involved
could chime in and say whether my patch is correct. If anyone has any
suggestions for something better to say in the commit message I'm
happy to use different wording as well.

In any case, looking at it I guess a serdev device would use
serdev_device_wait_until_sent() to ensure its command was sent. That
eventually seems to trickle down to the UART function tx_empty(). If a
serdev device needs to block suspend while waiting then that would be
up to the serdev device. This could be implicit if the
serdev_device_wait_until_sent() was being called as part of the serdev
device's suspend() function or perhaps could be through some sort of
locking.

...so really I think the case we're running into is if userspace has a
whole bunch of bytes queued up to write out the UART. That shouldn't
block suspend. If userspace needs to block suspend they should use
another method.


> > The DEFINE_RUNTIME_DEV_PM_OPS() used by the
> > serial_port code means that the system suspend function will be
> > pm_runtime_force_suspend(). In pm_runtime_force_suspend() we can see
> > that before calling the runtime suspend function we'll call
> > pm_runtime_disable(). This should be a reliable way to detect that
> > we're called from system suspend and that we shouldn't look for
> > busyness.
> >
> > Fixes: 43066e32227e ("serial: port: Don't suspend if the port is still busy")
> > Signed-off-by: Douglas Anderson <dianders@chromium.org>
> > ---
> >
> >  drivers/tty/serial/serial_port.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> >
> > diff --git a/drivers/tty/serial/serial_port.c b/drivers/tty/serial/serial_port.c
> > index 91a338d3cb34..b781227cc996 100644
> > --- a/drivers/tty/serial/serial_port.c
> > +++ b/drivers/tty/serial/serial_port.c
> > @@ -64,6 +64,16 @@ static int serial_port_runtime_suspend(struct device *dev)
> >         if (port->flags & UPF_DEAD)
> >                 return 0;
> >
> > +       /*
> > +        * We only want to check the busyness of the port if PM Runtime is
> > +        * enabled. Specifically PM Runtime will be disabled by
> > +        * pm_runtime_force_suspend() during system suspend and we don't want
> > +        * to block system suspend even if there is data still left to
> > +        * transmit. We only want to block regulator PM Runtime transitions.
>
> s/regulator/regular/
>
> Is this a typo?

Yeah, I'll fix that to "regular". I'll wait (probably till Tuesday)
before sending a v2.

> Also, why is "runtime" capitalized?

Somehow I thought it was supposed to be, but you're right that it
looks weird. I guess the docs call it "runtime PM" so I'll change all
instances to that in v2.

-Doug