mbox series

[v4,0/8] serial: qcom-geni: Overhaul TX handling to fix crashes/hangs

Message ID 20240610222515.3023730-1-dianders@chromium.org
Headers show
Series serial: qcom-geni: Overhaul TX handling to fix crashes/hangs | expand

Message

Doug Anderson June 10, 2024, 10:24 p.m. UTC
While trying to reproduce -EBUSY errors that our lab was getting in
suspend/resume testing, I ended up finding a whole pile of problems
with the Qualcomm GENI serial driver. I've posted a fix for the -EBUSY
issue separately [1]. This series is fixing all of the Qualcomm GENI
problems that I found.

As far as I can tell most of the problems have been in the Qualcomm
GENI serial driver since inception, but it can be noted that the
behavior got worse with the new kfifo changes. Previously when the OS
took data out of the circular queue we'd just spit stale data onto the
serial port. Now we'll hard lockup. :-P

I've tried to break this series up as much as possible to make it
easier to understand but the final patch is still a lot of change at
once. Hopefully it's OK.

[1] https://lore.kernel.org/r/20240530084841.v2.1.I2395e66cf70c6e67d774c56943825c289b9c13e4@changeid

Changes in v4:
- Add GP_LENGTH field definition.
- Fix indentation.
- GENMASK(31, 0) -> GP_LENGTH.
- Use uart_fifo_timeout_ms() for timeout.
- tty: serial: Add uart_fifo_timeout_ms()

Changes in v3:
- 0xffffffff => GENMASK(31, 0)
- Reword commit message.
- Use uart_fifo_timeout() for timeout.

Changes in v2:
- Totally rework / rename patch to handle suspend while active xfer
- serial: qcom-geni: Fix arg types for qcom_geni_serial_poll_bit()
- serial: qcom-geni: Fix the timeout in qcom_geni_serial_poll_bit()
- serial: qcom-geni: Introduce qcom_geni_serial_poll_bitfield()
- serial: qcom-geni: Just set the watermark level once
- serial: qcom-geni: Rework TX in FIFO mode to fix hangs/lockups
- soc: qcom: geni-se: Add GP_LENGTH/IRQ_EN_SET/IRQ_EN_CLEAR registers

Douglas Anderson (8):
  soc: qcom: geni-se: Add GP_LENGTH/IRQ_EN_SET/IRQ_EN_CLEAR registers
  tty: serial: Add uart_fifo_timeout_ms()
  serial: qcom-geni: Fix the timeout in qcom_geni_serial_poll_bit()
  serial: qcom-geni: Fix arg types for qcom_geni_serial_poll_bit()
  serial: qcom-geni: Introduce qcom_geni_serial_poll_bitfield()
  serial: qcom-geni: Just set the watermark level once
  serial: qcom-geni: Fix suspend while active UART xfer
  serial: qcom-geni: Rework TX in FIFO mode to fix hangs/lockups

 drivers/tty/serial/qcom_geni_serial.c | 322 +++++++++++++++-----------
 include/linux/serial_core.h           |  15 +-
 include/linux/soc/qcom/geni-se.h      |   9 +
 3 files changed, 206 insertions(+), 140 deletions(-)

Comments

Konrad Dybcio June 17, 2024, 6:46 p.m. UTC | #1
On 6/11/24 00:24, Douglas Anderson wrote:
> The qcom_geni_serial_poll_bit() is supposed to be able to be used to
> poll a bit that's will become set when a TX transfer finishes. Because
> of this it tries to set its timeout based on how long the UART will
> take to shift out all of the queued bytes. There are two problems
> here:
> 1. There appears to be a hidden extra word on the firmware side which
>     is the word that the firmware has already taken out of the FIFO and
>     is currently shifting out. We need to account for this.
> 2. The timeout calculation was assuming that it would only need 8 bits
>     on the wire to shift out 1 byte. This isn't true. Typically 10 bits
>     are used (8 data bits, 1 start and 1 stop bit), but as much as 13
>     bits could be used (14 if we allowed 9 bits per byte, which we
>     don't).
> 
> The too-short timeout was seen causing problems in a future patch
> which more properly waited for bytes to transfer out of the UART
> before cancelling.
> 
> Rather than fix the calculation, replace it with the core-provided
> uart_fifo_timeout() function.
> 
> NOTE: during earlycon, uart_fifo_timeout() has the same limitations
> about not being able to figure out the exact timeout that the old
> function did. Luckily uart_fifo_timeout() returns the same default
> timeout of 20ms in this case. We'll add a comment about it, though, to
> make it more obvious what's happening.
> 
> Fixes: c4f528795d1a ("tty: serial: msm_geni_serial: Add serial driver support for GENI based QUP")
> Suggested-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> ---

Acked-by: Konrad Dybcio <konrad.dybcio@linaro.org>

Konrad
Konrad Dybcio June 17, 2024, 6:53 p.m. UTC | #2
On 6/11/24 00:24, Douglas Anderson wrote:
> With a small modification the qcom_geni_serial_poll_bit() function
> could be used to poll more than just a single bit. Let's generalize
> it. We'll make the qcom_geni_serial_poll_bit() into just a wrapper of
> the general function.
> 
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> ---

Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org>

Konrad
Konrad Dybcio June 17, 2024, 7:02 p.m. UTC | #3
On 6/11/24 00:24, Douglas Anderson wrote:
> On devices using Qualcomm's GENI UART it is possible to get the UART
> stuck such that it no longer outputs data. Specifically, logging in
> via an agetty on the debug serial port (which was _not_ used for
> kernel console) and running:
>    cat /var/log/messages
> ...and then (via an SSH session) forcing a few suspend/resume cycles
> causes the UART to stop transmitting.
> 
> The root of the problems was with qcom_geni_serial_stop_tx_fifo()
> which is called as part of the suspend process. Specific problems with
> that function:
> - When an in-progress "tx" command is cancelled it doesn't appear to
>    fully drain the FIFO. That meant qcom_geni_serial_tx_empty()
>    continued to report that the FIFO wasn't empty. The
>    qcom_geni_serial_start_tx_fifo() function didn't re-enable
>    interrupts in this case so the driver would never start transferring
>    again.
> - When the driver cancelled the current "tx" command but it forgot to
>    zero out "tx_remaining". This confused logic elsewhere in the
>    driver.
> - From experimentation, it appears that cancelling the "tx" command
>    could drop some of the queued up bytes.
> 
> While qcom_geni_serial_stop_tx_fifo() could be fixed to drain the FIFO
> and shut things down properly, stop_tx() isn't supposed to be a slow
> function. It is run with local interrupts off and is documented to
> stop transmitting "as soon as possible". Change the function to just
> stop new bytes from being queued. In order to make this work, change
> qcom_geni_serial_start_tx_fifo() to remove some conditions. It's
> always safe to enable the watermark interrupt and the IRQ handler will
> disable it if it's not needed.
> 
> For system suspend the queue still needs to be drained. Failure to do
> so means that the hardware won't provide new interrupts until a
> "cancel" command is sent. Add draining logic (fixing the issues noted
> above) at suspend time.
> 
> NOTE: It would be ideal if qcom_geni_serial_stop_tx_fifo() could
> "pause" the transmitter right away. There is no obvious way to do this
> in the docs and experimentation didn't find any tricks either, so
> stopping TX "as soon as possible" isn't very soon but is the best
> possible.
> 
> Fixes: c4f528795d1a ("tty: serial: msm_geni_serial: Add serial driver support for GENI based QUP")
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> ---

This all looks good in my eyes, with the assumption that sending an ABORT
can't somehow be rejected by the hardware.. I wouldn't normally think of
that, but GENI is peculiar at times

Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org>

Konrad
Konrad Dybcio June 18, 2024, 10:19 a.m. UTC | #4
On 6/11/24 00:24, Douglas Anderson wrote:
> 
> While trying to reproduce -EBUSY errors that our lab was getting in
> suspend/resume testing, I ended up finding a whole pile of problems
> with the Qualcomm GENI serial driver. I've posted a fix for the -EBUSY
> issue separately [1]. This series is fixing all of the Qualcomm GENI
> problems that I found.
> 
> As far as I can tell most of the problems have been in the Qualcomm
> GENI serial driver since inception, but it can be noted that the
> behavior got worse with the new kfifo changes. Previously when the OS
> took data out of the circular queue we'd just spit stale data onto the
> serial port. Now we'll hard lockup. :-P
> 
> I've tried to break this series up as much as possible to make it
> easier to understand but the final patch is still a lot of change at
> once. Hopefully it's OK.

Tested-by: Konrad Dybcio <konrad.dybcio@linaro.org>

Konrad
Neil Armstrong June 19, 2024, 8:25 a.m. UTC | #5
On 11/06/2024 00:24, Douglas Anderson wrote:
> 
> While trying to reproduce -EBUSY errors that our lab was getting in
> suspend/resume testing, I ended up finding a whole pile of problems
> with the Qualcomm GENI serial driver. I've posted a fix for the -EBUSY
> issue separately [1]. This series is fixing all of the Qualcomm GENI
> problems that I found.
> 
> As far as I can tell most of the problems have been in the Qualcomm
> GENI serial driver since inception, but it can be noted that the
> behavior got worse with the new kfifo changes. Previously when the OS
> took data out of the circular queue we'd just spit stale data onto the
> serial port. Now we'll hard lockup. :-P
> 
> I've tried to break this series up as much as possible to make it
> easier to understand but the final patch is still a lot of change at
> once. Hopefully it's OK.
> 
> [1] https://lore.kernel.org/r/20240530084841.v2.1.I2395e66cf70c6e67d774c56943825c289b9c13e4@changeid
> 
> Changes in v4:
> - Add GP_LENGTH field definition.
> - Fix indentation.
> - GENMASK(31, 0) -> GP_LENGTH.
> - Use uart_fifo_timeout_ms() for timeout.
> - tty: serial: Add uart_fifo_timeout_ms()
> 
> Changes in v3:
> - 0xffffffff => GENMASK(31, 0)
> - Reword commit message.
> - Use uart_fifo_timeout() for timeout.
> 
> Changes in v2:
> - Totally rework / rename patch to handle suspend while active xfer
> - serial: qcom-geni: Fix arg types for qcom_geni_serial_poll_bit()
> - serial: qcom-geni: Fix the timeout in qcom_geni_serial_poll_bit()
> - serial: qcom-geni: Introduce qcom_geni_serial_poll_bitfield()
> - serial: qcom-geni: Just set the watermark level once
> - serial: qcom-geni: Rework TX in FIFO mode to fix hangs/lockups
> - soc: qcom: geni-se: Add GP_LENGTH/IRQ_EN_SET/IRQ_EN_CLEAR registers
> 
> Douglas Anderson (8):
>    soc: qcom: geni-se: Add GP_LENGTH/IRQ_EN_SET/IRQ_EN_CLEAR registers
>    tty: serial: Add uart_fifo_timeout_ms()
>    serial: qcom-geni: Fix the timeout in qcom_geni_serial_poll_bit()
>    serial: qcom-geni: Fix arg types for qcom_geni_serial_poll_bit()
>    serial: qcom-geni: Introduce qcom_geni_serial_poll_bitfield()
>    serial: qcom-geni: Just set the watermark level once
>    serial: qcom-geni: Fix suspend while active UART xfer
>    serial: qcom-geni: Rework TX in FIFO mode to fix hangs/lockups
> 
>   drivers/tty/serial/qcom_geni_serial.c | 322 +++++++++++++++-----------
>   include/linux/serial_core.h           |  15 +-
>   include/linux/soc/qcom/geni-se.h      |   9 +
>   3 files changed, 206 insertions(+), 140 deletions(-)
> 

Indeed no more lockup when killing a process on the serial debug console

Tested-by: Neil Armstrong <neil.armstrong@linaro.org> # on SM8650-HDK

Thanks !
Neil
Johan Hovold June 19, 2024, 8:50 a.m. UTC | #6
Hi Doug,

and sorry about the late feedback on this (was out of office last
week).

On Mon, Jun 10, 2024 at 03:24:18PM -0700, Douglas Anderson wrote:
> 
> While trying to reproduce -EBUSY errors that our lab was getting in
> suspend/resume testing, I ended up finding a whole pile of problems
> with the Qualcomm GENI serial driver. I've posted a fix for the -EBUSY
> issue separately [1]. This series is fixing all of the Qualcomm GENI
> problems that I found.
> 
> As far as I can tell most of the problems have been in the Qualcomm
> GENI serial driver since inception, but it can be noted that the
> behavior got worse with the new kfifo changes. Previously when the OS
> took data out of the circular queue we'd just spit stale data onto the
> serial port. Now we'll hard lockup. :-P

Thanks for taking a stab at this. This is indeed a known issue that has
been on my ever growing TODO list for over a year now. I worked around a
related regression with:

	9aff74cc4e9e ("serial: qcom-geni: fix console shutdown hang")

but noticed that the underlying bug can still easily be triggered, for
example, using software flow control in a serial console.

With 6.10-rc1 I started hitting this hang on every reboot. I was booting
the new x1e80100 so wasn't sure at first what caused it, but after
triggering the hang by interrupting a dmesg command I remembered the
broken serial driver and indeed your (v2) series fixed the regression
which was also present on sc8280xp.

I did run a quick benchmark this morning to see if there was any
significant performance penalty and I am seeing a 26% slow down (e.g.
catting 544 kB takes 68 instead of 54 seconds at 115200).

I've had a feeling that boot was slower with the series applied, but I
haven't verified that (just printing dmesg takes an extra second,
though).

Correctness first, of course, but perhaps something can be done about
that too.

I'll comment on the individual patches as well, but for now:

Tested-by: Johan Hovold <johan+linaro@kernel.org>

(I did a quick test with Bluetooth / DMA as well.)

Johan
Johan Hovold June 24, 2024, 12:12 p.m. UTC | #7
On Mon, Jun 10, 2024 at 03:24:25PM -0700, Douglas Anderson wrote:
> On devices using Qualcomm's GENI UART it is possible to get the UART
> stuck such that it no longer outputs data. Specifically, logging in
> via an agetty on the debug serial port (which was _not_ used for
> kernel console) and running:
>   cat /var/log/messages
> ...and then (via an SSH session) forcing a few suspend/resume cycles
> causes the UART to stop transmitting.

An easier way to trigger this old bug is to just run a command like
dmesg and hit ctrl-s in a serial console to stop tx. Interrupting the
command or hitting ctrl-q to restart tx then triggers the soft lockup.

> The root of the problems was with qcom_geni_serial_stop_tx_fifo()
> which is called as part of the suspend process. Specific problems with
> that function:
> - When an in-progress "tx" command is cancelled it doesn't appear to
>   fully drain the FIFO. That meant qcom_geni_serial_tx_empty()
>   continued to report that the FIFO wasn't empty. The
>   qcom_geni_serial_start_tx_fifo() function didn't re-enable
>   interrupts in this case so the driver would never start transferring
>   again.
> - When the driver cancelled the current "tx" command but it forgot to
>   zero out "tx_remaining". This confused logic elsewhere in the
>   driver.
> - From experimentation, it appears that cancelling the "tx" command
>   could drop some of the queued up bytes.
> 
> While qcom_geni_serial_stop_tx_fifo() could be fixed to drain the FIFO
> and shut things down properly, stop_tx() isn't supposed to be a slow
> function. It is run with local interrupts off and is documented to
> stop transmitting "as soon as possible". Change the function to just
> stop new bytes from being queued. In order to make this work, change
> qcom_geni_serial_start_tx_fifo() to remove some conditions. It's
> always safe to enable the watermark interrupt and the IRQ handler will
> disable it if it's not needed.
> 
> For system suspend the queue still needs to be drained. Failure to do
> so means that the hardware won't provide new interrupts until a
> "cancel" command is sent. Add draining logic (fixing the issues noted
> above) at suspend time.

So I spent the better part of the weekend looking at this driver and
this is one of the bits I worry about with your approach as relying on
draining anything won't work with hardware flow control.

Cancelling commands can result stalled TX in a number of ways and
there's still at least one that you don't handle. If you end up with
data in in the FIFO, the watermark interrupt may never fire when you try
to restart tx.

I'm leaning towards fixing the immediate hard lockup regression
separately and then we can address the older bugs and rework driver
without having to rush things.

I've prepared a minimal three patch series which fixes most of the
discussed issues (hard and soft lockup and garbage characters) and that
should be backportable as well.

Currently, the diffstat is just:

	 drivers/tty/serial/qcom_geni_serial.c | 36 +++++++++++++++++++++++++-----------
	 1 file changed, 25 insertions(+), 11 deletions(-)

Fixing the hard lockup 6.10-rc1 regression is just a single line.

Johan
Johan Hovold June 24, 2024, 4:54 p.m. UTC | #8
On Mon, Jun 24, 2024 at 02:12:04PM +0200, Johan Hovold wrote:

> I've prepared a minimal three patch series which fixes most of the
> discussed issues (hard and soft lockup and garbage characters) and that
> should be backportable as well.
> 
> Currently, the diffstat is just:
> 
> 	 drivers/tty/serial/qcom_geni_serial.c | 36 +++++++++++++++++++++++++-----------
> 	 1 file changed, 25 insertions(+), 11 deletions(-)
> 
> Fixing the hard lockup 6.10-rc1 regression is just a single line.

For the record, I've posted the series here:

	https://lore.kernel.org/lkml/20240624133135.7445-1-johan+linaro@kernel.org/
 
Johan
Doug Anderson June 24, 2024, 8:58 p.m. UTC | #9
Hi,

On Mon, Jun 24, 2024 at 5:12 AM Johan Hovold <johan@kernel.org> wrote:
>
> On Mon, Jun 10, 2024 at 03:24:25PM -0700, Douglas Anderson wrote:
> > On devices using Qualcomm's GENI UART it is possible to get the UART
> > stuck such that it no longer outputs data. Specifically, logging in
> > via an agetty on the debug serial port (which was _not_ used for
> > kernel console) and running:
> >   cat /var/log/messages
> > ...and then (via an SSH session) forcing a few suspend/resume cycles
> > causes the UART to stop transmitting.
>
> An easier way to trigger this old bug is to just run a command like
> dmesg and hit ctrl-s in a serial console to stop tx. Interrupting the
> command or hitting ctrl-q to restart tx then triggers the soft lockup.
>
> > The root of the problems was with qcom_geni_serial_stop_tx_fifo()
> > which is called as part of the suspend process. Specific problems with
> > that function:
> > - When an in-progress "tx" command is cancelled it doesn't appear to
> >   fully drain the FIFO. That meant qcom_geni_serial_tx_empty()
> >   continued to report that the FIFO wasn't empty. The
> >   qcom_geni_serial_start_tx_fifo() function didn't re-enable
> >   interrupts in this case so the driver would never start transferring
> >   again.
> > - When the driver cancelled the current "tx" command but it forgot to
> >   zero out "tx_remaining". This confused logic elsewhere in the
> >   driver.
> > - From experimentation, it appears that cancelling the "tx" command
> >   could drop some of the queued up bytes.
> >
> > While qcom_geni_serial_stop_tx_fifo() could be fixed to drain the FIFO
> > and shut things down properly, stop_tx() isn't supposed to be a slow
> > function. It is run with local interrupts off and is documented to
> > stop transmitting "as soon as possible". Change the function to just
> > stop new bytes from being queued. In order to make this work, change
> > qcom_geni_serial_start_tx_fifo() to remove some conditions. It's
> > always safe to enable the watermark interrupt and the IRQ handler will
> > disable it if it's not needed.
> >
> > For system suspend the queue still needs to be drained. Failure to do
> > so means that the hardware won't provide new interrupts until a
> > "cancel" command is sent. Add draining logic (fixing the issues noted
> > above) at suspend time.
>
> So I spent the better part of the weekend looking at this driver and
> this is one of the bits I worry about with your approach as relying on
> draining anything won't work with hardware flow control.
>
> Cancelling commands can result stalled TX in a number of ways and
> there's still at least one that you don't handle. If you end up with
> data in in the FIFO, the watermark interrupt may never fire when you try
> to restart tx.

Ah, that's a good call. Right now it doesn't really happen since
people tend to hook up the debug UART without flow control lines (as
far as I've seen), but it's good to make sure it works.


> I'm leaning towards fixing the immediate hard lockup regression
> separately and then we can address the older bugs and rework driver
> without having to rush things.

Yeah, that's fair. I've responded to your patch with a
counter-proposal to fix the hard lockup regression, but I agree that
should take priority.


> I've prepared a minimal three patch series which fixes most of the
> discussed issues (hard and soft lockup and garbage characters) and that
> should be backportable as well.
>
> Currently, the diffstat is just:
>
>          drivers/tty/serial/qcom_geni_serial.c | 36 +++++++++++++++++++++++++-----------
>          1 file changed, 25 insertions(+), 11 deletions(-)

I'll respond more in dept to your patches, but I suspect that your
patch series won't fix the issues that Nícolas reported [1]. I also
tested and your patch series doesn't fix the kdb issue talked about in
my patch #8. Part of my reworking of stuff also changed the way that
the console and the polling commands worked since they were pretty
broken. Your series doesn't touch them.

We'll probably need something in-between taking advantage of some of
the stuff you figured out with "cancel" but also doing a bigger rework
than you did.

[1] https://lore.kernel.org/r/46f57349-1217-4594-85b2-84fa3a365c0c@notapiano
Johan Hovold June 25, 2024, 8:46 a.m. UTC | #10
On Mon, Jun 24, 2024 at 01:58:34PM -0700, Doug Anderson wrote:
> On Mon, Jun 24, 2024 at 5:12 AM Johan Hovold <johan@kernel.org> wrote:

> > I'm leaning towards fixing the immediate hard lockup regression
> > separately and then we can address the older bugs and rework driver
> > without having to rush things.
> 
> Yeah, that's fair. I've responded to your patch with a
> counter-proposal to fix the hard lockup regression, but I agree that
> should take priority.
> 
> > I've prepared a minimal three patch series which fixes most of the
> > discussed issues (hard and soft lockup and garbage characters) and that
> > should be backportable as well.
> >
> > Currently, the diffstat is just:
> >
> >          drivers/tty/serial/qcom_geni_serial.c | 36 +++++++++++++++++++++++++-----------
> >          1 file changed, 25 insertions(+), 11 deletions(-)
> 
> I'll respond more in dept to your patches, but I suspect that your
> patch series won't fix the issues that Nícolas reported [1]. I also
> tested and your patch series doesn't fix the kdb issue talked about in
> my patch #8. Part of my reworking of stuff also changed the way that
> the console and the polling commands worked since they were pretty
> broken. Your series doesn't touch them.

Right, I never claimed to fix all the issues, only some of the most
obvious and severe ones. 

> We'll probably need something in-between taking advantage of some of
> the stuff you figured out with "cancel" but also doing a bigger rework
> than you did.

Quite likely. My intention was to try to find minimal fixes for
individual issues, which could also be backported, before doing a larger
rework if that turns out to be necessary (and which can also be done in
more than way, e.g. using 16-byte fifos).

Johan