Message ID | 20240530154553.v2.6.I0f81a5baa37d368f291c96ee4830abca337e3c87@changeid |
---|---|
State | New |
Headers | show |
Series | serial: qcom-geni: Overhaul TX handling to fix crashes/hangs | expand |
On Thu, May 30, 2024 at 03:45:58PM -0700, Douglas Anderson wrote: > On devices using Qualcomm's GENI UART it is possible to get the UART > stuck such that it no longer outputs data. Specifically, I could > reproduce this problem by logging in via an agetty on the debug serial > port (which was _not_ used for kernel console) and running: > cat /var/log/messages > ...and then (via an SSH session) forcing a few suspend/resume cycles. > > Digging into this showed a number of problems that are all related. > > The root of the problems was with qcom_geni_serial_stop_tx_fifo() > which is called as part of the suspend process. Specific problems with > that function: > - When we cancel an in-progress "tx" command it doesn't appear to > fully drain the FIFO. That meant qcom_geni_serial_tx_empty() > continued to report that the FIFO wasn't empty. The > qcom_geni_serial_start_tx_fifo() function didn't re-enable > interrupts in this case so we'd never start transferring again. > - We cancelled the current "tx" command but we forgot to zero out > "tx_remaining". This confused logic elsewhere in the driver > - From experimentation, it appears that cancelling the "tx" command > could drop some of the queued up bytes. While maybe not the end of > the world, it doesn't seem like we should be dropping bytes when > stopping the FIFO, which is defined more of a "pause". > > One idea to fix the above would be to add FIFO draining to > qcom_geni_serial_stop_tx_fifo(). However, digging into the > documentation in serial_core.h for stop_tx() makes this seem like the > wrong choice. Specifically stop_tx() is called with local interrupts > disabled. Waiting for a FIFO (which might be 64 bytes big) to drain at > 115.2 kbps doesn't seem like a wise move. > > Ideally qcom_geni_serial_stop_tx_fifo() would be able to pause the > transmitter, but nothing in the documentation for the GENI UART makes > me believe that is possible. > > Given the lack of better choices, we'll change > qcom_geni_serial_stop_tx_fifo() to simply disable the > TX_FIFO_WATERMARK interrupt and call it a day. This seems OK as per > the serial core docs since stop_tx() is supposed to stop transferring > bytes "as soon as possible" and there doesn't seem to be any possible > way to stop transferring sooner. As part of this, get rid of some of > the extra conditions on qcom_geni_serial_start_tx_fifo() which simply > weren't needed and are now getting in the way. It's always fine to > turn the interrupts on if we want to receive and it'll be up to the > IRQ handler to turn them back off if somehow they're not needed. This > works fine. > > Unfortunately, doing just the above change causes new/different > problems with suspend/resume. Now if you suspend while an active > transfer is happening you can find that after resume time you're no > longer receiving UART interrupts at all. It appears to be important to > drain the FIFO and send a "cancel" command if the UART is active to > avoid this. Since we've already decided that > qcom_geni_serial_stop_tx_fifo() shouldn't be doing this, let's add the > draining / cancelling logic to the shutdown() call where it should be > OK to delay a bit. This is called as part of the suspend process via > uart_suspend_port(). > > Finally, with all of the above, the test case where we're spamming the > UART with data and going through suspend/resume cycles doesn't kill > the UART and doesn't drop bytes. > > NOTE: though I haven't gone back and validated on ancient code, it > appears from code inspection that many of these problems have existed > since the start of the driver. In the very least, I could reproduce > the problems on vanilla v5.15. The problems don't seem to reproduce > when using the serial port for kernel console output and also don't > seem to reproduce if nothing is being printed to the console at > suspend time, so this is presumably why they were not noticed until > now. ... > + qcom_geni_serial_poll_bitfield(uport, SE_GENI_M_GP_LENGTH, 0xffffffff, It's easy to miscount f:s, GENMASK()? > + port->tx_total - port->tx_remaining);
On Thu, 30 May 2024, Douglas Anderson wrote: > On devices using Qualcomm's GENI UART it is possible to get the UART > stuck such that it no longer outputs data. Specifically, I could > reproduce this problem by logging in via an agetty on the debug serial > port (which was _not_ used for kernel console) and running: > cat /var/log/messages > ...and then (via an SSH session) forcing a few suspend/resume cycles. > > Digging into this showed a number of problems that are all related. > > The root of the problems was with qcom_geni_serial_stop_tx_fifo() > which is called as part of the suspend process. Specific problems with > that function: > - When we cancel an in-progress "tx" command it doesn't appear to > fully drain the FIFO. That meant qcom_geni_serial_tx_empty() > continued to report that the FIFO wasn't empty. The > qcom_geni_serial_start_tx_fifo() function didn't re-enable > interrupts in this case so we'd never start transferring again. > - We cancelled the current "tx" command but we forgot to zero out > "tx_remaining". This confused logic elsewhere in the driver > - From experimentation, it appears that cancelling the "tx" command > could drop some of the queued up bytes. While maybe not the end of > the world, it doesn't seem like we should be dropping bytes when > stopping the FIFO, which is defined more of a "pause". > > One idea to fix the above would be to add FIFO draining to > qcom_geni_serial_stop_tx_fifo(). However, digging into the > documentation in serial_core.h for stop_tx() makes this seem like the > wrong choice. Specifically stop_tx() is called with local interrupts > disabled. Waiting for a FIFO (which might be 64 bytes big) to drain at > 115.2 kbps doesn't seem like a wise move. > > Ideally qcom_geni_serial_stop_tx_fifo() would be able to pause the > transmitter, but nothing in the documentation for the GENI UART makes > me believe that is possible. > > Given the lack of better choices, we'll change > qcom_geni_serial_stop_tx_fifo() to simply disable the > TX_FIFO_WATERMARK interrupt and call it a day. This seems OK as per > the serial core docs since stop_tx() is supposed to stop transferring > bytes "as soon as possible" and there doesn't seem to be any possible > way to stop transferring sooner. As part of this, get rid of some of > the extra conditions on qcom_geni_serial_start_tx_fifo() which simply > weren't needed and are now getting in the way. It's always fine to > turn the interrupts on if we want to receive and it'll be up to the > IRQ handler to turn them back off if somehow they're not needed. This > works fine. > > Unfortunately, doing just the above change causes new/different > problems with suspend/resume. Now if you suspend while an active > transfer is happening you can find that after resume time you're no > longer receiving UART interrupts at all. It appears to be important to > drain the FIFO and send a "cancel" command if the UART is active to > avoid this. Since we've already decided that > qcom_geni_serial_stop_tx_fifo() shouldn't be doing this, let's add the > draining / cancelling logic to the shutdown() call where it should be > OK to delay a bit. This is called as part of the suspend process via > uart_suspend_port(). > > Finally, with all of the above, the test case where we're spamming the > UART with data and going through suspend/resume cycles doesn't kill > the UART and doesn't drop bytes. > > NOTE: though I haven't gone back and validated on ancient code, it > appears from code inspection that many of these problems have existed > since the start of the driver. In the very least, I could reproduce > the problems on vanilla v5.15. The problems don't seem to reproduce > when using the serial port for kernel console output and also don't > seem to reproduce if nothing is being printed to the console at > suspend time, so this is presumably why they were not noticed until > now. Hi, This was quite tiring to read. :-) It's has lots of useful information but it could be structured better. Could you try to rewrite this entire description so that it's easier to find the problem and final solution information from it. Start with those two things, and in that part, try to avoid detouring to extra branches you took while finding and solving the problem. You can place how the problem can be reproduced after you've described the root cause & final solution first. Extra information why some other approaches do not work is also useful information, but please place it after the final solution has been covered first. Also, try to avoid I/you/we, use imperative tone.
Hi, On Fri, May 31, 2024 at 8:13 AM Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> wrote: > > On Thu, 30 May 2024, Douglas Anderson wrote: > > > On devices using Qualcomm's GENI UART it is possible to get the UART > > stuck such that it no longer outputs data. Specifically, I could > > reproduce this problem by logging in via an agetty on the debug serial > > port (which was _not_ used for kernel console) and running: > > cat /var/log/messages > > ...and then (via an SSH session) forcing a few suspend/resume cycles. > > > > Digging into this showed a number of problems that are all related. > > > > The root of the problems was with qcom_geni_serial_stop_tx_fifo() > > which is called as part of the suspend process. Specific problems with > > that function: > > - When we cancel an in-progress "tx" command it doesn't appear to > > fully drain the FIFO. That meant qcom_geni_serial_tx_empty() > > continued to report that the FIFO wasn't empty. The > > qcom_geni_serial_start_tx_fifo() function didn't re-enable > > interrupts in this case so we'd never start transferring again. > > - We cancelled the current "tx" command but we forgot to zero out > > "tx_remaining". This confused logic elsewhere in the driver > > - From experimentation, it appears that cancelling the "tx" command > > could drop some of the queued up bytes. While maybe not the end of > > the world, it doesn't seem like we should be dropping bytes when > > stopping the FIFO, which is defined more of a "pause". > > > > One idea to fix the above would be to add FIFO draining to > > qcom_geni_serial_stop_tx_fifo(). However, digging into the > > documentation in serial_core.h for stop_tx() makes this seem like the > > wrong choice. Specifically stop_tx() is called with local interrupts > > disabled. Waiting for a FIFO (which might be 64 bytes big) to drain at > > 115.2 kbps doesn't seem like a wise move. > > > > Ideally qcom_geni_serial_stop_tx_fifo() would be able to pause the > > transmitter, but nothing in the documentation for the GENI UART makes > > me believe that is possible. > > > > Given the lack of better choices, we'll change > > qcom_geni_serial_stop_tx_fifo() to simply disable the > > TX_FIFO_WATERMARK interrupt and call it a day. This seems OK as per > > the serial core docs since stop_tx() is supposed to stop transferring > > bytes "as soon as possible" and there doesn't seem to be any possible > > way to stop transferring sooner. As part of this, get rid of some of > > the extra conditions on qcom_geni_serial_start_tx_fifo() which simply > > weren't needed and are now getting in the way. It's always fine to > > turn the interrupts on if we want to receive and it'll be up to the > > IRQ handler to turn them back off if somehow they're not needed. This > > works fine. > > > > Unfortunately, doing just the above change causes new/different > > problems with suspend/resume. Now if you suspend while an active > > transfer is happening you can find that after resume time you're no > > longer receiving UART interrupts at all. It appears to be important to > > drain the FIFO and send a "cancel" command if the UART is active to > > avoid this. Since we've already decided that > > qcom_geni_serial_stop_tx_fifo() shouldn't be doing this, let's add the > > draining / cancelling logic to the shutdown() call where it should be > > OK to delay a bit. This is called as part of the suspend process via > > uart_suspend_port(). > > > > Finally, with all of the above, the test case where we're spamming the > > UART with data and going through suspend/resume cycles doesn't kill > > the UART and doesn't drop bytes. > > > > NOTE: though I haven't gone back and validated on ancient code, it > > appears from code inspection that many of these problems have existed > > since the start of the driver. In the very least, I could reproduce > > the problems on vanilla v5.15. The problems don't seem to reproduce > > when using the serial port for kernel console output and also don't > > seem to reproduce if nothing is being printed to the console at > > suspend time, so this is presumably why they were not noticed until > > now. > > Hi, > > This was quite tiring to read. :-) It's has lots of useful information but > it could be structured better. > > Could you try to rewrite this entire description so that it's easier to > find the problem and final solution information from it. Start with those > two things, and in that part, try to avoid detouring to extra branches you > took while finding and solving the problem. > > You can place how the problem can be reproduced after you've described the > root cause & final solution first. Extra information why some other > approaches do not work is also useful information, but please place it > after the final solution has been covered first. > > Also, try to avoid I/you/we, use imperative tone. Sure. I'll try. It's always a tradeoff between providing too much information and not providing enough. In general I find that providing the thought process can help someone else who is likely going to go through the same thing as they're trying to understand the patch, but I agree it can also be overwhelming. Sure. I've attempted to use the imperative tone when possible. In general (unless my understanding is flawed) it's not possible to use imperative when explaining to the reader how the hardware/driver works or what the problem is and (IMO) we shouldn't fully remove these types of explanations from the commit message. When describing what the patch actually does, though, I've tried to make sure it's in imperative form. If you have wording changes on v3 then please suggest specific changes. -Doug
diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c index d7814f9e5c26..10aeb0313f9b 100644 --- a/drivers/tty/serial/qcom_geni_serial.c +++ b/drivers/tty/serial/qcom_geni_serial.c @@ -131,6 +131,7 @@ struct qcom_geni_serial_port { bool brk; unsigned int tx_remaining; + unsigned int tx_total; int wakeup_irq; bool rx_tx_swap; bool cts_rts_swap; @@ -337,11 +338,14 @@ static bool qcom_geni_serial_poll_bit(struct uart_port *uport, static void qcom_geni_serial_setup_tx(struct uart_port *uport, u32 xmit_size) { + struct qcom_geni_serial_port *port = to_dev_port(uport); u32 m_cmd; writel(xmit_size, uport->membase + SE_UART_TX_TRANS_LEN); m_cmd = UART_START_TX << M_OPCODE_SHFT; writel(m_cmd, uport->membase + SE_GENI_M_CMD0); + + port->tx_total = xmit_size; } static void qcom_geni_serial_poll_tx_done(struct uart_port *uport) @@ -361,6 +365,64 @@ static void qcom_geni_serial_poll_tx_done(struct uart_port *uport) writel(irq_clear, uport->membase + SE_GENI_M_IRQ_CLEAR); } +static void qcom_geni_serial_drain_tx_fifo(struct uart_port *uport) +{ + struct qcom_geni_serial_port *port = to_dev_port(uport); + + /* + * If the main sequencer is inactive it means that the TX command has + * been completed and all bytes have been sent. Nothing to do in that + * case. + */ + if (!qcom_geni_serial_main_active(uport)) + return; + + /* + * Wait until the FIFO has been drained. We've already taken bytes out + * of the higher level queue in qcom_geni_serial_send_chunk_fifo() so + * if we don't drain the FIFO but send the "cancel" below they seem to + * get lost. + */ + qcom_geni_serial_poll_bitfield(uport, SE_GENI_M_GP_LENGTH, 0xffffffff, + port->tx_total - port->tx_remaining); + + /* + * If clearing the FIFO made us inactive then we're done--no need for + * a cancel. + */ + if (!qcom_geni_serial_main_active(uport)) + return; + + /* + * Cancel the current command. After this the main sequencer will + * stop reporting that it's active and we'll have to start a new + * transfer command. + * + * If we skip doing this cancel and then continue with a system + * suspend while there's an active command in the main sequencer + * then after resume time we won't get any more interrupts on the + * main sequencer until we send the cancel. + */ + geni_se_cancel_m_cmd(&port->se); + if (!qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS, + M_CMD_CANCEL_EN, true)) { + /* The cancel failed; try an abort as a fallback. */ + geni_se_abort_m_cmd(&port->se); + qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS, + M_CMD_ABORT_EN, true); + writel(M_CMD_ABORT_EN, uport->membase + SE_GENI_M_IRQ_CLEAR); + } + writel(M_CMD_CANCEL_EN, uport->membase + SE_GENI_M_IRQ_CLEAR); + + /* + * We've cancelled the current command. "tx_remaining" stores how + * many bytes are left to finish in the current command so we know + * when to start a new command. Since the command was cancelled we + * need to zero "tx_remaining". + */ + port->tx_remaining = 0; +} + static void qcom_geni_serial_abort_rx(struct uart_port *uport) { u32 irq_clear = S_CMD_DONE_EN | S_CMD_ABORT_EN; @@ -681,37 +743,18 @@ static void qcom_geni_serial_start_tx_fifo(struct uart_port *uport) { u32 irq_en; - if (qcom_geni_serial_main_active(uport) || - !qcom_geni_serial_tx_empty(uport)) - return; - irq_en = readl(uport->membase + SE_GENI_M_IRQ_EN); irq_en |= M_TX_FIFO_WATERMARK_EN | M_CMD_DONE_EN; - writel(irq_en, uport->membase + SE_GENI_M_IRQ_EN); } static void qcom_geni_serial_stop_tx_fifo(struct uart_port *uport) { u32 irq_en; - struct qcom_geni_serial_port *port = to_dev_port(uport); irq_en = readl(uport->membase + SE_GENI_M_IRQ_EN); irq_en &= ~(M_CMD_DONE_EN | M_TX_FIFO_WATERMARK_EN); writel(irq_en, uport->membase + SE_GENI_M_IRQ_EN); - /* Possible stop tx is called multiple times. */ - if (!qcom_geni_serial_main_active(uport)) - return; - - geni_se_cancel_m_cmd(&port->se); - if (!qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS, - M_CMD_CANCEL_EN, true)) { - geni_se_abort_m_cmd(&port->se); - qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS, - M_CMD_ABORT_EN, true); - writel(M_CMD_ABORT_EN, uport->membase + SE_GENI_M_IRQ_CLEAR); - } - writel(M_CMD_CANCEL_EN, uport->membase + SE_GENI_M_IRQ_CLEAR); } static void qcom_geni_serial_handle_rx_fifo(struct uart_port *uport, bool drop) @@ -1093,7 +1136,15 @@ static int setup_fifos(struct qcom_geni_serial_port *port) } -static void qcom_geni_serial_shutdown(struct uart_port *uport) +static void qcom_geni_serial_shutdown_dma(struct uart_port *uport) +{ + disable_irq(uport->irq); + + qcom_geni_serial_stop_tx(uport); + qcom_geni_serial_stop_rx(uport); +} + +static void qcom_geni_serial_shutdown_fifo(struct uart_port *uport) { disable_irq(uport->irq); @@ -1102,6 +1153,8 @@ static void qcom_geni_serial_shutdown(struct uart_port *uport) qcom_geni_serial_stop_tx(uport); qcom_geni_serial_stop_rx(uport); + + qcom_geni_serial_drain_tx_fifo(uport); } static int qcom_geni_serial_port_setup(struct uart_port *uport) @@ -1560,7 +1613,7 @@ static const struct uart_ops qcom_geni_console_pops = { .startup = qcom_geni_serial_startup, .request_port = qcom_geni_serial_request_port, .config_port = qcom_geni_serial_config_port, - .shutdown = qcom_geni_serial_shutdown, + .shutdown = qcom_geni_serial_shutdown_fifo, .type = qcom_geni_serial_get_type, .set_mctrl = qcom_geni_serial_set_mctrl, .get_mctrl = qcom_geni_serial_get_mctrl, @@ -1582,7 +1635,7 @@ static const struct uart_ops qcom_geni_uart_pops = { .startup = qcom_geni_serial_startup, .request_port = qcom_geni_serial_request_port, .config_port = qcom_geni_serial_config_port, - .shutdown = qcom_geni_serial_shutdown, + .shutdown = qcom_geni_serial_shutdown_dma, .type = qcom_geni_serial_get_type, .set_mctrl = qcom_geni_serial_set_mctrl, .get_mctrl = qcom_geni_serial_get_mctrl,
On devices using Qualcomm's GENI UART it is possible to get the UART stuck such that it no longer outputs data. Specifically, I could reproduce this problem by logging in via an agetty on the debug serial port (which was _not_ used for kernel console) and running: cat /var/log/messages ...and then (via an SSH session) forcing a few suspend/resume cycles. Digging into this showed a number of problems that are all related. The root of the problems was with qcom_geni_serial_stop_tx_fifo() which is called as part of the suspend process. Specific problems with that function: - When we cancel an in-progress "tx" command it doesn't appear to fully drain the FIFO. That meant qcom_geni_serial_tx_empty() continued to report that the FIFO wasn't empty. The qcom_geni_serial_start_tx_fifo() function didn't re-enable interrupts in this case so we'd never start transferring again. - We cancelled the current "tx" command but we forgot to zero out "tx_remaining". This confused logic elsewhere in the driver - From experimentation, it appears that cancelling the "tx" command could drop some of the queued up bytes. While maybe not the end of the world, it doesn't seem like we should be dropping bytes when stopping the FIFO, which is defined more of a "pause". One idea to fix the above would be to add FIFO draining to qcom_geni_serial_stop_tx_fifo(). However, digging into the documentation in serial_core.h for stop_tx() makes this seem like the wrong choice. Specifically stop_tx() is called with local interrupts disabled. Waiting for a FIFO (which might be 64 bytes big) to drain at 115.2 kbps doesn't seem like a wise move. Ideally qcom_geni_serial_stop_tx_fifo() would be able to pause the transmitter, but nothing in the documentation for the GENI UART makes me believe that is possible. Given the lack of better choices, we'll change qcom_geni_serial_stop_tx_fifo() to simply disable the TX_FIFO_WATERMARK interrupt and call it a day. This seems OK as per the serial core docs since stop_tx() is supposed to stop transferring bytes "as soon as possible" and there doesn't seem to be any possible way to stop transferring sooner. As part of this, get rid of some of the extra conditions on qcom_geni_serial_start_tx_fifo() which simply weren't needed and are now getting in the way. It's always fine to turn the interrupts on if we want to receive and it'll be up to the IRQ handler to turn them back off if somehow they're not needed. This works fine. Unfortunately, doing just the above change causes new/different problems with suspend/resume. Now if you suspend while an active transfer is happening you can find that after resume time you're no longer receiving UART interrupts at all. It appears to be important to drain the FIFO and send a "cancel" command if the UART is active to avoid this. Since we've already decided that qcom_geni_serial_stop_tx_fifo() shouldn't be doing this, let's add the draining / cancelling logic to the shutdown() call where it should be OK to delay a bit. This is called as part of the suspend process via uart_suspend_port(). Finally, with all of the above, the test case where we're spamming the UART with data and going through suspend/resume cycles doesn't kill the UART and doesn't drop bytes. NOTE: though I haven't gone back and validated on ancient code, it appears from code inspection that many of these problems have existed since the start of the driver. In the very least, I could reproduce the problems on vanilla v5.15. The problems don't seem to reproduce when using the serial port for kernel console output and also don't seem to reproduce if nothing is being printed to the console at suspend time, so this is presumably why they were not noticed until now. Fixes: c4f528795d1a ("tty: serial: msm_geni_serial: Add serial driver support for GENI based QUP") Signed-off-by: Douglas Anderson <dianders@chromium.org> --- There are still a number of problems with GENI UART after this but I've kept this change separate to make it easier to understand. Specifically on mainline just hitting "Ctrl-C" after dumping /var/log/messages to the serial port hangs things after the kfifo changes. Those issues will be addressed in future patches. Changes in v2: - Totally rework / rename patch to handle suspend while active xfer drivers/tty/serial/qcom_geni_serial.c | 97 +++++++++++++++++++++------ 1 file changed, 75 insertions(+), 22 deletions(-)