diff mbox series

[v2,3/4] spi: spi-geni-qcom: Don't try to set CS if an xfer is pending

Message ID 20201216144114.v2.3.I07afdedcc49655c5d26880f8df9170aac5792378@changeid
State Superseded
Headers show
Series [v2,1/4] spi: spi-geni-qcom: Fix geni_spi_isr() NULL dereference in timeout case | expand

Commit Message

Doug Anderson Dec. 16, 2020, 10:41 p.m. UTC
If we get a timeout sending then this happens:
* spi_transfer_wait() will get a timeout.
* We'll set the chip select
* We'll call handle_err() => handle_fifo_timeout().

Unfortunately that won't work so well on geni.  If we got a timeout
transferring then it's likely that our interrupt handler is blocked,
but we need that same interrupt handler to adjust the chip select.
Trying to set the chip select doesn't crash us but ends up confusing
our state machine and leads to messages like:
  Premature done. rx_rem = 32 bpw8

Let's just drop the chip select request in this case.  Sure, we might
leave the chip select in the wrong state but it's likely it was going
to fail anyway and this avoids getting the driver even more confused
about what it's doing.

The SPI core in general assumes that setting chip select is a simple
operation that doesn't fail.  Yet another reason to just reconfigure
the chip select line as GPIOs.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
---

Changes in v2:
- ("spi: spi-geni-qcom: Don't try to set CS if an xfer is pending") new for v2.

 drivers/spi/spi-geni-qcom.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

Comments

Stephen Boyd Dec. 17, 2020, 4:25 a.m. UTC | #1
Quoting Douglas Anderson (2020-12-16 14:41:51)
> If we get a timeout sending then this happens:

> * spi_transfer_wait() will get a timeout.

> * We'll set the chip select

> * We'll call handle_err() => handle_fifo_timeout().

> 

> Unfortunately that won't work so well on geni.  If we got a timeout

> transferring then it's likely that our interrupt handler is blocked,

> but we need that same interrupt handler to adjust the chip select.

> Trying to set the chip select doesn't crash us but ends up confusing

> our state machine and leads to messages like:

>   Premature done. rx_rem = 32 bpw8

> 

> Let's just drop the chip select request in this case.  Sure, we might

> leave the chip select in the wrong state but it's likely it was going

> to fail anyway and this avoids getting the driver even more confused

> about what it's doing.

> 

> The SPI core in general assumes that setting chip select is a simple

> operation that doesn't fail.  Yet another reason to just reconfigure

> the chip select line as GPIOs.


Indeed.

> 

> Signed-off-by: Douglas Anderson <dianders@chromium.org>

> ---

> 

> Changes in v2:

> - ("spi: spi-geni-qcom: Don't try to set CS if an xfer is pending") new for v2.

> 

>  drivers/spi/spi-geni-qcom.c | 9 +++++++--

>  1 file changed, 7 insertions(+), 2 deletions(-)

> 

> diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c

> index d988463e606f..0e4fa52ac017 100644

> --- a/drivers/spi/spi-geni-qcom.c

> +++ b/drivers/spi/spi-geni-qcom.c

> @@ -204,9 +204,14 @@ static void spi_geni_set_cs(struct spi_device *slv, bool set_flag)

>                 goto exit;

>         }

>  

> -       mas->cs_flag = set_flag;

> -

>         spin_lock_irq(&mas->lock);

> +       if (mas->cur_xfer) {


How is it possible that cs change happens when cur_xfer is non-NULL?

> +               dev_err(mas->dev, "Can't set CS when prev xfter running\n");


xfer? or xfter?

> +               spin_unlock_irq(&mas->lock);

> +               goto exit;

> +       }

> +

> +       mas->cs_flag = set_flag;

>         reinit_completion(&mas->cs_done);
Stephen Boyd Dec. 17, 2020, 4:29 a.m. UTC | #2
Quoting Douglas Anderson (2020-12-16 14:41:51)
> If we get a timeout sending then this happens:
> * spi_transfer_wait() will get a timeout.
> * We'll set the chip select
> * We'll call handle_err() => handle_fifo_timeout().
> 
> Unfortunately that won't work so well on geni.  If we got a timeout
> transferring then it's likely that our interrupt handler is blocked,
> but we need that same interrupt handler to adjust the chip select.
> Trying to set the chip select doesn't crash us but ends up confusing
> our state machine and leads to messages like:
>   Premature done. rx_rem = 32 bpw8
> 
> Let's just drop the chip select request in this case.  Sure, we might
> leave the chip select in the wrong state but it's likely it was going
> to fail anyway and this avoids getting the driver even more confused
> about what it's doing.
> 
> The SPI core in general assumes that setting chip select is a simple
> operation that doesn't fail.  Yet another reason to just reconfigure
> the chip select line as GPIOs.

BTW, we could peek at the irq bit for the CS change and ignore the irq
handler entirely. That would be one way to make sure the cs change went
through, and would avoid an irq delay/scheduling problem for this simple
operation. Maybe using the irq path is worse in general here?
Mark Brown Dec. 17, 2020, 1:25 p.m. UTC | #3
On Wed, Dec 16, 2020 at 08:25:18PM -0800, Stephen Boyd wrote:

> >         spin_lock_irq(&mas->lock);
> > +       if (mas->cur_xfer) {
> 
> How is it possible that cs change happens when cur_xfer is non-NULL?

We will set the initial chip select state during controller setup.
Doug Anderson Dec. 17, 2020, 9:21 p.m. UTC | #4
Hi,

On Wed, Dec 16, 2020 at 8:25 PM Stephen Boyd <swboyd@chromium.org> wrote:
>

> Quoting Douglas Anderson (2020-12-16 14:41:51)

> > If we get a timeout sending then this happens:

> > * spi_transfer_wait() will get a timeout.

> > * We'll set the chip select

> > * We'll call handle_err() => handle_fifo_timeout().

> >

> > Unfortunately that won't work so well on geni.  If we got a timeout

> > transferring then it's likely that our interrupt handler is blocked,

> > but we need that same interrupt handler to adjust the chip select.

> > Trying to set the chip select doesn't crash us but ends up confusing

> > our state machine and leads to messages like:

> >   Premature done. rx_rem = 32 bpw8

> >

> > Let's just drop the chip select request in this case.  Sure, we might

> > leave the chip select in the wrong state but it's likely it was going

> > to fail anyway and this avoids getting the driver even more confused

> > about what it's doing.

> >

> > The SPI core in general assumes that setting chip select is a simple

> > operation that doesn't fail.  Yet another reason to just reconfigure

> > the chip select line as GPIOs.

>

> Indeed.

>

> >

> > Signed-off-by: Douglas Anderson <dianders@chromium.org>

> > ---

> >

> > Changes in v2:

> > - ("spi: spi-geni-qcom: Don't try to set CS if an xfer is pending") new for v2.

> >

> >  drivers/spi/spi-geni-qcom.c | 9 +++++++--

> >  1 file changed, 7 insertions(+), 2 deletions(-)

> >

> > diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c

> > index d988463e606f..0e4fa52ac017 100644

> > --- a/drivers/spi/spi-geni-qcom.c

> > +++ b/drivers/spi/spi-geni-qcom.c

> > @@ -204,9 +204,14 @@ static void spi_geni_set_cs(struct spi_device *slv, bool set_flag)

> >                 goto exit;

> >         }

> >

> > -       mas->cs_flag = set_flag;

> > -

> >         spin_lock_irq(&mas->lock);

> > +       if (mas->cur_xfer) {

>

> How is it possible that cs change happens when cur_xfer is non-NULL?


I'll add this to the commit message:

spi_transfer_one_message()
 ->transfer_one() AKA spi_geni_transfer_one()
  setup_fifo_xfer()
   mas->cur_xfer = non-NULL
 spi_transfer_wait() => TIMES OUT
 msg->status != -EINPROGRESS => goto out
 if (ret != 0 ...)
  spi_set_cs()
   ->set_cs AKA spi_geni_set_cs()
    # mas->cur_xfer is non-NULL

Specifically the place where cur_xfer is usually made NULL is in the
interrupt handler.  If that doesn't run then it will be non-NULL.


> > +               dev_err(mas->dev, "Can't set CS when prev xfter running\n");

>

> xfer? or xfter?


Will fix.


-Doug
Doug Anderson Dec. 17, 2020, 9:35 p.m. UTC | #5
Hi,

On Wed, Dec 16, 2020 at 8:29 PM Stephen Boyd <swboyd@chromium.org> wrote:
>

> > The SPI core in general assumes that setting chip select is a simple

> > operation that doesn't fail.  Yet another reason to just reconfigure

> > the chip select line as GPIOs.

>

> BTW, we could peek at the irq bit for the CS change and ignore the irq

> handler entirely. That would be one way to make sure the cs change went

> through, and would avoid an irq delay/scheduling problem for this simple

> operation. Maybe using the irq path is worse in general here?


Yes, when I was in my SPI optimization phase I actually coded this up
and thought about it.  It can be made to work and is probably
marginally faster at the expense of more cpu cycles to poll the
interrupt line.  I also don't think it fixes this issue nor does it
simplify things...  :(

1. If there are already commands pending we still have to do something
about them.  Said another way: there's not a separate channel just for
setting the chip select, so if the single command channel is gummed up
then using polling mode to handle the chip select won't really un-gum
it up.

2. In order to use polling mode to set the chip select we have to do
_something_ to temporarily disable our interrupt handler.  If we don't
do that then the interrupt handler will fire for the "DONE" when we
send the chip select command.


If we wanted to truly make this driver super robust against ridiculous
interrupt latencies then, presumably, we could handle the SPI timeout
ourselves but before timing out we could check to see if the
interrupts were pending.  Then we could disable our interrupts,
synchronize our interrupt handler, handle the interrupt directly, and
then re-enable interrupts.  If we did this then transfers could
continue to eek their way through even if interrupts were completely
blocked.  IMO, it's not worth it.  I'm satisfied with not crashing and
not getting the state machine too out-of-whack.

-Doug
Stephen Boyd Dec. 18, 2020, 2:51 a.m. UTC | #6
Quoting Doug Anderson (2020-12-17 13:35:08)
> 

> If we wanted to truly make this driver super robust against ridiculous

> interrupt latencies then, presumably, we could handle the SPI timeout

> ourselves but before timing out we could check to see if the

> interrupts were pending.  Then we could disable our interrupts,

> synchronize our interrupt handler, handle the interrupt directly, and

> then re-enable interrupts.  If we did this then transfers could

> continue to eek their way through even if interrupts were completely

> blocked.  IMO, it's not worth it.  I'm satisfied with not crashing and

> not getting the state machine too out-of-whack.

> 


Ok that's fair. If it's not worth the effort then let's drop this idea.
diff mbox series

Patch

diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c
index d988463e606f..0e4fa52ac017 100644
--- a/drivers/spi/spi-geni-qcom.c
+++ b/drivers/spi/spi-geni-qcom.c
@@ -204,9 +204,14 @@  static void spi_geni_set_cs(struct spi_device *slv, bool set_flag)
 		goto exit;
 	}
 
-	mas->cs_flag = set_flag;
-
 	spin_lock_irq(&mas->lock);
+	if (mas->cur_xfer) {
+		dev_err(mas->dev, "Can't set CS when prev xfter running\n");
+		spin_unlock_irq(&mas->lock);
+		goto exit;
+	}
+
+	mas->cs_flag = set_flag;
 	reinit_completion(&mas->cs_done);
 	if (set_flag)
 		geni_se_setup_m_cmd(se, SPI_CS_ASSERT, 0);