Message ID | 20201203074459.13078-1-rojay@codeaurora.org |
---|---|
State | New |
Headers | show |
Series | spi: spi-geni-qcom: Fix NULL pointer access in geni_spi_isr | expand |
Quoting Doug Anderson (2020-12-03 08:40:46) > I would guess that if "mas->cur_xfer" is NULL then > geni_spi_handle_rx() should read all data in the FIFO and throw it > away and geni_spi_handle_tx() should set SE_GENI_TX_WATERMARK_REG to > 0. NOTE: I _think_ that with the synchronize_irq() I'm suggesting > above we'll avoid this case, but it never hurts to be defensive. > > > Does that all make sense? So the summary is that instead of your patch: Can we get a CPU diagram describing the race and scenario where this happens? Something like: CPU0 CPU1 ---- ---- setup_fifo_xfer() spin_lock_irq(&mas->lock); spin_unlock_irq(&mas->lock); mas->cur_xfer = xfer ... <IRQ> geni_spi_isr() geni_spi_handle_rx() <NULL deref boom explosion!> But obviously this example diagram is incorrect and some timeout happens instead? Sorry, I'm super lazy and don't want to read many paragraphs of text. :) I'd rather have a diagram like above that clearly points out the steps taken to the NULL pointer deref. > > 1. Add synchronize_irq() at the start and end of > handle_fifo_timeout(). Not under lock. > > 2. In geni_spi_handle_rx(), check for NULL "mas->cur_xfer". Read all > data in the FIFO (don't cap at rx_rem_bytes), but throw it away. > > 3. In geni_spi_handle_tx(), check for NULL "mas->cur_xfer". Don't > write any data. Just write 0 to SE_GENI_TX_WATERMARK_REG. > > I think #1 is the real fix, but #2 and #3 will avoid crashes in case > there's another bug somewhere. > Aren't 2 and 3 papering over some weird problem though where irqs are coming in unexpectedly?
Hi, On Wed, Dec 9, 2020 at 7:17 PM Stephen Boyd <swboyd@chromium.org> wrote: > > Quoting Doug Anderson (2020-12-03 08:40:46) > > > I would guess that if "mas->cur_xfer" is NULL then > > geni_spi_handle_rx() should read all data in the FIFO and throw it > > away and geni_spi_handle_tx() should set SE_GENI_TX_WATERMARK_REG to > > 0. NOTE: I _think_ that with the synchronize_irq() I'm suggesting > > above we'll avoid this case, but it never hurts to be defensive. > > > > > > Does that all make sense? So the summary is that instead of your patch: > > Can we get a CPU diagram describing the race and scenario where this > happens? Something like: > > CPU0 CPU1 > ---- ---- > setup_fifo_xfer() > spin_lock_irq(&mas->lock); > spin_unlock_irq(&mas->lock); > mas->cur_xfer = xfer > ... > <IRQ> > geni_spi_isr() > geni_spi_handle_rx() > <NULL deref boom explosion!> > > But obviously this example diagram is incorrect and some timeout happens > instead? Sorry, I'm super lazy and don't want to read many paragraphs of > text. :) I'd rather have a diagram like above that clearly points out > the steps taken to the NULL pointer deref. This is my untested belief of what's happening CPU0 CPU1 ---- ---- setup_fifo_xfer() ... geni_se_setup_m_cmd() <hardware starts transfer> <unrelated interrupt storm> spin_unlock_irq() <continued interrupt storm> <time passes> <continued interrupt storm> <transfer complets in hardware> <continued interrupt storm> <hardware sets M_RX_FIFO_WATERMARK_EN> <continued interrupt storm> <time passes> <continued interrupt storm> handle_fifo_timeout() <continued interrupt storm> spin_lock_irq() <continued interrupt storm> mas->cur_xfer = NULL <continued interrupt storm> geni_se_cancel_m_cmd() <continued interrupt storm> spin_unlock_irq() <continued interrupt storm> wait_for_completion_timeout() => timeout <continued interrupt storm> spin_lock_irq() <continued interrupt storm> geni_se_abort_m_cmd() <continued interrupt storm> spin_unlock_irq() <continued interrupt storm> wait_for_completion_timeout() => timeout <interrupt storm ends> geni_spi_isr() spin_lock() if (m_irq & M_RX_FIFO_WATERMARK_EN) geni_spi_handle_rx() mas->cur_xfer NULL derefrence With my proposed fix, I believe that would transform into: CPU0 CPU1 ---- ---- setup_fifo_xfer() ... geni_se_setup_m_cmd() <hardware starts transfer> <unrelated interrupt storm> spin_unlock_irq() <continued interrupt storm> <time passes> <continued interrupt storm> <transfer complets in hardware> <continued interrupt storm> <hardware sets M_RX_FIFO_WATERMARK_EN> <continued interrupt storm> <time passes> <continued interrupt storm> handle_fifo_timeout() <continued interrupt storm> synchronize_irq() <continued interrupt storm> <time passes> <interrupt storm ends> geni_spi_isr() ... <synchronize_irq() finishes> spin_lock_irq() mas->cur_xfer = NULL geni_se_cancel_m_cmd() spin_unlock_irq() geni_spi_isr() ... wait_for_completion_timeout() => success The extra synchronize_irq() I was suggesting at the end of the function would be an extra bit of paranoia. Maybe a new storm showed up while we were processing the timeout? > > 1. Add synchronize_irq() at the start and end of > > handle_fifo_timeout(). Not under lock. > > > > 2. In geni_spi_handle_rx(), check for NULL "mas->cur_xfer". Read all > > data in the FIFO (don't cap at rx_rem_bytes), but throw it away. > > > > 3. In geni_spi_handle_tx(), check for NULL "mas->cur_xfer". Don't > > write any data. Just write 0 to SE_GENI_TX_WATERMARK_REG. > > > > I think #1 is the real fix, but #2 and #3 will avoid crashes in case > > there's another bug somewhere. > > Aren't 2 and 3 papering over some weird problem though where irqs are > coming in unexpectedly? I think that's what I said but in different words? #1 is the real fix but #2 and #3 will keep us from crashing (AKA paper over) if we have some other (unexpected) bug. We'll already have an error in the log in this case "Failed to cancel/abort m_cmd" so it doesn't feel necessary to crash with a NULL dereference... -Doug
Hi, On Thu, Dec 10, 2020 at 2:58 PM Stephen Boyd <swboyd@chromium.org> wrote: > > Quoting Doug Anderson (2020-12-10 09:14:15) > > > > This is my untested belief of what's happening > > > > CPU0 CPU1 > > ---- ---- > > setup_fifo_xfer() > > ... > > geni_se_setup_m_cmd() > > <hardware starts transfer> > > <unrelated interrupt storm> spin_unlock_irq() > > <continued interrupt storm> <time passes> > > <continued interrupt storm> <transfer complets in hardware> > > <continued interrupt storm> <hardware sets M_RX_FIFO_WATERMARK_EN> > > <continued interrupt storm> <time passes> > > <continued interrupt storm> handle_fifo_timeout() > > <continued interrupt storm> spin_lock_irq() > > <continued interrupt storm> mas->cur_xfer = NULL > > <continued interrupt storm> geni_se_cancel_m_cmd() > > <continued interrupt storm> spin_unlock_irq() > > <continued interrupt storm> wait_for_completion_timeout() => timeout > > <continued interrupt storm> spin_lock_irq() > > <continued interrupt storm> geni_se_abort_m_cmd() > > <continued interrupt storm> spin_unlock_irq() > > <continued interrupt storm> wait_for_completion_timeout() => timeout > > <interrupt storm ends> > > geni_spi_isr() > > spin_lock() > > if (m_irq & M_RX_FIFO_WATERMARK_EN) > > geni_spi_handle_rx() > > mas->cur_xfer NULL derefrence > > Ok so the one line summary is "If geni_spi_isr() is sufficiently delayed > then we may deref NULL in the irq handler because the handler tries to > deref mas->cur_xfer after the timeout handling code has set it to NULL". Sure. > CPU0 CPU1 > ---- ---- > setup_fifo_xfer() > ... > geni_se_setup_m_cmd() > <hardware starts transfer> > unrelated_irq_handler() spin_unlock_irq() > ... > <transfer completes in hardware> > <hardware sets M_RX_FIFO_WATERMARK_EN> > > handle_fifo_timeout() > spin_lock_irq() > mas->cur_xfer = NULL > geni_se_cancel_m_cmd() > spin_unlock_irq() > wait_for_completion_timeout() => timeout > spin_lock_irq() > geni_se_abort_m_cmd() > spin_unlock_irq() > wait_for_completion_timeout() => timeout > return IRQ_HANDLED; > gic_handle_irq() > geni_spi_isr() > spin_lock() > if (m_irq & M_RX_FIFO_WATERMARK_EN) > geni_spi_handle_rx() > rx_buf = mas->cur_xfer->rx_buf <--- OOPS! > > > With my proposed fix, I believe that would transform into: > > > > CPU0 CPU1 > > ---- ---- > > setup_fifo_xfer() > > ... > > geni_se_setup_m_cmd() > > <hardware starts transfer> > > <unrelated interrupt storm> spin_unlock_irq() > > <continued interrupt storm> <time passes> > > <continued interrupt storm> <transfer complets in hardware> > > <continued interrupt storm> <hardware sets M_RX_FIFO_WATERMARK_EN> > > <continued interrupt storm> <time passes> > > <continued interrupt storm> handle_fifo_timeout() > > <continued interrupt storm> synchronize_irq() > > <continued interrupt storm> <time passes> > > <interrupt storm ends> > > geni_spi_isr() > > ... > > <synchronize_irq() finishes> > > spin_lock_irq() > > mas->cur_xfer = NULL > > geni_se_cancel_m_cmd() > > spin_unlock_irq() > > geni_spi_isr() > > ... > > wait_for_completion_timeout() => success > > > > The extra synchronize_irq() I was suggesting at the end of the > > function would be an extra bit of paranoia. Maybe a new storm showed > > up while we were processing the timeout? > > Shouldn't we check in the timeout logic to see if m_irq has > M_RX_FIFO_WATERMARK_EN or M_TX_FIFO_WATERMARK_EN set instead? Similarly > for the CS assert/deassert stuff. If the timeout hits but one of those > bits are set then it seems we've run into some poor irqsoff section but > the hardware is still working. Calling synchronize_irq() wouldn't help > if the CPU handling the irqs (i.e. CPU0) had irqs off for a long time, > right? It will only ensure that other irq handlers have completed, which > may be a problem, but not the only one. > > TL;DR: Peek at the irq status register in the timeout logic and skip it > if the irq is pending? I don't have tons of experience with synchronize_irq(), but the function comment seems to indicate that as long as the interrupt is pending synchronize_irq() will do what we want even if the CPU that should handle the interrupt is in an irqsoff section. Digging a little bit I guess it relies upon the interrupt controller being able to read this state, but (hopefully) the GIC can? If it doesn't work like I think it does, I'd be OK with peeking in the IRQ status register, but we shouldn't _skip_ the logic IMO. As long as we believe that an interrupt could happen in the future we shouldn't return from handle_fifo_timeout(). It's impossible to reason about how future transfers would work if the pending interrupt from the previous transfer could fire at any point. -Doug
Hi, On Thu, Dec 10, 2020 at 3:32 PM Stephen Boyd <swboyd@chromium.org> wrote: > > Quoting Doug Anderson (2020-12-10 15:07:39) > > Hi, > > > > On Thu, Dec 10, 2020 at 2:58 PM Stephen Boyd <swboyd@chromium.org> wrote: > > > right? It will only ensure that other irq handlers have completed, which > > > may be a problem, but not the only one. > > > > > > TL;DR: Peek at the irq status register in the timeout logic and skip it > > > if the irq is pending? > > > > I don't have tons of experience with synchronize_irq(), but the > > function comment seems to indicate that as long as the interrupt is > > pending synchronize_irq() will do what we want even if the CPU that > > should handle the interrupt is in an irqsoff section. Digging a > > little bit I guess it relies upon the interrupt controller being able > > to read this state, but (hopefully) the GIC can? > > I didn't read synchronize_irq() more than the single line summary. I > thought it would only make sure other irq handlers have finished, which > is beside the point of some long section of code that has disabled irqs > on CPU0 with local_irq_disable(). And further more, presumably the irq > handler could be threaded, and then we could put a sufficiently large > msleep() at the start of geni_spi_isr() and see the same problem? As I understand it synchronize_irq(): 1. If the interrupt is not running but is pending at a hardware level, it'll wait. 2. If the interrupt is currently running it waits for it to finish. That should handle all the cases you're talking about including waiting for the threaded IRQ handler. There's an explicit comment about the threaded IRQ being accounted for in synchronize_irq(): https://elixir.bootlin.com/linux/v5.9/source/kernel/irq/manage.c#L134 > > If it doesn't work like I think it does, I'd be OK with peeking in the > > IRQ status register, but we shouldn't _skip_ the logic IMO. As long > > as we believe that an interrupt could happen in the future we > > shouldn't return from handle_fifo_timeout(). It's impossible to > > reason about how future transfers would work if the pending interrupt > > from the previous transfer could fire at any point. > > Right. I just meant skip the timeout handling logic. We'd have to go > back to the timeout and keep waiting until the irq handler can run and > complete the completion variable. > > I forgot that this is half handled in the spi core though. Peeking at > m_irq doesn't look very easy to implement. It certainly seems like this > means the timeout handler is busted and the diagram earlier could > indicate that spi core is driving this logic from > spi_transfer_one_message(). My assumption was that it was still OK (even if not perfect) to still process it as a timeout. I just want to really make sure a future interrupt isn't going to show up. If we want to try to do better, we can do timeout handling ourselves. The SPI core allows for that. > So why don't we check for cur_xfer being NULL in the rx/tx handling > paths too and bail out there? Does the FIFO need to be cleared out in > such a situation that spi core thinks a timeout happened but there's RX > data according to m_irq? Do we need to read it all and throw it away? Or > does the abort/cancel clear out the RX fifo? I don't know for sure, but IMO it's safest to read anything that's in the FIFO. It's also important to adjust the watermark in the TX case. The suggestions I provided in my original reply (#2 and #3) handle this and are plenty simple. As per my original reply, though, anything we do in the ISR doesn't replace the changes we need to make to handle_fifo_timeout(). It is very important that when handle_fifo_timeout() finishes that no future interrupts for old transfers will fire. -Doug
Hi, On Thu, Dec 10, 2020 at 4:51 PM Stephen Boyd <swboyd@chromium.org> wrote: > > Quoting Doug Anderson (2020-12-10 15:50:04) > > Hi, > > > > On Thu, Dec 10, 2020 at 3:32 PM Stephen Boyd <swboyd@chromium.org> wrote: > > > > > > Quoting Doug Anderson (2020-12-10 15:07:39) > > > > Hi, > > > > > > > > On Thu, Dec 10, 2020 at 2:58 PM Stephen Boyd <swboyd@chromium.org> wrote: > > > > > right? It will only ensure that other irq handlers have completed, which > > > > > may be a problem, but not the only one. > > > > > > > > > > TL;DR: Peek at the irq status register in the timeout logic and skip it > > > > > if the irq is pending? > > > > > > > > I don't have tons of experience with synchronize_irq(), but the > > > > function comment seems to indicate that as long as the interrupt is > > > > pending synchronize_irq() will do what we want even if the CPU that > > > > should handle the interrupt is in an irqsoff section. Digging a > > > > little bit I guess it relies upon the interrupt controller being able > > > > to read this state, but (hopefully) the GIC can? > > > > > > I didn't read synchronize_irq() more than the single line summary. I > > > thought it would only make sure other irq handlers have finished, which > > > is beside the point of some long section of code that has disabled irqs > > > on CPU0 with local_irq_disable(). And further more, presumably the irq > > > handler could be threaded, and then we could put a sufficiently large > > > msleep() at the start of geni_spi_isr() and see the same problem? > > > > As I understand it synchronize_irq(): > > 1. If the interrupt is not running but is pending at a hardware level, > > it'll wait. > > 2. If the interrupt is currently running it waits for it to finish. > > > > That should handle all the cases you're talking about including > > waiting for the threaded IRQ handler. There's an explicit comment > > about the threaded IRQ being accounted for in synchronize_irq(): > > > > https://elixir.bootlin.com/linux/v5.9/source/kernel/irq/manage.c#L134 > > Ok cool sounds like it would work then. Thanks for reading the code for > me! :) > > > > > > > > > If it doesn't work like I think it does, I'd be OK with peeking in the > > > > IRQ status register, but we shouldn't _skip_ the logic IMO. As long > > > > as we believe that an interrupt could happen in the future we > > > > shouldn't return from handle_fifo_timeout(). It's impossible to > > > > reason about how future transfers would work if the pending interrupt > > > > from the previous transfer could fire at any point. > > > > > > Right. I just meant skip the timeout handling logic. We'd have to go > > > back to the timeout and keep waiting until the irq handler can run and > > > complete the completion variable. > > > > > > I forgot that this is half handled in the spi core though. Peeking at > > > m_irq doesn't look very easy to implement. It certainly seems like this > > > means the timeout handler is busted and the diagram earlier could > > > indicate that spi core is driving this logic from > > > spi_transfer_one_message(). > > > > My assumption was that it was still OK (even if not perfect) to still > > process it as a timeout. I just want to really make sure a future > > interrupt isn't going to show up. > > I'm worried about the buffer disappearing if spi core calls handle_err() > but the geni_spi_isr() handler runs both an rx and a cancel/abort > routine. That doesn't seem to be the case though so it looks all fine. It would be pretty racy if that was the case. Until it calls handle_timeout() we're still free to write to that buffer, right? > > If we want to try to do better, we can do timeout handling ourselves. > > The SPI core allows for that. > > > > > > > So why don't we check for cur_xfer being NULL in the rx/tx handling > > > paths too and bail out there? Does the FIFO need to be cleared out in > > > such a situation that spi core thinks a timeout happened but there's RX > > > data according to m_irq? Do we need to read it all and throw it away? Or > > > does the abort/cancel clear out the RX fifo? > > > > I don't know for sure, but IMO it's safest to read anything that's in > > the FIFO. It's also important to adjust the watermark in the TX case. > > The suggestions I provided in my original reply (#2 and #3) handle > > this and are plenty simple. > > > > As per my original reply, though, anything we do in the ISR doesn't > > replace the changes we need to make to handle_fifo_timeout(). It is > > very important that when handle_fifo_timeout() finishes that no future > > interrupts for old transfers will fire. > > > > Alright. With a proper diagram in the commit text I think doing all of > the points, 1 through 3, would be good and required to leave the > hardware in a sane state for all the scenarios. Why do we need to call > synchronize_irq() at the start and end of handle_fifo_timeout() though? > Presumably having it at the start would make sure the long delayed irq > runs and handles any rx/tx by throwing it away. Sounds great, but having > it at the end leaves me confused. We want to make sure the cancel really > went through? Don't we know that because the completion variable for > cancel succeeded? I want it to handle the case where the "abort" command timed out! :-) If the "abort" command timed out then it's still pending and we could get an interrupt for it at some future point in time. > I guess I'm not convinced that the hardware is so bad that it cancels > and aborts the sequencer, raises an irq for that, and then raises an irq > for the earlier rx/tx that the sequencer canceled out. Is that > happening? It's called a sequencer because presumably it runs a sequence > of operations like tx, rx, cs change, cancel, and abort. Hopefully it > doesn't run them out of order. If they run at the same time it's fine, > the irq handler will see all of them and throw away reads, etc. Maybe answered by me explaining that I'm worried about the case where "abort" times out (and thus the "done" from the abort is still pending). NOTE: I will also assert that if we send the "abort" then it seems like it has a high likelihood of timing out. Why do I say that? In order to even get to sending the "abort", it means: a) The original transfer timed out b) The "cancel" timed out. As you can see, if the "cancel" doesn't time out we don't even send the "abort" ...so two things timed out, one of which we _just_ sent. The "abort" feels like a last ditch effort. Might as well try it, but things were in pretty sorry shape to start with by the time we tried it. -Doug
Quoting Doug Anderson (2020-12-10 17:04:06) > On Thu, Dec 10, 2020 at 4:51 PM Stephen Boyd <swboyd@chromium.org> wrote: > > > > I'm worried about the buffer disappearing if spi core calls handle_err() > > but the geni_spi_isr() handler runs both an rx and a cancel/abort > > routine. That doesn't seem to be the case though so it looks all fine. > > It would be pretty racy if that was the case. Until it calls > handle_timeout() we're still free to write to that buffer, right? Right I don't see any badness here. > > > > > If we want to try to do better, we can do timeout handling ourselves. > > > The SPI core allows for that. > > > > > > > > > > So why don't we check for cur_xfer being NULL in the rx/tx handling > > > > paths too and bail out there? Does the FIFO need to be cleared out in > > > > such a situation that spi core thinks a timeout happened but there's RX > > > > data according to m_irq? Do we need to read it all and throw it away? Or > > > > does the abort/cancel clear out the RX fifo? > > > > > > I don't know for sure, but IMO it's safest to read anything that's in > > > the FIFO. It's also important to adjust the watermark in the TX case. > > > The suggestions I provided in my original reply (#2 and #3) handle > > > this and are plenty simple. > > > > > > As per my original reply, though, anything we do in the ISR doesn't > > > replace the changes we need to make to handle_fifo_timeout(). It is > > > very important that when handle_fifo_timeout() finishes that no future > > > interrupts for old transfers will fire. > > > > > > > Alright. With a proper diagram in the commit text I think doing all of > > the points, 1 through 3, would be good and required to leave the > > hardware in a sane state for all the scenarios. Why do we need to call > > synchronize_irq() at the start and end of handle_fifo_timeout() though? > > Presumably having it at the start would make sure the long delayed irq > > runs and handles any rx/tx by throwing it away. Sounds great, but having > > it at the end leaves me confused. We want to make sure the cancel really > > went through? Don't we know that because the completion variable for > > cancel succeeded? > > I want it to handle the case where the "abort" command timed out! :-) > If the "abort" command timed out then it's still pending and we could > get an interrupt for it at some future point in time. Sure but who cares? We set a completion variable if abort comes in later. We'll put a message in the log indicating that it "Failed" but otherwise handle_fifo_timeout() can't return an error so we have to give up eventually. > > > > I guess I'm not convinced that the hardware is so bad that it cancels > > and aborts the sequencer, raises an irq for that, and then raises an irq > > for the earlier rx/tx that the sequencer canceled out. Is that > > happening? It's called a sequencer because presumably it runs a sequence > > of operations like tx, rx, cs change, cancel, and abort. Hopefully it > > doesn't run them out of order. If they run at the same time it's fine, > > the irq handler will see all of them and throw away reads, etc. > > Maybe answered by me explaining that I'm worried about the case where > "abort" times out (and thus the "done" from the abort is still > pending). > > NOTE: I will also assert that if we send the "abort" then it seems > like it has a high likelihood of timing out. Why do I say that? In > order to even get to sending the "abort", it means: > > a) The original transfer timed out > > b) The "cancel" timed out. As you can see, if the "cancel" doesn't > time out we don't even send the "abort" > > ...so two things timed out, one of which we _just_ sent. The "abort" > feels like a last ditch effort. Might as well try it, but things were > in pretty sorry shape to start with by the time we tried it. > Yeah and so if it comes way later because it timed out then what's the point of calling synchronize_irq() again? To make the completion variable set when it won't be tested again until it is reinitialized?
Quoting Doug Anderson (2020-12-10 17:30:17) > On Thu, Dec 10, 2020 at 5:21 PM Stephen Boyd <swboyd@chromium.org> wrote: > > > > Yeah and so if it comes way later because it timed out then what's the > > point of calling synchronize_irq() again? To make the completion > > variable set when it won't be tested again until it is reinitialized? > > Presumably the idea is to try to recover to a somewhat usable state > again? We're not rebooting the machine so, even though this transfer > failed, we will undoubtedly do another transfer later. If that > "abort" interrupt comes way later while we're setting up the next > transfer we'll really confuse ourselves. The interrupt handler just sets a completion variable. What does that confuse? > > I guess you could go the route of adding a synchronize_irq() at the > start of the next transfer, but I'd rather add the overhead in the > exceptional case (the timeout) than the normal case. In the normal > case we don't need to worry about random IRQs from the past transfer > suddenly showing up. > How does adding synchronize_irq() at the end guarantee that the abort is cleared out of the hardware though? It seems to assume that the abort is pending at the GIC when it could still be running through the hardware and not executed yet. It seems like a synchronize_irq() for that is wishful thinking that the irq is merely pending even though it timed out and possibly never ran. Maybe it's stuck in a write buffer in the CPU?
Quoting Doug Anderson (2020-12-10 17:51:53) > Hi, > > On Thu, Dec 10, 2020 at 5:39 PM Stephen Boyd <swboyd@chromium.org> wrote: > > > > Quoting Doug Anderson (2020-12-10 17:30:17) > > > On Thu, Dec 10, 2020 at 5:21 PM Stephen Boyd <swboyd@chromium.org> wrote: > > > > > > > > Yeah and so if it comes way later because it timed out then what's the > > > > point of calling synchronize_irq() again? To make the completion > > > > variable set when it won't be tested again until it is reinitialized? > > > > > > Presumably the idea is to try to recover to a somewhat usable state > > > again? We're not rebooting the machine so, even though this transfer > > > failed, we will undoubtedly do another transfer later. If that > > > "abort" interrupt comes way later while we're setting up the next > > > transfer we'll really confuse ourselves. > > > > The interrupt handler just sets a completion variable. What does that > > confuse? > > The interrupt handler sees a "DONE" interrupt. If we've made it far > enough into setting up the next transfer that "cur_xfer" has been set > then it might do more, no? I thought it saw a cancel/abort EN bit? if (m_irq & M_CMD_CANCEL_EN) complete(&mas->cancel_done); if (m_irq & M_CMD_ABORT_EN) complete(&mas->abort_done) and only a DONE bit if a transfer happened. > > > > > I guess you could go the route of adding a synchronize_irq() at the > > > start of the next transfer, but I'd rather add the overhead in the > > > exceptional case (the timeout) than the normal case. In the normal > > > case we don't need to worry about random IRQs from the past transfer > > > suddenly showing up. > > > > > > > How does adding synchronize_irq() at the end guarantee that the abort is > > cleared out of the hardware though? It seems to assume that the abort is > > pending at the GIC when it could still be running through the hardware > > and not executed yet. It seems like a synchronize_irq() for that is > > wishful thinking that the irq is merely pending even though it timed > > out and possibly never ran. Maybe it's stuck in a write buffer in the > > CPU? > > I guess I'm asserting that if a full second passed (because we timed > out) and after that full second no interrupts are pending then the > interrupt will never come. That seems a reasonable assumption to me. > It seems hard to believe it'd be stuck in a write buffer for a full > second? > Ok, so if we don't expect an irq to come in why are we calling synchronize_irq()? I'm lost.
Hi, On Fri, Dec 11, 2020 at 5:32 PM Stephen Boyd <swboyd@chromium.org> wrote: > > Quoting Doug Anderson (2020-12-10 17:51:53) > > Hi, > > > > On Thu, Dec 10, 2020 at 5:39 PM Stephen Boyd <swboyd@chromium.org> wrote: > > > > > > Quoting Doug Anderson (2020-12-10 17:30:17) > > > > On Thu, Dec 10, 2020 at 5:21 PM Stephen Boyd <swboyd@chromium.org> wrote: > > > > > > > > > > Yeah and so if it comes way later because it timed out then what's the > > > > > point of calling synchronize_irq() again? To make the completion > > > > > variable set when it won't be tested again until it is reinitialized? > > > > > > > > Presumably the idea is to try to recover to a somewhat usable state > > > > again? We're not rebooting the machine so, even though this transfer > > > > failed, we will undoubtedly do another transfer later. If that > > > > "abort" interrupt comes way later while we're setting up the next > > > > transfer we'll really confuse ourselves. > > > > > > The interrupt handler just sets a completion variable. What does that > > > confuse? > > > > The interrupt handler sees a "DONE" interrupt. If we've made it far > > enough into setting up the next transfer that "cur_xfer" has been set > > then it might do more, no? > > I thought it saw a cancel/abort EN bit? > > if (m_irq & M_CMD_CANCEL_EN) > complete(&mas->cancel_done); > if (m_irq & M_CMD_ABORT_EN) > complete(&mas->abort_done) > > and only a DONE bit if a transfer happened. Ah, true. The crazy thing is that since we do abort / cancel with commands we get them together with "done". That "done" could potentially confuse the next transfer... In theory we could ignore DONE if we see ABORT / CANCEL, but I've now spent a bunch of time on this and I think the best thing is to just make sure we won't start the next transfer if any IRQs are pending. I'll post patches... > > > > I guess you could go the route of adding a synchronize_irq() at the > > > > start of the next transfer, but I'd rather add the overhead in the > > > > exceptional case (the timeout) than the normal case. In the normal > > > > case we don't need to worry about random IRQs from the past transfer > > > > suddenly showing up. > > > > > > > > > > How does adding synchronize_irq() at the end guarantee that the abort is > > > cleared out of the hardware though? It seems to assume that the abort is > > > pending at the GIC when it could still be running through the hardware > > > and not executed yet. It seems like a synchronize_irq() for that is > > > wishful thinking that the irq is merely pending even though it timed > > > out and possibly never ran. Maybe it's stuck in a write buffer in the > > > CPU? > > > > I guess I'm asserting that if a full second passed (because we timed > > out) and after that full second no interrupts are pending then the > > interrupt will never come. That seems a reasonable assumption to me. > > It seems hard to believe it'd be stuck in a write buffer for a full > > second? > > > > Ok, so if we don't expect an irq to come in why are we calling > synchronize_irq()? I'm lost. It turns out that synchronize_irq() doesn't do what I thought it did, actually. :( Despite __synchronize_hardirq() talking about waiting for "pending" interrupts, it actually passes in "IRQCHIP_STATE_ACTIVE" and not "IRQCHIP_STATE_PENDING". So much for that. ...but, if it did, I guess my point (which no longer matters) was: a) If you wait a second but don't wait for pending interrupts to be done, interrupts might still come later if the CPU servicing interrupts was blocked. b) If you don't wait a second but wait for pending interrupts to be done, interrupts might still come later because maybe the transaction wasn't finished first. c) If you wait a second (enough for the transaction to finish) and then wait for pending interrupts (to handle ISR being blocked) then you're good. --- So I got tired of all this conjecture and decided to write some code. I reproduced the problem with some test code that let me call local_irq_disable() for a set amount of time based on sysfs. In terminal 1: while true; do ectool version > /dev/null; done In terminal 2, disable interrupts on cpu0 for 2000 ms: taskset -c 0 echo 2000 > /sys/module/spi_geni_qcom/parameters/doug_test Of course, I got the timeout and the NULL dereference. Then I could poke at all the corner cases. Posting patches for what I think is the best solution... -Doug
diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c index 25810a7eef10..e65d6676602b 100644 --- a/drivers/spi/spi-geni-qcom.c +++ b/drivers/spi/spi-geni-qcom.c @@ -457,7 +457,6 @@ static void setup_fifo_xfer(struct spi_transfer *xfer, len = xfer->len / (mas->cur_bits_per_word / BITS_PER_BYTE + 1); len &= TRANS_LEN_MSK; - mas->cur_xfer = xfer; if (xfer->tx_buf) { m_cmd |= SPI_TX_ONLY; mas->tx_rem_bytes = xfer->len; @@ -475,6 +474,7 @@ static void setup_fifo_xfer(struct spi_transfer *xfer, * interrupt could come in at any time now. */ spin_lock_irq(&mas->lock); + mas->cur_xfer = xfer; geni_se_setup_m_cmd(se, m_cmd, FRAGMENTATION); /*
Here, there is a chance of race condition occurrence which leads to NULL pointer dereference with struct spi_geni_master member 'cur_xfer' between setup_fifo_xfer() and handle_fifo_timeout() functions. Fix this race condition with guarding the 'cur_xfer' where it gets updated, with spin_lock_irq/spin_unlock_irq in setup_fifo_xfer() as we do in handle_fifo_timeout() function. Call trace: geni_spi_isr+0x114/0x34c __handle_irq_event_percpu+0xe0/0x23c handle_irq_event_percpu+0x34/0x8c handle_irq_event+0x48/0x94 handle_fasteoi_irq+0xd0/0x140 __handle_domain_irq+0x8c/0xcc gic_handle_irq+0x114/0x1dc el1_irq+0xcc/0x180 geni_spi a80000.spi: Failed to cancel/abort m_cmd dev_watchdog+0x348/0x354 call_timer_fn+0xc4/0x220 __run_timers+0x228/0x2d4 spi_master spi6: failed to transfer one message from queue run_timer_softirq+0x24/0x44 __do_softirq+0x16c/0x344 irq_exit+0xa8/0xac __handle_domain_irq+0x94/0xcc gic_handle_irq+0x114/0x1dc el1_irq+0xcc/0x180 cpuidle_enter_state+0xf8/0x204 cpuidle_enter+0x38/0x4c cros-ec-spi spi6.0: spi transfer failed: -110 ... Fixes: 2ee471a1e28e ("spi: spi-geni-qcom: Mo' betta locking") Signed-off-by: Roja Rani Yarubandi <rojay@codeaurora.org> --- drivers/spi/spi-geni-qcom.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)