Message ID | 20210716122504.22976-1-pali@kernel.org |
---|---|
State | New |
Headers | show |
Series | [stable-4.19] PCI: aardvark: Fix kernel panic during PIO transfer | expand |
On Fri, Jul 16, 2021 at 02:25:04PM +0200, Pali Rohár wrote: > commit f18139966d072dab8e4398c95ce955a9742e04f7 upstream. > > Trying to start a new PIO transfer by writing value 0 in PIO_START register > when previous transfer has not yet completed (which is indicated by value 1 > in PIO_START) causes an External Abort on CPU, which results in kernel > panic: > > SError Interrupt on CPU0, code 0xbf000002 -- SError > Kernel panic - not syncing: Asynchronous SError Interrupt > > To prevent kernel panic, it is required to reject a new PIO transfer when > previous one has not finished yet. > > If previous PIO transfer is not finished yet, the kernel may issue a new > PIO request only if the previous PIO transfer timed out. > > In the past the root cause of this issue was incorrectly identified (as it > often happens during link retraining or after link down event) and special > hack was implemented in Trusted Firmware to catch all SError events in EL3, > to ignore errors with code 0xbf000002 and not forwarding any other errors > to kernel and instead throw panic from EL3 Trusted Firmware handler. > > Links to discussion and patches about this issue: > https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/commit/?id=3c7dcdac5c50 > https://lore.kernel.org/linux-pci/20190316161243.29517-1-repk@triplefau.lt/ > https://lore.kernel.org/linux-pci/971be151d24312cc533989a64bd454b4@www.loen.fr/ > https://review.trustedfirmware.org/c/TF-A/trusted-firmware-a/+/1541 > > But the real cause was the fact that during link retraining or after link > down event the PIO transfer may take longer time, up to the 1.44s until it > times out. This increased probability that a new PIO transfer would be > issued by kernel while previous one has not finished yet. > > After applying this change into the kernel, it is possible to revert the > mentioned TF-A hack and SError events do not have to be caught in TF-A EL3. > > Link: https://lore.kernel.org/r/20210608203655.31228-1-pali@kernel.org > Signed-off-by: Pali Rohár <pali@kernel.org> > Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> > Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> > Reviewed-by: Marek Behún <kabel@kernel.org> > Cc: stable@vger.kernel.org # 7fbcb5da811b ("PCI: aardvark: Don't rely on jiffies while holding spinlock") > [pali: Backported to 4.19 version] > --- > This patch is backported to 4.19 version. It depends on commit > 7fbcb5da811b as presented on Cc: stable line. > --- > drivers/pci/controller/pci-aardvark.c | 49 ++++++++++++++++++++++----- > 1 file changed, 40 insertions(+), 9 deletions(-) Now queued up, thanks. greg k-h
diff --git a/drivers/pci/controller/pci-aardvark.c b/drivers/pci/controller/pci-aardvark.c index 524e0fb3b062..947f60ba5b75 100644 --- a/drivers/pci/controller/pci-aardvark.c +++ b/drivers/pci/controller/pci-aardvark.c @@ -382,7 +382,7 @@ static int advk_pcie_wait_pio(struct advk_pcie *pcie) udelay(PIO_RETRY_DELAY); } - dev_err(dev, "config read/write timed out\n"); + dev_err(dev, "PIO read/write transfer time out\n"); return -ETIMEDOUT; } @@ -395,6 +395,35 @@ static bool advk_pcie_valid_device(struct advk_pcie *pcie, struct pci_bus *bus, return true; } +static bool advk_pcie_pio_is_running(struct advk_pcie *pcie) +{ + struct device *dev = &pcie->pdev->dev; + + /* + * Trying to start a new PIO transfer when previous has not completed + * cause External Abort on CPU which results in kernel panic: + * + * SError Interrupt on CPU0, code 0xbf000002 -- SError + * Kernel panic - not syncing: Asynchronous SError Interrupt + * + * Functions advk_pcie_rd_conf() and advk_pcie_wr_conf() are protected + * by raw_spin_lock_irqsave() at pci_lock_config() level to prevent + * concurrent calls at the same time. But because PIO transfer may take + * about 1.5s when link is down or card is disconnected, it means that + * advk_pcie_wait_pio() does not always have to wait for completion. + * + * Some versions of ARM Trusted Firmware handles this External Abort at + * EL3 level and mask it to prevent kernel panic. Relevant TF-A commit: + * https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/commit/?id=3c7dcdac5c50 + */ + if (advk_readl(pcie, PIO_START)) { + dev_err(dev, "Previous PIO read/write transfer is still running\n"); + return true; + } + + return false; +} + static int advk_pcie_rd_conf(struct pci_bus *bus, u32 devfn, int where, int size, u32 *val) { @@ -407,9 +436,10 @@ static int advk_pcie_rd_conf(struct pci_bus *bus, u32 devfn, return PCIBIOS_DEVICE_NOT_FOUND; } - /* Start PIO */ - advk_writel(pcie, 0, PIO_START); - advk_writel(pcie, 1, PIO_ISR); + if (advk_pcie_pio_is_running(pcie)) { + *val = 0xffffffff; + return PCIBIOS_SET_FAILED; + } /* Program the control register */ reg = advk_readl(pcie, PIO_CTRL); @@ -428,7 +458,8 @@ static int advk_pcie_rd_conf(struct pci_bus *bus, u32 devfn, /* Program the data strobe */ advk_writel(pcie, 0xf, PIO_WR_DATA_STRB); - /* Start the transfer */ + /* Clear PIO DONE ISR and start the transfer */ + advk_writel(pcie, 1, PIO_ISR); advk_writel(pcie, 1, PIO_START); ret = advk_pcie_wait_pio(pcie); @@ -462,9 +493,8 @@ static int advk_pcie_wr_conf(struct pci_bus *bus, u32 devfn, if (where % size) return PCIBIOS_SET_FAILED; - /* Start PIO */ - advk_writel(pcie, 0, PIO_START); - advk_writel(pcie, 1, PIO_ISR); + if (advk_pcie_pio_is_running(pcie)) + return PCIBIOS_SET_FAILED; /* Program the control register */ reg = advk_readl(pcie, PIO_CTRL); @@ -491,7 +521,8 @@ static int advk_pcie_wr_conf(struct pci_bus *bus, u32 devfn, /* Program the data strobe */ advk_writel(pcie, data_strobe, PIO_WR_DATA_STRB); - /* Start the transfer */ + /* Clear PIO DONE ISR and start the transfer */ + advk_writel(pcie, 1, PIO_ISR); advk_writel(pcie, 1, PIO_START); ret = advk_pcie_wait_pio(pcie);