Message ID | BN0PR08MB695133000AF116F04C3A9FFE83212@BN0PR08MB6951.namprd08.prod.outlook.com |
---|---|
State | New |
Headers | show |
Series | MMC card detection may trip watchdog if card powered off | expand |
On Wed, 20 Nov 2024 at 17:21, Anthony Pighin (Nokia) <anthony.pighin@nokia.com> wrote: > > If card detection is done via polling, due to broken-cd (Freescale LX2160, etc.), or for other reasons, then the card will be polled asynchronously and periodically. > > If that polling happens after the card has been put in powered off state (i.e. during system shutdown/reboot), then the polling times out. That timeout is of a long duration (10s). And it is repeated multiple times (x3). And that is all done after the watchdogd has been disabled, meaning that system watchdogs are not being kicked. > > If the MMC polling exceeds the watchdog trip time, then the system will be ungraciously reset. Or in the case of a pretimeout capable watchdog, the pretimeout will trip unnecessarily. > > [ 46.872767] mmc_mrq_pr_debug:274: mmc1: starting CMD6 arg 03220301 flags 0000049d > [ 46.880258] sdhci_irq:3558: mmc1: sdhci: IRQ status 0x00000001 > [ 46.886082] sdhci_irq:3558: mmc1: sdhci: IRQ status 0x00000002 > [ 46.891906] mmc_request_done:187: mmc1: req done (CMD6): 0: 00000800 00000000 00000000 00000000 > [ 46.900606] mmc_set_ios:892: mmc1: clock 0Hz busmode 2 powermode 0 cs 0 Vdd 0 width 1 timing 0 > [ 46.914934] mmc_mrq_pr_debug:274: mmc1: starting CMD13 arg 00010000 flags 00000195 > [ 57.433351] mmc1: Timeout waiting for hardware cmd interrupt. Hmm. How is the polling being done? By using MMC_CAP_NEEDS_POLL? The above certainly looks weird. The mmc_rescan work should not be allowed to run at this point, as the work is getting punted to the system_freezable_wq via mmc_schedule_delayed_work(). > ... > [ 71.031911] [Redacted] 2030000.i2c:[Redacted]@41:watchdog: Watchdog interrupt received! > [ 71.039737] Kernel panic - not syncing: watchdog pretimeout event > [ 71.045820] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O 6.6.59 #1 > [ 71.053207] Hardware name: [Redacted] > [ 71.059897] Call trace: > [ 71.062332] dump_backtrace+0x9c/0x128 > ... > > CMD6 is SWITCH_FUNC and arg 03220301 is POWER_OFF_NOTIFICATION (bits 16:23 = 0x22 = 34). > CMD13 is SEND_STATUS, and when it occurs after the POWER_OFF_NOTIFICATION (as above) bad things happen. > > I have made the following change to attempt to work around the issue, and it seems to hold up, but is also quite brute force: > > --- a/drivers/mmc/core/mmc.c > +++ b/drivers/mmc/core/mmc.c > @@ -2046,6 +2046,11 @@ static void mmc_remove(struct mmc_host *host) > */ > static int mmc_alive(struct mmc_host *host) > { > + if (host->card && mmc_card_suspended(host->card)) { > + pr_err("%s: Skip card detection: Card suspended\n", > + mmc_hostname(host)); > + return -ENOMEDIUM; > + } > return mmc_send_status(host->card, NULL); > } Yeah, the above isn't really the correct solution in my opinion. We need to prevent the mmc_rescan work from running, which I thought we already did... > > Anthony > > Kind regards Uffe
> > > > If card detection is done via polling, due to broken-cd (Freescale LX2160, > etc.), or for other reasons, then the card will be polled asynchronously and > periodically. > > > > If that polling happens after the card has been put in powered off state (i.e. > during system shutdown/reboot), then the polling times out. That timeout is > of a long duration (10s). And it is repeated multiple times (x3). And that is all > done after the watchdogd has been disabled, meaning that system watchdogs > are not being kicked. > > > > If the MMC polling exceeds the watchdog trip time, then the system will be > ungraciously reset. Or in the case of a pretimeout capable watchdog, the > pretimeout will trip unnecessarily. > > > > [ 46.872767] mmc_mrq_pr_debug:274: mmc1: starting CMD6 arg > 03220301 flags 0000049d > > [ 46.880258] sdhci_irq:3558: mmc1: sdhci: IRQ status 0x00000001 > > [ 46.886082] sdhci_irq:3558: mmc1: sdhci: IRQ status 0x00000002 > > [ 46.891906] mmc_request_done:187: mmc1: req done (CMD6): 0: > 00000800 00000000 00000000 00000000 > > [ 46.900606] mmc_set_ios:892: mmc1: clock 0Hz busmode 2 > powermode 0 cs 0 Vdd 0 width 1 timing 0 > > [ 46.914934] mmc_mrq_pr_debug:274: mmc1: starting CMD13 arg > 00010000 flags 00000195 > > [ 57.433351] mmc1: Timeout waiting for hardware cmd interrupt. > > Hmm. How is the polling being done? By using MMC_CAP_NEEDS_POLL? > Correct. (At least in my understanding.) 'broken-cd' in the device tree will trigger MMC_CAP_NEEDS_POLL to be set. > The above certainly looks weird. The mmc_rescan work should not be > allowed to run at this point, as the work is getting punted to the > system_freezable_wq via mmc_schedule_delayed_work(). > This is the backtrace I get when the timeout occurs: [ 46.154348] mmc_mrq_pr_debug:274: mmc1: starting CMD13 arg 00010000 flags 00000195 [ 46.161917] sdhci_irq:3546: mmc1: sdhci: IRQ status 0x00000001 [ 46.167743] mmc_request_done:187: mmc1: req done (CMD13): 0: 00000900 00000000 00000000 00000000 [ 46.176526] mmc_mrq_pr_debug:274: mmc1: starting CMD6 arg 03220301 flags 0000049d [ 46.184016] sdhci_irq:3546: mmc1: sdhci: IRQ status 0x00000001 [ 46.189840] sdhci_irq:3546: mmc1: sdhci: IRQ status 0x00000002 [ 46.195665] mmc_request_done:187: mmc1: req done (CMD6): 0: 00000800 00000000 00000000 00000000 [ 46.204362] mmc_set_ios:892: mmc1: clock 0Hz busmode 2 powermode 0 cs 0 Vdd 0 width 1 timing 0 [ 46.219307] CPU: 6 PID: 153 Comm: kworker/6:1 Tainted: G O 6.6.59 #1 [ 46.231988] Hardware name: [Redacted] [ 46.238678] Workqueue: events_freezable mmc_rescan [ 46.243466] Call trace: [ 46.245901] dump_backtrace+0x9c/0x128 [ 46.249643] show_stack+0x20/0x38 [ 46.252948] dump_stack_lvl+0x48/0x60 [ 46.256603] dump_stack+0x18/0x28 [ 46.259909] mmc_alive+0x74/0x88 [ 46.263128] _mmc_detect_card_removed+0x3c/0x158 [ 46.267735] mmc_detect+0x30/0x98 [ 46.271040] mmc_rescan+0x94/0x238 [ 46.274432] process_one_work+0x178/0x3d8 [ 46.278432] worker_thread+0x2bc/0x3e0 [ 46.282171] kthread+0xf8/0x110 [ 46.285303] ret_from_fork+0x10/0x20 [ 46.288878] mmc_mrq_pr_debug:274: mmc1: starting CMD13 arg 00010000 flags 00000195 [ 56.793379] mmc1: Timeout waiting for hardware cmd interrupt. [ 56.799116] mmc1: sdhci: ============ SDHCI REGISTER DUMP =========== [ 56.805545] mmc1: sdhci: Sys addr: 0x00000000 | Version: 0x00002202 ... > > ... > > [ 71.031911] [Redacted] 2030000.i2c:[Redacted]@41:watchdog: > Watchdog interrupt received! > > [ 71.039737] Kernel panic - not syncing: watchdog pretimeout event > > [ 71.045820] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O > 6.6.59 #1 > > [ 71.053207] Hardware name: [Redacted] > > [ 71.059897] Call trace: > > [ 71.062332] dump_backtrace+0x9c/0x128 > > ... > > > > CMD6 is SWITCH_FUNC and arg 03220301 is POWER_OFF_NOTIFICATION > (bits 16:23 = 0x22 = 34). > > CMD13 is SEND_STATUS, and when it occurs after the > POWER_OFF_NOTIFICATION (as above) bad things happen. > > > > I have made the following change to attempt to work around the issue, and > it seems to hold up, but is also quite brute force: > > > > --- a/drivers/mmc/core/mmc.c > > +++ b/drivers/mmc/core/mmc.c > > @@ -2046,6 +2046,11 @@ static void mmc_remove(struct mmc_host *host) > > */ > > static int mmc_alive(struct mmc_host *host) { > > + if (host->card && mmc_card_suspended(host->card)) { > > + pr_err("%s: Skip card detection: Card suspended\n", > > + mmc_hostname(host)); > > + return -ENOMEDIUM; > > + } > > return mmc_send_status(host->card, NULL); } > > Yeah, the above isn't really the correct solution in my opinion. > > We need to prevent the mmc_rescan work from running, which I thought we > already did... > > > > > Anthony > > > > > > Kind regards > Uffe
On Thu, 21 Nov 2024 at 14:23, Anthony Pighin (Nokia) <anthony.pighin@nokia.com> wrote: > > > > > > > If card detection is done via polling, due to broken-cd (Freescale LX2160, > > etc.), or for other reasons, then the card will be polled asynchronously and > > periodically. > > > > > > If that polling happens after the card has been put in powered off state (i.e. > > during system shutdown/reboot), then the polling times out. That timeout is > > of a long duration (10s). And it is repeated multiple times (x3). And that is all > > done after the watchdogd has been disabled, meaning that system watchdogs > > are not being kicked. > > > > > > If the MMC polling exceeds the watchdog trip time, then the system will be > > ungraciously reset. Or in the case of a pretimeout capable watchdog, the > > pretimeout will trip unnecessarily. > > > > > > [ 46.872767] mmc_mrq_pr_debug:274: mmc1: starting CMD6 arg > > 03220301 flags 0000049d > > > [ 46.880258] sdhci_irq:3558: mmc1: sdhci: IRQ status 0x00000001 > > > [ 46.886082] sdhci_irq:3558: mmc1: sdhci: IRQ status 0x00000002 > > > [ 46.891906] mmc_request_done:187: mmc1: req done (CMD6): 0: > > 00000800 00000000 00000000 00000000 > > > [ 46.900606] mmc_set_ios:892: mmc1: clock 0Hz busmode 2 > > powermode 0 cs 0 Vdd 0 width 1 timing 0 > > > [ 46.914934] mmc_mrq_pr_debug:274: mmc1: starting CMD13 arg > > 00010000 flags 00000195 > > > [ 57.433351] mmc1: Timeout waiting for hardware cmd interrupt. > > > > Hmm. How is the polling being done? By using MMC_CAP_NEEDS_POLL? > > > > Correct. (At least in my understanding.) 'broken-cd' in the device tree will trigger > MMC_CAP_NEEDS_POLL to be set. > > > The above certainly looks weird. The mmc_rescan work should not be > > allowed to run at this point, as the work is getting punted to the > > system_freezable_wq via mmc_schedule_delayed_work(). > > > > This is the backtrace I get when the timeout occurs: > > [ 46.154348] mmc_mrq_pr_debug:274: mmc1: starting CMD13 arg 00010000 flags 00000195 > [ 46.161917] sdhci_irq:3546: mmc1: sdhci: IRQ status 0x00000001 > [ 46.167743] mmc_request_done:187: mmc1: req done (CMD13): 0: 00000900 00000000 00000000 00000000 > [ 46.176526] mmc_mrq_pr_debug:274: mmc1: starting CMD6 arg 03220301 flags 0000049d > [ 46.184016] sdhci_irq:3546: mmc1: sdhci: IRQ status 0x00000001 > [ 46.189840] sdhci_irq:3546: mmc1: sdhci: IRQ status 0x00000002 > [ 46.195665] mmc_request_done:187: mmc1: req done (CMD6): 0: 00000800 00000000 00000000 00000000 > [ 46.204362] mmc_set_ios:892: mmc1: clock 0Hz busmode 2 powermode 0 cs 0 Vdd 0 width 1 timing 0 > [ 46.219307] CPU: 6 PID: 153 Comm: kworker/6:1 Tainted: G O 6.6.59 #1 > [ 46.231988] Hardware name: [Redacted] > [ 46.238678] Workqueue: events_freezable mmc_rescan > [ 46.243466] Call trace: > [ 46.245901] dump_backtrace+0x9c/0x128 > [ 46.249643] show_stack+0x20/0x38 > [ 46.252948] dump_stack_lvl+0x48/0x60 > [ 46.256603] dump_stack+0x18/0x28 > [ 46.259909] mmc_alive+0x74/0x88 > [ 46.263128] _mmc_detect_card_removed+0x3c/0x158 > [ 46.267735] mmc_detect+0x30/0x98 > [ 46.271040] mmc_rescan+0x94/0x238 > [ 46.274432] process_one_work+0x178/0x3d8 > [ 46.278432] worker_thread+0x2bc/0x3e0 > [ 46.282171] kthread+0xf8/0x110 > [ 46.285303] ret_from_fork+0x10/0x20 > [ 46.288878] mmc_mrq_pr_debug:274: mmc1: starting CMD13 arg 00010000 flags 00000195 > [ 56.793379] mmc1: Timeout waiting for hardware cmd interrupt. > [ 56.799116] mmc1: sdhci: ============ SDHCI REGISTER DUMP =========== > [ 56.805545] mmc1: sdhci: Sys addr: 0x00000000 | Version: 0x00002202 > ... > Okay. If this is system suspend, it looks like the workqueue didn't become frozen as exepcted. I don't know why, but this needs to be investigated. If this is a shutdown, mmc_host_classdev_shutdown() should be called to "disable" the mmc_rescan work from running the code causing the above. What am I missing? Kind regards Uffe > > > ... > > > [ 71.031911] [Redacted] 2030000.i2c:[Redacted]@41:watchdog: > > Watchdog interrupt received! > > > [ 71.039737] Kernel panic - not syncing: watchdog pretimeout event > > > [ 71.045820] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O > > 6.6.59 #1 > > > [ 71.053207] Hardware name: [Redacted] > > > [ 71.059897] Call trace: > > > [ 71.062332] dump_backtrace+0x9c/0x128 > > > ... > > > > > > CMD6 is SWITCH_FUNC and arg 03220301 is POWER_OFF_NOTIFICATION > > (bits 16:23 = 0x22 = 34). > > > CMD13 is SEND_STATUS, and when it occurs after the > > POWER_OFF_NOTIFICATION (as above) bad things happen. > > > > > > I have made the following change to attempt to work around the issue, and > > it seems to hold up, but is also quite brute force: > > > > > > --- a/drivers/mmc/core/mmc.c > > > +++ b/drivers/mmc/core/mmc.c > > > @@ -2046,6 +2046,11 @@ static void mmc_remove(struct mmc_host *host) > > > */ > > > static int mmc_alive(struct mmc_host *host) { > > > + if (host->card && mmc_card_suspended(host->card)) { > > > + pr_err("%s: Skip card detection: Card suspended\n", > > > + mmc_hostname(host)); > > > + return -ENOMEDIUM; > > > + } > > > return mmc_send_status(host->card, NULL); } > > > > Yeah, the above isn't really the correct solution in my opinion. > > > > We need to prevent the mmc_rescan work from running, which I thought we > > already did... > > > > > > > > Anthony > > > > > > > > > > Kind regards > > Uffe
> > > > > > > > If card detection is done via polling, due to broken-cd (Freescale > > > > LX2160, > > > etc.), or for other reasons, then the card will be polled > > > asynchronously and periodically. > > > > > > > > If that polling happens after the card has been put in powered off state > (i.e. > > > during system shutdown/reboot), then the polling times out. That > > > timeout is of a long duration (10s). And it is repeated multiple > > > times (x3). And that is all done after the watchdogd has been > > > disabled, meaning that system watchdogs are not being kicked. > > > > > > > > If the MMC polling exceeds the watchdog trip time, then the system > > > > will be > > > ungraciously reset. Or in the case of a pretimeout capable watchdog, > > > the pretimeout will trip unnecessarily. > > > > > > > > [ 46.872767] mmc_mrq_pr_debug:274: mmc1: starting CMD6 arg > > > 03220301 flags 0000049d > > > > [ 46.880258] sdhci_irq:3558: mmc1: sdhci: IRQ status 0x00000001 > > > > [ 46.886082] sdhci_irq:3558: mmc1: sdhci: IRQ status 0x00000002 > > > > [ 46.891906] mmc_request_done:187: mmc1: req done (CMD6): 0: > > > 00000800 00000000 00000000 00000000 > > > > [ 46.900606] mmc_set_ios:892: mmc1: clock 0Hz busmode 2 > > > powermode 0 cs 0 Vdd 0 width 1 timing 0 > > > > [ 46.914934] mmc_mrq_pr_debug:274: mmc1: starting CMD13 arg > > > 00010000 flags 00000195 > > > > [ 57.433351] mmc1: Timeout waiting for hardware cmd interrupt. > > > > > > Hmm. How is the polling being done? By using MMC_CAP_NEEDS_POLL? > > > > > > > Correct. (At least in my understanding.) 'broken-cd' in the device > > tree will trigger MMC_CAP_NEEDS_POLL to be set. > > > > > The above certainly looks weird. The mmc_rescan work should not be > > > allowed to run at this point, as the work is getting punted to the > > > system_freezable_wq via mmc_schedule_delayed_work(). > > > > > > > This is the backtrace I get when the timeout occurs: > > > > [ 46.154348] mmc_mrq_pr_debug:274: mmc1: starting CMD13 arg > 00010000 flags 00000195 > > [ 46.161917] sdhci_irq:3546: mmc1: sdhci: IRQ status 0x00000001 > > [ 46.167743] mmc_request_done:187: mmc1: req done (CMD13): 0: > 00000900 00000000 00000000 00000000 > > [ 46.176526] mmc_mrq_pr_debug:274: mmc1: starting CMD6 arg > 03220301 flags 0000049d > > [ 46.184016] sdhci_irq:3546: mmc1: sdhci: IRQ status 0x00000001 > > [ 46.189840] sdhci_irq:3546: mmc1: sdhci: IRQ status 0x00000002 > > [ 46.195665] mmc_request_done:187: mmc1: req done (CMD6): 0: > 00000800 00000000 00000000 00000000 > > [ 46.204362] mmc_set_ios:892: mmc1: clock 0Hz busmode 2 powermode 0 > cs 0 Vdd 0 width 1 timing 0 > > [ 46.219307] CPU: 6 PID: 153 Comm: kworker/6:1 Tainted: G O > 6.6.59 #1 > > [ 46.231988] Hardware name: [Redacted] > > [ 46.238678] Workqueue: events_freezable mmc_rescan > > [ 46.243466] Call trace: > > [ 46.245901] dump_backtrace+0x9c/0x128 > > [ 46.249643] show_stack+0x20/0x38 > > [ 46.252948] dump_stack_lvl+0x48/0x60 > > [ 46.256603] dump_stack+0x18/0x28 > > [ 46.259909] mmc_alive+0x74/0x88 > > [ 46.263128] _mmc_detect_card_removed+0x3c/0x158 > > [ 46.267735] mmc_detect+0x30/0x98 > > [ 46.271040] mmc_rescan+0x94/0x238 > > [ 46.274432] process_one_work+0x178/0x3d8 > > [ 46.278432] worker_thread+0x2bc/0x3e0 > > [ 46.282171] kthread+0xf8/0x110 > > [ 46.285303] ret_from_fork+0x10/0x20 > > [ 46.288878] mmc_mrq_pr_debug:274: mmc1: starting CMD13 arg > 00010000 flags 00000195 > > [ 56.793379] mmc1: Timeout waiting for hardware cmd interrupt. > > [ 56.799116] mmc1: sdhci: ============ SDHCI REGISTER DUMP > =========== > > [ 56.805545] mmc1: sdhci: Sys addr: 0x00000000 | Version: 0x00002202 > > ... > > > > Okay. If this is system suspend, it looks like the workqueue didn't become > frozen as exepcted. I don't know why, but this needs to be investigated. > > If this is a shutdown, mmc_host_classdev_shutdown() should be called to > "disable" the mmc_rescan work from running the code causing the above. > > What am I missing? > > Kind regards > Uffe > This occurs in response to a 'reboot' command. Here is an instance of the failure, with initcall_debug set to 'Y': Rebooting... [ 44.500858] caam_jr 8030000.jr: shutdown [ 44.504805] caam_jr 8020000.jr: shutdown [ 44.508748] caam_jr 8010000.jr: shutdown [ 44.515349] tpm tpm0: shutdown_pre [ 44.523165] igb 0000:01:00.0: shutdown [ 44.558695] usb usb2-port1: shutdown [ 44.562268] usb usb1-port1: shutdown [ 44.565864] xhci-hcd xhci-hcd.0.auto: shutdown [ 44.570306] dwc3 3100000.usb: shutdown [ 44.574052] [Redacted] 2030000.i2c:cpld@41:watchdog: shutdown [ 44.580137] tpm_tis_spi spi2.0: shutdown [ 44.584054] spi-nor spi0.1: shutdown [ 44.588066] spi-nor spi0.0: shutdown [ 44.592112] mmcblk mmc0:0001: shutdown [ 44.595902] mmc_mrq_pr_debug:274: mmc0: starting CMD7 arg 00000000 flags 00000000 [ 44.603413] sdhci_irq:3546: mmc0: sdhci: IRQ status 0x00000001 [ 44.609240] mmc_request_done:187: mmc0: req done (CMD7): 0: 00000000 00000000 00000000 00000000 [ 44.617939] mmc_set_ios:892: mmc0: clock 0Hz busmode 2 powermode 0 cs 0 Vdd 0 width 1 timing 0 [ 44.627877] mmcblk mmc1:0001: shutdown [ 44.631709] mmc_mrq_pr_debug:274: mmc1: starting CMD6 arg 03200101 flags 00000095 [ 44.639200] sdhci_irq:3546: mmc1: sdhci: IRQ status 0x00000001 [ 44.645027] mmc_request_done:187: mmc1: req done (CMD6): 0: 00000800 00000000 00000000 00000000 [ 44.653727] mmc_mrq_pr_debug:274: mmc1: starting CMD13 arg 00010000 flags 00000195 [ 44.661296] sdhci_irq:3546: mmc1: sdhci: IRQ status 0x00000001 [ 44.667123] mmc_request_done:187: mmc1: req done (CMD13): 0: 00000900 00000000 00000000 00000000 [ 44.675907] mmc_mrq_pr_debug:274: mmc1: starting CMD6 arg 03220301 flags 0000049d [ 44.683399] sdhci_irq:3546: mmc1: sdhci: IRQ status 0x00000001 [ 44.689223] sdhci_irq:3546: mmc1: sdhci: IRQ status 0x00000002 [ 44.695047] mmc_request_done:187: mmc1: req done (CMD6): 0: 00000800 00000000 00000000 00000000 [ 44.703746] mmc_set_ios:892: mmc1: clock 0Hz busmode 2 powermode 0 cs 0 Vdd 0 width 1 timing 0 [ 44.713667] aer 0000:00:00.0:pcie002: shutdown [ 44.713673] mmc_mrq_pr_debug:274: mmc1: starting CMD13 arg 00010000 flags 00000195 [ 44.718102] pcie_pme 0000:00:00.0:pcie001: shutdown [ 44.730534] pcieport 0000:00:00.0: shutdown [ 44.734898] fsl_mc_dprc dprc.1: shutdown [ 44.740863] fsl_mc_dprc dprc.1: DPRC device shutdown [ 44.745830] mmc_host mmc1: shutdown_pre [ 54.873286] mmc1: Timeout waiting for hardware cmd interrupt. [ 54.879023] mmc1: sdhci: ============ SDHCI REGISTER DUMP =========== [ 54.885453] mmc1: sdhci: Sys addr: 0x00000000 | Version: 0x00002202 The trigger CMD13 is in-flight at time 44.713673. The mmc_host mmc1: shutdown_pre is at time 44.745830. > > > > ... > > > > [ 71.031911] [Redacted] 2030000.i2c:[Redacted]@41:watchdog: > > > Watchdog interrupt received! > > > > [ 71.039737] Kernel panic - not syncing: watchdog pretimeout event > > > > [ 71.045820] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O > > > 6.6.59 #1 > > > > [ 71.053207] Hardware name: [Redacted] > > > > [ 71.059897] Call trace: > > > > [ 71.062332] dump_backtrace+0x9c/0x128 > > > > ... > > > > > > > > CMD6 is SWITCH_FUNC and arg 03220301 is > POWER_OFF_NOTIFICATION > > > (bits 16:23 = 0x22 = 34). > > > > CMD13 is SEND_STATUS, and when it occurs after the > > > POWER_OFF_NOTIFICATION (as above) bad things happen. > > > > > > > > I have made the following change to attempt to work around the > > > > issue, and > > > it seems to hold up, but is also quite brute force: > > > > > > > > --- a/drivers/mmc/core/mmc.c > > > > +++ b/drivers/mmc/core/mmc.c > > > > @@ -2046,6 +2046,11 @@ static void mmc_remove(struct mmc_host > *host) > > > > */ > > > > static int mmc_alive(struct mmc_host *host) { > > > > + if (host->card && mmc_card_suspended(host->card)) { > > > > + pr_err("%s: Skip card detection: Card suspended\n", > > > > + mmc_hostname(host)); > > > > + return -ENOMEDIUM; > > > > + } > > > > return mmc_send_status(host->card, NULL); } > > > > > > Yeah, the above isn't really the correct solution in my opinion. > > > > > > We need to prevent the mmc_rescan work from running, which I thought > > > we already did... > > > > > > > > > > > Anthony > > > > > > > > > > > > > > Kind regards > > > Uffe
On Fri, 22 Nov 2024 at 18:47, Anthony Pighin (Nokia) <anthony.pighin@nokia.com> wrote: > > > > > > > > > > > If card detection is done via polling, due to broken-cd (Freescale > > > > > LX2160, > > > > etc.), or for other reasons, then the card will be polled > > > > asynchronously and periodically. > > > > > > > > > > If that polling happens after the card has been put in powered off state > > (i.e. > > > > during system shutdown/reboot), then the polling times out. That > > > > timeout is of a long duration (10s). And it is repeated multiple > > > > times (x3). And that is all done after the watchdogd has been > > > > disabled, meaning that system watchdogs are not being kicked. > > > > > > > > > > If the MMC polling exceeds the watchdog trip time, then the system > > > > > will be > > > > ungraciously reset. Or in the case of a pretimeout capable watchdog, > > > > the pretimeout will trip unnecessarily. > > > > > > > > > > [ 46.872767] mmc_mrq_pr_debug:274: mmc1: starting CMD6 arg > > > > 03220301 flags 0000049d > > > > > [ 46.880258] sdhci_irq:3558: mmc1: sdhci: IRQ status 0x00000001 > > > > > [ 46.886082] sdhci_irq:3558: mmc1: sdhci: IRQ status 0x00000002 > > > > > [ 46.891906] mmc_request_done:187: mmc1: req done (CMD6): 0: > > > > 00000800 00000000 00000000 00000000 > > > > > [ 46.900606] mmc_set_ios:892: mmc1: clock 0Hz busmode 2 > > > > powermode 0 cs 0 Vdd 0 width 1 timing 0 > > > > > [ 46.914934] mmc_mrq_pr_debug:274: mmc1: starting CMD13 arg > > > > 00010000 flags 00000195 > > > > > [ 57.433351] mmc1: Timeout waiting for hardware cmd interrupt. > > > > > > > > Hmm. How is the polling being done? By using MMC_CAP_NEEDS_POLL? > > > > > > > > > > Correct. (At least in my understanding.) 'broken-cd' in the device > > > tree will trigger MMC_CAP_NEEDS_POLL to be set. > > > > > > > The above certainly looks weird. The mmc_rescan work should not be > > > > allowed to run at this point, as the work is getting punted to the > > > > system_freezable_wq via mmc_schedule_delayed_work(). > > > > > > > > > > This is the backtrace I get when the timeout occurs: > > > > > > [ 46.154348] mmc_mrq_pr_debug:274: mmc1: starting CMD13 arg > > 00010000 flags 00000195 > > > [ 46.161917] sdhci_irq:3546: mmc1: sdhci: IRQ status 0x00000001 > > > [ 46.167743] mmc_request_done:187: mmc1: req done (CMD13): 0: > > 00000900 00000000 00000000 00000000 > > > [ 46.176526] mmc_mrq_pr_debug:274: mmc1: starting CMD6 arg > > 03220301 flags 0000049d > > > [ 46.184016] sdhci_irq:3546: mmc1: sdhci: IRQ status 0x00000001 > > > [ 46.189840] sdhci_irq:3546: mmc1: sdhci: IRQ status 0x00000002 > > > [ 46.195665] mmc_request_done:187: mmc1: req done (CMD6): 0: > > 00000800 00000000 00000000 00000000 > > > [ 46.204362] mmc_set_ios:892: mmc1: clock 0Hz busmode 2 powermode 0 > > cs 0 Vdd 0 width 1 timing 0 > > > [ 46.219307] CPU: 6 PID: 153 Comm: kworker/6:1 Tainted: G O > > 6.6.59 #1 > > > [ 46.231988] Hardware name: [Redacted] > > > [ 46.238678] Workqueue: events_freezable mmc_rescan > > > [ 46.243466] Call trace: > > > [ 46.245901] dump_backtrace+0x9c/0x128 > > > [ 46.249643] show_stack+0x20/0x38 > > > [ 46.252948] dump_stack_lvl+0x48/0x60 > > > [ 46.256603] dump_stack+0x18/0x28 > > > [ 46.259909] mmc_alive+0x74/0x88 > > > [ 46.263128] _mmc_detect_card_removed+0x3c/0x158 > > > [ 46.267735] mmc_detect+0x30/0x98 > > > [ 46.271040] mmc_rescan+0x94/0x238 > > > [ 46.274432] process_one_work+0x178/0x3d8 > > > [ 46.278432] worker_thread+0x2bc/0x3e0 > > > [ 46.282171] kthread+0xf8/0x110 > > > [ 46.285303] ret_from_fork+0x10/0x20 > > > [ 46.288878] mmc_mrq_pr_debug:274: mmc1: starting CMD13 arg > > 00010000 flags 00000195 > > > [ 56.793379] mmc1: Timeout waiting for hardware cmd interrupt. > > > [ 56.799116] mmc1: sdhci: ============ SDHCI REGISTER DUMP > > =========== > > > [ 56.805545] mmc1: sdhci: Sys addr: 0x00000000 | Version: 0x00002202 > > > ... > > > > > > > Okay. If this is system suspend, it looks like the workqueue didn't become > > frozen as exepcted. I don't know why, but this needs to be investigated. > > > > If this is a shutdown, mmc_host_classdev_shutdown() should be called to > > "disable" the mmc_rescan work from running the code causing the above. > > > > What am I missing? > > > > Kind regards > > Uffe > > > > This occurs in response to a 'reboot' command. > > Here is an instance of the failure, with initcall_debug set to 'Y': > > Rebooting... > [ 44.500858] caam_jr 8030000.jr: shutdown > [ 44.504805] caam_jr 8020000.jr: shutdown > [ 44.508748] caam_jr 8010000.jr: shutdown > [ 44.515349] tpm tpm0: shutdown_pre > [ 44.523165] igb 0000:01:00.0: shutdown > [ 44.558695] usb usb2-port1: shutdown > [ 44.562268] usb usb1-port1: shutdown > [ 44.565864] xhci-hcd xhci-hcd.0.auto: shutdown > [ 44.570306] dwc3 3100000.usb: shutdown > [ 44.574052] [Redacted] 2030000.i2c:cpld@41:watchdog: shutdown > [ 44.580137] tpm_tis_spi spi2.0: shutdown > [ 44.584054] spi-nor spi0.1: shutdown > [ 44.588066] spi-nor spi0.0: shutdown > [ 44.592112] mmcblk mmc0:0001: shutdown > [ 44.595902] mmc_mrq_pr_debug:274: mmc0: starting CMD7 arg 00000000 flags 00000000 > [ 44.603413] sdhci_irq:3546: mmc0: sdhci: IRQ status 0x00000001 > [ 44.609240] mmc_request_done:187: mmc0: req done (CMD7): 0: 00000000 00000000 00000000 00000000 > [ 44.617939] mmc_set_ios:892: mmc0: clock 0Hz busmode 2 powermode 0 cs 0 Vdd 0 width 1 timing 0 > [ 44.627877] mmcblk mmc1:0001: shutdown > [ 44.631709] mmc_mrq_pr_debug:274: mmc1: starting CMD6 arg 03200101 flags 00000095 > [ 44.639200] sdhci_irq:3546: mmc1: sdhci: IRQ status 0x00000001 > [ 44.645027] mmc_request_done:187: mmc1: req done (CMD6): 0: 00000800 00000000 00000000 00000000 > [ 44.653727] mmc_mrq_pr_debug:274: mmc1: starting CMD13 arg 00010000 flags 00000195 > [ 44.661296] sdhci_irq:3546: mmc1: sdhci: IRQ status 0x00000001 > [ 44.667123] mmc_request_done:187: mmc1: req done (CMD13): 0: 00000900 00000000 00000000 00000000 > [ 44.675907] mmc_mrq_pr_debug:274: mmc1: starting CMD6 arg 03220301 flags 0000049d > [ 44.683399] sdhci_irq:3546: mmc1: sdhci: IRQ status 0x00000001 > [ 44.689223] sdhci_irq:3546: mmc1: sdhci: IRQ status 0x00000002 > [ 44.695047] mmc_request_done:187: mmc1: req done (CMD6): 0: 00000800 00000000 00000000 00000000 > [ 44.703746] mmc_set_ios:892: mmc1: clock 0Hz busmode 2 powermode 0 cs 0 Vdd 0 width 1 timing 0 > [ 44.713667] aer 0000:00:00.0:pcie002: shutdown > [ 44.713673] mmc_mrq_pr_debug:274: mmc1: starting CMD13 arg 00010000 flags 00000195 > [ 44.718102] pcie_pme 0000:00:00.0:pcie001: shutdown > [ 44.730534] pcieport 0000:00:00.0: shutdown > [ 44.734898] fsl_mc_dprc dprc.1: shutdown > [ 44.740863] fsl_mc_dprc dprc.1: DPRC device shutdown > [ 44.745830] mmc_host mmc1: shutdown_pre > [ 54.873286] mmc1: Timeout waiting for hardware cmd interrupt. > [ 54.879023] mmc1: sdhci: ============ SDHCI REGISTER DUMP =========== > [ 54.885453] mmc1: sdhci: Sys addr: 0x00000000 | Version: 0x00002202 > > The trigger CMD13 is in-flight at time 44.713673. > The mmc_host mmc1: shutdown_pre is at time 44.745830. Thanks a lot for providing the details! Yes, this certainly points out the problem during shutdown/reboot. I have sent a patch [1] that I think should address the problem. Can you please test and verify it? [...] Kind regards Uffe [1] https://lore.kernel.org/all/20241125122446.18684-1-ulf.hansson@linaro.org/
> > > > > > > > > > > > If card detection is done via polling, due to broken-cd > > > > > > (Freescale LX2160, > > > > > etc.), or for other reasons, then the card will be polled > > > > > asynchronously and periodically. > > > > > > > > > > > > If that polling happens after the card has been put in powered > > > > > > off state > > > (i.e. > > > > > during system shutdown/reboot), then the polling times out. That > > > > > timeout is of a long duration (10s). And it is repeated multiple > > > > > times (x3). And that is all done after the watchdogd has been > > > > > disabled, meaning that system watchdogs are not being kicked. > > > > > > > > > > > > If the MMC polling exceeds the watchdog trip time, then the > > > > > > system will be > > > > > ungraciously reset. Or in the case of a pretimeout capable > > > > > watchdog, the pretimeout will trip unnecessarily. > > > > > > > > > > > > [ 46.872767] mmc_mrq_pr_debug:274: mmc1: starting CMD6 arg > > > > > 03220301 flags 0000049d > > > > > > [ 46.880258] sdhci_irq:3558: mmc1: sdhci: IRQ status 0x00000001 > > > > > > [ 46.886082] sdhci_irq:3558: mmc1: sdhci: IRQ status 0x00000002 > > > > > > [ 46.891906] mmc_request_done:187: mmc1: req done (CMD6): > 0: > > > > > 00000800 00000000 00000000 00000000 > > > > > > [ 46.900606] mmc_set_ios:892: mmc1: clock 0Hz busmode 2 > > > > > powermode 0 cs 0 Vdd 0 width 1 timing 0 > > > > > > [ 46.914934] mmc_mrq_pr_debug:274: mmc1: starting CMD13 > arg > > > > > 00010000 flags 00000195 > > > > > > [ 57.433351] mmc1: Timeout waiting for hardware cmd interrupt. > > > > > > > > > > Hmm. How is the polling being done? By using > MMC_CAP_NEEDS_POLL? > > > > > > > > > > > > > Correct. (At least in my understanding.) 'broken-cd' in the device > > > > tree will trigger MMC_CAP_NEEDS_POLL to be set. > > > > > > > > > The above certainly looks weird. The mmc_rescan work should not > > > > > be allowed to run at this point, as the work is getting punted > > > > > to the system_freezable_wq via mmc_schedule_delayed_work(). > > > > > > > > > > > > > This is the backtrace I get when the timeout occurs: > > > > > > > > [ 46.154348] mmc_mrq_pr_debug:274: mmc1: starting CMD13 arg > > > 00010000 flags 00000195 > > > > [ 46.161917] sdhci_irq:3546: mmc1: sdhci: IRQ status 0x00000001 > > > > [ 46.167743] mmc_request_done:187: mmc1: req done (CMD13): 0: > > > 00000900 00000000 00000000 00000000 > > > > [ 46.176526] mmc_mrq_pr_debug:274: mmc1: starting CMD6 arg > > > 03220301 flags 0000049d > > > > [ 46.184016] sdhci_irq:3546: mmc1: sdhci: IRQ status 0x00000001 > > > > [ 46.189840] sdhci_irq:3546: mmc1: sdhci: IRQ status 0x00000002 > > > > [ 46.195665] mmc_request_done:187: mmc1: req done (CMD6): 0: > > > 00000800 00000000 00000000 00000000 > > > > [ 46.204362] mmc_set_ios:892: mmc1: clock 0Hz busmode 2 > powermode 0 > > > cs 0 Vdd 0 width 1 timing 0 > > > > [ 46.219307] CPU: 6 PID: 153 Comm: kworker/6:1 Tainted: G O > > > 6.6.59 #1 > > > > [ 46.231988] Hardware name: [Redacted] > > > > [ 46.238678] Workqueue: events_freezable mmc_rescan > > > > [ 46.243466] Call trace: > > > > [ 46.245901] dump_backtrace+0x9c/0x128 > > > > [ 46.249643] show_stack+0x20/0x38 > > > > [ 46.252948] dump_stack_lvl+0x48/0x60 > > > > [ 46.256603] dump_stack+0x18/0x28 > > > > [ 46.259909] mmc_alive+0x74/0x88 > > > > [ 46.263128] _mmc_detect_card_removed+0x3c/0x158 > > > > [ 46.267735] mmc_detect+0x30/0x98 > > > > [ 46.271040] mmc_rescan+0x94/0x238 > > > > [ 46.274432] process_one_work+0x178/0x3d8 > > > > [ 46.278432] worker_thread+0x2bc/0x3e0 > > > > [ 46.282171] kthread+0xf8/0x110 > > > > [ 46.285303] ret_from_fork+0x10/0x20 > > > > [ 46.288878] mmc_mrq_pr_debug:274: mmc1: starting CMD13 arg > > > 00010000 flags 00000195 > > > > [ 56.793379] mmc1: Timeout waiting for hardware cmd interrupt. > > > > [ 56.799116] mmc1: sdhci: ============ SDHCI REGISTER DUMP > > > =========== > > > > [ 56.805545] mmc1: sdhci: Sys addr: 0x00000000 | Version: > 0x00002202 > > > > ... > > > > > > > > > > Okay. If this is system suspend, it looks like the workqueue didn't > > > become frozen as exepcted. I don't know why, but this needs to be > investigated. > > > > > > If this is a shutdown, mmc_host_classdev_shutdown() should be called > > > to "disable" the mmc_rescan work from running the code causing the > above. > > > > > > What am I missing? > > > > > > Kind regards > > > Uffe > > > > > > > This occurs in response to a 'reboot' command. > > > > Here is an instance of the failure, with initcall_debug set to 'Y': > > > > Rebooting... > > [ 44.500858] caam_jr 8030000.jr: shutdown > > [ 44.504805] caam_jr 8020000.jr: shutdown > > [ 44.508748] caam_jr 8010000.jr: shutdown > > [ 44.515349] tpm tpm0: shutdown_pre > > [ 44.523165] igb 0000:01:00.0: shutdown > > [ 44.558695] usb usb2-port1: shutdown > > [ 44.562268] usb usb1-port1: shutdown > > [ 44.565864] xhci-hcd xhci-hcd.0.auto: shutdown > > [ 44.570306] dwc3 3100000.usb: shutdown > > [ 44.574052] [Redacted] 2030000.i2c:cpld@41:watchdog: shutdown > > [ 44.580137] tpm_tis_spi spi2.0: shutdown > > [ 44.584054] spi-nor spi0.1: shutdown > > [ 44.588066] spi-nor spi0.0: shutdown > > [ 44.592112] mmcblk mmc0:0001: shutdown > > [ 44.595902] mmc_mrq_pr_debug:274: mmc0: starting CMD7 arg > 00000000 flags 00000000 > > [ 44.603413] sdhci_irq:3546: mmc0: sdhci: IRQ status 0x00000001 > > [ 44.609240] mmc_request_done:187: mmc0: req done (CMD7): 0: > 00000000 00000000 00000000 00000000 > > [ 44.617939] mmc_set_ios:892: mmc0: clock 0Hz busmode 2 powermode 0 > cs 0 Vdd 0 width 1 timing 0 > > [ 44.627877] mmcblk mmc1:0001: shutdown > > [ 44.631709] mmc_mrq_pr_debug:274: mmc1: starting CMD6 arg > 03200101 flags 00000095 > > [ 44.639200] sdhci_irq:3546: mmc1: sdhci: IRQ status 0x00000001 > > [ 44.645027] mmc_request_done:187: mmc1: req done (CMD6): 0: > 00000800 00000000 00000000 00000000 > > [ 44.653727] mmc_mrq_pr_debug:274: mmc1: starting CMD13 arg > 00010000 flags 00000195 > > [ 44.661296] sdhci_irq:3546: mmc1: sdhci: IRQ status 0x00000001 > > [ 44.667123] mmc_request_done:187: mmc1: req done (CMD13): 0: > 00000900 00000000 00000000 00000000 > > [ 44.675907] mmc_mrq_pr_debug:274: mmc1: starting CMD6 arg > 03220301 flags 0000049d > > [ 44.683399] sdhci_irq:3546: mmc1: sdhci: IRQ status 0x00000001 > > [ 44.689223] sdhci_irq:3546: mmc1: sdhci: IRQ status 0x00000002 > > [ 44.695047] mmc_request_done:187: mmc1: req done (CMD6): 0: > 00000800 00000000 00000000 00000000 > > [ 44.703746] mmc_set_ios:892: mmc1: clock 0Hz busmode 2 powermode 0 > cs 0 Vdd 0 width 1 timing 0 > > [ 44.713667] aer 0000:00:00.0:pcie002: shutdown > > [ 44.713673] mmc_mrq_pr_debug:274: mmc1: starting CMD13 arg > 00010000 flags 00000195 > > [ 44.718102] pcie_pme 0000:00:00.0:pcie001: shutdown > > [ 44.730534] pcieport 0000:00:00.0: shutdown > > [ 44.734898] fsl_mc_dprc dprc.1: shutdown > > [ 44.740863] fsl_mc_dprc dprc.1: DPRC device shutdown > > [ 44.745830] mmc_host mmc1: shutdown_pre > > [ 54.873286] mmc1: Timeout waiting for hardware cmd interrupt. > > [ 54.879023] mmc1: sdhci: ============ SDHCI REGISTER DUMP > =========== > > [ 54.885453] mmc1: sdhci: Sys addr: 0x00000000 | Version: 0x00002202 > > > > The trigger CMD13 is in-flight at time 44.713673. > > The mmc_host mmc1: shutdown_pre is at time 44.745830. > > Thanks a lot for providing the details! Yes, this certainly points out the > problem during shutdown/reboot. > > I have sent a patch [1] that I think should address the problem. Can you > please test and verify it? > > [...] > > Kind regards > Uffe > > [1] > https://lore.ke/ > rnel.org%2Fall%2F20241125122446.18684-1- > ulf.hansson%40linaro.org%2F&data=05%7C02%7Canthony.pighin%40nokia.co > m%7C2b25985f3d974df1d15e08dd0d4cf275%7C5d4717519675428d917b70f4 > 4f9630b0%7C0%7C0%7C638681346376209654%7CUnknown%7CTWFpbGZsb > 3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkF > OIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=XQ%2B3SRi8%2B > F9NZLEuhmekipj5YrjxVhTnlXIP2RUokmg%3D&reserved=0 I confirm that the patch corrects the behaviour. Thank you! Anthony
--- a/drivers/mmc/core/mmc.c +++ b/drivers/mmc/core/mmc.c @@ -2046,6 +2046,11 @@ static void mmc_remove(struct mmc_host *host) */ static int mmc_alive(struct mmc_host *host) { + if (host->card && mmc_card_suspended(host->card)) { + pr_err("%s: Skip card detection: Card suspended\n", + mmc_hostname(host)); + return -ENOMEDIUM; + } return mmc_send_status(host->card, NULL); }