diff mbox series

[15/16] wifi: mt76: mt7915: reset the device after MCU timeout

Message ID 20240816173529.17873-15-nbd@nbd.name
State Superseded
Headers show
Series [01/16] mt76: mt7603: fix mixed declarations and code | expand

Commit Message

Felix Fietkau Aug. 16, 2024, 5:35 p.m. UTC
On MT7915, MCU hangs do not trigger watchdog interrupts, so they can only
be detected through MCU message timeouts. Ensure that the hardware gets
restarted when that happens in order to prevent a permanent stuck state.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
---
 drivers/net/wireless/mediatek/mt76/mt7915/mcu.c | 7 +++++++
 1 file changed, 7 insertions(+)

Comments

Ben Greear Aug. 23, 2024, 6:32 p.m. UTC | #1
On 8/16/24 10:35, Felix Fietkau wrote:
> On MT7915, MCU hangs do not trigger watchdog interrupts, so they can only
> be detected through MCU message timeouts. Ensure that the hardware gets
> restarted when that happens in order to prevent a permanent stuck state.

We applied this to our hacked upon 6.10 kernel, and this patch appears
to cause NPE down in debugfs file removal during radio restart.  We didn't investigate this
closely, but removing this patch fixes the problem.

Also of note, we see the radio have a timeout, but then recover, often
(without this patch).

Did you force/fake this situation to happen and see it actually work?

Thanks,
Ben
Felix Fietkau Aug. 23, 2024, 6:35 p.m. UTC | #2
On 23.08.24 20:32, Ben Greear wrote:
> On 8/16/24 10:35, Felix Fietkau wrote:
>> On MT7915, MCU hangs do not trigger watchdog interrupts, so they can only
>> be detected through MCU message timeouts. Ensure that the hardware gets
>> restarted when that happens in order to prevent a permanent stuck state.
> 
> We applied this to our hacked upon 6.10 kernel, and this patch appears
> to cause NPE down in debugfs file removal during radio restart.  We didn't investigate this
> closely, but removing this patch fixes the problem.
> 
> Also of note, we see the radio have a timeout, but then recover, often
> (without this patch).
> 
> Did you force/fake this situation to happen and see it actually work?

I found some issues in a few patches of this series in the last few days 
and will send v2 soon.

- Felix
diff mbox series

Patch

diff --git a/drivers/net/wireless/mediatek/mt76/mt7915/mcu.c b/drivers/net/wireless/mediatek/mt76/mt7915/mcu.c
index 068523561f5e..7c98d9ba9152 100644
--- a/drivers/net/wireless/mediatek/mt76/mt7915/mcu.c
+++ b/drivers/net/wireless/mediatek/mt76/mt7915/mcu.c
@@ -157,12 +157,19 @@  static int
 mt7915_mcu_parse_response(struct mt76_dev *mdev, int cmd,
 			  struct sk_buff *skb, int seq)
 {
+	struct mt7915_dev *dev = container_of(mdev, struct mt7915_dev, mt76);
 	struct mt76_connac2_mcu_rxd *rxd;
 	int ret = 0;
 
 	if (!skb) {
 		dev_err(mdev->dev, "Message %08x (seq %d) timeout\n",
 			cmd, seq);
+		dev->recovery.restart = true;
+		set_bit(MT76_MCU_RESET, &dev->mphy.state);
+		wake_up(&dev->mt76.mcu.wait);
+		queue_work(dev->mt76.wq, &dev->reset_work);
+		wake_up(&dev->reset_wait);
+
 		return -ETIMEDOUT;
 	}