Message ID | 20250620025535.3425049-3-superm1@kernel.org |
---|---|
State | New |
Headers | show |
Series | Don't make noise about disconnected USB4 devices | expand |
On 6/21/2025 11:43 PM, Lukas Wunner wrote: > On Sat, Jun 21, 2025 at 02:56:08PM -0500, Mario Limonciello wrote: >> On 6/21/25 2:05 PM, Lukas Wunner wrote: >>> In the dmesg output attached to... >>> >>> https://bugzilla.kernel.org/show_bug.cgi?id=220216 >>> >>> ... the device exhibiting the refcount underflow is a PCIe port. >>> Are you also seeing this on a PCIe port or is it a different device? >> >> The device with the underflow is the disconnected PCIe bridge. >> >> In my case it was this bridge that was downstream. > > Okay, in both cases the refcount underflow occurs on a PCIe port. > So it seems likely the gratuitous refcount decrement is in portdrv.c > or one of the port services drivers. > > Your patch changes the code path for *all* PCI devices. > Not just PCIe ports. Hence it's likely not the right fix. > > It may fix the issue on this particular PCIe port but > I strongly suspect it'll leak a runtime PM ref on all other devices. > Thanks, I see your point. > >>> So the refcount decrement happens in pcie_portdrv_probe() and >>> the refcount increment happens in pcie_portdrv_remove(). >>> Both times it's conditional on pci_bridge_d3_possible(). >>> Does that return a different value on probe versus remove? > > Could you please answer this? I did this check and yes specifically on this PCIe port with the underflow the d3 possible lookup returns false during pcie_portdrv_remove(). It returns true during pcie_portdrv_probe(). > > >>> Does any of the port service drivers decrement the refcount >>> once too often? I've just looked through pciehp but cannot >>> find anything out of the ordinary. >>> >>> Looking through recent changes, 002bf2fbc00e and bca84a7b93fd >>> look like potential candidates causing a regression, but the >>> former is for AER (which isn't used in the dmesg attached to >>> the bugzilla) and the latter touches suspend on system sleep, >>> not runtime suspend. >>> >>> Can you maybe instrument the pm_runtime_{get,put}*() functions >>> with a printk() and/or dump_stack() to see where a gratuitous >>> refcount decrement occurs? >> >> That's exactly what I did to conclude this call was an extra one. >> >> Here's the drop to 0: > > The drop to 0 is uninteresting. You need to record *all* > refcount increments/decrements so that we can see where the > gratuitous one occurs. It happens earlier than the drop to 0. > > However, please first check whether the pci_bridge_d3_possible() > return value changes on probe versus remove of the PCIe port > in question. If it does, then that's the root cause and there's > no need to look any further. > That was a great hypothesis that's spot on. Just for posterity this was all the increment/decrement calls that I saw happen. pci 0000:02:04.0: inc usage cnt from 0, caller: pci_pm_init+0x84/0x2d0 pci 0000:02:04.0: inc usage cnt from 1, caller: pci_scan_bridge_extend+0x6d/0x710 pci 0000:02:04.0: dec usage cnt from 2, caller: pci_scan_bridge_extend+0x19e/0x710 pci 0000:02:04.0: inc usage cnt from 1, caller: pci_scan_bridge_extend+0x6d/0x710 pci 0000:02:04.0: dec usage cnt from 2, caller: pci_scan_bridge_extend+0x19e/0x710 pcieport 0000:02:04.0: inc usage cnt from 1, caller: local_pci_probe+0x2d/0xa0 pcieport 0000:02:04.0: inc usage cnt from 2, caller: __device_attach+0x9c/0x1b0 pcieport 0000:02:04.0: inc usage cnt from 3, caller: __driver_probe_device+0x5c/0x150 pcieport 0000:02:04.0: dec usage cnt from 4, caller: __driver_probe_device+0x9a/0x150 pcieport 0000:02:04.0: dec usage cnt from 3, caller: __device_attach+0x145/0x1b0 pcieport 0000:02:04.0: dec usage cnt from 2, caller: pcie_portdrv_probe+0x19d/0x6d0 pcieport 0000:02:04.0: dec usage cnt from 1, caller: pcie_portdrv_probe+0x1a5/0x6d0 pcieport 0000:02:04.0: inc usage cnt from 0, caller: device_release_driver_internal+0xac/0x200 pcieport 0000:02:04.0: dec usage cnt from 1, caller: device_release_driver_internal+0x197/0x200 pcieport 0000:02:04.0: inc usage cnt from 0, caller: pci_device_remove+0x2d/0xb0 pcieport 0000:02:04.0: dec usage cnt from 0, caller: pci_device_remove+0x7e/0xb0 pcieport 0000:02:04.0: Runtime PM usage cnt underflow! What's your suggestion on what to actually do here then?
diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c index b78b98133e7df..63f1cb11906ad 100644 --- a/drivers/pci/pci-driver.c +++ b/drivers/pci/pci-driver.c @@ -478,9 +478,6 @@ static void pci_device_remove(struct device *dev) pci_dev->driver = NULL; pci_iov_remove(pci_dev); - /* Undo the runtime PM settings in local_pci_probe() */ - pm_runtime_put_sync(dev); - /* * If the device is still on, set the power state as "unknown", * since it might change by the next time we load the driver.