diff mbox series

[v2,2/2] mwifiex: pcie: add reset_d3cold quirk for Surface gen4+ devices

Message ID 20210709145831.6123-3-verdre@v0yd.nl
State New
Headers show
Series mwifiex: Add quirks for MS Surface devices | expand

Commit Message

Jonas Dreßler July 9, 2021, 2:58 p.m. UTC
From: Tsuchiya Yuto <kitakar@gmail.com>

To reset mwifiex on Surface gen4+ (Pro 4 or later gen) devices, it
seems that putting the wifi device into D3cold is required according
to errata.inf file on Windows installation (Windows/INF/errata.inf).

This patch adds a function that performs power-cycle (put into D3cold
then D0) and call the function at the end of reset_prepare().

Note: Need to also reset the parent device (bridge) of wifi on SB1;
it might be because the bridge of wifi always reports it's in D3hot.
When I tried to reset only the wifi device (not touching parent), it gave
the following error and the reset failed:

    acpi device:4b: Cannot transition to power state D0 for parent in D3hot
    mwifiex_pcie 0000:03:00.0: can't change power state from D3cold to D0 (config space inaccessible)

Signed-off-by: Tsuchiya Yuto <kitakar@gmail.com>
Signed-off-by: Jonas Dreßler <verdre@v0yd.nl>
---
 drivers/net/wireless/marvell/mwifiex/pcie.c   |   7 +
 .../wireless/marvell/mwifiex/pcie_quirks.c    | 123 ++++++++++++++++++
 .../wireless/marvell/mwifiex/pcie_quirks.h    |   3 +
 3 files changed, 133 insertions(+)

Comments

Jonas Dreßler July 9, 2021, 3:33 p.m. UTC | #1
On 7/9/21 5:18 PM, Pali Rohár wrote:
> On Friday 09 July 2021 16:58:31 Jonas Dreßler wrote:
>> From: Tsuchiya Yuto <kitakar@gmail.com>
>>
>> To reset mwifiex on Surface gen4+ (Pro 4 or later gen) devices, it
>> seems that putting the wifi device into D3cold is required according
>> to errata.inf file on Windows installation (Windows/INF/errata.inf).
>>
>> This patch adds a function that performs power-cycle (put into D3cold
>> then D0) and call the function at the end of reset_prepare().
>>
>> Note: Need to also reset the parent device (bridge) of wifi on SB1;
>> it might be because the bridge of wifi always reports it's in D3hot.
>> When I tried to reset only the wifi device (not touching parent), it gave
>> the following error and the reset failed:
>>
>>      acpi device:4b: Cannot transition to power state D0 for parent in D3hot
>>      mwifiex_pcie 0000:03:00.0: can't change power state from D3cold to D0 (config space inaccessible)
>>
>> Signed-off-by: Tsuchiya Yuto <kitakar@gmail.com>
>> Signed-off-by: Jonas Dreßler <verdre@v0yd.nl>
>> ---
>>   drivers/net/wireless/marvell/mwifiex/pcie.c   |   7 +
>>   .../wireless/marvell/mwifiex/pcie_quirks.c    | 123 ++++++++++++++++++
>>   .../wireless/marvell/mwifiex/pcie_quirks.h    |   3 +
>>   3 files changed, 133 insertions(+)
>>
>> diff --git a/drivers/net/wireless/marvell/mwifiex/pcie.c b/drivers/net/wireless/marvell/mwifiex/pcie.c
>> index a530832c9421..c6ccce426b49 100644
>> --- a/drivers/net/wireless/marvell/mwifiex/pcie.c
>> +++ b/drivers/net/wireless/marvell/mwifiex/pcie.c
>> @@ -528,6 +528,13 @@ static void mwifiex_pcie_reset_prepare(struct pci_dev *pdev)
>>   	mwifiex_shutdown_sw(adapter);
>>   	clear_bit(MWIFIEX_IFACE_WORK_DEVICE_DUMP, &card->work_flags);
>>   	clear_bit(MWIFIEX_IFACE_WORK_CARD_RESET, &card->work_flags);
>> +
>> +	/* On MS Surface gen4+ devices FLR isn't effective to recover from
>> +	 * hangups, so we power-cycle the card instead.
>> +	 */
>> +	if (card->quirks & QUIRK_FW_RST_D3COLD)
>> +		mwifiex_pcie_reset_d3cold_quirk(pdev);
>> +
> 
> Hello! Now I'm thinking loudly about this patch. Why this kind of reset
> is needed only for Surface devices? AFAIK these 88W8897 chips are same
> in all cards. Chip itself implements PCIe interface (and also SDIO) so
> for me looks very strange if this 88W8897 PCIe device needs DMI specific
> quirks. I cannot believe that Microsoft got some special version of
> these chips from Marvell which are different than version uses on cards
> in mPCIe form factor.
> 
> And now when I'm reading comment below about PCIe bridge to which is
> this 88W8897 PCIe chip connected, is not this rather an issue in that
> PCIe bridge (instead of mwifiex/88W8897) or in ACPI firmware which
> controls this bridge?
> 
> Or are having other people same issues on mPCIe form factor wifi cards
> with 88W8897 chips and then this quirk should not DMI dependent?
> 
> Note that I'm seeing issues with reset and other things also on chip
> 88W8997 when is connected to system via SDIO. These chips have both PCIe
> and SDIO buses, it just depends which pins are used.
> 

Hi and thanks for the quick reply! Honestly I've no idea, this is just 
the first method we found that allows for a proper reset of the chip. 
What I know is that some Surface devices need that ACPI DSM call (the 
one that was done in the commit I dropped in this version of the 
patchset) to reset the chip instead of this method.

Afaik other devices with this chip don't need this resetting method, at 
least Marvell employees couldn't reproduce the issues on their testing 
devices.

So would you suggest we just try to match for the pci chip 88W8897 
instead? Then we'd probably have to check if there are any laptops where 
multiple devices are connected to the pci bridge as Amey suggested in a 
review before.

>>   	mwifiex_dbg(adapter, INFO, "%s, successful\n", __func__);
>>   
>>   	card->pci_reset_ongoing = true;
>> diff --git a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c
>> index 4064f99b36ba..b5f214fc1212 100644
>> --- a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c
>> +++ b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c
>> @@ -15,6 +15,72 @@
>>   
>>   /* quirk table based on DMI matching */
>>   static const struct dmi_system_id mwifiex_quirk_table[] = {
>> +	{
>> +		.ident = "Surface Pro 4",
>> +		.matches = {
>> +			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
>> +			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Pro 4"),
>> +		},
>> +		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
>> +	},
>> +	{
>> +		.ident = "Surface Pro 5",
>> +		.matches = {
>> +			/* match for SKU here due to generic product name "Surface Pro" */
>> +			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
>> +			DMI_EXACT_MATCH(DMI_PRODUCT_SKU, "Surface_Pro_1796"),
>> +		},
>> +		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
>> +	},
>> +	{
>> +		.ident = "Surface Pro 5 (LTE)",
>> +		.matches = {
>> +			/* match for SKU here due to generic product name "Surface Pro" */
>> +			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
>> +			DMI_EXACT_MATCH(DMI_PRODUCT_SKU, "Surface_Pro_1807"),
>> +		},
>> +		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
>> +	},
>> +	{
>> +		.ident = "Surface Pro 6",
>> +		.matches = {
>> +			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
>> +			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Pro 6"),
>> +		},
>> +		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
>> +	},
>> +	{
>> +		.ident = "Surface Book 1",
>> +		.matches = {
>> +			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
>> +			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Book"),
>> +		},
>> +		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
>> +	},
>> +	{
>> +		.ident = "Surface Book 2",
>> +		.matches = {
>> +			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
>> +			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Book 2"),
>> +		},
>> +		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
>> +	},
>> +	{
>> +		.ident = "Surface Laptop 1",
>> +		.matches = {
>> +			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
>> +			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Laptop"),
>> +		},
>> +		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
>> +	},
>> +	{
>> +		.ident = "Surface Laptop 2",
>> +		.matches = {
>> +			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
>> +			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Laptop 2"),
>> +		},
>> +		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
>> +	},
>>   	{}
>>   };
>>   
>> @@ -29,4 +95,61 @@ void mwifiex_initialize_quirks(struct pcie_service_card *card)
>>   
>>   	if (!card->quirks)
>>   		dev_info(&pdev->dev, "no quirks enabled\n");
>> +	if (card->quirks & QUIRK_FW_RST_D3COLD)
>> +		dev_info(&pdev->dev, "quirk reset_d3cold enabled\n");
>> +}
>> +
>> +static void mwifiex_pcie_set_power_d3cold(struct pci_dev *pdev)
>> +{
>> +	dev_info(&pdev->dev, "putting into D3cold...\n");
>> +
>> +	pci_save_state(pdev);
>> +	if (pci_is_enabled(pdev))
>> +		pci_disable_device(pdev);
>> +	pci_set_power_state(pdev, PCI_D3cold);
>> +}
>> +
>> +static int mwifiex_pcie_set_power_d0(struct pci_dev *pdev)
>> +{
>> +	int ret;
>> +
>> +	dev_info(&pdev->dev, "putting into D0...\n");
>> +
>> +	pci_set_power_state(pdev, PCI_D0);
>> +	ret = pci_enable_device(pdev);
>> +	if (ret) {
>> +		dev_err(&pdev->dev, "pci_enable_device failed\n");
>> +		return ret;
>> +	}
>> +	pci_restore_state(pdev);
>> +
>> +	return 0;
>> +}
>> +
>> +int mwifiex_pcie_reset_d3cold_quirk(struct pci_dev *pdev)
>> +{
>> +	struct pci_dev *parent_pdev = pci_upstream_bridge(pdev);
>> +	int ret;
>> +
>> +	/* Power-cycle (put into D3cold then D0) */
>> +	dev_info(&pdev->dev, "Using reset_d3cold quirk to perform FW reset\n");
>> +
>> +	/* We need to perform power-cycle also for bridge of wifi because
>> +	 * on some devices (e.g. Surface Book 1), the OS for some reasons
>> +	 * can't know the real power state of the bridge.
>> +	 * When tried to power-cycle only wifi, the reset failed with the
>> +	 * following dmesg log:
>> +	 * "Cannot transition to power state D0 for parent in D3hot".
>> +	 */
>> +	mwifiex_pcie_set_power_d3cold(pdev);
>> +	mwifiex_pcie_set_power_d3cold(parent_pdev);
>> +
>> +	ret = mwifiex_pcie_set_power_d0(parent_pdev);
>> +	if (ret)
>> +		return ret;
>> +	ret = mwifiex_pcie_set_power_d0(pdev);
>> +	if (ret)
>> +		return ret;
>> +
>> +	return 0;
>>   }
>> diff --git a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h
>> index 7a1fe3b3a61a..549093067813 100644
>> --- a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h
>> +++ b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h
>> @@ -5,4 +5,7 @@
>>   
>>   #include "pcie.h"
>>   
>> +#define QUIRK_FW_RST_D3COLD	BIT(0)
>> +
>>   void mwifiex_initialize_quirks(struct pcie_service_card *card);
>> +int mwifiex_pcie_reset_d3cold_quirk(struct pci_dev *pdev);
>> -- 
>> 2.31.1
>>
Pali Rohár July 9, 2021, 4:12 p.m. UTC | #2
On Friday 09 July 2021 17:33:34 Jonas Dreßler wrote:
> On 7/9/21 5:18 PM, Pali Rohár wrote:
> > On Friday 09 July 2021 16:58:31 Jonas Dreßler wrote:
> > > From: Tsuchiya Yuto <kitakar@gmail.com>
> > > 
> > > To reset mwifiex on Surface gen4+ (Pro 4 or later gen) devices, it
> > > seems that putting the wifi device into D3cold is required according
> > > to errata.inf file on Windows installation (Windows/INF/errata.inf).
> > > 
> > > This patch adds a function that performs power-cycle (put into D3cold
> > > then D0) and call the function at the end of reset_prepare().
> > > 
> > > Note: Need to also reset the parent device (bridge) of wifi on SB1;
> > > it might be because the bridge of wifi always reports it's in D3hot.
> > > When I tried to reset only the wifi device (not touching parent), it gave
> > > the following error and the reset failed:
> > > 
> > >      acpi device:4b: Cannot transition to power state D0 for parent in D3hot
> > >      mwifiex_pcie 0000:03:00.0: can't change power state from D3cold to D0 (config space inaccessible)
> > > 
> > > Signed-off-by: Tsuchiya Yuto <kitakar@gmail.com>
> > > Signed-off-by: Jonas Dreßler <verdre@v0yd.nl>
> > > ---
> > >   drivers/net/wireless/marvell/mwifiex/pcie.c   |   7 +
> > >   .../wireless/marvell/mwifiex/pcie_quirks.c    | 123 ++++++++++++++++++
> > >   .../wireless/marvell/mwifiex/pcie_quirks.h    |   3 +
> > >   3 files changed, 133 insertions(+)
> > > 
> > > diff --git a/drivers/net/wireless/marvell/mwifiex/pcie.c b/drivers/net/wireless/marvell/mwifiex/pcie.c
> > > index a530832c9421..c6ccce426b49 100644
> > > --- a/drivers/net/wireless/marvell/mwifiex/pcie.c
> > > +++ b/drivers/net/wireless/marvell/mwifiex/pcie.c
> > > @@ -528,6 +528,13 @@ static void mwifiex_pcie_reset_prepare(struct pci_dev *pdev)
> > >   	mwifiex_shutdown_sw(adapter);
> > >   	clear_bit(MWIFIEX_IFACE_WORK_DEVICE_DUMP, &card->work_flags);
> > >   	clear_bit(MWIFIEX_IFACE_WORK_CARD_RESET, &card->work_flags);
> > > +
> > > +	/* On MS Surface gen4+ devices FLR isn't effective to recover from
> > > +	 * hangups, so we power-cycle the card instead.
> > > +	 */
> > > +	if (card->quirks & QUIRK_FW_RST_D3COLD)
> > > +		mwifiex_pcie_reset_d3cold_quirk(pdev);
> > > +
> > 
> > Hello! Now I'm thinking loudly about this patch. Why this kind of reset
> > is needed only for Surface devices? AFAIK these 88W8897 chips are same
> > in all cards. Chip itself implements PCIe interface (and also SDIO) so
> > for me looks very strange if this 88W8897 PCIe device needs DMI specific
> > quirks. I cannot believe that Microsoft got some special version of
> > these chips from Marvell which are different than version uses on cards
> > in mPCIe form factor.
> > 
> > And now when I'm reading comment below about PCIe bridge to which is
> > this 88W8897 PCIe chip connected, is not this rather an issue in that
> > PCIe bridge (instead of mwifiex/88W8897) or in ACPI firmware which
> > controls this bridge?
> > 
> > Or are having other people same issues on mPCIe form factor wifi cards
> > with 88W8897 chips and then this quirk should not DMI dependent?
> > 
> > Note that I'm seeing issues with reset and other things also on chip
> > 88W8997 when is connected to system via SDIO. These chips have both PCIe
> > and SDIO buses, it just depends which pins are used.
> > 
> 
> Hi and thanks for the quick reply! Honestly I've no idea, this is just the
> first method we found that allows for a proper reset of the chip. What I
> know is that some Surface devices need that ACPI DSM call (the one that was
> done in the commit I dropped in this version of the patchset) to reset the
> chip instead of this method.
> 
> Afaik other devices with this chip don't need this resetting method, at
> least Marvell employees couldn't reproduce the issues on their testing
> devices.
> 
> So would you suggest we just try to match for the pci chip 88W8897 instead?

Hello! Such suggestion makes sense when we know that it is 88W8897
issue. But if you got information that issue cannot be reproduced on
other 88W8897 cards then matching 88W8897 is not correct.
Maximilian Luz July 9, 2021, 5:03 p.m. UTC | #3
On 7/9/21 6:12 PM, Pali Rohár wrote:

[...]

>>> Hello! Now I'm thinking loudly about this patch. Why this kind of reset
>>> is needed only for Surface devices? AFAIK these 88W8897 chips are same
>>> in all cards. Chip itself implements PCIe interface (and also SDIO) so
>>> for me looks very strange if this 88W8897 PCIe device needs DMI specific
>>> quirks. I cannot believe that Microsoft got some special version of
>>> these chips from Marvell which are different than version uses on cards
>>> in mPCIe form factor.
>>>
>>> And now when I'm reading comment below about PCIe bridge to which is
>>> this 88W8897 PCIe chip connected, is not this rather an issue in that
>>> PCIe bridge (instead of mwifiex/88W8897) or in ACPI firmware which
>>> controls this bridge?
>>>
>>> Or are having other people same issues on mPCIe form factor wifi cards
>>> with 88W8897 chips and then this quirk should not DMI dependent?
>>>
>>> Note that I'm seeing issues with reset and other things also on chip
>>> 88W8997 when is connected to system via SDIO. These chips have both PCIe
>>> and SDIO buses, it just depends which pins are used.
>>>
>>
>> Hi and thanks for the quick reply! Honestly I've no idea, this is just the
>> first method we found that allows for a proper reset of the chip. What I
>> know is that some Surface devices need that ACPI DSM call (the one that was
>> done in the commit I dropped in this version of the patchset) to reset the
>> chip instead of this method.
>>
>> Afaik other devices with this chip don't need this resetting method, at
>> least Marvell employees couldn't reproduce the issues on their testing
>> devices.
>>
>> So would you suggest we just try to match for the pci chip 88W8897 instead?
> 
> Hello! Such suggestion makes sense when we know that it is 88W8897
> issue. But if you got information that issue cannot be reproduced on
> other 88W8897 cards then matching 88W8897 is not correct.
> 
>  From all this information looks like that it is problem in (Microsoft?)
> PCIe bridge to which is card connected. Otherwise I do not reason how it
> can be 88W8897 affected. Either it is reproducible on 88W8897 cards also
> in other devices or issue is not on 88W8897 card.

I doubt that it's an issue with the PCIe bridge (itself at least). The
same type of bridge is used for both dGPU and NVME SSD on my device (see
lspci output below) and those work fine. Also if I'm seeing that right
it's from the Intel CPU, so my guess is that a lot more people would
have issues with that then.

I don't know about the hardware side, so it might be possible that it's
an issue with integrating both bridge and wifi chip, in which case it's
still probably best handled via DMI quirks unless we know more.

Also as Tsuchiya mentioned in his original submission, on Windows the
device is reset via this D3cold method. I've only skimmed that
errata.inf file mentioned, but I think this is what he's referring to:

   Controls whether ACPIDeviceEnableD3ColdOnSurpriseRemoval rule will be
   evaluated or not on a given platform. Currently
   ACPIDeviceEnableD3ColdOnSurpriseRemoval rule only needs to be
   evaluated on Surface platforms which contain the Marvell WiFi
   controller which depends on device going through D3Cold as part of
   surprise-removal.

and

   Starting with Windows releases *after* Blue, ACPI will not put
   surprise-removed devices into D3Cold automatically. Some known
   scenarios (viz. WiFi reset/recovery) rely on the device cycling
   through D3Cold on surprise-removal. This hack allows surprise-removed
   devices to be put into D3Cold (if supported by the stack).

So, as far as I can tell, the chip doesn't like to be surprise-removed
(which seems to happen during reset) and then needs to be power-cycled,
which I think is likely due to some issue with firmware state.

So the quirk on Windows seems very Surface specific.

There also seem a bunch of revisions of these chips around, for example
my SB2 is affected by a bug that we've tied to the specific hardware
revision which causes some issues with host-sleep (IIRC chip switches
rapidly between wake and sleep states without any external influence,
which is not how it should behave and how it does behave on a later
hardware revision).

>> Then we'd probably have to check if there are any laptops where multiple
>> devices are connected to the pci bridge as Amey suggested in a review
>> before.
> 
> Well, I do not know... But if this is issue with PCIe bridge then
> similar issue could be observed also for other PCIe devices with this
> PCIe bridge. But question is if there are other laptops with this PCIe
> bridge. And also it can be a problem in ACPI firmware on those Surface
> devices, which implements some PCIe bridge functionality. So it is
> possible that issue is with PCIe bridge, not in HW, but in SW/firmware
> part which can be Microsoft specific... So too many questions to which
> we do not know answers.
> 
> Could you provide output of 'lspci -nn -vv' and 'lspci -tvnn' on
> affected machines? If you have already sent it in some previous email,
> just send a link. At least I'm not able to find it right now and output
> may contain something useful...

 From my Surface Book 2 (with the same issue):

  - lspci -tvnn: https://paste.ubuntu.com/p/mm3YpcZJ8N/
  - lspci -vv -nn: https://paste.ubuntu.com/p/dctTDP738N/

Regards,
Max
Pali Rohár July 9, 2021, 5:30 p.m. UTC | #4
On Friday 09 July 2021 19:03:37 Maximilian Luz wrote:
> On 7/9/21 6:12 PM, Pali Rohár wrote:
> 
> [...]
> 
> > > > Hello! Now I'm thinking loudly about this patch. Why this kind of reset
> > > > is needed only for Surface devices? AFAIK these 88W8897 chips are same
> > > > in all cards. Chip itself implements PCIe interface (and also SDIO) so
> > > > for me looks very strange if this 88W8897 PCIe device needs DMI specific
> > > > quirks. I cannot believe that Microsoft got some special version of
> > > > these chips from Marvell which are different than version uses on cards
> > > > in mPCIe form factor.
> > > > 
> > > > And now when I'm reading comment below about PCIe bridge to which is
> > > > this 88W8897 PCIe chip connected, is not this rather an issue in that
> > > > PCIe bridge (instead of mwifiex/88W8897) or in ACPI firmware which
> > > > controls this bridge?
> > > > 
> > > > Or are having other people same issues on mPCIe form factor wifi cards
> > > > with 88W8897 chips and then this quirk should not DMI dependent?
> > > > 
> > > > Note that I'm seeing issues with reset and other things also on chip
> > > > 88W8997 when is connected to system via SDIO. These chips have both PCIe
> > > > and SDIO buses, it just depends which pins are used.
> > > > 
> > > 
> > > Hi and thanks for the quick reply! Honestly I've no idea, this is just the
> > > first method we found that allows for a proper reset of the chip. What I
> > > know is that some Surface devices need that ACPI DSM call (the one that was
> > > done in the commit I dropped in this version of the patchset) to reset the
> > > chip instead of this method.
> > > 
> > > Afaik other devices with this chip don't need this resetting method, at
> > > least Marvell employees couldn't reproduce the issues on their testing
> > > devices.
> > > 
> > > So would you suggest we just try to match for the pci chip 88W8897 instead?
> > 
> > Hello! Such suggestion makes sense when we know that it is 88W8897
> > issue. But if you got information that issue cannot be reproduced on
> > other 88W8897 cards then matching 88W8897 is not correct.
> > 
> >  From all this information looks like that it is problem in (Microsoft?)
> > PCIe bridge to which is card connected. Otherwise I do not reason how it
> > can be 88W8897 affected. Either it is reproducible on 88W8897 cards also
> > in other devices or issue is not on 88W8897 card.
> 
> I doubt that it's an issue with the PCIe bridge (itself at least). The
> same type of bridge is used for both dGPU and NVME SSD on my device (see
> lspci output below) and those work fine. Also if I'm seeing that right
> it's from the Intel CPU, so my guess is that a lot more people would
> have issues with that then.
Pali Rohár July 9, 2021, 6:44 p.m. UTC | #5
On Friday 09 July 2021 20:16:49 Maximilian Luz wrote:
> On 7/9/21 7:30 PM, Pali Rohár wrote:
> > On Friday 09 July 2021 19:03:37 Maximilian Luz wrote:
> > > On 7/9/21 6:12 PM, Pali Rohár wrote:
> > > 
> > > [...]
> > > 
> > > > > > Hello! Now I'm thinking loudly about this patch. Why this kind of reset
> > > > > > is needed only for Surface devices? AFAIK these 88W8897 chips are same
> > > > > > in all cards. Chip itself implements PCIe interface (and also SDIO) so
> > > > > > for me looks very strange if this 88W8897 PCIe device needs DMI specific
> > > > > > quirks. I cannot believe that Microsoft got some special version of
> > > > > > these chips from Marvell which are different than version uses on cards
> > > > > > in mPCIe form factor.
> > > > > > 
> > > > > > And now when I'm reading comment below about PCIe bridge to which is
> > > > > > this 88W8897 PCIe chip connected, is not this rather an issue in that
> > > > > > PCIe bridge (instead of mwifiex/88W8897) or in ACPI firmware which
> > > > > > controls this bridge?
> > > > > > 
> > > > > > Or are having other people same issues on mPCIe form factor wifi cards
> > > > > > with 88W8897 chips and then this quirk should not DMI dependent?
> > > > > > 
> > > > > > Note that I'm seeing issues with reset and other things also on chip
> > > > > > 88W8997 when is connected to system via SDIO. These chips have both PCIe
> > > > > > and SDIO buses, it just depends which pins are used.
> > > > > > 
> > > > > 
> > > > > Hi and thanks for the quick reply! Honestly I've no idea, this is just the
> > > > > first method we found that allows for a proper reset of the chip. What I
> > > > > know is that some Surface devices need that ACPI DSM call (the one that was
> > > > > done in the commit I dropped in this version of the patchset) to reset the
> > > > > chip instead of this method.
> > > > > 
> > > > > Afaik other devices with this chip don't need this resetting method, at
> > > > > least Marvell employees couldn't reproduce the issues on their testing
> > > > > devices.
> > > > > 
> > > > > So would you suggest we just try to match for the pci chip 88W8897 instead?
> > > > 
> > > > Hello! Such suggestion makes sense when we know that it is 88W8897
> > > > issue. But if you got information that issue cannot be reproduced on
> > > > other 88W8897 cards then matching 88W8897 is not correct.
> > > > 
> > > >   From all this information looks like that it is problem in (Microsoft?)
> > > > PCIe bridge to which is card connected. Otherwise I do not reason how it
> > > > can be 88W8897 affected. Either it is reproducible on 88W8897 cards also
> > > > in other devices or issue is not on 88W8897 card.
> > > 
> > > I doubt that it's an issue with the PCIe bridge (itself at least). The
> > > same type of bridge is used for both dGPU and NVME SSD on my device (see
> > > lspci output below) and those work fine. Also if I'm seeing that right
> > > it's from the Intel CPU, so my guess is that a lot more people would
> > > have issues with that then.
> > 
> >  From information below it seems to be related to surprise removal.
> > Therefore is surprise removal working without issue for dGPU or NVME
> > SSD? Not all PCIe bridges support surprise removal...
> 
> The dGPU on the Surface Book 2 is detachable (the whole base where that
> is placed can be removed). As far as I can tell surprise removal works
> perfectly fine for that one. The only thing that it needs is a driver for
> out-of-band hot-plug signalling if the device is in D3cold while removed
> as hotplug/removal notifications via PCI don't work in D3cold (this
> works via ACPI, there is as far as I can tell no such mechanism for
> WiFi, probably since it's not intended to be hot-unplugged).

Ok. Thank you for confirmation.

> > > I don't know about the hardware side, so it might be possible that it's
> > > an issue with integrating both bridge and wifi chip, in which case it's
> > > still probably best handled via DMI quirks unless we know more.
> > > 
> > > Also as Tsuchiya mentioned in his original submission, on Windows the
> > > device is reset via this D3cold method. I've only skimmed that
> > > errata.inf file mentioned, but I think this is what he's referring to:
> > > 
> > >    Controls whether ACPIDeviceEnableD3ColdOnSurpriseRemoval rule will be
> > >    evaluated or not on a given platform. Currently
> > >    ACPIDeviceEnableD3ColdOnSurpriseRemoval rule only needs to be
> > >    evaluated on Surface platforms which contain the Marvell WiFi
> > >    controller which depends on device going through D3Cold as part of
> > >    surprise-removal.
> > > 
> > > and
> > > 
> > >    Starting with Windows releases *after* Blue, ACPI will not put
> > >    surprise-removed devices into D3Cold automatically. Some known
> > >    scenarios (viz. WiFi reset/recovery) rely on the device cycling
> > >    through D3Cold on surprise-removal. This hack allows surprise-removed
> > >    devices to be put into D3Cold (if supported by the stack).
> > > 
> > > So, as far as I can tell, the chip doesn't like to be surprise-removed
> > > (which seems to happen during reset) and then needs to be power-cycled,
> > > which I think is likely due to some issue with firmware state.
> > 
> > Thanks for information. This really does not look like PCIe bridge
> > specific if bridge itself can handle surprise-removed devices. lspci can
> > tell us if bridge supports it or not (see below).
> > 
> > > So the quirk on Windows seems very Surface specific.
> > > 
> > > There also seem a bunch of revisions of these chips around, for example
> > > my SB2 is affected by a bug that we've tied to the specific hardware
> > > revision which causes some issues with host-sleep (IIRC chip switches
> > > rapidly between wake and sleep states without any external influence,
> > > which is not how it should behave and how it does behave on a later
> > > hardware revision).
> > 
> > Interesting... This looks like the issue can be in 88W8897 chip and
> > needs some special conditions to trigger? And Surface is triggering it
> > always?
> 
> Not always. It's been a while since I've been actively looking at this
> and I'm not sure we ever had a good way to reproduce this. Also, I've
> never really dealt with it as in-depth as Tsuchiya and Jonas have.
> 
> My (very) quick attempt ('echo 1 > /sys/bus/pci/.../reset) at
> reproducing this didn't work, so I think at very least a network
> connection needs to be active.

This is doing PCIe function level reset. Maybe you can get more luck
with PCIe Hot Reset. See following link how to trigger PCIe Hot Reset
from userspace: https://alexforencich.com/wiki/en/pcie/hot-reset-linux

> Unfortunately I can't test that with a
> network connection (and without compiling a custom kernel for which I
> don't have the time right now) because there's currently another bug
> deadlocking on device removal if there's an active connection during
> removal (which also seems to trigger on reset). That one ill be fixed
> by
> 
>   https://lore.kernel.org/linux-wireless/20210515024227.2159311-1-briannorris@chromium.org/
> 
> Jonas might know more.
> 
> > > > > Then we'd probably have to check if there are any laptops where multiple
> > > > > devices are connected to the pci bridge as Amey suggested in a review
> > > > > before.
> > > > 
> > > > Well, I do not know... But if this is issue with PCIe bridge then
> > > > similar issue could be observed also for other PCIe devices with this
> > > > PCIe bridge. But question is if there are other laptops with this PCIe
> > > > bridge. And also it can be a problem in ACPI firmware on those Surface
> > > > devices, which implements some PCIe bridge functionality. So it is
> > > > possible that issue is with PCIe bridge, not in HW, but in SW/firmware
> > > > part which can be Microsoft specific... So too many questions to which
> > > > we do not know answers.
> > > > 
> > > > Could you provide output of 'lspci -nn -vv' and 'lspci -tvnn' on
> > > > affected machines? If you have already sent it in some previous email,
> > > > just send a link. At least I'm not able to find it right now and output
> > > > may contain something useful...
> > > 
> > >  From my Surface Book 2 (with the same issue):
> > > 
> > >   - lspci -tvnn: https://paste.ubuntu.com/p/mm3YpcZJ8N/
> > >   - lspci -vv -nn: https://paste.ubuntu.com/p/dctTDP738N/
> > 
> > Could you re-run lspci under root account? There are missing important
> > parts like "Capabilities: <access denied>" where is information if
> > bridge supports surprise removal or not.
> 
> Ah sorry, sure thing. Here's the updated lspci -nn -vv log:
> 
>   https://paste.ubuntu.com/p/fzsmCvm86Y/
> 
> The log for lspci -tvnn is the same.

Ok. So bridge for wifi card (00:1c.0) indicates:

    SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
            Slot #0, PowerLimit 10.000W; Interlock- NoCompl+

No support for surprise removal, nor for hotplug interrupt.

But bridge for nvidia card (00:1c.4) indicates:

    SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+ Surprise+
            Slot #4, PowerLimit 25.000W; Interlock- NoCompl+

And interesting, it supports hotplug interrupt and also surprise
removal. Which matches what you wrote above about dGPU.

So another idea: maybe problem is really in 88W8897 and recovering is
working only via bridge which supports surprise removal? Just guessing.
Or kernel PCIe hotplug driver is doing something which is needed for
recovering 88W8897 and because this bridge does not support surprise
removal, it behaves differently?

> 
> > > Regards,
> > > Max
Maximilian Luz July 9, 2021, 7:27 p.m. UTC | #6
On 7/9/21 8:44 PM, Pali Rohár wrote:

[...]

>> My (very) quick attempt ('echo 1 > /sys/bus/pci/.../reset) at
>> reproducing this didn't work, so I think at very least a network
>> connection needs to be active.
> 
> This is doing PCIe function level reset. Maybe you can get more luck
> with PCIe Hot Reset. See following link how to trigger PCIe Hot Reset
> from userspace: https://alexforencich.com/wiki/en/pcie/hot-reset-linux

Thanks for that link! That does indeed do something which breaks the
adapter. Running the script produces

   [  178.388414] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
   [  178.389128] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
   [  178.461365] mwifiex_pcie 0000:01:00.0: performing cancel_work_sync()...
   [  178.461373] mwifiex_pcie 0000:01:00.0: cancel_work_sync() done
   [  178.984106] pci 0000:01:00.0: [11ab:2b38] type 00 class 0x020000
   [  178.984161] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x000fffff 64bit pref]
   [  178.984193] pci 0000:01:00.0: reg 0x18: [mem 0x00000000-0x000fffff 64bit pref]
   [  178.984430] pci 0000:01:00.0: supports D1 D2
   [  178.984434] pci 0000:01:00.0: PME# supported from D0 D1 D3hot D3cold
   [  178.984871] pcieport 0000:00:1c.0: ASPM: current common clock configuration is inconsistent, reconfiguring
   [  179.297919] pci 0000:01:00.0: BAR 0: assigned [mem 0xd4400000-0xd44fffff 64bit pref]
   [  179.297961] pci 0000:01:00.0: BAR 2: assigned [mem 0xd4500000-0xd45fffff 64bit pref]
   [  179.298316] mwifiex_pcie 0000:01:00.0: enabling device (0000 -> 0002)
   [  179.298752] mwifiex_pcie: PCI memory map Virt0: 00000000c4593df1 PCI memory map Virt2: 0000000039d67daf
   [  179.300522] mwifiex_pcie 0000:01:00.0: WLAN read winner status failed!
   [  179.300552] mwifiex_pcie 0000:01:00.0: info: _mwifiex_fw_dpc: unregister device
   [  179.300622] mwifiex_pcie 0000:01:00.0: Read register failed
   [  179.300912] mwifiex_pcie 0000:01:00.0: performing cancel_work_sync()...
   [  179.300928] mwifiex_pcie 0000:01:00.0: cancel_work_sync() done

after which the card is unusable (there is no WiFi interface availabel
any more). Reloading the driver module doesn't help and produces

   [  376.906833] mwifiex_pcie: PCI memory map Virt0: 0000000025149d28 PCI memory map Virt2: 00000000c4593df1
   [  376.907278] mwifiex_pcie 0000:01:00.0: WLAN read winner status failed!
   [  376.907281] mwifiex_pcie 0000:01:00.0: info: _mwifiex_fw_dpc: unregister device
   [  376.907293] mwifiex_pcie 0000:01:00.0: Read register failed
   [  376.907404] mwifiex_pcie 0000:01:00.0: performing cancel_work_sync()...
   [  376.907406] mwifiex_pcie 0000:01:00.0: cancel_work_sync() done

again. Performing a function level reset produces

   [  402.489572] mwifiex_pcie 0000:01:00.0: mwifiex_pcie_reset_prepare: adapter structure is not valid
   [  403.514219] mwifiex_pcie 0000:01:00.0: mwifiex_pcie_reset_done: adapter structure is not valid

and doesn't help either.

Running the same command on a kernel with (among other) this patch
unfortunately also breaks the adapter in the same way. As far as I can
tell though, it doesn't run through the reset code added by this patch
(as indicated by the log message when performing FLR), which I assume
in a non-forced scenario, e.g. firmware issues (which IIRC is why this
patch exists), it would?

>> Unfortunately I can't test that with a
>> network connection (and without compiling a custom kernel for which I
>> don't have the time right now) because there's currently another bug
>> deadlocking on device removal if there's an active connection during
>> removal (which also seems to trigger on reset). That one ill be fixed
>> by
>>
>>    https://lore.kernel.org/linux-wireless/20210515024227.2159311-1-briannorris@chromium.org/
>>
>> Jonas might know more.

[...]
Pali Rohár July 9, 2021, 7:44 p.m. UTC | #7
On Friday 09 July 2021 21:27:51 Maximilian Luz wrote:
> On 7/9/21 8:44 PM, Pali Rohár wrote:
> 
> [...]
> 
> > > My (very) quick attempt ('echo 1 > /sys/bus/pci/.../reset) at
> > > reproducing this didn't work, so I think at very least a network
> > > connection needs to be active.
> > 
> > This is doing PCIe function level reset. Maybe you can get more luck
> > with PCIe Hot Reset. See following link how to trigger PCIe Hot Reset
> > from userspace: https://alexforencich.com/wiki/en/pcie/hot-reset-linux
> 
> Thanks for that link! That does indeed do something which breaks the
> adapter. Running the script produces
> 
>   [  178.388414] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
>   [  178.389128] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
>   [  178.461365] mwifiex_pcie 0000:01:00.0: performing cancel_work_sync()...
>   [  178.461373] mwifiex_pcie 0000:01:00.0: cancel_work_sync() done
>   [  178.984106] pci 0000:01:00.0: [11ab:2b38] type 00 class 0x020000
>   [  178.984161] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x000fffff 64bit pref]
>   [  178.984193] pci 0000:01:00.0: reg 0x18: [mem 0x00000000-0x000fffff 64bit pref]
>   [  178.984430] pci 0000:01:00.0: supports D1 D2
>   [  178.984434] pci 0000:01:00.0: PME# supported from D0 D1 D3hot D3cold
>   [  178.984871] pcieport 0000:00:1c.0: ASPM: current common clock configuration is inconsistent, reconfiguring
>   [  179.297919] pci 0000:01:00.0: BAR 0: assigned [mem 0xd4400000-0xd44fffff 64bit pref]
>   [  179.297961] pci 0000:01:00.0: BAR 2: assigned [mem 0xd4500000-0xd45fffff 64bit pref]
>   [  179.298316] mwifiex_pcie 0000:01:00.0: enabling device (0000 -> 0002)
>   [  179.298752] mwifiex_pcie: PCI memory map Virt0: 00000000c4593df1 PCI memory map Virt2: 0000000039d67daf
>   [  179.300522] mwifiex_pcie 0000:01:00.0: WLAN read winner status failed!
>   [  179.300552] mwifiex_pcie 0000:01:00.0: info: _mwifiex_fw_dpc: unregister device
>   [  179.300622] mwifiex_pcie 0000:01:00.0: Read register failed
>   [  179.300912] mwifiex_pcie 0000:01:00.0: performing cancel_work_sync()...
>   [  179.300928] mwifiex_pcie 0000:01:00.0: cancel_work_sync() done
> 
> after which the card is unusable (there is no WiFi interface availabel
> any more). Reloading the driver module doesn't help and produces
> 
>   [  376.906833] mwifiex_pcie: PCI memory map Virt0: 0000000025149d28 PCI memory map Virt2: 00000000c4593df1
>   [  376.907278] mwifiex_pcie 0000:01:00.0: WLAN read winner status failed!
>   [  376.907281] mwifiex_pcie 0000:01:00.0: info: _mwifiex_fw_dpc: unregister device
>   [  376.907293] mwifiex_pcie 0000:01:00.0: Read register failed
>   [  376.907404] mwifiex_pcie 0000:01:00.0: performing cancel_work_sync()...
>   [  376.907406] mwifiex_pcie 0000:01:00.0: cancel_work_sync() done
> 
> again. Performing a function level reset produces
> 
>   [  402.489572] mwifiex_pcie 0000:01:00.0: mwifiex_pcie_reset_prepare: adapter structure is not valid
>   [  403.514219] mwifiex_pcie 0000:01:00.0: mwifiex_pcie_reset_done: adapter structure is not valid
> 
> and doesn't help either.

More Qualcomm/Atheros wifi cards are broken in a way that they stop
responding after PCIe Hot Reset and completely disappear from the PCIe
bus. It is possible that similar issue have also these Marvell cards?

As now we know that bride does not support hotplug it cannot inform
system when card disconnect from the bus. The one possible way how to
detect if PCIe card is available at specific address is trying to read
its device and vendor id. Kernel caches device/vendor id, so from
userspace is needed to call lspci with -b argument (to ignore kernel's
reported cached value). Could you provide output of following command
after you do PCIe Hot Reset?

    lspci -s 01:00.0 -b -vv

(and compare with output which you have already provided if there are
any differences)

If PCIe Hot Reset is breaking the card then the only option how to
"reset" card into working state again is PCIe Warm Reset. Unfortunately
there is no common mechanism how to do it from system. PCIe Warm Reset
is done by asserting PERST# signal on card itself, in mPCIe form factor
it is pin 22. In most cases pin 22 is connected to some GPIO so via GPIO
subsystem it could be controlled.

> Running the same command on a kernel with (among other) this patch
> unfortunately also breaks the adapter in the same way. As far as I can
> tell though, it doesn't run through the reset code added by this patch
> (as indicated by the log message when performing FLR), which I assume
> in a non-forced scenario, e.g. firmware issues (which IIRC is why this
> patch exists), it would?

Err... I have caught this part. Is proposed patch able to recover also
from state which happens after PCIe Hot Reset?

> > > Unfortunately I can't test that with a
> > > network connection (and without compiling a custom kernel for which I
> > > don't have the time right now) because there's currently another bug
> > > deadlocking on device removal if there's an active connection during
> > > removal (which also seems to trigger on reset). That one ill be fixed
> > > by
> > > 
> > >    https://lore.kernel.org/linux-wireless/20210515024227.2159311-1-briannorris@chromium.org/
> > > 
> > > Jonas might know more.
> 
> [...]
Maximilian Luz July 9, 2021, 8:54 p.m. UTC | #8
On 7/9/21 9:44 PM, Pali Rohár wrote:
> On Friday 09 July 2021 21:27:51 Maximilian Luz wrote:
>> On 7/9/21 8:44 PM, Pali Rohár wrote:
>>
>> [...]
>>
>>>> My (very) quick attempt ('echo 1 > /sys/bus/pci/.../reset) at
>>>> reproducing this didn't work, so I think at very least a network
>>>> connection needs to be active.
>>>
>>> This is doing PCIe function level reset. Maybe you can get more luck
>>> with PCIe Hot Reset. See following link how to trigger PCIe Hot Reset
>>> from userspace: https://alexforencich.com/wiki/en/pcie/hot-reset-linux
>>
>> Thanks for that link! That does indeed do something which breaks the
>> adapter. Running the script produces
>>
>>    [  178.388414] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
>>    [  178.389128] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
>>    [  178.461365] mwifiex_pcie 0000:01:00.0: performing cancel_work_sync()...
>>    [  178.461373] mwifiex_pcie 0000:01:00.0: cancel_work_sync() done
>>    [  178.984106] pci 0000:01:00.0: [11ab:2b38] type 00 class 0x020000
>>    [  178.984161] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x000fffff 64bit pref]
>>    [  178.984193] pci 0000:01:00.0: reg 0x18: [mem 0x00000000-0x000fffff 64bit pref]
>>    [  178.984430] pci 0000:01:00.0: supports D1 D2
>>    [  178.984434] pci 0000:01:00.0: PME# supported from D0 D1 D3hot D3cold
>>    [  178.984871] pcieport 0000:00:1c.0: ASPM: current common clock configuration is inconsistent, reconfiguring
>>    [  179.297919] pci 0000:01:00.0: BAR 0: assigned [mem 0xd4400000-0xd44fffff 64bit pref]
>>    [  179.297961] pci 0000:01:00.0: BAR 2: assigned [mem 0xd4500000-0xd45fffff 64bit pref]
>>    [  179.298316] mwifiex_pcie 0000:01:00.0: enabling device (0000 -> 0002)
>>    [  179.298752] mwifiex_pcie: PCI memory map Virt0: 00000000c4593df1 PCI memory map Virt2: 0000000039d67daf
>>    [  179.300522] mwifiex_pcie 0000:01:00.0: WLAN read winner status failed!
>>    [  179.300552] mwifiex_pcie 0000:01:00.0: info: _mwifiex_fw_dpc: unregister device
>>    [  179.300622] mwifiex_pcie 0000:01:00.0: Read register failed
>>    [  179.300912] mwifiex_pcie 0000:01:00.0: performing cancel_work_sync()...
>>    [  179.300928] mwifiex_pcie 0000:01:00.0: cancel_work_sync() done
>>
>> after which the card is unusable (there is no WiFi interface availabel
>> any more). Reloading the driver module doesn't help and produces
>>
>>    [  376.906833] mwifiex_pcie: PCI memory map Virt0: 0000000025149d28 PCI memory map Virt2: 00000000c4593df1
>>    [  376.907278] mwifiex_pcie 0000:01:00.0: WLAN read winner status failed!
>>    [  376.907281] mwifiex_pcie 0000:01:00.0: info: _mwifiex_fw_dpc: unregister device
>>    [  376.907293] mwifiex_pcie 0000:01:00.0: Read register failed
>>    [  376.907404] mwifiex_pcie 0000:01:00.0: performing cancel_work_sync()...
>>    [  376.907406] mwifiex_pcie 0000:01:00.0: cancel_work_sync() done
>>
>> again. Performing a function level reset produces
>>
>>    [  402.489572] mwifiex_pcie 0000:01:00.0: mwifiex_pcie_reset_prepare: adapter structure is not valid
>>    [  403.514219] mwifiex_pcie 0000:01:00.0: mwifiex_pcie_reset_done: adapter structure is not valid
>>
>> and doesn't help either.
> 
> More Qualcomm/Atheros wifi cards are broken in a way that they stop
> responding after PCIe Hot Reset and completely disappear from the PCIe
> bus. It is possible that similar issue have also these Marvell cards?
> 
> As now we know that bride does not support hotplug it cannot inform
> system when card disconnect from the bus. The one possible way how to
> detect if PCIe card is available at specific address is trying to read
> its device and vendor id. Kernel caches device/vendor id, so from
> userspace is needed to call lspci with -b argument (to ignore kernel's
> reported cached value). Could you provide output of following command
> after you do PCIe Hot Reset?
> 
>      lspci -s 01:00.0 -b -vv
> 
> (and compare with output which you have already provided if there are
> any differences)

There do seem to be some differences, specifically regarding memory.
See https://paste.ubuntu.com/p/Rz2CDjwkCv/ for the full output.

> If PCIe Hot Reset is breaking the card then the only option how to
> "reset" card into working state again is PCIe Warm Reset. Unfortunately
> there is no common mechanism how to do it from system. PCIe Warm Reset
> is done by asserting PERST# signal on card itself, in mPCIe form factor
> it is pin 22. In most cases pin 22 is connected to some GPIO so via GPIO
> subsystem it could be controlled.
> 
>> Running the same command on a kernel with (among other) this patch
>> unfortunately also breaks the adapter in the same way. As far as I can
>> tell though, it doesn't run through the reset code added by this patch
>> (as indicated by the log message when performing FLR), which I assume
>> in a non-forced scenario, e.g. firmware issues (which IIRC is why this
>> patch exists), it would?
> 
> Err... I have caught this part. Is proposed patch able to recover also
> from state which happens after PCIe Hot Reset?

I'm not sure at this point if the power-cycle through D3cold would fix
this (I think it might, but have no evidence for that). This patch alone
isn't able to recover the device, as, when triggering the hot-reset via
that script, the code never seems to run mwifiex_pcie_reset_d3cold_quirk().

If I remember correctly, the main issue was that the firmware state
doesn't get reset completely. This can be somewhat observed when doing
'echo 1 > /sys/bus/pci/devices/.../reset' via the difference in log
messages:

For an unpatched kernel:

   [   64.159509] mwifiex_pcie 0000:01:00.0: info: shutdown mwifiex...
   [   64.159546] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
   [   64.159922] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
   [   65.240272] mwifiex_pcie 0000:01:00.0: WLAN FW already running! Skip FW dnld
   [   65.240285] mwifiex_pcie 0000:01:00.0: WLAN FW is active
   [   65.327359] mwifiex_pcie 0000:01:00.0: info: MWIFIEX VERSION: mwifiex 1.0 (15.68.19.p21)
   [   65.327370] mwifiex_pcie 0000:01:00.0: driver_version = mwifiex 1.0 (15.68.19.p21)

For a patched kernel:

   [   41.966094] mwifiex_pcie 0000:01:00.0: info: shutdown mwifiex...
   [   41.966451] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
   [   41.967227] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
   [   42.063543] mwifiex_pcie 0000:01:00.0: Using reset_d3cold quirk to perform FW reset
   [   42.063558] mwifiex_pcie 0000:01:00.0: putting into D3cold...
   [   42.081010] usb 1-6: USB disconnect, device number 9
   [   42.339922] pcieport 0000:00:1c.0: putting into D3cold...
   [   42.425766] pcieport 0000:00:1c.0: putting into D0...
   [   42.695987] mwifiex_pcie 0000:01:00.0: putting into D0...
   [   42.956673] mwifiex_pcie 0000:01:00.0: enabling device (0000 -> 0002)
   [   45.012736] mwifiex_pcie 0000:01:00.0: info: FW download over, size 723540 bytes
   [   45.740882] usb 1-6: new high-speed USB device number 10 using xhci_hcd
   [   45.881294] usb 1-6: New USB device found, idVendor=1286, idProduct=204c, bcdDevice=32.01
   [   45.881308] usb 1-6: New USB device strings: Mfr=1, Product=2, SerialNumber=3
   [   45.881313] usb 1-6: Product: Bluetooth and Wireless LAN Composite Device
   [   45.881318] usb 1-6: Manufacturer: Marvell
   [   45.881322] usb 1-6: SerialNumber: 0000000000000000
   [   45.884882] Bluetooth: hci0: unexpected event for opcode 0x0000
   [   45.885128] Bluetooth: hci0: unexpected event for opcode 0x0000
   [   45.966218] mwifiex_pcie 0000:01:00.0: WLAN FW is active
   [   46.157474] mwifiex_pcie 0000:01:00.0: info: MWIFIEX VERSION: mwifiex 1.0 (15.68.19.p21)
   [   46.157485] mwifiex_pcie 0000:01:00.0: driver_version = mwifiex 1.0 (15.68.19.p21)

Note the absence of "WLAN FW already running! Skip FW dnld" on the
second log. Due to the power cycle we're essentially forcing the device
to re-download and re-initialize its firmware. On an unpatched kernel,
it looks like the firmware itself is kept and state may not be cleared
properly. So in other words it looks like the firmware, while being
prompted to do a reset, doesn't do that properly (at least when it has
crashed before).

IIRC this then allowed us to recover from firmware issues that the
"normal" firmware reset that mwifiex supposedly performs on a FLR
(or reloading driver modules) didn't help with.

I'm not so sure any more if resetting actively caused issues or if it
just showed different symptoms of some firmware issue that prompted us
to do the reset in the first place (again, been quite a while since I
last dealt with this stuff, sorry). All I know is that this patched
reset gets the card going again.

As a side note: There are also more patches by Jonas (and Tsuchiya?)
building on top of the quirk implementation introduced here which
significantly reduce the need for doing resets in the first place
(nevertheless having a reset that actually does properly reset the
device is a good thing IMHO). Those patches can for example be found
here:

   https://github.com/linux-surface/kernel/compare/eaaf96ba58a5fe5999b89fe3afaded74caa96480...989c8725a6d4e62db6370dd0fefe45498274d3ce

>>>> Unfortunately I can't test that with a
>>>> network connection (and without compiling a custom kernel for which I
>>>> don't have the time right now) because there's currently another bug
>>>> deadlocking on device removal if there's an active connection during
>>>> removal (which also seems to trigger on reset). That one ill be fixed
>>>> by
>>>>
>>>>     https://lore.kernel.org/linux-wireless/20210515024227.2159311-1-briannorris@chromium.org/
>>>>
>>>> Jonas might know more.
>>
>> [...]
Pali Rohár July 10, 2021, 12:07 a.m. UTC | #9
On Saturday 10 July 2021 02:00:08 Maximilian Luz wrote:
> On 7/10/21 12:54 AM, Pali Rohár wrote:
> 
> [...]
> 
> > > Also not sure if this is just my bias, but it feels like the Surface
> > > line always had more problems with that driver (and firmware) than
> > > others.
> > 
> > Ehm, really? I see reports also from non-Surface users about bad quality
> > of these 88W[89]xxx cards and repeating firmware issues. I have bad
> > personal experience with 88W8997 SDIO firmware and lot of times I get
> > advice about ex-Marvell/NXP wifi cards "do not touch and run away
> > quickly".
> 
> Yeah, then I'm probably biased since I'm mostly dealing with Surface
> stuff.
> 
> > I think that more people if they get mPCIe/M.2 wifi card in laptop which
> > does not work, they just replace it with some working one. And not
> > spending infinite time in trying to fix it... So this may explain why
> > there are more Surface users with these issues...
> 
> That might be an explanation. If it wouldn't need a heat-gun to open it
> up, I'd probably have done that at some point in the past (there were
> times when WiFi at my Uni was pretty much unusable with this device...
> and I'm still not sure what fixed that or even if it's fixed completely).
> 
> > > I'm honestly a bit surprised that MS stuck with them for this
> > > long (they decided to go with Intel for 7th gen devices). AFAICT they
> > > initially chose Marvell due to connected standby support, so maybe that
> > > causes issue for us and others simply aren't using that feature? Just
> > > guessing though.
> > 
> > In my opinion that "Connected Standby" is just MS marketing term.
> 
> I can only really repeat what I've been told: Apparently when they
> started designing those devices, the only option with "Connected
> standby" (or probably rather that feature set that MS wanted) was,
> unfortunately for us, Marvell.
> 
> > 88W[89]xxx chips using full-mac firmware and drivers [*]. Full-mac lot
> > of times causing more issues than soft-mac solution. Moreover this
> > Marvell firmware implements also other "application" layers in firmware
> > which OS drivers can use, e.g. there is fully working "wpa-supplicant"
> > replacement and also AP part. Maybe windows drivers are using it and it
> > cause less problems? Duno. mwifiex uses only "low level" commands and
> > WPA state machine is implemented in userspace wpa-supplicant daemon.
> > 
> > [*] - Small note: There are also soft-mac firmwares and drivers but
> > apparently Marvell has never finished linux driver and firmware was not
> > released to public...
> > 
> > And there is also Laird Connectivity which offers their own proprietary
> > linux kernel drivers with their own firmware for these 88W[89]xxx chips.
> > Last time I checked it they released some parts of driver on github.
> > Maybe somebody could contact Laird or check if their driver can be
> > replaced by mwifiex? Or just replacing ex-Marvell/NXP firmware by their?
> > But I'm not sure if they have something for 88W8897.
> 
> Interesting, I was not aware of this. IIRC we've been experimenting with
> the mwlwifi driver (which that lrdmwl driver seems to be based on?), but
> couldn't get that to work with the firmware we have.

mwlwifi is that soft-mac driver and uses completely different firmware.
For sure it would not work with current full-mac firmware.

> IIRC it also didn't
> work with the Windows firmware (which seems to be significantly
> different from the one we have for Linux and seems to use or be modeled
> after some special Windows WiFi driver interface).

So... Microsoft has different firmware for this chip? And it is working
with mwifiex driver?
Pali Rohár July 10, 2021, 12:38 a.m. UTC | #10
On Saturday 10 July 2021 02:18:12 Maximilian Luz wrote:
> On 7/10/21 2:07 AM, Pali Rohár wrote:
> 
> [...]
> 
> > > Interesting, I was not aware of this. IIRC we've been experimenting with
> > > the mwlwifi driver (which that lrdmwl driver seems to be based on?), but
> > > couldn't get that to work with the firmware we have.
> > 
> > mwlwifi is that soft-mac driver and uses completely different firmware.
> > For sure it would not work with current full-mac firmware.
> > 
> > > IIRC it also didn't
> > > work with the Windows firmware (which seems to be significantly
> > > different from the one we have for Linux and seems to use or be modeled
> > > after some special Windows WiFi driver interface).
> > 
> > So... Microsoft has different firmware for this chip? And it is working
> > with mwifiex driver?
> 
> I'm not sure how special that firmware really is (i.e. if it is Surface
> specific or just what Marvell uses on Windows), only that it doesn't
> look like the firmware included in the linux-firmware repo. The Windows
> firmware doesn't work with either mwlwifi or mwifiex drivers (IIRC) and
> on Linux we use the official firmware from the linux-firmware repo.

Version available in the linux-firmware repo is also what big companies
(like google) receive for their systems... sometimes just only older
version as Marvell/NXP is slow in updating files in linux-firmware.
Seems that it is also same what receive customers under NDA as more
companies dropped "proprietary" ex-Marvell/NXP driver on internet and it
contained this firmware with some sources of driver which looks like a
fork of mwifiex (or maybe mwifiex is "cleaned fork" of that driver :D)

There is old firmware documentation which describe RPC communication
between OS and firmware:
http://wiki.laptop.org/images/f/f3/Firmware-Spec-v5.1-MV-S103752-00.pdf

It is really old for very old wifi chips and when I checked it, it still
matches what mwifiex is doing with new chips. Just there are new and
more commands. And documentation is OS-neutral.

So if Microsoft has some "incompatible" firmware with this, it could
mean that they got something special which nobody else have? Maybe it
can explain that "connected standby" and maybe also better stability?

Or just windows distribute firmware in different format and needs to
"unpack" or "preprocess" prior downloading it to device?
Jonas Dreßler July 11, 2021, 4:31 p.m. UTC | #11
On 7/9/21 7:30 PM, Pali Rohár wrote:
> On Friday 09 July 2021 19:03:37 Maximilian Luz wrote:
>> On 7/9/21 6:12 PM, Pali Rohár wrote:
>>
>> [...]
>>
>>>>> Hello! Now I'm thinking loudly about this patch. Why this kind of reset
>>>>> is needed only for Surface devices? AFAIK these 88W8897 chips are same
>>>>> in all cards. Chip itself implements PCIe interface (and also SDIO) so
>>>>> for me looks very strange if this 88W8897 PCIe device needs DMI specific
>>>>> quirks. I cannot believe that Microsoft got some special version of
>>>>> these chips from Marvell which are different than version uses on cards
>>>>> in mPCIe form factor.
>>>>>
>>>>> And now when I'm reading comment below about PCIe bridge to which is
>>>>> this 88W8897 PCIe chip connected, is not this rather an issue in that
>>>>> PCIe bridge (instead of mwifiex/88W8897) or in ACPI firmware which
>>>>> controls this bridge?
>>>>>
>>>>> Or are having other people same issues on mPCIe form factor wifi cards
>>>>> with 88W8897 chips and then this quirk should not DMI dependent?
>>>>>
>>>>> Note that I'm seeing issues with reset and other things also on chip
>>>>> 88W8997 when is connected to system via SDIO. These chips have both PCIe
>>>>> and SDIO buses, it just depends which pins are used.
>>>>>
>>>>
>>>> Hi and thanks for the quick reply! Honestly I've no idea, this is just the
>>>> first method we found that allows for a proper reset of the chip. What I
>>>> know is that some Surface devices need that ACPI DSM call (the one that was
>>>> done in the commit I dropped in this version of the patchset) to reset the
>>>> chip instead of this method.
>>>>
>>>> Afaik other devices with this chip don't need this resetting method, at
>>>> least Marvell employees couldn't reproduce the issues on their testing
>>>> devices.
>>>>
>>>> So would you suggest we just try to match for the pci chip 88W8897 instead?
>>>
>>> Hello! Such suggestion makes sense when we know that it is 88W8897
>>> issue. But if you got information that issue cannot be reproduced on
>>> other 88W8897 cards then matching 88W8897 is not correct.
>>>
>>>   From all this information looks like that it is problem in (Microsoft?)
>>> PCIe bridge to which is card connected. Otherwise I do not reason how it
>>> can be 88W8897 affected. Either it is reproducible on 88W8897 cards also
>>> in other devices or issue is not on 88W8897 card.
>>
>> I doubt that it's an issue with the PCIe bridge (itself at least). The
>> same type of bridge is used for both dGPU and NVME SSD on my device (see
>> lspci output below) and those work fine. Also if I'm seeing that right
>> it's from the Intel CPU, so my guess is that a lot more people would
>> have issues with that then.
> 
>  From information below it seems to be related to surprise removal.
> Therefore is surprise removal working without issue for dGPU or NVME
> SSD? Not all PCIe bridges support surprise removal...
> 
>>
>> I don't know about the hardware side, so it might be possible that it's
>> an issue with integrating both bridge and wifi chip, in which case it's
>> still probably best handled via DMI quirks unless we know more.
>>
>> Also as Tsuchiya mentioned in his original submission, on Windows the
>> device is reset via this D3cold method. I've only skimmed that
>> errata.inf file mentioned, but I think this is what he's referring to:
>>
>>    Controls whether ACPIDeviceEnableD3ColdOnSurpriseRemoval rule will be
>>    evaluated or not on a given platform. Currently
>>    ACPIDeviceEnableD3ColdOnSurpriseRemoval rule only needs to be
>>    evaluated on Surface platforms which contain the Marvell WiFi
>>    controller which depends on device going through D3Cold as part of
>>    surprise-removal.
>>
>> and
>>
>>    Starting with Windows releases *after* Blue, ACPI will not put
>>    surprise-removed devices into D3Cold automatically. Some known
>>    scenarios (viz. WiFi reset/recovery) rely on the device cycling
>>    through D3Cold on surprise-removal. This hack allows surprise-removed
>>    devices to be put into D3Cold (if supported by the stack).
>>
>> So, as far as I can tell, the chip doesn't like to be surprise-removed
>> (which seems to happen during reset) and then needs to be power-cycled,
>> which I think is likely due to some issue with firmware state.
> 
> Thanks for information. This really does not look like PCIe bridge
> specific if bridge itself can handle surprise-removed devices. lspci can
> tell us if bridge supports it or not (see below).
> 
>> So the quirk on Windows seems very Surface specific.
>>
>> There also seem a bunch of revisions of these chips around, for example
>> my SB2 is affected by a bug that we've tied to the specific hardware
>> revision which causes some issues with host-sleep (IIRC chip switches
>> rapidly between wake and sleep states without any external influence,
>> which is not how it should behave and how it does behave on a later
>> hardware revision).
> 
> Interesting... This looks like the issue can be in 88W8897 chip and
> needs some special conditions to trigger? And Surface is triggering it
> always?

The specific issue was that the card wakes up very quickly after going 
into deep sleep (deep sleep is a state the firmware enters when idle and 
not connected to an AP). Now this in turn messed with the host 
suspending, because deep sleep is involved there and the card is not 
expected to wake up that quickly again (I'm oversimplifying here, it's 
also been some time since we looked into it).

Anyway, in the end those wakeups from deep sleep only happened with one 
hardware revision of the card (I guess it's caused by a hardware design 
issue like a floating gpio or something), and we managed to fix the 
problems by disabling deep sleep on that hardware revision, but as Max 
mentioned that problem is completely unrelated from this patch.

> 
>>>> Then we'd probably have to check if there are any laptops where multiple
>>>> devices are connected to the pci bridge as Amey suggested in a review
>>>> before.
>>>
>>> Well, I do not know... But if this is issue with PCIe bridge then
>>> similar issue could be observed also for other PCIe devices with this
>>> PCIe bridge. But question is if there are other laptops with this PCIe
>>> bridge. And also it can be a problem in ACPI firmware on those Surface
>>> devices, which implements some PCIe bridge functionality. So it is
>>> possible that issue is with PCIe bridge, not in HW, but in SW/firmware
>>> part which can be Microsoft specific... So too many questions to which
>>> we do not know answers.
>>>
>>> Could you provide output of 'lspci -nn -vv' and 'lspci -tvnn' on
>>> affected machines? If you have already sent it in some previous email,
>>> just send a link. At least I'm not able to find it right now and output
>>> may contain something useful...
>>
>>  From my Surface Book 2 (with the same issue):
>>
>>   - lspci -tvnn: https://paste.ubuntu.com/p/mm3YpcZJ8N/
>>   - lspci -vv -nn: https://paste.ubuntu.com/p/dctTDP738N/
> 
> Could you re-run lspci under root account? There are missing important
> parts like "Capabilities: <access denied>" where is information if
> bridge supports surprise removal or not.
> 
>> Regards,
>> Max
Jonas Dreßler July 11, 2021, 4:53 p.m. UTC | #12
On 7/10/21 3:07 AM, Maximilian Luz wrote:
> On 7/10/21 2:38 AM, Pali Rohár wrote:

>> On Saturday 10 July 2021 02:18:12 Maximilian Luz wrote:

>>> On 7/10/21 2:07 AM, Pali Rohár wrote:

>>>

>>> [...]

>>>

>>>>> Interesting, I was not aware of this. IIRC we've been experimenting 

>>>>> with

>>>>> the mwlwifi driver (which that lrdmwl driver seems to be based 

>>>>> on?), but

>>>>> couldn't get that to work with the firmware we have.

>>>>

>>>> mwlwifi is that soft-mac driver and uses completely different firmware.

>>>> For sure it would not work with current full-mac firmware.

>>>>

>>>>> IIRC it also didn't

>>>>> work with the Windows firmware (which seems to be significantly

>>>>> different from the one we have for Linux and seems to use or be 

>>>>> modeled

>>>>> after some special Windows WiFi driver interface).

>>>>

>>>> So... Microsoft has different firmware for this chip? And it is working

>>>> with mwifiex driver?

>>>

>>> I'm not sure how special that firmware really is (i.e. if it is Surface

>>> specific or just what Marvell uses on Windows), only that it doesn't

>>> look like the firmware included in the linux-firmware repo. The Windows

>>> firmware doesn't work with either mwlwifi or mwifiex drivers (IIRC) and

>>> on Linux we use the official firmware from the linux-firmware repo.

>>

>> Version available in the linux-firmware repo is also what big companies

>> (like google) receive for their systems... sometimes just only older

>> version as Marvell/NXP is slow in updating files in linux-firmware.

>> Seems that it is also same what receive customers under NDA as more

>> companies dropped "proprietary" ex-Marvell/NXP driver on internet and it

>> contained this firmware with some sources of driver which looks like a

>> fork of mwifiex (or maybe mwifiex is "cleaned fork" of that driver :D)

>>

>> There is old firmware documentation which describe RPC communication

>> between OS and firmware:

>> http://wiki.laptop.org/images/f/f3/Firmware-Spec-v5.1-MV-S103752-00.pdf

>>

>> It is really old for very old wifi chips and when I checked it, it still

>> matches what mwifiex is doing with new chips. Just there are new and

>> more commands. And documentation is OS-neutral.

>>

>> So if Microsoft has some "incompatible" firmware with this, it could

>> mean that they got something special which nobody else have? Maybe it

>> can explain that "connected standby" and maybe also better stability?

>>

>> Or just windows distribute firmware in different format and needs to

>> "unpack" or "preprocess" prior downloading it to device?

> 

> If memory serves me right, Jonas did some reverse engineering on the

> Windows driver and found that it uses the "new" WDI Miniport API: It

> seems that originally both Windows and Linux drivers (and firmware)

> were pretty much the same (he mentioned there were similarities in

> terminology), but then they switched to that new API on Windows and

> changed the firmware with it, so that the driver now essentially only

> forwards the commands from that API to the firmware and the firmware

> handles the rest.

> 

> By reading the Windows docs on that API, that change might have been

> forced on them as some Windows 10 features apparently only work via

> that API.

> 

> He'll probably know more about that than I do.


Not much I can add there, it seemed a lot like both mwifiex and the 
Windows 10 WDI miniport driver were both derived from the same codebase 
originally, but in order to be compatible with the WDI miniport API and 
other stuff Windows requires from wifi devices (I recall there was some 
SAR-value control/reporting stuff too), some parts of the firmware had 
to be rewritten.

In the end, the Windows firmware is updated a lot more often and likely 
includes a bunch of bugfixes the linux firmware doesn't have, but it 
can't be used on linux without a ton of work that would probably include 
rebuilding proprietary APIs from Windows.

Also, from my testing with custom APs and sniffing packets with 
Wireshark, the functionality, limitations and weird 
"semi-spec-compliant" behaviors were exactly the same with the Windows 
firmware: It doesn't support WPA3, it can't connect to fast transition 
APs (funnily enough that's opposed to what MS claims) and it also can't 
spawn an AP with WPA-PSK-SHA256 AKM ciphers. So not sure there's a lot 
of sense in spending more time trying to go down that path.
Pali Rohár July 11, 2021, 5:01 p.m. UTC | #13
On Sunday 11 July 2021 18:53:32 Jonas Dreßler wrote:
> On 7/10/21 3:07 AM, Maximilian Luz wrote:

> > On 7/10/21 2:38 AM, Pali Rohár wrote:

> > > On Saturday 10 July 2021 02:18:12 Maximilian Luz wrote:

> > > > On 7/10/21 2:07 AM, Pali Rohár wrote:

> > > > 

> > > > [...]

> > > > 

> > > > > > Interesting, I was not aware of this. IIRC we've been

> > > > > > experimenting with

> > > > > > the mwlwifi driver (which that lrdmwl driver seems to be

> > > > > > based on?), but

> > > > > > couldn't get that to work with the firmware we have.

> > > > > 

> > > > > mwlwifi is that soft-mac driver and uses completely different firmware.

> > > > > For sure it would not work with current full-mac firmware.

> > > > > 

> > > > > > IIRC it also didn't

> > > > > > work with the Windows firmware (which seems to be significantly

> > > > > > different from the one we have for Linux and seems to

> > > > > > use or be modeled

> > > > > > after some special Windows WiFi driver interface).

> > > > > 

> > > > > So... Microsoft has different firmware for this chip? And it is working

> > > > > with mwifiex driver?

> > > > 

> > > > I'm not sure how special that firmware really is (i.e. if it is Surface

> > > > specific or just what Marvell uses on Windows), only that it doesn't

> > > > look like the firmware included in the linux-firmware repo. The Windows

> > > > firmware doesn't work with either mwlwifi or mwifiex drivers (IIRC) and

> > > > on Linux we use the official firmware from the linux-firmware repo.

> > > 

> > > Version available in the linux-firmware repo is also what big companies

> > > (like google) receive for their systems... sometimes just only older

> > > version as Marvell/NXP is slow in updating files in linux-firmware.

> > > Seems that it is also same what receive customers under NDA as more

> > > companies dropped "proprietary" ex-Marvell/NXP driver on internet and it

> > > contained this firmware with some sources of driver which looks like a

> > > fork of mwifiex (or maybe mwifiex is "cleaned fork" of that driver :D)

> > > 

> > > There is old firmware documentation which describe RPC communication

> > > between OS and firmware:

> > > http://wiki.laptop.org/images/f/f3/Firmware-Spec-v5.1-MV-S103752-00.pdf

> > > 

> > > It is really old for very old wifi chips and when I checked it, it still

> > > matches what mwifiex is doing with new chips. Just there are new and

> > > more commands. And documentation is OS-neutral.

> > > 

> > > So if Microsoft has some "incompatible" firmware with this, it could

> > > mean that they got something special which nobody else have? Maybe it

> > > can explain that "connected standby" and maybe also better stability?

> > > 

> > > Or just windows distribute firmware in different format and needs to

> > > "unpack" or "preprocess" prior downloading it to device?

> > 

> > If memory serves me right, Jonas did some reverse engineering on the

> > Windows driver and found that it uses the "new" WDI Miniport API: It

> > seems that originally both Windows and Linux drivers (and firmware)

> > were pretty much the same (he mentioned there were similarities in

> > terminology), but then they switched to that new API on Windows and

> > changed the firmware with it, so that the driver now essentially only

> > forwards the commands from that API to the firmware and the firmware

> > handles the rest.

> > 

> > By reading the Windows docs on that API, that change might have been

> > forced on them as some Windows 10 features apparently only work via

> > that API.

> > 

> > He'll probably know more about that than I do.

> 

> Not much I can add there, it seemed a lot like both mwifiex and the Windows

> 10 WDI miniport driver were both derived from the same codebase originally,

> but in order to be compatible with the WDI miniport API and other stuff

> Windows requires from wifi devices (I recall there was some SAR-value

> control/reporting stuff too), some parts of the firmware had to be

> rewritten.

> 

> In the end, the Windows firmware is updated a lot more often and likely

> includes a bunch of bugfixes the linux firmware doesn't have, but it can't

> be used on linux without a ton of work that would probably include

> rebuilding proprietary APIs from Windows.

> 

> Also, from my testing with custom APs and sniffing packets with Wireshark,

> the functionality, limitations and weird "semi-spec-compliant" behaviors

> were exactly the same with the Windows firmware: It doesn't support WPA3, it

> can't connect to fast transition APs (funnily enough that's opposed to what

> MS claims) and it also can't spawn an AP with WPA-PSK-SHA256 AKM ciphers. So

> not sure there's a lot of sense in spending more time trying to go down that

> path.


New version of firmware files are available on NXP portal, where are
updated more frequently, but only for companies which have NXP accounts
and signed NDA with NXP. Not for end users.

If you want these new firmware files, you need to ask NXP developers as
only they can ask for non-NDA distribution and include new version into
linux-firmware repository. Like in this pull request where is new SDIO
firmware for 88W8897:
https://lore.kernel.org/linux-firmware/DB7PR04MB453855B0D6C41923BCB0922EFC1C9@DB7PR04MB4538.eurprd04.prod.outlook.com/
diff mbox series

Patch

diff --git a/drivers/net/wireless/marvell/mwifiex/pcie.c b/drivers/net/wireless/marvell/mwifiex/pcie.c
index a530832c9421..c6ccce426b49 100644
--- a/drivers/net/wireless/marvell/mwifiex/pcie.c
+++ b/drivers/net/wireless/marvell/mwifiex/pcie.c
@@ -528,6 +528,13 @@  static void mwifiex_pcie_reset_prepare(struct pci_dev *pdev)
 	mwifiex_shutdown_sw(adapter);
 	clear_bit(MWIFIEX_IFACE_WORK_DEVICE_DUMP, &card->work_flags);
 	clear_bit(MWIFIEX_IFACE_WORK_CARD_RESET, &card->work_flags);
+
+	/* On MS Surface gen4+ devices FLR isn't effective to recover from
+	 * hangups, so we power-cycle the card instead.
+	 */
+	if (card->quirks & QUIRK_FW_RST_D3COLD)
+		mwifiex_pcie_reset_d3cold_quirk(pdev);
+
 	mwifiex_dbg(adapter, INFO, "%s, successful\n", __func__);
 
 	card->pci_reset_ongoing = true;
diff --git a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c
index 4064f99b36ba..b5f214fc1212 100644
--- a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c
+++ b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c
@@ -15,6 +15,72 @@ 
 
 /* quirk table based on DMI matching */
 static const struct dmi_system_id mwifiex_quirk_table[] = {
+	{
+		.ident = "Surface Pro 4",
+		.matches = {
+			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
+			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Pro 4"),
+		},
+		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
+	},
+	{
+		.ident = "Surface Pro 5",
+		.matches = {
+			/* match for SKU here due to generic product name "Surface Pro" */
+			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
+			DMI_EXACT_MATCH(DMI_PRODUCT_SKU, "Surface_Pro_1796"),
+		},
+		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
+	},
+	{
+		.ident = "Surface Pro 5 (LTE)",
+		.matches = {
+			/* match for SKU here due to generic product name "Surface Pro" */
+			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
+			DMI_EXACT_MATCH(DMI_PRODUCT_SKU, "Surface_Pro_1807"),
+		},
+		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
+	},
+	{
+		.ident = "Surface Pro 6",
+		.matches = {
+			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
+			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Pro 6"),
+		},
+		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
+	},
+	{
+		.ident = "Surface Book 1",
+		.matches = {
+			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
+			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Book"),
+		},
+		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
+	},
+	{
+		.ident = "Surface Book 2",
+		.matches = {
+			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
+			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Book 2"),
+		},
+		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
+	},
+	{
+		.ident = "Surface Laptop 1",
+		.matches = {
+			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
+			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Laptop"),
+		},
+		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
+	},
+	{
+		.ident = "Surface Laptop 2",
+		.matches = {
+			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
+			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Laptop 2"),
+		},
+		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
+	},
 	{}
 };
 
@@ -29,4 +95,61 @@  void mwifiex_initialize_quirks(struct pcie_service_card *card)
 
 	if (!card->quirks)
 		dev_info(&pdev->dev, "no quirks enabled\n");
+	if (card->quirks & QUIRK_FW_RST_D3COLD)
+		dev_info(&pdev->dev, "quirk reset_d3cold enabled\n");
+}
+
+static void mwifiex_pcie_set_power_d3cold(struct pci_dev *pdev)
+{
+	dev_info(&pdev->dev, "putting into D3cold...\n");
+
+	pci_save_state(pdev);
+	if (pci_is_enabled(pdev))
+		pci_disable_device(pdev);
+	pci_set_power_state(pdev, PCI_D3cold);
+}
+
+static int mwifiex_pcie_set_power_d0(struct pci_dev *pdev)
+{
+	int ret;
+
+	dev_info(&pdev->dev, "putting into D0...\n");
+
+	pci_set_power_state(pdev, PCI_D0);
+	ret = pci_enable_device(pdev);
+	if (ret) {
+		dev_err(&pdev->dev, "pci_enable_device failed\n");
+		return ret;
+	}
+	pci_restore_state(pdev);
+
+	return 0;
+}
+
+int mwifiex_pcie_reset_d3cold_quirk(struct pci_dev *pdev)
+{
+	struct pci_dev *parent_pdev = pci_upstream_bridge(pdev);
+	int ret;
+
+	/* Power-cycle (put into D3cold then D0) */
+	dev_info(&pdev->dev, "Using reset_d3cold quirk to perform FW reset\n");
+
+	/* We need to perform power-cycle also for bridge of wifi because
+	 * on some devices (e.g. Surface Book 1), the OS for some reasons
+	 * can't know the real power state of the bridge.
+	 * When tried to power-cycle only wifi, the reset failed with the
+	 * following dmesg log:
+	 * "Cannot transition to power state D0 for parent in D3hot".
+	 */
+	mwifiex_pcie_set_power_d3cold(pdev);
+	mwifiex_pcie_set_power_d3cold(parent_pdev);
+
+	ret = mwifiex_pcie_set_power_d0(parent_pdev);
+	if (ret)
+		return ret;
+	ret = mwifiex_pcie_set_power_d0(pdev);
+	if (ret)
+		return ret;
+
+	return 0;
 }
diff --git a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h
index 7a1fe3b3a61a..549093067813 100644
--- a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h
+++ b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h
@@ -5,4 +5,7 @@ 
 
 #include "pcie.h"
 
+#define QUIRK_FW_RST_D3COLD	BIT(0)
+
 void mwifiex_initialize_quirks(struct pcie_service_card *card);
+int mwifiex_pcie_reset_d3cold_quirk(struct pci_dev *pdev);