e1000e: Power cycle phy on PM resume

Message ID	20200923074751.10527-1-kai.heng.feng@canonical.com
State	New
Headers	show Return-Path: <SRS0=b0vd=DA=vger.kernel.org=netdev-owner@kernel.org> From: Kai-Heng Feng <kai.heng.feng@canonical.com> To: jeffrey.t.kirsher@intel.com Cc: Kai-Heng Feng <kai.heng.feng@canonical.com>, "David S. Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>, intel-wired-lan@lists.osuosl.org (moderated list:INTEL ETHERNET DRIVERS), netdev@vger.kernel.org (open list:NETWORKING DRIVERS), linux-kernel@vger.kernel.org (open list) Subject: [PATCH] e1000e: Power cycle phy on PM resume Date: Wed, 23 Sep 2020 15:47:51 +0800 Message-Id: <20200923074751.10527-1-kai.heng.feng@canonical.com> Precedence: bulk
Series	e1000e: Power cycle phy on PM resume \| expand e1000e: Power cycle phy on PM resume

Kai-Heng Feng Sept. 23, 2020, 7:47 a.m. UTC

We are seeing the following error after S3 resume:
[  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
[  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete
[  704.902817] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
[  704.903075] e1000e 0000:00:1f.6 eno1: reading PHY page 769 (or 0x6020 shifted) reg 0x17
[  704.903281] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
[  704.903486] e1000e 0000:00:1f.6 eno1: writing PHY page 769 (or 0x6020 shifted) reg 0x17
[  704.943155] e1000e 0000:00:1f.6 eno1: MDI Error
...
[  705.108161] e1000e 0000:00:1f.6 eno1: Hardware Error

Since we don't know what platform firmware may do to the phy, so let's
power cycle the phy upon system resume to resolve the issue.

Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 2 ++
 1 file changed, 2 insertions(+)

Andrew Lunn Sept. 23, 2020, 12:17 p.m. UTC | #1

On Wed, Sep 23, 2020 at 03:47:51PM +0800, Kai-Heng Feng wrote:
> We are seeing the following error after S3 resume:

> [  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020

> [  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete

> [  704.902817] e1000e 0000:00:1f.6 eno1: Setting page 0x6020

> [  704.903075] e1000e 0000:00:1f.6 eno1: reading PHY page 769 (or 0x6020 shifted) reg 0x17

> [  704.903281] e1000e 0000:00:1f.6 eno1: Setting page 0x6020

> [  704.903486] e1000e 0000:00:1f.6 eno1: writing PHY page 769 (or 0x6020 shifted) reg 0x17

> [  704.943155] e1000e 0000:00:1f.6 eno1: MDI Error

> ...

> [  705.108161] e1000e 0000:00:1f.6 eno1: Hardware Error

> 

> Since we don't know what platform firmware may do to the phy, so let's

> power cycle the phy upon system resume to resolve the issue.

> 

> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>

> ---

>  drivers/net/ethernet/intel/e1000e/netdev.c | 2 ++

>  1 file changed, 2 insertions(+)

> 

> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c

> index 664e8ccc88d2..c2a87a408102 100644

> --- a/drivers/net/ethernet/intel/e1000e/netdev.c

> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c

> @@ -6968,6 +6968,8 @@ static __maybe_unused int e1000e_pm_resume(struct device *dev)

>  	    !e1000e_check_me(hw->adapter->pdev->device))

>  		e1000e_s0ix_exit_flow(adapter);

>  

> +	e1000_power_down_phy(adapter);

> +


static void e1000_power_down_phy(struct e1000_adapter *adapter)
{
	struct e1000_hw *hw = &adapter->hw;

	/* Power down the PHY so no link is implied when interface is down *
	 * The PHY cannot be powered down if any of the following is true *
	 * (a) WoL is enabled
	 * (b) AMT is active
	 * (c) SoL/IDER session is active
	 */
	if (!adapter->wol && hw->mac_type >= e1000_82540 &&
	   hw->media_type == e1000_media_type_copper) {

Could it be coming out of S3 because it just received a WoL?

It seems unlikely that it is the MII_CR_POWER_DOWN which is helping,
since that is an MDIO write itself. Do you actually know how this call
to e1000_power_down_phy() fixes the issues?

   Andrew

Paul Menzel Sept. 23, 2020, 1:28 p.m. UTC | #2

Dear Kai-Heng,


Am 23.09.20 um 09:47 schrieb Kai-Heng Feng:
> We are seeing the following error after S3 resume:
> [  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
> [  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete
> [  704.902817] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
> [  704.903075] e1000e 0000:00:1f.6 eno1: reading PHY page 769 (or 0x6020 shifted) reg 0x17
> [  704.903281] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
> [  704.903486] e1000e 0000:00:1f.6 eno1: writing PHY page 769 (or 0x6020 shifted) reg 0x17
> [  704.943155] e1000e 0000:00:1f.6 eno1: MDI Error
> ...
> [  705.108161] e1000e 0000:00:1f.6 eno1: Hardware Error
> 
> Since we don't know what platform firmware may do to the phy, so let's
> power cycle the phy upon system resume to resolve the issue.

Is there a bug report or list thread for this issue?

> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> ---
>   drivers/net/ethernet/intel/e1000e/netdev.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
> index 664e8ccc88d2..c2a87a408102 100644
> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> @@ -6968,6 +6968,8 @@ static __maybe_unused int e1000e_pm_resume(struct device *dev)
>   	    !e1000e_check_me(hw->adapter->pdev->device))
>   		e1000e_s0ix_exit_flow(adapter);
>   
> +	e1000_power_down_phy(adapter);
> +
>   	rc = __e1000_resume(pdev);
>   	if (rc)
>   		return rc;

How much does this increase the resume time?


Kind regards,

Paul

Kai-Heng Feng Sept. 23, 2020, 2:44 p.m. UTC | #3

Hi Andrew,

> On Sep 23, 2020, at 20:17, Andrew Lunn <andrew@lunn.ch> wrote:
> 
> On Wed, Sep 23, 2020 at 03:47:51PM +0800, Kai-Heng Feng wrote:
>> We are seeing the following error after S3 resume:
>> [  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>> [  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete
>> [  704.902817] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>> [  704.903075] e1000e 0000:00:1f.6 eno1: reading PHY page 769 (or 0x6020 shifted) reg 0x17
>> [  704.903281] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>> [  704.903486] e1000e 0000:00:1f.6 eno1: writing PHY page 769 (or 0x6020 shifted) reg 0x17
>> [  704.943155] e1000e 0000:00:1f.6 eno1: MDI Error
>> ...
>> [  705.108161] e1000e 0000:00:1f.6 eno1: Hardware Error
>> 
>> Since we don't know what platform firmware may do to the phy, so let's
>> power cycle the phy upon system resume to resolve the issue.
>> 
>> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
>> ---
>> drivers/net/ethernet/intel/e1000e/netdev.c | 2 ++
>> 1 file changed, 2 insertions(+)
>> 
>> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
>> index 664e8ccc88d2..c2a87a408102 100644
>> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
>> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
>> @@ -6968,6 +6968,8 @@ static __maybe_unused int e1000e_pm_resume(struct device *dev)
>> 	    !e1000e_check_me(hw->adapter->pdev->device))
>> 		e1000e_s0ix_exit_flow(adapter);
>> 
>> +	e1000_power_down_phy(adapter);
>> +
> 
> static void e1000_power_down_phy(struct e1000_adapter *adapter)
> {
> 	struct e1000_hw *hw = &adapter->hw;
> 
> 	/* Power down the PHY so no link is implied when interface is down *
> 	 * The PHY cannot be powered down if any of the following is true *
> 	 * (a) WoL is enabled
> 	 * (b) AMT is active
> 	 * (c) SoL/IDER session is active
> 	 */
> 	if (!adapter->wol && hw->mac_type >= e1000_82540 &&
> 	   hw->media_type == e1000_media_type_copper) {

Looks like the the function comes from e1000, drivers/net/ethernet/intel/e1000/e1000_main.c.
However, this patch is for e1000e, so the function with same name is different.

> 
> Could it be coming out of S3 because it just received a WoL?

No, the issue can be reproduced by pressing keyboard or rtcwake.

> 
> It seems unlikely that it is the MII_CR_POWER_DOWN which is helping,
> since that is an MDIO write itself. Do you actually know how this call
> to e1000_power_down_phy() fixes the issues?

I don't know from hardware's perspective, but I think the comment on e1000_power_down_phy_copper() can give us some insight:

/**
 * e1000_power_down_phy_copper - Restore copper link in case of PHY power down
 * @hw: pointer to the HW structure
 *
 * In the case of a PHY power down to save power, or to turn off link during a
 * driver unload, or wake on lan is not enabled, restore the link to previous
 * settings.                       
 **/
void e1000_power_down_phy_copper(struct e1000_hw *hw)
{
        u16 mii_reg = 0;

        /* The PHY will retain its settings across a power down/up cycle */
        e1e_rphy(hw, MII_BMCR, &mii_reg);
        mii_reg |= BMCR_PDOWN;
        e1e_wphy(hw, MII_BMCR, mii_reg);
        usleep_range(1000, 2000);
}

Kai-Heng

> 
>   Andrew

Kai-Heng Feng Sept. 23, 2020, 2:46 p.m. UTC | #4

Hi Paul,

> On Sep 23, 2020, at 21:28, Paul Menzel <pmenzel@molgen.mpg.de> wrote:

> 

> Dear Kai-Heng,

> 

> 

> Am 23.09.20 um 09:47 schrieb Kai-Heng Feng:

>> We are seeing the following error after S3 resume:

>> [  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020

>> [  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete

>> [  704.902817] e1000e 0000:00:1f.6 eno1: Setting page 0x6020

>> [  704.903075] e1000e 0000:00:1f.6 eno1: reading PHY page 769 (or 0x6020 shifted) reg 0x17

>> [  704.903281] e1000e 0000:00:1f.6 eno1: Setting page 0x6020

>> [  704.903486] e1000e 0000:00:1f.6 eno1: writing PHY page 769 (or 0x6020 shifted) reg 0x17

>> [  704.943155] e1000e 0000:00:1f.6 eno1: MDI Error

>> ...

>> [  705.108161] e1000e 0000:00:1f.6 eno1: Hardware Error

>> Since we don't know what platform firmware may do to the phy, so let's

>> power cycle the phy upon system resume to resolve the issue.

> 

> Is there a bug report or list thread for this issue?


No. That's why I sent a patch to start discussion :)

> 

>> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>

>> ---

>>  drivers/net/ethernet/intel/e1000e/netdev.c | 2 ++

>>  1 file changed, 2 insertions(+)

>> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c

>> index 664e8ccc88d2..c2a87a408102 100644

>> --- a/drivers/net/ethernet/intel/e1000e/netdev.c

>> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c

>> @@ -6968,6 +6968,8 @@ static __maybe_unused int e1000e_pm_resume(struct device *dev)

>>  	    !e1000e_check_me(hw->adapter->pdev->device))

>>  		e1000e_s0ix_exit_flow(adapter);

>>  +	e1000_power_down_phy(adapter);

>> +

>>  	rc = __e1000_resume(pdev);

>>  	if (rc)

>>  		return rc;

> 

> How much does this increase the resume time?


I didn't measure it. Because for me it's more important to have a working device.

Does it have a noticeable impact on your system's resume time?

Kai-Heng

> 

> 

> Kind regards,

> 

> Paul

>

Paul Menzel Sept. 23, 2020, 3:02 p.m. UTC | #5

Dear Kai-Heng,


Am 23.09.20 um 16:46 schrieb Kai-Heng Feng:

>> On Sep 23, 2020, at 21:28, Paul Menzel wrote:


>> Am 23.09.20 um 09:47 schrieb Kai-Heng Feng:

>>> We are seeing the following error after S3 resume:

>>> [  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020

>>> [  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete

>>> [  704.902817] e1000e 0000:00:1f.6 eno1: Setting page 0x6020

>>> [  704.903075] e1000e 0000:00:1f.6 eno1: reading PHY page 769 (or 0x6020 shifted) reg 0x17

>>> [  704.903281] e1000e 0000:00:1f.6 eno1: Setting page 0x6020

>>> [  704.903486] e1000e 0000:00:1f.6 eno1: writing PHY page 769 (or 0x6020 shifted) reg 0x17

>>> [  704.943155] e1000e 0000:00:1f.6 eno1: MDI Error

>>> ...

>>> [  705.108161] e1000e 0000:00:1f.6 eno1: Hardware Error

>>> Since we don't know what platform firmware may do to the phy, so let's

>>> power cycle the phy upon system resume to resolve the issue.

>>

>> Is there a bug report or list thread for this issue?

> 

> No. That's why I sent a patch to start discussion :)


Then please add on what systems that is.

>>> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>

>>> ---

>>>   drivers/net/ethernet/intel/e1000e/netdev.c | 2 ++

>>>   1 file changed, 2 insertions(+)

>>> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c

>>> index 664e8ccc88d2..c2a87a408102 100644

>>> --- a/drivers/net/ethernet/intel/e1000e/netdev.c

>>> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c

>>> @@ -6968,6 +6968,8 @@ static __maybe_unused int e1000e_pm_resume(struct device *dev)

>>>   	    !e1000e_check_me(hw->adapter->pdev->device))

>>>   		e1000e_s0ix_exit_flow(adapter);

>>>   +	e1000_power_down_phy(adapter);

>>> +

>>>   	rc = __e1000_resume(pdev);

>>>   	if (rc)

>>>   		return rc;

>>

>> How much does this increase the resume time?

> 

> I didn't measure it. Because for me it's more important to have a working device.

> 

> Does it have a noticeable impact on your system's resume time?


I am not able to test the patch right now. But resume time is important 
to me. As I do not have the problem, nothing should be changed for my 
system (Dell Latitude E7250).

     00:19.0 Ethernet controller [0200]: Intel Corporation Ethernet 
Connection (3) I218-LM [8086:15a2] (rev 03)


Kind regards,

Paul

Andrew Lunn Sept. 23, 2020, 3:37 p.m. UTC | #6

On Wed, Sep 23, 2020 at 10:44:10PM +0800, Kai-Heng Feng wrote:
> Hi Andrew,

> 

> > On Sep 23, 2020, at 20:17, Andrew Lunn <andrew@lunn.ch> wrote:

> > 

> > On Wed, Sep 23, 2020 at 03:47:51PM +0800, Kai-Heng Feng wrote:

> >> We are seeing the following error after S3 resume:

> >> [  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020

> >> [  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete

> >> [  704.902817] e1000e 0000:00:1f.6 eno1: Setting page 0x6020

> >> [  704.903075] e1000e 0000:00:1f.6 eno1: reading PHY page 769 (or 0x6020 shifted) reg 0x17

> >> [  704.903281] e1000e 0000:00:1f.6 eno1: Setting page 0x6020

> >> [  704.903486] e1000e 0000:00:1f.6 eno1: writing PHY page 769 (or 0x6020 shifted) reg 0x17

> >> [  704.943155] e1000e 0000:00:1f.6 eno1: MDI Error

> >> ...

> >> [  705.108161] e1000e 0000:00:1f.6 eno1: Hardware Error

> >> 

> >> Since we don't know what platform firmware may do to the phy, so let's

> >> power cycle the phy upon system resume to resolve the issue.

> >> 

> >> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>

> >> ---

> >> drivers/net/ethernet/intel/e1000e/netdev.c | 2 ++

> >> 1 file changed, 2 insertions(+)

> >> 

> >> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c

> >> index 664e8ccc88d2..c2a87a408102 100644

> >> --- a/drivers/net/ethernet/intel/e1000e/netdev.c

> >> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c

> >> @@ -6968,6 +6968,8 @@ static __maybe_unused int e1000e_pm_resume(struct device *dev)

> >> 	    !e1000e_check_me(hw->adapter->pdev->device))

> >> 		e1000e_s0ix_exit_flow(adapter);

> >> 

> >> +	e1000_power_down_phy(adapter);

> >> +

> > 

> > static void e1000_power_down_phy(struct e1000_adapter *adapter)

> > {

> > 	struct e1000_hw *hw = &adapter->hw;

> > 

> > 	/* Power down the PHY so no link is implied when interface is down *

> > 	 * The PHY cannot be powered down if any of the following is true *

> > 	 * (a) WoL is enabled

> > 	 * (b) AMT is active

> > 	 * (c) SoL/IDER session is active

> > 	 */

> > 	if (!adapter->wol && hw->mac_type >= e1000_82540 &&

> > 	   hw->media_type == e1000_media_type_copper) {

> 

> Looks like the the function comes from e1000, drivers/net/ethernet/intel/e1000/e1000_main.c.

> However, this patch is for e1000e, so the function with same name is different.


Ah! Sorry. Missed that. Also it is not nice there are two functions in
the kernel with the same name.

> > Could it be coming out of S3 because it just received a WoL?

> 

> No, the issue can be reproduced by pressing keyboard or rtcwake.

 
Not relevant now, since i was looking at the wrong function. But i was
meaning the call is a NOP in the case WoL caused the wake up. So if
the issues can also happen after WoL, your fix is not going to fix it.

> > It seems unlikely that it is the MII_CR_POWER_DOWN which is helping,

> > since that is an MDIO write itself. Do you actually know how this call

> > to e1000_power_down_phy() fixes the issues?

> 


> I don't know from hardware's perspective, but I think the comment on

> e1000_power_down_phy_copper() can give us some insight:


And there is only one function called e1000_power_down_phy_copper()
:-)

> 

> /**

>  * e1000_power_down_phy_copper - Restore copper link in case of PHY power down

>  * @hw: pointer to the HW structure

>  *

>  * In the case of a PHY power down to save power, or to turn off link during a

>  * driver unload, or wake on lan is not enabled, restore the link to previous

>  * settings.                       

>  **/

> void e1000_power_down_phy_copper(struct e1000_hw *hw)

> {

>         u16 mii_reg = 0;

> 

>         /* The PHY will retain its settings across a power down/up cycle */

>         e1e_rphy(hw, MII_BMCR, &mii_reg);

>         mii_reg |= BMCR_PDOWN;

>         e1e_wphy(hw, MII_BMCR, mii_reg);

>         usleep_range(1000, 2000);

> }


I don't really see how this explains this:

> >> [  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020

> >> [  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete


https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/intel/e1000e/phy.c#L181

So first off, the comments are all cut/paste from
e1000e_read_phy_reg_mdic(). It would be nice to s/read/write/g in that
function.

So it sets up the transaction and starts it. MDIO is a serial bus with
no acknowledgements. You clock out around 64 bits, and hope the PHY
receives it. The time it takes to send those 64 bits is fixed by the
bus speed, typically 2.5MHz.

So the driver polls waiting for the hardware to say the bits have been
sent. And this is timing out. How long that takes has nothing to do
with the PHY, or what state it is in. Powering down the PHY has no
effect on the MDIO bus master, and how long it takes to shift those
bits out. Which is why i don't think this patch is correct. This is
probably an MDIO bus issue, not a PHY issue.

Try dumping the value of MDIC in the good/bad case before the
transaction starts.

	 Andrew

Andrew Lunn Sept. 23, 2020, 7:28 p.m. UTC | #7

> > > How much does this increase the resume time?


Define resume time? Until you get the display manager unlock screen?
Or do you need working networking?

It takes around 1.5 seconds for auto negotiation to get a link. I know
it takes me longer than that to move my fingers to the keyboard and
type in my password to unlock the screen. So by the time you actually
get to see your desktop, you should have link.

I've no idea about how the e1000e driver does link negotiation. But
powering the PHY off means there is going to be a negotiation sometime
later. But if you don't turn it off, the driver might be able to avoid
doing an autoneg if the PHY has already done one when it got powered
up.

      Andrew

Kai-Heng Feng Sept. 24, 2020, 12:50 p.m. UTC | #8

Hi Andrew,

> On Sep 23, 2020, at 23:37, Andrew Lunn <andrew@lunn.ch> wrote:

> 

> On Wed, Sep 23, 2020 at 10:44:10PM +0800, Kai-Heng Feng wrote:

>> Hi Andrew,

>> 

>>> On Sep 23, 2020, at 20:17, Andrew Lunn <andrew@lunn.ch> wrote:

>>> 

>>> On Wed, Sep 23, 2020 at 03:47:51PM +0800, Kai-Heng Feng wrote:

>>>> We are seeing the following error after S3 resume:

>>>> [  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020

>>>> [  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete

>>>> [  704.902817] e1000e 0000:00:1f.6 eno1: Setting page 0x6020

>>>> [  704.903075] e1000e 0000:00:1f.6 eno1: reading PHY page 769 (or 0x6020 shifted) reg 0x17

>>>> [  704.903281] e1000e 0000:00:1f.6 eno1: Setting page 0x6020

>>>> [  704.903486] e1000e 0000:00:1f.6 eno1: writing PHY page 769 (or 0x6020 shifted) reg 0x17

>>>> [  704.943155] e1000e 0000:00:1f.6 eno1: MDI Error

>>>> ...

>>>> [  705.108161] e1000e 0000:00:1f.6 eno1: Hardware Error

>>>> 

>>>> Since we don't know what platform firmware may do to the phy, so let's

>>>> power cycle the phy upon system resume to resolve the issue.

>>>> 

>>>> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>

>>>> ---

>>>> drivers/net/ethernet/intel/e1000e/netdev.c | 2 ++

>>>> 1 file changed, 2 insertions(+)

>>>> 

>>>> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c

>>>> index 664e8ccc88d2..c2a87a408102 100644

>>>> --- a/drivers/net/ethernet/intel/e1000e/netdev.c

>>>> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c

>>>> @@ -6968,6 +6968,8 @@ static __maybe_unused int e1000e_pm_resume(struct device *dev)

>>>> 	    !e1000e_check_me(hw->adapter->pdev->device))

>>>> 		e1000e_s0ix_exit_flow(adapter);

>>>> 

>>>> +	e1000_power_down_phy(adapter);

>>>> +

>>> 

>>> static void e1000_power_down_phy(struct e1000_adapter *adapter)

>>> {

>>> 	struct e1000_hw *hw = &adapter->hw;

>>> 

>>> 	/* Power down the PHY so no link is implied when interface is down *

>>> 	 * The PHY cannot be powered down if any of the following is true *

>>> 	 * (a) WoL is enabled

>>> 	 * (b) AMT is active

>>> 	 * (c) SoL/IDER session is active

>>> 	 */

>>> 	if (!adapter->wol && hw->mac_type >= e1000_82540 &&

>>> 	   hw->media_type == e1000_media_type_copper) {

>> 

>> Looks like the the function comes from e1000, drivers/net/ethernet/intel/e1000/e1000_main.c.

>> However, this patch is for e1000e, so the function with same name is different.

> 

> Ah! Sorry. Missed that. Also it is not nice there are two functions in

> the kernel with the same name.

> 

>>> Could it be coming out of S3 because it just received a WoL?

>> 

>> No, the issue can be reproduced by pressing keyboard or rtcwake.

> 

> Not relevant now, since i was looking at the wrong function. But i was

> meaning the call is a NOP in the case WoL caused the wake up. So if

> the issues can also happen after WoL, your fix is not going to fix it.

> 

>>> It seems unlikely that it is the MII_CR_POWER_DOWN which is helping,

>>> since that is an MDIO write itself. Do you actually know how this call

>>> to e1000_power_down_phy() fixes the issues?

>> 

> 

>> I don't know from hardware's perspective, but I think the comment on

>> e1000_power_down_phy_copper() can give us some insight:

> 

> And there is only one function called e1000_power_down_phy_copper()

> :-)

> 

>> 

>> /**

>> * e1000_power_down_phy_copper - Restore copper link in case of PHY power down

>> * @hw: pointer to the HW structure

>> *

>> * In the case of a PHY power down to save power, or to turn off link during a

>> * driver unload, or wake on lan is not enabled, restore the link to previous

>> * settings.                       

>> **/

>> void e1000_power_down_phy_copper(struct e1000_hw *hw)

>> {

>>        u16 mii_reg = 0;

>> 

>>        /* The PHY will retain its settings across a power down/up cycle */

>>        e1e_rphy(hw, MII_BMCR, &mii_reg);

>>        mii_reg |= BMCR_PDOWN;

>>        e1e_wphy(hw, MII_BMCR, mii_reg);

>>        usleep_range(1000, 2000);

>> }

> 

> I don't really see how this explains this:

> 

>>>> [  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020

>>>> [  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete

> 

> https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/intel/e1000e/phy.c#L181

> 

> So first off, the comments are all cut/paste from

> e1000e_read_phy_reg_mdic(). It would be nice to s/read/write/g in that

> function.


Ah yes...

> 

> So it sets up the transaction and starts it. MDIO is a serial bus with

> no acknowledgements. You clock out around 64 bits, and hope the PHY

> receives it. The time it takes to send those 64 bits is fixed by the

> bus speed, typically 2.5MHz.


Thanks for the info.

> 

> So the driver polls waiting for the hardware to say the bits have been

> sent. And this is timing out. How long that takes has nothing to do

> with the PHY, or what state it is in. Powering down the PHY has no

> effect on the MDIO bus master, and how long it takes to shift those

> bits out. Which is why i don't think this patch is correct. This is

> probably an MDIO bus issue, not a PHY issue.


Thanks for pointing out the possible root cause.
Indeed this looks like an MDIO issue so this patch is completely wrong.

I'll send a V2, thanks.

Kai-Heng

> 

> Try dumping the value of MDIC in the good/bad case before the

> transaction starts.

> 

> 	 Andrew

Paul Menzel Sept. 24, 2020, 1:02 p.m. UTC | #9

Dear Andrew,


Am 23.09.20 um 21:28 schrieb Andrew Lunn:
>>>> How much does this increase the resume time?
> 
> Define resume time? Until you get the display manager unlock screen?
> Or do you need working networking?

Until network is functional again. Currently, the speed negotiation 
alone takes three(?) seconds, so making it even longer is unacceptable. 
(You wrote it below.)

> It takes around 1.5 seconds for auto negotiation to get a link. I know
> it takes me longer than that to move my fingers to the keyboard and
> type in my password to unlock the screen. So by the time you actually
> get to see your desktop, you should have link.

Not here.

> I've no idea about how the e1000e driver does link negotiation. But
> powering the PHY off means there is going to be a negotiation sometime
> later. But if you don't turn it off, the driver might be able to avoid
> doing an autoneg if the PHY has already done one when it got powered
> up.

Indeed.


Kind regards,

Paul

e1000e: Power cycle phy on PM resume

Commit Message

Comments

Patch