diff mbox series

[RFC,net-next] net: stmmac: should not modify RX descriptor when STMMAC resume

Message ID 20210419115921.19219-1-qiangqing.zhang@nxp.com
State New
Headers show
Series [RFC,net-next] net: stmmac: should not modify RX descriptor when STMMAC resume | expand

Commit Message

Joakim Zhang April 19, 2021, 11:59 a.m. UTC
When system resume back, STMMAC will clear RX descriptors:
stmmac_resume()
	->stmmac_clear_descriptors()
		->stmmac_clear_rx_descriptors()
			->stmmac_init_rx_desc()
				->dwmac4_set_rx_owner()
				//p->des3 |= cpu_to_le32(RDES3_OWN | RDES3_BUFFER1_VALID_ADDR);
It only assets OWN and BUF1V bits in desc3 field, doesn't clear desc0/1/2 fields.

Let's take a case into account, when system suspend, it is possible that
there are packets have not received yet, so the RX descriptors are wrote
back by DMA, e.g.
008 [0x00000000c4310080]: 0x0 0x40 0x0 0x34010040

When system resume back, after above process, it became a broken
descriptor:
008 [0x00000000c4310080]: 0x0 0x40 0x0 0xb5010040

The issue is that it only changes the owner of this descriptor, but do nothing
about desc0/1/2 fields. The descriptor of STMMAC a bit special, applicaton
prepares RX descriptors for DMA, after DMA recevie the packets, it will write
back the descriptors, so the same field of a descriptor have different
meanings to application and DMA. It should be a software bug there, and may
not easy to reproduce, but there is a certain probability that it will
occur.

Commit 9c63faaa931e ("net: stmmac: re-init rx buffers when mac resume back") tried
to re-init desc0/desc1 (buffer address fields) to fix this issue, but it
is not a proper solution, and made regression on Jetson TX2 boards.

It is unreasonable to modify RX descriptors outside of stmmac_rx_refill() function,
where it will clear all desc0/desc1/desc2/desc3 fields together.

This patch removes RX descriptors modification when STMMAC resume.

Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com>
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Jon Hunter April 20, 2021, 1:33 p.m. UTC | #1
On 20/04/2021 02:49, Joakim Zhang wrote:

...

>> I have tested this patch, but unfortunately the board still fails to resume
>> correctly. So it appears to suffer with the same issue we saw on the previous
>> implementation.
> 
> Could I double check with you? Have you reverted Commit 9c63faaa931e ("net: stmmac: re-init rx buffers when mac resume back") and then apply above patch to do the test?
> 
> If yes, you still saw the same issue with Commit 9c63faaa931e? Let's recall the problem, system suspended, but system hang when STMMAC resume back, right?


I tested your patch on top of next-20210419 which has Thierry's revert
of 9c63faaa931e. So yes this is reverted. Unfortunately, with this
change resuming from suspend still does not work.

Jon
Joakim Zhang April 22, 2021, 4:53 a.m. UTC | #2
> -----Original Message-----

> From: Jon Hunter <jonathanh@nvidia.com>

> Sent: 2021年4月20日 21:34

> To: Joakim Zhang <qiangqing.zhang@nxp.com>; peppe.cavallaro@st.com;

> alexandre.torgue@foss.st.com; joabreu@synopsys.com;

> davem@davemloft.net; kuba@kernel.org; mcoquelin.stm32@gmail.com;

> andrew@lunn.ch; f.fainelli@gmail.com

> Cc: dl-linux-imx <linux-imx@nxp.com>; treding@nvidia.com;

> netdev@vger.kernel.org

> Subject: Re: [RFC net-next] net: stmmac: should not modify RX descriptor when

> STMMAC resume

> 

> 

> 

> On 20/04/2021 02:49, Joakim Zhang wrote:

> 

> ...

> 

> >> I have tested this patch, but unfortunately the board still fails to

> >> resume correctly. So it appears to suffer with the same issue we saw

> >> on the previous implementation.

> >

> > Could I double check with you? Have you reverted Commit 9c63faaa931e

> ("net: stmmac: re-init rx buffers when mac resume back") and then apply above

> patch to do the test?

> >

> > If yes, you still saw the same issue with Commit 9c63faaa931e? Let's recall

> the problem, system suspended, but system hang when STMMAC resume back,

> right?

>

> 

> I tested your patch on top of next-20210419 which has Thierry's revert of

> 9c63faaa931e. So yes this is reverted. Unfortunately, with this change

> resuming from suspend still does not work.



Hi Jakub, Andrew,

Could you please help review this patch? It's really beyond my comprehension, why this patch would affect Tegra186 Jetson TX2 board?
Thanks a lot!

Best Regards,
Joakim Zhang
> Jon

> 

> --

> nvpublic
Jakub Kicinski April 22, 2021, 3:56 p.m. UTC | #3
On Thu, 22 Apr 2021 04:53:08 +0000 Joakim Zhang wrote:
> Could you please help review this patch? It's really beyond my

> comprehension, why this patch would affect Tegra186 Jetson TX2 board?


Looks okay, please repost as non-RFC.
Florian Fainelli April 22, 2021, 4:12 p.m. UTC | #4
On 4/21/2021 9:53 PM, Joakim Zhang wrote:
> 

>> -----Original Message-----

>> From: Jon Hunter <jonathanh@nvidia.com>

>> Sent: 2021年4月20日 21:34

>> To: Joakim Zhang <qiangqing.zhang@nxp.com>; peppe.cavallaro@st.com;

>> alexandre.torgue@foss.st.com; joabreu@synopsys.com;

>> davem@davemloft.net; kuba@kernel.org; mcoquelin.stm32@gmail.com;

>> andrew@lunn.ch; f.fainelli@gmail.com

>> Cc: dl-linux-imx <linux-imx@nxp.com>; treding@nvidia.com;

>> netdev@vger.kernel.org

>> Subject: Re: [RFC net-next] net: stmmac: should not modify RX descriptor when

>> STMMAC resume

>>

>>

>>

>> On 20/04/2021 02:49, Joakim Zhang wrote:

>>

>> ...

>>

>>>> I have tested this patch, but unfortunately the board still fails to

>>>> resume correctly. So it appears to suffer with the same issue we saw

>>>> on the previous implementation.

>>>

>>> Could I double check with you? Have you reverted Commit 9c63faaa931e

>> ("net: stmmac: re-init rx buffers when mac resume back") and then apply above

>> patch to do the test?

>>>

>>> If yes, you still saw the same issue with Commit 9c63faaa931e? Let's recall

>> the problem, system suspended, but system hang when STMMAC resume back,

>> right?

>>

>>

>> I tested your patch on top of next-20210419 which has Thierry's revert of

>> 9c63faaa931e. So yes this is reverted. Unfortunately, with this change

>> resuming from suspend still does not work.

> 

> 

> Hi Jakub, Andrew,

> 

> Could you please help review this patch? It's really beyond my comprehension, why this patch would affect Tegra186 Jetson TX2 board?

> Thanks a lot!


What does the resumption failure looks like? Does the stmmac driver
successfully resume from your suspend state, but there is no network
traffic? Do you have a log by any chance?

Is power to the Ethernet MAC turned off in this suspend state, in which
case could we be missing an essential register programming stage?
-- 
Florian
Florian Fainelli April 22, 2021, 4:31 p.m. UTC | #5
On 4/19/2021 4:59 AM, Joakim Zhang wrote:
> When system resume back, STMMAC will clear RX descriptors:

> stmmac_resume()

> 	->stmmac_clear_descriptors()

> 		->stmmac_clear_rx_descriptors()

> 			->stmmac_init_rx_desc()

> 				->dwmac4_set_rx_owner()

> 				//p->des3 |= cpu_to_le32(RDES3_OWN | RDES3_BUFFER1_VALID_ADDR);

> It only assets OWN and BUF1V bits in desc3 field, doesn't clear desc0/1/2 fields.

> 

> Let's take a case into account, when system suspend, it is possible that

> there are packets have not received yet, so the RX descriptors are wrote

> back by DMA, e.g.

> 008 [0x00000000c4310080]: 0x0 0x40 0x0 0x34010040

> 

> When system resume back, after above process, it became a broken

> descriptor:

> 008 [0x00000000c4310080]: 0x0 0x40 0x0 0xb5010040

> 

> The issue is that it only changes the owner of this descriptor, but do nothing

> about desc0/1/2 fields. The descriptor of STMMAC a bit special, applicaton

> prepares RX descriptors for DMA, after DMA recevie the packets, it will write

> back the descriptors, so the same field of a descriptor have different

> meanings to application and DMA. It should be a software bug there, and may

> not easy to reproduce, but there is a certain probability that it will

> occur.

> 

> Commit 9c63faaa931e ("net: stmmac: re-init rx buffers when mac resume back") tried

> to re-init desc0/desc1 (buffer address fields) to fix this issue, but it

> is not a proper solution, and made regression on Jetson TX2 boards.

> 

> It is unreasonable to modify RX descriptors outside of stmmac_rx_refill() function,

> where it will clear all desc0/desc1/desc2/desc3 fields together.

> 

> This patch removes RX descriptors modification when STMMAC resume.


Your patch makes sense to me, however the explanation seems to highlight
that you may have a few cases to consider while you suspend.

Usually you will turn off the RX DMA such that DMA into DRAM stops
there, this may not an entirely atomic operation as the MAC may have to
wait for a certain packet boundary to be crossed, that could leave you
with descriptors in 3 states I believe:

- descriptor is ready for RX DMA to process and is owned by RX DMA, no
need to do anything

- descriptor has been fully consumed by the CPU and is owned by the CPU,
CPU should be putting the descriptor back on the ring and relinquish
ownership

- descriptor has been written to DRAM but not processed by CPU, and it
should be put back on the ring for RX DMA to use it

Out of suspend, don't you need to deal with descriptors in cases 2 and 3
somehow? Does the DMA skip over descriptors that are still marked as
owned by the CPU or does it stop/stall?
-- 
Florian
Jon Hunter April 22, 2021, 5 p.m. UTC | #6
On 22/04/2021 17:12, Florian Fainelli wrote:

...

> What does the resumption failure looks like? Does the stmmac driver

> successfully resume from your suspend state, but there is no network

> traffic? Do you have a log by any chance?


The board fails to resume and appears to hang. With regard to the
original patch I did find that moving the code to re-init the RX buffers
to before the PHY is enabled did work [0].

> Is power to the Ethernet MAC turned off in this suspend state, in which

> case could we be missing an essential register programming stage?


It seems to be more of a sequencing issue rather than a power issue.

I have also ran 2000 suspend cycles on our Tegra platform without
Joakim's patch to see how stable suspend is on this platform. I did not
see any failures in 2000 cycles and so it is not evident to me that the
problem that Joakim is trying to fix is seen on devices such as Tegra.
Admittedly, if it is hard to reproduce, then it is possible we have not
seen it yet.

Jon

[0]
https://lore.kernel.org/netdev/e4864046-e52f-63b6-a490-74c3cd8045f4@nvidia.com/

-- 
nvpublic
Florian Fainelli April 22, 2021, 5:32 p.m. UTC | #7
On 4/22/2021 10:00 AM, Jon Hunter wrote:
> 

> On 22/04/2021 17:12, Florian Fainelli wrote:

> 

> ...

> 

>> What does the resumption failure looks like? Does the stmmac driver

>> successfully resume from your suspend state, but there is no network

>> traffic? Do you have a log by any chance?

> 

> The board fails to resume and appears to hang. With regard to the

> original patch I did find that moving the code to re-init the RX buffers

> to before the PHY is enabled did work [0].


You indicated that you are using a Broadcom PHY, which specific PHY are
you using?

I suspect that the stmmac is somehow relying on the PHY to provide its
125MHz RXC clock back to you in order to have its RX logic work correctly.

One difference between using the Broadcom PHY and the Generic PHY
drivers could be whether your Broadcom PHY driver entry has a
.suspend/.resume callback implemented or not.

> 

>> Is power to the Ethernet MAC turned off in this suspend state, in which

>> case could we be missing an essential register programming stage?

> 

> It seems to be more of a sequencing issue rather than a power issue.

> 

> I have also ran 2000 suspend cycles on our Tegra platform without

> Joakim's patch to see how stable suspend is on this platform. I did not

> see any failures in 2000 cycles and so it is not evident to me that the

> problem that Joakim is trying to fix is seen on devices such as Tegra.

> Admittedly, if it is hard to reproduce, then it is possible we have not

> seen it yet.

> 

> Jon

> 

> [0]

> https://lore.kernel.org/netdev/e4864046-e52f-63b6-a490-74c3cd8045f4@nvidia.com/

> 


-- 
Florian
Jon Hunter April 22, 2021, 6:18 p.m. UTC | #8
On 22/04/2021 18:32, Florian Fainelli wrote:
> 

> 

> On 4/22/2021 10:00 AM, Jon Hunter wrote:

>>

>> On 22/04/2021 17:12, Florian Fainelli wrote:

>>

>> ...

>>

>>> What does the resumption failure looks like? Does the stmmac driver

>>> successfully resume from your suspend state, but there is no network

>>> traffic? Do you have a log by any chance?

>>

>> The board fails to resume and appears to hang. With regard to the

>> original patch I did find that moving the code to re-init the RX buffers

>> to before the PHY is enabled did work [0].

> 

> You indicated that you are using a Broadcom PHY, which specific PHY are

> you using?

> 

> I suspect that the stmmac is somehow relying on the PHY to provide its

> 125MHz RXC clock back to you in order to have its RX logic work correctly.

> 

> One difference between using the Broadcom PHY and the Generic PHY

> drivers could be whether your Broadcom PHY driver entry has a

> .suspend/.resume callback implemented or not.



This board has a BCM89610 and uses the drivers/net/phy/broadcom.c
driver. Interestingly I don't see any suspend/resume handlers for this phy.

Cheers
Jon

-- 
nvpublic
Florian Fainelli April 22, 2021, 6:20 p.m. UTC | #9
On 4/22/2021 11:18 AM, Jon Hunter wrote:
> 

> On 22/04/2021 18:32, Florian Fainelli wrote:

>>

>>

>> On 4/22/2021 10:00 AM, Jon Hunter wrote:

>>>

>>> On 22/04/2021 17:12, Florian Fainelli wrote:

>>>

>>> ...

>>>

>>>> What does the resumption failure looks like? Does the stmmac driver

>>>> successfully resume from your suspend state, but there is no network

>>>> traffic? Do you have a log by any chance?

>>>

>>> The board fails to resume and appears to hang. With regard to the

>>> original patch I did find that moving the code to re-init the RX buffers

>>> to before the PHY is enabled did work [0].

>>

>> You indicated that you are using a Broadcom PHY, which specific PHY are

>> you using?

>>

>> I suspect that the stmmac is somehow relying on the PHY to provide its

>> 125MHz RXC clock back to you in order to have its RX logic work correctly.

>>

>> One difference between using the Broadcom PHY and the Generic PHY

>> drivers could be whether your Broadcom PHY driver entry has a

>> .suspend/.resume callback implemented or not.

> 

> 

> This board has a BCM89610 and uses the drivers/net/phy/broadcom.c

> driver. Interestingly I don't see any suspend/resume handlers for this phy.


Now if you do add .suspend = genphy_suspend and .resume = bcm54xx_resume
for this PHY entry does it work better?
-- 
Florian
Joakim Zhang April 23, 2021, 2:17 a.m. UTC | #10
Hi Florian,

Thanks for your comments.

> -----Original Message-----

> From: Florian Fainelli <f.fainelli@gmail.com>

> Sent: 2021年4月23日 0:32

> To: Joakim Zhang <qiangqing.zhang@nxp.com>; peppe.cavallaro@st.com;

> alexandre.torgue@foss.st.com; joabreu@synopsys.com;

> davem@davemloft.net; kuba@kernel.org; mcoquelin.stm32@gmail.com;

> andrew@lunn.ch

> Cc: dl-linux-imx <linux-imx@nxp.com>; jonathanh@nvidia.com;

> treding@nvidia.com; netdev@vger.kernel.org

> Subject: Re: [RFC net-next] net: stmmac: should not modify RX descriptor when

> STMMAC resume

> 

> 

> 

> On 4/19/2021 4:59 AM, Joakim Zhang wrote:

> > When system resume back, STMMAC will clear RX descriptors:

> > stmmac_resume()

> > 	->stmmac_clear_descriptors()

> > 		->stmmac_clear_rx_descriptors()

> > 			->stmmac_init_rx_desc()

> > 				->dwmac4_set_rx_owner()

> > 				//p->des3 |= cpu_to_le32(RDES3_OWN |

> RDES3_BUFFER1_VALID_ADDR); It

> > only assets OWN and BUF1V bits in desc3 field, doesn't clear desc0/1/2 fields.

> >

> > Let's take a case into account, when system suspend, it is possible

> > that there are packets have not received yet, so the RX descriptors

> > are wrote back by DMA, e.g.

> > 008 [0x00000000c4310080]: 0x0 0x40 0x0 0x34010040

> >

> > When system resume back, after above process, it became a broken

> > descriptor:

> > 008 [0x00000000c4310080]: 0x0 0x40 0x0 0xb5010040

> >

> > The issue is that it only changes the owner of this descriptor, but do

> > nothing about desc0/1/2 fields. The descriptor of STMMAC a bit

> > special, applicaton prepares RX descriptors for DMA, after DMA recevie

> > the packets, it will write back the descriptors, so the same field of

> > a descriptor have different meanings to application and DMA. It should

> > be a software bug there, and may not easy to reproduce, but there is a

> > certain probability that it will occur.

> >

> > Commit 9c63faaa931e ("net: stmmac: re-init rx buffers when mac resume

> > back") tried to re-init desc0/desc1 (buffer address fields) to fix

> > this issue, but it is not a proper solution, and made regression on Jetson TX2

> boards.

> >

> > It is unreasonable to modify RX descriptors outside of

> > stmmac_rx_refill() function, where it will clear all desc0/desc1/desc2/desc3

> fields together.

> >

> > This patch removes RX descriptors modification when STMMAC resume.

> 

> Your patch makes sense to me, however the explanation seems to highlight

> that you may have a few cases to consider while you suspend.

> 

> Usually you will turn off the RX DMA such that DMA into DRAM stops there,

> this may not an entirely atomic operation as the MAC may have to wait for a

> certain packet boundary to be crossed, that could leave you with descriptors in

> 3 states I believe:

> 

> - descriptor is ready for RX DMA to process and is owned by RX DMA, no need

> to do anything

Agree.

> - descriptor has been fully consumed by the CPU and is owned by the CPU, CPU

> should be putting the descriptor back on the ring and relinquish ownership

Yes, at this case, after stmmac resume, this descriptor would have chance to be refilled in stmmac_rx_refill() function.

> - descriptor has been written to DRAM but not processed by CPU, and it should

> be put back on the ring for RX DMA to use it

This case is descriptor is owned by CPU and rx buffer have not been consumed yet, I think it still has chance to be received from stmmac_rx() after stmmac resume.

> 

> Out of suspend, don't you need to deal with descriptors in cases 2 and 3

> somehow? Does the DMA skip over descriptors that are still marked as owned

> by the CPU or does it stop/stall?

Yes, AFAIK, DMA would skip over descriptors that are owned by CPU, only use the descriptors that owned by DMA. This is my understanding.

Best Regards,
Joakim Zhang
> --

> Florian
Joakim Zhang April 23, 2021, 2:37 a.m. UTC | #11
Hi Florian, John,

> -----Original Message-----

> From: Florian Fainelli <f.fainelli@gmail.com>

> Sent: 2021年4月23日 1:33

> To: Jon Hunter <jonathanh@nvidia.com>; Joakim Zhang

> <qiangqing.zhang@nxp.com>; peppe.cavallaro@st.com;

> alexandre.torgue@foss.st.com; joabreu@synopsys.com;

> davem@davemloft.net; kuba@kernel.org; mcoquelin.stm32@gmail.com;

> andrew@lunn.ch

> Cc: dl-linux-imx <linux-imx@nxp.com>; treding@nvidia.com;

> netdev@vger.kernel.org

> Subject: Re: [RFC net-next] net: stmmac: should not modify RX descriptor when

> STMMAC resume

> 

> 

> 

> On 4/22/2021 10:00 AM, Jon Hunter wrote:

> >

> > On 22/04/2021 17:12, Florian Fainelli wrote:

> >

> > ...

> >

> >> What does the resumption failure looks like? Does the stmmac driver

> >> successfully resume from your suspend state, but there is no network

> >> traffic? Do you have a log by any chance?

> >

> > The board fails to resume and appears to hang. With regard to the

> > original patch I did find that moving the code to re-init the RX

> > buffers to before the PHY is enabled did work [0].

> 

> You indicated that you are using a Broadcom PHY, which specific PHY are you

> using?

> 

> I suspect that the stmmac is somehow relying on the PHY to provide its

> 125MHz RXC clock back to you in order to have its RX logic work correctly.

Yes, for i.MX platforms, we need PHY feeds its RXC clock back to STMMAC for RX logic, not sure if it is need for NVIDIA platforms.
And now stmmac resume PHY(phylink_start) before initialize MAC.

> One difference between using the Broadcom PHY and the Generic PHY drivers

> could be whether your Broadcom PHY driver entry has a .suspend/.resume

> callback implemented or not.


If there is no .suspend/.resume callback, the PHY should always work. I have a concern, does Boardcom always can feed continuous RX clock? Does it support EEE?

Best Regards,
Joakim Zhang
> >

> >> Is power to the Ethernet MAC turned off in this suspend state, in

> >> which case could we be missing an essential register programming stage?

> >

> > It seems to be more of a sequencing issue rather than a power issue.

> >

> > I have also ran 2000 suspend cycles on our Tegra platform without

> > Joakim's patch to see how stable suspend is on this platform. I did

> > not see any failures in 2000 cycles and so it is not evident to me

> > that the problem that Joakim is trying to fix is seen on devices such as Tegra.

> > Admittedly, if it is hard to reproduce, then it is possible we have

> > not seen it yet.

> >

> > Jon

> >

> > [0]

> > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore

> > .kernel.org%2Fnetdev%2Fe4864046-e52f-63b6-a490-74c3cd8045f4%40nvidia

> .c

> >

> om%2F&amp;data=04%7C01%7Cqiangqing.zhang%40nxp.com%7Ca4ce9a89b

> 92c4f600

> >

> 84608d905b4a285%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C63

> 7547095

> >

> 639789970%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoi

> V2luMzIi

> >

> LCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=xLgYkYit0EfOXbbM

> lz78ws

> > b4w19W36A0FvPvkaPzl7o%3D&amp;reserved=0

> >

> 

> --

> Florian
Jon Hunter April 23, 2021, 1:46 p.m. UTC | #12
On 22/04/2021 19:20, Florian Fainelli wrote:
> 

> 

> On 4/22/2021 11:18 AM, Jon Hunter wrote:

>>

>> On 22/04/2021 18:32, Florian Fainelli wrote:

>>>

>>>

>>> On 4/22/2021 10:00 AM, Jon Hunter wrote:

>>>>

>>>> On 22/04/2021 17:12, Florian Fainelli wrote:

>>>>

>>>> ...

>>>>

>>>>> What does the resumption failure looks like? Does the stmmac driver

>>>>> successfully resume from your suspend state, but there is no network

>>>>> traffic? Do you have a log by any chance?

>>>>

>>>> The board fails to resume and appears to hang. With regard to the

>>>> original patch I did find that moving the code to re-init the RX buffers

>>>> to before the PHY is enabled did work [0].

>>>

>>> You indicated that you are using a Broadcom PHY, which specific PHY are

>>> you using?

>>>

>>> I suspect that the stmmac is somehow relying on the PHY to provide its

>>> 125MHz RXC clock back to you in order to have its RX logic work correctly.

>>>

>>> One difference between using the Broadcom PHY and the Generic PHY

>>> drivers could be whether your Broadcom PHY driver entry has a

>>> .suspend/.resume callback implemented or not.

>>

>>

>> This board has a BCM89610 and uses the drivers/net/phy/broadcom.c

>> driver. Interestingly I don't see any suspend/resume handlers for this phy.

> 

> Now if you do add .suspend = genphy_suspend and .resume = bcm54xx_resume

> for this PHY entry does it work better?


Thanks for the suggestion. I tried this and this appears to breaking
networking altogether. In other words, the board was not longer able to
request an IP address. I also tried the genphy_resume and the board was
able to get an IP address, but it is still unable to resume from suspend.

Thanks
Jon

-- 
nvpublic
Jon Hunter April 23, 2021, 1:48 p.m. UTC | #13
On 22/04/2021 16:56, Jakub Kicinski wrote:
> On Thu, 22 Apr 2021 04:53:08 +0000 Joakim Zhang wrote:

>> Could you please help review this patch? It's really beyond my

>> comprehension, why this patch would affect Tegra186 Jetson TX2 board?

> 

> Looks okay, please repost as non-RFC.



I still have an issue with a board not being able to resume from suspend
with this patch. Shouldn't we try to resolve that first?

Jon
Heiner Kallweit April 23, 2021, 3:38 p.m. UTC | #14
On 23.04.2021 15:46, Jon Hunter wrote:
> 

> On 22/04/2021 19:20, Florian Fainelli wrote:

>>

>>

>> On 4/22/2021 11:18 AM, Jon Hunter wrote:

>>>

>>> On 22/04/2021 18:32, Florian Fainelli wrote:

>>>>

>>>>

>>>> On 4/22/2021 10:00 AM, Jon Hunter wrote:

>>>>>

>>>>> On 22/04/2021 17:12, Florian Fainelli wrote:

>>>>>

>>>>> ...

>>>>>

>>>>>> What does the resumption failure looks like? Does the stmmac driver

>>>>>> successfully resume from your suspend state, but there is no network

>>>>>> traffic? Do you have a log by any chance?

>>>>>

>>>>> The board fails to resume and appears to hang. With regard to the

>>>>> original patch I did find that moving the code to re-init the RX buffers

>>>>> to before the PHY is enabled did work [0].

>>>>

>>>> You indicated that you are using a Broadcom PHY, which specific PHY are

>>>> you using?

>>>>

>>>> I suspect that the stmmac is somehow relying on the PHY to provide its

>>>> 125MHz RXC clock back to you in order to have its RX logic work correctly.

>>>>

>>>> One difference between using the Broadcom PHY and the Generic PHY

>>>> drivers could be whether your Broadcom PHY driver entry has a

>>>> .suspend/.resume callback implemented or not.

>>>

>>>

>>> This board has a BCM89610 and uses the drivers/net/phy/broadcom.c

>>> driver. Interestingly I don't see any suspend/resume handlers for this phy.

>>

>> Now if you do add .suspend = genphy_suspend and .resume = bcm54xx_resume

>> for this PHY entry does it work better?

> 

> Thanks for the suggestion. I tried this and this appears to breaking

> networking altogether. In other words, the board was not longer able to

> request an IP address. I also tried the genphy_resume and the board was

> able to get an IP address, but it is still unable to resume from suspend.

> 

Maybe MDIO bus PM and MAC driver PM interfere. Following changes dealt
with such an issue with the fec driver and a specific PHY model.

fba863b81604 ("net: phy: make PHY PM ops a no-op if MAC driver manages PHY PM")
557d5dc83f68 ("net: fec: use mac-managed PHY PM")

If stmmac suspend/resume take care also of PHY pm, you may try a similar
change in stmmac.


> Thanks

> Jon

> 

Heiner
Joakim Zhang May 6, 2021, 6:33 a.m. UTC | #15
> -----Original Message-----

> From: Jon Hunter <jonathanh@nvidia.com>

> Sent: 2021年4月23日 21:48

> To: Jakub Kicinski <kuba@kernel.org>; Joakim Zhang

> <qiangqing.zhang@nxp.com>

> Cc: peppe.cavallaro@st.com; alexandre.torgue@foss.st.com;

> joabreu@synopsys.com; davem@davemloft.net;

> mcoquelin.stm32@gmail.com; andrew@lunn.ch; f.fainelli@gmail.com;

> dl-linux-imx <linux-imx@nxp.com>; treding@nvidia.com;

> netdev@vger.kernel.org

> Subject: Re: [RFC net-next] net: stmmac: should not modify RX descriptor when

> STMMAC resume

> 

> 

> On 22/04/2021 16:56, Jakub Kicinski wrote:

> > On Thu, 22 Apr 2021 04:53:08 +0000 Joakim Zhang wrote:

> >> Could you please help review this patch? It's really beyond my

> >> comprehension, why this patch would affect Tegra186 Jetson TX2 board?

> >

> > Looks okay, please repost as non-RFC.

> 

> 

> I still have an issue with a board not being able to resume from suspend with

> this patch. Shouldn't we try to resolve that first?


Hi Jon,

Any updates about this? Could I repost as non-RFC?

Best Regards,
Joakim Zhang
> Jon
Jon Hunter May 7, 2021, 2:22 p.m. UTC | #16
Hi Joakim,

On 06/05/2021 07:33, Joakim Zhang wrote:
> 

>> -----Original Message-----

>> From: Jon Hunter <jonathanh@nvidia.com>

>> Sent: 2021年4月23日 21:48

>> To: Jakub Kicinski <kuba@kernel.org>; Joakim Zhang

>> <qiangqing.zhang@nxp.com>

>> Cc: peppe.cavallaro@st.com; alexandre.torgue@foss.st.com;

>> joabreu@synopsys.com; davem@davemloft.net;

>> mcoquelin.stm32@gmail.com; andrew@lunn.ch; f.fainelli@gmail.com;

>> dl-linux-imx <linux-imx@nxp.com>; treding@nvidia.com;

>> netdev@vger.kernel.org

>> Subject: Re: [RFC net-next] net: stmmac: should not modify RX descriptor when

>> STMMAC resume

>>

>>

>> On 22/04/2021 16:56, Jakub Kicinski wrote:

>>> On Thu, 22 Apr 2021 04:53:08 +0000 Joakim Zhang wrote:

>>>> Could you please help review this patch? It's really beyond my

>>>> comprehension, why this patch would affect Tegra186 Jetson TX2 board?

>>>

>>> Looks okay, please repost as non-RFC.

>>

>>

>> I still have an issue with a board not being able to resume from suspend with

>> this patch. Shouldn't we try to resolve that first?

> 

> Hi Jon,

> 

> Any updates about this? Could I repost as non-RFC?



Sorry no updates from my end. Again, I don't see how we can post this as
it introduces a regression for us. I am sorry that I am not able to help
more here, but we have done some extensive testing on the current
mainline without your change and I don't see any issues with regard to
suspend/resume. Hence, this does not appear to fix any pre-existing
issues. It is possible that we are not seeing them.

At this point I think that we really need someone from Synopsys to help
us understand that exact problem that you are experiencing so that we
can ensure we have the necessary fix in place and if this is something
that is applicable to all devices or not.

Jon

-- 
nvpublic
Joakim Zhang May 8, 2021, 11:20 a.m. UTC | #17
Hi Jakub,

> -----Original Message-----

> From: Jon Hunter <jonathanh@nvidia.com>

> Sent: 2021年5月7日 22:22

> To: Joakim Zhang <qiangqing.zhang@nxp.com>; Jakub Kicinski

> <kuba@kernel.org>

> Cc: peppe.cavallaro@st.com; alexandre.torgue@foss.st.com;

> joabreu@synopsys.com; davem@davemloft.net;

> mcoquelin.stm32@gmail.com; andrew@lunn.ch; f.fainelli@gmail.com;

> dl-linux-imx <linux-imx@nxp.com>; treding@nvidia.com;

> netdev@vger.kernel.org

> Subject: Re: [RFC net-next] net: stmmac: should not modify RX descriptor when

> STMMAC resume

> 

> Hi Joakim,

> 

> On 06/05/2021 07:33, Joakim Zhang wrote:

> >

> >> -----Original Message-----

> >> From: Jon Hunter <jonathanh@nvidia.com>

> >> Sent: 2021年4月23日 21:48

> >> To: Jakub Kicinski <kuba@kernel.org>; Joakim Zhang

> >> <qiangqing.zhang@nxp.com>

> >> Cc: peppe.cavallaro@st.com; alexandre.torgue@foss.st.com;

> >> joabreu@synopsys.com; davem@davemloft.net;

> mcoquelin.stm32@gmail.com;

> >> andrew@lunn.ch; f.fainelli@gmail.com; dl-linux-imx

> >> <linux-imx@nxp.com>; treding@nvidia.com; netdev@vger.kernel.org

> >> Subject: Re: [RFC net-next] net: stmmac: should not modify RX

> >> descriptor when STMMAC resume

> >>

> >>

> >> On 22/04/2021 16:56, Jakub Kicinski wrote:

> >>> On Thu, 22 Apr 2021 04:53:08 +0000 Joakim Zhang wrote:

> >>>> Could you please help review this patch? It's really beyond my

> >>>> comprehension, why this patch would affect Tegra186 Jetson TX2 board?

> >>>

> >>> Looks okay, please repost as non-RFC.

> >>

> >>

> >> I still have an issue with a board not being able to resume from

> >> suspend with this patch. Shouldn't we try to resolve that first?

> >

> > Hi Jon,

> >

> > Any updates about this? Could I repost as non-RFC?

> 

> 

> Sorry no updates from my end. Again, I don't see how we can post this as it

> introduces a regression for us. I am sorry that I am not able to help more here,

> but we have done some extensive testing on the current mainline without your

> change and I don't see any issues with regard to suspend/resume. Hence, this

> does not appear to fix any pre-existing issues. It is possible that we are not

> seeing them.

> 

> At this point I think that we really need someone from Synopsys to help us

> understand that exact problem that you are experiencing so that we can

> ensure we have the necessary fix in place and if this is something that is

> applicable to all devices or not.


This patch only removes modification of Rx descriptors when STMMAC resume back, IMHO, it should not affect system suspend/resume function.
Do you have any idea about Joh's issue or any acceptable solution to fix the issue I met? Thanks a lot!

Best Regards,
Joakim Zhang
> Jon

> 

> --

> nvpublic
Florian Fainelli May 8, 2021, 3:42 p.m. UTC | #18
On 5/8/2021 4:20 AM, Joakim Zhang wrote:
> 

> Hi Jakub,

> 

>> -----Original Message-----

>> From: Jon Hunter <jonathanh@nvidia.com>

>> Sent: 2021年5月7日 22:22

>> To: Joakim Zhang <qiangqing.zhang@nxp.com>; Jakub Kicinski

>> <kuba@kernel.org>

>> Cc: peppe.cavallaro@st.com; alexandre.torgue@foss.st.com;

>> joabreu@synopsys.com; davem@davemloft.net;

>> mcoquelin.stm32@gmail.com; andrew@lunn.ch; f.fainelli@gmail.com;

>> dl-linux-imx <linux-imx@nxp.com>; treding@nvidia.com;

>> netdev@vger.kernel.org

>> Subject: Re: [RFC net-next] net: stmmac: should not modify RX descriptor when

>> STMMAC resume

>>

>> Hi Joakim,

>>

>> On 06/05/2021 07:33, Joakim Zhang wrote:

>>>

>>>> -----Original Message-----

>>>> From: Jon Hunter <jonathanh@nvidia.com>

>>>> Sent: 2021年4月23日 21:48

>>>> To: Jakub Kicinski <kuba@kernel.org>; Joakim Zhang

>>>> <qiangqing.zhang@nxp.com>

>>>> Cc: peppe.cavallaro@st.com; alexandre.torgue@foss.st.com;

>>>> joabreu@synopsys.com; davem@davemloft.net;

>> mcoquelin.stm32@gmail.com;

>>>> andrew@lunn.ch; f.fainelli@gmail.com; dl-linux-imx

>>>> <linux-imx@nxp.com>; treding@nvidia.com; netdev@vger.kernel.org

>>>> Subject: Re: [RFC net-next] net: stmmac: should not modify RX

>>>> descriptor when STMMAC resume

>>>>

>>>>

>>>> On 22/04/2021 16:56, Jakub Kicinski wrote:

>>>>> On Thu, 22 Apr 2021 04:53:08 +0000 Joakim Zhang wrote:

>>>>>> Could you please help review this patch? It's really beyond my

>>>>>> comprehension, why this patch would affect Tegra186 Jetson TX2 board?

>>>>>

>>>>> Looks okay, please repost as non-RFC.

>>>>

>>>>

>>>> I still have an issue with a board not being able to resume from

>>>> suspend with this patch. Shouldn't we try to resolve that first?

>>>

>>> Hi Jon,

>>>

>>> Any updates about this? Could I repost as non-RFC?

>>

>>

>> Sorry no updates from my end. Again, I don't see how we can post this as it

>> introduces a regression for us. I am sorry that I am not able to help more here,

>> but we have done some extensive testing on the current mainline without your

>> change and I don't see any issues with regard to suspend/resume. Hence, this

>> does not appear to fix any pre-existing issues. It is possible that we are not

>> seeing them.

>>

>> At this point I think that we really need someone from Synopsys to help us

>> understand that exact problem that you are experiencing so that we can

>> ensure we have the necessary fix in place and if this is something that is

>> applicable to all devices or not.

> 

> This patch only removes modification of Rx descriptors when STMMAC resume back, IMHO, it should not affect system suspend/resume function.

> Do you have any idea about Joh's issue or any acceptable solution to fix the issue I met? Thanks a lot!


Joakim, don't you have a support contact at Synopsys who would be able
to help or someone at NXP who was responsible for the MAC integration?
We also have Synopsys engineers copied so presumably they could shed
some light.
-- 
Florian
Joakim Zhang May 10, 2021, 2:10 a.m. UTC | #19
Hi Florian,

> -----Original Message-----

> From: Florian Fainelli <f.fainelli@gmail.com>

> Sent: 2021年5月8日 23:42

> To: Joakim Zhang <qiangqing.zhang@nxp.com>; Jon Hunter

> <jonathanh@nvidia.com>; Jakub Kicinski <kuba@kernel.org>

> Cc: peppe.cavallaro@st.com; alexandre.torgue@foss.st.com;

> joabreu@synopsys.com; davem@davemloft.net;

> mcoquelin.stm32@gmail.com; andrew@lunn.ch; dl-linux-imx

> <linux-imx@nxp.com>; treding@nvidia.com; netdev@vger.kernel.org

> Subject: Re: [RFC net-next] net: stmmac: should not modify RX descriptor when

> STMMAC resume

> 

> 

> 

> On 5/8/2021 4:20 AM, Joakim Zhang wrote:

> >

> > Hi Jakub,

> >

> >> -----Original Message-----

> >> From: Jon Hunter <jonathanh@nvidia.com>

> >> Sent: 2021年5月7日 22:22

> >> To: Joakim Zhang <qiangqing.zhang@nxp.com>; Jakub Kicinski

> >> <kuba@kernel.org>

> >> Cc: peppe.cavallaro@st.com; alexandre.torgue@foss.st.com;

> >> joabreu@synopsys.com; davem@davemloft.net;

> mcoquelin.stm32@gmail.com;

> >> andrew@lunn.ch; f.fainelli@gmail.com; dl-linux-imx

> >> <linux-imx@nxp.com>; treding@nvidia.com; netdev@vger.kernel.org

> >> Subject: Re: [RFC net-next] net: stmmac: should not modify RX

> >> descriptor when STMMAC resume

> >>

> >> Hi Joakim,

> >>

> >> On 06/05/2021 07:33, Joakim Zhang wrote:

> >>>

> >>>> -----Original Message-----

> >>>> From: Jon Hunter <jonathanh@nvidia.com>

> >>>> Sent: 2021年4月23日 21:48

> >>>> To: Jakub Kicinski <kuba@kernel.org>; Joakim Zhang

> >>>> <qiangqing.zhang@nxp.com>

> >>>> Cc: peppe.cavallaro@st.com; alexandre.torgue@foss.st.com;

> >>>> joabreu@synopsys.com; davem@davemloft.net;

> >> mcoquelin.stm32@gmail.com;

> >>>> andrew@lunn.ch; f.fainelli@gmail.com; dl-linux-imx

> >>>> <linux-imx@nxp.com>; treding@nvidia.com; netdev@vger.kernel.org

> >>>> Subject: Re: [RFC net-next] net: stmmac: should not modify RX

> >>>> descriptor when STMMAC resume

> >>>>

> >>>>

> >>>> On 22/04/2021 16:56, Jakub Kicinski wrote:

> >>>>> On Thu, 22 Apr 2021 04:53:08 +0000 Joakim Zhang wrote:

> >>>>>> Could you please help review this patch? It's really beyond my

> >>>>>> comprehension, why this patch would affect Tegra186 Jetson TX2

> board?

> >>>>>

> >>>>> Looks okay, please repost as non-RFC.

> >>>>

> >>>>

> >>>> I still have an issue with a board not being able to resume from

> >>>> suspend with this patch. Shouldn't we try to resolve that first?

> >>>

> >>> Hi Jon,

> >>>

> >>> Any updates about this? Could I repost as non-RFC?

> >>

> >>

> >> Sorry no updates from my end. Again, I don't see how we can post this

> >> as it introduces a regression for us. I am sorry that I am not able

> >> to help more here, but we have done some extensive testing on the

> >> current mainline without your change and I don't see any issues with

> >> regard to suspend/resume. Hence, this does not appear to fix any

> >> pre-existing issues. It is possible that we are not seeing them.

> >>

> >> At this point I think that we really need someone from Synopsys to

> >> help us understand that exact problem that you are experiencing so

> >> that we can ensure we have the necessary fix in place and if this is

> >> something that is applicable to all devices or not.

> >

> > This patch only removes modification of Rx descriptors when STMMAC

> resume back, IMHO, it should not affect system suspend/resume function.

> > Do you have any idea about Joh's issue or any acceptable solution to fix the

> issue I met? Thanks a lot!

> 

> Joakim, don't you have a support contact at Synopsys who would be able to

> help or someone at NXP who was responsible for the MAC integration?

> We also have Synopsys engineers copied so presumably they could shed some

> light.


I contacted Synopsys no substantive help was received, and integration guys from NXP is unavailable now.

But, some hints has came out, seems a bit help. I found that the DMA width is 34 bits on i.MX8MP, this may different from many existing SoCs which integrated STMMAC.

As I described in the commit message:
When system suspend: the rx descriptor is 008 [0x00000000c4310080]: 0x0 0x40 0x0 0x34010040
When system resume: the rx descriptor modified to 008 [0x00000000c4310080]: 0x0 0x40 0x0 0xb5010040
Since the DMA is 34 bits width, so desc0/desc1 indicates the buffer address, after system resume, the buffer address changed to 0x4000000000.
And the correct rx descriptor is 008 [0x00000000c4310080]: 0x6511000 0x1 0x0 0x81000000, the valid buffer address is 0x16511000.
So when DMA tried to access 0x4000000000, this valid address, would generate fatal bus error.

But for other 32 bits width DMA, DMA seems still can work when this issue happened, only desc0 indicates buffer address, so the buffer address is 0x0 when system resume.
And there is a NOTE in the guide:
In the Receive Descriptor (Read Format), if the Buffer Address
field is all 0s, the module does not transfer data to that buffer
and skips to the next buffer or next descriptor.
For this note, I don't know what could IP actually do, when detect all zeros buffer address, it will change the descriptor to application own? If not, STMMAC driver seems can't handle this case.
I will contact Synopsys guys for more details.

It now appears that this issue seems only can be reproduced on DMA width more than 32 bits, this may be why other SoCs(e.g. i.MX8DXL) which integrated the same STMMAC IP can't reproduce it.

Best Regards,
Joakim Zhang
> --

> Florian
Thierry Reding May 17, 2021, 7:53 a.m. UTC | #20
On Mon, May 10, 2021 at 02:10:21AM +0000, Joakim Zhang wrote:
> 

> Hi Florian,

> 

> > -----Original Message-----

> > From: Florian Fainelli <f.fainelli@gmail.com>

> > Sent: 2021年5月8日 23:42

> > To: Joakim Zhang <qiangqing.zhang@nxp.com>; Jon Hunter

> > <jonathanh@nvidia.com>; Jakub Kicinski <kuba@kernel.org>

> > Cc: peppe.cavallaro@st.com; alexandre.torgue@foss.st.com;

> > joabreu@synopsys.com; davem@davemloft.net;

> > mcoquelin.stm32@gmail.com; andrew@lunn.ch; dl-linux-imx

> > <linux-imx@nxp.com>; treding@nvidia.com; netdev@vger.kernel.org

> > Subject: Re: [RFC net-next] net: stmmac: should not modify RX descriptor when

> > STMMAC resume

> > 

> > 

> > 

> > On 5/8/2021 4:20 AM, Joakim Zhang wrote:

> > >

> > > Hi Jakub,

> > >

> > >> -----Original Message-----

> > >> From: Jon Hunter <jonathanh@nvidia.com>

> > >> Sent: 2021年5月7日 22:22

> > >> To: Joakim Zhang <qiangqing.zhang@nxp.com>; Jakub Kicinski

> > >> <kuba@kernel.org>

> > >> Cc: peppe.cavallaro@st.com; alexandre.torgue@foss.st.com;

> > >> joabreu@synopsys.com; davem@davemloft.net;

> > mcoquelin.stm32@gmail.com;

> > >> andrew@lunn.ch; f.fainelli@gmail.com; dl-linux-imx

> > >> <linux-imx@nxp.com>; treding@nvidia.com; netdev@vger.kernel.org

> > >> Subject: Re: [RFC net-next] net: stmmac: should not modify RX

> > >> descriptor when STMMAC resume

> > >>

> > >> Hi Joakim,

> > >>

> > >> On 06/05/2021 07:33, Joakim Zhang wrote:

> > >>>

> > >>>> -----Original Message-----

> > >>>> From: Jon Hunter <jonathanh@nvidia.com>

> > >>>> Sent: 2021年4月23日 21:48

> > >>>> To: Jakub Kicinski <kuba@kernel.org>; Joakim Zhang

> > >>>> <qiangqing.zhang@nxp.com>

> > >>>> Cc: peppe.cavallaro@st.com; alexandre.torgue@foss.st.com;

> > >>>> joabreu@synopsys.com; davem@davemloft.net;

> > >> mcoquelin.stm32@gmail.com;

> > >>>> andrew@lunn.ch; f.fainelli@gmail.com; dl-linux-imx

> > >>>> <linux-imx@nxp.com>; treding@nvidia.com; netdev@vger.kernel.org

> > >>>> Subject: Re: [RFC net-next] net: stmmac: should not modify RX

> > >>>> descriptor when STMMAC resume

> > >>>>

> > >>>>

> > >>>> On 22/04/2021 16:56, Jakub Kicinski wrote:

> > >>>>> On Thu, 22 Apr 2021 04:53:08 +0000 Joakim Zhang wrote:

> > >>>>>> Could you please help review this patch? It's really beyond my

> > >>>>>> comprehension, why this patch would affect Tegra186 Jetson TX2

> > board?

> > >>>>>

> > >>>>> Looks okay, please repost as non-RFC.

> > >>>>

> > >>>>

> > >>>> I still have an issue with a board not being able to resume from

> > >>>> suspend with this patch. Shouldn't we try to resolve that first?

> > >>>

> > >>> Hi Jon,

> > >>>

> > >>> Any updates about this? Could I repost as non-RFC?

> > >>

> > >>

> > >> Sorry no updates from my end. Again, I don't see how we can post this

> > >> as it introduces a regression for us. I am sorry that I am not able

> > >> to help more here, but we have done some extensive testing on the

> > >> current mainline without your change and I don't see any issues with

> > >> regard to suspend/resume. Hence, this does not appear to fix any

> > >> pre-existing issues. It is possible that we are not seeing them.

> > >>

> > >> At this point I think that we really need someone from Synopsys to

> > >> help us understand that exact problem that you are experiencing so

> > >> that we can ensure we have the necessary fix in place and if this is

> > >> something that is applicable to all devices or not.

> > >

> > > This patch only removes modification of Rx descriptors when STMMAC

> > resume back, IMHO, it should not affect system suspend/resume function.

> > > Do you have any idea about Joh's issue or any acceptable solution to fix the

> > issue I met? Thanks a lot!

> > 

> > Joakim, don't you have a support contact at Synopsys who would be able to

> > help or someone at NXP who was responsible for the MAC integration?

> > We also have Synopsys engineers copied so presumably they could shed some

> > light.

> 

> I contacted Synopsys no substantive help was received, and integration guys from NXP is unavailable now.

> 

> But, some hints has came out, seems a bit help. I found that the DMA width is 34 bits on i.MX8MP, this may different from many existing SoCs which integrated STMMAC.

> 

> As I described in the commit message:

> When system suspend: the rx descriptor is 008 [0x00000000c4310080]: 0x0 0x40 0x0 0x34010040

> When system resume: the rx descriptor modified to 008 [0x00000000c4310080]: 0x0 0x40 0x0 0xb5010040

> Since the DMA is 34 bits width, so desc0/desc1 indicates the buffer address, after system resume, the buffer address changed to 0x4000000000.

> And the correct rx descriptor is 008 [0x00000000c4310080]: 0x6511000 0x1 0x0 0x81000000, the valid buffer address is 0x16511000.

> So when DMA tried to access 0x4000000000, this valid address, would generate fatal bus error.


Okay, that's interesting. If i.MX8MP supports only 34 address bits but
the driver tries to set a DMA address of 0x4000000000, that's way out of
the valid range.

I suspect what might be happening is that the DMA mask isn't properly
set for your device. There's in fact some code in the driver that deals
with this. If you look at the implementation of stmmac_dvr_probe() in
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c around line 4980,
there's a comment that actually mentions i.MX8MP and the 34 address bit
limitation. Can you find out what that priv->plat->addr64 is set to on
your system?

Or alternatively find out what priv->dma_cap.addr64 ends up being set a
few lines further down? That value is effectively used to set the DMA
mask and if that's wrong it might explain why the driver is setting a
bad DMA address.

In fact, maybe that information is already in the kernel log. There's a
dev_info() there that should print out something like:

	Using 34 bits DMA width

in your case. If that says something other than 34 in there, it's very
likely that this needs to be correctly set somewhere. Looking at the
code in dwmac-imx.c, I see that that's already set to 34, so this looks
like it should be setting things correctly, but better make sure.

> But for other 32 bits width DMA, DMA seems still can work when this issue happened, only desc0 indicates buffer address, so the buffer address is 0x0 when system resume.

> And there is a NOTE in the guide:

> In the Receive Descriptor (Read Format), if the Buffer Address

> field is all 0s, the module does not transfer data to that buffer

> and skips to the next buffer or next descriptor.

> For this note, I don't know what could IP actually do, when detect all zeros buffer address, it will change the descriptor to application own? If not, STMMAC driver seems can't handle this case.

> I will contact Synopsys guys for more details.

> 

> It now appears that this issue seems only can be reproduced on DMA width more than 32 bits, this may be why other SoCs(e.g. i.MX8DXL) which integrated the same STMMAC IP can't reproduce it.


On Tegra186 and later we support up to 40 address bits. The newer
Tegra194 has a special quirk where bit 39 has special meaning, so we
have to override the DMA mask as well. I recall that this was causing
issues at some point, which is why I suspect something like this could
be happening in your case as well.

Thierry
Joakim Zhang May 18, 2021, 7:52 a.m. UTC | #21
Hi Thierry,

> -----Original Message-----

> From: Thierry Reding <treding@nvidia.com>

> Sent: 2021年5月17日 15:53

> To: Joakim Zhang <qiangqing.zhang@nxp.com>

> Cc: Florian Fainelli <f.fainelli@gmail.com>; Jon Hunter

> <jonathanh@nvidia.com>; Jakub Kicinski <kuba@kernel.org>;

> peppe.cavallaro@st.com; alexandre.torgue@foss.st.com;

> joabreu@synopsys.com; davem@davemloft.net;

> mcoquelin.stm32@gmail.com; andrew@lunn.ch; dl-linux-imx

> <linux-imx@nxp.com>; netdev@vger.kernel.org

> Subject: Re: [RFC net-next] net: stmmac: should not modify RX descriptor when

> STMMAC resume

> 

> On Mon, May 10, 2021 at 02:10:21AM +0000, Joakim Zhang wrote:

> >

> > Hi Florian,

> >

> > > -----Original Message-----

> > > From: Florian Fainelli <f.fainelli@gmail.com>

> > > Sent: 2021年5月8日 23:42

> > > To: Joakim Zhang <qiangqing.zhang@nxp.com>; Jon Hunter

> > > <jonathanh@nvidia.com>; Jakub Kicinski <kuba@kernel.org>

> > > Cc: peppe.cavallaro@st.com; alexandre.torgue@foss.st.com;

> > > joabreu@synopsys.com; davem@davemloft.net;

> > > mcoquelin.stm32@gmail.com; andrew@lunn.ch; dl-linux-imx

> > > <linux-imx@nxp.com>; treding@nvidia.com; netdev@vger.kernel.org

> > > Subject: Re: [RFC net-next] net: stmmac: should not modify RX

> > > descriptor when STMMAC resume

> > >

> > >

> > >

> > > On 5/8/2021 4:20 AM, Joakim Zhang wrote:

> > > >

> > > > Hi Jakub,

> > > >

> > > >> -----Original Message-----

> > > >> From: Jon Hunter <jonathanh@nvidia.com>

> > > >> Sent: 2021年5月7日 22:22

> > > >> To: Joakim Zhang <qiangqing.zhang@nxp.com>; Jakub Kicinski

> > > >> <kuba@kernel.org>

> > > >> Cc: peppe.cavallaro@st.com; alexandre.torgue@foss.st.com;

> > > >> joabreu@synopsys.com; davem@davemloft.net;

> > > mcoquelin.stm32@gmail.com;

> > > >> andrew@lunn.ch; f.fainelli@gmail.com; dl-linux-imx

> > > >> <linux-imx@nxp.com>; treding@nvidia.com; netdev@vger.kernel.org

> > > >> Subject: Re: [RFC net-next] net: stmmac: should not modify RX

> > > >> descriptor when STMMAC resume

> > > >>

> > > >> Hi Joakim,

> > > >>

> > > >> On 06/05/2021 07:33, Joakim Zhang wrote:

> > > >>>

> > > >>>> -----Original Message-----

> > > >>>> From: Jon Hunter <jonathanh@nvidia.com>

> > > >>>> Sent: 2021年4月23日 21:48

> > > >>>> To: Jakub Kicinski <kuba@kernel.org>; Joakim Zhang

> > > >>>> <qiangqing.zhang@nxp.com>

> > > >>>> Cc: peppe.cavallaro@st.com; alexandre.torgue@foss.st.com;

> > > >>>> joabreu@synopsys.com; davem@davemloft.net;

> > > >> mcoquelin.stm32@gmail.com;

> > > >>>> andrew@lunn.ch; f.fainelli@gmail.com; dl-linux-imx

> > > >>>> <linux-imx@nxp.com>; treding@nvidia.com; netdev@vger.kernel.org

> > > >>>> Subject: Re: [RFC net-next] net: stmmac: should not modify RX

> > > >>>> descriptor when STMMAC resume

> > > >>>>

> > > >>>>

> > > >>>> On 22/04/2021 16:56, Jakub Kicinski wrote:

> > > >>>>> On Thu, 22 Apr 2021 04:53:08 +0000 Joakim Zhang wrote:

> > > >>>>>> Could you please help review this patch? It's really beyond

> > > >>>>>> my comprehension, why this patch would affect Tegra186 Jetson

> > > >>>>>> TX2

> > > board?

> > > >>>>>

> > > >>>>> Looks okay, please repost as non-RFC.

> > > >>>>

> > > >>>>

> > > >>>> I still have an issue with a board not being able to resume

> > > >>>> from suspend with this patch. Shouldn't we try to resolve that first?

> > > >>>

> > > >>> Hi Jon,

> > > >>>

> > > >>> Any updates about this? Could I repost as non-RFC?

> > > >>

> > > >>

> > > >> Sorry no updates from my end. Again, I don't see how we can post

> > > >> this as it introduces a regression for us. I am sorry that I am

> > > >> not able to help more here, but we have done some extensive

> > > >> testing on the current mainline without your change and I don't

> > > >> see any issues with regard to suspend/resume. Hence, this does

> > > >> not appear to fix any pre-existing issues. It is possible that we are not

> seeing them.

> > > >>

> > > >> At this point I think that we really need someone from Synopsys

> > > >> to help us understand that exact problem that you are

> > > >> experiencing so that we can ensure we have the necessary fix in

> > > >> place and if this is something that is applicable to all devices or not.

> > > >

> > > > This patch only removes modification of Rx descriptors when STMMAC

> > > resume back, IMHO, it should not affect system suspend/resume function.

> > > > Do you have any idea about Joh's issue or any acceptable solution

> > > > to fix the

> > > issue I met? Thanks a lot!

> > >

> > > Joakim, don't you have a support contact at Synopsys who would be

> > > able to help or someone at NXP who was responsible for the MAC

> integration?

> > > We also have Synopsys engineers copied so presumably they could shed

> > > some light.

> >

> > I contacted Synopsys no substantive help was received, and integration guys

> from NXP is unavailable now.

> >

> > But, some hints has came out, seems a bit help. I found that the DMA width

> is 34 bits on i.MX8MP, this may different from many existing SoCs which

> integrated STMMAC.

> >

> > As I described in the commit message:

> > When system suspend: the rx descriptor is 008 [0x00000000c4310080]:

> > 0x0 0x40 0x0 0x34010040 When system resume: the rx descriptor modified

> > to 008 [0x00000000c4310080]: 0x0 0x40 0x0 0xb5010040 Since the DMA is 34

> bits width, so desc0/desc1 indicates the buffer address, after system resume,

> the buffer address changed to 0x4000000000.

> > And the correct rx descriptor is 008 [0x00000000c4310080]: 0x6511000 0x1

> 0x0 0x81000000, the valid buffer address is 0x16511000.

> > So when DMA tried to access 0x4000000000, this valid address, would

> generate fatal bus error.

> 

> Okay, that's interesting. If i.MX8MP supports only 34 address bits but the

> driver tries to set a DMA address of 0x4000000000, that's way out of the valid

> range.

> 

> I suspect what might be happening is that the DMA mask isn't properly set for

> your device. There's in fact some code in the driver that deals with this. If you

> look at the implementation of stmmac_dvr_probe() in

> drivers/net/ethernet/stmicro/stmmac/stmmac_main.c around line 4980,

> there's a comment that actually mentions i.MX8MP and the 34 address bit

> limitation. Can you find out what that priv->plat->addr64 is set to on your

> system?

Yes, I know STMMAC driver has taken DMA width into account, and you can see that DMA width is set
to 34 bits in dwmac-imx.c (.addr_width = 34,).

You also can see that the valid DMA address is 0x106511000, which is within 34bits. But the valid DMA
address is 0x4000000000 after system resume, which is out of 34 bits range. The reason here is that we
modify the rx descriptor when system resume.

> Or alternatively find out what priv->dma_cap.addr64 ends up being set a few

> lines further down? That value is effectively used to set the DMA mask and if

> that's wrong it might explain why the driver is setting a bad DMA address.

As I described above, the DMA address allocated when initialized is large than 32bits, MAC can play
well in normal case. So this should be impossible.

> In fact, maybe that information is already in the kernel log. There's a

> dev_info() there that should print out something like:

> 

> 	Using 34 bits DMA width

Yes, we can this log:
[    2.376903] imx-dwmac 30bf0000.ethernet: Using 34 bits DMA width

> in your case. If that says something other than 34 in there, it's very likely that

> this needs to be correctly set somewhere. Looking at the code in dwmac-imx.c,

> I see that that's already set to 34, so this looks like it should be setting things

> correctly, but better make sure.

Yes, now we make sure the DMA mask is correct.

> > But for other 32 bits width DMA, DMA seems still can work when this issue

> happened, only desc0 indicates buffer address, so the buffer address is 0x0

> when system resume.

> > And there is a NOTE in the guide:

> > In the Receive Descriptor (Read Format), if the Buffer Address field

> > is all 0s, the module does not transfer data to that buffer and skips

> > to the next buffer or next descriptor.

> > For this note, I don't know what could IP actually do, when detect all zeros

> buffer address, it will change the descriptor to application own? If not,

> STMMAC driver seems can't handle this case.

> > I will contact Synopsys guys for more details.

> >

> > It now appears that this issue seems only can be reproduced on DMA width

> more than 32 bits, this may be why other SoCs(e.g. i.MX8DXL) which integrated

> the same STMMAC IP can't reproduce it.

> 

> On Tegra186 and later we support up to 40 address bits. The newer

> Tegra194 has a special quirk where bit 39 has special meaning, so we have to

> override the DMA mask as well. I recall that this was causing issues at some

> point, which is why I suspect something like this could be happening in your

> case as well.

I am not quite understand what you means? Do you mean that our 34bits DMA width also
has a special meaning?

Thanks much Thierry for helping analyzing this issue. As I described in the commit message, we should
not _only_ change the rx descriptors to DMA own and let other parts of rx descriptors not updated.
So could you please help check why this RFC would make regression at you side? Why system can't resume back? 

Best Regards,
Joakim Zhang
> Thierry
diff mbox series

Patch

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 9f396648d76f..b784304a22e8 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -7186,6 +7186,8 @@  static void stmmac_reset_queues_param(struct stmmac_priv *priv)
 		tx_q->mss = 0;
 
 		netdev_tx_reset_queue(netdev_get_tx_queue(priv->dev, queue));
+
+		stmmac_clear_tx_descriptors(priv, queue);
 	}
 }
 
@@ -7250,7 +7252,6 @@  int stmmac_resume(struct device *dev)
 	stmmac_reset_queues_param(priv);
 
 	stmmac_free_tx_skbufs(priv);
-	stmmac_clear_descriptors(priv);
 
 	stmmac_hw_setup(ndev, false);
 	stmmac_init_coalesce(priv);