mbox series

[v2,0/5] r8152: Avoid writing garbage to the adapter's registers

Message ID 20231004192622.1093964-1-dianders@chromium.org
Headers show
Series r8152: Avoid writing garbage to the adapter's registers | expand

Message

Doug Anderson Oct. 4, 2023, 7:24 p.m. UTC
This series is the result of a cooperative debug effort between
Realtek and the ChromeOS team. On ChromeOS, we've noticed that Realtek
Ethernet adapters can sometimes get so wedged that even a reboot of
the host can't get them to enumerate again, assuming that the adapter
was on a powered hub and din't lose power when the host rebooted. This
is sometimes seen in the ChromeOS automated testing lab. The only way
to recover adapters in this state is to manually power cycle them.

I managed to reproduce one instance of this wedging (unknown if this
is truly related to what the test lab sees) by doing this:
1. Start a flood ping from a host to the device.
2. Drop the device into kdb.
3. Wait 90 seconds.
4. Resume from kdb (the "g" command).
5. Wait another 45 seconds.

Upon analysis, Realtek realized this was happening:

1. The Linux driver was getting a "Tx timeout" after resuming from kdb
   and then trying to reset itself.
2. As part of the reset, the Linux driver was attempting to do a
   read-modify-write of the adapter's registers.
3. The read would fail (due to a timeout) and the driver pretended
   that the register contained all 0xFFs. See commit f53a7ad18959
   ("r8152: Set memory to all 0xFFs on failed reg reads")
4. The driver would take this value of all 0xFFs, modify it, and
   attempt to write it back to the adapter.
5. By this time the USB channel seemed to recover and thus we'd
   successfully write a value that was mostly 0xFFs to the adpater.
6. The adapter didn't like this and would wedge itself.

Another Engineer also managed to reproduce wedging of the Realtek
Ethernet adpater during a reboot test on an AMD Chromebook. In that
case he was sometimes seeing -EPIPE returned from the control
transfers.

This patch series fixes both issues.

Changes in v2:
- ("Check for unplug in rtl_phy_patch_request()") new for v2.
- ("Check for unplug in r8153b_ups_en() / r8153c_ups_en()") new for v2.
- ("Rename RTL8152_UNPLUG to RTL8152_INACCESSIBLE") new for v2.

Douglas Anderson (5):
  r8152: Increase USB control msg timeout to 5000ms as per spec
  r8152: Check for unplug in rtl_phy_patch_request()
  r8152: Check for unplug in r8153b_ups_en() / r8153c_ups_en()
  r8152: Rename RTL8152_UNPLUG to RTL8152_INACCESSIBLE
  r8152: Block future register access if register access fails

 drivers/net/usb/r8152.c | 268 +++++++++++++++++++++++++++++++---------
 1 file changed, 209 insertions(+), 59 deletions(-)

Comments

Simon Horman Oct. 5, 2023, 12:17 p.m. UTC | #1
On Wed, Oct 04, 2023 at 12:24:42PM -0700, Douglas Anderson wrote:

...

> @@ -9784,7 +9904,29 @@ static int rtl8152_probe(struct usb_interface *intf,
>  	else
>  		device_set_wakeup_enable(&udev->dev, false);
>  
> -	netif_info(tp, probe, netdev, "%s\n", DRIVER_VERSION);
> +	mutex_lock(&tp->control);
> +	if (test_bit(RTL8152_INACCESSIBLE, &tp->flags)) {
> +		/* If the device is marked inaccessible before probe even
> +		 * finished then one of two things happened. Either we got a
> +		 * USB error during probe or the user already unplugged the
> +		 * device.
> +		 *
> +		 * If we got a USB error during probe then we skipped doing a
> +		 * reset in r8152_control_msg() and deferred it to here. This
> +		 * is because the queued reset will give up after 1 second
> +		 * (see usb_lock_device_for_reset()) and we want to make sure
> +		 * that we queue things up right before probe finishes.
> +		 *
> +		 * If the user already unplugged the device then the USB
> +		 * farmework will call unbind right away for us. The extra

Hi Douglas,

As you are planning to re-spin anyway: farmework -> framework

> +		 * reset we queue up here will be harmless.
> +		 */
> +		usb_queue_reset_device(tp->intf);
> +	} else {
> +		set_bit(PROBED_WITH_NO_ERRORS, &tp->flags);
> +		netif_info(tp, probe, netdev, "%s\n", DRIVER_VERSION);
> +	}
> +	mutex_unlock(&tp->control);
>  
>  	return 0;
>  
> -- 
> 2.42.0.582.g8ccd20d70d-goog
>