[net] net: Lock lower level devices when updating features

Message ID	20250506142117.1883598-1-cratiu@nvidia.com
State	New
Headers	show Received: from NAM02-BN1-obe.outbound.protection.outlook.com (mail-bn1nam02on2071.outbound.protection.outlook.com [40.107.212.71]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 50F06281504; Tue, 6 May 2025 14:22:04 +0000 (UTC) Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 216.228.117.160 as permitted sender) receiver=protection.outlook.com; client-ip=216.228.117.160; helo=mail.nvidia.com; pr=C From: Cosmin Ratiu <cratiu@nvidia.com> To: <netdev@vger.kernel.org>, <cratiu@nvidia.com> CC: Stanislav Fomichev <sdf@fomichev.me>, "David S . Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, "jiri @ resnulli . us" <jiri@resnulli.us>, Saeed Mahameed <saeedm@nvidia.com>, Dragos Tatulea <dtatulea@nvidia.com>, <linux-kselftest@vger.kernel.org> Subject: [PATCH net] net: Lock lower level devices when updating features Date: Tue, 6 May 2025 17:21:17 +0300 Message-ID: <20250506142117.1883598-1-cratiu@nvidia.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain
Series	[net] net: Lock lower level devices when updating features \| expand [net] net: Lock lower level devices when updating features

Message ID

20250506142117.1883598-1-cratiu@nvidia.com

State

New

Headers

Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates
 216.228.117.160 as permitted sender) receiver=protection.outlook.com;
 client-ip=216.228.117.160; helo=mail.nvidia.com; pr=C
From: Cosmin Ratiu <cratiu@nvidia.com>
To: <netdev@vger.kernel.org>, <cratiu@nvidia.com>
CC: Stanislav Fomichev <sdf@fomichev.me>, "David S . Miller"
	<davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski
	<kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, "jiri @ resnulli . us"
	<jiri@resnulli.us>, Saeed Mahameed <saeedm@nvidia.com>, Dragos Tatulea
	<dtatulea@nvidia.com>, <linux-kselftest@vger.kernel.org>
Subject: [PATCH net] net: Lock lower level devices when updating features
Date: Tue, 6 May 2025 17:21:17 +0300
Message-ID: <20250506142117.1883598-1-cratiu@nvidia.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 06 May 2025 14:21:56.8949
 (UTC)
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 4b4871a8-dcad-473b-6c8c-08dd8ca95b98
X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a
X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: 
 TenantId=43083d15-7273-40c1-b7db-39efd9ccc17a;Ip=[216.228.117.160];Helo=[mail.nvidia.com]
X-MS-Exchange-CrossTenant-AuthSource: 
	BN3PEPF0000B373.namprd21.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Anonymous
X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem
X-MS-Exchange-Transport-CrossTenantHeadersStamped: MN0PR12MB5738

Series

[net] net: Lock lower level devices when updating features | expand

Commit Message

Cosmin Ratiu May 6, 2025, 2:21 p.m. UTC

__netdev_update_features() expects the netdevice to be ops-locked, but
it gets called recursively on the lower level netdevices to sync their
features, and nothing locks those.

This commit fixes that, with the assumption that it shouldn't be possible
for both higher-level and lover-level netdevices to require the instance
lock, because that would lead to lock dependency warnings.

Without this, playing with higher level (e.g. vxlan) netdevices on top
of netdevices with instance locking enabled can run into issues:

WARNING: CPU: 59 PID: 206496 at ./include/net/netdev_lock.h:17 netif_napi_add_weight_locked+0x753/0xa60
[...]
Call Trace:
 <TASK>
 mlx5e_open_channel+0xc09/0x3740 [mlx5_core]
 mlx5e_open_channels+0x1f0/0x770 [mlx5_core]
 mlx5e_safe_switch_params+0x1b5/0x2e0 [mlx5_core]
 set_feature_lro+0x1c2/0x330 [mlx5_core]
 mlx5e_handle_feature+0xc8/0x140 [mlx5_core]
 mlx5e_set_features+0x233/0x2e0 [mlx5_core]
 __netdev_update_features+0x5be/0x1670
 __netdev_update_features+0x71f/0x1670
 dev_ethtool+0x21c5/0x4aa0
 dev_ioctl+0x438/0xae0
 sock_ioctl+0x2ba/0x690
 __x64_sys_ioctl+0xa78/0x1700
 do_syscall_64+0x6d/0x140
 entry_SYSCALL_64_after_hwframe+0x4b/0x53
 </TASK>

Fixes: 7e4d784f5810 ("net: hold netdev instance lock during rtnetlink operations")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
---
 net/core/dev.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Cosmin Ratiu May 7, 2025, 3:13 p.m. UTC | #1

On Wed, 2025-05-07 at 14:35 +0000, Cosmin Ratiu wrote:
> On Tue, 2025-05-06 at 11:13 -0700, Stanislav Fomichev wrote:
> > On 05/06, Cosmin Ratiu wrote:
> > 
> > 
> > Right, but netdev_sync_lower_features calls lower's
> > __netdev_update_features
> > only for NETIF_F_UPPER_DISABLES. So it doesn't propagate all
> > features,
> > only LRO AFAICT.
> 
> Got it, I didn't look into what netdev_sync_lower_features actually
> does besides noticing it can call __netdev_update_feature...
> 
> In any case, please hold off with picking this patch up, it seems
> there's a possibility of a real deadlock. Here's the scenario:
> 
> ============================================
> WARNING: possible recursive locking detected
> --------------------------------------------
> ethtool/44150 is trying to acquire lock:
> ffff8881364e8c80 (&dev_instance_lock_key#7){+.+.}-{4:4}, at:
> __netdev_update_features+0x31e/0xe20
> 
> but task is already holding lock:
> ffff8881364e8c80 (&dev_instance_lock_key#7){+.+.}-{4:4}, at:
> ethnl_set_features+0xbc/0x4b0
> and the lock comparison function returns 0:
> 
> other info that might help us debug this:
>  Possible unsafe locking scenario:
> 
>        CPU0
>        ----
>   lock(&dev_instance_lock_key#7);
>   lock(&dev_instance_lock_key#7);
> 
>  *** DEADLOCK ***
> 
>  May be due to missing lock nesting notation
> 
> 3 locks held by ethtool/44150:
>  #0: ffffffff830e5a50 (cb_lock){++++}-{4:4}, at: genl_rcv+0x15/0x40
>  #1: ffffffff830cf708 (rtnl_mutex){+.+.}-{4:4}, at:
> ethnl_set_features+0x88/0x4b0
>  #2: ffff8881364e8c80 (&dev_instance_lock_key#7){+.+.}-{4:4}, at:
> ethnl_set_features+0xbc/0x4b0
> 
> stack backtrace:
> Call Trace:
>  <TASK>
>  dump_stack_lvl+0x69/0xa0
>  print_deadlock_bug.cold+0xbd/0xca
>  __lock_acquire+0x163c/0x2f00
>  lock_acquire+0xd3/0x2e0
>  __mutex_lock+0x98/0xf10
>  __netdev_update_features+0x31e/0xe20
>  netdev_update_features+0x1f/0x60
>  vlan_device_event+0x57d/0x930 [8021q]
>  notifier_call_chain+0x3d/0x100
>  netdev_features_change+0x32/0x50
>  ethnl_set_features+0x17e/0x4b0
>  genl_family_rcv_msg_doit+0xe0/0x130
>  genl_rcv_msg+0x188/0x290
> [...]
> 
> I'm not sure how to solve this yet...
> Cosmin.

If it's not clear, the problem is that:
1. the lower device is already ops locked
2. netdev_feature_change gets called.
3. __netdev_update_features gets called for the vlan (upper) dev.
4. It tries to acquire the same lock instance as 1 (this patch).
5. Deadlock.

One solution I can think of would be to run device notifiers for
changing features outside the lock, it doesn't seem like the netdev
lock has anything to do with what other devices might do with this
information.

This can be triggered from many scenarios, I have another similar stack
involving bonding.

What do you think?

Cosmin.

Cosmin Ratiu May 7, 2025, 8:29 p.m. UTC | #2

On Wed, 2025-05-07 at 15:13 +0000, Cosmin Ratiu wrote:
> > In any case, please hold off with picking this patch up, it seems
> > there's a possibility of a real deadlock. Here's the scenario:
> > 
> > ============================================
> > WARNING: possible recursive locking detected
> > --------------------------------------------
> > ethtool/44150 is trying to acquire lock:
> > ffff8881364e8c80 (&dev_instance_lock_key#7){+.+.}-{4:4}, at:
> > __netdev_update_features+0x31e/0xe20
> > 
> > but task is already holding lock:
> > ffff8881364e8c80 (&dev_instance_lock_key#7){+.+.}-{4:4}, at:
> > ethnl_set_features+0xbc/0x4b0
> > and the lock comparison function returns 0:
> > 
> > other info that might help us debug this:
> >  Possible unsafe locking scenario:
> > 
> >        CPU0
> >        ----
> >   lock(&dev_instance_lock_key#7);
> >   lock(&dev_instance_lock_key#7);
> > 
> >  *** DEADLOCK ***
> > 
> >  May be due to missing lock nesting notation
> > 
> > 3 locks held by ethtool/44150:
> >  #0: ffffffff830e5a50 (cb_lock){++++}-{4:4}, at: genl_rcv+0x15/0x40
> >  #1: ffffffff830cf708 (rtnl_mutex){+.+.}-{4:4}, at:
> > ethnl_set_features+0x88/0x4b0
> >  #2: ffff8881364e8c80 (&dev_instance_lock_key#7){+.+.}-{4:4}, at:
> > ethnl_set_features+0xbc/0x4b0
> > 
> > stack backtrace:
> > Call Trace:
> >  <TASK>
> >  dump_stack_lvl+0x69/0xa0
> >  print_deadlock_bug.cold+0xbd/0xca
> >  __lock_acquire+0x163c/0x2f00
> >  lock_acquire+0xd3/0x2e0
> >  __mutex_lock+0x98/0xf10
> >  __netdev_update_features+0x31e/0xe20
> >  netdev_update_features+0x1f/0x60
> >  vlan_device_event+0x57d/0x930 [8021q]
> >  notifier_call_chain+0x3d/0x100
> >  netdev_features_change+0x32/0x50
> >  ethnl_set_features+0x17e/0x4b0
> >  genl_family_rcv_msg_doit+0xe0/0x130
> >  genl_rcv_msg+0x188/0x290
> > [...]
> > 
> > I'm not sure how to solve this yet...
> > Cosmin.
> 
> If it's not clear, the problem is that:
> 1. the lower device is already ops locked
> 2. netdev_feature_change gets called.
> 3. __netdev_update_features gets called for the vlan (upper) dev.
> 4. It tries to acquire the same lock instance as 1 (this patch).
> 5. Deadlock.
> 
> One solution I can think of would be to run device notifiers for
> changing features outside the lock, it doesn't seem like the netdev
> lock has anything to do with what other devices might do with this
> information.
> 
> This can be triggered from many scenarios, I have another similar
> stack
> involving bonding.
> 
> What do you think?

All I could think of was to drop the lock during the
netdev_features_changed notifier calls, like in the following hunk.
I'm running this through regressions, let's see if it's a good idea or
not.

diff --git a/net/core/dev.c b/net/core/dev.c
index 1be7cb73a602..817fd5bc21b1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1514,7 +1514,12 @@ int dev_get_alias(const struct net_device *dev,
char *name, size_t len)
  */
 void netdev_features_change(struct net_device *dev)
 {
+	/* Drop the lock to avoid potential deadlocks from e.g. upper
dev
+	 * notifiers altering features of 'dev' and acquiring dev lock
again.
+	 */
+	netdev_unlock_ops(dev);
 	call_netdevice_notifiers(NETDEV_FEAT_CHANGE, dev);
+	netdev_lock_ops(dev);
 }
 EXPORT_SYMBOL(netdev_features_change);

Cosmin Ratiu May 8, 2025, 10:33 a.m. UTC | #3

On Wed, 2025-05-07 at 14:20 -0700, Stanislav Fomichev wrote:
> On 05/07, Cosmin Ratiu wrote:
> > On Wed, 2025-05-07 at 15:13 +0000, Cosmin Ratiu wrote:
> > > > In any case, please hold off with picking this patch up, it
> > > > seems
> > > > there's a possibility of a real deadlock. Here's the scenario:
> 
> Hmm, are you sure you're calling __netdev_update_features on the
> upper?
> I don't see how the lower would be locked in that case. From my POW,
> this is what happens:
> 
> 1. your dev (lower) has a vlan on it (upper)
> 2. you call lro=off on the _lower_
> 3. this triggers FEAT_CHANGE notifier and vlan_device_event catches
> it
> 4. since the lower has a vlan device (dev->vlan_info != NULL), it
> goes
>    over every other vlan in the group and triggers
> netdev_update_features
>    for the upper (netdev_update_features vlandev)
> 5. the upper tries to sync the features into the lower (including the
>    one that triggered FEAT_CHANGE) and that's where the deadlock
> happens
> 
> But I think (5) should be largely a no-op for the device triggering
> the
> notification, because the features have been already applied in
> ethnl_set_features.
> I'd move the lock into netdev_sync_lower_features, and only after
> checking
> the features (and making sure that we are going to change them). The
> feature
> check might be racy, but I think it should still work?
> 

You are right, if I restrict the lower dev critical section to only the
call to __netdev_update_features for the lower dev there's no deadlock
any more, because the device with the lock held already had its
features updated.

I will send a new version of this patch soon after the full regression
suite finishes and I convince myself there are no more issues related
to this that we can encounter.

> Can you also share the bonding stacktrace as well to confirm it's the
> same issue?

Sure, here it is, it's the same scenario. bond_netdev_event gets called
on a slave dev, it recomputes features and updates all slaves
(bond_compute_features), and then the same lock is reacquired.

But this is also fixed with your suggestion above.

 ============================================
 WARNING: possible recursive locking detected

 devlink/14341 is trying to acquire lock:
 ffff88810ebd8c80 (&dev_instance_lock_key#9){+.+.}-{4:4}, at:
__netdev_update_features+0x31e/0xe20

 but task is already holding lock:
 ffff88810ebd8c80 (&dev_instance_lock_key#9){+.+.}-{4:4}, at:
mlx5e_attach_netdev+0x31f/0x360 [mlx5_core]
 and the lock comparison function returns 0:
    
 other info that might help us debug this: 
  Possible unsafe locking scenario:
    
        CPU0
        ----
   lock(&dev_instance_lock_key#9);
   lock(&dev_instance_lock_key#9);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 4 locks held by devlink/14341:
  #0: ffffffff830e5a50 (cb_lock){++++}-{4:4}, at: genl_rcv+0x15/0x40
  #1: ffff888164a5c250 (&devlink->lock_key){+.+.}-{4:4}, at:
devlink_get_from_attrs_lock+0xbc/0x180
  #2: ffffffff830cf708 (rtnl_mutex){+.+.}-{4:4}, at:
mlx5e_attach_netdev+0x30d/0x360 [mlx5_core]
  #3: ffff88810ebd8c80 (&dev_instance_lock_key#9){+.+.}-{4:4}, at:
mlx5e_attach_netdev+0x31f/0x360 [mlx5_core]

 Call Trace:
  <TASK>
  dump_stack_lvl+0x69/0xa0
  print_deadlock_bug.cold+0xbd/0xca
  __lock_acquire+0x163c/0x2f00
  lock_acquire+0xd3/0x2e0
  __mutex_lock+0x98/0xf10
  __netdev_update_features+0x31e/0xe20
  netdev_change_features+0x1f/0x60
  bond_compute_features+0x24e/0x300 [bonding]
  bond_netdev_event+0x2e0/0x400 [bonding]
  notifier_call_chain+0x3d/0x100
  netdev_update_features+0x52/0x60
  mlx5e_attach_netdev+0x32f/0x360 [mlx5_core]
  mlx5e_netdev_attach_profile+0x48/0x90 [mlx5_core]
  mlx5e_netdev_change_profile+0x90/0xf0 [mlx5_core]
  mlx5e_vport_rep_load+0x414/0x490 [mlx5_core]
  __esw_offloads_load_rep+0x87/0xd0 [mlx5_core]
  mlx5_esw_offloads_rep_load+0x45/0xe0 [mlx5_core]
  esw_offloads_enable+0xb7b/0xca0 [mlx5_core] 
  mlx5_eswitch_enable_locked+0x293/0x430 [mlx5_core]
  mlx5_devlink_eswitch_mode_set+0x229/0x620 [mlx5_core]
  devlink_nl_eswitch_set_doit+0x60/0xd0
  genl_family_rcv_msg_doit+0xe0/0x130
  genl_rcv_msg+0x188/0x290
  netlink_rcv_skb+0x4b/0xf0
  genl_rcv+0x24/0x40
  netlink_unicast+0x1e1/0x2c0
  netlink_sendmsg+0x210/0x450

Cosmin.

diff --git a/net/core/dev.c b/net/core/dev.c
index 1be7cb73a602..77472364225c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -10620,8 +10620,11 @@  int __netdev_update_features(struct net_device *dev)
 	/* some features must be disabled on lower devices when disabled
 	 * on an upper device (think: bonding master or bridge)
 	 */
-	netdev_for_each_lower_dev(dev, lower, iter)
+	netdev_for_each_lower_dev(dev, lower, iter) {
+		netdev_lock_ops(lower);
 		netdev_sync_lower_features(dev, lower, features);
+		netdev_unlock_ops(lower);
+	}
 
 	if (!err) {
 		netdev_features_t diff = features ^ dev->features;

[net] net: Lock lower level devices when updating features

Commit Message

Comments

Patch