diff mbox series

[net] team: postpone features update to avoid deadlock

Message ID 20210120122354.3687556-1-ivecera@redhat.com
State New
Headers show
Series [net] team: postpone features update to avoid deadlock | expand

Commit Message

Ivan Vecera Jan. 20, 2021, 12:23 p.m. UTC
Team driver protects port list traversal by its team->lock mutex
in functions like team_change_mtu(), team_set_rx_mode(),
team_vlan_rx_{add,del}_vid() etc.
These functions call appropriate callbacks of all enslaved
devices. Some drivers need to update their features under
certain conditions (e.g. TSO is broken for jumbo frames etc.) so
they call netdev_update_features(). This causes a deadlock because
netdev_update_features() calls netdevice notifiers and one of them
is team_device_event() that in case of NETDEV_FEAT_CHANGE tries lock
team->lock mutex again.

Example (r8169 case):
...
[ 6391.348202]  __mutex_lock.isra.6+0x2d0/0x4a0
[ 6391.358602]  team_device_event+0x9d/0x160 [team]
[ 6391.363756]  notifier_call_chain+0x47/0x70
[ 6391.368329]  netdev_update_features+0x56/0x60
[ 6391.373207]  rtl8169_change_mtu+0x14/0x50 [r8169]
[ 6391.378457]  dev_set_mtu_ext+0xe1/0x1d0
[ 6391.387022]  dev_set_mtu+0x52/0x90
[ 6391.390820]  team_change_mtu+0x64/0xf0 [team]
[ 6391.395683]  dev_set_mtu_ext+0xe1/0x1d0
[ 6391.399963]  do_setlink+0x231/0xf50
...

To fix the problem __team_compute_features() needs to be postponed
for these cases.

Fixes: 3d249d4ca7d0 ("net: introduce ethernet teaming device")
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
---
 drivers/net/team/team.c | 36 +++++++++++++++++++++++++++++++++++-
 include/linux/if_team.h |  1 +
 2 files changed, 36 insertions(+), 1 deletion(-)

Comments

Cong Wang Jan. 20, 2021, 11:18 p.m. UTC | #1
On Wed, Jan 20, 2021 at 4:56 AM Ivan Vecera <ivecera@redhat.com> wrote:
>

> To fix the problem __team_compute_features() needs to be postponed

> for these cases.


Is there any user-visible effect after deferring this feature change?

Thanks.
Ivan Vecera Jan. 21, 2021, 10:29 a.m. UTC | #2
On Wed, 20 Jan 2021 15:18:20 -0800
Cong Wang <xiyou.wangcong@gmail.com> wrote:

> On Wed, Jan 20, 2021 at 4:56 AM Ivan Vecera <ivecera@redhat.com> wrote:

> >

> > To fix the problem __team_compute_features() needs to be postponed

> > for these cases.  

> 

> Is there any user-visible effect after deferring this feature change?

> 

An user should not notice this change.

I.
Jakub Kicinski Jan. 22, 2021, 2:34 a.m. UTC | #3
On Thu, 21 Jan 2021 11:29:37 +0100 Ivan Vecera wrote:
> On Wed, 20 Jan 2021 15:18:20 -0800

> Cong Wang <xiyou.wangcong@gmail.com> wrote:

> > On Wed, Jan 20, 2021 at 4:56 AM Ivan Vecera <ivecera@redhat.com> wrote:  

> > > Team driver protects port list traversal by its team->lock mutex

> > > in functions like team_change_mtu(), team_set_rx_mode(),


The set_rx_mode part can't be true, set_rx_mode can't sleep and
team->lock is a mutex.

> > > To fix the problem __team_compute_features() needs to be postponed

> > > for these cases.    

> > 

> > Is there any user-visible effect after deferring this feature change?

>

> An user should not notice this change.


I think Cong is right, can you expand a little on your assertion?
User should be able to assume that the moment syscall returns the
features had settled.

What does team->mutex actually protect in team_compute_features()?
All callers seem to hold RTNL at a quick glance. This is a bit of 
a long shot but isn't it just tryin to protect the iteration over 
ports which could be under RCU?

More crude idea would be to wrap the mutex_unlock(&team->lock) into 
a helper which checks if something tried to change features while it
was locked. rtnl_unlock()-style.
Ivan Vecera Jan. 22, 2021, 8:30 a.m. UTC | #4
On Thu, 21 Jan 2021 18:34:52 -0800
Jakub Kicinski <kuba@kernel.org> wrote:

> On Thu, 21 Jan 2021 11:29:37 +0100 Ivan Vecera wrote:

> > On Wed, 20 Jan 2021 15:18:20 -0800

> > Cong Wang <xiyou.wangcong@gmail.com> wrote:  

> > > On Wed, Jan 20, 2021 at 4:56 AM Ivan Vecera <ivecera@redhat.com> wrote:    

> > > > Team driver protects port list traversal by its team->lock mutex

> > > > in functions like team_change_mtu(), team_set_rx_mode(),  

> 

> The set_rx_mode part can't be true, set_rx_mode can't sleep and

> team->lock is a mutex.

> 

> > > > To fix the problem __team_compute_features() needs to be postponed

> > > > for these cases.      

> > > 

> > > Is there any user-visible effect after deferring this feature change?  

> >

> > An user should not notice this change.  

> 

> I think Cong is right, can you expand a little on your assertion?

> User should be able to assume that the moment syscall returns the

> features had settled.

> 

> What does team->mutex actually protect in team_compute_features()?

> All callers seem to hold RTNL at a quick glance. This is a bit of 

> a long shot but isn't it just tryin to protect the iteration over 

> ports which could be under RCU?


In fact the mutex could be removed at all because all port-list
writers are running under rtnl_lock, some readers like team_change_mtu()
or team_device_event() [notifier] as well and hot path readers are
protected by RCU.
I have discussed this with Jiri but he don't want to introduce any dependency
on RTNL to team as it was designed as RTNL-independent from beginning.

Anyway your idea to run team_compute_features under RCU could be fine
as subsequent __team_compute_features() cannot sleep...

Do you mean something like this?

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index c19dac21c468..dd7917cab2b1 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -992,7 +992,8 @@ static void __team_compute_features(struct team *team)
        unsigned int dst_release_flag = IFF_XMIT_DST_RELEASE |
                                        IFF_XMIT_DST_RELEASE_PERM;
 
-       list_for_each_entry(port, &team->port_list, list) {
+       rcu_read_lock();
+       list_for_each_entry_rcu(port, &team->port_list, list) {
                vlan_features = netdev_increment_features(vlan_features,
                                        port->dev->vlan_features,
                                        TEAM_VLAN_FEATURES);
@@ -1006,6 +1007,7 @@ static void __team_compute_features(struct team *team)
                if (port->dev->hard_header_len > max_hard_header_len)
                        max_hard_header_len = port->dev->hard_header_len;
        }
+       rcu_read_unlock();
 
        team->dev->vlan_features = vlan_features;
        team->dev->hw_enc_features = enc_features | NETIF_F_GSO_ENCAP_ALL |
@@ -1020,9 +1022,7 @@ static void __team_compute_features(struct team *team)
 
 static void team_compute_features(struct team *team)
 {
-       mutex_lock(&team->lock);
        __team_compute_features(team);
-       mutex_unlock(&team->lock);
        netdev_change_features(team->dev);
 }

Thanks for comments,

Ivan
Jakub Kicinski Jan. 22, 2021, 6:03 p.m. UTC | #5
On Fri, 22 Jan 2021 09:30:27 +0100 Ivan Vecera wrote:
> On Thu, 21 Jan 2021 18:34:52 -0800

> Jakub Kicinski <kuba@kernel.org> wrote:

> 

> > On Thu, 21 Jan 2021 11:29:37 +0100 Ivan Vecera wrote:  

> > > On Wed, 20 Jan 2021 15:18:20 -0800

> > > Cong Wang <xiyou.wangcong@gmail.com> wrote:    

> > > > On Wed, Jan 20, 2021 at 4:56 AM Ivan Vecera <ivecera@redhat.com> wrote:      

> > > > > Team driver protects port list traversal by its team->lock mutex

> > > > > in functions like team_change_mtu(), team_set_rx_mode(),    

> > 

> > The set_rx_mode part can't be true, set_rx_mode can't sleep and

> > team->lock is a mutex.

> >   

> > > > > To fix the problem __team_compute_features() needs to be postponed

> > > > > for these cases.        

> > > > 

> > > > Is there any user-visible effect after deferring this feature change?    

> > >

> > > An user should not notice this change.    

> > 

> > I think Cong is right, can you expand a little on your assertion?

> > User should be able to assume that the moment syscall returns the

> > features had settled.

> > 

> > What does team->mutex actually protect in team_compute_features()?

> > All callers seem to hold RTNL at a quick glance. This is a bit of 

> > a long shot but isn't it just tryin to protect the iteration over 

> > ports which could be under RCU?  

> 

> In fact the mutex could be removed at all because all port-list

> writers are running under rtnl_lock, some readers like team_change_mtu()

> or team_device_event() [notifier] as well and hot path readers are

> protected by RCU.

> I have discussed this with Jiri but he don't want to introduce any dependency

> on RTNL to team as it was designed as RTNL-independent from beginning.

> 

> Anyway your idea to run team_compute_features under RCU could be fine

> as subsequent __team_compute_features() cannot sleep...

> 

> Do you mean something like this?

> 

> diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c

> index c19dac21c468..dd7917cab2b1 100644

> --- a/drivers/net/team/team.c

> +++ b/drivers/net/team/team.c

> @@ -992,7 +992,8 @@ static void __team_compute_features(struct team *team)

>         unsigned int dst_release_flag = IFF_XMIT_DST_RELEASE |

>                                         IFF_XMIT_DST_RELEASE_PERM;

>  

> -       list_for_each_entry(port, &team->port_list, list) {

> +       rcu_read_lock();

> +       list_for_each_entry_rcu(port, &team->port_list, list) {

>                 vlan_features = netdev_increment_features(vlan_features,

>                                         port->dev->vlan_features,

>                                         TEAM_VLAN_FEATURES);

> @@ -1006,6 +1007,7 @@ static void __team_compute_features(struct team *team)

>                 if (port->dev->hard_header_len > max_hard_header_len)

>                         max_hard_header_len = port->dev->hard_header_len;

>         }

> +       rcu_read_unlock();

>  

>         team->dev->vlan_features = vlan_features;

>         team->dev->hw_enc_features = enc_features | NETIF_F_GSO_ENCAP_ALL |

> @@ -1020,9 +1022,7 @@ static void __team_compute_features(struct team *team)

>  

>  static void team_compute_features(struct team *team)

>  {

> -       mutex_lock(&team->lock);

>         __team_compute_features(team);

> -       mutex_unlock(&team->lock);

>         netdev_change_features(team->dev);

>  }


Yup, like this, but if Jiri doesn't like it then I guess we need to
come up with something else?

How about doing the work on unlock? Have some bit set when we had to
defer and then run __team_compute_features() before releasing the lock
for real?
diff mbox series

Patch

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index c19dac21c468..f66d38b0e70a 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -975,6 +975,10 @@  static void team_port_disable(struct team *team,
 	team_lower_state_changed(port);
 }
 
+/*******************
+ * Compute features
+ *******************/
+
 #define TEAM_VLAN_FEATURES (NETIF_F_HW_CSUM | NETIF_F_SG | \
 			    NETIF_F_FRAGLIST | NETIF_F_GSO_SOFTWARE | \
 			    NETIF_F_HIGHDMA | NETIF_F_LRO)
@@ -1018,12 +1022,39 @@  static void __team_compute_features(struct team *team)
 		team->dev->priv_flags |= IFF_XMIT_DST_RELEASE;
 }
 
-static void team_compute_features(struct team *team)
+static void team_compute_features_work(struct work_struct *work)
 {
+	struct team *team;
+
+	team = container_of(work, struct team, compute_features_task);
 	mutex_lock(&team->lock);
 	__team_compute_features(team);
 	mutex_unlock(&team->lock);
+
+	rtnl_lock();
 	netdev_change_features(team->dev);
+	rtnl_unlock();
+}
+
+static void team_compute_features(struct team *team)
+{
+	if (mutex_trylock(&team->lock)) {
+		__team_compute_features(team);
+		mutex_unlock(&team->lock);
+		netdev_change_features(team->dev);
+	} else {
+		schedule_work(&team->compute_features_task);
+	}
+}
+
+static void team_compute_features_init(struct team *team)
+{
+	INIT_WORK(&team->compute_features_task, team_compute_features_work);
+}
+
+static void team_compute_features_fini(struct team *team)
+{
+	cancel_work_sync(&team->compute_features_task);
 }
 
 static int team_port_enter(struct team *team, struct team_port *port)
@@ -1639,6 +1670,7 @@  static int team_init(struct net_device *dev)
 
 	team_notify_peers_init(team);
 	team_mcast_rejoin_init(team);
+	team_compute_features_init(team);
 
 	err = team_options_register(team, team_options, ARRAY_SIZE(team_options));
 	if (err)
@@ -1652,6 +1684,7 @@  static int team_init(struct net_device *dev)
 	return 0;
 
 err_options_register:
+	team_compute_features_fini(team);
 	team_mcast_rejoin_fini(team);
 	team_notify_peers_fini(team);
 	team_queue_override_fini(team);
@@ -1673,6 +1706,7 @@  static void team_uninit(struct net_device *dev)
 
 	__team_change_mode(team, NULL); /* cleanup */
 	__team_options_unregister(team, team_options, ARRAY_SIZE(team_options));
+	team_compute_features_fini(team);
 	team_mcast_rejoin_fini(team);
 	team_notify_peers_fini(team);
 	team_queue_override_fini(team);
diff --git a/include/linux/if_team.h b/include/linux/if_team.h
index add607943c95..581d79552bbd 100644
--- a/include/linux/if_team.h
+++ b/include/linux/if_team.h
@@ -208,6 +208,7 @@  struct team {
 	bool queue_override_enabled;
 	struct list_head *qom_lists; /* array of queue override mapping lists */
 	bool port_mtu_change_allowed;
+	struct work_struct compute_features_task;
 	struct {
 		unsigned int count;
 		unsigned int interval; /* in ms */