diff mbox series

[RFC,net-next,05/13] net: bonding: hold the netdev lists lock when retrieving device statistics

Message ID 20201206235919.393158-6-vladimir.oltean@nxp.com
State New
Headers show
Series [RFC,net-next,01/13] RDMA/mlx4: remove bogus dev_base_lock usage | expand

Commit Message

Vladimir Oltean Dec. 6, 2020, 11:59 p.m. UTC
In the effort of making .ndo_get_stats64 be able to sleep, we need to
ensure the callers of dev_get_stats do not use atomic context.

The bonding driver uses an RCU read-side critical section to ensure the
integrity of the list of network interfaces, because the driver iterates
through all net devices in the netns to find the ones which are its
configured slaves. We still need some protection against an interface
registering or deregistering, and the writer-side lock, the netns mutex,
is fine for that, because it offers sleepable context.

This mutex now serves double duty. It offers code serialization,
something which the stats_lock already did. So now that serves no
purpose, let's remove it.

Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 drivers/net/bonding/bond_main.c | 16 +++++-----------
 include/net/bonding.h           |  1 -
 2 files changed, 5 insertions(+), 12 deletions(-)

Comments

Vladimir Oltean Dec. 7, 2020, 1 a.m. UTC | #1
On Mon, Dec 07, 2020 at 01:59:11AM +0200, Vladimir Oltean wrote:
> In the effort of making .ndo_get_stats64 be able to sleep, we need to

> ensure the callers of dev_get_stats do not use atomic context.

>

> The bonding driver uses an RCU read-side critical section to ensure the

> integrity of the list of network interfaces, because the driver iterates

> through all net devices in the netns to find the ones which are its

> configured slaves. We still need some protection against an interface

> registering or deregistering, and the writer-side lock, the netns mutex,

> is fine for that, because it offers sleepable context.

>

> This mutex now serves double duty. It offers code serialization,

> something which the stats_lock already did. So now that serves no

> purpose, let's remove it.

>

> Cc: Jay Vosburgh <j.vosburgh@gmail.com>

> Cc: Veaceslav Falico <vfalico@gmail.com>

> Cc: Andy Gospodarek <andy@greyhouse.net>

> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

> ---


There is a very obvious deadlock here which happens when we have
bond-over-bond and the upper calls dev_get_stats from the lower.

Conceptually, the same can happen even in any number of stacking
combinations between bonding, net_failover, [ insert any other driver
that takes net->netdev_lists_lock here ].

There would be two approaches trying to solve this issue:
- using mutex_lock_nested where we aren't sure that we are top level
- ensuring through convention that user space always takes
  net->netdev_lists_lock when calling dev_get_stats, and documenting
  that, and therefore making it unnecessary to lock in bonding.

I took neither of the two approaches (I don't really like either one too
much), hence [ one of ] the reasons for the RFC. Comments?
Vladimir Oltean Dec. 7, 2020, 3:22 p.m. UTC | #2
On Mon, Dec 07, 2020 at 03:00:40AM +0200, Vladimir Oltean wrote:
> There is a very obvious deadlock here which happens when we have

> bond-over-bond and the upper calls dev_get_stats from the lower.

>

> Conceptually, the same can happen even in any number of stacking

> combinations between bonding, net_failover, [ insert any other driver

> that takes net->netdev_lists_lock here ].

>

> There would be two approaches trying to solve this issue:

> - using mutex_lock_nested where we aren't sure that we are top level

> - ensuring through convention that user space always takes

>   net->netdev_lists_lock when calling dev_get_stats, and documenting

>   that, and therefore making it unnecessary to lock in bonding.

>

> I took neither of the two approaches (I don't really like either one too

> much), hence [ one of ] the reasons for the RFC. Comments?


And there are also issues which are more subtle (or maybe just to me, at
the time I wrote the patch). Like the fact that the netdev adjacency
lists are not protected by net->netdev_lists_lock, but still by the RTNL
mutex and RCU. I think that in order for the iteration through lower
interfaces to capture a consistent state of the adjancency lists of all
interfaces, the __netdev_adjacent_dev_link_lists and
__netdev_adjacent_dev_unlink_lists functions would need to be run under
the net->netdev_lists_lock, and not just under some lock per-netdev.
But this is raising the locking domain covered by net->netdev_lists_lock
to more than I initially intended. I'll try to do this and see how it
works.
Jakub Kicinski Dec. 8, 2020, 11:57 p.m. UTC | #3
On Mon, 7 Dec 2020 01:00:40 +0000 Vladimir Oltean wrote:
> - ensuring through convention that user space always takes

>   net->netdev_lists_lock when calling dev_get_stats, and documenting

>   that, and therefore making it unnecessary to lock in bonding.


This seems like the better option to me. Makes the locking rules pretty
clear.
Vladimir Oltean Dec. 9, 2020, 12:03 a.m. UTC | #4
On Tue, Dec 08, 2020 at 03:57:44PM -0800, Jakub Kicinski wrote:
> On Mon, 7 Dec 2020 01:00:40 +0000 Vladimir Oltean wrote:

> > - ensuring through convention that user space always takes

> >   net->netdev_lists_lock when calling dev_get_stats, and documenting

> >   that, and therefore making it unnecessary to lock in bonding.

>

> This seems like the better option to me. Makes the locking rules pretty

> clear.


It is non-obvious to me that top-level callers of dev_get_stats should
hold a lock as specific as the one protecting the lists of network
interfaces. In the vast majority of implementations of dev_get_stats,
that lock would not actually protect anything, which would lead into
just one more lock that is used for more than it should be. In my tree I
had actually already switched over to mutex_lock_nested. Nonetheless, I
am still open if you want to make the case that simplicity should prevail
over specificity. But in that case, maybe we should just keep on using
the RTNL mutex.
Jakub Kicinski Dec. 9, 2020, 12:17 a.m. UTC | #5
On Wed, 9 Dec 2020 00:03:56 +0000 Vladimir Oltean wrote:
> On Tue, Dec 08, 2020 at 03:57:44PM -0800, Jakub Kicinski wrote:

> > On Mon, 7 Dec 2020 01:00:40 +0000 Vladimir Oltean wrote:  

> > > - ensuring through convention that user space always takes

> > >   net->netdev_lists_lock when calling dev_get_stats, and documenting

> > >   that, and therefore making it unnecessary to lock in bonding.  

> >

> > This seems like the better option to me. Makes the locking rules pretty

> > clear.  

> 

> It is non-obvious to me that top-level callers of dev_get_stats should

> hold a lock as specific as the one protecting the lists of network

> interfaces. In the vast majority of implementations of dev_get_stats,

> that lock would not actually protect anything, which would lead into

> just one more lock that is used for more than it should be. In my tree I

> had actually already switched over to mutex_lock_nested. Nonetheless, I

> am still open if you want to make the case that simplicity should prevail

> over specificity.


What are the locking rules you have in mind then? Caller may hold RTNL
or ifc mutex?

> But in that case, maybe we should just keep on using the RTNL mutex.


That's a wasted opportunity, RTNL lock is pretty busy.
Vladimir Oltean Dec. 9, 2020, 1:14 a.m. UTC | #6
On Tue, Dec 08, 2020 at 04:17:37PM -0800, Jakub Kicinski wrote:
> On Wed, 9 Dec 2020 00:03:56 +0000 Vladimir Oltean wrote:

> > On Tue, Dec 08, 2020 at 03:57:44PM -0800, Jakub Kicinski wrote:

> > > On Mon, 7 Dec 2020 01:00:40 +0000 Vladimir Oltean wrote:

> > > > - ensuring through convention that user space always takes

> > > >   net->netdev_lists_lock when calling dev_get_stats, and documenting

> > > >   that, and therefore making it unnecessary to lock in bonding.

> > >

> > > This seems like the better option to me. Makes the locking rules pretty

> > > clear.

> >

> > It is non-obvious to me that top-level callers of dev_get_stats should

> > hold a lock as specific as the one protecting the lists of network

> > interfaces. In the vast majority of implementations of dev_get_stats,

> > that lock would not actually protect anything, which would lead into

> > just one more lock that is used for more than it should be. In my tree I

> > had actually already switched over to mutex_lock_nested. Nonetheless, I

> > am still open if you want to make the case that simplicity should prevail

> > over specificity.

>

> What are the locking rules you have in mind then? Caller may hold RTNL

> or ifc mutex?


Well, it's clear that calling mutex_lock_nested would only silence lockdep,
there would still be this non-reentrant mutex that will be held from a
potentially recursive code path. So it a non-solution, and even worse
than using plain mutex_lock, because at least that is detectable when it
locks up the system.

net_failover and bonding are the only drivers that are creating this
recursivity requirement in dev_get_stats. Other one-over-many stackable
interfaces, like the bridge, just use dev_get_tstats64. I'm almost
thinking that it would be cleaner to convert these two to dev_get_tstats64
too, that would simplify things enormously. Even team uses something
that is based on software counters, something reminiscent of
dev_get_tstats64, definitely not counters retrieved from the underlying
device. Of course, the disadvantage with doing that is going to be that
virtual interfaces cannot retrieve statistics recursively from their
lower interface. I'm trying to think how much of a real disadvantage
that will be. For offloaded interfaces they will be completely off,
that's for sure. And this is one of the reasons that mandated the DSA
rework to expose MAC-based counters in dev_get_stats in the first place.
But thinking in the larger sense. An offloading interface that supports
IP forwarding, with 50 VLAN uppers. How could the statistics counters of
those VLAN uppers ever be correct. It's not realistic to expect of the
underlying hardware to keep separate statistics for each upper, that the
upper could then go ahead and just query. Sure, in the cases where a
lower can have only one upper at a time (bridge, bonding) what I said is
not applicable, but the general goal of having accurate counters for
offloading interfaces everywhere seems to not be really achievable.

In case this doesn't work out, I guess I'll have to document that
dev_get_stats is recursive, and all mutual exclusion based locks need to
be taken upfront, which is the only strategy that works with recursion
that I know of.

> > But in that case, maybe we should just keep on using the RTNL mutex.

>

> That's a wasted opportunity, RTNL lock is pretty busy.


It is certainly easy to use and easy to enforce though. It also seems to
protect everything, which is generally the reason why you tend to not
think a lot when using it.

To be clear, I am not removing the RTNL mutex from any place where it is
held today. Letting me do that would be like letting a bull in a china
shop. I am only creating a sub-lock of it, which is protecting a subset
of what the RTNL mutex is protecting. By sub-lock I mean that code paths
that currently hold the RTNL mutex, like list_netdevice(), will also
take net->netdev_lists_lock - never the other way around, that would
create lock inversion. The plan is to then require new code that
iterates through network interface lists to use the new locking scheme
and not RTNL mutex, and in parallel to migrate, on a best-effort basis,
code that runs under ASSERT_RTNL + net->netdev_lists_lock to only take
the net->netdev_lists_lock and be RTNL-free.

But is that any better at runtime than just taking the RTNL mutex, will
it make it less contended? Nope, due to the indirect serialization
effect that is created by the fact that net->netdev_lists_lock is a
sub-lock of the RTNL mutex. Say net/core/net-procfs.c holds
net->netdev_lists_lock while dumping the statistics from a firmware and
sleeping while doing so. All is rosy in its RTNL mutex free code path.
Then there comes along another thread which calls for_each_netdev,
something else which today requires the RTNL mutex but could be reworked
to use the net->netdev_lists_lock. Let's assume we are at a stage where
the gazillion places that call for_each_netdev() have been converted to
also take the net->netdev_lists_lock. But they couldn't be converted to
also _not_ take the RTNL mutex, or at least all of them. This means that
there will be code paths that try to take the RTNL mutex and end up
waiting for our net/core/net-procfs.c to complete dumping the firmware.
Vladimir Oltean Dec. 9, 2020, 1:28 a.m. UTC | #7
On Wed, Dec 09, 2020 at 03:14:04AM +0200, Vladimir Oltean wrote:
> net_failover and bonding are the only drivers that are creating this

> recursivity requirement in dev_get_stats. Other one-over-many stackable

> interfaces, like the bridge, just use dev_get_tstats64. I'm almost

> thinking that it would be cleaner to convert these two to dev_get_tstats64

> too, that would simplify things enormously. Even team uses something

> that is based on software counters, something reminiscent of

> dev_get_tstats64, definitely not counters retrieved from the underlying

> device. Of course, the disadvantage with doing that is going to be that

> virtual interfaces cannot retrieve statistics recursively from their

> lower interface. I'm trying to think how much of a real disadvantage

> that will be. For offloaded interfaces they will be completely off,

> that's for sure. And this is one of the reasons that mandated the DSA

> rework to expose MAC-based counters in dev_get_stats in the first place.

> But thinking in the larger sense. An offloading interface that supports

> IP forwarding, with 50 VLAN uppers. How could the statistics counters of

> those VLAN uppers ever be correct. It's not realistic to expect of the

> underlying hardware to keep separate statistics for each upper, that the

> upper could then go ahead and just query. Sure, in the cases where a

> lower can have only one upper at a time (bridge, bonding) what I said is

> not applicable, but the general goal of having accurate counters for

> offloading interfaces everywhere seems to not be really achievable.


Ok, putting some more thought into it after sending the email, maybe it
isn't unreasonable for devices with IP fwd offload to keep statistics
for each upper. They need to be aware of those uppers for a plethora of
other reasons, after all.

So, here's another idea for eliminating the recursion. Maybe we just add
a new netdevice feature, NETIF_F_FOLDING_STATS, or something like that,
which bonding and failover will use. Then dev_get_stats sees that flag,
and instead of calling ->ndo_get_stats64 for it, it just iterates
through its bottom-most level of lower interfaces and aggregates the
stats by itself.
diff mbox series

Patch

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index e0880a3840d7..f788f9fa1858 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3738,21 +3738,16 @@  static void bond_get_stats(struct net_device *bond_dev,
 			   struct rtnl_link_stats64 *stats)
 {
 	struct bonding *bond = netdev_priv(bond_dev);
+	struct net *net = dev_net(bond_dev);
 	struct rtnl_link_stats64 temp;
 	struct list_head *iter;
 	struct slave *slave;
-	int nest_level = 0;
 
+	mutex_lock(&net->netdev_lists_lock);
 
-	rcu_read_lock();
-#ifdef CONFIG_LOCKDEP
-	nest_level = bond_get_lowest_level_rcu(bond_dev);
-#endif
-
-	spin_lock_nested(&bond->stats_lock, nest_level);
 	memcpy(stats, &bond->bond_stats, sizeof(*stats));
 
-	bond_for_each_slave_rcu(bond, slave, iter) {
+	bond_for_each_slave(bond, slave, iter) {
 		const struct rtnl_link_stats64 *new =
 			dev_get_stats(slave->dev, &temp);
 
@@ -3763,8 +3758,8 @@  static void bond_get_stats(struct net_device *bond_dev,
 	}
 
 	memcpy(&bond->bond_stats, stats, sizeof(*stats));
-	spin_unlock(&bond->stats_lock);
-	rcu_read_unlock();
+
+	mutex_unlock(&net->netdev_lists_lock);
 }
 
 static int bond_do_ioctl(struct net_device *bond_dev, struct ifreq *ifr, int cmd)
@@ -5192,7 +5187,6 @@  static int bond_init(struct net_device *bond_dev)
 	if (!bond->wq)
 		return -ENOMEM;
 
-	spin_lock_init(&bond->stats_lock);
 	netdev_lockdep_set_classes(bond_dev);
 
 	list_add_tail(&bond->bond_list, &bn->dev_list);
diff --git a/include/net/bonding.h b/include/net/bonding.h
index d9d0ff3b0ad3..6fbde9713424 100644
--- a/include/net/bonding.h
+++ b/include/net/bonding.h
@@ -224,7 +224,6 @@  struct bonding {
 	 * ALB mode (6) - to sync the use and modifications of its hash table
 	 */
 	spinlock_t mode_lock;
-	spinlock_t stats_lock;
 	u8	 send_peer_notif;
 	u8       igmp_retrans;
 #ifdef CONFIG_PROC_FS