mbox series

[v3,net-next,00/12] Better support for sandwiched LAGs with bridge and DSA

Message ID 20210320223448.2452869-1-olteanv@gmail.com
Headers show
Series Better support for sandwiched LAGs with bridge and DSA | expand

Message

Vladimir Oltean March 20, 2021, 10:34 p.m. UTC
From: Vladimir Oltean <vladimir.oltean@nxp.com>

The objective of this series is to make LAG uppers on top of switchdev
ports work regardless of which order we link interfaces to their masters
(first make the port join the LAG, then the LAG join the bridge, or the
other way around).

There was a design decision to be made in patches 2-4 on whether we
should adopt the "push" model (which attempts to solve the problem
centrally, in the bridge layer) where the driver just calls:

  switchdev_bridge_port_offloaded(brport_dev,
                                  &atomic_notifier_block,
                                  &blocking_notifier_block,
                                  extack);

and the bridge just replays the entire collection of switchdev port
attributes and objects that it has, in some predefined order and with
some predefined error handling logic;


or the "pull" model (which attempts to solve the problem by giving the
driver the rope to hang itself), where the driver, apart from calling:

  switchdev_bridge_port_offloaded(brport_dev, extack);

has the task of "dumpster diving" (as Tobias puts it) through the bridge
attributes and objects by itself, by calling:

  - br_vlan_replay
  - br_fdb_replay
  - br_mdb_replay
  - br_vlan_enabled
  - br_port_flag_is_set
  - br_port_get_stp_state
  - br_multicast_router
  - br_get_ageing_time

(not necessarily all of them, and not necessarily in this order, and
with driver-defined error handling).

Even though I'm not in love myself with the "pull" model, I chose it
because there is a fundamental trick with replaying switchdev events
like this:

ip link add br0 type bridge
ip link add bond0 type bond
ip link set bond0 master br0
ip link set swp0 master bond0 <- this will replay the objects once for
                                 the bond0 bridge port, and the swp0
                                 switchdev port will process them
ip link set swp1 master bond0 <- this will replay the objects again for
                                 the bond0 bridge port, and the swp1
                                 switchdev port will see them, but swp0
                                 will see them for the second time now

Basically I believe that it is implementation defined whether the driver
wants to error out on switchdev objects seen twice on a port, and the
bridge should not enforce a certain model for that. For example, for FDB
entries added to a bonding interface, the underling switchdev driver
might have an abstraction for just that: an FDB entry pointing towards a
logical (as opposed to physical) port. So when the second port joins the
bridge, it doesn't realy need to replay FDB entries, since there is
already at least one hardware port which has been receiving those
events, and the FDB entries don't need to be added a second time to the
same logical port.
In the other corner, we have the drivers that handle switchdev port
attributes on a LAG as individual switchdev port attributes on physical
ports (example: VLAN filtering). In fact, the switchdev_handle_port_attr_set
helper facilitates this: it is a fan-out from a single orig_dev towards
multiple lowers that pass the check_cb().
But that's the point: switchdev_handle_port_attr_set is just a helper
which the driver _opts_ to use. The bridge can't enforce the "push"
model, because that would assume that all drivers handle port attributes
in the same way, which is probably false.

For this reason, I preferred to go with the "pull" mode for this patch
set. Just to see how bad it is for other switchdev drivers to copy-paste
this logic, I added the pull support to ocelot too, and I think it's
pretty manageable.

Vladimir Oltean (12):
  net: dsa: call dsa_port_bridge_join when joining a LAG that is already
    in a bridge
  net: dsa: pass extack to dsa_port_{bridge,lag}_join
  net: dsa: inherit the actual bridge port flags at join time
  net: dsa: sync up with bridge port's STP state when joining
  net: dsa: sync up VLAN filtering state when joining the bridge
  net: dsa: sync multicast router state when joining the bridge
  net: dsa: sync ageing time when joining the bridge
  net: dsa: replay port and host-joined mdb entries when joining the
    bridge
  net: dsa: replay port and local fdb entries when joining the bridge
  net: dsa: replay VLANs installed on port when joining the bridge
  net: ocelot: call ocelot_netdevice_bridge_join when joining a bridged
    LAG
  net: ocelot: replay switchdev events when joining bridge

 drivers/net/dsa/ocelot/felix.c         |   4 +-
 drivers/net/ethernet/mscc/ocelot.c     |  18 +--
 drivers/net/ethernet/mscc/ocelot_net.c | 208 +++++++++++++++++++++----
 include/linux/if_bridge.h              |  40 +++++
 include/net/switchdev.h                |   1 +
 include/soc/mscc/ocelot.h              |   6 +-
 net/bridge/br_fdb.c                    |  52 +++++++
 net/bridge/br_mdb.c                    |  84 ++++++++++
 net/bridge/br_stp.c                    |  27 ++++
 net/bridge/br_vlan.c                   |  71 +++++++++
 net/dsa/dsa_priv.h                     |   9 +-
 net/dsa/port.c                         | 203 ++++++++++++++++++------
 net/dsa/slave.c                        |  11 +-
 13 files changed, 631 insertions(+), 103 deletions(-)

Comments

Florian Fainelli March 22, 2021, 4:02 p.m. UTC | #1
On 3/20/2021 3:34 PM, Vladimir Oltean wrote:
> From: Vladimir Oltean <vladimir.oltean@nxp.com>

> 

> The SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME attribute is only emitted from:

> 

> sysfs/ioctl/netlink

> -> br_set_ageing_time

>    -> __set_ageing_time

> 

> therefore not at bridge port creation time, so:

> (a) drivers had to hardcode the initial value for the address ageing time,

>     because they didn't get any notification

> (b) that hardcoded value can be out of sync, if the user changes the

>     ageing time before enslaving the port to the bridge

> 

> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>


Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>

-- 
Florian
Nikolay Aleksandrov March 22, 2021, 4:27 p.m. UTC | #2
On 21/03/2021 00:34, Vladimir Oltean wrote:
> From: Vladimir Oltean <vladimir.oltean@nxp.com>
> 
> The objective of this series is to make LAG uppers on top of switchdev
> ports work regardless of which order we link interfaces to their masters
> (first make the port join the LAG, then the LAG join the bridge, or the
> other way around).
> 
> There was a design decision to be made in patches 2-4 on whether we
> should adopt the "push" model (which attempts to solve the problem
> centrally, in the bridge layer) where the driver just calls:
> 
>   switchdev_bridge_port_offloaded(brport_dev,
>                                   &atomic_notifier_block,
>                                   &blocking_notifier_block,
>                                   extack);
> 
> and the bridge just replays the entire collection of switchdev port
> attributes and objects that it has, in some predefined order and with
> some predefined error handling logic;
> 
> 
> or the "pull" model (which attempts to solve the problem by giving the
> driver the rope to hang itself), where the driver, apart from calling:
> 
>   switchdev_bridge_port_offloaded(brport_dev, extack);
> 
> has the task of "dumpster diving" (as Tobias puts it) through the bridge
> attributes and objects by itself, by calling:
> 
>   - br_vlan_replay
>   - br_fdb_replay
>   - br_mdb_replay
>   - br_vlan_enabled
>   - br_port_flag_is_set
>   - br_port_get_stp_state
>   - br_multicast_router
>   - br_get_ageing_time
> 
> (not necessarily all of them, and not necessarily in this order, and
> with driver-defined error handling).
> 
> Even though I'm not in love myself with the "pull" model, I chose it
> because there is a fundamental trick with replaying switchdev events
> like this:
> 
> ip link add br0 type bridge
> ip link add bond0 type bond
> ip link set bond0 master br0
> ip link set swp0 master bond0 <- this will replay the objects once for
>                                  the bond0 bridge port, and the swp0
>                                  switchdev port will process them
> ip link set swp1 master bond0 <- this will replay the objects again for
>                                  the bond0 bridge port, and the swp1
>                                  switchdev port will see them, but swp0
>                                  will see them for the second time now
> 
> Basically I believe that it is implementation defined whether the driver
> wants to error out on switchdev objects seen twice on a port, and the
> bridge should not enforce a certain model for that. For example, for FDB
> entries added to a bonding interface, the underling switchdev driver
> might have an abstraction for just that: an FDB entry pointing towards a
> logical (as opposed to physical) port. So when the second port joins the
> bridge, it doesn't realy need to replay FDB entries, since there is
> already at least one hardware port which has been receiving those
> events, and the FDB entries don't need to be added a second time to the
> same logical port.
> In the other corner, we have the drivers that handle switchdev port
> attributes on a LAG as individual switchdev port attributes on physical
> ports (example: VLAN filtering). In fact, the switchdev_handle_port_attr_set
> helper facilitates this: it is a fan-out from a single orig_dev towards
> multiple lowers that pass the check_cb().
> But that's the point: switchdev_handle_port_attr_set is just a helper
> which the driver _opts_ to use. The bridge can't enforce the "push"
> model, because that would assume that all drivers handle port attributes
> in the same way, which is probably false.
> 
> For this reason, I preferred to go with the "pull" mode for this patch
> set. Just to see how bad it is for other switchdev drivers to copy-paste
> this logic, I added the pull support to ocelot too, and I think it's
> pretty manageable.
> 
> Vladimir Oltean (12):
>   net: dsa: call dsa_port_bridge_join when joining a LAG that is already
>     in a bridge
>   net: dsa: pass extack to dsa_port_{bridge,lag}_join
>   net: dsa: inherit the actual bridge port flags at join time
>   net: dsa: sync up with bridge port's STP state when joining
>   net: dsa: sync up VLAN filtering state when joining the bridge
>   net: dsa: sync multicast router state when joining the bridge
>   net: dsa: sync ageing time when joining the bridge
>   net: dsa: replay port and host-joined mdb entries when joining the
>     bridge
>   net: dsa: replay port and local fdb entries when joining the bridge
>   net: dsa: replay VLANs installed on port when joining the bridge
>   net: ocelot: call ocelot_netdevice_bridge_join when joining a bridged
>     LAG
>   net: ocelot: replay switchdev events when joining bridge
> 
>  drivers/net/dsa/ocelot/felix.c         |   4 +-
>  drivers/net/ethernet/mscc/ocelot.c     |  18 +--
>  drivers/net/ethernet/mscc/ocelot_net.c | 208 +++++++++++++++++++++----
>  include/linux/if_bridge.h              |  40 +++++
>  include/net/switchdev.h                |   1 +
>  include/soc/mscc/ocelot.h              |   6 +-
>  net/bridge/br_fdb.c                    |  52 +++++++
>  net/bridge/br_mdb.c                    |  84 ++++++++++
>  net/bridge/br_stp.c                    |  27 ++++
>  net/bridge/br_vlan.c                   |  71 +++++++++
>  net/dsa/dsa_priv.h                     |   9 +-
>  net/dsa/port.c                         | 203 ++++++++++++++++++------
>  net/dsa/slave.c                        |  11 +-
>  13 files changed, 631 insertions(+), 103 deletions(-)
> 

Hi Vladimir,
Please pull all of the new bridge code into separate patches with the proper
bridge subsystems tagged in the subject.
I'll review the bridge changes in a minute.

Thanks,
 Nik
Nikolay Aleksandrov March 22, 2021, 4:35 p.m. UTC | #3
On 21/03/2021 00:34, Vladimir Oltean wrote:
> From: Vladimir Oltean <vladimir.oltean@nxp.com>

> 

> I have udhcpcd in my system and this is configured to bring interfaces

> up as soon as they are created.

> 

> I create a bridge as follows:

> 

> ip link add br0 type bridge

> 

> As soon as I create the bridge and udhcpcd brings it up, I also have

> avahi which automatically starts sending IPv6 packets to advertise some

> local services, and because of that, the br0 bridge joins the following

> IPv6 groups due to the code path detailed below:

> 

> 33:33:ff:6d:c1:9c vid 0

> 33:33:00:00:00:6a vid 0

> 33:33:00:00:00:fb vid 0

> 

> br_dev_xmit

> -> br_multicast_rcv

>    -> br_ip6_multicast_add_group

>       -> __br_multicast_add_group

>          -> br_multicast_host_join

>             -> br_mdb_notify

> 

> This is all fine, but inside br_mdb_notify we have br_mdb_switchdev_host

> hooked up, and switchdev will attempt to offload the host joined groups

> to an empty list of ports. Of course nobody offloads them.

> 

> Then when we add a port to br0:

> 

> ip link set swp0 master br0

> 

> the bridge doesn't replay the host-joined MDB entries from br_add_if,

> and eventually the host joined addresses expire, and a switchdev

> notification for deleting it is emitted, but surprise, the original

> addition was already completely missed.

> 

> The strategy to address this problem is to replay the MDB entries (both

> the port ones and the host joined ones) when the new port joins the

> bridge, similar to what vxlan_fdb_replay does (in that case, its FDB can

> be populated and only then attached to a bridge that you offload).

> However there are 2 possibilities: the addresses can be 'pushed' by the

> bridge into the port, or the port can 'pull' them from the bridge.

> 

> Considering that in the general case, the new port can be really late to

> the party, and there may have been many other switchdev ports that

> already received the initial notification, we would like to avoid

> delivering duplicate events to them, since they might misbehave. And

> currently, the bridge calls the entire switchdev notifier chain, whereas

> for replaying it should just call the notifier block of the new guy.

> But the bridge doesn't know what is the new guy's notifier block, it

> just knows where the switchdev notifier chain is. So for simplification,

> we make this a driver-initiated pull for now, and the notifier block is

> passed as an argument.

> 

> To emulate the calling context for mdb objects (deferred and put on the

> blocking notifier chain), we must iterate under RCU protection through

> the bridge's mdb entries, queue them, and only call them once we're out

> of the RCU read-side critical section.

> 

> Suggested-by: Ido Schimmel <idosch@idosch.org>

> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

> ---

> Changes in v3:

> - Removed the implication that avahi is crap from the commit message.

> - Made the br_mdb_replay shim return -EOPNOTSUPP.

> 

>  include/linux/if_bridge.h |  9 +++++

>  net/bridge/br_mdb.c       | 84 +++++++++++++++++++++++++++++++++++++++

>  net/dsa/dsa_priv.h        |  2 +

>  net/dsa/port.c            |  6 +++

>  net/dsa/slave.c           |  2 +-

>  5 files changed, 102 insertions(+), 1 deletion(-)

> 

> diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h

> index ebd16495459c..f6472969bb44 100644

> --- a/include/linux/if_bridge.h

> +++ b/include/linux/if_bridge.h

> @@ -69,6 +69,8 @@ bool br_multicast_has_querier_anywhere(struct net_device *dev, int proto);

>  bool br_multicast_has_querier_adjacent(struct net_device *dev, int proto);

>  bool br_multicast_enabled(const struct net_device *dev);

>  bool br_multicast_router(const struct net_device *dev);

> +int br_mdb_replay(struct net_device *br_dev, struct net_device *dev,

> +		  struct notifier_block *nb, struct netlink_ext_ack *extack);

>  #else

>  static inline int br_multicast_list_adjacent(struct net_device *dev,

>  					     struct list_head *br_ip_list)

> @@ -93,6 +95,13 @@ static inline bool br_multicast_router(const struct net_device *dev)

>  {

>  	return false;

>  }

> +static inline int br_mdb_replay(struct net_device *br_dev,

> +				struct net_device *dev,

> +				struct notifier_block *nb,

> +				struct netlink_ext_ack *extack)

> +{

> +	return -EOPNOTSUPP;

> +}

>  #endif

>  

>  #if IS_ENABLED(CONFIG_BRIDGE) && IS_ENABLED(CONFIG_BRIDGE_VLAN_FILTERING)

> diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c

> index 8846c5bcd075..23973186094c 100644

> --- a/net/bridge/br_mdb.c

> +++ b/net/bridge/br_mdb.c

> @@ -506,6 +506,90 @@ static void br_mdb_complete(struct net_device *dev, int err, void *priv)

>  	kfree(priv);

>  }

>  

> +static int br_mdb_replay_one(struct notifier_block *nb, struct net_device *dev,

> +			     struct net_bridge_mdb_entry *mp, int obj_id,

> +			     struct net_device *orig_dev,

> +			     struct netlink_ext_ack *extack)

> +{

> +	struct switchdev_notifier_port_obj_info obj_info = {

> +		.info = {

> +			.dev = dev,

> +			.extack = extack,

> +		},

> +	};

> +	struct switchdev_obj_port_mdb mdb = {

> +		.obj = {

> +			.orig_dev = orig_dev,

> +			.id = obj_id,

> +		},

> +		.vid = mp->addr.vid,

> +	};

> +	int err;

> +

> +	if (mp->addr.proto == htons(ETH_P_IP))

> +		ip_eth_mc_map(mp->addr.dst.ip4, mdb.addr);

> +#if IS_ENABLED(CONFIG_IPV6)

> +	else if (mp->addr.proto == htons(ETH_P_IPV6))

> +		ipv6_eth_mc_map(&mp->addr.dst.ip6, mdb.addr);

> +#endif

> +	else

> +		ether_addr_copy(mdb.addr, mp->addr.dst.mac_addr);

> +

> +	obj_info.obj = &mdb.obj;

> +

> +	err = nb->notifier_call(nb, SWITCHDEV_PORT_OBJ_ADD, &obj_info);

> +	return notifier_to_errno(err);

> +}

> +

> +int br_mdb_replay(struct net_device *br_dev, struct net_device *dev,

> +		  struct notifier_block *nb, struct netlink_ext_ack *extack)

> +{

> +	struct net_bridge_mdb_entry *mp;

> +	struct list_head mdb_list;


If you use LIST_HEAD(mdb_list)...

> +	struct net_bridge *br;

> +	int err = 0;

> +

> +	ASSERT_RTNL();

> +

> +	INIT_LIST_HEAD(&mdb_list);


... you can drop this one.

> +

> +	if (!netif_is_bridge_master(br_dev) || !netif_is_bridge_port(dev))

> +		return -EINVAL;

> +

> +	br = netdev_priv(br_dev);

> +

> +	if (!br_opt_get(br, BROPT_MULTICAST_ENABLED))

> +		return 0;

> +

> +	hlist_for_each_entry(mp, &br->mdb_list, mdb_node) {


You cannot walk over these lists without the multicast lock or RCU. RTNL is not
enough because of various timers and leave messages that can alter both the mdb_list
and the port group lists. I'd prefer RCU to avoid blocking the bridge mcast.

> +		struct net_bridge_port_group __rcu **pp;

> +		struct net_bridge_port_group *p;

> +

> +		if (mp->host_joined) {

> +			err = br_mdb_replay_one(nb, dev, mp,

> +						SWITCHDEV_OBJ_ID_HOST_MDB,

> +						br_dev, extack);

> +			if (err)

> +				return err;

> +		}

> +

> +		for (pp = &mp->ports; (p = rtnl_dereference(*pp)) != NULL;

> +		     pp = &p->next) {

> +			if (p->key.port->dev != dev)

> +				continue;

> +

> +			err = br_mdb_replay_one(nb, dev, mp,

> +						SWITCHDEV_OBJ_ID_PORT_MDB,

> +						dev, extack);

> +			if (err)

> +				return err;> +		}

> +	}

> +

> +	return 0;

> +}

> +EXPORT_SYMBOL(br_mdb_replay);


EXPORT_SYMBOL_GPL

> +

>  static void br_mdb_switchdev_host_port(struct net_device *dev,

>  				       struct net_device *lower_dev,

>  				       struct net_bridge_mdb_entry *mp,

> diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h

> index b8778c5d8529..b14c43cb88bb 100644

> --- a/net/dsa/dsa_priv.h

> +++ b/net/dsa/dsa_priv.h

> @@ -262,6 +262,8 @@ static inline bool dsa_tree_offloads_bridge_port(struct dsa_switch_tree *dst,

>  

>  /* slave.c */

>  extern const struct dsa_device_ops notag_netdev_ops;

> +extern struct notifier_block dsa_slave_switchdev_blocking_notifier;

> +

>  void dsa_slave_mii_bus_init(struct dsa_switch *ds);

>  int dsa_slave_create(struct dsa_port *dp);

>  void dsa_slave_destroy(struct net_device *slave_dev);

> diff --git a/net/dsa/port.c b/net/dsa/port.c

> index 95e6f2861290..3e61e9e6675c 100644

> --- a/net/dsa/port.c

> +++ b/net/dsa/port.c

> @@ -199,6 +199,12 @@ static int dsa_port_switchdev_sync(struct dsa_port *dp,

>  	if (err && err != -EOPNOTSUPP)

>  		return err;

>  

> +	err = br_mdb_replay(br, brport_dev,

> +			    &dsa_slave_switchdev_blocking_notifier,

> +			    extack);

> +	if (err && err != -EOPNOTSUPP)

> +		return err;

> +

>  	return 0;

>  }

>  

> diff --git a/net/dsa/slave.c b/net/dsa/slave.c

> index 1ff48be476bb..b974d8f84a2e 100644

> --- a/net/dsa/slave.c

> +++ b/net/dsa/slave.c

> @@ -2396,7 +2396,7 @@ static struct notifier_block dsa_slave_switchdev_notifier = {

>  	.notifier_call = dsa_slave_switchdev_event,

>  };

>  

> -static struct notifier_block dsa_slave_switchdev_blocking_notifier = {

> +struct notifier_block dsa_slave_switchdev_blocking_notifier = {

>  	.notifier_call = dsa_slave_switchdev_blocking_event,

>  };

>  

>
Nikolay Aleksandrov March 22, 2021, 4:48 p.m. UTC | #4
On 21/03/2021 00:34, Vladimir Oltean wrote:
> From: Vladimir Oltean <vladimir.oltean@nxp.com>

> 

> Currently this simple setup:

> 

> ip link add br0 type bridge vlan_filtering 1

> ip link add bond0 type bond

> ip link set bond0 master br0

> ip link set swp0 master bond0

> 

> will not work because the bridge has created the PVID in br_add_if ->

> nbp_vlan_init, and it has notified switchdev of the existence of VLAN 1,

> but that was too early, since swp0 was not yet a lower of bond0, so it

> had no reason to act upon that notification.

> 

> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

> ---

> Changes in v3:

> Made the br_vlan_replay shim return -EOPNOTSUPP.

> 

>  include/linux/if_bridge.h | 10 ++++++

>  net/bridge/br_vlan.c      | 71 +++++++++++++++++++++++++++++++++++++++

>  net/dsa/port.c            |  6 ++++

>  3 files changed, 87 insertions(+)

[snip]
> +int br_vlan_replay(struct net_device *br_dev, struct net_device *dev,

> +		   struct notifier_block *nb, struct netlink_ext_ack *extack)

> +{

> +	struct net_bridge_vlan_group *vg;

> +	struct net_bridge_vlan *v;

> +	struct net_bridge_port *p;

> +	struct net_bridge *br;

> +	int err = 0;

> +	u16 pvid;

> +

> +	ASSERT_RTNL();

> +

> +	if (!netif_is_bridge_master(br_dev))

> +		return -EINVAL;

> +

> +	if (!netif_is_bridge_master(dev) && !netif_is_bridge_port(dev))

> +		return -EINVAL;

> +

> +	if (netif_is_bridge_master(dev)) {

> +		br = netdev_priv(dev);

> +		vg = br_vlan_group(br);

> +		p = NULL;

> +	} else {

> +		p = br_port_get_rtnl(dev);

> +		if (WARN_ON(!p))

> +			return -EINVAL;

> +		vg = nbp_vlan_group(p);

> +		br = p->br;

> +	}

> +

> +	if (!vg)

> +		return 0;

> +

> +	pvid = br_get_pvid(vg);

> +

> +	list_for_each_entry(v, &vg->vlan_list, vlist) {

> +		struct switchdev_obj_port_vlan vlan = {

> +			.obj.orig_dev = dev,

> +			.obj.id = SWITCHDEV_OBJ_ID_PORT_VLAN,

> +			.flags = br_vlan_flags(v, pvid),

> +			.vid = v->vid,

> +		};

> +

> +		if (!br_vlan_should_use(v))

> +			continue;

> +

> +		br_vlan_replay_one(nb, dev, &vlan, extack);

> +		if (err)

> +			return err;

> +	}

> +

> +	return err;

> +}


EXPORT_SYMBOL_GPL ?

>  /* check if v_curr can enter a range ending in range_end */

>  bool br_vlan_can_enter_range(const struct net_bridge_vlan *v_curr,

>  			     const struct net_bridge_vlan *range_end)

> diff --git a/net/dsa/port.c b/net/dsa/port.c

> index d21a511f1e16..84775e253ee8 100644

> --- a/net/dsa/port.c

> +++ b/net/dsa/port.c

> @@ -209,6 +209,12 @@ static int dsa_port_switchdev_sync(struct dsa_port *dp,

>  	if (err && err != -EOPNOTSUPP)

>  		return err;

>  

> +	err = br_vlan_replay(br, brport_dev,

> +			     &dsa_slave_switchdev_blocking_notifier,

> +			     extack);

> +	if (err && err != -EOPNOTSUPP)

> +		return err;

> +

>  	return 0;

>  }

>  

>
Vladimir Oltean March 22, 2021, 4:56 p.m. UTC | #5
On Mon, Mar 22, 2021 at 06:35:10PM +0200, Nikolay Aleksandrov wrote:
> > +	hlist_for_each_entry(mp, &br->mdb_list, mdb_node) {

> 

> You cannot walk over these lists without the multicast lock or RCU. RTNL is not

> enough because of various timers and leave messages that can alter both the mdb_list

> and the port group lists. I'd prefer RCU to avoid blocking the bridge mcast.


The trouble is that I need to emulate the calling context that is
provided to SWITCHDEV_OBJ_ID_HOST_MDB and SWITCHDEV_OBJ_ID_PORT_MDB, and
that means blocking context.

So if I hold rcu_read_lock(), I need to queue up the mdb entries, and
notify the driver only after I leave the RCU critical section. The
memory footprint may temporarily blow up.

In fact this is what I did in v1:
https://patchwork.kernel.org/project/netdevbpf/patch/20210224114350.2791260-15-olteanv@gmail.com/

I just figured I could get away with rtnl_mutex protection, but it looks
like I can't. So I guess you prefer my v1?
Nikolay Aleksandrov March 22, 2021, 5 p.m. UTC | #6
On 22/03/2021 18:56, Vladimir Oltean wrote:
> On Mon, Mar 22, 2021 at 06:35:10PM +0200, Nikolay Aleksandrov wrote:

>>> +	hlist_for_each_entry(mp, &br->mdb_list, mdb_node) {

>>

>> You cannot walk over these lists without the multicast lock or RCU. RTNL is not

>> enough because of various timers and leave messages that can alter both the mdb_list

>> and the port group lists. I'd prefer RCU to avoid blocking the bridge mcast.

> 

> The trouble is that I need to emulate the calling context that is

> provided to SWITCHDEV_OBJ_ID_HOST_MDB and SWITCHDEV_OBJ_ID_PORT_MDB, and

> that means blocking context.

> 

> So if I hold rcu_read_lock(), I need to queue up the mdb entries, and

> notify the driver only after I leave the RCU critical section. The

> memory footprint may temporarily blow up.

> 

> In fact this is what I did in v1:

> https://patchwork.kernel.org/project/netdevbpf/patch/20210224114350.2791260-15-olteanv@gmail.com/

> 

> I just figured I could get away with rtnl_mutex protection, but it looks

> like I can't. So I guess you prefer my v1?

> 


Indeed, if you need a blocking context then you'd have to go with v1.