[v2,net-next,0/4] net: dsa: Link aggregation support

Message ID	20201130140610.4018-1-tobias@waldekranz.com
Headers	show Return-Path: <netdev-owner@kernel.org> From: Tobias Waldekranz <tobias@waldekranz.com> To: davem@davemloft.net, kuba@kernel.org Cc: andrew@lunn.ch, vivien.didelot@gmail.com, f.fainelli@gmail.com, olteanv@gmail.com, j.vosburgh@gmail.com, vfalico@gmail.com, andy@greyhouse.net, netdev@vger.kernel.org Subject: [PATCH v2 net-next 0/4] net: dsa: Link aggregation support Date: Mon, 30 Nov 2020 15:06:06 +0100 Message-Id: <20201130140610.4018-1-tobias@waldekranz.com> Organization: Westermo Precedence: bulk
Series	net: dsa: Link aggregation support \| expand [v2,net-next,0/4] net: dsa: Link aggregation support [v2,net-next,1/4] net: bonding: Notify ports about their initial state [v2,net-next,2/4] net: dsa: Link aggregation support [v2,net-next,3/4] net: dsa: mv88e6xxx: Link aggregation support [v2,net-next,4/4] net: dsa: tag_dsa: Support reception of packets from LAG devices

Tobias Waldekranz Nov. 30, 2020, 2:06 p.m. UTC

Start of by adding an extra notification when adding a port to a bond,
this allows static LAGs to be offloaded using the bonding driver.

Then add the generic support required to offload link aggregates to
drivers built on top of the DSA subsystem.

Finally, implement offloading for the mv88e6xxx driver, i.e. Marvell's
LinkStreet family.

Supported LAG implementations:
- Bonding
- Team

Supported modes:
- Isolated. The LAG may be used as a regular interface outside of any
  bridge.
- Bridged. The LAG may be added to a bridge, in which case switching
  is offloaded between the LAG and any other switch ports. I.e. the
  LAG behaves just like a port from this perspective.

In bridged mode, the following is supported:
- STP filtering.
- VLAN filtering.
- Multicast filtering. The bridge correctly snoops IGMP and configures
  the proper groups if snooping is enabled. Static groups can also be
  configured. MLD seems to work, but has not been extensively tested.
- Unicast filtering. Automatic learning works. Static entries are
  _not_ supported. This will be added in a later series as it requires
  some more general refactoring in mv88e6xxx before I can test it.

v1 -> v2:
- Allocate LAGs from a static pool to avoid late errors under memory
  pressure, as suggested by Andrew.

RFC -> v1:
- Properly propagate MDB operations.
- Support for bonding in addition to team.
- Fixed a locking bug in mv88e6xxx.
- Make sure ports are disabled-by-default in mv88e6xxx.
- Support for both DSA and EDSA tagging.

Tobias Waldekranz (4):
  net: bonding: Notify ports about their initial state
  net: dsa: Link aggregation support
  net: dsa: mv88e6xxx: Link aggregation support
  net: dsa: tag_dsa: Support reception of packets from LAG devices

 drivers/net/bonding/bond_main.c     |   2 +
 drivers/net/dsa/mv88e6xxx/chip.c    | 234 +++++++++++++++++++++++++++-
 drivers/net/dsa/mv88e6xxx/chip.h    |   4 +
 drivers/net/dsa/mv88e6xxx/global2.c |   8 +-
 drivers/net/dsa/mv88e6xxx/global2.h |   5 +
 drivers/net/dsa/mv88e6xxx/port.c    |  21 +++
 drivers/net/dsa/mv88e6xxx/port.h    |   5 +
 include/net/dsa.h                   |  97 ++++++++++++
 net/dsa/dsa.c                       |  12 +-
 net/dsa/dsa2.c                      |  51 ++++++
 net/dsa/dsa_priv.h                  |  31 ++++
 net/dsa/port.c                      | 141 +++++++++++++++++
 net/dsa/slave.c                     |  83 +++++++++-
 net/dsa/switch.c                    |  49 ++++++
 net/dsa/tag_dsa.c                   |  17 +-
 15 files changed, 744 insertions(+), 16 deletions(-)

Vladimir Oltean Dec. 1, 2020, 1:37 a.m. UTC | #1

Hi Tobias,

On Mon, Nov 30, 2020 at 03:06:08PM +0100, Tobias Waldekranz wrote:
> Monitor the following events and notify the driver when:

> 

> - A DSA port joins/leaves a LAG.

> - A LAG, made up of DSA ports, joins/leaves a bridge.

> - A DSA port in a LAG is enabled/disabled (enabled meaning

>   "distributing" in 802.3ad LACP terms).

> 

> Each LAG interface to which a DSA port is attached is represented by a

> `struct dsa_lag` which is globally reachable from the switch tree and

> from each associated port.

> 

> When a LAG joins a bridge, the DSA subsystem will treat that as each

> individual port joining the bridge. The driver may look at the port's

> LAG pointer to see if it is associated with any LAG, if that is

> required. This is analogue to how switchdev events are replicated out

> to all lower devices when reaching e.g. a LAG.

> 

> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>

> ---

>  include/net/dsa.h  |  97 +++++++++++++++++++++++++++++++

>  net/dsa/dsa2.c     |  51 ++++++++++++++++

>  net/dsa/dsa_priv.h |  31 ++++++++++

>  net/dsa/port.c     | 141 +++++++++++++++++++++++++++++++++++++++++++++

>  net/dsa/slave.c    |  83 ++++++++++++++++++++++++--

>  net/dsa/switch.c   |  49 ++++++++++++++++

>  6 files changed, 446 insertions(+), 6 deletions(-)

> 

> +static inline struct dsa_lag *dsa_lag_by_id(struct dsa_switch_tree *dst, int id)

> +{

> +	if (!test_bit(id, dst->lags.busy))

> +		return NULL;

> +

> +	return &dst->lags.pool[id];

> +}

> +

> +static inline struct net_device *dsa_lag_dev_by_id(struct dsa_switch_tree *dst,

> +						   int id)

> +{

> +	struct dsa_lag *lag = dsa_lag_by_id(dst, id);

> +

> +	return lag ? rcu_dereference(lag->dev) : NULL;

> +}

> +

> +static inline struct dsa_lag *dsa_lag_by_dev(struct dsa_switch_tree *dst,

> +					     struct net_device *dev)

> +{

> +	struct dsa_lag *lag;

> +	int id;

> +

> +	dsa_lag_foreach(id, dst) {

> +		lag = dsa_lag_by_id(dst, id);

> +

> +		if (rtnl_dereference(lag->dev) == dev)

> +			return lag;

> +	}

> +

> +	return NULL;

> +}

>  

> diff --git a/net/dsa/port.c b/net/dsa/port.c

> index 73569c9af3cc..c2332ee5f5c7 100644

> --- a/net/dsa/port.c

> +++ b/net/dsa/port.c

> @@ -193,6 +193,147 @@ void dsa_port_bridge_leave(struct dsa_port *dp, struct net_device *br)

>  	dsa_port_set_state_now(dp, BR_STATE_FORWARDING);

>  }

>  

> +static struct dsa_lag *dsa_lag_get(struct dsa_switch_tree *dst,

> +				   struct net_device *dev)

> +{

> +	struct dsa_lag *lag;

> +	int id;

> +

> +	lag = dsa_lag_by_dev(dst, dev);

> +	if (lag) {

> +		kref_get(&lag->refcount);

> +		return lag;

> +	}

> +

> +	id = find_first_zero_bit(dst->lags.busy, dst->lags.num);

> +	if (id >= dst->lags.num) {

> +		WARN(1, "No LAGs available");

> +		return NULL;

> +	}

> +

> +	set_bit(id, dst->lags.busy);

> +

> +	lag = &dst->lags.pool[id];

> +	kref_init(&lag->refcount);

> +	lag->id = id;

> +	INIT_LIST_HEAD(&lag->ports);

> +	INIT_LIST_HEAD(&lag->tx_ports);

> +

> +	rcu_assign_pointer(lag->dev, dev);

> +	return lag;

> +}

> +

> +static void dsa_lag_release(struct kref *refcount)

> +{

> +	struct dsa_lag *lag = container_of(refcount, struct dsa_lag, refcount);

> +

> +	rcu_assign_pointer(lag->dev, NULL);

> +	synchronize_rcu();

> +	memset(lag, 0, sizeof(*lag));

> +}


What difference does it make if lag->dev is set to NULL right away or
after a grace period? Squeezing one last packet from that bonding interface?
Pointer updates are atomic operations on all architectures that the
kernel supports, and, as long as you use WRITE_ONCE and READ_ONCE memory
barriers, there should be no reason for RCU protection that I can see.
And unlike typical uses of RCU, you do not free lag->dev, because you do
not own lag->dev. Instead, the bonding interface pointed to by lag->dev
is going to be freed (in case of a deletion using ip link) after an RCU
grace period anyway. And the receive data path is under an RCU read-side
critical section anyway. So even if you set lag->dev to NULL using
WRITE_ONCE, the existing in-flight readers from the RX data path that
had called dsa_lag_dev_by_id() will still hold a reference to a valid
bonding interface.

Tobias Waldekranz Dec. 1, 2020, 8:13 a.m. UTC | #2

On Tue, Dec 01, 2020 at 03:37, Vladimir Oltean <olteanv@gmail.com> wrote:
> On Mon, Nov 30, 2020 at 03:06:08PM +0100, Tobias Waldekranz wrote:

>> +static void dsa_lag_release(struct kref *refcount)

>> +{

>> +	struct dsa_lag *lag = container_of(refcount, struct dsa_lag, refcount);

>> +

>> +	rcu_assign_pointer(lag->dev, NULL);

>> +	synchronize_rcu();

>> +	memset(lag, 0, sizeof(*lag));

>> +}

>

> What difference does it make if lag->dev is set to NULL right away or

> after a grace period? Squeezing one last packet from that bonding interface?

> Pointer updates are atomic operations on all architectures that the

> kernel supports, and, as long as you use WRITE_ONCE and READ_ONCE memory

> barriers, there should be no reason for RCU protection that I can see.

> And unlike typical uses of RCU, you do not free lag->dev, because you do

> not own lag->dev. Instead, the bonding interface pointed to by lag->dev

> is going to be freed (in case of a deletion using ip link) after an RCU

> grace period anyway. And the receive data path is under an RCU read-side

> critical section anyway. So even if you set lag->dev to NULL using

> WRITE_ONCE, the existing in-flight readers from the RX data path that

> had called dsa_lag_dev_by_id() will still hold a reference to a valid

> bonding interface.


I completely agree with your analysis. I will remove all the RCU
primitives in v3. Thank you.

Vladimir Oltean Dec. 1, 2020, 1:29 p.m. UTC | #3

On Tue, Dec 01, 2020 at 09:13:57AM +0100, Tobias Waldekranz wrote:
> I completely agree with your analysis. I will remove all the RCU

> primitives in v3. Thank you.


I expect that this also gives us a simple refcount_t instead of the
struct kref?

Vladimir Oltean Dec. 1, 2020, 2:03 p.m. UTC | #4

On Mon, Nov 30, 2020 at 03:06:08PM +0100, Tobias Waldekranz wrote:
> When a LAG joins a bridge, the DSA subsystem will treat that as each

> individual port joining the bridge. The driver may look at the port's

> LAG pointer to see if it is associated with any LAG, if that is

> required. This is analogue to how switchdev events are replicated out

> to all lower devices when reaching e.g. a LAG.


Agree with the principle. But doesn't that mean that this code:

static int dsa_slave_switchdev_blocking_event(struct notifier_block *unused,
					      unsigned long event, void *ptr)
{
	struct net_device *dev = switchdev_notifier_info_to_dev(ptr);
	int err;

	switch (event) {
	case SWITCHDEV_PORT_OBJ_ADD:
		err = switchdev_handle_port_obj_add(dev, ptr,
						    dsa_slave_dev_check,
						    dsa_slave_port_obj_add);
		return notifier_from_errno(err);
	case SWITCHDEV_PORT_OBJ_DEL:
		err = switchdev_handle_port_obj_del(dev, ptr,
						    dsa_slave_dev_check,
						    dsa_slave_port_obj_del);
		return notifier_from_errno(err);
	case SWITCHDEV_PORT_ATTR_SET:
		err = switchdev_handle_port_attr_set(dev, ptr,
						     dsa_slave_dev_check,
						     dsa_slave_port_attr_set);
		return notifier_from_errno(err);
	}

	return NOTIFY_DONE;
}

should be replaced with something that also reacts to the case where
"dev" is a LAG? Like, for example, I imagine that a VLAN installed on a
bridge port that is a LAG should be propagated to the switch ports
beneath that LAG. Similarly for all bridge attributes.

As for FDB and MDB addresses, I think they should be propagated towards
a "logical port" corresponding to the LAG upper. I don't know how the
mv88e6xxx handles this.

Tobias Waldekranz Dec. 1, 2020, 2:22 p.m. UTC | #5

On Tue, Dec 01, 2020 at 15:29, Vladimir Oltean <olteanv@gmail.com> wrote:
> On Tue, Dec 01, 2020 at 09:13:57AM +0100, Tobias Waldekranz wrote:

>> I completely agree with your analysis. I will remove all the RCU

>> primitives in v3. Thank you.

>

> I expect that this also gives us a simple refcount_t instead of the

> struct kref?


Yeah sure, I was just trying to be consistent with what was being used
in other dsa-related structs. I will change it.

Tobias Waldekranz Dec. 1, 2020, 2:29 p.m. UTC | #6

On Tue, Dec 01, 2020 at 16:03, Vladimir Oltean <olteanv@gmail.com> wrote:
> On Mon, Nov 30, 2020 at 03:06:08PM +0100, Tobias Waldekranz wrote:

>> When a LAG joins a bridge, the DSA subsystem will treat that as each

>> individual port joining the bridge. The driver may look at the port's

>> LAG pointer to see if it is associated with any LAG, if that is

>> required. This is analogue to how switchdev events are replicated out

>> to all lower devices when reaching e.g. a LAG.

>

> Agree with the principle. But doesn't that mean that this code:

>

> static int dsa_slave_switchdev_blocking_event(struct notifier_block *unused,

> 					      unsigned long event, void *ptr)

> {

> 	struct net_device *dev = switchdev_notifier_info_to_dev(ptr);

> 	int err;

>

> 	switch (event) {

> 	case SWITCHDEV_PORT_OBJ_ADD:

> 		err = switchdev_handle_port_obj_add(dev, ptr,

> 						    dsa_slave_dev_check,

> 						    dsa_slave_port_obj_add);

> 		return notifier_from_errno(err);

> 	case SWITCHDEV_PORT_OBJ_DEL:

> 		err = switchdev_handle_port_obj_del(dev, ptr,

> 						    dsa_slave_dev_check,

> 						    dsa_slave_port_obj_del);

> 		return notifier_from_errno(err);

> 	case SWITCHDEV_PORT_ATTR_SET:

> 		err = switchdev_handle_port_attr_set(dev, ptr,

> 						     dsa_slave_dev_check,

> 						     dsa_slave_port_attr_set);

> 		return notifier_from_errno(err);

> 	}

>

> 	return NOTIFY_DONE;

> }

>

> should be replaced with something that also reacts to the case where

> "dev" is a LAG? Like, for example, I imagine that a VLAN installed on a

> bridge port that is a LAG should be propagated to the switch ports

> beneath that LAG. Similarly for all bridge attributes.


That is exactly what switchdev_handle_* does, no? It is this exact
behavior that my statement about switchdev event replication references.

> As for FDB and MDB addresses, I think they should be propagated towards

> a "logical port" corresponding to the LAG upper. I don't know how the

> mv88e6xxx handles this.


mv88e6xxx differentiates between multicast and unicast entries. So MDB
entries fit very well with the obj_add/del+replication. Unicast entries
will have use "lagX" as its destination, so in that case we need a new
dsa op along the lines of "lag_fdb_add/del".

Vladimir Oltean Dec. 1, 2020, 8:04 p.m. UTC | #7

On Tue, Dec 01, 2020 at 03:29:53PM +0100, Tobias Waldekranz wrote:
> On Tue, Dec 01, 2020 at 16:03, Vladimir Oltean <olteanv@gmail.com> wrote:

> > On Mon, Nov 30, 2020 at 03:06:08PM +0100, Tobias Waldekranz wrote:

> >> When a LAG joins a bridge, the DSA subsystem will treat that as each

> >> individual port joining the bridge. The driver may look at the port's

> >> LAG pointer to see if it is associated with any LAG, if that is

> >> required. This is analogue to how switchdev events are replicated out

> >> to all lower devices when reaching e.g. a LAG.

> >

> > Agree with the principle. But doesn't that mean that this code:

> >

> > static int dsa_slave_switchdev_blocking_event(struct notifier_block *unused,

> > 					      unsigned long event, void *ptr)

> > {

> > 	struct net_device *dev = switchdev_notifier_info_to_dev(ptr);

> > 	int err;

> >

> > 	switch (event) {

> > 	case SWITCHDEV_PORT_OBJ_ADD:

> > 		err = switchdev_handle_port_obj_add(dev, ptr,

> > 						    dsa_slave_dev_check,

> > 						    dsa_slave_port_obj_add);

> > 		return notifier_from_errno(err);

> > 	case SWITCHDEV_PORT_OBJ_DEL:

> > 		err = switchdev_handle_port_obj_del(dev, ptr,

> > 						    dsa_slave_dev_check,

> > 						    dsa_slave_port_obj_del);

> > 		return notifier_from_errno(err);

> > 	case SWITCHDEV_PORT_ATTR_SET:

> > 		err = switchdev_handle_port_attr_set(dev, ptr,

> > 						     dsa_slave_dev_check,

> > 						     dsa_slave_port_attr_set);

> > 		return notifier_from_errno(err);

> > 	}

> >

> > 	return NOTIFY_DONE;

> > }

> >

> > should be replaced with something that also reacts to the case where

> > "dev" is a LAG? Like, for example, I imagine that a VLAN installed on a

> > bridge port that is a LAG should be propagated to the switch ports

> > beneath that LAG. Similarly for all bridge attributes.

>

> That is exactly what switchdev_handle_* does, no? It is this exact

> behavior that my statement about switchdev event replication references.


I'm sorry, I don't mean to be overly obtuse, but _how_ does the current
code propagate a VLAN to a physical port located below a bond? Through
magic? The dsa_slave_dev_check is passed as a parameter to
switchdev_handle_port_obj_add _exactly_ because the code has needed so
far to match only on DSA interfaces and not on bonding interfaces. So
the code does not react to VLANs added on a bonding interface. Hence my
question.

ip link del bond0
ip link add bond0 type bond mode 802.3ad
ip link set swp1 down && ip link set swp1 master bond0 && ip link set swp1 up
ip link set swp2 down && ip link set swp2 master bond0 && ip link set swp2 up
ip link del br0
ip link add br0 type bridge
ip link set bond0 master br0
ip link set swp0 master br0

This should propagate the VLANs to swp1 and swp2 but doesn't:
bridge vlan add dev bond0 vid 100

It's perfectly acceptable to say that this patch set doesn't deal with
that. But your commit message seems to suggest that it's me who's
misunderstanding something.

Tobias Waldekranz Dec. 1, 2020, 9:48 p.m. UTC | #8

On Tue, Dec 01, 2020 at 22:04, Vladimir Oltean <olteanv@gmail.com> wrote:
> On Tue, Dec 01, 2020 at 03:29:53PM +0100, Tobias Waldekranz wrote:

>> On Tue, Dec 01, 2020 at 16:03, Vladimir Oltean <olteanv@gmail.com> wrote:

>> > On Mon, Nov 30, 2020 at 03:06:08PM +0100, Tobias Waldekranz wrote:

>> >> When a LAG joins a bridge, the DSA subsystem will treat that as each

>> >> individual port joining the bridge. The driver may look at the port's

>> >> LAG pointer to see if it is associated with any LAG, if that is

>> >> required. This is analogue to how switchdev events are replicated out

>> >> to all lower devices when reaching e.g. a LAG.

>> >

>> > Agree with the principle. But doesn't that mean that this code:

>> >

>> > static int dsa_slave_switchdev_blocking_event(struct notifier_block *unused,

>> > 					      unsigned long event, void *ptr)

>> > {

>> > 	struct net_device *dev = switchdev_notifier_info_to_dev(ptr);

>> > 	int err;

>> >

>> > 	switch (event) {

>> > 	case SWITCHDEV_PORT_OBJ_ADD:

>> > 		err = switchdev_handle_port_obj_add(dev, ptr,

>> > 						    dsa_slave_dev_check,

>> > 						    dsa_slave_port_obj_add);

>> > 		return notifier_from_errno(err);

>> > 	case SWITCHDEV_PORT_OBJ_DEL:

>> > 		err = switchdev_handle_port_obj_del(dev, ptr,

>> > 						    dsa_slave_dev_check,

>> > 						    dsa_slave_port_obj_del);

>> > 		return notifier_from_errno(err);

>> > 	case SWITCHDEV_PORT_ATTR_SET:

>> > 		err = switchdev_handle_port_attr_set(dev, ptr,

>> > 						     dsa_slave_dev_check,

>> > 						     dsa_slave_port_attr_set);

>> > 		return notifier_from_errno(err);

>> > 	}

>> >

>> > 	return NOTIFY_DONE;

>> > }

>> >

>> > should be replaced with something that also reacts to the case where

>> > "dev" is a LAG? Like, for example, I imagine that a VLAN installed on a

>> > bridge port that is a LAG should be propagated to the switch ports

>> > beneath that LAG. Similarly for all bridge attributes.

>>

>> That is exactly what switchdev_handle_* does, no? It is this exact

>> behavior that my statement about switchdev event replication references.

>

> I'm sorry, I don't mean to be overly obtuse, but _how_ does the current

> code propagate a VLAN to a physical port located below a bond? Through

> magic? The dsa_slave_dev_check is passed as a parameter to

> switchdev_handle_port_obj_add _exactly_ because the code has needed so

> far to match only on DSA interfaces and not on bonding interfaces. So

> the code does not react to VLANs added on a bonding interface. Hence my

> question.


There is no magic involved, here is the relevant snippet from
__switchdev_handle_port_obj_add:

	/* Switch ports might be stacked under e.g. a LAG. Ignore the
	 * unsupported devices, another driver might be able to handle them. But
	 * propagate to the callers any hard errors.
	 *
	 * If the driver does its own bookkeeping of stacked ports, it's not
	 * necessary to go through this helper.
	 */
	netdev_for_each_lower_dev(dev, lower_dev, iter) {
		if (netif_is_bridge_master(lower_dev))
			continue;

		err = __switchdev_handle_port_obj_add(lower_dev, port_obj_info,
						      check_cb, add_cb);
		if (err && err != -EOPNOTSUPP)
			return err;
	}


> ip link del bond0

> ip link add bond0 type bond mode 802.3ad

> ip link set swp1 down && ip link set swp1 master bond0 && ip link set swp1 up

> ip link set swp2 down && ip link set swp2 master bond0 && ip link set swp2 up

> ip link del br0

> ip link add br0 type bridge

> ip link set bond0 master br0

> ip link set swp0 master br0

>

> This should propagate the VLANs to swp1 and swp2 but doesn't:

> bridge vlan add dev bond0 vid 100


I ran through this on my setup and it is indeed propagated to all ports.

Just a thought, when you rebased the ocelot specific stuff to v2, did
you add the number of supported LAGs to ds->num_lags? If not, DSA will
assume that the hardware does not support offloading.

> It's perfectly acceptable to say that this patch set doesn't deal with

> that. But your commit message seems to suggest that it's me who's

> misunderstanding something.


I understand, that is why I explicitly mentioned the lack of static FDB
support for example. But it absolutely should deal with the full list I
specified, so thanks for testing it.

Vladimir Oltean Dec. 1, 2020, 10:23 p.m. UTC | #9

On Tue, Dec 01, 2020 at 10:48:34PM +0100, Tobias Waldekranz wrote:
> On Tue, Dec 01, 2020 at 22:04, Vladimir Oltean <olteanv@gmail.com> wrote:

> > On Tue, Dec 01, 2020 at 03:29:53PM +0100, Tobias Waldekranz wrote:

> >> On Tue, Dec 01, 2020 at 16:03, Vladimir Oltean <olteanv@gmail.com> wrote:

> >> > On Mon, Nov 30, 2020 at 03:06:08PM +0100, Tobias Waldekranz wrote:

> >> >> When a LAG joins a bridge, the DSA subsystem will treat that as each

> >> >> individual port joining the bridge. The driver may look at the port's

> >> >> LAG pointer to see if it is associated with any LAG, if that is

> >> >> required. This is analogue to how switchdev events are replicated out

> >> >> to all lower devices when reaching e.g. a LAG.

> >> >

> >> > Agree with the principle. But doesn't that mean that this code:

> >> >

> >> > static int dsa_slave_switchdev_blocking_event(struct notifier_block *unused,

> >> > 					      unsigned long event, void *ptr)

> >> > {

> >> > 	struct net_device *dev = switchdev_notifier_info_to_dev(ptr);

> >> > 	int err;

> >> >

> >> > 	switch (event) {

> >> > 	case SWITCHDEV_PORT_OBJ_ADD:

> >> > 		err = switchdev_handle_port_obj_add(dev, ptr,

> >> > 						    dsa_slave_dev_check,

> >> > 						    dsa_slave_port_obj_add);

> >> > 		return notifier_from_errno(err);

> >> > 	case SWITCHDEV_PORT_OBJ_DEL:

> >> > 		err = switchdev_handle_port_obj_del(dev, ptr,

> >> > 						    dsa_slave_dev_check,

> >> > 						    dsa_slave_port_obj_del);

> >> > 		return notifier_from_errno(err);

> >> > 	case SWITCHDEV_PORT_ATTR_SET:

> >> > 		err = switchdev_handle_port_attr_set(dev, ptr,

> >> > 						     dsa_slave_dev_check,

> >> > 						     dsa_slave_port_attr_set);

> >> > 		return notifier_from_errno(err);

> >> > 	}

> >> >

> >> > 	return NOTIFY_DONE;

> >> > }

> >> >

> >> > should be replaced with something that also reacts to the case where

> >> > "dev" is a LAG? Like, for example, I imagine that a VLAN installed on a

> >> > bridge port that is a LAG should be propagated to the switch ports

> >> > beneath that LAG. Similarly for all bridge attributes.

> >>

> >> That is exactly what switchdev_handle_* does, no? It is this exact

> >> behavior that my statement about switchdev event replication references.

> >

> > I'm sorry, I don't mean to be overly obtuse, but _how_ does the current

> > code propagate a VLAN to a physical port located below a bond? Through

> > magic? The dsa_slave_dev_check is passed as a parameter to

> > switchdev_handle_port_obj_add _exactly_ because the code has needed so

> > far to match only on DSA interfaces and not on bonding interfaces. So

> > the code does not react to VLANs added on a bonding interface. Hence my

> > question.

>

> There is no magic involved, here is the relevant snippet from

> __switchdev_handle_port_obj_add:

>

> 	/* Switch ports might be stacked under e.g. a LAG. Ignore the

> 	 * unsupported devices, another driver might be able to handle them. But

> 	 * propagate to the callers any hard errors.

> 	 *

> 	 * If the driver does its own bookkeeping of stacked ports, it's not

> 	 * necessary to go through this helper.

> 	 */

> 	netdev_for_each_lower_dev(dev, lower_dev, iter) {

> 		if (netif_is_bridge_master(lower_dev))

> 			continue;

>

> 		err = __switchdev_handle_port_obj_add(lower_dev, port_obj_info,

> 						      check_cb, add_cb);

> 		if (err && err != -EOPNOTSUPP)

> 			return err;

> 	}

>


Oh wow, such an odd place to put that. Especially since the entire
reason why switchdev uses notifiers is that you as a switchdev driver
can now explicitly intercept and offload switchdev objects that the
bridge emitted towards a driver that was "not you", such as a vxlan
interface. I guess that's still what's happening now, just that it's
completely non-obvious since it's hidden behind an opaque function.

Very interesting, thanks, I didn't know that.

> > ip link del bond0

> > ip link add bond0 type bond mode 802.3ad

> > ip link set swp1 down && ip link set swp1 master bond0 && ip link set swp1 up

> > ip link set swp2 down && ip link set swp2 master bond0 && ip link set swp2 up

> > ip link del br0

> > ip link add br0 type bridge

> > ip link set bond0 master br0

> > ip link set swp0 master br0

> >

> > This should propagate the VLANs to swp1 and swp2 but doesn't:

> > bridge vlan add dev bond0 vid 100

>

> I ran through this on my setup and it is indeed propagated to all ports.

>

> Just a thought, when you rebased the ocelot specific stuff to v2, did

> you add the number of supported LAGs to ds->num_lags? If not, DSA will

> assume that the hardware does not support offloading.


Ah, yes, that makes sense and that's what was happening. So DSA does the
right thing and does not offload bridge attributes to these ports,
because bonding needs to be done in software, and therefore even
bridging on swp1 and swp2 needs to be done in software. So as far as DSA
is concerned, swp1 and swp2 are standalone ports. This reminds me that I
need to do more testing for switches that can't offload bonding, to make
sure that they do the right thing.

[v2,net-next,0/4] net: dsa: Link aggregation support

Message

Comments