Message ID | 20201130140610.4018-1-tobias@waldekranz.com |
---|---|
Headers | show |
Series | net: dsa: Link aggregation support | expand |
Hi Tobias, On Mon, Nov 30, 2020 at 03:06:08PM +0100, Tobias Waldekranz wrote: > Monitor the following events and notify the driver when: > > - A DSA port joins/leaves a LAG. > - A LAG, made up of DSA ports, joins/leaves a bridge. > - A DSA port in a LAG is enabled/disabled (enabled meaning > "distributing" in 802.3ad LACP terms). > > Each LAG interface to which a DSA port is attached is represented by a > `struct dsa_lag` which is globally reachable from the switch tree and > from each associated port. > > When a LAG joins a bridge, the DSA subsystem will treat that as each > individual port joining the bridge. The driver may look at the port's > LAG pointer to see if it is associated with any LAG, if that is > required. This is analogue to how switchdev events are replicated out > to all lower devices when reaching e.g. a LAG. > > Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> > --- > include/net/dsa.h | 97 +++++++++++++++++++++++++++++++ > net/dsa/dsa2.c | 51 ++++++++++++++++ > net/dsa/dsa_priv.h | 31 ++++++++++ > net/dsa/port.c | 141 +++++++++++++++++++++++++++++++++++++++++++++ > net/dsa/slave.c | 83 ++++++++++++++++++++++++-- > net/dsa/switch.c | 49 ++++++++++++++++ > 6 files changed, 446 insertions(+), 6 deletions(-) > > +static inline struct dsa_lag *dsa_lag_by_id(struct dsa_switch_tree *dst, int id) > +{ > + if (!test_bit(id, dst->lags.busy)) > + return NULL; > + > + return &dst->lags.pool[id]; > +} > + > +static inline struct net_device *dsa_lag_dev_by_id(struct dsa_switch_tree *dst, > + int id) > +{ > + struct dsa_lag *lag = dsa_lag_by_id(dst, id); > + > + return lag ? rcu_dereference(lag->dev) : NULL; > +} > + > +static inline struct dsa_lag *dsa_lag_by_dev(struct dsa_switch_tree *dst, > + struct net_device *dev) > +{ > + struct dsa_lag *lag; > + int id; > + > + dsa_lag_foreach(id, dst) { > + lag = dsa_lag_by_id(dst, id); > + > + if (rtnl_dereference(lag->dev) == dev) > + return lag; > + } > + > + return NULL; > +} > > diff --git a/net/dsa/port.c b/net/dsa/port.c > index 73569c9af3cc..c2332ee5f5c7 100644 > --- a/net/dsa/port.c > +++ b/net/dsa/port.c > @@ -193,6 +193,147 @@ void dsa_port_bridge_leave(struct dsa_port *dp, struct net_device *br) > dsa_port_set_state_now(dp, BR_STATE_FORWARDING); > } > > +static struct dsa_lag *dsa_lag_get(struct dsa_switch_tree *dst, > + struct net_device *dev) > +{ > + struct dsa_lag *lag; > + int id; > + > + lag = dsa_lag_by_dev(dst, dev); > + if (lag) { > + kref_get(&lag->refcount); > + return lag; > + } > + > + id = find_first_zero_bit(dst->lags.busy, dst->lags.num); > + if (id >= dst->lags.num) { > + WARN(1, "No LAGs available"); > + return NULL; > + } > + > + set_bit(id, dst->lags.busy); > + > + lag = &dst->lags.pool[id]; > + kref_init(&lag->refcount); > + lag->id = id; > + INIT_LIST_HEAD(&lag->ports); > + INIT_LIST_HEAD(&lag->tx_ports); > + > + rcu_assign_pointer(lag->dev, dev); > + return lag; > +} > + > +static void dsa_lag_release(struct kref *refcount) > +{ > + struct dsa_lag *lag = container_of(refcount, struct dsa_lag, refcount); > + > + rcu_assign_pointer(lag->dev, NULL); > + synchronize_rcu(); > + memset(lag, 0, sizeof(*lag)); > +} What difference does it make if lag->dev is set to NULL right away or after a grace period? Squeezing one last packet from that bonding interface? Pointer updates are atomic operations on all architectures that the kernel supports, and, as long as you use WRITE_ONCE and READ_ONCE memory barriers, there should be no reason for RCU protection that I can see. And unlike typical uses of RCU, you do not free lag->dev, because you do not own lag->dev. Instead, the bonding interface pointed to by lag->dev is going to be freed (in case of a deletion using ip link) after an RCU grace period anyway. And the receive data path is under an RCU read-side critical section anyway. So even if you set lag->dev to NULL using WRITE_ONCE, the existing in-flight readers from the RX data path that had called dsa_lag_dev_by_id() will still hold a reference to a valid bonding interface.
On Tue, Dec 01, 2020 at 03:37, Vladimir Oltean <olteanv@gmail.com> wrote: > On Mon, Nov 30, 2020 at 03:06:08PM +0100, Tobias Waldekranz wrote: >> +static void dsa_lag_release(struct kref *refcount) >> +{ >> + struct dsa_lag *lag = container_of(refcount, struct dsa_lag, refcount); >> + >> + rcu_assign_pointer(lag->dev, NULL); >> + synchronize_rcu(); >> + memset(lag, 0, sizeof(*lag)); >> +} > > What difference does it make if lag->dev is set to NULL right away or > after a grace period? Squeezing one last packet from that bonding interface? > Pointer updates are atomic operations on all architectures that the > kernel supports, and, as long as you use WRITE_ONCE and READ_ONCE memory > barriers, there should be no reason for RCU protection that I can see. > And unlike typical uses of RCU, you do not free lag->dev, because you do > not own lag->dev. Instead, the bonding interface pointed to by lag->dev > is going to be freed (in case of a deletion using ip link) after an RCU > grace period anyway. And the receive data path is under an RCU read-side > critical section anyway. So even if you set lag->dev to NULL using > WRITE_ONCE, the existing in-flight readers from the RX data path that > had called dsa_lag_dev_by_id() will still hold a reference to a valid > bonding interface. I completely agree with your analysis. I will remove all the RCU primitives in v3. Thank you.
On Tue, Dec 01, 2020 at 09:13:57AM +0100, Tobias Waldekranz wrote: > I completely agree with your analysis. I will remove all the RCU > primitives in v3. Thank you. I expect that this also gives us a simple refcount_t instead of the struct kref?
On Mon, Nov 30, 2020 at 03:06:08PM +0100, Tobias Waldekranz wrote: > When a LAG joins a bridge, the DSA subsystem will treat that as each > individual port joining the bridge. The driver may look at the port's > LAG pointer to see if it is associated with any LAG, if that is > required. This is analogue to how switchdev events are replicated out > to all lower devices when reaching e.g. a LAG. Agree with the principle. But doesn't that mean that this code: static int dsa_slave_switchdev_blocking_event(struct notifier_block *unused, unsigned long event, void *ptr) { struct net_device *dev = switchdev_notifier_info_to_dev(ptr); int err; switch (event) { case SWITCHDEV_PORT_OBJ_ADD: err = switchdev_handle_port_obj_add(dev, ptr, dsa_slave_dev_check, dsa_slave_port_obj_add); return notifier_from_errno(err); case SWITCHDEV_PORT_OBJ_DEL: err = switchdev_handle_port_obj_del(dev, ptr, dsa_slave_dev_check, dsa_slave_port_obj_del); return notifier_from_errno(err); case SWITCHDEV_PORT_ATTR_SET: err = switchdev_handle_port_attr_set(dev, ptr, dsa_slave_dev_check, dsa_slave_port_attr_set); return notifier_from_errno(err); } return NOTIFY_DONE; } should be replaced with something that also reacts to the case where "dev" is a LAG? Like, for example, I imagine that a VLAN installed on a bridge port that is a LAG should be propagated to the switch ports beneath that LAG. Similarly for all bridge attributes. As for FDB and MDB addresses, I think they should be propagated towards a "logical port" corresponding to the LAG upper. I don't know how the mv88e6xxx handles this.
On Tue, Dec 01, 2020 at 15:29, Vladimir Oltean <olteanv@gmail.com> wrote: > On Tue, Dec 01, 2020 at 09:13:57AM +0100, Tobias Waldekranz wrote: >> I completely agree with your analysis. I will remove all the RCU >> primitives in v3. Thank you. > > I expect that this also gives us a simple refcount_t instead of the > struct kref? Yeah sure, I was just trying to be consistent with what was being used in other dsa-related structs. I will change it.
On Tue, Dec 01, 2020 at 16:03, Vladimir Oltean <olteanv@gmail.com> wrote: > On Mon, Nov 30, 2020 at 03:06:08PM +0100, Tobias Waldekranz wrote: >> When a LAG joins a bridge, the DSA subsystem will treat that as each >> individual port joining the bridge. The driver may look at the port's >> LAG pointer to see if it is associated with any LAG, if that is >> required. This is analogue to how switchdev events are replicated out >> to all lower devices when reaching e.g. a LAG. > > Agree with the principle. But doesn't that mean that this code: > > static int dsa_slave_switchdev_blocking_event(struct notifier_block *unused, > unsigned long event, void *ptr) > { > struct net_device *dev = switchdev_notifier_info_to_dev(ptr); > int err; > > switch (event) { > case SWITCHDEV_PORT_OBJ_ADD: > err = switchdev_handle_port_obj_add(dev, ptr, > dsa_slave_dev_check, > dsa_slave_port_obj_add); > return notifier_from_errno(err); > case SWITCHDEV_PORT_OBJ_DEL: > err = switchdev_handle_port_obj_del(dev, ptr, > dsa_slave_dev_check, > dsa_slave_port_obj_del); > return notifier_from_errno(err); > case SWITCHDEV_PORT_ATTR_SET: > err = switchdev_handle_port_attr_set(dev, ptr, > dsa_slave_dev_check, > dsa_slave_port_attr_set); > return notifier_from_errno(err); > } > > return NOTIFY_DONE; > } > > should be replaced with something that also reacts to the case where > "dev" is a LAG? Like, for example, I imagine that a VLAN installed on a > bridge port that is a LAG should be propagated to the switch ports > beneath that LAG. Similarly for all bridge attributes. That is exactly what switchdev_handle_* does, no? It is this exact behavior that my statement about switchdev event replication references. > As for FDB and MDB addresses, I think they should be propagated towards > a "logical port" corresponding to the LAG upper. I don't know how the > mv88e6xxx handles this. mv88e6xxx differentiates between multicast and unicast entries. So MDB entries fit very well with the obj_add/del+replication. Unicast entries will have use "lagX" as its destination, so in that case we need a new dsa op along the lines of "lag_fdb_add/del".
On Tue, Dec 01, 2020 at 03:29:53PM +0100, Tobias Waldekranz wrote: > On Tue, Dec 01, 2020 at 16:03, Vladimir Oltean <olteanv@gmail.com> wrote: > > On Mon, Nov 30, 2020 at 03:06:08PM +0100, Tobias Waldekranz wrote: > >> When a LAG joins a bridge, the DSA subsystem will treat that as each > >> individual port joining the bridge. The driver may look at the port's > >> LAG pointer to see if it is associated with any LAG, if that is > >> required. This is analogue to how switchdev events are replicated out > >> to all lower devices when reaching e.g. a LAG. > > > > Agree with the principle. But doesn't that mean that this code: > > > > static int dsa_slave_switchdev_blocking_event(struct notifier_block *unused, > > unsigned long event, void *ptr) > > { > > struct net_device *dev = switchdev_notifier_info_to_dev(ptr); > > int err; > > > > switch (event) { > > case SWITCHDEV_PORT_OBJ_ADD: > > err = switchdev_handle_port_obj_add(dev, ptr, > > dsa_slave_dev_check, > > dsa_slave_port_obj_add); > > return notifier_from_errno(err); > > case SWITCHDEV_PORT_OBJ_DEL: > > err = switchdev_handle_port_obj_del(dev, ptr, > > dsa_slave_dev_check, > > dsa_slave_port_obj_del); > > return notifier_from_errno(err); > > case SWITCHDEV_PORT_ATTR_SET: > > err = switchdev_handle_port_attr_set(dev, ptr, > > dsa_slave_dev_check, > > dsa_slave_port_attr_set); > > return notifier_from_errno(err); > > } > > > > return NOTIFY_DONE; > > } > > > > should be replaced with something that also reacts to the case where > > "dev" is a LAG? Like, for example, I imagine that a VLAN installed on a > > bridge port that is a LAG should be propagated to the switch ports > > beneath that LAG. Similarly for all bridge attributes. > > That is exactly what switchdev_handle_* does, no? It is this exact > behavior that my statement about switchdev event replication references. I'm sorry, I don't mean to be overly obtuse, but _how_ does the current code propagate a VLAN to a physical port located below a bond? Through magic? The dsa_slave_dev_check is passed as a parameter to switchdev_handle_port_obj_add _exactly_ because the code has needed so far to match only on DSA interfaces and not on bonding interfaces. So the code does not react to VLANs added on a bonding interface. Hence my question. ip link del bond0 ip link add bond0 type bond mode 802.3ad ip link set swp1 down && ip link set swp1 master bond0 && ip link set swp1 up ip link set swp2 down && ip link set swp2 master bond0 && ip link set swp2 up ip link del br0 ip link add br0 type bridge ip link set bond0 master br0 ip link set swp0 master br0 This should propagate the VLANs to swp1 and swp2 but doesn't: bridge vlan add dev bond0 vid 100 It's perfectly acceptable to say that this patch set doesn't deal with that. But your commit message seems to suggest that it's me who's misunderstanding something.
On Tue, Dec 01, 2020 at 22:04, Vladimir Oltean <olteanv@gmail.com> wrote: > On Tue, Dec 01, 2020 at 03:29:53PM +0100, Tobias Waldekranz wrote: >> On Tue, Dec 01, 2020 at 16:03, Vladimir Oltean <olteanv@gmail.com> wrote: >> > On Mon, Nov 30, 2020 at 03:06:08PM +0100, Tobias Waldekranz wrote: >> >> When a LAG joins a bridge, the DSA subsystem will treat that as each >> >> individual port joining the bridge. The driver may look at the port's >> >> LAG pointer to see if it is associated with any LAG, if that is >> >> required. This is analogue to how switchdev events are replicated out >> >> to all lower devices when reaching e.g. a LAG. >> > >> > Agree with the principle. But doesn't that mean that this code: >> > >> > static int dsa_slave_switchdev_blocking_event(struct notifier_block *unused, >> > unsigned long event, void *ptr) >> > { >> > struct net_device *dev = switchdev_notifier_info_to_dev(ptr); >> > int err; >> > >> > switch (event) { >> > case SWITCHDEV_PORT_OBJ_ADD: >> > err = switchdev_handle_port_obj_add(dev, ptr, >> > dsa_slave_dev_check, >> > dsa_slave_port_obj_add); >> > return notifier_from_errno(err); >> > case SWITCHDEV_PORT_OBJ_DEL: >> > err = switchdev_handle_port_obj_del(dev, ptr, >> > dsa_slave_dev_check, >> > dsa_slave_port_obj_del); >> > return notifier_from_errno(err); >> > case SWITCHDEV_PORT_ATTR_SET: >> > err = switchdev_handle_port_attr_set(dev, ptr, >> > dsa_slave_dev_check, >> > dsa_slave_port_attr_set); >> > return notifier_from_errno(err); >> > } >> > >> > return NOTIFY_DONE; >> > } >> > >> > should be replaced with something that also reacts to the case where >> > "dev" is a LAG? Like, for example, I imagine that a VLAN installed on a >> > bridge port that is a LAG should be propagated to the switch ports >> > beneath that LAG. Similarly for all bridge attributes. >> >> That is exactly what switchdev_handle_* does, no? It is this exact >> behavior that my statement about switchdev event replication references. > > I'm sorry, I don't mean to be overly obtuse, but _how_ does the current > code propagate a VLAN to a physical port located below a bond? Through > magic? The dsa_slave_dev_check is passed as a parameter to > switchdev_handle_port_obj_add _exactly_ because the code has needed so > far to match only on DSA interfaces and not on bonding interfaces. So > the code does not react to VLANs added on a bonding interface. Hence my > question. There is no magic involved, here is the relevant snippet from __switchdev_handle_port_obj_add: /* Switch ports might be stacked under e.g. a LAG. Ignore the * unsupported devices, another driver might be able to handle them. But * propagate to the callers any hard errors. * * If the driver does its own bookkeeping of stacked ports, it's not * necessary to go through this helper. */ netdev_for_each_lower_dev(dev, lower_dev, iter) { if (netif_is_bridge_master(lower_dev)) continue; err = __switchdev_handle_port_obj_add(lower_dev, port_obj_info, check_cb, add_cb); if (err && err != -EOPNOTSUPP) return err; } > ip link del bond0 > ip link add bond0 type bond mode 802.3ad > ip link set swp1 down && ip link set swp1 master bond0 && ip link set swp1 up > ip link set swp2 down && ip link set swp2 master bond0 && ip link set swp2 up > ip link del br0 > ip link add br0 type bridge > ip link set bond0 master br0 > ip link set swp0 master br0 > > This should propagate the VLANs to swp1 and swp2 but doesn't: > bridge vlan add dev bond0 vid 100 I ran through this on my setup and it is indeed propagated to all ports. Just a thought, when you rebased the ocelot specific stuff to v2, did you add the number of supported LAGs to ds->num_lags? If not, DSA will assume that the hardware does not support offloading. > It's perfectly acceptable to say that this patch set doesn't deal with > that. But your commit message seems to suggest that it's me who's > misunderstanding something. I understand, that is why I explicitly mentioned the lack of static FDB support for example. But it absolutely should deal with the full list I specified, so thanks for testing it.
On Tue, Dec 01, 2020 at 10:48:34PM +0100, Tobias Waldekranz wrote: > On Tue, Dec 01, 2020 at 22:04, Vladimir Oltean <olteanv@gmail.com> wrote: > > On Tue, Dec 01, 2020 at 03:29:53PM +0100, Tobias Waldekranz wrote: > >> On Tue, Dec 01, 2020 at 16:03, Vladimir Oltean <olteanv@gmail.com> wrote: > >> > On Mon, Nov 30, 2020 at 03:06:08PM +0100, Tobias Waldekranz wrote: > >> >> When a LAG joins a bridge, the DSA subsystem will treat that as each > >> >> individual port joining the bridge. The driver may look at the port's > >> >> LAG pointer to see if it is associated with any LAG, if that is > >> >> required. This is analogue to how switchdev events are replicated out > >> >> to all lower devices when reaching e.g. a LAG. > >> > > >> > Agree with the principle. But doesn't that mean that this code: > >> > > >> > static int dsa_slave_switchdev_blocking_event(struct notifier_block *unused, > >> > unsigned long event, void *ptr) > >> > { > >> > struct net_device *dev = switchdev_notifier_info_to_dev(ptr); > >> > int err; > >> > > >> > switch (event) { > >> > case SWITCHDEV_PORT_OBJ_ADD: > >> > err = switchdev_handle_port_obj_add(dev, ptr, > >> > dsa_slave_dev_check, > >> > dsa_slave_port_obj_add); > >> > return notifier_from_errno(err); > >> > case SWITCHDEV_PORT_OBJ_DEL: > >> > err = switchdev_handle_port_obj_del(dev, ptr, > >> > dsa_slave_dev_check, > >> > dsa_slave_port_obj_del); > >> > return notifier_from_errno(err); > >> > case SWITCHDEV_PORT_ATTR_SET: > >> > err = switchdev_handle_port_attr_set(dev, ptr, > >> > dsa_slave_dev_check, > >> > dsa_slave_port_attr_set); > >> > return notifier_from_errno(err); > >> > } > >> > > >> > return NOTIFY_DONE; > >> > } > >> > > >> > should be replaced with something that also reacts to the case where > >> > "dev" is a LAG? Like, for example, I imagine that a VLAN installed on a > >> > bridge port that is a LAG should be propagated to the switch ports > >> > beneath that LAG. Similarly for all bridge attributes. > >> > >> That is exactly what switchdev_handle_* does, no? It is this exact > >> behavior that my statement about switchdev event replication references. > > > > I'm sorry, I don't mean to be overly obtuse, but _how_ does the current > > code propagate a VLAN to a physical port located below a bond? Through > > magic? The dsa_slave_dev_check is passed as a parameter to > > switchdev_handle_port_obj_add _exactly_ because the code has needed so > > far to match only on DSA interfaces and not on bonding interfaces. So > > the code does not react to VLANs added on a bonding interface. Hence my > > question. > > There is no magic involved, here is the relevant snippet from > __switchdev_handle_port_obj_add: > > /* Switch ports might be stacked under e.g. a LAG. Ignore the > * unsupported devices, another driver might be able to handle them. But > * propagate to the callers any hard errors. > * > * If the driver does its own bookkeeping of stacked ports, it's not > * necessary to go through this helper. > */ > netdev_for_each_lower_dev(dev, lower_dev, iter) { > if (netif_is_bridge_master(lower_dev)) > continue; > > err = __switchdev_handle_port_obj_add(lower_dev, port_obj_info, > check_cb, add_cb); > if (err && err != -EOPNOTSUPP) > return err; > } > Oh wow, such an odd place to put that. Especially since the entire reason why switchdev uses notifiers is that you as a switchdev driver can now explicitly intercept and offload switchdev objects that the bridge emitted towards a driver that was "not you", such as a vxlan interface. I guess that's still what's happening now, just that it's completely non-obvious since it's hidden behind an opaque function. Very interesting, thanks, I didn't know that. > > ip link del bond0 > > ip link add bond0 type bond mode 802.3ad > > ip link set swp1 down && ip link set swp1 master bond0 && ip link set swp1 up > > ip link set swp2 down && ip link set swp2 master bond0 && ip link set swp2 up > > ip link del br0 > > ip link add br0 type bridge > > ip link set bond0 master br0 > > ip link set swp0 master br0 > > > > This should propagate the VLANs to swp1 and swp2 but doesn't: > > bridge vlan add dev bond0 vid 100 > > I ran through this on my setup and it is indeed propagated to all ports. > > Just a thought, when you rebased the ocelot specific stuff to v2, did > you add the number of supported LAGs to ds->num_lags? If not, DSA will > assume that the hardware does not support offloading. Ah, yes, that makes sense and that's what was happening. So DSA does the right thing and does not offload bridge attributes to these ports, because bonding needs to be done in software, and therefore even bridging on swp1 and swp2 needs to be done in software. So as far as DSA is concerned, swp1 and swp2 are standalone ports. This reminds me that I need to do more testing for switches that can't offload bonding, to make sure that they do the right thing.