diff mbox series

[v2,net-next,04/11] net: bridge: offload initial and final port flags through switchdev

Message ID 20210209151936.97382-5-olteanv@gmail.com
State New
Headers show
Series Cleanup in brport flags switchdev offload for DSA | expand

Commit Message

Vladimir Oltean Feb. 9, 2021, 3:19 p.m. UTC
From: Vladimir Oltean <vladimir.oltean@nxp.com>

It must first be admitted that switchdev device drivers have a life
beyond the bridge, and when they aren't offloading the bridge driver
they are operating with forwarding disabled between ports, emulating as
closely as possible N standalone network interfaces.

Now it must be said that for a switchdev port operating in standalone
mode, address learning doesn't make much sense since that is a bridge
function. In fact, address learning even breaks setups such as this one:

   +---------------------------------------------+
   |                                             |
   | +-------------------+                       |
   | |        br0        |    send      receive  |
   | +--------+-+--------+ +--------+ +--------+ |
   | |        | |        | |        | |        | |
   | |  swp0  | |  swp1  | |  swp2  | |  swp3  | |
   | |        | |        | |        | |        | |
   +-+--------+-+--------+-+--------+-+--------+-+
          |         ^           |          ^
          |         |           |          |
          |         +-----------+          |
          |                                |
          +--------------------------------+

because if the ASIC has a single FDB (can offload a single bridge)
then source address learning on swp3 can "steal" the source MAC address
of swp2 from br0's FDB, because learning frames coming from swp2 will be
done twice: first on the swp1 ingress port, second on the swp3 ingress
port. So the hardware FDB will become out of sync with the software
bridge, and when swp2 tries to send one more packet towards swp1, the
ASIC will attempt to short-circuit the forwarding path and send it
directly to swp3 (since that's the last port it learned that address on),
which it obviously can't, because swp3 operates in standalone mode.

So switchdev drivers operating in standalone mode should disable address
learning. As a matter of practicality, we can reduce code duplication in
drivers by having the bridge notify through switchdev of the initial and
final brport flags. Then, drivers can simply start up hardcoded for no
address learning (similar to how they already start up hardcoded for no
forwarding), then they only need to listen for
SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS and their job is basically done, no
need for special cases when the port joins or leaves the bridge etc.

When a port leaves the bridge (and therefore becomes standalone), we
issue a switchdev attribute that apart from disabling address learning,
enables flooding of all kinds. This is also done for pragmatic reasons,
because even though standalone switchdev ports might not need to have
flooding enabled in order to inject traffic with any MAC DA from the
control interface, it certainly doesn't hurt either, and it even makes
more sense than disabling flooding of unknown traffic towards that port.

Note that the implementation is a bit wacky because the switchdev API
for port attributes is very counterproductive. Instead of issuing a
single switchdev notification with a bitwise OR of all flags that we're
modifying, we need to issue 4 individual notifications, one for each bit.
This is because the SWITCHDEV_ATTR_ID_PORT_PRE_BRIDGE_FLAGS notifier
forces you to refuse the entire operation if there's at least one bit
which you can't offload, and that is currently BR_BCAST_FLOOD which
nobody does. So this change would do nothing for no one if we offloaded
all flags at once, but the idea is to offload as much as possible
instead of all or nothing.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
Changes in v2:
- Renamed nbp_flags_change to nbp_flags_notify.
- Don't return any errors, offload flags one by one as opposed to all at
  once.

 include/linux/if_bridge.h |  3 +++
 net/bridge/br_if.c        | 21 ++++++++++++++++++++-
 net/bridge/br_switchdev.c |  3 +--
 3 files changed, 24 insertions(+), 3 deletions(-)

Comments

Ido Schimmel Feb. 9, 2021, 6:51 p.m. UTC | #1
On Tue, Feb 09, 2021 at 05:19:29PM +0200, Vladimir Oltean wrote:
> So switchdev drivers operating in standalone mode should disable address
> learning. As a matter of practicality, we can reduce code duplication in
> drivers by having the bridge notify through switchdev of the initial and
> final brport flags. Then, drivers can simply start up hardcoded for no
> address learning (similar to how they already start up hardcoded for no
> forwarding), then they only need to listen for
> SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS and their job is basically done, no
> need for special cases when the port joins or leaves the bridge etc.

How are you handling the case where a port leaves a LAG that is linked
to a bridge? In this case the port becomes a standalone port, but will
not get this notification.
Ido Schimmel Feb. 9, 2021, 10:01 p.m. UTC | #2
On Tue, Feb 09, 2021 at 10:20:45PM +0200, Vladimir Oltean wrote:
> On Tue, Feb 09, 2021 at 08:51:00PM +0200, Ido Schimmel wrote:
> > On Tue, Feb 09, 2021 at 05:19:29PM +0200, Vladimir Oltean wrote:
> > > So switchdev drivers operating in standalone mode should disable address
> > > learning. As a matter of practicality, we can reduce code duplication in
> > > drivers by having the bridge notify through switchdev of the initial and
> > > final brport flags. Then, drivers can simply start up hardcoded for no
> > > address learning (similar to how they already start up hardcoded for no
> > > forwarding), then they only need to listen for
> > > SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS and their job is basically done, no
> > > need for special cases when the port joins or leaves the bridge etc.
> > 
> > How are you handling the case where a port leaves a LAG that is linked
> > to a bridge? In this case the port becomes a standalone port, but will
> > not get this notification.
> 
> Apparently the answer to that question is "I delete the code that makes
> this use case work", how smart of me. Thanks.

Not sure how you expect to interpret this.

> 
> Unless you have any idea how I could move the logic into the bridge, I
> guess I'm stuck with DSA and all the other switchdev drivers having this
> forest of corner cases to deal with. At least I can add a comment so I'm
> not tempted to delete it next time.

There are too many moving pieces with stacked devices. It is not only
LAG/bridge. In L3 you have VRFs, SVIs, macvlans etc. It might be better
to gracefully / explicitly not handle a case rather than pretending to
handle it correctly with complex / buggy code.

For example, you should refuse to be enslaved to a LAG that already has
upper devices such as a bridge. You are probably not handling this
correctly / at all. This is easy. Just a call to
netdev_has_any_upper_dev().

The reverse, during unlinking, would be to refuse unlinking if the upper
has uppers of its own. netdev_upper_dev_unlink() needs to learn to
return an error and callers such as team/bond need to learn to handle
it, but it seems patchable.
Ido Schimmel Feb. 10, 2021, 10:59 a.m. UTC | #3
On Wed, Feb 10, 2021 at 12:51:53AM +0200, Vladimir Oltean wrote:
> On Wed, Feb 10, 2021 at 12:01:24AM +0200, Ido Schimmel wrote:

> > On Tue, Feb 09, 2021 at 10:20:45PM +0200, Vladimir Oltean wrote:

> > > On Tue, Feb 09, 2021 at 08:51:00PM +0200, Ido Schimmel wrote:

> > > > On Tue, Feb 09, 2021 at 05:19:29PM +0200, Vladimir Oltean wrote:

> > > > > So switchdev drivers operating in standalone mode should disable address

> > > > > learning. As a matter of practicality, we can reduce code duplication in

> > > > > drivers by having the bridge notify through switchdev of the initial and

> > > > > final brport flags. Then, drivers can simply start up hardcoded for no

> > > > > address learning (similar to how they already start up hardcoded for no

> > > > > forwarding), then they only need to listen for

> > > > > SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS and their job is basically done, no

> > > > > need for special cases when the port joins or leaves the bridge etc.

> > > >

> > > > How are you handling the case where a port leaves a LAG that is linked

> > > > to a bridge? In this case the port becomes a standalone port, but will

> > > > not get this notification.

> > >

> > > Apparently the answer to that question is "I delete the code that makes

> > > this use case work", how smart of me. Thanks.

> >

> > Not sure how you expect to interpret this.

> 

> Next patch (05/11) deletes that explicit notification from dsa_port_bridge_leave,

> function which is called from dsa_port_lag_leave too, apparently with good reason.

> 

> > > Unless you have any idea how I could move the logic into the bridge, I

> > > guess I'm stuck with DSA and all the other switchdev drivers having this

> > > forest of corner cases to deal with. At least I can add a comment so I'm

> > > not tempted to delete it next time.

> >

> > There are too many moving pieces with stacked devices. It is not only

> > LAG/bridge. In L3 you have VRFs, SVIs, macvlans etc. It might be better

> > to gracefully / explicitly not handle a case rather than pretending to

> > handle it correctly with complex / buggy code.

> >

> > For example, you should refuse to be enslaved to a LAG that already has

> > upper devices such as a bridge. You are probably not handling this

> > correctly / at all. This is easy. Just a call to

> > netdev_has_any_upper_dev().

> 

> Correct, good point, in particular this means that joining a bridged LAG

> will not get me any notifications of that LAG's CHANGEUPPER because that

> was consumed a long time ago. An equally valid approach seems to be to

> check for netdev_master_upper_dev_get_rcu in dsa_port_lag_join, and call

> dsa_port_bridge_join on the upper if that is present.


The bridge might already have a state you are not familiar with (e.g.,
FDB entry pointing to the LAG), so best to just forbid this. I think
it's fair to impose such limitations (assuming they are properly
communicated to user space) given it results in a much less
buggy/complex code to maintain.

> 

> > The reverse, during unlinking, would be to refuse unlinking if the upper

> > has uppers of its own. netdev_upper_dev_unlink() needs to learn to

> > return an error and callers such as team/bond need to learn to handle

> > it, but it seems patchable.

> 

> Again, this was treated prior to my deletion in this series and not by

> erroring out, I just really didn't think it through.

> 

> So you're saying that if we impose that all switchdev drivers restrict

> the house of cards to be constructed from the bottom up, and destructed

> from the top down, then the notification of bridge port flags can stay

> in the bridge layer?


I actually don't think it's a good idea to have this in the bridge in
any case. I understand that it makes sense for some devices where
learning, flooding, etc are port attributes, but in other devices these
can be {port,vlan} attributes and then you need to take care of them
when a vlan is added / deleted and not only when a port is removed from
the bridge. So for such devices this really won't save anything. I would
thus leave it to the lower levels to decide.
Vladimir Oltean Feb. 10, 2021, 11:23 p.m. UTC | #4
On Wed, Feb 10, 2021 at 12:59:49PM +0200, Ido Schimmel wrote:
> > > The reverse, during unlinking, would be to refuse unlinking if the upper

> > > has uppers of its own. netdev_upper_dev_unlink() needs to learn to

> > > return an error and callers such as team/bond need to learn to handle

> > > it, but it seems patchable.

> >

> > Again, this was treated prior to my deletion in this series and not by

> > erroring out, I just really didn't think it through.

> >

> > So you're saying that if we impose that all switchdev drivers restrict

> > the house of cards to be constructed from the bottom up, and destructed

> > from the top down, then the notification of bridge port flags can stay

> > in the bridge layer?

>

> I actually don't think it's a good idea to have this in the bridge in

> any case. I understand that it makes sense for some devices where

> learning, flooding, etc are port attributes, but in other devices these

> can be {port,vlan} attributes and then you need to take care of them

> when a vlan is added / deleted and not only when a port is removed from

> the bridge. So for such devices this really won't save anything. I would

> thus leave it to the lower levels to decide.


Just for my understanding, how are per-{port,vlan} attributes such as
learning and flooding managed by the Linux bridge? How can I disable
flooding only in a certain VLAN?
Ido Schimmel Feb. 11, 2021, 7:44 a.m. UTC | #5
On Thu, Feb 11, 2021 at 01:23:52AM +0200, Vladimir Oltean wrote:
> On Wed, Feb 10, 2021 at 12:59:49PM +0200, Ido Schimmel wrote:

> > > > The reverse, during unlinking, would be to refuse unlinking if the upper

> > > > has uppers of its own. netdev_upper_dev_unlink() needs to learn to

> > > > return an error and callers such as team/bond need to learn to handle

> > > > it, but it seems patchable.

> > >

> > > Again, this was treated prior to my deletion in this series and not by

> > > erroring out, I just really didn't think it through.

> > >

> > > So you're saying that if we impose that all switchdev drivers restrict

> > > the house of cards to be constructed from the bottom up, and destructed

> > > from the top down, then the notification of bridge port flags can stay

> > > in the bridge layer?

> >

> > I actually don't think it's a good idea to have this in the bridge in

> > any case. I understand that it makes sense for some devices where

> > learning, flooding, etc are port attributes, but in other devices these

> > can be {port,vlan} attributes and then you need to take care of them

> > when a vlan is added / deleted and not only when a port is removed from

> > the bridge. So for such devices this really won't save anything. I would

> > thus leave it to the lower levels to decide.

> 

> Just for my understanding, how are per-{port,vlan} attributes such as

> learning and flooding managed by the Linux bridge? How can I disable

> flooding only in a certain VLAN?


You can't (currently). But it does not change the fact that in some
devices these are {port,vlan} attributes and we are talking here about
the interface towards these devices. Having these as {port,vlan}
attributes allows you to support use cases such as a port being enslaved
to a VLAN-aware bridge and its VLAN upper(s) enslaved to VLAN unaware
bridge(s). Obviously you need to ensure there is no conflict between the
VLANs used by the VLAN-aware bridge and the VLAN device(s).
Ido Schimmel Feb. 11, 2021, 10:20 p.m. UTC | #6
On Thu, Feb 11, 2021 at 11:35:27AM +0200, Vladimir Oltean wrote:
> On Thu, Feb 11, 2021 at 09:44:43AM +0200, Ido Schimmel wrote:

> > On Thu, Feb 11, 2021 at 01:23:52AM +0200, Vladimir Oltean wrote:

> > > On Wed, Feb 10, 2021 at 12:59:49PM +0200, Ido Schimmel wrote:

> > > > > > The reverse, during unlinking, would be to refuse unlinking if the upper

> > > > > > has uppers of its own. netdev_upper_dev_unlink() needs to learn to

> > > > > > return an error and callers such as team/bond need to learn to handle

> > > > > > it, but it seems patchable.

> > > > >

> > > > > Again, this was treated prior to my deletion in this series and not by

> > > > > erroring out, I just really didn't think it through.

> > > > >

> > > > > So you're saying that if we impose that all switchdev drivers restrict

> > > > > the house of cards to be constructed from the bottom up, and destructed

> > > > > from the top down, then the notification of bridge port flags can stay

> > > > > in the bridge layer?

> > > >

> > > > I actually don't think it's a good idea to have this in the bridge in

> > > > any case. I understand that it makes sense for some devices where

> > > > learning, flooding, etc are port attributes, but in other devices these

> > > > can be {port,vlan} attributes and then you need to take care of them

> > > > when a vlan is added / deleted and not only when a port is removed from

> > > > the bridge. So for such devices this really won't save anything. I would

> > > > thus leave it to the lower levels to decide.

> > >

> > > Just for my understanding, how are per-{port,vlan} attributes such as

> > > learning and flooding managed by the Linux bridge? How can I disable

> > > flooding only in a certain VLAN?

> >

> > You can't (currently). But it does not change the fact that in some

> > devices these are {port,vlan} attributes and we are talking here about

> > the interface towards these devices. Having these as {port,vlan}

> > attributes allows you to support use cases such as a port being enslaved

> > to a VLAN-aware bridge and its VLAN upper(s) enslaved to VLAN unaware

> > bridge(s).

> 

> I don't think I understand the use case really. You mean something like this?

> 

>     br1 (vlan_filtering=0)

>     /           \

>    /             \

>  swp0.100         \

>    |               \

>    |(vlan_filtering \

>    |  br0  =1)       \

>    | /   \            \

>    |/     \            \

>  swp0    swp1         swp2

> 

> A packet received on swp0 with VLAN tag 100 will go to swp0.100 which

> will be forwarded according to the FDB of br1, and will be delivered to

> swp2 as untagged? Respectively in the other direction, a packet received

> on swp2 will have a VLAN 100 tag pushed on egress towards swp0, even if

> it is already VLAN-tagged?

> 

> What do you even use this for?


The more common use case is to have multiple VLAN-unaware bridges
instead of one VLAN-aware bridge. I'm not aware of users that use the
hybrid model (VLAN-aware + VLAN-unaware). But regardless, this entails
treating above mentioned attributes as {port,vlan} attributes. A device
that only supports them as port attributes will have problems supporting
such a model.

> And also: if the {port,vlan} attributes can be simulated by making the

> bridge port be an 8021q upper of a physical interface, then as far as

> the bridge is concerned, they still are per-port attributes, and they

> are per-{port,vlan} only as far as the switch driver is concerned -

> therefore I don't see why it isn't okay for the bridge to notify the

> brport flags in exactly the same way for them too.


Look at this hunk from the patch:

@@ -343,6 +360,8 @@ static void del_nbp(struct net_bridge_port *p)
 		update_headroom(br, get_max_headroom(br));
 	netdev_reset_rx_headroom(dev);
 
+	nbp_flags_notify(p, BR_PORT_DEFAULT_FLAGS & ~BR_LEARNING,
+			 BR_PORT_DEFAULT_FLAGS);
 	nbp_vlan_flush(p);
 	br_fdb_delete_by_port(br, p, 0, 1);
 	switchdev_deferred_process();

Devices that treat these attributes as {port,vlan} attributes will undo
this change upon the call to nbp_vlan_flush() when all the VLANs are
flushed.

> 

> > Obviously you need to ensure there is no conflict between the

> > VLANs used by the VLAN-aware bridge and the VLAN device(s).

> 

> On the other hand I think I have a more real-life use case that I think

> is in conflict with this last phrase.

> I have a VLAN-aware bridge and I want to run PTP in VLAN 7, but I also

> need to add VLAN 7 in the VLAN table of the bridge ports so that it

> doesn't drop traffic. PTP is link-local, so I need to run it on VLAN

> uppers of the switch ports. Like this:

> 

> ip link add br0 type bridge vlan_filtering 1

> ip link set swp0 master br0

> ip link set swp1 master br0

> bridge vlan add dev swp0 vid 7 master

> bridge vlan add dev swp1 vid 7 master

> bridge vlan add dev br0 vid 7 self

> ip link add link swp0 name swp0.7 type vlan id 7

> ip link add link swp1 name swp0.7 type vlan id 7

> ptp4l -i swp0.7 -i swp1.7 -m

> 

> How can I do that considering that you recommend avoiding conflicts

> between the VLAN-aware bridge and 8021q uppers? Or is that true only

> when the 8021q uppers are bridged?


The problem is with the statement "I also need to add VLAN 7 in the VLAN
table of the bridge ports so that it doesn't drop traffic". Packets with
VLAN 7 received by swp0 will be processed by swp0.7. br0 is irrelevant
and configuring swp0.7 should be enough in order to enable the VLAN
filter for VLAN 7 on swp0. I don't know the internals of the HW you are
working with, but I imagine that you would need to create a HW bridge
between {swp0, VLAN 7} and the CPU port so that all the traffic with
VLAN 7 will be sent / flooded to the CPU.
diff mbox series

Patch

diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
index b979005ea39c..36d77fa8f40b 100644
--- a/include/linux/if_bridge.h
+++ b/include/linux/if_bridge.h
@@ -58,6 +58,9 @@  struct br_ip_list {
 #define BR_MRP_LOST_CONT	BIT(18)
 #define BR_MRP_LOST_IN_CONT	BIT(19)
 
+#define BR_PORT_DEFAULT_FLAGS	(BR_FLOOD | BR_MCAST_FLOOD | BR_BCAST_FLOOD | \
+				 BR_LEARNING)
+
 #define BR_DEFAULT_AGEING_TIME	(300 * HZ)
 
 extern void brioctl_set(int (*ioctl_hook)(struct net *, unsigned int, void __user *));
diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index f7d2f472ae24..f813eec986ba 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -89,6 +89,23 @@  void br_port_carrier_check(struct net_bridge_port *p, bool *notified)
 	spin_unlock_bh(&br->lock);
 }
 
+/* If @mask has multiple bits set at once, offload them one by one to
+ * switchdev, to allow it to reject only what it doesn't support and accept
+ * what it does.
+ */
+static void nbp_flags_notify(struct net_bridge_port *p, unsigned long flags,
+			     unsigned long mask)
+{
+	int flag;
+
+	for_each_set_bit(flag, &mask, 32)
+		br_switchdev_set_port_flag(p, flags & BIT(flag),
+					   BIT(flag), NULL);
+
+	p->flags &= ~mask;
+	p->flags |= flags;
+}
+
 static void br_port_set_promisc(struct net_bridge_port *p)
 {
 	int err = 0;
@@ -343,6 +360,8 @@  static void del_nbp(struct net_bridge_port *p)
 		update_headroom(br, get_max_headroom(br));
 	netdev_reset_rx_headroom(dev);
 
+	nbp_flags_notify(p, BR_PORT_DEFAULT_FLAGS & ~BR_LEARNING,
+			 BR_PORT_DEFAULT_FLAGS);
 	nbp_vlan_flush(p);
 	br_fdb_delete_by_port(br, p, 0, 1);
 	switchdev_deferred_process();
@@ -428,7 +447,7 @@  static struct net_bridge_port *new_nbp(struct net_bridge *br,
 	p->path_cost = port_cost(dev);
 	p->priority = 0x8000 >> BR_PORT_BITS;
 	p->port_no = index;
-	p->flags = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD | BR_BCAST_FLOOD;
+	nbp_flags_notify(p, BR_PORT_DEFAULT_FLAGS, BR_PORT_DEFAULT_FLAGS);
 	br_init_port(p);
 	br_set_state(p, BR_STATE_DISABLED);
 	br_stp_port_timer_init(p);
diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c
index ac8dead86bf2..1fae532cfbb1 100644
--- a/net/bridge/br_switchdev.c
+++ b/net/bridge/br_switchdev.c
@@ -55,8 +55,7 @@  bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,
 }
 
 /* Flags that can be offloaded to hardware */
-#define BR_PORT_FLAGS_HW_OFFLOAD (BR_LEARNING | BR_FLOOD | \
-				  BR_MCAST_FLOOD | BR_BCAST_FLOOD)
+#define BR_PORT_FLAGS_HW_OFFLOAD	BR_PORT_DEFAULT_FLAGS
 
 int br_switchdev_set_port_flag(struct net_bridge_port *p,
 			       unsigned long flags,