[RFC,v2,net-next,00/16] Better support for sandwiched LAGs with bridge and DSA

Message ID	20210318231829.3892920-1-olteanv@gmail.com
Headers	show Return-Path: <linux-omap-owner@kernel.org> From: Vladimir Oltean <olteanv@gmail.com> To: Jakub Kicinski <kuba@kernel.org>, "David S. Miller" <davem@davemloft.net> Cc: Andrew Lunn <andrew@lunn.ch>, Vivien Didelot <vivien.didelot@gmail.com>, Florian Fainelli <f.fainelli@gmail.com>, Tobias Waldekranz <tobias@waldekranz.com>, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Roopa Prabhu <roopa@nvidia.com>, Nikolay Aleksandrov <nikolay@nvidia.com>, Jiri Pirko <jiri@resnulli.us>, Ido Schimmel <idosch@idosch.org>, Alexandre Belloni <alexandre.belloni@bootlin.com>, UNGLinuxDriver@microchip.com, Vadym Kochan <vkochan@marvell.com>, Taras Chornyi <tchornyi@marvell.com>, Grygorii Strashko <grygorii.strashko@ti.com>, Vignesh Raghavendra <vigneshr@ti.com>, Ioana Ciornei <ioana.ciornei@nxp.com>, Ivan Vecera <ivecera@redhat.com>, linux-omap@vger.kernel.org, Vladimir Oltean <vladimir.oltean@nxp.com> Subject: [RFC PATCH v2 net-next 00/16] Better support for sandwiched LAGs with bridge and DSA Date: Fri, 19 Mar 2021 01:18:13 +0200 Message-Id: <20210318231829.3892920-1-olteanv@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Better support for sandwiched LAGs with bridge and DSA \| expand [RFC,v2,net-next,00/16] Better support for sandwiched LAGs with bridge and DSA [RFC,v2,net-next,01/16] net: dsa: call dsa_port_bridge_join when joining a LAG that is already in... [RFC,v2,net-next,02/16] net: dsa: pass extack to dsa_port_{bridge,lag}_join [RFC,v2,net-next,03/16] net: dsa: inherit the actual bridge port flags at join time [RFC,v2,net-next,04/16] net: dsa: sync up with bridge port's STP state when joining [RFC,v2,net-next,05/16] net: dsa: sync up VLAN filtering state when joining the bridge [RFC,v2,net-next,06/16] net: dsa: sync multicast router state when joining the bridge [RFC,v2,net-next,07/16] net: dsa: sync ageing time when joining the bridge [RFC,v2,net-next,08/16] net: dsa: replay port and host-joined mdb entries when joining the bridge [RFC,v2,net-next,09/16] net: dsa: replay port and local fdb entries when joining the bridge [RFC,v2,net-next,10/16] net: dsa: replay VLANs installed on port when joining the bridge [RFC,v2,net-next,11/16] net: ocelot: support multiple bridges [RFC,v2,net-next,12/16] net: ocelot: call ocelot_netdevice_bridge_join when joining a bridged LAG [RFC,v2,net-next,13/16] net: ocelot: replay switchdev events when joining bridge [RFC,v2,net-next,14/16] net: dsa: don't set skb->offload_fwd_mark when not offloading the bridge [RFC,v2,net-next,15/16] net: dsa: return -EOPNOTSUPP when driver does not implement .port_lag_join [RFC,v2,net-next,16/16] net: bridge: switchdev: let drivers inform which bridge ports are offloaded

Vladimir Oltean March 18, 2021, 11:18 p.m. UTC

From: Vladimir Oltean <vladimir.oltean@nxp.com>

This series has two objectives:
- To make LAG uppers on top of DSA ports work regardless of which order
  we link interfaces to their masters (first make the port join the LAG,
  then the LAG join the bridge, or the other way around).
- To make DSA ports support non-offloaded LAG interfaces properly.

There was a design decision to be made in patches 2-4 on whether we
should adopt the "push" model, where the driver just calls:

  switchdev_bridge_port_offloaded(brport_dev,
                                  &atomic_notifier_block,
                                  &blocking_notifier_block,
                                  extack);

and the bridge just replays the entire collection of switchdev port
attributes and objects that it has, in some predefined order and with
some predefined error handling logic;


or the "pull" model, where the driver, apart from calling:

  switchdev_bridge_port_offloaded(brport_dev, extack);

has the task of "dumpster diving" (as Tobias puts it) through the bridge
attributes and objects by itself, by calling:

  - br_vlan_replay
  - br_fdb_replay
  - br_mdb_replay
  - br_vlan_enabled
  - br_port_flag_is_set
  - br_port_get_stp_state
  - br_multicast_router
  - br_get_ageing_time

(not necessarily all of them, and not necessarily in this order, and
with driver-defined error handling).

Even though I'm not in love myself with the "pull" model, I chose it
because there is a fundamental trick with replaying switchdev events
like this:

ip link add br0 type bridge
ip link add bond0 type bond
ip link set bond0 master br0
ip link set swp0 master bond0 <- this will replay the objects once for
                                 the bond0 bridge port, and the swp0
                                 switchdev port will process them
ip link set swp1 master bond0 <- this will replay the objects again for
                                 the bond0 bridge port, and the swp1
                                 switchdev port will see them, but swp0
                                 will see them for the second time now

Basically I believe that it is implementation defined whether the driver
wants to error out on switchdev objects seen twice on a port, and the
bridge should not enforce a certain model for that. For example, for FDB
entries added to a bonding interface, the underling switchdev driver
might have an abstraction for just that: an FDB entry pointing towards a
logical (as opposed to physical) port. So when the second port joins the
bridge, it doesn't realy need to replay FDB entries, since there is
already at least one hardware port which has been receiving those
events, and the FDB entries don't need to be added a second time to the
same logical port.
In the other corner, we have the drivers that handle switchdev port
attributes on a LAG as individual switchdev port attributes on physical
ports (example: VLAN filtering). In fact, the switchdev_handle_port_attr_set
helper facilitates this: it is a fan-out from a single orig_dev towards
multiple lowers that pass the check_cb().
But that's the point: switchdev_handle_port_attr_set is just a helper
which the driver _opts_ to use. The bridge can't enforce the "push"
model, because that would assume that all drivers handle port attributes
in the same way, which is probably false.

For this reason, I preferred to go with the "pull" mode for this patch
set. Just to see how bad it is for other switchdev drivers to copy-paste
this logic, I added the pull support to ocelot too, and I think it's
pretty manageable.

This patch set is RFC because it is minimally tested, and I would like
to get some feedback/agreement regarding the design decisions taken,
before I spend any more time on this.

There are also some things I probably broke, but I couldn't figure any
better. For example, I can't seem to figure out if mlxsw does the right
thing when joining a bonding interface that is already a bridge port.
I think it probably doesn't, so in that case, the placement I found for
the switchdev_bridge_port_offload() probably needs some adjustment when
there exists a LAG upper.

If possible, I would like the maintainers of the switchdev drivers to
tell me if this change introduces any regressions to how packets are
flooded (actually not flooded) in software by the bridge between two
ports belonging to the same ASIC ID.

I should mention that this patch series is written on top of Tobias'
series:
https://patchwork.kernel.org/project/netdevbpf/cover/20210318192540.895062-1-tobias@waldekranz.com/
which should get applied soon.

Vladimir Oltean (16):
  net: dsa: call dsa_port_bridge_join when joining a LAG that is already
    in a bridge
  net: dsa: pass extack to dsa_port_{bridge,lag}_join
  net: dsa: inherit the actual bridge port flags at join time
  net: dsa: sync up with bridge port's STP state when joining
  net: dsa: sync up VLAN filtering state when joining the bridge
  net: dsa: sync multicast router state when joining the bridge
  net: dsa: sync ageing time when joining the bridge
  net: dsa: replay port and host-joined mdb entries when joining the
    bridge
  net: dsa: replay port and local fdb entries when joining the bridge
  net: dsa: replay VLANs installed on port when joining the bridge
  net: ocelot: support multiple bridges
  net: ocelot: call ocelot_netdevice_bridge_join when joining a bridged
    LAG
  net: ocelot: replay switchdev events when joining bridge
  net: dsa: don't set skb->offload_fwd_mark when not offloading the
    bridge
  net: dsa: return -EOPNOTSUPP when driver does not implement
    .port_lag_join
  net: bridge: switchdev: let drivers inform which bridge ports are
    offloaded

 drivers/net/dsa/ocelot/felix.c                |   4 +-
 .../ethernet/freescale/dpaa2/dpaa2-switch.c   |   4 +-
 .../marvell/prestera/prestera_switchdev.c     |   7 +
 .../mellanox/mlxsw/spectrum_switchdev.c       |   4 +-
 drivers/net/ethernet/mscc/ocelot.c            |  90 ++++----
 drivers/net/ethernet/mscc/ocelot_net.c        | 210 +++++++++++++++---
 drivers/net/ethernet/rocker/rocker_ofdpa.c    |   8 +-
 drivers/net/ethernet/ti/am65-cpsw-nuss.c      |   7 +-
 drivers/net/ethernet/ti/cpsw_new.c            |   6 +-
 include/linux/if_bridge.h                     |  56 +++++
 include/net/switchdev.h                       |   1 +
 include/soc/mscc/ocelot.h                     |  13 +-
 net/bridge/br_fdb.c                           |  52 +++++
 net/bridge/br_if.c                            |  11 +-
 net/bridge/br_mdb.c                           |  84 +++++++
 net/bridge/br_private.h                       |   8 +-
 net/bridge/br_stp.c                           |  27 +++
 net/bridge/br_switchdev.c                     |  94 +++++++-
 net/bridge/br_vlan.c                          |  71 ++++++
 net/dsa/dsa_priv.h                            |  23 +-
 net/dsa/port.c                                | 201 +++++++++++++----
 net/dsa/slave.c                               |  11 +-
 net/dsa/switch.c                              |   4 +-
 net/dsa/tag_brcm.c                            |   2 +-
 net/dsa/tag_dsa.c                             |  15 +-
 net/dsa/tag_hellcreek.c                       |   2 +-
 net/dsa/tag_ksz.c                             |   2 +-
 net/dsa/tag_lan9303.c                         |   3 +-
 net/dsa/tag_mtk.c                             |   2 +-
 net/dsa/tag_ocelot.c                          |   2 +-
 net/dsa/tag_ocelot_8021q.c                    |   2 +-
 net/dsa/tag_rtl4_a.c                          |   2 +-
 net/dsa/tag_sja1105.c                         |   4 +-
 net/dsa/tag_xrs700x.c                         |   2 +-
 34 files changed, 845 insertions(+), 189 deletions(-)

Qingfang Deng March 19, 2021, 8:52 a.m. UTC | #1

On Fri, Mar 19, 2021 at 01:18:27AM +0200, Vladimir Oltean wrote:
> From: Vladimir Oltean <vladimir.oltean@nxp.com>

> 

> DSA has gained the recent ability to deal gracefully with upper

> interfaces it cannot offload, such as the bridge, bonding or team

> drivers. When such uppers exist, the ports are still in standalone mode

> as far as the hardware is concerned.

> 

> But when we deliver packets to the software bridge in order for that to

> do the forwarding, there is an unpleasant surprise in that the bridge

> will refuse to forward them. This is because we unconditionally set

> skb->offload_fwd_mark = true, meaning that the bridge thinks the frames

> were already forwarded in hardware by us.

> 

> Since dp->bridge_dev is populated only when there is hardware offload

> for it, but not in the software fallback case, let's introduce a new

> helper that can be called from the tagger data path which sets the

> skb->offload_fwd_mark accordingly to zero when there is no hardware

> offload for bridging. This lets the bridge forward packets back to other

> interfaces of our switch, if needed.

> 

> Without this change, sending a packet to the CPU for an unoffloaded

> interface triggers this WARN_ON:

> 

> void nbp_switchdev_frame_mark(const struct net_bridge_port *p,

> 			      struct sk_buff *skb)

> {

> 	if (skb->offload_fwd_mark && !WARN_ON_ONCE(!p->offload_fwd_mark))

> 		BR_INPUT_SKB_CB(skb)->offload_fwd_mark = p->offload_fwd_mark;

> }

> 

> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

> Reviewed-by: Tobias Waldekranz <tobias@waldekranz.com>

> ---

>  net/dsa/dsa_priv.h         | 14 ++++++++++++++

>  net/dsa/tag_brcm.c         |  2 +-

>  net/dsa/tag_dsa.c          | 15 +++++++++++----

>  net/dsa/tag_hellcreek.c    |  2 +-

>  net/dsa/tag_ksz.c          |  2 +-

>  net/dsa/tag_lan9303.c      |  3 ++-

>  net/dsa/tag_mtk.c          |  2 +-

>  net/dsa/tag_ocelot.c       |  2 +-

>  net/dsa/tag_ocelot_8021q.c |  2 +-

>  net/dsa/tag_rtl4_a.c       |  2 +-

>  net/dsa/tag_sja1105.c      |  4 ++--

>  net/dsa/tag_xrs700x.c      |  2 +-

>  12 files changed, 37 insertions(+), 15 deletions(-)

> 

> diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h

> index 92282de54230..b61bef79ce84 100644

> --- a/net/dsa/dsa_priv.h

> +++ b/net/dsa/dsa_priv.h

> @@ -349,6 +349,20 @@ static inline struct sk_buff *dsa_untag_bridge_pvid(struct sk_buff *skb)

>  	return skb;

>  }

>  

> +/* If the ingress port offloads the bridge, we mark the frame as autonomously

> + * forwarded by hardware, so the software bridge doesn't forward in twice, back

> + * to us, because we already did. However, if we're in fallback mode and we do

> + * software bridging, we are not offloading it, therefore the dp->bridge_dev

> + * pointer is not populated, and flooding needs to be done by software (we are

> + * effectively operating in standalone ports mode).

> + */

> +static inline void dsa_default_offload_fwd_mark(struct sk_buff *skb)

> +{

> +	struct dsa_port *dp = dsa_slave_to_port(skb->dev);

> +

> +	skb->offload_fwd_mark = !!(dp->bridge_dev);

> +}


So offload_fwd_mark is set iff the ingress port offloads the bridge.
Consider this set up on a switch which does NOT support LAG offload:

        +----- br0 -----+
        |               |
      bond0             |
        |               |         (Linux interfaces)
    +---+---+       +---+---+
    |       |       |       |
+-------+-------+-------+-------+
| sw0p0 | sw0p1 | sw0p2 | sw0p3 |
+-------+-------+-------+-------+
    |       |       |       |
    +---A---+       B       C     (LAN clients)


sw0p0 and sw0p1 should be in standalone mode (offload_fwd_mark = 0),
while sw0p2 and sw0p3 are offloaded (offload_fwd_mark = 1).

When a frame is sent into sw0p2 or sw0p3, can it be forwarded to sw0p0 or
sw0p1?

Setting offload_fwd_mark to 0 could also cause potential packet loss on
switches that perform learning on the CPU port:

When client C is talking to client A, frames from C will:
1. Enter sw0p3, where the switch will learn C is reachable via sw0p3.
2. Be sent to the CPU port and bounced back, where the switch will learn C is
   reachable via the CPU port, overwriting the previous learned FDB entry.
3. Be sent out of either sw0p0 or sw0p1, and reach its destination - A.

During step 2, if client B sends a frame to C, the frame will be forwarded to
the CPU, which will think it is already forwarded by the switch, and refuse to
forward it back, resulting in packet loss.

Many switch TX tags (mtk, qca, rtl) have a bit to disable source address
learning on a per-frame basis. We should utilise that.

> +

>  /* switch.c */

>  int dsa_switch_register_notifier(struct dsa_switch *ds);

>  void dsa_switch_unregister_notifier(struct dsa_switch *ds);

> diff --git a/net/dsa/tag_brcm.c b/net/dsa/tag_brcm.c

> index e2577a7dcbca..a8880b3bb106 100644

> --- a/net/dsa/tag_brcm.c

> +++ b/net/dsa/tag_brcm.c

> @@ -150,7 +150,7 @@ static struct sk_buff *brcm_tag_rcv_ll(struct sk_buff *skb,

>  	/* Remove Broadcom tag and update checksum */

>  	skb_pull_rcsum(skb, BRCM_TAG_LEN);

>  

> -	skb->offload_fwd_mark = 1;

> +	dsa_default_offload_fwd_mark(skb);

>  

>  	return skb;

>  }

> diff --git a/net/dsa/tag_dsa.c b/net/dsa/tag_dsa.c

> index 7e7b7decdf39..09ab9c25e686 100644

> --- a/net/dsa/tag_dsa.c

> +++ b/net/dsa/tag_dsa.c

> @@ -162,8 +162,8 @@ static struct sk_buff *dsa_xmit_ll(struct sk_buff *skb, struct net_device *dev,

>  static struct sk_buff *dsa_rcv_ll(struct sk_buff *skb, struct net_device *dev,

>  				  u8 extra)

>  {

> +	bool trap = false, trunk = false;

>  	int source_device, source_port;

> -	bool trunk = false;

>  	enum dsa_code code;

>  	enum dsa_cmd cmd;

>  	u8 *dsa_header;

> @@ -174,8 +174,6 @@ static struct sk_buff *dsa_rcv_ll(struct sk_buff *skb, struct net_device *dev,

>  	cmd = dsa_header[0] >> 6;

>  	switch (cmd) {

>  	case DSA_CMD_FORWARD:

> -		skb->offload_fwd_mark = 1;

> -

>  		trunk = !!(dsa_header[1] & 7);

>  		break;

>  

> @@ -194,7 +192,6 @@ static struct sk_buff *dsa_rcv_ll(struct sk_buff *skb, struct net_device *dev,

>  			 * device (like a bridge) that forwarding has

>  			 * already been done by hardware.

>  			 */

> -			skb->offload_fwd_mark = 1;

>  			break;

>  		case DSA_CODE_MGMT_TRAP:

>  		case DSA_CODE_IGMP_MLD_TRAP:

> @@ -202,6 +199,7 @@ static struct sk_buff *dsa_rcv_ll(struct sk_buff *skb, struct net_device *dev,

>  			/* Traps have, by definition, not been

>  			 * forwarded by hardware, so don't mark them.

>  			 */

> +			trap = true;

>  			break;

>  		default:

>  			/* Reserved code, this could be anything. Drop

> @@ -235,6 +233,15 @@ static struct sk_buff *dsa_rcv_ll(struct sk_buff *skb, struct net_device *dev,

>  	if (!skb->dev)

>  		return NULL;

>  

> +	/* When using LAG offload, skb->dev is not a DSA slave interface,

> +	 * so we cannot call dsa_default_offload_fwd_mark and we need to

> +	 * special-case it.

> +	 */

> +	if (trunk)

> +		skb->offload_fwd_mark = true;

> +	else if (!trap)

> +		dsa_default_offload_fwd_mark(skb);

> +

>  	/* If the 'tagged' bit is set; convert the DSA tag to a 802.1Q

>  	 * tag, and delete the ethertype (extra) if applicable. If the

>  	 * 'tagged' bit is cleared; delete the DSA tag, and ethertype

> diff --git a/net/dsa/tag_hellcreek.c b/net/dsa/tag_hellcreek.c

> index a09805c8e1ab..c1ee6eefafe4 100644

> --- a/net/dsa/tag_hellcreek.c

> +++ b/net/dsa/tag_hellcreek.c

> @@ -44,7 +44,7 @@ static struct sk_buff *hellcreek_rcv(struct sk_buff *skb,

>  

>  	pskb_trim_rcsum(skb, skb->len - HELLCREEK_TAG_LEN);

>  

> -	skb->offload_fwd_mark = true;

> +	dsa_default_offload_fwd_mark(skb);

>  

>  	return skb;

>  }

> diff --git a/net/dsa/tag_ksz.c b/net/dsa/tag_ksz.c

> index 4820dbcedfa2..8eee63a5b93b 100644

> --- a/net/dsa/tag_ksz.c

> +++ b/net/dsa/tag_ksz.c

> @@ -24,7 +24,7 @@ static struct sk_buff *ksz_common_rcv(struct sk_buff *skb,

>  

>  	pskb_trim_rcsum(skb, skb->len - len);

>  

> -	skb->offload_fwd_mark = true;

> +	dsa_default_offload_fwd_mark(skb);

>  

>  	return skb;

>  }

> diff --git a/net/dsa/tag_lan9303.c b/net/dsa/tag_lan9303.c

> index aa1318dccaf0..3a5494d2f7b1 100644

> --- a/net/dsa/tag_lan9303.c

> +++ b/net/dsa/tag_lan9303.c

> @@ -115,7 +115,8 @@ static struct sk_buff *lan9303_rcv(struct sk_buff *skb, struct net_device *dev,

>  	skb_pull_rcsum(skb, 2 + 2);

>  	memmove(skb->data - ETH_HLEN, skb->data - (ETH_HLEN + LAN9303_TAG_LEN),

>  		2 * ETH_ALEN);

> -	skb->offload_fwd_mark = !(lan9303_tag1 & LAN9303_TAG_RX_TRAPPED_TO_CPU);

> +	if (!(lan9303_tag1 & LAN9303_TAG_RX_TRAPPED_TO_CPU))

> +		dsa_default_offload_fwd_mark(skb);

>  

>  	return skb;

>  }

> diff --git a/net/dsa/tag_mtk.c b/net/dsa/tag_mtk.c

> index f9b2966d1936..92ab21d2ceca 100644

> --- a/net/dsa/tag_mtk.c

> +++ b/net/dsa/tag_mtk.c

> @@ -92,7 +92,7 @@ static struct sk_buff *mtk_tag_rcv(struct sk_buff *skb, struct net_device *dev,

>  	if (!skb->dev)

>  		return NULL;

>  

> -	skb->offload_fwd_mark = 1;

> +	dsa_default_offload_fwd_mark(skb);

>  

>  	return skb;

>  }

> diff --git a/net/dsa/tag_ocelot.c b/net/dsa/tag_ocelot.c

> index f9df9cac81c5..1deba3f1bb82 100644

> --- a/net/dsa/tag_ocelot.c

> +++ b/net/dsa/tag_ocelot.c

> @@ -123,7 +123,7 @@ static struct sk_buff *ocelot_rcv(struct sk_buff *skb,

>  		 */

>  		return NULL;

>  

> -	skb->offload_fwd_mark = 1;

> +	dsa_default_offload_fwd_mark(skb);

>  	skb->priority = qos_class;

>  

>  	/* Ocelot switches copy frames unmodified to the CPU. However, it is

> diff --git a/net/dsa/tag_ocelot_8021q.c b/net/dsa/tag_ocelot_8021q.c

> index 5f3e8e124a82..447e1eeb357c 100644

> --- a/net/dsa/tag_ocelot_8021q.c

> +++ b/net/dsa/tag_ocelot_8021q.c

> @@ -81,7 +81,7 @@ static struct sk_buff *ocelot_rcv(struct sk_buff *skb,

>  	if (!skb->dev)

>  		return NULL;

>  

> -	skb->offload_fwd_mark = 1;

> +	dsa_default_offload_fwd_mark(skb);

>  	skb->priority = qos_class;

>  

>  	return skb;

> diff --git a/net/dsa/tag_rtl4_a.c b/net/dsa/tag_rtl4_a.c

> index e9176475bac8..1864e3a74df8 100644

> --- a/net/dsa/tag_rtl4_a.c

> +++ b/net/dsa/tag_rtl4_a.c

> @@ -114,7 +114,7 @@ static struct sk_buff *rtl4a_tag_rcv(struct sk_buff *skb,

>  		skb->data - ETH_HLEN - RTL4_A_HDR_LEN,

>  		2 * ETH_ALEN);

>  

> -	skb->offload_fwd_mark = 1;

> +	dsa_default_offload_fwd_mark(skb);

>  

>  	return skb;

>  }

> diff --git a/net/dsa/tag_sja1105.c b/net/dsa/tag_sja1105.c

> index 50496013cdb7..45cdf64f0e88 100644

> --- a/net/dsa/tag_sja1105.c

> +++ b/net/dsa/tag_sja1105.c

> @@ -295,8 +295,6 @@ static struct sk_buff *sja1105_rcv(struct sk_buff *skb,

>  	is_link_local = sja1105_is_link_local(skb);

>  	is_meta = sja1105_is_meta_frame(skb);

>  

> -	skb->offload_fwd_mark = 1;

> -

>  	if (is_tagged) {

>  		/* Normal traffic path. */

>  		skb_push_rcsum(skb, ETH_HLEN);

> @@ -339,6 +337,8 @@ static struct sk_buff *sja1105_rcv(struct sk_buff *skb,

>  		return NULL;

>  	}

>  

> +	dsa_default_offload_fwd_mark(skb);

> +

>  	if (subvlan)

>  		sja1105_decode_subvlan(skb, subvlan);

>  

> diff --git a/net/dsa/tag_xrs700x.c b/net/dsa/tag_xrs700x.c

> index 858cdf9d2913..1208549f45c1 100644

> --- a/net/dsa/tag_xrs700x.c

> +++ b/net/dsa/tag_xrs700x.c

> @@ -46,7 +46,7 @@ static struct sk_buff *xrs700x_rcv(struct sk_buff *skb, struct net_device *dev,

>  		return NULL;

>  

>  	/* Frame is forwarded by hardware, don't forward in software. */

> -	skb->offload_fwd_mark = 1;

> +	dsa_default_offload_fwd_mark(skb);

>  

>  	return skb;

>  }

> -- 

> 2.25.1

>

Vladimir Oltean March 19, 2021, 9:06 a.m. UTC | #2

On Fri, Mar 19, 2021 at 04:52:31PM +0800, DENG Qingfang wrote:
> On Fri, Mar 19, 2021 at 01:18:27AM +0200, Vladimir Oltean wrote:

> > From: Vladimir Oltean <vladimir.oltean@nxp.com>

> > 

> > DSA has gained the recent ability to deal gracefully with upper

> > interfaces it cannot offload, such as the bridge, bonding or team

> > drivers. When such uppers exist, the ports are still in standalone mode

> > as far as the hardware is concerned.

> > 

> > But when we deliver packets to the software bridge in order for that to

> > do the forwarding, there is an unpleasant surprise in that the bridge

> > will refuse to forward them. This is because we unconditionally set

> > skb->offload_fwd_mark = true, meaning that the bridge thinks the frames

> > were already forwarded in hardware by us.

> > 

> > Since dp->bridge_dev is populated only when there is hardware offload

> > for it, but not in the software fallback case, let's introduce a new

> > helper that can be called from the tagger data path which sets the

> > skb->offload_fwd_mark accordingly to zero when there is no hardware

> > offload for bridging. This lets the bridge forward packets back to other

> > interfaces of our switch, if needed.

> > 

> > Without this change, sending a packet to the CPU for an unoffloaded

> > interface triggers this WARN_ON:

> > 

> > void nbp_switchdev_frame_mark(const struct net_bridge_port *p,

> > 			      struct sk_buff *skb)

> > {

> > 	if (skb->offload_fwd_mark && !WARN_ON_ONCE(!p->offload_fwd_mark))

> > 		BR_INPUT_SKB_CB(skb)->offload_fwd_mark = p->offload_fwd_mark;

> > }

> > 

> > Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

> > Reviewed-by: Tobias Waldekranz <tobias@waldekranz.com>

> > ---

> >  net/dsa/dsa_priv.h         | 14 ++++++++++++++

> >  net/dsa/tag_brcm.c         |  2 +-

> >  net/dsa/tag_dsa.c          | 15 +++++++++++----

> >  net/dsa/tag_hellcreek.c    |  2 +-

> >  net/dsa/tag_ksz.c          |  2 +-

> >  net/dsa/tag_lan9303.c      |  3 ++-

> >  net/dsa/tag_mtk.c          |  2 +-

> >  net/dsa/tag_ocelot.c       |  2 +-

> >  net/dsa/tag_ocelot_8021q.c |  2 +-

> >  net/dsa/tag_rtl4_a.c       |  2 +-

> >  net/dsa/tag_sja1105.c      |  4 ++--

> >  net/dsa/tag_xrs700x.c      |  2 +-

> >  12 files changed, 37 insertions(+), 15 deletions(-)

> > 

> > diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h

> > index 92282de54230..b61bef79ce84 100644

> > --- a/net/dsa/dsa_priv.h

> > +++ b/net/dsa/dsa_priv.h

> > @@ -349,6 +349,20 @@ static inline struct sk_buff *dsa_untag_bridge_pvid(struct sk_buff *skb)

> >  	return skb;

> >  }

> >  

> > +/* If the ingress port offloads the bridge, we mark the frame as autonomously

> > + * forwarded by hardware, so the software bridge doesn't forward in twice, back

> > + * to us, because we already did. However, if we're in fallback mode and we do

> > + * software bridging, we are not offloading it, therefore the dp->bridge_dev

> > + * pointer is not populated, and flooding needs to be done by software (we are

> > + * effectively operating in standalone ports mode).

> > + */

> > +static inline void dsa_default_offload_fwd_mark(struct sk_buff *skb)

> > +{

> > +	struct dsa_port *dp = dsa_slave_to_port(skb->dev);

> > +

> > +	skb->offload_fwd_mark = !!(dp->bridge_dev);

> > +}

> 

> So offload_fwd_mark is set iff the ingress port offloads the bridge.

> Consider this set up on a switch which does NOT support LAG offload:

> 

>         +----- br0 -----+

>         |               |

>       bond0             |

>         |               |         (Linux interfaces)

>     +---+---+       +---+---+

>     |       |       |       |

> +-------+-------+-------+-------+

> | sw0p0 | sw0p1 | sw0p2 | sw0p3 |

> +-------+-------+-------+-------+

>     |       |       |       |

>     +---A---+       B       C     (LAN clients)

> 

> 

> sw0p0 and sw0p1 should be in standalone mode (offload_fwd_mark = 0),

> while sw0p2 and sw0p3 are offloaded (offload_fwd_mark = 1).

> 

> When a frame is sent into sw0p2 or sw0p3, can it be forwarded to sw0p0 or

> sw0p1?


bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,
				  const struct sk_buff *skb)
{
	return !skb->offload_fwd_mark ||
	       BR_INPUT_SKB_CB(skb)->offload_fwd_mark != p->offload_fwd_mark;
}

where p->offload_fwd_mark is the mark of the egress port, and
BR_INPUT_SKB_CB(skb) is the mark of the ingress port, assigned here:

void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
			      struct sk_buff *skb)
{
	if (skb->offload_fwd_mark && !WARN_ON_ONCE(!p->offload_fwd_mark))
		BR_INPUT_SKB_CB(skb)->offload_fwd_mark = p->offload_fwd_mark;
}

Basically, sw0p0 and sw0p1 have a switchdev mark of 0, and sw0p2 and
sw0p3 have a non-zero switchdev mark, so nbp_switchdev_allowed_egress
returns true in both directions, regardless of the value of
skb->offload_fwd_mark.

> Setting offload_fwd_mark to 0 could also cause potential packet loss on

> switches that perform learning on the CPU port:

> 

> When client C is talking to client A, frames from C will:

> 1. Enter sw0p3, where the switch will learn C is reachable via sw0p3.

> 2. Be sent to the CPU port and bounced back, where the switch will learn C is

>    reachable via the CPU port, overwriting the previous learned FDB entry.

> 3. Be sent out of either sw0p0 or sw0p1, and reach its destination - A.

> 

> During step 2, if client B sends a frame to C, the frame will be forwarded to

> the CPU, which will think it is already forwarded by the switch, and refuse to

> forward it back, resulting in packet loss.

> 

> Many switch TX tags (mtk, qca, rtl) have a bit to disable source address

> learning on a per-frame basis. We should utilise that.


This is a good point actually, which I thought about, but did not give a
lot of importance to for the moment. Either we go full steam ahead with
assisted learning on the CPU port for everybody, and we selectively
learn the addresses relevant to the bridging funciton only, or we do
what you say, but then it will be a little bit more complicated IMO, and
have hardware dependencies, which isn't as nice.

Qingfang Deng March 19, 2021, 9:29 a.m. UTC | #3

On Fri, Mar 19, 2021 at 5:06 PM Vladimir Oltean <olteanv@gmail.com> wrote:
>
> This is a good point actually, which I thought about, but did not give a
> lot of importance to for the moment. Either we go full steam ahead with
> assisted learning on the CPU port for everybody, and we selectively
> learn the addresses relevant to the bridging funciton only, or we do
> what you say, but then it will be a little bit more complicated IMO, and
> have hardware dependencies, which isn't as nice.

Are skb->offload_fwd_mark and source DSA switch kept in dsa_slave_xmit?
I think SA learning should be bypassed iff skb->offload_fwd_mark == 1 and
source DSA switch == destination DSA switch.

Vladimir Oltean March 19, 2021, 10:49 a.m. UTC | #4

On Fri, Mar 19, 2021 at 05:29:12PM +0800, DENG Qingfang wrote:
> On Fri, Mar 19, 2021 at 5:06 PM Vladimir Oltean <olteanv@gmail.com> wrote:
> >
> > This is a good point actually, which I thought about, but did not give a
> > lot of importance to for the moment. Either we go full steam ahead with
> > assisted learning on the CPU port for everybody, and we selectively
> > learn the addresses relevant to the bridging funciton only, or we do
> > what you say, but then it will be a little bit more complicated IMO, and
> > have hardware dependencies, which isn't as nice.
> 
> Are skb->offload_fwd_mark and source DSA switch kept in dsa_slave_xmit?
> I think SA learning should be bypassed iff skb->offload_fwd_mark == 1 and
> source DSA switch == destination DSA switch.

Why would you even want to look at the source net device for forwarding?
I'd say that if dp->bridge_dev is NULL in the xmit function, you certainly
want to bypass address learning if you can. Maybe also for link-local traffic.

Florian Fainelli March 19, 2021, 10:04 p.m. UTC | #5

On 3/18/2021 4:18 PM, Vladimir Oltean wrote:
> From: Vladimir Oltean <vladimir.oltean@nxp.com>
> 
> DSA can properly detect and offload this sequence of operations:
> 
> ip link add br0 type bridge
> ip link add bond0 type bond
> ip link set swp0 master bond0
> ip link set bond0 master br0
> 
> But not this one:
> 
> ip link add br0 type bridge
> ip link add bond0 type bond
> ip link set bond0 master br0
> ip link set swp0 master bond0
> 
> Actually the second one is more complicated, due to the elapsed time
> between the enslavement of bond0 and the offloading of it via swp0, a
> lot of things could have happened to the bond0 bridge port in terms of
> switchdev objects (host MDBs, VLANs, altered STP state etc). So this is
> a bit of a can of worms, and making sure that the DSA port's state is in
> sync with this already existing bridge port is handled in the next
> patches.
> 
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>

Florian Fainelli March 19, 2021, 10:05 p.m. UTC | #6

On 3/18/2021 4:18 PM, Vladimir Oltean wrote:
> From: Vladimir Oltean <vladimir.oltean@nxp.com>

> 

> This is a pretty noisy change that was broken out of the larger change

> for replaying switchdev attributes and objects at bridge join time,

> which is when these extack objects are actually used.

> 

> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>


Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>

-- 
Florian

Florian Fainelli March 19, 2021, 10:11 p.m. UTC | #7

On 3/18/2021 4:18 PM, Vladimir Oltean wrote:
> From: Vladimir Oltean <vladimir.oltean@nxp.com>

> 

> This is the same situation as for other switchdev port attributes: if we

> join an already-created bridge port, such as a bond master interface,

> then we can miss the initial switchdev notification emitted by the

> bridge for this port.

> 

> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>


Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>

-- 
Florian

Florian Fainelli March 19, 2021, 10:20 p.m. UTC | #8

On 3/18/2021 4:18 PM, Vladimir Oltean wrote:
> From: Vladimir Oltean <vladimir.oltean@nxp.com>

> 

> I have udhcpcd in my system and this is configured to bring interfaces

> up as soon as they are created.

> 

> I create a bridge as follows:

> 

> ip link add br0 type bridge

> 

> As soon as I create the bridge and udhcpcd brings it up, I have some

> other crap (avahi)


How dare you ;)

 that starts sending some random IPv6 packets to
> advertise some local services, and from there, the br0 bridge joins the

> following IPv6 groups:

> 

> 33:33:ff:6d:c1:9c vid 0

> 33:33:00:00:00:6a vid 0

> 33:33:00:00:00:fb vid 0

> 

> br_dev_xmit

> -> br_multicast_rcv

>    -> br_ip6_multicast_add_group

>       -> __br_multicast_add_group

>          -> br_multicast_host_join

>             -> br_mdb_notify

> 

> This is all fine, but inside br_mdb_notify we have br_mdb_switchdev_host

> hooked up, and switchdev will attempt to offload the host joined groups

> to an empty list of ports. Of course nobody offloads them.

> 

> Then when we add a port to br0:

> 

> ip link set swp0 master br0

> 

> the bridge doesn't replay the host-joined MDB entries from br_add_if,

> and eventually the host joined addresses expire, and a switchdev

> notification for deleting it is emitted, but surprise, the original

> addition was already completely missed.

> 

> The strategy to address this problem is to replay the MDB entries (both

> the port ones and the host joined ones) when the new port joins the

> bridge, similar to what vxlan_fdb_replay does (in that case, its FDB can

> be populated and only then attached to a bridge that you offload).

> However there are 2 possibilities: the addresses can be 'pushed' by the

> bridge into the port, or the port can 'pull' them from the bridge.

> 

> Considering that in the general case, the new port can be really late to

> the party, and there may have been many other switchdev ports that

> already received the initial notification, we would like to avoid

> delivering duplicate events to them, since they might misbehave. And

> currently, the bridge calls the entire switchdev notifier chain, whereas

> for replaying it should just call the notifier block of the new guy.

> But the bridge doesn't know what is the new guy's notifier block, it

> just knows where the switchdev notifier chain is. So for simplification,

> we make this a driver-initiated pull for now, and the notifier block is

> passed as an argument.

> 

> To emulate the calling context for mdb objects (deferred and put on the

> blocking notifier chain), we must iterate under RCU protection through

> the bridge's mdb entries, queue them, and only call them once we're out

> of the RCU read-side critical section.

> 

> Suggested-by: Ido Schimmel <idosch@idosch.org>

> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

> ---

>  include/linux/if_bridge.h |  9 +++++

>  net/bridge/br_mdb.c       | 84 +++++++++++++++++++++++++++++++++++++++

>  net/dsa/dsa_priv.h        |  2 +

>  net/dsa/port.c            |  6 +++

>  net/dsa/slave.c           |  2 +-

>  5 files changed, 102 insertions(+), 1 deletion(-)

> 

> diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h

> index ebd16495459c..4c25dafb013d 100644

> --- a/include/linux/if_bridge.h

> +++ b/include/linux/if_bridge.h

> @@ -69,6 +69,8 @@ bool br_multicast_has_querier_anywhere(struct net_device *dev, int proto);

>  bool br_multicast_has_querier_adjacent(struct net_device *dev, int proto);

>  bool br_multicast_enabled(const struct net_device *dev);

>  bool br_multicast_router(const struct net_device *dev);

> +int br_mdb_replay(struct net_device *br_dev, struct net_device *dev,

> +		  struct notifier_block *nb, struct netlink_ext_ack *extack);

>  #else

>  static inline int br_multicast_list_adjacent(struct net_device *dev,

>  					     struct list_head *br_ip_list)

> @@ -93,6 +95,13 @@ static inline bool br_multicast_router(const struct net_device *dev)

>  {

>  	return false;

>  }

> +static inline int br_mdb_replay(struct net_device *br_dev,

> +				struct net_device *dev,

> +				struct notifier_block *nb,

> +				struct netlink_ext_ack *extack)

> +{

> +	return -EINVAL;


Should we return -EOPNOTUSPP such that this is not made fatal for DSA if
someone compiles its kernel with CONFIG_BRIDGE_IGMP_SNOOPING disabled?

> +}

>  #endif

>  

>  #if IS_ENABLED(CONFIG_BRIDGE) && IS_ENABLED(CONFIG_BRIDGE_VLAN_FILTERING)

> diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c

> index 8846c5bcd075..23973186094c 100644

> --- a/net/bridge/br_mdb.c

> +++ b/net/bridge/br_mdb.c

> @@ -506,6 +506,90 @@ static void br_mdb_complete(struct net_device *dev, int err, void *priv)

>  	kfree(priv);

>  }

>  

> +static int br_mdb_replay_one(struct notifier_block *nb, struct net_device *dev,

> +			     struct net_bridge_mdb_entry *mp, int obj_id,

> +			     struct net_device *orig_dev,

> +			     struct netlink_ext_ack *extack)

> +{

> +	struct switchdev_notifier_port_obj_info obj_info = {

> +		.info = {

> +			.dev = dev,

> +			.extack = extack,

> +		},

> +	};

> +	struct switchdev_obj_port_mdb mdb = {

> +		.obj = {

> +			.orig_dev = orig_dev,

> +			.id = obj_id,

> +		},

> +		.vid = mp->addr.vid,

> +	};

> +	int err;

> +

> +	if (mp->addr.proto == htons(ETH_P_IP))

> +		ip_eth_mc_map(mp->addr.dst.ip4, mdb.addr);

> +#if IS_ENABLED(CONFIG_IPV6)

> +	else if (mp->addr.proto == htons(ETH_P_IPV6))

> +		ipv6_eth_mc_map(&mp->addr.dst.ip6, mdb.addr);

> +#endif

> +	else

> +		ether_addr_copy(mdb.addr, mp->addr.dst.mac_addr);


How you would feel about re-using br_mdb_switchdev_host_port() here and
pass a 'type' value that is neither RTM_NEWDB nor RTM_DELDB just so you
don't have to duplicate that code here and we ensure it is in sync?
-- 
Florian

Florian Fainelli March 19, 2021, 10:24 p.m. UTC | #9

On 3/18/2021 4:18 PM, Vladimir Oltean wrote:
> From: Vladimir Oltean <vladimir.oltean@nxp.com>

> 

> Currently this simple setup:

> 

> ip link add br0 type bridge vlan_filtering 1

> ip link add bond0 type bond

> ip link set bond0 master br0

> ip link set swp0 master bond0

> 

> will not work because the bridge has created the PVID in br_add_if ->

> nbp_vlan_init, and it has notified switchdev of the existence of VLAN 1,

> but that was too early, since swp0 was not yet a lower of bond0, so it

> had no reason to act upon that notification.

> 

> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

> ---

>  include/linux/if_bridge.h | 10 ++++++

>  net/bridge/br_vlan.c      | 71 +++++++++++++++++++++++++++++++++++++++

>  net/dsa/port.c            |  6 ++++

>  3 files changed, 87 insertions(+)

> 

> diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h

> index 89596134e88f..ea176c508c0d 100644

> --- a/include/linux/if_bridge.h

> +++ b/include/linux/if_bridge.h

> @@ -111,6 +111,8 @@ int br_vlan_get_pvid_rcu(const struct net_device *dev, u16 *p_pvid);

>  int br_vlan_get_proto(const struct net_device *dev, u16 *p_proto);

>  int br_vlan_get_info(const struct net_device *dev, u16 vid,

>  		     struct bridge_vlan_info *p_vinfo);

> +int br_vlan_replay(struct net_device *br_dev, struct net_device *dev,

> +		   struct notifier_block *nb, struct netlink_ext_ack *extack);

>  #else

>  static inline bool br_vlan_enabled(const struct net_device *dev)

>  {

> @@ -137,6 +139,14 @@ static inline int br_vlan_get_info(const struct net_device *dev, u16 vid,

>  {

>  	return -EINVAL;

>  }

> +

> +static inline int br_vlan_replay(struct net_device *br_dev,

> +				 struct net_device *dev,

> +				 struct notifier_block *nb,

> +				 struct netlink_ext_ack *extack)

> +{

> +	return -EINVAL;


Same comment as patch 8, CONFIG_BRIDGE_VLAN_FILTERING can be turned off
even if this does not really make practical sense with a hardware
switch. Should we return -EOPNOTSUPP instead?
-- 
Florian

Vladimir Oltean March 20, 2021, 9:53 a.m. UTC | #10

On Fri, Mar 19, 2021 at 03:20:38PM -0700, Florian Fainelli wrote:
>

>

> On 3/18/2021 4:18 PM, Vladimir Oltean wrote:

> > From: Vladimir Oltean <vladimir.oltean@nxp.com>

> >

> > I have udhcpcd in my system and this is configured to bring interfaces

> > up as soon as they are created.

> >

> > I create a bridge as follows:

> >

> > ip link add br0 type bridge

> >

> > As soon as I create the bridge and udhcpcd brings it up, I have some

> > other crap (avahi)

>

> How dare you ;)


Well, it comes preinstalled on my system, I don't need it, and it has
caused me nothing but trouble. So I think it has earned its title :D

> > that starts sending some random IPv6 packets to

> > advertise some local services, and from there, the br0 bridge joins the

> > following IPv6 groups:

> >

> > 33:33:ff:6d:c1:9c vid 0

> > 33:33:00:00:00:6a vid 0

> > 33:33:00:00:00:fb vid 0

> >

> > br_dev_xmit

> > -> br_multicast_rcv

> >    -> br_ip6_multicast_add_group

> >       -> __br_multicast_add_group

> >          -> br_multicast_host_join

> >             -> br_mdb_notify

> >

> > This is all fine, but inside br_mdb_notify we have br_mdb_switchdev_host

> > hooked up, and switchdev will attempt to offload the host joined groups

> > to an empty list of ports. Of course nobody offloads them.

> >

> > Then when we add a port to br0:

> >

> > ip link set swp0 master br0

> >

> > the bridge doesn't replay the host-joined MDB entries from br_add_if,

> > and eventually the host joined addresses expire, and a switchdev

> > notification for deleting it is emitted, but surprise, the original

> > addition was already completely missed.

> >

> > The strategy to address this problem is to replay the MDB entries (both

> > the port ones and the host joined ones) when the new port joins the

> > bridge, similar to what vxlan_fdb_replay does (in that case, its FDB can

> > be populated and only then attached to a bridge that you offload).

> > However there are 2 possibilities: the addresses can be 'pushed' by the

> > bridge into the port, or the port can 'pull' them from the bridge.

> >

> > Considering that in the general case, the new port can be really late to

> > the party, and there may have been many other switchdev ports that

> > already received the initial notification, we would like to avoid

> > delivering duplicate events to them, since they might misbehave. And

> > currently, the bridge calls the entire switchdev notifier chain, whereas

> > for replaying it should just call the notifier block of the new guy.

> > But the bridge doesn't know what is the new guy's notifier block, it

> > just knows where the switchdev notifier chain is. So for simplification,

> > we make this a driver-initiated pull for now, and the notifier block is

> > passed as an argument.

> >

> > To emulate the calling context for mdb objects (deferred and put on the

> > blocking notifier chain), we must iterate under RCU protection through

> > the bridge's mdb entries, queue them, and only call them once we're out

> > of the RCU read-side critical section.

> >

> > Suggested-by: Ido Schimmel <idosch@idosch.org>

> > Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

> > ---

> >  include/linux/if_bridge.h |  9 +++++

> >  net/bridge/br_mdb.c       | 84 +++++++++++++++++++++++++++++++++++++++

> >  net/dsa/dsa_priv.h        |  2 +

> >  net/dsa/port.c            |  6 +++

> >  net/dsa/slave.c           |  2 +-

> >  5 files changed, 102 insertions(+), 1 deletion(-)

> >

> > diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h

> > index ebd16495459c..4c25dafb013d 100644

> > --- a/include/linux/if_bridge.h

> > +++ b/include/linux/if_bridge.h

> > @@ -69,6 +69,8 @@ bool br_multicast_has_querier_anywhere(struct net_device *dev, int proto);

> >  bool br_multicast_has_querier_adjacent(struct net_device *dev, int proto);

> >  bool br_multicast_enabled(const struct net_device *dev);

> >  bool br_multicast_router(const struct net_device *dev);

> > +int br_mdb_replay(struct net_device *br_dev, struct net_device *dev,

> > +		  struct notifier_block *nb, struct netlink_ext_ack *extack);

> >  #else

> >  static inline int br_multicast_list_adjacent(struct net_device *dev,

> >  					     struct list_head *br_ip_list)

> > @@ -93,6 +95,13 @@ static inline bool br_multicast_router(const struct net_device *dev)

> >  {

> >  	return false;

> >  }

> > +static inline int br_mdb_replay(struct net_device *br_dev,

> > +				struct net_device *dev,

> > +				struct notifier_block *nb,

> > +				struct netlink_ext_ack *extack)

> > +{

> > +	return -EINVAL;

>

> Should we return -EOPNOTUSPP such that this is not made fatal for DSA if

> someone compiles its kernel with CONFIG_BRIDGE_IGMP_SNOOPING disabled?


Sure, I'll change the return values of the shims everywhere.

> > +}

> >  #endif

> >

> >  #if IS_ENABLED(CONFIG_BRIDGE) && IS_ENABLED(CONFIG_BRIDGE_VLAN_FILTERING)

> > diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c

> > index 8846c5bcd075..23973186094c 100644

> > --- a/net/bridge/br_mdb.c

> > +++ b/net/bridge/br_mdb.c

> > @@ -506,6 +506,90 @@ static void br_mdb_complete(struct net_device *dev, int err, void *priv)

> >  	kfree(priv);

> >  }

> >

> > +static int br_mdb_replay_one(struct notifier_block *nb, struct net_device *dev,

> > +			     struct net_bridge_mdb_entry *mp, int obj_id,

> > +			     struct net_device *orig_dev,

> > +			     struct netlink_ext_ack *extack)

> > +{

> > +	struct switchdev_notifier_port_obj_info obj_info = {

> > +		.info = {

> > +			.dev = dev,

> > +			.extack = extack,

> > +		},

> > +	};

> > +	struct switchdev_obj_port_mdb mdb = {

> > +		.obj = {

> > +			.orig_dev = orig_dev,

> > +			.id = obj_id,

> > +		},

> > +		.vid = mp->addr.vid,

> > +	};

> > +	int err;

> > +

> > +	if (mp->addr.proto == htons(ETH_P_IP))

> > +		ip_eth_mc_map(mp->addr.dst.ip4, mdb.addr);

> > +#if IS_ENABLED(CONFIG_IPV6)

> > +	else if (mp->addr.proto == htons(ETH_P_IPV6))

> > +		ipv6_eth_mc_map(&mp->addr.dst.ip6, mdb.addr);

> > +#endif

> > +	else

> > +		ether_addr_copy(mdb.addr, mp->addr.dst.mac_addr);

>

> How you would feel about re-using br_mdb_switchdev_host_port() here and

> pass a 'type' value that is neither RTM_NEWDB nor RTM_DELDB just so you

> don't have to duplicate that code here and we ensure it is in sync?


The trouble is that br_mdb_switchdev_host calls switchdev_port_obj_add,
and I think the agreement was that replayed events should be a silent,
one-to-one conversation via a direct call to the notifier block of the
interested driver, as opposed to a call to the entire notifier chain
which would make everybody else in the system see duplicates. This is
the reason why I duplicated mostly everything.

Qingfang Deng March 22, 2021, 8:04 a.m. UTC | #11

On Fri, Mar 19, 2021 at 6:49 PM Vladimir Oltean <olteanv@gmail.com> wrote:
> Why would you even want to look at the source net device for forwarding?

> I'd say that if dp->bridge_dev is NULL in the xmit function, you certainly

> want to bypass address learning if you can. Maybe also for link-local traffic.


Also for trapped traffic (snooping, tc-flower trap action) if the CPU
sends them back.

Florian Fainelli March 22, 2021, 3:56 p.m. UTC | #12

On 3/20/2021 2:53 AM, Vladimir Oltean wrote:
> On Fri, Mar 19, 2021 at 03:20:38PM -0700, Florian Fainelli wrote:

>>

>>

>> On 3/18/2021 4:18 PM, Vladimir Oltean wrote:

>>> From: Vladimir Oltean <vladimir.oltean@nxp.com>

>>>

>>> I have udhcpcd in my system and this is configured to bring interfaces

>>> up as soon as they are created.

>>>

>>> I create a bridge as follows:

>>>

>>> ip link add br0 type bridge

>>>

>>> As soon as I create the bridge and udhcpcd brings it up, I have some

>>> other crap (avahi)

>>

>> How dare you ;)

> 

> Well, it comes preinstalled on my system, I don't need it, and it has

> caused me nothing but trouble. So I think it has earned its title :D

> 

>>> that starts sending some random IPv6 packets to

>>> advertise some local services, and from there, the br0 bridge joins the

>>> following IPv6 groups:

>>>

>>> 33:33:ff:6d:c1:9c vid 0

>>> 33:33:00:00:00:6a vid 0

>>> 33:33:00:00:00:fb vid 0

>>>

>>> br_dev_xmit

>>> -> br_multicast_rcv

>>>    -> br_ip6_multicast_add_group

>>>       -> __br_multicast_add_group

>>>          -> br_multicast_host_join

>>>             -> br_mdb_notify

>>>

>>> This is all fine, but inside br_mdb_notify we have br_mdb_switchdev_host

>>> hooked up, and switchdev will attempt to offload the host joined groups

>>> to an empty list of ports. Of course nobody offloads them.

>>>

>>> Then when we add a port to br0:

>>>

>>> ip link set swp0 master br0

>>>

>>> the bridge doesn't replay the host-joined MDB entries from br_add_if,

>>> and eventually the host joined addresses expire, and a switchdev

>>> notification for deleting it is emitted, but surprise, the original

>>> addition was already completely missed.

>>>

>>> The strategy to address this problem is to replay the MDB entries (both

>>> the port ones and the host joined ones) when the new port joins the

>>> bridge, similar to what vxlan_fdb_replay does (in that case, its FDB can

>>> be populated and only then attached to a bridge that you offload).

>>> However there are 2 possibilities: the addresses can be 'pushed' by the

>>> bridge into the port, or the port can 'pull' them from the bridge.

>>>

>>> Considering that in the general case, the new port can be really late to

>>> the party, and there may have been many other switchdev ports that

>>> already received the initial notification, we would like to avoid

>>> delivering duplicate events to them, since they might misbehave. And

>>> currently, the bridge calls the entire switchdev notifier chain, whereas

>>> for replaying it should just call the notifier block of the new guy.

>>> But the bridge doesn't know what is the new guy's notifier block, it

>>> just knows where the switchdev notifier chain is. So for simplification,

>>> we make this a driver-initiated pull for now, and the notifier block is

>>> passed as an argument.

>>>

>>> To emulate the calling context for mdb objects (deferred and put on the

>>> blocking notifier chain), we must iterate under RCU protection through

>>> the bridge's mdb entries, queue them, and only call them once we're out

>>> of the RCU read-side critical section.

>>>

>>> Suggested-by: Ido Schimmel <idosch@idosch.org>

>>> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

>>> ---

>>>  include/linux/if_bridge.h |  9 +++++

>>>  net/bridge/br_mdb.c       | 84 +++++++++++++++++++++++++++++++++++++++

>>>  net/dsa/dsa_priv.h        |  2 +

>>>  net/dsa/port.c            |  6 +++

>>>  net/dsa/slave.c           |  2 +-

>>>  5 files changed, 102 insertions(+), 1 deletion(-)

>>>

>>> diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h

>>> index ebd16495459c..4c25dafb013d 100644

>>> --- a/include/linux/if_bridge.h

>>> +++ b/include/linux/if_bridge.h

>>> @@ -69,6 +69,8 @@ bool br_multicast_has_querier_anywhere(struct net_device *dev, int proto);

>>>  bool br_multicast_has_querier_adjacent(struct net_device *dev, int proto);

>>>  bool br_multicast_enabled(const struct net_device *dev);

>>>  bool br_multicast_router(const struct net_device *dev);

>>> +int br_mdb_replay(struct net_device *br_dev, struct net_device *dev,

>>> +		  struct notifier_block *nb, struct netlink_ext_ack *extack);

>>>  #else

>>>  static inline int br_multicast_list_adjacent(struct net_device *dev,

>>>  					     struct list_head *br_ip_list)

>>> @@ -93,6 +95,13 @@ static inline bool br_multicast_router(const struct net_device *dev)

>>>  {

>>>  	return false;

>>>  }

>>> +static inline int br_mdb_replay(struct net_device *br_dev,

>>> +				struct net_device *dev,

>>> +				struct notifier_block *nb,

>>> +				struct netlink_ext_ack *extack)

>>> +{

>>> +	return -EINVAL;

>>

>> Should we return -EOPNOTUSPP such that this is not made fatal for DSA if

>> someone compiles its kernel with CONFIG_BRIDGE_IGMP_SNOOPING disabled?

> 

> Sure, I'll change the return values of the shims everywhere.

> 

>>> +}

>>>  #endif

>>>

>>>  #if IS_ENABLED(CONFIG_BRIDGE) && IS_ENABLED(CONFIG_BRIDGE_VLAN_FILTERING)

>>> diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c

>>> index 8846c5bcd075..23973186094c 100644

>>> --- a/net/bridge/br_mdb.c

>>> +++ b/net/bridge/br_mdb.c

>>> @@ -506,6 +506,90 @@ static void br_mdb_complete(struct net_device *dev, int err, void *priv)

>>>  	kfree(priv);

>>>  }

>>>

>>> +static int br_mdb_replay_one(struct notifier_block *nb, struct net_device *dev,

>>> +			     struct net_bridge_mdb_entry *mp, int obj_id,

>>> +			     struct net_device *orig_dev,

>>> +			     struct netlink_ext_ack *extack)

>>> +{

>>> +	struct switchdev_notifier_port_obj_info obj_info = {

>>> +		.info = {

>>> +			.dev = dev,

>>> +			.extack = extack,

>>> +		},

>>> +	};

>>> +	struct switchdev_obj_port_mdb mdb = {

>>> +		.obj = {

>>> +			.orig_dev = orig_dev,

>>> +			.id = obj_id,

>>> +		},

>>> +		.vid = mp->addr.vid,

>>> +	};

>>> +	int err;

>>> +

>>> +	if (mp->addr.proto == htons(ETH_P_IP))

>>> +		ip_eth_mc_map(mp->addr.dst.ip4, mdb.addr);

>>> +#if IS_ENABLED(CONFIG_IPV6)

>>> +	else if (mp->addr.proto == htons(ETH_P_IPV6))

>>> +		ipv6_eth_mc_map(&mp->addr.dst.ip6, mdb.addr);

>>> +#endif

>>> +	else

>>> +		ether_addr_copy(mdb.addr, mp->addr.dst.mac_addr);

>>

>> How you would feel about re-using br_mdb_switchdev_host_port() here and

>> pass a 'type' value that is neither RTM_NEWDB nor RTM_DELDB just so you

>> don't have to duplicate that code here and we ensure it is in sync?

> 

> The trouble is that br_mdb_switchdev_host calls switchdev_port_obj_add,

> and I think the agreement was that replayed events should be a silent,

> one-to-one conversation via a direct call to the notifier block of the

> interested driver, as opposed to a call to the entire notifier chain

> which would make everybody else in the system see duplicates. This is

> the reason why I duplicated mostly everything.


It's not a whole lot of notification but if you passed a type argument
that is neither of the two supported value (say -1),
br_mdb_switchdev_host_port() would end its execution there, and that
would avoid the duplication altogether. I am not stuck on that idea and
can hardly think for now of why this function would change, or why the
switchdev_obj_port_mdb would change, too.
-- 
Florian

Vladimir Oltean March 22, 2021, 10:23 p.m. UTC | #13

On Mon, Mar 22, 2021 at 04:04:01PM +0800, DENG Qingfang wrote:
> On Fri, Mar 19, 2021 at 6:49 PM Vladimir Oltean <olteanv@gmail.com> wrote:

> > Why would you even want to look at the source net device for forwarding?

> > I'd say that if dp->bridge_dev is NULL in the xmit function, you certainly

> > want to bypass address learning if you can. Maybe also for link-local traffic.

> 

> Also for trapped traffic (snooping, tc-flower trap action) if the CPU

> sends them back.


This sounds line an interesting use case, please tell me more about what
commands I could run to reinject trapped packets into the hardware data
path.

[RFC,v2,net-next,00/16] Better support for sandwiched LAGs with bridge and DSA

Message

Comments