mbox series

[v2,net-next,0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking

Message ID 20210819160723.2186424-1-vladimir.oltean@nxp.com
Headers show
Series Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking | expand

Message

Vladimir Oltean Aug. 19, 2021, 4:07 p.m. UTC
Problem statement:

Any time a driver needs to create a private association between a bridge
upper interface and use that association within its
SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB
entries deleted by the bridge when the port leaves. The issue is that
all switchdev drivers schedule a work item to have sleepable context,
and that work item can be actually scheduled after the port has left the
bridge, which means the association might have already been broken by
the time the scheduled FDB work item attempts to use it.

The solution is to modify switchdev to use its embedded SWITCHDEV_F_DEFER
mechanism to make the FDB notifiers emitted from the fastpath be
scheduled in sleepable context. All drivers are converted to handle
SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE from their blocking notifier block
handler (or register a blocking switchdev notifier handler if they
didn't have one). This solves the aforementioned problem because the
bridge waits for the switchdev deferred work items to finish before a
port leaves (del_nbp calls switchdev_deferred_process), whereas a work
item privately scheduled by the driver will obviously not be waited upon
by the bridge, leading to the possibility of having the race.

This is a dependency for the "DSA FDB isolation" posted here. It was
split out of that series hence the numbering starts directly at v2.

https://patchwork.kernel.org/project/netdevbpf/cover/20210818120150.892647-1-vladimir.oltean@nxp.com/

Vladimir Oltean (5):
  net: switchdev: move SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE to the blocking
    notifier chain
  net: bridge: switchdev: make br_fdb_replay offer sleepable context to
    consumers
  net: switchdev: drop the atomic notifier block from
    switchdev_bridge_port_{,un}offload
  net: switchdev: don't assume RCU context in
    switchdev_handle_fdb_{add,del}_to_device
  net: dsa: handle SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE synchronously

 .../ethernet/freescale/dpaa2/dpaa2-switch.c   |  86 +++++------
 .../marvell/prestera/prestera_switchdev.c     | 110 +++++++-------
 .../mellanox/mlx5/core/en/rep/bridge.c        |  59 +++++++-
 .../mellanox/mlxsw/spectrum_switchdev.c       |  61 +++++++-
 .../microchip/sparx5/sparx5_switchdev.c       |  78 +++++-----
 drivers/net/ethernet/mscc/ocelot_net.c        |   3 -
 drivers/net/ethernet/rocker/rocker_main.c     |  73 ++++-----
 drivers/net/ethernet/rocker/rocker_ofdpa.c    |   4 +-
 drivers/net/ethernet/ti/am65-cpsw-nuss.c      |   4 +-
 drivers/net/ethernet/ti/am65-cpsw-switchdev.c |  57 ++++----
 drivers/net/ethernet/ti/cpsw_new.c            |   4 +-
 drivers/net/ethernet/ti/cpsw_switchdev.c      |  60 ++++----
 drivers/s390/net/qeth_l2_main.c               |  10 +-
 include/net/switchdev.h                       |  30 +++-
 net/bridge/br.c                               |   5 +-
 net/bridge/br_fdb.c                           |  40 ++++-
 net/bridge/br_private.h                       |   4 -
 net/bridge/br_switchdev.c                     |  18 +--
 net/dsa/dsa.c                                 |  15 --
 net/dsa/dsa_priv.h                            |  15 --
 net/dsa/port.c                                |   3 -
 net/dsa/slave.c                               | 138 ++++++------------
 net/switchdev/switchdev.c                     |  61 +++++++-
 23 files changed, 529 insertions(+), 409 deletions(-)

Comments

Vladimir Oltean Aug. 19, 2021, 11:18 p.m. UTC | #1
Hi Vlad,

On Thu, Aug 19, 2021 at 09:15:17PM +0300, Vlad Buslov wrote:
> On Thu 19 Aug 2021 at 19:07, Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c b/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c
> > index 0c38c2e319be..ea7c3f07f6fe 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c
> > @@ -276,6 +276,55 @@ mlx5_esw_bridge_port_obj_attr_set(struct net_device *dev,
> >  	return err;
> >  }
> >
> > +static struct mlx5_bridge_switchdev_fdb_work *
> > +mlx5_esw_bridge_init_switchdev_fdb_work(struct net_device *dev, bool add,
> > +					struct switchdev_notifier_fdb_info *fdb_info,
> > +					struct mlx5_esw_bridge_offloads *br_offloads);
> > +
> > +static int
> > +mlx5_esw_bridge_fdb_event(struct net_device *dev, unsigned long event,
> > +			  struct switchdev_notifier_info *info,
> > +			  struct mlx5_esw_bridge_offloads *br_offloads)
> > +{
> > +	struct switchdev_notifier_fdb_info *fdb_info;
> > +	struct mlx5_bridge_switchdev_fdb_work *work;
> > +	struct mlx5_eswitch *esw = br_offloads->esw;
> > +	u16 vport_num, esw_owner_vhca_id;
> > +	struct net_device *upper, *rep;
> > +
> > +	upper = netdev_master_upper_dev_get_rcu(dev);
> > +	if (!upper)
> > +		return 0;
> > +	if (!netif_is_bridge_master(upper))
> > +		return 0;
> > +
> > +	rep = mlx5_esw_bridge_rep_vport_num_vhca_id_get(dev, esw,
> > +							&vport_num,
> > +							&esw_owner_vhca_id);
> > +	if (!rep)
> > +		return 0;
> > +
> > +	/* only handle the event on peers */
> > +	if (mlx5_esw_bridge_is_local(dev, rep, esw))
> > +		return 0;
>
> This check is only needed for SWITCHDEV_FDB_DEL_TO_BRIDGE case. Here it
> breaks the offload.

Very good point, thanks for looking. I copied the entire atomic notifier
handler and deleted the code which wasn't needed, but I actually took a
break while converting mlx5, and so I forgot to delete this part when I
came back.

> > +
> > +	fdb_info = container_of(info, struct switchdev_notifier_fdb_info, info);
> > +
> > +	work = mlx5_esw_bridge_init_switchdev_fdb_work(dev,
> > +						       event == SWITCHDEV_FDB_ADD_TO_DEVICE,
> > +						       fdb_info,
>
> Here FDB info can already be deallocated[1] since this is now executing
> asynchronously and races with fdb_rcu_free() that is scheduled to be
> called after rcu grace period by fdb_delete().

I am incredibly lucky that you caught this, apparently I needed to add
an msleep(1000) to see it as well.

It is not the struct switchdev_notifier_fdb_info *fdb_info that gets
freed under RCU. It is fdb_info->addr (the MAC address), since
switchdev_deferred_enqueue only performs a shallow copy. I will address
that in v3.

> > @@ -415,9 +470,7 @@ static int mlx5_esw_bridge_switchdev_event(struct notifier_block *nb,
> >  		/* only handle the event on peers */
> >  		if (mlx5_esw_bridge_is_local(dev, rep, esw))
> >  			break;
>
> I really like the idea of completely remove the driver wq from FDB
> handling code, but I'm not yet too familiar with bridge internals to
> easily determine whether same approach can be applied to
> SWITCHDEV_FDB_{ADD|DEL}_TO_BRIDGE event after this series is accepted.
> It seems that all current users already generate these events from
> blocking context, so would it be a trivial change for me to do in your
> opinion? That would allow me to get rid of mlx5_esw_bridge_offloads->wq
> in our driver.

If all callers really are in blocking context (and they do appear to be)
you can even forgo the switchdev_deferred_enqueue that switchdev_fdb_add_to_device
does, and just call_switchdev_blocking_notifiers() directly. Then you
move the bridge handler from br_switchdev_event() to br_switchdev_blocking_event().
It should be even simpler than this conversion.
Vlad Buslov Aug. 20, 2021, 7:36 a.m. UTC | #2
On Fri 20 Aug 2021 at 02:18, Vladimir Oltean <olteanv@gmail.com> wrote:
> Hi Vlad,

>

> On Thu, Aug 19, 2021 at 09:15:17PM +0300, Vlad Buslov wrote:

>> On Thu 19 Aug 2021 at 19:07, Vladimir Oltean <vladimir.oltean@nxp.com> wrote:

>> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c b/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c

>> > index 0c38c2e319be..ea7c3f07f6fe 100644

>> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c

>> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c

>> > @@ -276,6 +276,55 @@ mlx5_esw_bridge_port_obj_attr_set(struct net_device *dev,

>> >  	return err;

>> >  }

>> >

>> > +static struct mlx5_bridge_switchdev_fdb_work *

>> > +mlx5_esw_bridge_init_switchdev_fdb_work(struct net_device *dev, bool add,

>> > +					struct switchdev_notifier_fdb_info *fdb_info,

>> > +					struct mlx5_esw_bridge_offloads *br_offloads);

>> > +

>> > +static int

>> > +mlx5_esw_bridge_fdb_event(struct net_device *dev, unsigned long event,

>> > +			  struct switchdev_notifier_info *info,

>> > +			  struct mlx5_esw_bridge_offloads *br_offloads)

>> > +{

>> > +	struct switchdev_notifier_fdb_info *fdb_info;

>> > +	struct mlx5_bridge_switchdev_fdb_work *work;

>> > +	struct mlx5_eswitch *esw = br_offloads->esw;

>> > +	u16 vport_num, esw_owner_vhca_id;

>> > +	struct net_device *upper, *rep;

>> > +

>> > +	upper = netdev_master_upper_dev_get_rcu(dev);

>> > +	if (!upper)

>> > +		return 0;

>> > +	if (!netif_is_bridge_master(upper))

>> > +		return 0;

>> > +

>> > +	rep = mlx5_esw_bridge_rep_vport_num_vhca_id_get(dev, esw,

>> > +							&vport_num,

>> > +							&esw_owner_vhca_id);

>> > +	if (!rep)

>> > +		return 0;

>> > +

>> > +	/* only handle the event on peers */

>> > +	if (mlx5_esw_bridge_is_local(dev, rep, esw))

>> > +		return 0;

>>

>> This check is only needed for SWITCHDEV_FDB_DEL_TO_BRIDGE case. Here it

>> breaks the offload.

>

> Very good point, thanks for looking. I copied the entire atomic notifier

> handler and deleted the code which wasn't needed, but I actually took a

> break while converting mlx5, and so I forgot to delete this part when I

> came back.

>

>> > +

>> > +	fdb_info = container_of(info, struct switchdev_notifier_fdb_info, info);

>> > +

>> > +	work = mlx5_esw_bridge_init_switchdev_fdb_work(dev,

>> > +						       event == SWITCHDEV_FDB_ADD_TO_DEVICE,

>> > +						       fdb_info,

>>

>> Here FDB info can already be deallocated[1] since this is now executing

>> asynchronously and races with fdb_rcu_free() that is scheduled to be

>> called after rcu grace period by fdb_delete().

>

> I am incredibly lucky that you caught this, apparently I needed to add

> an msleep(1000) to see it as well.

>

> It is not the struct switchdev_notifier_fdb_info *fdb_info that gets

> freed under RCU. It is fdb_info->addr (the MAC address), since

> switchdev_deferred_enqueue only performs a shallow copy. I will address

> that in v3.

>

>> > @@ -415,9 +470,7 @@ static int mlx5_esw_bridge_switchdev_event(struct notifier_block *nb,

>> >  		/* only handle the event on peers */

>> >  		if (mlx5_esw_bridge_is_local(dev, rep, esw))

>> >  			break;

>>

>> I really like the idea of completely remove the driver wq from FDB

>> handling code, but I'm not yet too familiar with bridge internals to

>> easily determine whether same approach can be applied to

>> SWITCHDEV_FDB_{ADD|DEL}_TO_BRIDGE event after this series is accepted.

>> It seems that all current users already generate these events from

>> blocking context, so would it be a trivial change for me to do in your

>> opinion? That would allow me to get rid of mlx5_esw_bridge_offloads->wq

>> in our driver.

>

> If all callers really are in blocking context (and they do appear to be)

> you can even forgo the switchdev_deferred_enqueue that switchdev_fdb_add_to_device

> does, and just call_switchdev_blocking_notifiers() directly. Then you

> move the bridge handler from br_switchdev_event() to br_switchdev_blocking_event().

> It should be even simpler than this conversion.


Thanks for your advice! I'll start looking into it as soon as this
series is accepted.
Ido Schimmel Aug. 20, 2021, 9:16 a.m. UTC | #3
On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote:
> Problem statement:

> 

> Any time a driver needs to create a private association between a bridge

> upper interface and use that association within its

> SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB

> entries deleted by the bridge when the port leaves. The issue is that

> all switchdev drivers schedule a work item to have sleepable context,

> and that work item can be actually scheduled after the port has left the

> bridge, which means the association might have already been broken by

> the time the scheduled FDB work item attempts to use it.


This is handled in mlxsw by telling the device to flush the FDB entries
pointing to the {port, FID} when the VLAN is deleted (synchronously).

> 

> The solution is to modify switchdev to use its embedded SWITCHDEV_F_DEFER

> mechanism to make the FDB notifiers emitted from the fastpath be

> scheduled in sleepable context. All drivers are converted to handle

> SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE from their blocking notifier block

> handler (or register a blocking switchdev notifier handler if they

> didn't have one). This solves the aforementioned problem because the

> bridge waits for the switchdev deferred work items to finish before a

> port leaves (del_nbp calls switchdev_deferred_process), whereas a work

> item privately scheduled by the driver will obviously not be waited upon

> by the bridge, leading to the possibility of having the race.


How the problem is solved if after this patchset drivers still queue a
work item?

DSA supports learning, but does not report the entries to the bridge.
How are these entries deleted when a port leaves the bridge?

> 

> This is a dependency for the "DSA FDB isolation" posted here. It was

> split out of that series hence the numbering starts directly at v2.

> 

> https://patchwork.kernel.org/project/netdevbpf/cover/20210818120150.892647-1-vladimir.oltean@nxp.com/


What is FDB isolation? Cover letter says: "There are use cases which
need FDB isolation between standalone ports and bridged ports, as well
as isolation between ports of different bridges".

Does it mean that DSA currently forwards packets between ports even if
they are member in different bridges or standalone?

> 

> Vladimir Oltean (5):

>   net: switchdev: move SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE to the blocking

>     notifier chain

>   net: bridge: switchdev: make br_fdb_replay offer sleepable context to

>     consumers

>   net: switchdev: drop the atomic notifier block from

>     switchdev_bridge_port_{,un}offload

>   net: switchdev: don't assume RCU context in

>     switchdev_handle_fdb_{add,del}_to_device

>   net: dsa: handle SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE synchronously

> 

>  .../ethernet/freescale/dpaa2/dpaa2-switch.c   |  86 +++++------

>  .../marvell/prestera/prestera_switchdev.c     | 110 +++++++-------

>  .../mellanox/mlx5/core/en/rep/bridge.c        |  59 +++++++-

>  .../mellanox/mlxsw/spectrum_switchdev.c       |  61 +++++++-

>  .../microchip/sparx5/sparx5_switchdev.c       |  78 +++++-----

>  drivers/net/ethernet/mscc/ocelot_net.c        |   3 -

>  drivers/net/ethernet/rocker/rocker_main.c     |  73 ++++-----

>  drivers/net/ethernet/rocker/rocker_ofdpa.c    |   4 +-

>  drivers/net/ethernet/ti/am65-cpsw-nuss.c      |   4 +-

>  drivers/net/ethernet/ti/am65-cpsw-switchdev.c |  57 ++++----

>  drivers/net/ethernet/ti/cpsw_new.c            |   4 +-

>  drivers/net/ethernet/ti/cpsw_switchdev.c      |  60 ++++----

>  drivers/s390/net/qeth_l2_main.c               |  10 +-

>  include/net/switchdev.h                       |  30 +++-

>  net/bridge/br.c                               |   5 +-

>  net/bridge/br_fdb.c                           |  40 ++++-

>  net/bridge/br_private.h                       |   4 -

>  net/bridge/br_switchdev.c                     |  18 +--

>  net/dsa/dsa.c                                 |  15 --

>  net/dsa/dsa_priv.h                            |  15 --

>  net/dsa/port.c                                |   3 -

>  net/dsa/slave.c                               | 138 ++++++------------

>  net/switchdev/switchdev.c                     |  61 +++++++-

>  23 files changed, 529 insertions(+), 409 deletions(-)

> 

> -- 

> 2.25.1

>
Vladimir Oltean Aug. 20, 2021, 9:37 a.m. UTC | #4
On Fri, Aug 20, 2021 at 12:16:10PM +0300, Ido Schimmel wrote:
> On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote:

> > Problem statement:

> > 

> > Any time a driver needs to create a private association between a bridge

> > upper interface and use that association within its

> > SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB

> > entries deleted by the bridge when the port leaves. The issue is that

> > all switchdev drivers schedule a work item to have sleepable context,

> > and that work item can be actually scheduled after the port has left the

> > bridge, which means the association might have already been broken by

> > the time the scheduled FDB work item attempts to use it.

> 

> This is handled in mlxsw by telling the device to flush the FDB entries

> pointing to the {port, FID} when the VLAN is deleted (synchronously).


Again, central solution vs mlxsw solution.

If a port leaves a LAG that is offloaded but the LAG does not leave the
bridge, the driver still needs to initiate the VLAN deletion. I really
don't like that, it makes switchdev drivers bloated.

As long as you call switchdev_bridge_port_unoffload and you populate the
blocking notifier pointer, you will get replays of item deletion from
the bridge.

> > The solution is to modify switchdev to use its embedded SWITCHDEV_F_DEFER

> > mechanism to make the FDB notifiers emitted from the fastpath be

> > scheduled in sleepable context. All drivers are converted to handle

> > SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE from their blocking notifier block

> > handler (or register a blocking switchdev notifier handler if they

> > didn't have one). This solves the aforementioned problem because the

> > bridge waits for the switchdev deferred work items to finish before a

> > port leaves (del_nbp calls switchdev_deferred_process), whereas a work

> > item privately scheduled by the driver will obviously not be waited upon

> > by the bridge, leading to the possibility of having the race.

> 

> How the problem is solved if after this patchset drivers still queue a

> work item?


It's only a problem if you bank on any stateful association between FDB
entries and your ports (aka you expect that port->bridge_dev still holds
the same value in the atomic handler as in the deferred work item). I
think drivers don't do this at the moment, otherwise they would be
broken.

When they need that, they will convert to synchronous handling and all
will be fine.

> DSA supports learning, but does not report the entries to the bridge.


Why is this relevant exactly?

> How are these entries deleted when a port leaves the bridge?


dsa_port_fast_age does the following
(a) deletes the hardware learned entries on a port, in all VLANs
(b) notifies the bridge to also flush its software FDB on that port

It is called
(a) when the STP state changes from a learning-capable state (LEARNING,
    FORWARDING) to a non-learning capable state (BLOCKING, LISTENING)
(b) when learning is turned off by the user
(c) when learning is turned off by the port becoming standalone after
    leaving a bridge (actually same code path as b)

So the FDB of a port is also flushed when a single switch port leaves a
LAG that is the actual bridge port (maybe not ideal, but I don't know
any better).

> > This is a dependency for the "DSA FDB isolation" posted here. It was

> > split out of that series hence the numbering starts directly at v2.

> > 

> > https://patchwork.kernel.org/project/netdevbpf/cover/20210818120150.892647-1-vladimir.oltean@nxp.com/

> 

> What is FDB isolation? Cover letter says: "There are use cases which

> need FDB isolation between standalone ports and bridged ports, as well

> as isolation between ports of different bridges".


FDB isolation means exactly what it says: that the hardware FDB lookup
of ports that are standalone, or under one bridge, is unable to find FDB entries
(same MAC address, same VID) learned on another port from another bridge.

> Does it mean that DSA currently forwards packets between ports even if

> they are member in different bridges or standalone?


No, that is plain forwarding isolation in my understanding of terms, and
we have had that for many years now.
Vladimir Oltean Aug. 20, 2021, 10:49 a.m. UTC | #5
On Fri, Aug 20, 2021 at 12:16:10PM +0300, Ido Schimmel wrote:
> On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote:

> > Problem statement:

> >

> > Any time a driver needs to create a private association between a bridge

> > upper interface and use that association within its

> > SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB

> > entries deleted by the bridge when the port leaves. The issue is that

> > all switchdev drivers schedule a work item to have sleepable context,

> > and that work item can be actually scheduled after the port has left the

> > bridge, which means the association might have already been broken by

> > the time the scheduled FDB work item attempts to use it.

>

> This is handled in mlxsw by telling the device to flush the FDB entries

> pointing to the {port, FID} when the VLAN is deleted (synchronously).


If you have FDB entries pointing to bridge ports that are foreign
interfaces and you offload them, do you catch the VLAN deletion on the
foreign port and flush your entries towards it at that time?
Ido Schimmel Aug. 20, 2021, 4:09 p.m. UTC | #6
On Fri, Aug 20, 2021 at 12:37:23PM +0300, Vladimir Oltean wrote:
> On Fri, Aug 20, 2021 at 12:16:10PM +0300, Ido Schimmel wrote:

> > On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote:

> > > Problem statement:

> > > 

> > > Any time a driver needs to create a private association between a bridge

> > > upper interface and use that association within its

> > > SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB

> > > entries deleted by the bridge when the port leaves. The issue is that

> > > all switchdev drivers schedule a work item to have sleepable context,

> > > and that work item can be actually scheduled after the port has left the

> > > bridge, which means the association might have already been broken by

> > > the time the scheduled FDB work item attempts to use it.

> > 

> > This is handled in mlxsw by telling the device to flush the FDB entries

> > pointing to the {port, FID} when the VLAN is deleted (synchronously).

> 

> Again, central solution vs mlxsw solution.


Again, a solution is forced on everyone regardless if it benefits them
or not. List is bombarded with version after version until patches are
applied. *EXHAUSTING*.

With these patches, except DSA, everyone gets another queue_work() for
each FDB entry. In some cases, it completely misses the purpose of the
patchset.

Want a central solution? Make sure it is properly integrated. "Don't
have the energy"? Ask for help. Do not try to force a solution on
everyone and motivate them to change the code by doing a poor conversion
yourself.

I don't accept "this will have to do".

> 

> If a port leaves a LAG that is offloaded but the LAG does not leave the

> bridge, the driver still needs to initiate the VLAN deletion. I really

> don't like that, it makes switchdev drivers bloated.

> 

> As long as you call switchdev_bridge_port_unoffload and you populate the

> blocking notifier pointer, you will get replays of item deletion from

> the bridge.

> 

> > > The solution is to modify switchdev to use its embedded SWITCHDEV_F_DEFER

> > > mechanism to make the FDB notifiers emitted from the fastpath be

> > > scheduled in sleepable context. All drivers are converted to handle

> > > SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE from their blocking notifier block

> > > handler (or register a blocking switchdev notifier handler if they

> > > didn't have one). This solves the aforementioned problem because the

> > > bridge waits for the switchdev deferred work items to finish before a

> > > port leaves (del_nbp calls switchdev_deferred_process), whereas a work

> > > item privately scheduled by the driver will obviously not be waited upon

> > > by the bridge, leading to the possibility of having the race.

> > 

> > How the problem is solved if after this patchset drivers still queue a

> > work item?

> 

> It's only a problem if you bank on any stateful association between FDB

> entries and your ports (aka you expect that port->bridge_dev still holds

> the same value in the atomic handler as in the deferred work item). I

> think drivers don't do this at the moment, otherwise they would be

> broken.

> 

> When they need that, they will convert to synchronous handling and all

> will be fine.

> 

> > DSA supports learning, but does not report the entries to the bridge.

> 

> Why is this relevant exactly?


Because I wanted to make sure that FDB entries that are not present in
the bridge are also flushed.

> 

> > How are these entries deleted when a port leaves the bridge?

> 

> dsa_port_fast_age does the following

> (a) deletes the hardware learned entries on a port, in all VLANs

> (b) notifies the bridge to also flush its software FDB on that port

> 

> It is called

> (a) when the STP state changes from a learning-capable state (LEARNING,

>     FORWARDING) to a non-learning capable state (BLOCKING, LISTENING)

> (b) when learning is turned off by the user

> (c) when learning is turned off by the port becoming standalone after

>     leaving a bridge (actually same code path as b)

> 

> So the FDB of a port is also flushed when a single switch port leaves a

> LAG that is the actual bridge port (maybe not ideal, but I don't know

> any better).

> 

> > > This is a dependency for the "DSA FDB isolation" posted here. It was

> > > split out of that series hence the numbering starts directly at v2.

> > > 

> > > https://patchwork.kernel.org/project/netdevbpf/cover/20210818120150.892647-1-vladimir.oltean@nxp.com/

> > 

> > What is FDB isolation? Cover letter says: "There are use cases which

> > need FDB isolation between standalone ports and bridged ports, as well

> > as isolation between ports of different bridges".

> 

> FDB isolation means exactly what it says: that the hardware FDB lookup

> of ports that are standalone, or under one bridge, is unable to find FDB entries

> (same MAC address, same VID) learned on another port from another bridge.

> 

> > Does it mean that DSA currently forwards packets between ports even if

> > they are member in different bridges or standalone?

> 

> No, that is plain forwarding isolation in my understanding of terms, and

> we have had that for many years now.


So if I have {00:01:02:03:04:05, 5} in br0, but not in br1 and now a
packet with this DMAC/VID needs to be forwarded in br1 it will be
dropped instead of being flooded?
Ido Schimmel Aug. 20, 2021, 4:11 p.m. UTC | #7
On Fri, Aug 20, 2021 at 01:49:48PM +0300, Vladimir Oltean wrote:
> On Fri, Aug 20, 2021 at 12:16:10PM +0300, Ido Schimmel wrote:

> > On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote:

> > > Problem statement:

> > >

> > > Any time a driver needs to create a private association between a bridge

> > > upper interface and use that association within its

> > > SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB

> > > entries deleted by the bridge when the port leaves. The issue is that

> > > all switchdev drivers schedule a work item to have sleepable context,

> > > and that work item can be actually scheduled after the port has left the

> > > bridge, which means the association might have already been broken by

> > > the time the scheduled FDB work item attempts to use it.

> >

> > This is handled in mlxsw by telling the device to flush the FDB entries

> > pointing to the {port, FID} when the VLAN is deleted (synchronously).

> 

> If you have FDB entries pointing to bridge ports that are foreign

> interfaces and you offload them, do you catch the VLAN deletion on the

> foreign port and flush your entries towards it at that time?


Yes, that's how VXLAN offload works. VLAN addition is used to determine
the mapping between VNI and VLAN.
Vladimir Oltean Aug. 20, 2021, 5:06 p.m. UTC | #8
On Fri, Aug 20, 2021 at 07:09:18PM +0300, Ido Schimmel wrote:
> On Fri, Aug 20, 2021 at 12:37:23PM +0300, Vladimir Oltean wrote:

> > On Fri, Aug 20, 2021 at 12:16:10PM +0300, Ido Schimmel wrote:

> > > On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote:

> > > > Problem statement:

> > > >

> > > > Any time a driver needs to create a private association between a bridge

> > > > upper interface and use that association within its

> > > > SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB

> > > > entries deleted by the bridge when the port leaves. The issue is that

> > > > all switchdev drivers schedule a work item to have sleepable context,

> > > > and that work item can be actually scheduled after the port has left the

> > > > bridge, which means the association might have already been broken by

> > > > the time the scheduled FDB work item attempts to use it.

> > >

> > > This is handled in mlxsw by telling the device to flush the FDB entries

> > > pointing to the {port, FID} when the VLAN is deleted (synchronously).

> >

> > Again, central solution vs mlxsw solution.

>

> Again, a solution is forced on everyone regardless if it benefits them

> or not. List is bombarded with version after version until patches are

> applied. *EXHAUSTING*.


So if I replace "bombarded" with a more neutral word, isn't that how
it's done though? What would you do if you wanted to achieve something
but the framework stood in your way? Would you work around it to avoid
bombarding the list?

> With these patches, except DSA, everyone gets another queue_work() for

> each FDB entry. In some cases, it completely misses the purpose of the

> patchset.


I also fail to see the point. Patch 3 will have to make things worse
before they get better. It is like that in DSA too, and made more
reasonable only in the last patch from the series.

If I saw any middle-ground way, like keeping the notifiers on the atomic
chain for unconverted drivers, I would have done it. But what do you do
if more than one driver listens for one event, one driver wants it
blocking, the other wants it atomic. Do you make the bridge emit it
twice? That's even worse than having one useless queue_work() in some
drivers.

So if you think I can avoid that please tell me how.

> Want a central solution? Make sure it is properly integrated. "Don't

> have the energy"? Ask for help. Do not try to force a solution on

> everyone and motivate them to change the code by doing a poor conversion

> yourself.

>

> I don't accept "this will have to do".


So I can make many suppositions about what I did wrong, but I would
prefer that you tell me.

Is it the timing, as we're late in the development cycle? Maybe, and
that would make a lot of sense, but I don't want to assume anything that
has not been said.

Is it that I converted too few drivers? You said I'm bombarding the
list. Can I convert more drivers with less code? I would be absolutely
glad to. I have more driver conversions unsubmitted, some tested on
hardware.

Is it that I didn't ask for help? I still believe that it is best I
leave the driver maintainers to do the rest of the conversion, at their
own pace and with hardware to test and find issues I can not using just
code analysis and non-expert knowledge. After all, with all due respect
to the net-next tree, I sent these patches to a development git tree,
not to a production facility.

> > > What is FDB isolation? Cover letter says: "There are use cases which

> > > need FDB isolation between standalone ports and bridged ports, as well

> > > as isolation between ports of different bridges".

> >

> > FDB isolation means exactly what it says: that the hardware FDB lookup

> > of ports that are standalone, or under one bridge, is unable to find FDB entries

> > (same MAC address, same VID) learned on another port from another bridge.

> >

> > > Does it mean that DSA currently forwards packets between ports even if

> > > they are member in different bridges or standalone?

> >

> > No, that is plain forwarding isolation in my understanding of terms, and

> > we have had that for many years now.

>

> So if I have {00:01:02:03:04:05, 5} in br0, but not in br1 and now a

> packet with this DMAC/VID needs to be forwarded in br1 it will be

> dropped instead of being flooded?


Yes.
Nikolay Aleksandrov Aug. 20, 2021, 11:36 p.m. UTC | #9
On 20/08/2021 20:06, Vladimir Oltean wrote:
> On Fri, Aug 20, 2021 at 07:09:18PM +0300, Ido Schimmel wrote:

>> On Fri, Aug 20, 2021 at 12:37:23PM +0300, Vladimir Oltean wrote:

>>> On Fri, Aug 20, 2021 at 12:16:10PM +0300, Ido Schimmel wrote:

>>>> On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote:

>>>>> Problem statement:

>>>>>

>>>>> Any time a driver needs to create a private association between a bridge

>>>>> upper interface and use that association within its

>>>>> SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB

>>>>> entries deleted by the bridge when the port leaves. The issue is that

>>>>> all switchdev drivers schedule a work item to have sleepable context,

>>>>> and that work item can be actually scheduled after the port has left the

>>>>> bridge, which means the association might have already been broken by

>>>>> the time the scheduled FDB work item attempts to use it.

>>>>

>>>> This is handled in mlxsw by telling the device to flush the FDB entries

>>>> pointing to the {port, FID} when the VLAN is deleted (synchronously).

>>>

>>> Again, central solution vs mlxsw solution.

>>

>> Again, a solution is forced on everyone regardless if it benefits them

>> or not. List is bombarded with version after version until patches are

>> applied. *EXHAUSTING*.

> 

> So if I replace "bombarded" with a more neutral word, isn't that how

> it's done though? What would you do if you wanted to achieve something

> but the framework stood in your way? Would you work around it to avoid

> bombarding the list?

> 

>> With these patches, except DSA, everyone gets another queue_work() for

>> each FDB entry. In some cases, it completely misses the purpose of the

>> patchset.

> 

> I also fail to see the point. Patch 3 will have to make things worse

> before they get better. It is like that in DSA too, and made more

> reasonable only in the last patch from the series.

> 

> If I saw any middle-ground way, like keeping the notifiers on the atomic

> chain for unconverted drivers, I would have done it. But what do you do

> if more than one driver listens for one event, one driver wants it

> blocking, the other wants it atomic. Do you make the bridge emit it

> twice? That's even worse than having one useless queue_work() in some

> drivers.

> 

> So if you think I can avoid that please tell me how.

> 


Hi,
I don't like the double-queuing for each fdb for everyone either, it's forcing them
to rework it asap due to inefficiency even though that shouldn't be necessary. In the
long run I hope everyone would migrate to such scheme, but perhaps we can do it gradually.
For most drivers this is introducing more work (as in processing) rather than helping
them right now, give them the option to convert to it on their own accord or bite
the bullet and convert everyone so the change won't affect them, it holds rtnl, it is blocking
I don't see why not convert everyone to just execute their otherwise queued work.
I'm sure driver maintainers would appreciate such help and would test and review it. You're
halfway there already..

Cheers,
 Nik
Vladimir Oltean Aug. 21, 2021, 12:22 a.m. UTC | #10
On Sat, Aug 21, 2021 at 02:36:26AM +0300, Nikolay Aleksandrov wrote:
> Hi,

> I don't like the double-queuing for each fdb for everyone either, it's forcing them

> to rework it asap due to inefficiency even though that shouldn't be necessary.


Let's be honest, with the vast majority of drivers having absurdities such as the
"if (!fdb_info->added_by_user || fdb_info->is_local) => nothing to do here, bye"
check placed _inside_ the actual work item (and therefore scheduling for nothing
for entries dynamically learned by the bridge), it's hard to believe that driver
authors cared too much about inefficiency when mindlessly copy-pasting that snippet
from mlxsw

[ which for the record does call mlxsw_sp_span_respin for dynamically learned FDB
  entries, so that driver doesn't schedule for nothing like the rest - although
  maybe even mlxsw could call mlxsw_sp_port_dev_lower_find_rcu instead of
  mlxsw_sp_port_dev_lower_find, and could save a queue_work for FDB entries on
  foreign && non-VXLAN ports. Who knows?! ]

Now I get to care for them.

But I can see how a partial conversion could leave things in an even more absurd position.
I don't want to contribute to the absurdity.

> In the

> long run I hope everyone would migrate to such scheme, but perhaps we can do it gradually.

> For most drivers this is introducing more work (as in processing) rather than helping

> them right now, give them the option to convert to it on their own accord or bite

> the bullet and convert everyone so the change won't affect them, it holds rtnl, it is blocking

> I don't see why not convert everyone to just execute their otherwise queued work.

> I'm sure driver maintainers would appreciate such help and would test and review it. You're

> halfway there already..


Agree, this needs more work. Thanks for looking.
Vladimir Oltean Aug. 21, 2021, 7:09 p.m. UTC | #11
On Fri, Aug 20, 2021 at 07:11:15PM +0300, Ido Schimmel wrote:
> On Fri, Aug 20, 2021 at 01:49:48PM +0300, Vladimir Oltean wrote:

> > On Fri, Aug 20, 2021 at 12:16:10PM +0300, Ido Schimmel wrote:

> > > On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote:

> > > > Problem statement:

> > > >

> > > > Any time a driver needs to create a private association between a bridge

> > > > upper interface and use that association within its

> > > > SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB

> > > > entries deleted by the bridge when the port leaves. The issue is that

> > > > all switchdev drivers schedule a work item to have sleepable context,

> > > > and that work item can be actually scheduled after the port has left the

> > > > bridge, which means the association might have already been broken by

> > > > the time the scheduled FDB work item attempts to use it.

> > >

> > > This is handled in mlxsw by telling the device to flush the FDB entries

> > > pointing to the {port, FID} when the VLAN is deleted (synchronously).

> > 

> > If you have FDB entries pointing to bridge ports that are foreign

> > interfaces and you offload them, do you catch the VLAN deletion on the

> > foreign port and flush your entries towards it at that time?

> 

> Yes, that's how VXLAN offload works. VLAN addition is used to determine

> the mapping between VNI and VLAN.


I was only able to follow as far as:

mlxsw_sp_switchdev_blocking_event
-> mlxsw_sp_switchdev_handle_vxlan_obj_del
   -> mlxsw_sp_switchdev_vxlan_vlans_del
      -> mlxsw_sp_switchdev_vxlan_vlan_del
         -> ??? where are the FDB entries flushed?

I was expecting to see something along the lines of

mlxsw_sp_switchdev_blocking_event
-> mlxsw_sp_port_vlans_del
   -> mlxsw_sp_bridge_port_vlan_del
      -> mlxsw_sp_port_vlan_bridge_leave
         -> mlxsw_sp_bridge_port_fdb_flush

but that is exactly on the other branch of the "if (netif_is_vxlan(dev))"
condition (and also, mlxsw_sp_bridge_port_fdb_flush flushes an externally-facing
port, not really what I needed to know, see below).

Anyway, it also seems to me that we are referring to slightly different
things by "foreign" interfaces. To me, a "foreign" interface is one
towards which there is no hardware data path. Like for example if you
have a mlxsw port in a plain L2 bridge with an Intel card. The data path
is the CPU and that was my question: do you track FDB entries towards
those interfaces (implicitly: towards the CPU)? You've answered about
VXLAN, which is quite not "foreign" in the sense I am thinking about,
because mlxsw does have a hardware data path towards a VXLAN interface
(as you've mentioned, it associates a VID with each VNI).

I've been searching through the mlxsw driver and I don't see that this
is being done, so I'm guessing you might wonder/ask why you would want
to do that in the first place. If you bridge a mlxsw port with an Intel
card, then (from another thread where you've said that mlxsw always
injects control packets where hardware learning is not performed) my
guess is that the MAC addresses learned on the Intel bridge port will
never be learned on the mlxsw device. So every packet that ingresses the
mlxsw and must egress the Intel card will reach the CPU through flooding
(and will consequently be flooded in the entire broadcast domain of the
mlxsw side of the bridge). Right?
Ido Schimmel Aug. 22, 2021, 6:48 a.m. UTC | #12
On Sat, Aug 21, 2021 at 02:36:26AM +0300, Nikolay Aleksandrov wrote:
> On 20/08/2021 20:06, Vladimir Oltean wrote:

> > On Fri, Aug 20, 2021 at 07:09:18PM +0300, Ido Schimmel wrote:

> >> On Fri, Aug 20, 2021 at 12:37:23PM +0300, Vladimir Oltean wrote:

> >>> On Fri, Aug 20, 2021 at 12:16:10PM +0300, Ido Schimmel wrote:

> >>>> On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote:

> >>>>> Problem statement:

> >>>>>

> >>>>> Any time a driver needs to create a private association between a bridge

> >>>>> upper interface and use that association within its

> >>>>> SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB

> >>>>> entries deleted by the bridge when the port leaves. The issue is that

> >>>>> all switchdev drivers schedule a work item to have sleepable context,

> >>>>> and that work item can be actually scheduled after the port has left the

> >>>>> bridge, which means the association might have already been broken by

> >>>>> the time the scheduled FDB work item attempts to use it.

> >>>>

> >>>> This is handled in mlxsw by telling the device to flush the FDB entries

> >>>> pointing to the {port, FID} when the VLAN is deleted (synchronously).

> >>>

> >>> Again, central solution vs mlxsw solution.

> >>

> >> Again, a solution is forced on everyone regardless if it benefits them

> >> or not. List is bombarded with version after version until patches are

> >> applied. *EXHAUSTING*.

> > 

> > So if I replace "bombarded" with a more neutral word, isn't that how

> > it's done though? What would you do if you wanted to achieve something

> > but the framework stood in your way? Would you work around it to avoid

> > bombarding the list?

> > 

> >> With these patches, except DSA, everyone gets another queue_work() for

> >> each FDB entry. In some cases, it completely misses the purpose of the

> >> patchset.

> > 

> > I also fail to see the point. Patch 3 will have to make things worse

> > before they get better. It is like that in DSA too, and made more

> > reasonable only in the last patch from the series.

> > 

> > If I saw any middle-ground way, like keeping the notifiers on the atomic

> > chain for unconverted drivers, I would have done it. But what do you do

> > if more than one driver listens for one event, one driver wants it

> > blocking, the other wants it atomic. Do you make the bridge emit it

> > twice? That's even worse than having one useless queue_work() in some

> > drivers.

> > 

> > So if you think I can avoid that please tell me how.

> > 

> 

> Hi,

> I don't like the double-queuing for each fdb for everyone either, it's forcing them

> to rework it asap due to inefficiency even though that shouldn't be necessary. In the

> long run I hope everyone would migrate to such scheme, but perhaps we can do it gradually.


The fundamental problem is that these operations need to be deferred in
the first place. It would have been much better if user space could get
a synchronous feedback.

It all stems from the fact that control plane operations need to be done
under a spin lock because the shared databases (e.g., FDB, MDB) or
states (e.g., STP) that they are updating can also be updated from the
data plane in softIRQ.

I don't have a clean solution to this problem without doing a surgery in
the bridge driver. Deferring updates from the data plane using a work
queue and converting the spin locks to mutexes. This will also allow us
to emit netlink notifications from process context and convert
GFP_ATOMIC to GFP_KERNEL.

Is that something you consider as acceptable? Does anybody have a better
idea?

> For most drivers this is introducing more work (as in processing) rather than helping

> them right now, give them the option to convert to it on their own accord or bite

> the bullet and convert everyone so the change won't affect them, it holds rtnl, it is blocking

> I don't see why not convert everyone to just execute their otherwise queued work.

> I'm sure driver maintainers would appreciate such help and would test and review it. You're

> halfway there already..

> 

> Cheers,

>  Nik

> 

> 

> 

> 

> 

> 

> 

>
Ido Schimmel Aug. 22, 2021, 7:19 a.m. UTC | #13
On Sat, Aug 21, 2021 at 10:09:14PM +0300, Vladimir Oltean wrote:
> On Fri, Aug 20, 2021 at 07:11:15PM +0300, Ido Schimmel wrote:

> > On Fri, Aug 20, 2021 at 01:49:48PM +0300, Vladimir Oltean wrote:

> > > On Fri, Aug 20, 2021 at 12:16:10PM +0300, Ido Schimmel wrote:

> > > > On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote:

> > > > > Problem statement:

> > > > >

> > > > > Any time a driver needs to create a private association between a bridge

> > > > > upper interface and use that association within its

> > > > > SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB

> > > > > entries deleted by the bridge when the port leaves. The issue is that

> > > > > all switchdev drivers schedule a work item to have sleepable context,

> > > > > and that work item can be actually scheduled after the port has left the

> > > > > bridge, which means the association might have already been broken by

> > > > > the time the scheduled FDB work item attempts to use it.

> > > >

> > > > This is handled in mlxsw by telling the device to flush the FDB entries

> > > > pointing to the {port, FID} when the VLAN is deleted (synchronously).

> > > 

> > > If you have FDB entries pointing to bridge ports that are foreign

> > > interfaces and you offload them, do you catch the VLAN deletion on the

> > > foreign port and flush your entries towards it at that time?

> > 

> > Yes, that's how VXLAN offload works. VLAN addition is used to determine

> > the mapping between VNI and VLAN.

> 

> I was only able to follow as far as:

> 

> mlxsw_sp_switchdev_blocking_event

> -> mlxsw_sp_switchdev_handle_vxlan_obj_del

>    -> mlxsw_sp_switchdev_vxlan_vlans_del

>       -> mlxsw_sp_switchdev_vxlan_vlan_del

>          -> ??? where are the FDB entries flushed?


 mlxsw_sp_switchdev_blocking_event
 -> mlxsw_sp_switchdev_handle_vxlan_obj_del
    -> mlxsw_sp_switchdev_vxlan_vlans_del
       -> mlxsw_sp_switchdev_vxlan_vlan_del
          -> mlxsw_sp_bridge_vxlan_leave
	     -> mlxsw_sp_nve_fid_disable
	        -> mlxsw_sp_nve_fdb_flush_by_fid

> 

> I was expecting to see something along the lines of

> 

> mlxsw_sp_switchdev_blocking_event

> -> mlxsw_sp_port_vlans_del

>    -> mlxsw_sp_bridge_port_vlan_del

>       -> mlxsw_sp_port_vlan_bridge_leave

>          -> mlxsw_sp_bridge_port_fdb_flush

> 

> but that is exactly on the other branch of the "if (netif_is_vxlan(dev))"

> condition (and also, mlxsw_sp_bridge_port_fdb_flush flushes an externally-facing

> port, not really what I needed to know, see below).

> 

> Anyway, it also seems to me that we are referring to slightly different

> things by "foreign" interfaces. To me, a "foreign" interface is one

> towards which there is no hardware data path. Like for example if you

> have a mlxsw port in a plain L2 bridge with an Intel card. The data path

> is the CPU and that was my question: do you track FDB entries towards

> those interfaces (implicitly: towards the CPU)? You've answered about

> VXLAN, which is quite not "foreign" in the sense I am thinking about,

> because mlxsw does have a hardware data path towards a VXLAN interface

> (as you've mentioned, it associates a VID with each VNI).

> 

> I've been searching through the mlxsw driver and I don't see that this

> is being done, so I'm guessing you might wonder/ask why you would want

> to do that in the first place. If you bridge a mlxsw port with an Intel

> card, then (from another thread where you've said that mlxsw always

> injects control packets where hardware learning is not performed) my

> guess is that the MAC addresses learned on the Intel bridge port will

> never be learned on the mlxsw device. So every packet that ingresses the

> mlxsw and must egress the Intel card will reach the CPU through flooding

> (and will consequently be flooded in the entire broadcast domain of the

> mlxsw side of the bridge). Right?


I can see how this use case makes sense on systems where the difference
in performance between the ASIC and the CPU is not huge, but it doesn't
make much sense with Spectrum and I have yet to get requests to support
it (might change). Keep in mind that Spectrum is able to forward several
Bpps with a switching capacity of several Tbps. It is usually connected
to a weak CPU (e.g., low-end ARM, Intel Atom) through a PCI bus with a
bandwidth of several Gbps. There is usually one "Intel card" on such
systems which is connected to the management network that is separated
from the data plane network.

If we were to support it, FDB entries towards "foreign" interfaces would
be programmed to trap packets to the CPU. For now, for correctness /
rigor purposes, I would prefer simply returning an error / warning via
extack when such topologies are configured.
Nikolay Aleksandrov Aug. 22, 2021, 9:12 a.m. UTC | #14
On 22/08/2021 09:48, Ido Schimmel wrote:
> On Sat, Aug 21, 2021 at 02:36:26AM +0300, Nikolay Aleksandrov wrote:

>> On 20/08/2021 20:06, Vladimir Oltean wrote:

>>> On Fri, Aug 20, 2021 at 07:09:18PM +0300, Ido Schimmel wrote:

>>>> On Fri, Aug 20, 2021 at 12:37:23PM +0300, Vladimir Oltean wrote:

>>>>> On Fri, Aug 20, 2021 at 12:16:10PM +0300, Ido Schimmel wrote:

>>>>>> On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote:

>>>>>>> Problem statement:

>>>>>>>

>>>>>>> Any time a driver needs to create a private association between a bridge

>>>>>>> upper interface and use that association within its

>>>>>>> SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB

>>>>>>> entries deleted by the bridge when the port leaves. The issue is that

>>>>>>> all switchdev drivers schedule a work item to have sleepable context,

>>>>>>> and that work item can be actually scheduled after the port has left the

>>>>>>> bridge, which means the association might have already been broken by

>>>>>>> the time the scheduled FDB work item attempts to use it.

>>>>>>

>>>>>> This is handled in mlxsw by telling the device to flush the FDB entries

>>>>>> pointing to the {port, FID} when the VLAN is deleted (synchronously).

>>>>>

>>>>> Again, central solution vs mlxsw solution.

>>>>

>>>> Again, a solution is forced on everyone regardless if it benefits them

>>>> or not. List is bombarded with version after version until patches are

>>>> applied. *EXHAUSTING*.

>>>

>>> So if I replace "bombarded" with a more neutral word, isn't that how

>>> it's done though? What would you do if you wanted to achieve something

>>> but the framework stood in your way? Would you work around it to avoid

>>> bombarding the list?

>>>

>>>> With these patches, except DSA, everyone gets another queue_work() for

>>>> each FDB entry. In some cases, it completely misses the purpose of the

>>>> patchset.

>>>

>>> I also fail to see the point. Patch 3 will have to make things worse

>>> before they get better. It is like that in DSA too, and made more

>>> reasonable only in the last patch from the series.

>>>

>>> If I saw any middle-ground way, like keeping the notifiers on the atomic

>>> chain for unconverted drivers, I would have done it. But what do you do

>>> if more than one driver listens for one event, one driver wants it

>>> blocking, the other wants it atomic. Do you make the bridge emit it

>>> twice? That's even worse than having one useless queue_work() in some

>>> drivers.

>>>

>>> So if you think I can avoid that please tell me how.

>>>

>>

>> Hi,

>> I don't like the double-queuing for each fdb for everyone either, it's forcing them

>> to rework it asap due to inefficiency even though that shouldn't be necessary. In the

>> long run I hope everyone would migrate to such scheme, but perhaps we can do it gradually.

> 

> The fundamental problem is that these operations need to be deferred in

> the first place. It would have been much better if user space could get

> a synchronous feedback.

> 

> It all stems from the fact that control plane operations need to be done

> under a spin lock because the shared databases (e.g., FDB, MDB) or

> states (e.g., STP) that they are updating can also be updated from the

> data plane in softIRQ.

> 


Right, but changing that, as you've noted below, would require moving
the delaying to the bridge, I'd like to avoid that.

> I don't have a clean solution to this problem without doing a surgery in

> the bridge driver. Deferring updates from the data plane using a work

> queue and converting the spin locks to mutexes. This will also allow us

> to emit netlink notifications from process context and convert

> GFP_ATOMIC to GFP_KERNEL.

> 

> Is that something you consider as acceptable? Does anybody have a better

> idea?

> 


Moving the delays to the bridge for this purpose does not sound like a good solution,
I'd prefer the delaying to be done by the interested third party as in this case rather
than the bridge. If there's a solution that avoids delaying and doesn't hurt the software
fast-path then of course I'll be ok with that.
 
>> For most drivers this is introducing more work (as in processing) rather than helping

>> them right now, give them the option to convert to it on their own accord or bite

>> the bullet and convert everyone so the change won't affect them, it holds rtnl, it is blocking

>> I don't see why not convert everyone to just execute their otherwise queued work.

>> I'm sure driver maintainers would appreciate such help and would test and review it. You're

>> halfway there already..

>>

>> Cheers,

>>  Nik

>>

>>

>>

>>

>>

>>

>>

>>
Vladimir Oltean Aug. 22, 2021, 1:31 p.m. UTC | #15
On Sun, Aug 22, 2021 at 12:12:02PM +0300, Nikolay Aleksandrov wrote:
> On 22/08/2021 09:48, Ido Schimmel wrote:

> > On Sat, Aug 21, 2021 at 02:36:26AM +0300, Nikolay Aleksandrov wrote:

> >> On 20/08/2021 20:06, Vladimir Oltean wrote:

> >>> On Fri, Aug 20, 2021 at 07:09:18PM +0300, Ido Schimmel wrote:

> >>>> On Fri, Aug 20, 2021 at 12:37:23PM +0300, Vladimir Oltean wrote:

> >>>>> On Fri, Aug 20, 2021 at 12:16:10PM +0300, Ido Schimmel wrote:

> >>>>>> On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote:

> >>>>>>> Problem statement:

> >>>>>>>

> >>>>>>> Any time a driver needs to create a private association between a bridge

> >>>>>>> upper interface and use that association within its

> >>>>>>> SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB

> >>>>>>> entries deleted by the bridge when the port leaves. The issue is that

> >>>>>>> all switchdev drivers schedule a work item to have sleepable context,

> >>>>>>> and that work item can be actually scheduled after the port has left the

> >>>>>>> bridge, which means the association might have already been broken by

> >>>>>>> the time the scheduled FDB work item attempts to use it.

> >>>>>>

> >>>>>> This is handled in mlxsw by telling the device to flush the FDB entries

> >>>>>> pointing to the {port, FID} when the VLAN is deleted (synchronously).

> >>>>>

> >>>>> Again, central solution vs mlxsw solution.

> >>>>

> >>>> Again, a solution is forced on everyone regardless if it benefits them

> >>>> or not. List is bombarded with version after version until patches are

> >>>> applied. *EXHAUSTING*.

> >>>

> >>> So if I replace "bombarded" with a more neutral word, isn't that how

> >>> it's done though? What would you do if you wanted to achieve something

> >>> but the framework stood in your way? Would you work around it to avoid

> >>> bombarding the list?

> >>>

> >>>> With these patches, except DSA, everyone gets another queue_work() for

> >>>> each FDB entry. In some cases, it completely misses the purpose of the

> >>>> patchset.

> >>>

> >>> I also fail to see the point. Patch 3 will have to make things worse

> >>> before they get better. It is like that in DSA too, and made more

> >>> reasonable only in the last patch from the series.

> >>>

> >>> If I saw any middle-ground way, like keeping the notifiers on the atomic

> >>> chain for unconverted drivers, I would have done it. But what do you do

> >>> if more than one driver listens for one event, one driver wants it

> >>> blocking, the other wants it atomic. Do you make the bridge emit it

> >>> twice? That's even worse than having one useless queue_work() in some

> >>> drivers.

> >>>

> >>> So if you think I can avoid that please tell me how.

> >>>

> >>

> >> Hi,

> >> I don't like the double-queuing for each fdb for everyone either, it's forcing them

> >> to rework it asap due to inefficiency even though that shouldn't be necessary. In the

> >> long run I hope everyone would migrate to such scheme, but perhaps we can do it gradually.

> > 

> > The fundamental problem is that these operations need to be deferred in

> > the first place. It would have been much better if user space could get

> > a synchronous feedback.

> > 

> > It all stems from the fact that control plane operations need to be done

> > under a spin lock because the shared databases (e.g., FDB, MDB) or

> > states (e.g., STP) that they are updating can also be updated from the

> > data plane in softIRQ.

> > 

> 

> Right, but changing that, as you've noted below, would require moving

> the delaying to the bridge, I'd like to avoid that.

> 

> > I don't have a clean solution to this problem without doing a surgery in

> > the bridge driver. Deferring updates from the data plane using a work

> > queue and converting the spin locks to mutexes. This will also allow us

> > to emit netlink notifications from process context and convert

> > GFP_ATOMIC to GFP_KERNEL.

> > 

> > Is that something you consider as acceptable? Does anybody have a better

> > idea?

> > 

> 

> Moving the delays to the bridge for this purpose does not sound like a good solution,

> I'd prefer the delaying to be done by the interested third party as in this case rather

> than the bridge. If there's a solution that avoids delaying and doesn't hurt the software

> fast-path then of course I'll be ok with that.


Maybe emitting two notifiers, one atomic and one blocking, per FDB
add/del event is not such a stupid idea after all.

Here's an alternative I've been cooking. Obviously it still has pros and
cons. Hopefully by reading the commit message you get the basic idea and
I don't need to post the full series.

-----------------------------[ cut here ]-----------------------------
From 9870699f0fafeb6175af3462173a957ece551322 Mon Sep 17 00:00:00 2001
From: Vladimir Oltean <vladimir.oltean@nxp.com>

Date: Sat, 21 Aug 2021 15:57:40 +0300
Subject: [PATCH] net: switchdev: add an option for
 SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE to be deferred

Most existing switchdev drivers either talk to firmware, or to a device
over a bus where the I/O is sleepable (SPI, I2C, MDIO etc). So there
exists a pattern where drivers make a sleepable context for offloading
the given FDB entry by registering an ordered workqueue and scheduling
work items on it, and doing all the work from there.

This solution works, but there are some issues with it:

1. It creates large amounts of duplication between switchdev drivers,
   and they don't even copy all the right patterns from each other.
   For example:

   * DSA, dpaa2-switch, rocker allocate an ordered workqueue with the
     WQ_MEM_RECLAIM flag and no one knows why.

   * dpaa2-switch, sparx5, am65_cpsw, cpsw, rocker, prestera, all have
     this check, or one very similar to it:

		if (!fdb_info->added_by_user || fdb_info->is_local)
			break; /* do nothing and exit */

     within the actually scheduled workqueue item. That is to say, they
     schedule and take the rtnl_mutex for nothing - every single time
     that an FDB entry is dynamically learned by the software bridge and
     they are not interested in it. Same thing for the *_dev_check
     function - the function which checks if an FDB entry was learned on
     a network interface owned by the driver.

2. The work items scheduled privately by the driver are not synchronous
   with bridge events (i.e. the bridge will not wait for the driver to
   finish deleting an FDB entry before e.g. calling del_nbp and deleting
   that interface as a bridge port). This might matter for middle layers
   like DSA which construct their own API to their downstream consumers
   on top of the switchdev primitives. With the current switchdev API
   design, it is not possible to guarantee that the bridge which
   generated an FDB entry deletion is still the upper interface by the
   time that the work item is scheduled and the FDB deletion is actually
   executed. To obtain this guarantee it would be necessary to introduce
   a refcounting system where the reference to the bridge is kept by DSA
   for as long as there are pending bridge FDB entry additions/deletions.
   Not really ideal if we look at the big picture.

3. There is a larger issue that SWITCHDEV_FDB_ADD_TO_DEVICE events are
   deferred by drivers even from code paths that are initially blocking
   (are running in process context):

br_fdb_add
-> __br_fdb_add
   -> fdb_add_entry
      -> fdb_notify
         -> br_switchdev_fdb_notify

    It seems fairly trivial to move the fdb_notify call outside of the
    atomic section of fdb_add_entry, but with switchdev offering only an
    API where the SWITCHDEV_FDB_ADD_TO_DEVICE is atomic, drivers would
    still have to defer these events and are unable to provide
    synchronous feedback to user space (error codes, extack).

The above issues would warrant an attempt to fix a central problem, and
make switchdev expose an API that is easier to consume rather than
having drivers implement lateral workarounds.

In this case, we must notice that

(a) switchdev already has the concept of notifiers emitted from the fast
    path that are still processed by drivers from blocking context. This
    is accomplished through the SWITCHDEV_F_DEFER flag which is used by
    e.g. SWITCHDEV_OBJ_ID_HOST_MDB.

(b) the bridge del_nbp() function already calls switchdev_deferred_process().
    So if we could hook into that, we could have a chance that the
    bridge simply waits for our FDB entry offloading procedure to finish
    before it calls netdev_upper_dev_unlink() - which is almost
    immediately afterwards, and also when switchdev drivers typically
    break their stateful associations between the bridge upper and
    private data.

So it is in fact possible to use switchdev's generic
switchdev_deferred_enqueue mechanism to get a sleepable callback, and
from there we can call_switchdev_blocking_notifiers().

To address all requirements:

- drivers that are unconverted from atomic to blocking still work
- drivers that currently have a private workqueue are not worse off
- drivers that want the bridge to wait for their deferred work can use
  the bridge's defer mechanism
- a SWITCHDEV_FDB_ADD_TO_DEVICE event which does not have any interested
  parties does not get deferred for no reason, because this takes the
  rtnl_mutex and schedules a worker thread for nothing

it looks like we can in fact start off by emitting
SWITCHDEV_FDB_ADD_TO_DEVICE on the atomic chain. But we add a new bit in
struct switchdev_notifier_fdb_info called "needs_defer", and any
interested party can set this to true.

This way:

- unconverted drivers do their work (i.e. schedule their private work
  item) based on the atomic notifier, and do not set "needs_defer"
- converted drivers only mark "needs_defer" and treat a separate
  notifier, on the blocking chain, called SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
- SWITCHDEV_FDB_ADD_TO_DEVICE events with no interested party do not
  generate any follow-up SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED

Additionally, code paths that are blocking right not, like br_fdb_replay,
could notify only SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, as long as all
consumers of the replayed FDB events support that (right now, that is
DSA and dpaa2-switch).

Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set
needs_defer as appropriate, then the notifiers emitted from process
context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
directly, and we would also have fully blocking context all the way
down, with the opportunity for error propagation and extack.

Some disadvantages of this solution though:

- A driver now needs to check whether it is interested in an event
  twice: first on the atomic call chain, then again on the blocking call
  chain (because it is a notifier chain, it is potentially not the only
  driver subscribed to it, it may be listening to another driver's
  needs_defer request). The flip side: on sistems with mixed switchdev
  setups (dpaa2-switch + DSA, and DSA sniffs dynamically learned FDB
  entries on foreign interfaces), there are some "synergies", and the
  SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED event is only emitted once, as
  opposed to what would happen if each driver scheduled its own private
  work item.

- Right now drivers take rtnl_lock() as soon as their private work item
  runs. They need the rtnl_lock for the call to
  call_switchdev_notifiers(SWITCHDEV_FDB_OFFLOADED). But it doesn't
  really seem necessary for them to perform the actual hardware
  manipulation (adding the FDB entry) with the rtnl_lock held (anyway
  most do that). But with the new option of servicing
  SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, the rtnl_lock is taken top-level
  by switchdev, so even if these drivers wanted to be more self-conscious,
  they couldn't.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

---
 include/net/switchdev.h   | 26 ++++++++++++++-
 net/bridge/br_switchdev.c |  6 ++--
 net/switchdev/switchdev.c | 69 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 96 insertions(+), 5 deletions(-)

diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index 6764fb7692e2..67ddb80c828f 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -193,6 +193,8 @@ enum switchdev_notifier_type {
 	SWITCHDEV_FDB_DEL_TO_BRIDGE,
 	SWITCHDEV_FDB_ADD_TO_DEVICE,
 	SWITCHDEV_FDB_DEL_TO_DEVICE,
+	SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, /* Blocking. */
+	SWITCHDEV_FDB_DEL_TO_DEVICE_DEFERRED, /* Blocking. */
 	SWITCHDEV_FDB_OFFLOADED,
 	SWITCHDEV_FDB_FLUSH_TO_BRIDGE,
 
@@ -222,7 +224,8 @@ struct switchdev_notifier_fdb_info {
 	u16 vid;
 	u8 added_by_user:1,
 	   is_local:1,
-	   offloaded:1;
+	   offloaded:1,
+	   needs_defer:1;
 };
 
 struct switchdev_notifier_port_obj_info {
@@ -283,6 +286,13 @@ int switchdev_port_obj_add(struct net_device *dev,
 int switchdev_port_obj_del(struct net_device *dev,
 			   const struct switchdev_obj *obj);
 
+int
+switchdev_fdb_add_to_device(struct net_device *dev,
+			    struct switchdev_notifier_fdb_info *fdb_info);
+int
+switchdev_fdb_del_to_device(struct net_device *dev,
+			    struct switchdev_notifier_fdb_info *fdb_info);
+
 int register_switchdev_notifier(struct notifier_block *nb);
 int unregister_switchdev_notifier(struct notifier_block *nb);
 int call_switchdev_notifiers(unsigned long val, struct net_device *dev,
@@ -386,6 +396,20 @@ static inline int switchdev_port_obj_del(struct net_device *dev,
 	return -EOPNOTSUPP;
 }
 
+static inline int
+switchdev_fdb_add_to_device(struct net_device *dev,
+			    struct switchdev_notifier_fdb_info *fdb_info)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int
+switchdev_fdb_del_to_device(struct net_device *dev,
+			    struct switchdev_notifier_fdb_info *fdb_info)
+{
+	return -EOPNOTSUPP;
+}
+
 static inline int register_switchdev_notifier(struct notifier_block *nb)
 {
 	return 0;
diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c
index 7e62904089c8..687100ca7088 100644
--- a/net/bridge/br_switchdev.c
+++ b/net/bridge/br_switchdev.c
@@ -140,12 +140,10 @@ br_switchdev_fdb_notify(struct net_bridge *br,
 
 	switch (type) {
 	case RTM_DELNEIGH:
-		call_switchdev_notifiers(SWITCHDEV_FDB_DEL_TO_DEVICE,
-					 dev, &info.info, NULL);
+		switchdev_fdb_del_to_device(dev, &info);
 		break;
 	case RTM_NEWNEIGH:
-		call_switchdev_notifiers(SWITCHDEV_FDB_ADD_TO_DEVICE,
-					 dev, &info.info, NULL);
+		switchdev_fdb_add_to_device(dev, &info);
 		break;
 	}
 }
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index 0b2c18efc079..d2f0bfc8a0b4 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -378,6 +378,75 @@ int call_switchdev_blocking_notifiers(unsigned long val, struct net_device *dev,
 }
 EXPORT_SYMBOL_GPL(call_switchdev_blocking_notifiers);
 
+static void switchdev_fdb_add_deferred(struct net_device *dev, const void *data)
+{
+	const struct switchdev_notifier_fdb_info *fdb_info = data;
+	struct switchdev_notifier_fdb_info tmp = *fdb_info;
+	int err;
+
+	ASSERT_RTNL();
+	err = call_switchdev_blocking_notifiers(SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED,
+						dev, &tmp.info, NULL);
+	err = notifier_to_errno(err);
+	if (err && err != -EOPNOTSUPP)
+		netdev_err(dev, "failed to add FDB entry: %pe\n", ERR_PTR(err));
+}
+
+static void switchdev_fdb_del_deferred(struct net_device *dev, const void *data)
+{
+	const struct switchdev_notifier_fdb_info *fdb_info = data;
+	struct switchdev_notifier_fdb_info tmp = *fdb_info;
+	int err;
+
+	ASSERT_RTNL();
+	err = call_switchdev_blocking_notifiers(SWITCHDEV_FDB_DEL_TO_DEVICE_DEFERRED,
+						dev, &tmp.info, NULL);
+	err = notifier_to_errno(err);
+	if (err && err != -EOPNOTSUPP)
+		netdev_err(dev, "failed to delete FDB entry: %pe\n",
+			   ERR_PTR(err));
+}
+
+int
+switchdev_fdb_add_to_device(struct net_device *dev,
+			    struct switchdev_notifier_fdb_info *fdb_info)
+{
+	int err;
+
+	err = call_switchdev_notifiers(SWITCHDEV_FDB_ADD_TO_DEVICE, dev,
+				       &fdb_info->info, NULL);
+	err = notifier_to_errno(err);
+	if (err)
+		return err;
+
+	if (!fdb_info->needs_defer)
+		return 0;
+
+	return switchdev_deferred_enqueue(dev, fdb_info, sizeof(*fdb_info),
+					  switchdev_fdb_add_deferred);
+}
+EXPORT_SYMBOL_GPL(switchdev_fdb_add_to_device);
+
+int
+switchdev_fdb_del_to_device(struct net_device *dev,
+			    struct switchdev_notifier_fdb_info *fdb_info)
+{
+	int err;
+
+	err = call_switchdev_notifiers(SWITCHDEV_FDB_DEL_TO_DEVICE, dev,
+				       &fdb_info->info, NULL);
+	err = notifier_to_errno(err);
+	if (err)
+		return err;
+
+	if (!fdb_info->needs_defer)
+		return 0;
+
+	return switchdev_deferred_enqueue(dev, fdb_info, sizeof(*fdb_info),
+					  switchdev_fdb_del_deferred);
+}
+EXPORT_SYMBOL_GPL(switchdev_fdb_del_to_device);
+
 struct switchdev_nested_priv {
 	bool (*check_cb)(const struct net_device *dev);
 	bool (*foreign_dev_check_cb)(const struct net_device *dev,
-----------------------------[ cut here ]-----------------------------
Ido Schimmel Aug. 22, 2021, 5:06 p.m. UTC | #16
On Sun, Aug 22, 2021 at 04:31:45PM +0300, Vladimir Oltean wrote:
> 3. There is a larger issue that SWITCHDEV_FDB_ADD_TO_DEVICE events are

>    deferred by drivers even from code paths that are initially blocking

>    (are running in process context):

> 

> br_fdb_add

> -> __br_fdb_add

>    -> fdb_add_entry

>       -> fdb_notify

>          -> br_switchdev_fdb_notify

> 

>     It seems fairly trivial to move the fdb_notify call outside of the

>     atomic section of fdb_add_entry, but with switchdev offering only an

>     API where the SWITCHDEV_FDB_ADD_TO_DEVICE is atomic, drivers would

>     still have to defer these events and are unable to provide

>     synchronous feedback to user space (error codes, extack).

> 

> The above issues would warrant an attempt to fix a central problem, and

> make switchdev expose an API that is easier to consume rather than

> having drivers implement lateral workarounds.

> 

> In this case, we must notice that

> 

> (a) switchdev already has the concept of notifiers emitted from the fast

>     path that are still processed by drivers from blocking context. This

>     is accomplished through the SWITCHDEV_F_DEFER flag which is used by

>     e.g. SWITCHDEV_OBJ_ID_HOST_MDB.

> 

> (b) the bridge del_nbp() function already calls switchdev_deferred_process().

>     So if we could hook into that, we could have a chance that the

>     bridge simply waits for our FDB entry offloading procedure to finish

>     before it calls netdev_upper_dev_unlink() - which is almost

>     immediately afterwards, and also when switchdev drivers typically

>     break their stateful associations between the bridge upper and

>     private data.

> 

> So it is in fact possible to use switchdev's generic

> switchdev_deferred_enqueue mechanism to get a sleepable callback, and

> from there we can call_switchdev_blocking_notifiers().

> 

> To address all requirements:

> 

> - drivers that are unconverted from atomic to blocking still work

> - drivers that currently have a private workqueue are not worse off

> - drivers that want the bridge to wait for their deferred work can use

>   the bridge's defer mechanism

> - a SWITCHDEV_FDB_ADD_TO_DEVICE event which does not have any interested

>   parties does not get deferred for no reason, because this takes the

>   rtnl_mutex and schedules a worker thread for nothing

> 

> it looks like we can in fact start off by emitting

> SWITCHDEV_FDB_ADD_TO_DEVICE on the atomic chain. But we add a new bit in

> struct switchdev_notifier_fdb_info called "needs_defer", and any

> interested party can set this to true.

> 

> This way:

> 

> - unconverted drivers do their work (i.e. schedule their private work

>   item) based on the atomic notifier, and do not set "needs_defer"

> - converted drivers only mark "needs_defer" and treat a separate

>   notifier, on the blocking chain, called SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED

> - SWITCHDEV_FDB_ADD_TO_DEVICE events with no interested party do not

>   generate any follow-up SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED

> 

> Additionally, code paths that are blocking right not, like br_fdb_replay,

> could notify only SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, as long as all

> consumers of the replayed FDB events support that (right now, that is

> DSA and dpaa2-switch).

> 

> Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set

> needs_defer as appropriate, then the notifiers emitted from process

> context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED

> directly, and we would also have fully blocking context all the way

> down, with the opportunity for error propagation and extack.


IIUC, at this stage all the FDB notifications drivers get are blocking,
either from the work queue (because they were deferred) or directly from
process context. If so, how do we synchronize the two and ensure drivers
get the notifications at the correct order?

I was thinking of adding all the notifications to the 'deferred' list
when 'hash_lock' is held and then calling switchdev_deferred_process()
directly in process context. It's not very pretty (do we return an error
only for the entry the user added or for any other entry we flushed from
the list?), but I don't have a better idea right now.

> 

> Some disadvantages of this solution though:

> 

> - A driver now needs to check whether it is interested in an event

>   twice: first on the atomic call chain, then again on the blocking call

>   chain (because it is a notifier chain, it is potentially not the only

>   driver subscribed to it, it may be listening to another driver's

>   needs_defer request). The flip side: on sistems with mixed switchdev

>   setups (dpaa2-switch + DSA, and DSA sniffs dynamically learned FDB

>   entries on foreign interfaces), there are some "synergies", and the

>   SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED event is only emitted once, as

>   opposed to what would happen if each driver scheduled its own private

>   work item.

> 

> - Right now drivers take rtnl_lock() as soon as their private work item

>   runs. They need the rtnl_lock for the call to

>   call_switchdev_notifiers(SWITCHDEV_FDB_OFFLOADED). 


I think RCU is enough?

>   But it doesn't really seem necessary for them to perform the actual

>   hardware manipulation (adding the FDB entry) with the rtnl_lock held

>   (anyway most do that). But with the new option of servicing

>   SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, the rtnl_lock is taken

>   top-level by switchdev, so even if these drivers wanted to be more

>   self-conscious, they couldn't.


Yes, I want to remove this dependency in mlxsw assuming notifications
remain atomic. The more pressing issue is actually removing it from the
learning path.
Vladimir Oltean Aug. 22, 2021, 5:44 p.m. UTC | #17
On Sun, Aug 22, 2021 at 08:06:00PM +0300, Ido Schimmel wrote:
> On Sun, Aug 22, 2021 at 04:31:45PM +0300, Vladimir Oltean wrote:

> > 3. There is a larger issue that SWITCHDEV_FDB_ADD_TO_DEVICE events are

> >    deferred by drivers even from code paths that are initially blocking

> >    (are running in process context):

> > 

> > br_fdb_add

> > -> __br_fdb_add

> >    -> fdb_add_entry

> >       -> fdb_notify

> >          -> br_switchdev_fdb_notify

> > 

> >     It seems fairly trivial to move the fdb_notify call outside of the

> >     atomic section of fdb_add_entry, but with switchdev offering only an

> >     API where the SWITCHDEV_FDB_ADD_TO_DEVICE is atomic, drivers would

> >     still have to defer these events and are unable to provide

> >     synchronous feedback to user space (error codes, extack).

> > 

> > The above issues would warrant an attempt to fix a central problem, and

> > make switchdev expose an API that is easier to consume rather than

> > having drivers implement lateral workarounds.

> > 

> > In this case, we must notice that

> > 

> > (a) switchdev already has the concept of notifiers emitted from the fast

> >     path that are still processed by drivers from blocking context. This

> >     is accomplished through the SWITCHDEV_F_DEFER flag which is used by

> >     e.g. SWITCHDEV_OBJ_ID_HOST_MDB.

> > 

> > (b) the bridge del_nbp() function already calls switchdev_deferred_process().

> >     So if we could hook into that, we could have a chance that the

> >     bridge simply waits for our FDB entry offloading procedure to finish

> >     before it calls netdev_upper_dev_unlink() - which is almost

> >     immediately afterwards, and also when switchdev drivers typically

> >     break their stateful associations between the bridge upper and

> >     private data.

> > 

> > So it is in fact possible to use switchdev's generic

> > switchdev_deferred_enqueue mechanism to get a sleepable callback, and

> > from there we can call_switchdev_blocking_notifiers().

> > 

> > To address all requirements:

> > 

> > - drivers that are unconverted from atomic to blocking still work

> > - drivers that currently have a private workqueue are not worse off

> > - drivers that want the bridge to wait for their deferred work can use

> >   the bridge's defer mechanism

> > - a SWITCHDEV_FDB_ADD_TO_DEVICE event which does not have any interested

> >   parties does not get deferred for no reason, because this takes the

> >   rtnl_mutex and schedules a worker thread for nothing

> > 

> > it looks like we can in fact start off by emitting

> > SWITCHDEV_FDB_ADD_TO_DEVICE on the atomic chain. But we add a new bit in

> > struct switchdev_notifier_fdb_info called "needs_defer", and any

> > interested party can set this to true.

> > 

> > This way:

> > 

> > - unconverted drivers do their work (i.e. schedule their private work

> >   item) based on the atomic notifier, and do not set "needs_defer"

> > - converted drivers only mark "needs_defer" and treat a separate

> >   notifier, on the blocking chain, called SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED

> > - SWITCHDEV_FDB_ADD_TO_DEVICE events with no interested party do not

> >   generate any follow-up SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED

> > 

> > Additionally, code paths that are blocking right not, like br_fdb_replay,

> > could notify only SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, as long as all

> > consumers of the replayed FDB events support that (right now, that is

> > DSA and dpaa2-switch).

> > 

> > Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set

> > needs_defer as appropriate, then the notifiers emitted from process

> > context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED

> > directly, and we would also have fully blocking context all the way

> > down, with the opportunity for error propagation and extack.

> 

> IIUC, at this stage all the FDB notifications drivers get are blocking,

> either from the work queue (because they were deferred) or directly from

> process context. If so, how do we synchronize the two and ensure drivers

> get the notifications at the correct order?


What does 'at this stage' mean? Does it mean 'assuming the patch we're
discussing now gets accepted'? If that's what it means, then 'at this
stage' all drivers would first receive the atomic FDB_ADD_TO_DEVICE,
then would set needs_defer, then would receive the blocking
FDB_ADD_TO_DEVICE.

Thinking a bit more - this two-stage notification process ends up being
more efficient for br_fdb_replay too. We don't queue up FDB entries
except if the driver tells us that it wants to process them in blocking
context.

> I was thinking of adding all the notifications to the 'deferred' list

> when 'hash_lock' is held and then calling switchdev_deferred_process()

> directly in process context. It's not very pretty (do we return an error

> only for the entry the user added or for any other entry we flushed from

> the list?), but I don't have a better idea right now.


I was thinking to add a switchdev_fdb_add_to_device_now(). As opposed to
the switchdev_fdb_add_to_device() which defers, this does not defer at
all but just call_blocking_switchdev_notifiers(). So it would not go
through switchdev_deferred_enqueue.  For the code path I talked above,
we would temporarily drop the spin_lock, then call
switchdev_fdb_add_to_device_now(), then if that fails, take the
spin_lock again and delete the software fdb entry we've just added.

So as long as we use a _now() variant and don't resynchronize with the
deferred work, we shouldn't have any ordering issues, or am I
misunderstanding your question?

> 

> > 

> > Some disadvantages of this solution though:

> > 

> > - A driver now needs to check whether it is interested in an event

> >   twice: first on the atomic call chain, then again on the blocking call

> >   chain (because it is a notifier chain, it is potentially not the only

> >   driver subscribed to it, it may be listening to another driver's

> >   needs_defer request). The flip side: on sistems with mixed switchdev

> >   setups (dpaa2-switch + DSA, and DSA sniffs dynamically learned FDB

> >   entries on foreign interfaces), there are some "synergies", and the

> >   SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED event is only emitted once, as

> >   opposed to what would happen if each driver scheduled its own private

> >   work item.

> > 

> > - Right now drivers take rtnl_lock() as soon as their private work item

> >   runs. They need the rtnl_lock for the call to

> >   call_switchdev_notifiers(SWITCHDEV_FDB_OFFLOADED). 

> 

> I think RCU is enough?


Maybe, I haven't experimented with it. I thought br_fdb_offloaded_set
would notify back rtnetlink, but it looks like it doesn't.

> >   But it doesn't really seem necessary for them to perform the actual

> >   hardware manipulation (adding the FDB entry) with the rtnl_lock held

> >   (anyway most do that). But with the new option of servicing

> >   SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, the rtnl_lock is taken

> >   top-level by switchdev, so even if these drivers wanted to be more

> >   self-conscious, they couldn't.

> 

> Yes, I want to remove this dependency in mlxsw assuming notifications

> remain atomic. The more pressing issue is actually removing it from the

> learning path.


Bah, I understand where you're coming from, but it would be tricky to
remove the rtnl_lock from switchdev_deferred_process_work (that's what
it boils down to). My switchdev_handle_fdb_add_to_device helper currently
assumes rcu_read_lock(). With the blocking variant of SWITCHDEV_FDB_ADD_TO_DEVICE,
it would still need to traverse the netdev adjacency lists, so it would
need the rtnl_mutex for that. If we remove the rtnl_lock from
switchdev_deferred_process_work we'd have to add it back in DSA and to
any other callers of switchdev_handle_fdb_add_to_device.
Ido Schimmel Aug. 23, 2021, 10:47 a.m. UTC | #18
On Sun, Aug 22, 2021 at 08:44:49PM +0300, Vladimir Oltean wrote:
> On Sun, Aug 22, 2021 at 08:06:00PM +0300, Ido Schimmel wrote:

> > On Sun, Aug 22, 2021 at 04:31:45PM +0300, Vladimir Oltean wrote:

> > > 3. There is a larger issue that SWITCHDEV_FDB_ADD_TO_DEVICE events are

> > >    deferred by drivers even from code paths that are initially blocking

> > >    (are running in process context):

> > > 

> > > br_fdb_add

> > > -> __br_fdb_add

> > >    -> fdb_add_entry

> > >       -> fdb_notify

> > >          -> br_switchdev_fdb_notify

> > > 

> > >     It seems fairly trivial to move the fdb_notify call outside of the

> > >     atomic section of fdb_add_entry, but with switchdev offering only an

> > >     API where the SWITCHDEV_FDB_ADD_TO_DEVICE is atomic, drivers would

> > >     still have to defer these events and are unable to provide

> > >     synchronous feedback to user space (error codes, extack).

> > > 

> > > The above issues would warrant an attempt to fix a central problem, and

> > > make switchdev expose an API that is easier to consume rather than

> > > having drivers implement lateral workarounds.

> > > 

> > > In this case, we must notice that

> > > 

> > > (a) switchdev already has the concept of notifiers emitted from the fast

> > >     path that are still processed by drivers from blocking context. This

> > >     is accomplished through the SWITCHDEV_F_DEFER flag which is used by

> > >     e.g. SWITCHDEV_OBJ_ID_HOST_MDB.

> > > 

> > > (b) the bridge del_nbp() function already calls switchdev_deferred_process().

> > >     So if we could hook into that, we could have a chance that the

> > >     bridge simply waits for our FDB entry offloading procedure to finish

> > >     before it calls netdev_upper_dev_unlink() - which is almost

> > >     immediately afterwards, and also when switchdev drivers typically

> > >     break their stateful associations between the bridge upper and

> > >     private data.

> > > 

> > > So it is in fact possible to use switchdev's generic

> > > switchdev_deferred_enqueue mechanism to get a sleepable callback, and

> > > from there we can call_switchdev_blocking_notifiers().

> > > 

> > > To address all requirements:

> > > 

> > > - drivers that are unconverted from atomic to blocking still work

> > > - drivers that currently have a private workqueue are not worse off

> > > - drivers that want the bridge to wait for their deferred work can use

> > >   the bridge's defer mechanism

> > > - a SWITCHDEV_FDB_ADD_TO_DEVICE event which does not have any interested

> > >   parties does not get deferred for no reason, because this takes the

> > >   rtnl_mutex and schedules a worker thread for nothing

> > > 

> > > it looks like we can in fact start off by emitting

> > > SWITCHDEV_FDB_ADD_TO_DEVICE on the atomic chain. But we add a new bit in

> > > struct switchdev_notifier_fdb_info called "needs_defer", and any

> > > interested party can set this to true.

> > > 

> > > This way:

> > > 

> > > - unconverted drivers do their work (i.e. schedule their private work

> > >   item) based on the atomic notifier, and do not set "needs_defer"

> > > - converted drivers only mark "needs_defer" and treat a separate

> > >   notifier, on the blocking chain, called SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED

> > > - SWITCHDEV_FDB_ADD_TO_DEVICE events with no interested party do not

> > >   generate any follow-up SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED

> > > 

> > > Additionally, code paths that are blocking right not, like br_fdb_replay,

> > > could notify only SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, as long as all

> > > consumers of the replayed FDB events support that (right now, that is

> > > DSA and dpaa2-switch).

> > > 

> > > Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set

> > > needs_defer as appropriate, then the notifiers emitted from process

> > > context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED

> > > directly, and we would also have fully blocking context all the way

> > > down, with the opportunity for error propagation and extack.

> > 

> > IIUC, at this stage all the FDB notifications drivers get are blocking,

> > either from the work queue (because they were deferred) or directly from

> > process context. If so, how do we synchronize the two and ensure drivers

> > get the notifications at the correct order?

> 

> What does 'at this stage' mean? Does it mean 'assuming the patch we're

> discussing now gets accepted'? If that's what it means, then 'at this

> stage' all drivers would first receive the atomic FDB_ADD_TO_DEVICE,

> then would set needs_defer, then would receive the blocking

> FDB_ADD_TO_DEVICE.


I meant after:

"Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set
needs_defer as appropriate, then the notifiers emitted from process
context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
directly, and we would also have fully blocking context all the way
down, with the opportunity for error propagation and extack."

IIUC, after the conversion the 'needs_defer' is gone and all the FDB
events are blocking? Either from syscall context or the workqueue.

If so, I'm not sure how we synchronize the two. That is, making sure
that an event from syscall context does not reach drivers before an
earlier event that was added to the 'deferred' list.

I mean, in syscall context we are holding RTNL so whatever is already on
the 'deferred' list cannot be dequeued and processed.


> 

> Thinking a bit more - this two-stage notification process ends up being

> more efficient for br_fdb_replay too. We don't queue up FDB entries

> except if the driver tells us that it wants to process them in blocking

> context.

> 

> > I was thinking of adding all the notifications to the 'deferred' list

> > when 'hash_lock' is held and then calling switchdev_deferred_process()

> > directly in process context. It's not very pretty (do we return an error

> > only for the entry the user added or for any other entry we flushed from

> > the list?), but I don't have a better idea right now.

> 

> I was thinking to add a switchdev_fdb_add_to_device_now(). As opposed to

> the switchdev_fdb_add_to_device() which defers, this does not defer at

> all but just call_blocking_switchdev_notifiers(). So it would not go

> through switchdev_deferred_enqueue.  For the code path I talked above,

> we would temporarily drop the spin_lock, then call

> switchdev_fdb_add_to_device_now(), then if that fails, take the

> spin_lock again and delete the software fdb entry we've just added.

> 

> So as long as we use a _now() variant and don't resynchronize with the

> deferred work, we shouldn't have any ordering issues, or am I

> misunderstanding your question?


Not sure I'm following. I tried to explain above.

> 

> > 

> > > 

> > > Some disadvantages of this solution though:

> > > 

> > > - A driver now needs to check whether it is interested in an event

> > >   twice: first on the atomic call chain, then again on the blocking call

> > >   chain (because it is a notifier chain, it is potentially not the only

> > >   driver subscribed to it, it may be listening to another driver's

> > >   needs_defer request). The flip side: on sistems with mixed switchdev

> > >   setups (dpaa2-switch + DSA, and DSA sniffs dynamically learned FDB

> > >   entries on foreign interfaces), there are some "synergies", and the

> > >   SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED event is only emitted once, as

> > >   opposed to what would happen if each driver scheduled its own private

> > >   work item.

> > > 

> > > - Right now drivers take rtnl_lock() as soon as their private work item

> > >   runs. They need the rtnl_lock for the call to

> > >   call_switchdev_notifiers(SWITCHDEV_FDB_OFFLOADED). 

> > 

> > I think RCU is enough?

> 

> Maybe, I haven't experimented with it. I thought br_fdb_offloaded_set

> would notify back rtnetlink, but it looks like it doesn't.


You mean emit a RTM_NEWNEIGH? This can be done even without RTNL (from
the data path, for example)

> 

> > >   But it doesn't really seem necessary for them to perform the actual

> > >   hardware manipulation (adding the FDB entry) with the rtnl_lock held

> > >   (anyway most do that). But with the new option of servicing

> > >   SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, the rtnl_lock is taken

> > >   top-level by switchdev, so even if these drivers wanted to be more

> > >   self-conscious, they couldn't.

> > 

> > Yes, I want to remove this dependency in mlxsw assuming notifications

> > remain atomic. The more pressing issue is actually removing it from the

> > learning path.

> 

> Bah, I understand where you're coming from, but it would be tricky to

> remove the rtnl_lock from switchdev_deferred_process_work (that's what

> it boils down to). My switchdev_handle_fdb_add_to_device helper currently

> assumes rcu_read_lock(). With the blocking variant of SWITCHDEV_FDB_ADD_TO_DEVICE,

> it would still need to traverse the netdev adjacency lists, so it would

> need the rtnl_mutex for that. If we remove the rtnl_lock from

> switchdev_deferred_process_work we'd have to add it back in DSA and to

> any other callers of switchdev_handle_fdb_add_to_device.
Vladimir Oltean Aug. 23, 2021, 11 a.m. UTC | #19
On Mon, Aug 23, 2021 at 01:47:57PM +0300, Ido Schimmel wrote:
> On Sun, Aug 22, 2021 at 08:44:49PM +0300, Vladimir Oltean wrote:

> > On Sun, Aug 22, 2021 at 08:06:00PM +0300, Ido Schimmel wrote:

> > > On Sun, Aug 22, 2021 at 04:31:45PM +0300, Vladimir Oltean wrote:

> > > > 3. There is a larger issue that SWITCHDEV_FDB_ADD_TO_DEVICE events are

> > > >    deferred by drivers even from code paths that are initially blocking

> > > >    (are running in process context):

> > > >

> > > > br_fdb_add

> > > > -> __br_fdb_add

> > > >    -> fdb_add_entry

> > > >       -> fdb_notify

> > > >          -> br_switchdev_fdb_notify

> > > >

> > > >     It seems fairly trivial to move the fdb_notify call outside of the

> > > >     atomic section of fdb_add_entry, but with switchdev offering only an

> > > >     API where the SWITCHDEV_FDB_ADD_TO_DEVICE is atomic, drivers would

> > > >     still have to defer these events and are unable to provide

> > > >     synchronous feedback to user space (error codes, extack).

> > > >

> > > > The above issues would warrant an attempt to fix a central problem, and

> > > > make switchdev expose an API that is easier to consume rather than

> > > > having drivers implement lateral workarounds.

> > > >

> > > > In this case, we must notice that

> > > >

> > > > (a) switchdev already has the concept of notifiers emitted from the fast

> > > >     path that are still processed by drivers from blocking context. This

> > > >     is accomplished through the SWITCHDEV_F_DEFER flag which is used by

> > > >     e.g. SWITCHDEV_OBJ_ID_HOST_MDB.

> > > >

> > > > (b) the bridge del_nbp() function already calls switchdev_deferred_process().

> > > >     So if we could hook into that, we could have a chance that the

> > > >     bridge simply waits for our FDB entry offloading procedure to finish

> > > >     before it calls netdev_upper_dev_unlink() - which is almost

> > > >     immediately afterwards, and also when switchdev drivers typically

> > > >     break their stateful associations between the bridge upper and

> > > >     private data.

> > > >

> > > > So it is in fact possible to use switchdev's generic

> > > > switchdev_deferred_enqueue mechanism to get a sleepable callback, and

> > > > from there we can call_switchdev_blocking_notifiers().

> > > >

> > > > To address all requirements:

> > > >

> > > > - drivers that are unconverted from atomic to blocking still work

> > > > - drivers that currently have a private workqueue are not worse off

> > > > - drivers that want the bridge to wait for their deferred work can use

> > > >   the bridge's defer mechanism

> > > > - a SWITCHDEV_FDB_ADD_TO_DEVICE event which does not have any interested

> > > >   parties does not get deferred for no reason, because this takes the

> > > >   rtnl_mutex and schedules a worker thread for nothing

> > > >

> > > > it looks like we can in fact start off by emitting

> > > > SWITCHDEV_FDB_ADD_TO_DEVICE on the atomic chain. But we add a new bit in

> > > > struct switchdev_notifier_fdb_info called "needs_defer", and any

> > > > interested party can set this to true.

> > > >

> > > > This way:

> > > >

> > > > - unconverted drivers do their work (i.e. schedule their private work

> > > >   item) based on the atomic notifier, and do not set "needs_defer"

> > > > - converted drivers only mark "needs_defer" and treat a separate

> > > >   notifier, on the blocking chain, called SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED

> > > > - SWITCHDEV_FDB_ADD_TO_DEVICE events with no interested party do not

> > > >   generate any follow-up SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED

> > > >

> > > > Additionally, code paths that are blocking right not, like br_fdb_replay,

> > > > could notify only SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, as long as all

> > > > consumers of the replayed FDB events support that (right now, that is

> > > > DSA and dpaa2-switch).

> > > >

> > > > Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set

> > > > needs_defer as appropriate, then the notifiers emitted from process

> > > > context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED

> > > > directly, and we would also have fully blocking context all the way

> > > > down, with the opportunity for error propagation and extack.

> > >

> > > IIUC, at this stage all the FDB notifications drivers get are blocking,

> > > either from the work queue (because they were deferred) or directly from

> > > process context. If so, how do we synchronize the two and ensure drivers

> > > get the notifications at the correct order?

> >

> > What does 'at this stage' mean? Does it mean 'assuming the patch we're

> > discussing now gets accepted'? If that's what it means, then 'at this

> > stage' all drivers would first receive the atomic FDB_ADD_TO_DEVICE,

> > then would set needs_defer, then would receive the blocking

> > FDB_ADD_TO_DEVICE.

>

> I meant after:

>

> "Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set

> needs_defer as appropriate, then the notifiers emitted from process

> context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED

> directly, and we would also have fully blocking context all the way

> down, with the opportunity for error propagation and extack."

>

> IIUC, after the conversion the 'needs_defer' is gone and all the FDB

> events are blocking? Either from syscall context or the workqueue.


We would not delete 'needs_defer'. It still offers a useful preliminary
filtering mechanism for the fast path (and for br_fdb_replay). In
retrospect, the SWITCHDEV_OBJ_ID_HOST_MDB would also benefit from 'needs_defer'
instead of jumping to blocking context (if we care so much about performance).

If a FDB event does not need to be processed by anyone (dynamically
learned entry on a switchdev port), the bridge notifies the atomic call
chain for the sake of it, but not the blocking chain.

> If so, I'm not sure how we synchronize the two. That is, making sure

> that an event from syscall context does not reach drivers before an

> earlier event that was added to the 'deferred' list.

>

> I mean, in syscall context we are holding RTNL so whatever is already on

> the 'deferred' list cannot be dequeued and processed.


So switchdev_deferred_process() has ASSERT_RTNL. If we call
switchdev_deferred_process() right before adding the blocking FDB entry
in process context (and we already hold rtnl_mutex), I though that would
be enough to ensure we have a synchronization point: Everything that was
scheduled before is flushed now, everything that is scheduled while we
are running will run after we unlock the rtnl_mutex. Is that not the
order we expect? I mean, if there is a fast path FDB entry being learned
/ deleted while user space say adds that same FDB entry as static, how
is the relative ordering ensured between the two?
Ido Schimmel Aug. 23, 2021, 12:16 p.m. UTC | #20
On Mon, Aug 23, 2021 at 02:00:46PM +0300, Vladimir Oltean wrote:
> On Mon, Aug 23, 2021 at 01:47:57PM +0300, Ido Schimmel wrote:

> > On Sun, Aug 22, 2021 at 08:44:49PM +0300, Vladimir Oltean wrote:

> > > On Sun, Aug 22, 2021 at 08:06:00PM +0300, Ido Schimmel wrote:

> > > > On Sun, Aug 22, 2021 at 04:31:45PM +0300, Vladimir Oltean wrote:

> > > > > 3. There is a larger issue that SWITCHDEV_FDB_ADD_TO_DEVICE events are

> > > > >    deferred by drivers even from code paths that are initially blocking

> > > > >    (are running in process context):

> > > > >

> > > > > br_fdb_add

> > > > > -> __br_fdb_add

> > > > >    -> fdb_add_entry

> > > > >       -> fdb_notify

> > > > >          -> br_switchdev_fdb_notify

> > > > >

> > > > >     It seems fairly trivial to move the fdb_notify call outside of the

> > > > >     atomic section of fdb_add_entry, but with switchdev offering only an

> > > > >     API where the SWITCHDEV_FDB_ADD_TO_DEVICE is atomic, drivers would

> > > > >     still have to defer these events and are unable to provide

> > > > >     synchronous feedback to user space (error codes, extack).

> > > > >

> > > > > The above issues would warrant an attempt to fix a central problem, and

> > > > > make switchdev expose an API that is easier to consume rather than

> > > > > having drivers implement lateral workarounds.

> > > > >

> > > > > In this case, we must notice that

> > > > >

> > > > > (a) switchdev already has the concept of notifiers emitted from the fast

> > > > >     path that are still processed by drivers from blocking context. This

> > > > >     is accomplished through the SWITCHDEV_F_DEFER flag which is used by

> > > > >     e.g. SWITCHDEV_OBJ_ID_HOST_MDB.

> > > > >

> > > > > (b) the bridge del_nbp() function already calls switchdev_deferred_process().

> > > > >     So if we could hook into that, we could have a chance that the

> > > > >     bridge simply waits for our FDB entry offloading procedure to finish

> > > > >     before it calls netdev_upper_dev_unlink() - which is almost

> > > > >     immediately afterwards, and also when switchdev drivers typically

> > > > >     break their stateful associations between the bridge upper and

> > > > >     private data.

> > > > >

> > > > > So it is in fact possible to use switchdev's generic

> > > > > switchdev_deferred_enqueue mechanism to get a sleepable callback, and

> > > > > from there we can call_switchdev_blocking_notifiers().

> > > > >

> > > > > To address all requirements:

> > > > >

> > > > > - drivers that are unconverted from atomic to blocking still work

> > > > > - drivers that currently have a private workqueue are not worse off

> > > > > - drivers that want the bridge to wait for their deferred work can use

> > > > >   the bridge's defer mechanism

> > > > > - a SWITCHDEV_FDB_ADD_TO_DEVICE event which does not have any interested

> > > > >   parties does not get deferred for no reason, because this takes the

> > > > >   rtnl_mutex and schedules a worker thread for nothing

> > > > >

> > > > > it looks like we can in fact start off by emitting

> > > > > SWITCHDEV_FDB_ADD_TO_DEVICE on the atomic chain. But we add a new bit in

> > > > > struct switchdev_notifier_fdb_info called "needs_defer", and any

> > > > > interested party can set this to true.

> > > > >

> > > > > This way:

> > > > >

> > > > > - unconverted drivers do their work (i.e. schedule their private work

> > > > >   item) based on the atomic notifier, and do not set "needs_defer"

> > > > > - converted drivers only mark "needs_defer" and treat a separate

> > > > >   notifier, on the blocking chain, called SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED

> > > > > - SWITCHDEV_FDB_ADD_TO_DEVICE events with no interested party do not

> > > > >   generate any follow-up SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED

> > > > >

> > > > > Additionally, code paths that are blocking right not, like br_fdb_replay,

> > > > > could notify only SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, as long as all

> > > > > consumers of the replayed FDB events support that (right now, that is

> > > > > DSA and dpaa2-switch).

> > > > >

> > > > > Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set

> > > > > needs_defer as appropriate, then the notifiers emitted from process

> > > > > context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED

> > > > > directly, and we would also have fully blocking context all the way

> > > > > down, with the opportunity for error propagation and extack.

> > > >

> > > > IIUC, at this stage all the FDB notifications drivers get are blocking,

> > > > either from the work queue (because they were deferred) or directly from

> > > > process context. If so, how do we synchronize the two and ensure drivers

> > > > get the notifications at the correct order?

> > >

> > > What does 'at this stage' mean? Does it mean 'assuming the patch we're

> > > discussing now gets accepted'? If that's what it means, then 'at this

> > > stage' all drivers would first receive the atomic FDB_ADD_TO_DEVICE,

> > > then would set needs_defer, then would receive the blocking

> > > FDB_ADD_TO_DEVICE.

> >

> > I meant after:

> >

> > "Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set

> > needs_defer as appropriate, then the notifiers emitted from process

> > context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED

> > directly, and we would also have fully blocking context all the way

> > down, with the opportunity for error propagation and extack."

> >

> > IIUC, after the conversion the 'needs_defer' is gone and all the FDB

> > events are blocking? Either from syscall context or the workqueue.

> 

> We would not delete 'needs_defer'. It still offers a useful preliminary

> filtering mechanism for the fast path (and for br_fdb_replay). In

> retrospect, the SWITCHDEV_OBJ_ID_HOST_MDB would also benefit from 'needs_defer'

> instead of jumping to blocking context (if we care so much about performance).

> 

> If a FDB event does not need to be processed by anyone (dynamically

> learned entry on a switchdev port), the bridge notifies the atomic call

> chain for the sake of it, but not the blocking chain.

> 

> > If so, I'm not sure how we synchronize the two. That is, making sure

> > that an event from syscall context does not reach drivers before an

> > earlier event that was added to the 'deferred' list.

> >

> > I mean, in syscall context we are holding RTNL so whatever is already on

> > the 'deferred' list cannot be dequeued and processed.

> 

> So switchdev_deferred_process() has ASSERT_RTNL. If we call

> switchdev_deferred_process() right before adding the blocking FDB entry

> in process context (and we already hold rtnl_mutex), I though that would

> be enough to ensure we have a synchronization point: Everything that was

> scheduled before is flushed now, everything that is scheduled while we

> are running will run after we unlock the rtnl_mutex. Is that not the

> order we expect? I mean, if there is a fast path FDB entry being learned

> / deleted while user space say adds that same FDB entry as static, how

> is the relative ordering ensured between the two?


I was thinking about the following case:

t0 - <MAC1,VID1,P1> is added in syscall context under 'hash_lock'
t1 - br_fdb_delete_by_port() flushes entries under 'hash_lock' in
     response to STP state. Notifications are added to 'deferred' list
t2 - switchdev_deferred_process() is called in syscall context
t3 - <MAC1,VID1,P1> is notified as blocking

Updates to the SW FDB are protected by 'hash_lock', but updates to the
HW FDB are not. In this case, <MAC1,VID1,P1> does not exist in SW, but
it will exist in HW.

Another case assuming switchdev_deferred_process() is called first:

t0 - switchdev_deferred_process() is called in syscall context
t1 - <MAC1,VID,P1> is learned under 'hash_lock'. Notification is added
     to 'deferred' list
t2 - <MAC1,VID1,P1> is modified in syscall context under 'hash_lock' to
     <MAC1,VID1,P2>
t3 - <MAC1,VID1,P2> is notified as blocking
t4 - <MAC1,VID1,P1> is notified as blocking (next time the 'deferred'
     list is processed)

In this case, the HW will have <MAC1,VID1,P1>, but SW will have
<MAC1,VID1,P2>
Vladimir Oltean Aug. 23, 2021, 2:29 p.m. UTC | #21
On Mon, Aug 23, 2021 at 03:16:48PM +0300, Ido Schimmel wrote:
> I was thinking about the following case:

>

> t0 - <MAC1,VID1,P1> is added in syscall context under 'hash_lock'

> t1 - br_fdb_delete_by_port() flushes entries under 'hash_lock' in

>      response to STP state. Notifications are added to 'deferred' list

> t2 - switchdev_deferred_process() is called in syscall context

> t3 - <MAC1,VID1,P1> is notified as blocking

>

> Updates to the SW FDB are protected by 'hash_lock', but updates to the

> HW FDB are not. In this case, <MAC1,VID1,P1> does not exist in SW, but

> it will exist in HW.

>

> Another case assuming switchdev_deferred_process() is called first:

>

> t0 - switchdev_deferred_process() is called in syscall context

> t1 - <MAC1,VID,P1> is learned under 'hash_lock'. Notification is added

>      to 'deferred' list

> t2 - <MAC1,VID1,P1> is modified in syscall context under 'hash_lock' to

>      <MAC1,VID1,P2>

> t3 - <MAC1,VID1,P2> is notified as blocking

> t4 - <MAC1,VID1,P1> is notified as blocking (next time the 'deferred'

>      list is processed)

>

> In this case, the HW will have <MAC1,VID1,P1>, but SW will have

> <MAC1,VID1,P2>


Ok, so if the hardware FDB entry needs to be updated under the same
hash_lock as the software FDB entry, then it seems that the goal of
updating the hardware FDB synchronously and in a sleepable manner is if
the data path defers the learning to sleepable context too. That in turn
means that there will be 'dead time' between the reception of a packet
from a given {MAC SA, VID} flow and the learning of that address. So I
don't think that is really desirable. So I don't know if it is actually
realistic to do this.

Can we drop it from the requirements of this change, or do you feel like
it's not worth it to make my change if this problem is not solved?

There is of course the option of going half-way too, just like for
SWITCHDEV_PORT_ATTR_SET. You notify it once, synchronously, on the
atomic chain, the switchdev throws as many errors as it can reasonably
can, then you defer the actual installation which means a hardware access.
Ido Schimmel Aug. 23, 2021, 3:18 p.m. UTC | #22
On Mon, Aug 23, 2021 at 05:29:53PM +0300, Vladimir Oltean wrote:
> On Mon, Aug 23, 2021 at 03:16:48PM +0300, Ido Schimmel wrote:

> > I was thinking about the following case:

> >

> > t0 - <MAC1,VID1,P1> is added in syscall context under 'hash_lock'

> > t1 - br_fdb_delete_by_port() flushes entries under 'hash_lock' in

> >      response to STP state. Notifications are added to 'deferred' list

> > t2 - switchdev_deferred_process() is called in syscall context

> > t3 - <MAC1,VID1,P1> is notified as blocking

> >

> > Updates to the SW FDB are protected by 'hash_lock', but updates to the

> > HW FDB are not. In this case, <MAC1,VID1,P1> does not exist in SW, but

> > it will exist in HW.

> >

> > Another case assuming switchdev_deferred_process() is called first:

> >

> > t0 - switchdev_deferred_process() is called in syscall context

> > t1 - <MAC1,VID,P1> is learned under 'hash_lock'. Notification is added

> >      to 'deferred' list

> > t2 - <MAC1,VID1,P1> is modified in syscall context under 'hash_lock' to

> >      <MAC1,VID1,P2>

> > t3 - <MAC1,VID1,P2> is notified as blocking

> > t4 - <MAC1,VID1,P1> is notified as blocking (next time the 'deferred'

> >      list is processed)

> >

> > In this case, the HW will have <MAC1,VID1,P1>, but SW will have

> > <MAC1,VID1,P2>

> 

> Ok, so if the hardware FDB entry needs to be updated under the same

> hash_lock as the software FDB entry, then it seems that the goal of

> updating the hardware FDB synchronously and in a sleepable manner is if

> the data path defers the learning to sleepable context too. That in turn

> means that there will be 'dead time' between the reception of a packet

> from a given {MAC SA, VID} flow and the learning of that address. So I

> don't think that is really desirable. So I don't know if it is actually

> realistic to do this.

> 

> Can we drop it from the requirements of this change, or do you feel like

> it's not worth it to make my change if this problem is not solved?


I didn't pose it as a requirement, but as a desirable goal that I don't
know how to achieve w/o a surgery in the bridge driver that Nik and you
(understandably) don't like.

Regarding a more practical solution, earlier versions (not what you
posted yesterday) have the undesirable properties of being both
asynchronous (current state) and mandating RTNL to be held. If we are
going with the asynchronous model, then I think we should have a model
that doesn't force RTNL and allows batching.

I have the following proposal, which I believe solves your problem and
allows for batching without RTNL:

The pattern of enqueuing a work item per-entry is not very smart.
Instead, it is better to to add the notification info to a list
(protected by a spin lock) and scheduling a single work item whose
purpose is to dequeue entries from this list and batch process them.

Inside the work item you would do something like:

spin_lock_bh()
list_splice_init()
spin_unlock_bh()

mutex_lock() // rtnl or preferably private lock
list_for_each_entry_safe() 
	// process entry
	cond_resched()
mutex_unlock()

In del_nbp(), after br_fdb_delete_by_port(), the bridge will emit some
new blocking event (e.g., SWITCHDEV_FDB_FLUSH_TO_DEVICE) that will
instruct the driver to flush all its pending FDB notifications. You
don't strictly need this notification because of the
netdev_upper_dev_unlink() that follows, but it helps in making things
more structured.

Pros:

1. Solves your problem?
2. Pattern is not worse than what we currently have
3. Does not force RTNL
4. Allows for batching. For example, mlxsw has the ability to program up
to 64 entries in one transaction with the device. I assume other devices
in the same grade have similar capabilities

Cons:

1. Asynchronous
2. Pattern we will see in multiple drivers? Can consider migrating it
into switchdev itself at some point
3. Something I missed / overlooked

> There is of course the option of going half-way too, just like for

> SWITCHDEV_PORT_ATTR_SET. You notify it once, synchronously, on the

> atomic chain, the switchdev throws as many errors as it can reasonably

> can, then you defer the actual installation which means a hardware access.


Yes, the above proposal has the same property. You can throw errors
before enqueueing the notification info on your list.
Nikolay Aleksandrov Aug. 23, 2021, 3:42 p.m. UTC | #23
On 23/08/2021 18:18, Ido Schimmel wrote:
> On Mon, Aug 23, 2021 at 05:29:53PM +0300, Vladimir Oltean wrote:

>> On Mon, Aug 23, 2021 at 03:16:48PM +0300, Ido Schimmel wrote:

>>> I was thinking about the following case:

>>>

>>> t0 - <MAC1,VID1,P1> is added in syscall context under 'hash_lock'

>>> t1 - br_fdb_delete_by_port() flushes entries under 'hash_lock' in

>>>      response to STP state. Notifications are added to 'deferred' list

>>> t2 - switchdev_deferred_process() is called in syscall context

>>> t3 - <MAC1,VID1,P1> is notified as blocking

>>>

>>> Updates to the SW FDB are protected by 'hash_lock', but updates to the

>>> HW FDB are not. In this case, <MAC1,VID1,P1> does not exist in SW, but

>>> it will exist in HW.

>>>

>>> Another case assuming switchdev_deferred_process() is called first:

>>>

>>> t0 - switchdev_deferred_process() is called in syscall context

>>> t1 - <MAC1,VID,P1> is learned under 'hash_lock'. Notification is added

>>>      to 'deferred' list

>>> t2 - <MAC1,VID1,P1> is modified in syscall context under 'hash_lock' to

>>>      <MAC1,VID1,P2>

>>> t3 - <MAC1,VID1,P2> is notified as blocking

>>> t4 - <MAC1,VID1,P1> is notified as blocking (next time the 'deferred'

>>>      list is processed)

>>>

>>> In this case, the HW will have <MAC1,VID1,P1>, but SW will have

>>> <MAC1,VID1,P2>

>>

>> Ok, so if the hardware FDB entry needs to be updated under the same

>> hash_lock as the software FDB entry, then it seems that the goal of

>> updating the hardware FDB synchronously and in a sleepable manner is if

>> the data path defers the learning to sleepable context too. That in turn

>> means that there will be 'dead time' between the reception of a packet

>> from a given {MAC SA, VID} flow and the learning of that address. So I

>> don't think that is really desirable. So I don't know if it is actually

>> realistic to do this.

>>

>> Can we drop it from the requirements of this change, or do you feel like

>> it's not worth it to make my change if this problem is not solved?

> 

> I didn't pose it as a requirement, but as a desirable goal that I don't

> know how to achieve w/o a surgery in the bridge driver that Nik and you

> (understandably) don't like.

> 

> Regarding a more practical solution, earlier versions (not what you

> posted yesterday) have the undesirable properties of being both

> asynchronous (current state) and mandating RTNL to be held. If we are

> going with the asynchronous model, then I think we should have a model

> that doesn't force RTNL and allows batching.

> 

> I have the following proposal, which I believe solves your problem and

> allows for batching without RTNL:

> 

> The pattern of enqueuing a work item per-entry is not very smart.

> Instead, it is better to to add the notification info to a list

> (protected by a spin lock) and scheduling a single work item whose

> purpose is to dequeue entries from this list and batch process them.

> 

> Inside the work item you would do something like:

> 

> spin_lock_bh()

> list_splice_init()

> spin_unlock_bh()

> 

> mutex_lock() // rtnl or preferably private lock

> list_for_each_entry_safe() 

> 	// process entry

> 	cond_resched()

> mutex_unlock()

> 

> In del_nbp(), after br_fdb_delete_by_port(), the bridge will emit some

> new blocking event (e.g., SWITCHDEV_FDB_FLUSH_TO_DEVICE) that will

> instruct the driver to flush all its pending FDB notifications. You

> don't strictly need this notification because of the

> netdev_upper_dev_unlink() that follows, but it helps in making things

> more structured.

> 


I was also thinking about a solution along these lines, I like this proposition.

> Pros:

> 

> 1. Solves your problem?

> 2. Pattern is not worse than what we currently have

> 3. Does not force RTNL

> 4. Allows for batching. For example, mlxsw has the ability to program up

> to 64 entries in one transaction with the device. I assume other devices

> in the same grade have similar capabilities


Batching would help a lot even if we don't remove rtnl, on loaded systems rtnl itself
is a bottleneck and we've seen crazy delays in commands because of contention. That
coupled with the ability to program multiple entries would be a nice win.

> 

> Cons:

> 

> 1. Asynchronous

> 2. Pattern we will see in multiple drivers? Can consider migrating it

> into switchdev itself at some point

> 3. Something I missed / overlooked

> >> There is of course the option of going half-way too, just like for

>> SWITCHDEV_PORT_ATTR_SET. You notify it once, synchronously, on the

>> atomic chain, the switchdev throws as many errors as it can reasonably

>> can, then you defer the actual installation which means a hardware access.

> 

> Yes, the above proposal has the same property. You can throw errors

> before enqueueing the notification info on your list.

> 


Thanks,
 Nik
Vladimir Oltean Aug. 23, 2021, 3:42 p.m. UTC | #24
On Mon, Aug 23, 2021 at 06:18:08PM +0300, Ido Schimmel wrote:
> On Mon, Aug 23, 2021 at 05:29:53PM +0300, Vladimir Oltean wrote:

> > On Mon, Aug 23, 2021 at 03:16:48PM +0300, Ido Schimmel wrote:

> > > I was thinking about the following case:

> > >

> > > t0 - <MAC1,VID1,P1> is added in syscall context under 'hash_lock'

> > > t1 - br_fdb_delete_by_port() flushes entries under 'hash_lock' in

> > >      response to STP state. Notifications are added to 'deferred' list

> > > t2 - switchdev_deferred_process() is called in syscall context

> > > t3 - <MAC1,VID1,P1> is notified as blocking

> > >

> > > Updates to the SW FDB are protected by 'hash_lock', but updates to the

> > > HW FDB are not. In this case, <MAC1,VID1,P1> does not exist in SW, but

> > > it will exist in HW.

> > >

> > > Another case assuming switchdev_deferred_process() is called first:

> > >

> > > t0 - switchdev_deferred_process() is called in syscall context

> > > t1 - <MAC1,VID,P1> is learned under 'hash_lock'. Notification is added

> > >      to 'deferred' list

> > > t2 - <MAC1,VID1,P1> is modified in syscall context under 'hash_lock' to

> > >      <MAC1,VID1,P2>

> > > t3 - <MAC1,VID1,P2> is notified as blocking

> > > t4 - <MAC1,VID1,P1> is notified as blocking (next time the 'deferred'

> > >      list is processed)

> > >

> > > In this case, the HW will have <MAC1,VID1,P1>, but SW will have

> > > <MAC1,VID1,P2>

> >

> > Ok, so if the hardware FDB entry needs to be updated under the same

> > hash_lock as the software FDB entry, then it seems that the goal of

> > updating the hardware FDB synchronously and in a sleepable manner is if

> > the data path defers the learning to sleepable context too. That in turn

> > means that there will be 'dead time' between the reception of a packet

> > from a given {MAC SA, VID} flow and the learning of that address. So I

> > don't think that is really desirable. So I don't know if it is actually

> > realistic to do this.

> >

> > Can we drop it from the requirements of this change, or do you feel like

> > it's not worth it to make my change if this problem is not solved?

>

> I didn't pose it as a requirement, but as a desirable goal that I don't

> know how to achieve w/o a surgery in the bridge driver that Nik and you

> (understandably) don't like.

>

> Regarding a more practical solution, earlier versions (not what you

> posted yesterday) have the undesirable properties of being both

> asynchronous (current state) and mandating RTNL to be held. If we are

> going with the asynchronous model, then I think we should have a model

> that doesn't force RTNL and allows batching.

>

> I have the following proposal, which I believe solves your problem and

> allows for batching without RTNL:

>

> The pattern of enqueuing a work item per-entry is not very smart.

> Instead, it is better to to add the notification info to a list

> (protected by a spin lock) and scheduling a single work item whose

> purpose is to dequeue entries from this list and batch process them.


I don't have hardware where FDB entries can be installed in bulk, so
this is new to me. Might make sense though where you are in fact talking
to firmware, and the firmware is in fact still committing to hardware
one by one, you are still reducing the number of round trips.

> Inside the work item you would do something like:

>

> spin_lock_bh()

> list_splice_init()

> spin_unlock_bh()

>

> mutex_lock() // rtnl or preferably private lock

> list_for_each_entry_safe()

> 	// process entry

> 	cond_resched()

> mutex_unlock()


When is the work item scheduled in your proposal? I assume not only when
SWITCHDEV_FDB_FLUSH_TO_DEVICE is emitted. Is there some sort of timer to
allow for some batching to occur?

>

> In del_nbp(), after br_fdb_delete_by_port(), the bridge will emit some

> new blocking event (e.g., SWITCHDEV_FDB_FLUSH_TO_DEVICE) that will

> instruct the driver to flush all its pending FDB notifications. You

> don't strictly need this notification because of the

> netdev_upper_dev_unlink() that follows, but it helps in making things

> more structured.

>

> Pros:

>

> 1. Solves your problem?

> 2. Pattern is not worse than what we currently have

> 3. Does not force RTNL

> 4. Allows for batching. For example, mlxsw has the ability to program up

> to 64 entries in one transaction with the device. I assume other devices

> in the same grade have similar capabilities

>

> Cons:

>

> 1. Asynchronous

> 2. Pattern we will see in multiple drivers? Can consider migrating it

> into switchdev itself at some point


I can already flush_workqueue(dsa_owq) in dsa_port_pre_bridge_leave()
and this will solve the problem in the same way, will it not?

It's not that I don't have driver-level solutions and hook points.
My concern is that there are way too many moving parts and the entrance
barrier for a new switchdev driver is getting higher and higher to
achieve even basic stuff.

For example, I need to maintain a DSA driver and a switchdev driver for
the exact same class of hardware (ocelot is switchdev, felix is DSA, but
the hardware is the same) and it is just so annoying that the interaction
with switchdev is so verbose and open-coded, it just leads to so much
duplication of basic patterns.
When I add support for SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE in ocelot I
really don't want to add a boatload of code, all copied from DSA.

> 3. Something I missed / overlooked

>

> > There is of course the option of going half-way too, just like for

> > SWITCHDEV_PORT_ATTR_SET. You notify it once, synchronously, on the

> > atomic chain, the switchdev throws as many errors as it can reasonably

> > can, then you defer the actual installation which means a hardware access.

>

> Yes, the above proposal has the same property. You can throw errors

> before enqueueing the notification info on your list.
Ido Schimmel Aug. 23, 2021, 4:02 p.m. UTC | #25
On Mon, Aug 23, 2021 at 06:42:44PM +0300, Vladimir Oltean wrote:
> On Mon, Aug 23, 2021 at 06:18:08PM +0300, Ido Schimmel wrote:

> > On Mon, Aug 23, 2021 at 05:29:53PM +0300, Vladimir Oltean wrote:

> > > On Mon, Aug 23, 2021 at 03:16:48PM +0300, Ido Schimmel wrote:

> > > > I was thinking about the following case:

> > > >

> > > > t0 - <MAC1,VID1,P1> is added in syscall context under 'hash_lock'

> > > > t1 - br_fdb_delete_by_port() flushes entries under 'hash_lock' in

> > > >      response to STP state. Notifications are added to 'deferred' list

> > > > t2 - switchdev_deferred_process() is called in syscall context

> > > > t3 - <MAC1,VID1,P1> is notified as blocking

> > > >

> > > > Updates to the SW FDB are protected by 'hash_lock', but updates to the

> > > > HW FDB are not. In this case, <MAC1,VID1,P1> does not exist in SW, but

> > > > it will exist in HW.

> > > >

> > > > Another case assuming switchdev_deferred_process() is called first:

> > > >

> > > > t0 - switchdev_deferred_process() is called in syscall context

> > > > t1 - <MAC1,VID,P1> is learned under 'hash_lock'. Notification is added

> > > >      to 'deferred' list

> > > > t2 - <MAC1,VID1,P1> is modified in syscall context under 'hash_lock' to

> > > >      <MAC1,VID1,P2>

> > > > t3 - <MAC1,VID1,P2> is notified as blocking

> > > > t4 - <MAC1,VID1,P1> is notified as blocking (next time the 'deferred'

> > > >      list is processed)

> > > >

> > > > In this case, the HW will have <MAC1,VID1,P1>, but SW will have

> > > > <MAC1,VID1,P2>

> > >

> > > Ok, so if the hardware FDB entry needs to be updated under the same

> > > hash_lock as the software FDB entry, then it seems that the goal of

> > > updating the hardware FDB synchronously and in a sleepable manner is if

> > > the data path defers the learning to sleepable context too. That in turn

> > > means that there will be 'dead time' between the reception of a packet

> > > from a given {MAC SA, VID} flow and the learning of that address. So I

> > > don't think that is really desirable. So I don't know if it is actually

> > > realistic to do this.

> > >

> > > Can we drop it from the requirements of this change, or do you feel like

> > > it's not worth it to make my change if this problem is not solved?

> >

> > I didn't pose it as a requirement, but as a desirable goal that I don't

> > know how to achieve w/o a surgery in the bridge driver that Nik and you

> > (understandably) don't like.

> >

> > Regarding a more practical solution, earlier versions (not what you

> > posted yesterday) have the undesirable properties of being both

> > asynchronous (current state) and mandating RTNL to be held. If we are

> > going with the asynchronous model, then I think we should have a model

> > that doesn't force RTNL and allows batching.

> >

> > I have the following proposal, which I believe solves your problem and

> > allows for batching without RTNL:

> >

> > The pattern of enqueuing a work item per-entry is not very smart.

> > Instead, it is better to to add the notification info to a list

> > (protected by a spin lock) and scheduling a single work item whose

> > purpose is to dequeue entries from this list and batch process them.

> 

> I don't have hardware where FDB entries can be installed in bulk, so

> this is new to me. Might make sense though where you are in fact talking

> to firmware, and the firmware is in fact still committing to hardware

> one by one, you are still reducing the number of round trips.


Yes

> 

> > Inside the work item you would do something like:

> >

> > spin_lock_bh()

> > list_splice_init()

> > spin_unlock_bh()

> >

> > mutex_lock() // rtnl or preferably private lock

> > list_for_each_entry_safe()

> > 	// process entry

> > 	cond_resched()

> > mutex_unlock()

> 

> When is the work item scheduled in your proposal?


Calling queue_work() whenever you get a notification. The work item
might already be queued, which is fine.

> I assume not only when SWITCHDEV_FDB_FLUSH_TO_DEVICE is emitted. Is

> there some sort of timer to allow for some batching to occur?


You can add an hysteresis timer if you want, but I don't think it's
necessary. Assuming user space is programming entries at a high rate,
then by the time you finish a batch, you will have a new one enqueued.

> 

> >

> > In del_nbp(), after br_fdb_delete_by_port(), the bridge will emit some

> > new blocking event (e.g., SWITCHDEV_FDB_FLUSH_TO_DEVICE) that will

> > instruct the driver to flush all its pending FDB notifications. You

> > don't strictly need this notification because of the

> > netdev_upper_dev_unlink() that follows, but it helps in making things

> > more structured.

> >

> > Pros:

> >

> > 1. Solves your problem?

> > 2. Pattern is not worse than what we currently have

> > 3. Does not force RTNL

> > 4. Allows for batching. For example, mlxsw has the ability to program up

> > to 64 entries in one transaction with the device. I assume other devices

> > in the same grade have similar capabilities

> >

> > Cons:

> >

> > 1. Asynchronous

> > 2. Pattern we will see in multiple drivers? Can consider migrating it

> > into switchdev itself at some point

> 

> I can already flush_workqueue(dsa_owq) in dsa_port_pre_bridge_leave()

> and this will solve the problem in the same way, will it not?


Problem is that you will deadlock if your work item tries to take RTNL.

> 

> It's not that I don't have driver-level solutions and hook points.

> My concern is that there are way too many moving parts and the entrance

> barrier for a new switchdev driver is getting higher and higher to

> achieve even basic stuff.


I understand the frustration, but that's my best proposal at the moment.
IMO, it doesn't make things worse and has some nice advantages.

> 

> For example, I need to maintain a DSA driver and a switchdev driver for

> the exact same class of hardware (ocelot is switchdev, felix is DSA, but

> the hardware is the same) and it is just so annoying that the interaction

> with switchdev is so verbose and open-coded, it just leads to so much

> duplication of basic patterns.

> When I add support for SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE in ocelot I

> really don't want to add a boatload of code, all copied from DSA.

> 

> > 3. Something I missed / overlooked

> >

> > > There is of course the option of going half-way too, just like for

> > > SWITCHDEV_PORT_ATTR_SET. You notify it once, synchronously, on the

> > > atomic chain, the switchdev throws as many errors as it can reasonably

> > > can, then you defer the actual installation which means a hardware access.

> >

> > Yes, the above proposal has the same property. You can throw errors

> > before enqueueing the notification info on your list.
Vladimir Oltean Aug. 23, 2021, 4:11 p.m. UTC | #26
On Mon, Aug 23, 2021 at 07:02:15PM +0300, Ido Schimmel wrote:
> > > Inside the work item you would do something like:

> > >

> > > spin_lock_bh()

> > > list_splice_init()

> > > spin_unlock_bh()

> > >

> > > mutex_lock() // rtnl or preferably private lock

> > > list_for_each_entry_safe()

> > > 	// process entry

> > > 	cond_resched()

> > > mutex_unlock()

> >

> > When is the work item scheduled in your proposal?

>

> Calling queue_work() whenever you get a notification. The work item

> might already be queued, which is fine.

>

> > I assume not only when SWITCHDEV_FDB_FLUSH_TO_DEVICE is emitted. Is

> > there some sort of timer to allow for some batching to occur?

>

> You can add an hysteresis timer if you want, but I don't think it's

> necessary. Assuming user space is programming entries at a high rate,

> then by the time you finish a batch, you will have a new one enqueued.


With the current model, nobody really stops any driver from doing that
if so they wish. No switchdev or bridge changes needed. We have maximum
flexibility now, with this async model. Yet it just so happens that no
one is exploiting it, and instead the existing options are poorly
utilized by most drivers.

> > > In del_nbp(), after br_fdb_delete_by_port(), the bridge will emit some

> > > new blocking event (e.g., SWITCHDEV_FDB_FLUSH_TO_DEVICE) that will

> > > instruct the driver to flush all its pending FDB notifications. You

> > > don't strictly need this notification because of the

> > > netdev_upper_dev_unlink() that follows, but it helps in making things

> > > more structured.

> > >

> > > Pros:

> > >

> > > 1. Solves your problem?

> > > 2. Pattern is not worse than what we currently have

> > > 3. Does not force RTNL

> > > 4. Allows for batching. For example, mlxsw has the ability to program up

> > > to 64 entries in one transaction with the device. I assume other devices

> > > in the same grade have similar capabilities

> > >

> > > Cons:

> > >

> > > 1. Asynchronous

> > > 2. Pattern we will see in multiple drivers? Can consider migrating it

> > > into switchdev itself at some point

> >

> > I can already flush_workqueue(dsa_owq) in dsa_port_pre_bridge_leave()

> > and this will solve the problem in the same way, will it not?

>

> Problem is that you will deadlock if your work item tries to take RTNL.


I think we agreed that the rtnl_lock could be dropped from driver FDB work items.
I have not tried that yet though.

> > It's not that I don't have driver-level solutions and hook points.

> > My concern is that there are way too many moving parts and the entrance

> > barrier for a new switchdev driver is getting higher and higher to

> > achieve even basic stuff.

>

> I understand the frustration, but that's my best proposal at the moment.

> IMO, it doesn't make things worse and has some nice advantages.


Reconsidering my options, I don't want to reduce the available optimizations
that other switchdev drivers can make, in the name of a simpler baseline.
I am also not smart enough for reworking the bridge data path.
I will probably do something like flush_workqueue in the PRECHANGEUPPER
handler, see what other common patterns there might be, and try to synthesize
them in library code (a la switchdev_handle_*) that can be used by drivers
that wish to, and ignored by drivers that don't.
Vladimir Oltean Aug. 23, 2021, 4:23 p.m. UTC | #27
On Mon, Aug 23, 2021 at 07:02:15PM +0300, Ido Schimmel wrote:
> > When is the work item scheduled in your proposal?

> 

> Calling queue_work() whenever you get a notification. The work item

> might already be queued, which is fine.

> 

> > I assume not only when SWITCHDEV_FDB_FLUSH_TO_DEVICE is emitted. Is

> > there some sort of timer to allow for some batching to occur?

> 

> You can add an hysteresis timer if you want, but I don't think it's

> necessary. Assuming user space is programming entries at a high rate,

> then by the time you finish a batch, you will have a new one enqueued.


I tried to do something similar in DSA. There we have .ndo_fdb_dump
because we don't sync the hardware FDB. We also have some drivers where
the FDB flush on a port is a very slow procedure, because the FDB needs
to be walked element by element to see what needs to be deleted.
So I wanted to defer the FDB flush to a background work queue, and let
the port leave the bridge quickly and not block the rtnl_mutex.
But it gets really nasty really quick. The FDB flush workqueue cannot
run concurrently with the ndo_fdb_dump, for reasons that have to do with
hardware access. Also, any fdb_add or fdb_del would need to flush the
FDB flush workqueue, for the same reasons. All these are currently
implicitly serialized by the rtnl_mutex now. Your hardware/firmware might
be smarter, but I think that if you drop the rtnl_mutex requirement, you
will be seriously surprised by the amount of extra concurrency you need
to handle.
In the end I scrapped everything and I'm happy with a synchronous FDB
flush even if it's slow. YMMV of course.