diff mbox series

[net-next,2/4] net: dsa: Link aggregation support

Message ID 20201119144508.29468-3-tobias@waldekranz.com
State New
Headers show
Series [net-next,1/4] net: bonding: Notify ports about their initial state | expand

Commit Message

Tobias Waldekranz Nov. 19, 2020, 2:45 p.m. UTC
Monitor the following events and notify the driver when:

- A DSA port joins/leaves a LAG.
- A LAG, made up of DSA ports, joins/leaves a bridge.
- A DSA port in a LAG is enabled/disabled (enabled meaning
  "distributing" in 802.3ad LACP terms).

Each LAG interface to which a DSA port is attached is represented by a
`struct dsa_lag` which is globally reachable from the switch tree and
from each associated port.

When a LAG joins a bridge, the DSA subsystem will treat that as each
individual port joining the bridge. The driver may look at the port's
LAG pointer to see if it is associated with any LAG, if that is
required. This is analogue to how switchdev events are replicated out
to all lower devices when reaching e.g. a LAG.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
---
 include/net/dsa.h  |  66 +++++++++++++++++++++
 net/dsa/dsa2.c     |   3 +
 net/dsa/dsa_priv.h |  31 ++++++++++
 net/dsa/port.c     | 143 +++++++++++++++++++++++++++++++++++++++++++++
 net/dsa/slave.c    |  55 +++++++++++++++--
 net/dsa/switch.c   |  49 ++++++++++++++++
 6 files changed, 341 insertions(+), 6 deletions(-)

Comments

Andrew Lunn Nov. 20, 2020, 12:30 a.m. UTC | #1
> +static struct dsa_lag *dsa_lag_get(struct dsa_switch_tree *dst,

> +				   struct net_device *dev)

> +{

> +	unsigned long busy = 0;

> +	struct dsa_lag *lag;

> +	int id;

> +

> +	list_for_each_entry(lag, &dst->lags, list) {

> +		set_bit(lag->id, &busy);

> +

> +		if (lag->dev == dev) {

> +			kref_get(&lag->refcount);

> +			return lag;

> +		}

> +	}

> +

> +	id = find_first_zero_bit(&busy, BITS_PER_LONG);

> +	if (id >= BITS_PER_LONG)

> +		return ERR_PTR(-ENOSPC);

> +

> +	lag = kzalloc(sizeof(*lag), GFP_KERNEL);

> +	if (!lag)

> +		return ERR_PTR(-ENOMEM);


Hi Tobias

My comment last time was to statically allocated them at probe
time. Worse case scenario is each port is alone in a LAG. Pointless,
but somebody could configure it. In dsa_tree_setup_switches() you can
count the number of ports and then allocate an array, or while setting
up a port, add one more lag to the list of lags.

   Andrew
Florian Fainelli Nov. 20, 2020, 2:43 a.m. UTC | #2
On 11/19/2020 4:30 PM, Andrew Lunn wrote:
>> +static struct dsa_lag *dsa_lag_get(struct dsa_switch_tree *dst,

>> +				   struct net_device *dev)

>> +{

>> +	unsigned long busy = 0;

>> +	struct dsa_lag *lag;

>> +	int id;

>> +

>> +	list_for_each_entry(lag, &dst->lags, list) {

>> +		set_bit(lag->id, &busy);

>> +

>> +		if (lag->dev == dev) {

>> +			kref_get(&lag->refcount);

>> +			return lag;

>> +		}

>> +	}

>> +

>> +	id = find_first_zero_bit(&busy, BITS_PER_LONG);

>> +	if (id >= BITS_PER_LONG)

>> +		return ERR_PTR(-ENOSPC);

>> +

>> +	lag = kzalloc(sizeof(*lag), GFP_KERNEL);

>> +	if (!lag)

>> +		return ERR_PTR(-ENOMEM);

> 

> Hi Tobias

> 

> My comment last time was to statically allocated them at probe

> time. Worse case scenario is each port is alone in a LAG. Pointless,

> but somebody could configure it. In dsa_tree_setup_switches() you can

> count the number of ports and then allocate an array, or while setting

> up a port, add one more lag to the list of lags.


The allocation is allowed to sleep (have not checked the calling context
of dsa_lag_get() whether this is OK) so what would be the upside of
doing upfront dsa_lag allocation which could be wasteful?
-- 
Florian
Andrew Lunn Nov. 20, 2020, 1:30 p.m. UTC | #3
On Thu, Nov 19, 2020 at 06:43:38PM -0800, Florian Fainelli wrote:
> 

> 

> On 11/19/2020 4:30 PM, Andrew Lunn wrote:

> >> +static struct dsa_lag *dsa_lag_get(struct dsa_switch_tree *dst,

> >> +				   struct net_device *dev)

> >> +{

> >> +	unsigned long busy = 0;

> >> +	struct dsa_lag *lag;

> >> +	int id;

> >> +

> >> +	list_for_each_entry(lag, &dst->lags, list) {

> >> +		set_bit(lag->id, &busy);

> >> +

> >> +		if (lag->dev == dev) {

> >> +			kref_get(&lag->refcount);

> >> +			return lag;

> >> +		}

> >> +	}

> >> +

> >> +	id = find_first_zero_bit(&busy, BITS_PER_LONG);

> >> +	if (id >= BITS_PER_LONG)

> >> +		return ERR_PTR(-ENOSPC);

> >> +

> >> +	lag = kzalloc(sizeof(*lag), GFP_KERNEL);

> >> +	if (!lag)

> >> +		return ERR_PTR(-ENOMEM);

> > 

> > Hi Tobias

> > 

> > My comment last time was to statically allocated them at probe

> > time. Worse case scenario is each port is alone in a LAG. Pointless,

> > but somebody could configure it. In dsa_tree_setup_switches() you can

> > count the number of ports and then allocate an array, or while setting

> > up a port, add one more lag to the list of lags.

> 

> The allocation is allowed to sleep (have not checked the calling context

> of dsa_lag_get() whether this is OK) so what would be the upside of

> doing upfront dsa_lag allocation which could be wasteful?


Hi Florian

It fits the pattern for the rest of the DSA core. We never allocate
anything at runtime. That keeps the error handling simple, we don't
need to deal with ENOMEM errors, undoing whatever we might of done,
implementing transactions etc.

And the memory involved here is small. I guess around 80 bytes per
lag? So even for a 10 port switch, we are only talking about 800
bytes. That is not a lot.

       Andrew
Tobias Waldekranz Nov. 26, 2020, 10:36 p.m. UTC | #4
On Fri, Nov 20, 2020 at 14:30, Andrew Lunn <andrew@lunn.ch> wrote:
> On Thu, Nov 19, 2020 at 06:43:38PM -0800, Florian Fainelli wrote:

>> > 

>> > Hi Tobias

>> > 

>> > My comment last time was to statically allocated them at probe

>> > time. Worse case scenario is each port is alone in a LAG. Pointless,

>> > but somebody could configure it. In dsa_tree_setup_switches() you can

>> > count the number of ports and then allocate an array, or while setting

>> > up a port, add one more lag to the list of lags.


Right, yes I forgot about that, sorry.

Permit me a final plea for the current implementation. If it is a hard
no, say the word and I will re-spin a v2 with your suggestion.

>> The allocation is allowed to sleep (have not checked the calling context

>> of dsa_lag_get() whether this is OK) so what would be the upside of


These events always originate from a sendmsg (netlink message), AFAIK
there is no other way for an interface to move to a new upper, so I do
not see any issues there.

>> doing upfront dsa_lag allocation which could be wasteful?

>

> Hi Florian

>

> It fits the pattern for the rest of the DSA core. We never allocate

> anything at runtime. That keeps the error handling simple, we don't

> need to deal with ENOMEM errors, undoing whatever we might of done,

> implementing transactions etc.


I understand that argument in principle. The reason that I think it
carries less weight in this case has to do with the dynamic nature of
the LAGs themselves.

In the cases of `struct dsa_port`s or `struct dsa_switch`es there is no
way for them to be dynamically removed, so it makes a lot of sense to
statically allocate everything.

Contrast that with a LAG that dynamically created and dynamically
modified, i.e. more ports can join/leave the LAG during its
lifecycle. This means you get all the headache of figuring out if the
LAG is in use or not anyway, i.e. you need to refcount them.

So essentially we _will_ be dynamically allocating/freeing them. We just
get to choose if we do it through the SLAB allocator, or through the
static array.

If you go with the static array, you theoretically can not get the
equivalent of an ENOMEM. Practically though you have to iterate through
the array and look for a free entry, but you still have to put a return
statement at the bottom of that function, right? Or panic I suppose. My
guess is you end up with:

    struct dsa_lag *dsa_lag_get(dst)
    {
        for (lag in dst->lag_array) {
            if (lag->dev == NULL)
                return lag;
        }

        return NULL;
    }

So now we have just traded dealing with an ENOMEM for a NULL pointer;
pretty much the same thing.
Andrew Lunn Nov. 26, 2020, 10:57 p.m. UTC | #5
> If you go with the static array, you theoretically can not get the

> equivalent of an ENOMEM. Practically though you have to iterate through

> the array and look for a free entry, but you still have to put a return

> statement at the bottom of that function, right? Or panic I suppose. My

> guess is you end up with:

> 

>     struct dsa_lag *dsa_lag_get(dst)

>     {

>         for (lag in dst->lag_array) {

>             if (lag->dev == NULL)

>                 return lag;

>         }

> 

>         return NULL;

>     }


I would put a WARN() in here, and return the NULL pointer. That will
then likely opps soon afterwards and kill of the user space tool
configuring the LAG, maybe leaving rtnl locked, and so all network
configuration dies. But that is all fine, since you cannot have more
LAGs than ports. This can never happen. If it does happen, something
is badly wrong and we want to know about it. And so a very obvious
explosion is good.

> So now we have just traded dealing with an ENOMEM for a NULL pointer;

> pretty much the same thing.


ENOMEM you have to handle correctly, unwind everything and leaving the
stack in the same state as before. Being out of memory is not a reason
to explode. Have you tested this? It is the sort of thing which does
not happen very often, so i expect is broken.

   Andrew
Tobias Waldekranz Nov. 27, 2020, 9:19 a.m. UTC | #6
On Thu, Nov 26, 2020 at 23:57, Andrew Lunn <andrew@lunn.ch> wrote:
>> If you go with the static array, you theoretically can not get the

>> equivalent of an ENOMEM. Practically though you have to iterate through

>> the array and look for a free entry, but you still have to put a return

>> statement at the bottom of that function, right? Or panic I suppose. My

>> guess is you end up with:

>> 

>>     struct dsa_lag *dsa_lag_get(dst)

>>     {

>>         for (lag in dst->lag_array) {

>>             if (lag->dev == NULL)

>>                 return lag;

>>         }

>> 

>>         return NULL;

>>     }

>

> I would put a WARN() in here, and return the NULL pointer. That will

> then likely opps soon afterwards and kill of the user space tool

> configuring the LAG, maybe leaving rtnl locked, and so all network

> configuration dies. But that is all fine, since you cannot have more


This is a digression, but I really do not get this shift from using
BUG()s to WARN()s in the kernel when you detect a violated invariant. It
smells of "On Error Resume Next" to me.

> LAGs than ports. This can never happen. If it does happen, something

> is badly wrong and we want to know about it. And so a very obvious

> explosion is good.


Indeed. That is why I think it should be BUG(). Leaving the system to
limp along can easily go undetected.

>> So now we have just traded dealing with an ENOMEM for a NULL pointer;

>> pretty much the same thing.

>

> ENOMEM you have to handle correctly, unwind everything and leaving the

> stack in the same state as before. Being out of memory is not a reason


We have to handle EWHATEVER correctly, no? I do not get what is so
special about ENOMEM. If we hit some other error after the allocation we
have to remember to free the memory of course, but that is just normal
cleanup. Is this what you mean by "unwind everything"? In that case we
need to also "free" the element we allocated from the array. Again, it
is fundamentally the same problem.

How would a call to kmalloc have any impact on the stack? (Barring
exotic bugs in the allocator that would allow the heap to intrude on
stack memory) Or am I misunderstanding what you mean by "the stack"?

> to explode. Have you tested this? It is the sort of thing which does

> not happen very often, so i expect is broken.


I have not run my system under memory pressure or done any error
injection testing. Is that customary to do whenever using kmalloc?  If
it would make a difference I could look into setting that up.

I have certainly tried my best to audit all the possible error paths to
make sure that no memory is ever leaked, and I feel confident in that
analysis.
Andrew Lunn Nov. 27, 2020, 4:28 p.m. UTC | #7
> This is a digression, but I really do not get this shift from using

> BUG()s to WARN()s in the kernel when you detect a violated invariant. It

> smells of "On Error Resume Next" to me.


A WARN() gives you a chance to actually use the machine to collect
logs, the kernel dump, etc. You might be able to sync the filesystems,
reducing the damage to the disks.  With BUG(), the machine is
dead. You cannot interact with it, all you have to go on, is what you
can see on the disk, or what you might have logged on the serial
console.

> We have to handle EWHATEVER correctly, no? I do not get what is so

> special about ENOMEM.


Nothing is particularly special about it. But looking at the current
code base the only other error we can get is probably ETIMEDOUT from
an MDIO read/write. But if that happens, there is probably no real
recovery. You have no idea what state the switch is in, and all other
MDIO calls are likely to fail in the same way.

> How would a call to kmalloc have any impact on the stack? (Barring

> exotic bugs in the allocator that would allow the heap to intrude on

> stack memory) Or am I misunderstanding what you mean by "the stack"?


I mean the network stack, top to bottom. Say we have a few vlan
interfaces, on top of the bridge, on top of a LAG, on top of DSA, on
top of IP over avian carriers. When setting up a LAG, what else has
happened by the time you get your ENOMEM? What notifications have
already happened, which some other layer has acted upon? It is not
just LAG inside DSA which needs to unwind, it is all the actions which
have been triggered so far.

The initial design of switchdev was transactions. First there was a
prepare call, where you validated the requested action is possible,
and allocate resources needed, but don't actually do it. This prepare
call is allowed to fail. Then there is a second call to actually do
it, and that call is not allowed to fail. This structure avoids most
of the complexity of the unwind, just free up some resources. If you
never had to allocate the resources in the first place, better still.

      Andrew
Tobias Waldekranz Nov. 27, 2020, 11:19 p.m. UTC | #8
On Fri, Nov 27, 2020 at 17:28, Andrew Lunn <andrew@lunn.ch> wrote:
>> This is a digression, but I really do not get this shift from using

>> BUG()s to WARN()s in the kernel when you detect a violated invariant. It

>> smells of "On Error Resume Next" to me.

>

> A WARN() gives you a chance to actually use the machine to collect

> logs, the kernel dump, etc. You might be able to sync the filesystems,


I completely agree that collecting the logs is very important. But I
think there are ways of solving that problem that works for both BUG()
and WARN(), e.g. pstore or mtdoops.

> reducing the damage to the disks.  With BUG(), the machine is


Well, you are betting that whatever bug you just discovered has not
affected the VFS in any way, causing you to sync back bad data to the
disk. I would rather take a hard reboot and rely on the filesystem's
journal to bring it up in a consistent state.

> dead. You cannot interact with it, all you have to go on, is what you

> can see on the disk, or what you might have logged on the serial

> console.


I guess we are coming at this problem from different angles. For me it
is not OK to just leave the device in limp-mode until an administrator
can inspect it and perform a reboot, as that can take weeks in some
cases. I would much rather have a hard reboot, bring the system back to
a fully operational state, detect the panic during the next boot,
prepare a debug tarball with the pstore data, and signal an alarm of
some kind.

Agree to disagree? :)

>> We have to handle EWHATEVER correctly, no? I do not get what is so

>> special about ENOMEM.

>

> Nothing is particularly special about it. But looking at the current

> code base the only other error we can get is probably ETIMEDOUT from

> an MDIO read/write. But if that happens, there is probably no real

> recovery. You have no idea what state the switch is in, and all other

> MDIO calls are likely to fail in the same way.

>

>> How would a call to kmalloc have any impact on the stack? (Barring

>> exotic bugs in the allocator that would allow the heap to intrude on

>> stack memory) Or am I misunderstanding what you mean by "the stack"?

>

> I mean the network stack, top to bottom. Say we have a few vlan

> interfaces, on top of the bridge, on top of a LAG, on top of DSA, on

> top of IP over avian carriers. When setting up a LAG, what else has

> happened by the time you get your ENOMEM? What notifications have

> already happened, which some other layer has acted upon? It is not

> just LAG inside DSA which needs to unwind, it is all the actions which

> have been triggered so far.

>

> The initial design of switchdev was transactions. First there was a

> prepare call, where you validated the requested action is possible,

> and allocate resources needed, but don't actually do it. This prepare

> call is allowed to fail. Then there is a second call to actually do

> it, and that call is not allowed to fail. This structure avoids most

> of the complexity of the unwind, just free up some resources. If you

> never had to allocate the resources in the first place, better still.


OK I think I finally see what you are saying. Sorry it took me this
long. I do not mean to be difficult, I just want to understand.

How about this:

- Add a `lags_max` field to `struct dsa_switch` to let each driver
  define the maximum number supported by the hardware. By default this
  would be zero, which would mean that LAG offloading is not supported.

- In dsa_tree_setup we allocate a static array of the minimum supported
  number across the entire tree.

- When joining a new LAG, we ensure that a slot is available in
  NETDEV_PRECHANGEUPPER, avoiding the issue you are describing.

- In NETDEV_CHANGEUPPER, we actually mark it as busy and start using it.
Florian Fainelli Nov. 28, 2020, 5:19 a.m. UTC | #9
On 11/27/2020 3:19 PM, Tobias Waldekranz wrote:
>> The initial design of switchdev was transactions. First there was a

>> prepare call, where you validated the requested action is possible,

>> and allocate resources needed, but don't actually do it. This prepare

>> call is allowed to fail. Then there is a second call to actually do

>> it, and that call is not allowed to fail. This structure avoids most

>> of the complexity of the unwind, just free up some resources. If you

>> never had to allocate the resources in the first place, better still.

> 

> OK I think I finally see what you are saying. Sorry it took me this

> long. I do not mean to be difficult, I just want to understand.

> 

> How about this:

> 

> - Add a `lags_max` field to `struct dsa_switch` to let each driver

>   define the maximum number supported by the hardware. By default this

>   would be zero, which would mean that LAG offloading is not supported.

> 

> - In dsa_tree_setup we allocate a static array of the minimum supported

>   number across the entire tree.

> 

> - When joining a new LAG, we ensure that a slot is available in

>   NETDEV_PRECHANGEUPPER, avoiding the issue you are describing.

> 

> - In NETDEV_CHANGEUPPER, we actually mark it as busy and start using it.

> 


Sounds reasonable to me.
-- 
Florian
Andrew Lunn Nov. 28, 2020, 4:38 p.m. UTC | #10
> > OK I think I finally see what you are saying. Sorry it took me this

> > long. I do not mean to be difficult, I just want to understand.


Not a problem. This is a bit different to normal, the complexity of
the stack means you need to handle this different to most drivers. If
you have done any deeply embedded stuff, RTOS, allocating everything
up front is normal, it eliminates a whole class of problems.

> > 

> > How about this:

> > 

> > - Add a `lags_max` field to `struct dsa_switch` to let each driver

> >   define the maximum number supported by the hardware.


This is all reasonable. I just wonder what that number is for the
mv88e6xx case, especially for D in DSA. I guess LAGs are global in
scope across a set of switches. So it does not matter if there is one
switch or lots of switches, the lags_max stays the same.

      Andrew
diff mbox series

Patch

diff --git a/include/net/dsa.h b/include/net/dsa.h
index 4e60d2610f20..3fd8f041ddbe 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -147,6 +147,9 @@  struct dsa_switch_tree {
 	/* List of switch ports */
 	struct list_head ports;
 
+	/* List of configured LAGs */
+	struct list_head lags;
+
 	/* List of DSA links composing the routing table */
 	struct list_head rtable;
 };
@@ -180,6 +183,49 @@  struct dsa_mall_tc_entry {
 	};
 };
 
+struct dsa_lag {
+	struct net_device *dev;
+	int id;
+
+	struct list_head ports;
+
+	/* For multichip systems, we must ensure that each hash bucket
+	 * is only enabled on a single egress port throughout the
+	 * whole tree, lest we send duplicates. Therefore we must
+	 * maintain a global list of active tx ports, so that each
+	 * switch can figure out which buckets to enable on which
+	 * ports.
+	 */
+	struct list_head tx_ports;
+	int num_tx;
+
+	struct kref refcount;
+	struct list_head list;
+};
+
+static inline struct dsa_lag *dsa_lag_by_dev(struct dsa_switch_tree *dst,
+					     struct net_device *dev)
+{
+	struct dsa_lag *lag;
+
+	list_for_each_entry(lag, &dst->lags, list)
+		if (lag->dev == dev)
+			return lag;
+
+	return NULL;
+}
+
+static inline struct net_device *dsa_lag_dev_by_id(struct dsa_switch_tree *dst,
+						   int id)
+{
+	struct dsa_lag *lag;
+
+	list_for_each_entry_rcu(lag, &dst->lags, list)
+		if (lag->id == id)
+			return lag->dev;
+
+	return NULL;
+}
 
 struct dsa_port {
 	/* A CPU port is physically connected to a master device.
@@ -220,6 +266,9 @@  struct dsa_port {
 	bool			devlink_port_setup;
 	struct phylink		*pl;
 	struct phylink_config	pl_config;
+	struct dsa_lag		*lag;
+	struct list_head	lag_list;
+	struct list_head	lag_tx_list;
 
 	struct list_head list;
 
@@ -624,6 +673,13 @@  struct dsa_switch_ops {
 	void	(*crosschip_bridge_leave)(struct dsa_switch *ds, int tree_index,
 					  int sw_index, int port,
 					  struct net_device *br);
+	int	(*crosschip_lag_change)(struct dsa_switch *ds, int sw_index,
+					int port, struct net_device *lag_dev,
+					struct netdev_lag_lower_state_info *info);
+	int	(*crosschip_lag_join)(struct dsa_switch *ds, int sw_index,
+				      int port, struct net_device *lag_dev);
+	void	(*crosschip_lag_leave)(struct dsa_switch *ds, int sw_index,
+				       int port, struct net_device *lag_dev);
 
 	/*
 	 * PTP functionality
@@ -655,6 +711,16 @@  struct dsa_switch_ops {
 	int	(*port_change_mtu)(struct dsa_switch *ds, int port,
 				   int new_mtu);
 	int	(*port_max_mtu)(struct dsa_switch *ds, int port);
+
+	/*
+	 * LAG integration
+	 */
+	int	(*port_lag_change)(struct dsa_switch *ds, int port,
+				   struct netdev_lag_lower_state_info *info);
+	int	(*port_lag_join)(struct dsa_switch *ds, int port,
+				 struct net_device *lag_dev);
+	void	(*port_lag_leave)(struct dsa_switch *ds, int port,
+				  struct net_device *lag_dev);
 };
 
 #define DSA_DEVLINK_PARAM_DRIVER(_id, _name, _type, _cmodes)		\
diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index 183003e45762..708d5a34e150 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -66,6 +66,7 @@  static struct dsa_switch_tree *dsa_tree_alloc(int index)
 	INIT_LIST_HEAD(&dst->rtable);
 
 	INIT_LIST_HEAD(&dst->ports);
+	INIT_LIST_HEAD(&dst->lags);
 
 	INIT_LIST_HEAD(&dst->list);
 	list_add_tail(&dst->list, &dsa_tree_list);
@@ -659,6 +660,8 @@  static struct dsa_port *dsa_port_touch(struct dsa_switch *ds, int index)
 	dp->index = index;
 
 	INIT_LIST_HEAD(&dp->list);
+	INIT_LIST_HEAD(&dp->lag_list);
+	INIT_LIST_HEAD(&dp->lag_tx_list);
 	list_add_tail(&dp->list, &dst->ports);
 
 	return dp;
diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
index 7c96aae9062c..214051f3ced0 100644
--- a/net/dsa/dsa_priv.h
+++ b/net/dsa/dsa_priv.h
@@ -20,6 +20,9 @@  enum {
 	DSA_NOTIFIER_BRIDGE_LEAVE,
 	DSA_NOTIFIER_FDB_ADD,
 	DSA_NOTIFIER_FDB_DEL,
+	DSA_NOTIFIER_LAG_CHANGE,
+	DSA_NOTIFIER_LAG_JOIN,
+	DSA_NOTIFIER_LAG_LEAVE,
 	DSA_NOTIFIER_MDB_ADD,
 	DSA_NOTIFIER_MDB_DEL,
 	DSA_NOTIFIER_VLAN_ADD,
@@ -57,6 +60,14 @@  struct dsa_notifier_mdb_info {
 	int port;
 };
 
+/* DSA_NOTIFIER_LAG_* */
+struct dsa_notifier_lag_info {
+	struct netdev_lag_lower_state_info *info;
+	struct net_device *lag;
+	int sw_index;
+	int port;
+};
+
 /* DSA_NOTIFIER_VLAN_* */
 struct dsa_notifier_vlan_info {
 	const struct switchdev_obj_port_vlan *vlan;
@@ -135,6 +146,10 @@  void dsa_port_disable_rt(struct dsa_port *dp);
 void dsa_port_disable(struct dsa_port *dp);
 int dsa_port_bridge_join(struct dsa_port *dp, struct net_device *br);
 void dsa_port_bridge_leave(struct dsa_port *dp, struct net_device *br);
+int dsa_port_lag_change(struct dsa_port *dp,
+			struct netdev_lag_lower_state_info *linfo);
+int dsa_port_lag_join(struct dsa_port *dp, struct net_device *lag_dev);
+void dsa_port_lag_leave(struct dsa_port *dp, struct net_device *lag_dev);
 int dsa_port_vlan_filtering(struct dsa_port *dp, bool vlan_filtering,
 			    struct switchdev_trans *trans);
 bool dsa_port_skip_vlan_configuration(struct dsa_port *dp);
@@ -167,6 +182,22 @@  int dsa_port_link_register_of(struct dsa_port *dp);
 void dsa_port_link_unregister_of(struct dsa_port *dp);
 extern const struct phylink_mac_ops dsa_port_phylink_mac_ops;
 
+static inline bool dsa_port_can_offload(struct dsa_port *dp,
+					struct net_device *dev)
+{
+	/* Switchdev offloading can be configured on: */
+
+	if (dev == dp->slave)
+		/* DSA ports directly connected to a bridge. */
+		return true;
+
+	if (dp->lag && dev == dp->lag->dev)
+		/* DSA ports connected to a bridge via a LAG */
+		return true;
+
+	return false;
+}
+
 /* slave.c */
 extern const struct dsa_device_ops notag_netdev_ops;
 void dsa_slave_mii_bus_init(struct dsa_switch *ds);
diff --git a/net/dsa/port.c b/net/dsa/port.c
index 73569c9af3cc..4bb8a69d7ec2 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -193,6 +193,149 @@  void dsa_port_bridge_leave(struct dsa_port *dp, struct net_device *br)
 	dsa_port_set_state_now(dp, BR_STATE_FORWARDING);
 }
 
+static struct dsa_lag *dsa_lag_get(struct dsa_switch_tree *dst,
+				   struct net_device *dev)
+{
+	unsigned long busy = 0;
+	struct dsa_lag *lag;
+	int id;
+
+	list_for_each_entry(lag, &dst->lags, list) {
+		set_bit(lag->id, &busy);
+
+		if (lag->dev == dev) {
+			kref_get(&lag->refcount);
+			return lag;
+		}
+	}
+
+	id = find_first_zero_bit(&busy, BITS_PER_LONG);
+	if (id >= BITS_PER_LONG)
+		return ERR_PTR(-ENOSPC);
+
+	lag = kzalloc(sizeof(*lag), GFP_KERNEL);
+	if (!lag)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&lag->refcount);
+	lag->id = id;
+	lag->dev = dev;
+	INIT_LIST_HEAD(&lag->ports);
+	INIT_LIST_HEAD(&lag->tx_ports);
+
+	INIT_LIST_HEAD(&lag->list);
+	list_add_tail_rcu(&lag->list, &dst->lags);
+	return lag;
+}
+
+static void dsa_lag_release(struct kref *refcount)
+{
+	struct dsa_lag *lag = container_of(refcount, struct dsa_lag, refcount);
+
+	list_del_rcu(&lag->list);
+	synchronize_rcu();
+	kfree(lag);
+}
+
+static void dsa_lag_put(struct dsa_lag *lag)
+{
+	kref_put(&lag->refcount, dsa_lag_release);
+}
+
+int dsa_port_lag_change(struct dsa_port *dp,
+			struct netdev_lag_lower_state_info *linfo)
+{
+	struct dsa_notifier_lag_info info = {
+		.sw_index = dp->ds->index,
+		.port = dp->index,
+		.info = linfo,
+	};
+	bool old, new;
+
+	if (!dp->lag)
+		return 0;
+
+	info.lag = dp->lag->dev;
+
+	/* If this port is on the tx list, it is already enabled. */
+	old = !list_empty(&dp->lag_tx_list);
+
+	/* On statically configured aggregates (e.g. loadbalance
+	 * without LACP) ports will always be tx_enabled, even if the
+	 * link is down. Thus we require both link_up and tx_enabled
+	 * in order to include it in the tx set.
+	 */
+	new = linfo->link_up && linfo->tx_enabled;
+
+	if (new == old)
+		return 0;
+
+	if (new) {
+		dp->lag->num_tx++;
+		list_add_tail(&dp->lag_tx_list, &dp->lag->tx_ports);
+	} else {
+		list_del_init(&dp->lag_tx_list);
+		dp->lag->num_tx--;
+	}
+
+	return dsa_port_notify(dp, DSA_NOTIFIER_LAG_CHANGE, &info);
+}
+
+int dsa_port_lag_join(struct dsa_port *dp, struct net_device *lag_dev)
+{
+	struct dsa_notifier_lag_info info = {
+		.sw_index = dp->ds->index,
+		.port = dp->index,
+		.lag = lag_dev,
+	};
+	struct dsa_lag *lag;
+	int err;
+
+	lag = dsa_lag_get(dp->ds->dst, lag_dev);
+	if (IS_ERR(lag))
+		return PTR_ERR(lag);
+
+	dp->lag = lag;
+	list_add_tail(&dp->lag_list, &lag->ports);
+
+	err = dsa_port_notify(dp, DSA_NOTIFIER_LAG_JOIN, &info);
+	if (err) {
+		dp->lag = NULL;
+		list_del_init(&dp->lag_list);
+		dsa_lag_put(lag);
+	}
+
+	return err;
+}
+
+void dsa_port_lag_leave(struct dsa_port *dp, struct net_device *lag_dev)
+{
+	struct dsa_notifier_lag_info info = {
+		.sw_index = dp->ds->index,
+		.port = dp->index,
+		.lag = lag_dev,
+	};
+	int err;
+
+	/* Port might have been part of a LAG that in turn was
+	 * attached to a bridge.
+	 */
+	if (dp->bridge_dev)
+		dsa_port_bridge_leave(dp, dp->bridge_dev);
+
+	list_del_init(&dp->lag_list);
+	list_del_init(&dp->lag_tx_list);
+
+	err = dsa_port_notify(dp, DSA_NOTIFIER_LAG_LEAVE, &info);
+	if (err)
+		pr_err("DSA: failed to notify DSA_NOTIFIER_LAG_LEAVE: %d\n",
+		       err);
+
+	dsa_lag_put(dp->lag);
+
+	dp->lag = NULL;
+}
+
 /* Must be called under rcu_read_lock() */
 static bool dsa_port_can_apply_vlan_filtering(struct dsa_port *dp,
 					      bool vlan_filtering)
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index ff2266d2b998..ca61349886a4 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -334,7 +334,7 @@  static int dsa_slave_vlan_add(struct net_device *dev,
 	struct switchdev_obj_port_vlan vlan;
 	int vid, err;
 
-	if (obj->orig_dev != dev)
+	if (!dsa_port_can_offload(dp, obj->orig_dev))
 		return -EOPNOTSUPP;
 
 	if (dsa_port_skip_vlan_configuration(dp))
@@ -391,7 +391,7 @@  static int dsa_slave_port_obj_add(struct net_device *dev,
 
 	switch (obj->id) {
 	case SWITCHDEV_OBJ_ID_PORT_MDB:
-		if (obj->orig_dev != dev)
+		if (!dsa_port_can_offload(dp, obj->orig_dev))
 			return -EOPNOTSUPP;
 		err = dsa_port_mdb_add(dp, SWITCHDEV_OBJ_PORT_MDB(obj), trans);
 		break;
@@ -421,7 +421,7 @@  static int dsa_slave_vlan_del(struct net_device *dev,
 	struct switchdev_obj_port_vlan *vlan;
 	int vid, err;
 
-	if (obj->orig_dev != dev)
+	if (!dsa_port_can_offload(dp, obj->orig_dev))
 		return -EOPNOTSUPP;
 
 	if (dsa_port_skip_vlan_configuration(dp))
@@ -450,7 +450,7 @@  static int dsa_slave_port_obj_del(struct net_device *dev,
 
 	switch (obj->id) {
 	case SWITCHDEV_OBJ_ID_PORT_MDB:
-		if (obj->orig_dev != dev)
+		if (!dsa_port_can_offload(dp, obj->orig_dev))
 			return -EOPNOTSUPP;
 		err = dsa_port_mdb_del(dp, SWITCHDEV_OBJ_PORT_MDB(obj));
 		break;
@@ -1941,6 +1941,33 @@  static int dsa_slave_changeupper(struct net_device *dev,
 			dsa_port_bridge_leave(dp, info->upper_dev);
 			err = NOTIFY_OK;
 		}
+	} else if (netif_is_lag_master(info->upper_dev)) {
+		if (info->linking) {
+			err = dsa_port_lag_join(dp, info->upper_dev);
+			err = notifier_from_errno(err);
+		} else {
+			dsa_port_lag_leave(dp, info->upper_dev);
+			err = NOTIFY_OK;
+		}
+	}
+
+	return err;
+}
+
+static int dsa_slave_lag_changeupper(struct net_device *dev,
+				     struct netdev_notifier_changeupper_info *info)
+{
+	struct net_device *lower;
+	struct list_head *iter;
+	int err = NOTIFY_DONE;
+
+	netdev_for_each_lower_dev(dev, lower, iter) {
+		if (!dsa_slave_dev_check(lower))
+			continue;
+
+		err = dsa_slave_changeupper(lower, info);
+		if (notifier_to_errno(err))
+			break;
 	}
 
 	return err;
@@ -2038,10 +2065,26 @@  static int dsa_slave_netdevice_event(struct notifier_block *nb,
 		break;
 	}
 	case NETDEV_CHANGEUPPER:
+		if (dsa_slave_dev_check(dev))
+			return dsa_slave_changeupper(dev, ptr);
+
+		if (netif_is_lag_master(dev))
+			return dsa_slave_lag_changeupper(dev, ptr);
+
+		break;
+	case NETDEV_CHANGELOWERSTATE: {
+		struct netdev_notifier_changelowerstate_info *info = ptr;
+		struct dsa_port *dp;
+		int err;
+
 		if (!dsa_slave_dev_check(dev))
-			return NOTIFY_DONE;
+			break;
 
-		return dsa_slave_changeupper(dev, ptr);
+		dp = dsa_slave_to_port(dev);
+
+		err = dsa_port_lag_change(dp, info->lower_state_info);
+		return notifier_from_errno(err);
+	}
 	}
 
 	return NOTIFY_DONE;
diff --git a/net/dsa/switch.c b/net/dsa/switch.c
index 3fb362b6874e..3e518df7cd1f 100644
--- a/net/dsa/switch.c
+++ b/net/dsa/switch.c
@@ -178,6 +178,46 @@  static int dsa_switch_fdb_del(struct dsa_switch *ds,
 	return ds->ops->port_fdb_del(ds, port, info->addr, info->vid);
 }
 
+static int dsa_switch_lag_change(struct dsa_switch *ds,
+				 struct dsa_notifier_lag_info *info)
+{
+	if (ds->index == info->sw_index && ds->ops->port_lag_change)
+		return ds->ops->port_lag_change(ds, info->port, info->info);
+
+	if (ds->index != info->sw_index && ds->ops->crosschip_lag_change)
+		return ds->ops->crosschip_lag_change(ds, info->sw_index,
+						     info->port, info->lag,
+						     info->info);
+
+	return 0;
+}
+
+static int dsa_switch_lag_join(struct dsa_switch *ds,
+			       struct dsa_notifier_lag_info *info)
+{
+	if (ds->index == info->sw_index && ds->ops->port_lag_join)
+		return ds->ops->port_lag_join(ds, info->port, info->lag);
+
+	if (ds->index != info->sw_index && ds->ops->crosschip_lag_join)
+		return ds->ops->crosschip_lag_join(ds, info->sw_index,
+						   info->port, info->lag);
+
+	return 0;
+}
+
+static int dsa_switch_lag_leave(struct dsa_switch *ds,
+				struct dsa_notifier_lag_info *info)
+{
+	if (ds->index == info->sw_index && ds->ops->port_lag_leave)
+		ds->ops->port_lag_leave(ds, info->port, info->lag);
+
+	if (ds->index != info->sw_index && ds->ops->crosschip_lag_leave)
+		ds->ops->crosschip_lag_leave(ds, info->sw_index,
+					     info->port, info->lag);
+
+	return 0;
+}
+
 static bool dsa_switch_mdb_match(struct dsa_switch *ds, int port,
 				 struct dsa_notifier_mdb_info *info)
 {
@@ -325,6 +365,15 @@  static int dsa_switch_event(struct notifier_block *nb,
 	case DSA_NOTIFIER_FDB_DEL:
 		err = dsa_switch_fdb_del(ds, info);
 		break;
+	case DSA_NOTIFIER_LAG_CHANGE:
+		err = dsa_switch_lag_change(ds, info);
+		break;
+	case DSA_NOTIFIER_LAG_JOIN:
+		err = dsa_switch_lag_join(ds, info);
+		break;
+	case DSA_NOTIFIER_LAG_LEAVE:
+		err = dsa_switch_lag_leave(ds, info);
+		break;
 	case DSA_NOTIFIER_MDB_ADD:
 		err = dsa_switch_mdb_add(ds, info);
 		break;