diff mbox series

[4/4] cpsw: add switchdev support

Message ID 1527144984-31236-5-git-send-email-ilias.apalodimas@linaro.org
State New
Headers show
Series RFC CPSW switchdev mode | expand

Commit Message

Ilias Apalodimas May 24, 2018, 6:56 a.m. UTC
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>

---
 drivers/net/ethernet/ti/cpsw.c      | 503 +++++++++++++++++++++++++++++++-----
 drivers/net/ethernet/ti/cpsw_priv.h |   9 +-
 2 files changed, 450 insertions(+), 62 deletions(-)

-- 
2.7.4

Comments

Andrew Lunn May 24, 2018, 1:12 p.m. UTC | #1
> @@ -2626,7 +2750,11 @@ static int cpsw_probe_dt(struct cpsw_platform_data *data,

>  	data->mac_control = prop;

>  

>  	if (of_property_read_bool(node, "dual_emac"))

> -		data->dual_emac = 1;

> +		data->switch_mode = CPSW_DUAL_EMAC;

> +

> +	/* switchdev overrides DTS */

> +	if (IS_ENABLED(CONFIG_TI_CPSW_SWITCHDEV))

> +		data->switch_mode = CPSW_SWITCHDEV;


Device tree is supposed to describe the hardware. Using that hardware
in different ways is not something you should describe in DT.

There are also a lot of IS_ENABLED() here, which i don't like. It is a
lot better than #ifdef, but we should try to do better. It would be
good to split this cleanly into three parts. A generic library, which
does not care about DUAL_MAC or SWITCHDEV. A driver which implements
legacy DUAL MAC etc. And a driver which implements SWITCHDEV. We can
then give this new switchdev driver a different compatible. It i still
encoding in device tree how to use the hardware, but it is more
implicit, rather than explicit.

	  Andrew
Ilias Apalodimas May 24, 2018, 1:32 p.m. UTC | #2
On Thu, May 24, 2018 at 03:12:29PM +0200, Andrew Lunn wrote:
> Device tree is supposed to describe the hardware. Using that hardware

> in different ways is not something you should describe in DT.

> 

The new switchdev mode is applied with a .config option in the kernel. What you
see is pre-existing code, so i am not sure if i should change it in this
patchset. Your point is valid though and we are on the same page.
> There are also a lot of IS_ENABLED() here, which i don't like. It is a

> lot better than #ifdef, but we should try to do better. 

I don't like it either i just tried to clean up code in "hot path" with ifdefs.
In theory this should replace "switch mode" in the near future so the ifdefs
will go away
> It would be

> good to split this cleanly into three parts. A generic library, which

> does not care about DUAL_MAC or SWITCHDEV. A driver which implements

> legacy DUAL MAC etc. And a driver which implements SWITCHDEV. We can

> then give this new switchdev driver a different compatible. It i still

> encoding in device tree how to use the hardware, but it is more

> implicit, rather than explicit.

Good idea, i'll sent the next version like that

Thanks,
Ilias
Andrew Lunn May 24, 2018, 4:39 p.m. UTC | #3
On Thu, May 24, 2018 at 04:32:34PM +0300, Ilias Apalodimas wrote:
> On Thu, May 24, 2018 at 03:12:29PM +0200, Andrew Lunn wrote:

> > Device tree is supposed to describe the hardware. Using that hardware

> > in different ways is not something you should describe in DT.

> > 

> The new switchdev mode is applied with a .config option in the kernel. What you

> see is pre-existing code, so i am not sure if i should change it in this

> patchset.


If you break the code up into a library and two drivers, it becomes a
moot point.

But what i don't like here is that the device tree says to do dual
mac. But you ignore that and do sometime else. I would prefer that if
DT says dual mac, and switchdev is compiled in, the probe fails with
EINVAL. Rather than ignore something, make it clear it is invalid.

      Andrew
Ilias Apalodimas May 25, 2018, 4:56 a.m. UTC | #4
On Thu, May 24, 2018 at 06:39:04PM +0200, Andrew Lunn wrote:
> On Thu, May 24, 2018 at 04:32:34PM +0300, Ilias Apalodimas wrote:

> > On Thu, May 24, 2018 at 03:12:29PM +0200, Andrew Lunn wrote:

> > > Device tree is supposed to describe the hardware. Using that hardware

> > > in different ways is not something you should describe in DT.

> > > 

> > The new switchdev mode is applied with a .config option in the kernel. What you

> > see is pre-existing code, so i am not sure if i should change it in this

> > patchset.

> 

> If you break the code up into a library and two drivers, it becomes a

> moot point.

Agree

> 

> But what i don't like here is that the device tree says to do dual

> mac. But you ignore that and do sometime else. I would prefer that if

> DT says dual mac, and switchdev is compiled in, the probe fails with

> EINVAL. Rather than ignore something, make it clear it is invalid.

The switch has 3 modes of operation as is.
1. switch mode, to enable that you don't need to add anything on
the DTS and linux registers a single netdev interface.
2. dual mac mode, this is when you need to add dual_emac; on the DTS.
3. switchdev mode which is controlled by a .config option, since as you 
pointed out DTS was not made for controlling config options. 

I agree that this is far from beautiful. If the driver remains as in though,
i'd prefer either keeping what's there or making "switchdev" a DTS option, 
following the pre-existing erroneous usage rather than making the device 
unusable.  If we end up returning some error and refuse to initialize, users 
that remote upgrade their equipment, without taking a good look at changelog,
will loose access to their devices with no means of remotely fixing that.

Regards
Ilias
Florian Fainelli June 1, 2018, 9:48 p.m. UTC | #5
On 05/24/2018 09:56 PM, Ilias Apalodimas wrote:
> On Thu, May 24, 2018 at 06:39:04PM +0200, Andrew Lunn wrote:

>> On Thu, May 24, 2018 at 04:32:34PM +0300, Ilias Apalodimas wrote:

>>> On Thu, May 24, 2018 at 03:12:29PM +0200, Andrew Lunn wrote:

>>>> Device tree is supposed to describe the hardware. Using that hardware

>>>> in different ways is not something you should describe in DT.

>>>>

>>> The new switchdev mode is applied with a .config option in the kernel. What you

>>> see is pre-existing code, so i am not sure if i should change it in this

>>> patchset.

>>

>> If you break the code up into a library and two drivers, it becomes a

>> moot point.

> Agree

> 

>>

>> But what i don't like here is that the device tree says to do dual

>> mac. But you ignore that and do sometime else. I would prefer that if

>> DT says dual mac, and switchdev is compiled in, the probe fails with

>> EINVAL. Rather than ignore something, make it clear it is invalid.

> The switch has 3 modes of operation as is.

> 1. switch mode, to enable that you don't need to add anything on

> the DTS and linux registers a single netdev interface.

> 2. dual mac mode, this is when you need to add dual_emac; on the DTS.

> 3. switchdev mode which is controlled by a .config option, since as you 

> pointed out DTS was not made for controlling config options. 

> 

> I agree that this is far from beautiful. If the driver remains as in though,

> i'd prefer either keeping what's there or making "switchdev" a DTS option, 

> following the pre-existing erroneous usage rather than making the device 

> unusable.  If we end up returning some error and refuse to initialize, users 

> that remote upgrade their equipment, without taking a good look at changelog,

> will loose access to their devices with no means of remotely fixing that.


It seems to me that the mistake here is seeing multiple modes of
operations for the cpsw. There are not actually many, there is one
usage, and then there is what you can and cannot offload. The basic
premise with switchdev and DSA (which uses switchdev) is that each
user-facing port of your switch needs to work as if it were a normal
Ethernet NIC, that is what you call dual-MAC I believe. Then, when you
create a bridge and you enslave those ports into the bridge, you need to
have forwarding done in hardware between these two ports when the
SMAC/DMAC are not for the host/CPU/management interface and you must
simultaneously still have the host have the ability to send/receive
traffic through the bridge device.

It seems to me like this is entirely doable given that the dual MAC use
case is supported already.

switchdev is just a stateless framework to get notified from the
networking stack about what you can possibly offload in hardware, so
having a DTS option gate that is unfortunately wrong because it is
really implementing a SW policy in DTS which is not what it is meant for.
-- 
Florian
Ilias Apalodimas June 2, 2018, 10:34 a.m. UTC | #6
Hi Florian, 

Thanks for taking time to look into this

On Fri, Jun 01, 2018 at 02:48:48PM -0700, Florian Fainelli wrote:
> 

> 

> On 05/24/2018 09:56 PM, Ilias Apalodimas wrote:

> > On Thu, May 24, 2018 at 06:39:04PM +0200, Andrew Lunn wrote:

> >> On Thu, May 24, 2018 at 04:32:34PM +0300, Ilias Apalodimas wrote:

> >>> On Thu, May 24, 2018 at 03:12:29PM +0200, Andrew Lunn wrote:

> >>>> Device tree is supposed to describe the hardware. Using that hardware

> >>>> in different ways is not something you should describe in DT.

> >>>>

> >>> The new switchdev mode is applied with a .config option in the kernel. What you

> >>> see is pre-existing code, so i am not sure if i should change it in this

> >>> patchset.

> >>

> >> If you break the code up into a library and two drivers, it becomes a

> >> moot point.

> > Agree

> > 

> >>

> >> But what i don't like here is that the device tree says to do dual

> >> mac. But you ignore that and do sometime else. I would prefer that if

> >> DT says dual mac, and switchdev is compiled in, the probe fails with

> >> EINVAL. Rather than ignore something, make it clear it is invalid.

> > The switch has 3 modes of operation as is.

> > 1. switch mode, to enable that you don't need to add anything on

> > the DTS and linux registers a single netdev interface.

> > 2. dual mac mode, this is when you need to add dual_emac; on the DTS.

> > 3. switchdev mode which is controlled by a .config option, since as you 

> > pointed out DTS was not made for controlling config options. 

> > 

> > I agree that this is far from beautiful. If the driver remains as in though,

> > i'd prefer either keeping what's there or making "switchdev" a DTS option, 

> > following the pre-existing erroneous usage rather than making the device 

> > unusable.  If we end up returning some error and refuse to initialize, users 

> > that remote upgrade their equipment, without taking a good look at changelog,

> > will loose access to their devices with no means of remotely fixing that.

> 

> It seems to me that the mistake here is seeing multiple modes of

> operations for the cpsw. There are not actually many, there is one

> usage, and then there is what you can and cannot offload. 

CPSW has in fact 2 modes of operation, different FIFO usage/lookup entry(it's
called ALE in the current driver) by-pass(which is used in dual emac for 
example) and other features. Again Grygorii is better suited to answer the 
exact differences.
> The basic

> premise with switchdev and DSA (which uses switchdev) is that each

> user-facing port of your switch needs to work as if it were a normal

> Ethernet NIC, that is what you call dual-MAC I believe. Then, when you

> create a bridge and you enslave those ports into the bridge, you need to

> have forwarding done in hardware between these two ports when the

> SMAC/DMAC are not for the host/CPU/management interface and you must

> simultaneously still have the host have the ability to send/receive

> traffic through the bridge device.

Yes dual emac does that. But dual emac configures the port facing VLAN to the
CPU port as well. So dual emac splits and uses 2 interfaces. VLAN 1 is
configured on port1 + CPU port and VLAN 2 is confired on port 2 + CPU port
That's exactly what the current RFC does as well, with the addition of
registering a sw0p0 (i'll explain why later on this mail)
A little more detail on the issue we are having. On my description 
sw0p0 -> CPU port, sw0p1 -> port 1 sw0p2 -> port 2. sw0p1/sw0p2 are the ports
that have PHYs attached. 

When we start in the new switchdev mode all interfaces are added to VLAN 0
so CPU port + port1 + port2 are all in the same VLAN group. In that case sw0p1
and sw0p2 are working as you describe. So those 2 interfaces can send/receive
traffic normally which matches the switchdev case.

When we add them on a bridge(let's say br0), VLAN1(or any default bridge VLAN)
is now configured on sw0p1 and sw0p2 but *not* on the CPU port. 
From this point on the whole fuunctionality just collapses. The switch will 
work and offload traffic between sw0p1/sw0p2 but the bridge won't be able to 
get an ip address (since VLAN1 is not a member of the CPU port and the packet 
gets dropped). 
IGMPv2/V3 messages will never reach the br_multicast.c code to trigger 
switchdev and configure the MDBs on the ports.  i am prety sure there are other
fail scenarios which i haven't discovered already, but those 2 are the most 
basic ones.  If we add VLAN1 on the CPU port, everything works as intended of 
course.

That's the reason we registered sw0p0 as the CPU port. It can't do any "real"
traffic, but you can configure the CPU port independantly and not be forced to
do an OR on every VLAN add/delete grouping the CPU port with your port command.
The TL;DR version of this is that the switch is working exactly as switchdev is
expecting offloading traffic to the hardware when possible as long as the CPU
port is member of the proper VLANs

Petr's patch solves this for us (9c86ce2c1ae337fc10568a12aea812ed03de8319).
We can now do "bridge vlan add dev br0 vid 100 pvid untagged self" and decide
when to add the CPU port or not. 

There are still a couple of cases that are not covered though, if we don't 
register the CPU port. We cant decide when to forward multicast
traffic on the CPU port if a join hasn't been sent from br0.
So let's say you got 2 hosts doing multicast and for whatever reason the host
wants to see that traffic. 
With the CPU port present you can do a 
"bridge mdb add dev br0 port sw0p0 grp 239.1.1.1 permanent" which will offload
the traffic to the CPU port and thus the host. If this goes away we are forced
to send a join.
 
> It seems to me like this is entirely doable given that the dual MAC use

> case is supported already.

> 

> switchdev is just a stateless framework to get notified from the

> networking stack about what you can possibly offload in hardware, so

> having a DTS option gate that is unfortunately wrong because it is

> really implementing a SW policy in DTS which is not what it is meant for.

The DTS option for configuration pre-existed, i don't like that either switchdev
mode is activated by a .config option not DTS(it just overrides whatever config 
you have on the DTS). Far from pretty though fair point, i am open to 
suggestions.
> -- 

> Florian


Thanks!
Ilias
Florian Fainelli June 2, 2018, 4:10 p.m. UTC | #7
On June 2, 2018 3:34:32 AM MST, Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:
>Hi Florian, 

>

>Thanks for taking time to look into this

>

>On Fri, Jun 01, 2018 at 02:48:48PM -0700, Florian Fainelli wrote:

>> 

>> 

>> On 05/24/2018 09:56 PM, Ilias Apalodimas wrote:

>> > On Thu, May 24, 2018 at 06:39:04PM +0200, Andrew Lunn wrote:

>> >> On Thu, May 24, 2018 at 04:32:34PM +0300, Ilias Apalodimas wrote:

>> >>> On Thu, May 24, 2018 at 03:12:29PM +0200, Andrew Lunn wrote:

>> >>>> Device tree is supposed to describe the hardware. Using that

>hardware

>> >>>> in different ways is not something you should describe in DT.

>> >>>>

>> >>> The new switchdev mode is applied with a .config option in the

>kernel. What you

>> >>> see is pre-existing code, so i am not sure if i should change it

>in this

>> >>> patchset.

>> >>

>> >> If you break the code up into a library and two drivers, it

>becomes a

>> >> moot point.

>> > Agree

>> > 

>> >>

>> >> But what i don't like here is that the device tree says to do dual

>> >> mac. But you ignore that and do sometime else. I would prefer that

>if

>> >> DT says dual mac, and switchdev is compiled in, the probe fails

>with

>> >> EINVAL. Rather than ignore something, make it clear it is invalid.

>> > The switch has 3 modes of operation as is.

>> > 1. switch mode, to enable that you don't need to add anything on

>> > the DTS and linux registers a single netdev interface.

>> > 2. dual mac mode, this is when you need to add dual_emac; on the

>DTS.

>> > 3. switchdev mode which is controlled by a .config option, since as

>you 

>> > pointed out DTS was not made for controlling config options. 

>> > 

>> > I agree that this is far from beautiful. If the driver remains as

>in though,

>> > i'd prefer either keeping what's there or making "switchdev" a DTS

>option, 

>> > following the pre-existing erroneous usage rather than making the

>device 

>> > unusable.  If we end up returning some error and refuse to

>initialize, users 

>> > that remote upgrade their equipment, without taking a good look at

>changelog,

>> > will loose access to their devices with no means of remotely fixing

>that.

>> 

>> It seems to me that the mistake here is seeing multiple modes of

>> operations for the cpsw. There are not actually many, there is one

>> usage, and then there is what you can and cannot offload. 

>CPSW has in fact 2 modes of operation, different FIFO usage/lookup

>entry(it's

>called ALE in the current driver) by-pass(which is used in dual emac

>for 

>example) and other features. Again Grygorii is better suited to answer

>the 

>exact differences.

>> The basic

>> premise with switchdev and DSA (which uses switchdev) is that each

>> user-facing port of your switch needs to work as if it were a normal

>> Ethernet NIC, that is what you call dual-MAC I believe. Then, when

>you

>> create a bridge and you enslave those ports into the bridge, you need

>to

>> have forwarding done in hardware between these two ports when the

>> SMAC/DMAC are not for the host/CPU/management interface and you must

>> simultaneously still have the host have the ability to send/receive

>> traffic through the bridge device.

>Yes dual emac does that. But dual emac configures the port facing VLAN

>to the

>CPU port as well. So dual emac splits and uses 2 interfaces. VLAN 1 is

>configured on port1 + CPU port and VLAN 2 is confired on port 2 + CPU

>port

>That's exactly what the current RFC does as well, with the addition of

>registering a sw0p0 (i'll explain why later on this mail)

>A little more detail on the issue we are having. On my description 

>sw0p0 -> CPU port, sw0p1 -> port 1 sw0p2 -> port 2. sw0p1/sw0p2 are the

>ports

>that have PHYs attached. 

>

>When we start in the new switchdev mode all interfaces are added to

>VLAN 0

>so CPU port + port1 + port2 are all in the same VLAN group. In that

>case sw0p1

>and sw0p2 are working as you describe. So those 2 interfaces can

>send/receive

>traffic normally which matches the switchdev case.

>

>When we add them on a bridge(let's say br0), VLAN1(or any default

>bridge VLAN)

>is now configured on sw0p1 and sw0p2 but *not* on the CPU port. 

>From this point on the whole fuunctionality just collapses. The switch

>will 

>work and offload traffic between sw0p1/sw0p2 but the bridge won't be

>able to 

>get an ip address (since VLAN1 is not a member of the CPU port and the

>packet 

>gets dropped). 

>IGMPv2/V3 messages will never reach the br_multicast.c code to trigger 

>switchdev and configure the MDBs on the ports.  i am prety sure there

>are other

>fail scenarios which i haven't discovered already, but those 2 are the

>most 

>basic ones.  If we add VLAN1 on the CPU port, everything works as

>intended of 

>course.

>

>That's the reason we registered sw0p0 as the CPU port. It can't do any

>"real"

>traffic, but you can configure the CPU port independantly and not be

>forced to

>do an OR on every VLAN add/delete grouping the CPU port with your port

>command.

>The TL;DR version of this is that the switch is working exactly as

>switchdev is

>expecting offloading traffic to the hardware when possible as long as

>the CPU

>port is member of the proper VLANs

>

>Petr's patch solves this for us

>(9c86ce2c1ae337fc10568a12aea812ed03de8319).

>We can now do "bridge vlan add dev br0 vid 100 pvid untagged self" and

>decide

>when to add the CPU port or not. 

>

>There are still a couple of cases that are not covered though, if we

>don't 

>register the CPU port. We cant decide when to forward multicast

>traffic on the CPU port if a join hasn't been sent from br0.

>So let's say you got 2 hosts doing multicast and for whatever reason

>the host

>wants to see that traffic. 

>With the CPU port present you can do a 

>"bridge mdb add dev br0 port sw0p0 grp 239.1.1.1 permanent" which will

>offload

>the traffic to the CPU port and thus the host. If this goes away we are

>forced

>to send a join.


Thanks for the detailed explanation. Somehow I was under the impression that cpsw had the ability, through specific DMA descriptor bits to direct traffic towards one external port or another and conversely, have that information from the HW when receiving packets. What you describe is exactly the same problem we have in DSA when the switch advertises DSA_TAG_PROTO_NONE where only VLAN tags could help differentiate traffic from external ports. At some point there was a discuss of making DSA_TAG_PROTO_NONE automatically create one VLAN per port but this is a good source for other problems...

Looking forward to your follow-up patch series!

-- 
Florian
Ilias Apalodimas June 2, 2018, 4:52 p.m. UTC | #8
On Sat, Jun 02, 2018 at 09:10:08AM -0700, Florian Fainelli wrote:
> On June 2, 2018 3:34:32 AM MST, Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:

> >Hi Florian, 

> >

> >Thanks for taking time to look into this

> >

> >On Fri, Jun 01, 2018 at 02:48:48PM -0700, Florian Fainelli wrote:

> >> 

> >> 

> >> On 05/24/2018 09:56 PM, Ilias Apalodimas wrote:

> >> > On Thu, May 24, 2018 at 06:39:04PM +0200, Andrew Lunn wrote:

> >> >> On Thu, May 24, 2018 at 04:32:34PM +0300, Ilias Apalodimas wrote:

> >> >>> On Thu, May 24, 2018 at 03:12:29PM +0200, Andrew Lunn wrote:

> >> >>>> Device tree is supposed to describe the hardware. Using that

> >hardware

> >> >>>> in different ways is not something you should describe in DT.

> >> >>>>

> >> >>> The new switchdev mode is applied with a .config option in the

> >kernel. What you

> >> >>> see is pre-existing code, so i am not sure if i should change it

> >in this

> >> >>> patchset.

> >> >>

> >> >> If you break the code up into a library and two drivers, it

> >becomes a

> >> >> moot point.

> >> > Agree

> >> > 

> >> >>

> >> >> But what i don't like here is that the device tree says to do dual

> >> >> mac. But you ignore that and do sometime else. I would prefer that

> >if

> >> >> DT says dual mac, and switchdev is compiled in, the probe fails

> >with

> >> >> EINVAL. Rather than ignore something, make it clear it is invalid.

> >> > The switch has 3 modes of operation as is.

> >> > 1. switch mode, to enable that you don't need to add anything on

> >> > the DTS and linux registers a single netdev interface.

> >> > 2. dual mac mode, this is when you need to add dual_emac; on the

> >DTS.

> >> > 3. switchdev mode which is controlled by a .config option, since as

> >you 

> >> > pointed out DTS was not made for controlling config options. 

> >> > 

> >> > I agree that this is far from beautiful. If the driver remains as

> >in though,

> >> > i'd prefer either keeping what's there or making "switchdev" a DTS

> >option, 

> >> > following the pre-existing erroneous usage rather than making the

> >device 

> >> > unusable.  If we end up returning some error and refuse to

> >initialize, users 

> >> > that remote upgrade their equipment, without taking a good look at

> >changelog,

> >> > will loose access to their devices with no means of remotely fixing

> >that.

> >> 

> >> It seems to me that the mistake here is seeing multiple modes of

> >> operations for the cpsw. There are not actually many, there is one

> >> usage, and then there is what you can and cannot offload. 

> >CPSW has in fact 2 modes of operation, different FIFO usage/lookup

> >entry(it's

> >called ALE in the current driver) by-pass(which is used in dual emac

> >for 

> >example) and other features. Again Grygorii is better suited to answer

> >the 

> >exact differences.

> >> The basic

> >> premise with switchdev and DSA (which uses switchdev) is that each

> >> user-facing port of your switch needs to work as if it were a normal

> >> Ethernet NIC, that is what you call dual-MAC I believe. Then, when

> >you

> >> create a bridge and you enslave those ports into the bridge, you need

> >to

> >> have forwarding done in hardware between these two ports when the

> >> SMAC/DMAC are not for the host/CPU/management interface and you must

> >> simultaneously still have the host have the ability to send/receive

> >> traffic through the bridge device.

> >Yes dual emac does that. But dual emac configures the port facing VLAN

> >to the

> >CPU port as well. So dual emac splits and uses 2 interfaces. VLAN 1 is

> >configured on port1 + CPU port and VLAN 2 is confired on port 2 + CPU

> >port

> >That's exactly what the current RFC does as well, with the addition of

> >registering a sw0p0 (i'll explain why later on this mail)

> >A little more detail on the issue we are having. On my description 

> >sw0p0 -> CPU port, sw0p1 -> port 1 sw0p2 -> port 2. sw0p1/sw0p2 are the

> >ports

> >that have PHYs attached. 

> >

> >When we start in the new switchdev mode all interfaces are added to

> >VLAN 0

> >so CPU port + port1 + port2 are all in the same VLAN group. In that

> >case sw0p1

> >and sw0p2 are working as you describe. So those 2 interfaces can

> >send/receive

> >traffic normally which matches the switchdev case.

> >

> >When we add them on a bridge(let's say br0), VLAN1(or any default

> >bridge VLAN)

> >is now configured on sw0p1 and sw0p2 but *not* on the CPU port. 

> >From this point on the whole fuunctionality just collapses. The switch

> >will 

> >work and offload traffic between sw0p1/sw0p2 but the bridge won't be

> >able to 

> >get an ip address (since VLAN1 is not a member of the CPU port and the

> >packet 

> >gets dropped). 

> >IGMPv2/V3 messages will never reach the br_multicast.c code to trigger 

> >switchdev and configure the MDBs on the ports.  i am prety sure there

> >are other

> >fail scenarios which i haven't discovered already, but those 2 are the

> >most 

> >basic ones.  If we add VLAN1 on the CPU port, everything works as

> >intended of 

> >course.

> >

> >That's the reason we registered sw0p0 as the CPU port. It can't do any

> >"real"

> >traffic, but you can configure the CPU port independantly and not be

> >forced to

> >do an OR on every VLAN add/delete grouping the CPU port with your port

> >command.

> >The TL;DR version of this is that the switch is working exactly as

> >switchdev is

> >expecting offloading traffic to the hardware when possible as long as

> >the CPU

> >port is member of the proper VLANs

> >

> >Petr's patch solves this for us

> >(9c86ce2c1ae337fc10568a12aea812ed03de8319).

> >We can now do "bridge vlan add dev br0 vid 100 pvid untagged self" and

> >decide

> >when to add the CPU port or not. 

> >

> >There are still a couple of cases that are not covered though, if we

> >don't 

> >register the CPU port. We cant decide when to forward multicast

> >traffic on the CPU port if a join hasn't been sent from br0.

> >So let's say you got 2 hosts doing multicast and for whatever reason

> >the host

> >wants to see that traffic. 

> >With the CPU port present you can do a 

> >"bridge mdb add dev br0 port sw0p0 grp 239.1.1.1 permanent" which will

> >offload

> >the traffic to the CPU port and thus the host. If this goes away we are

> >forced

> >to send a join.

> 

> Thanks for the detailed explanation. Somehow I was under the impression that cpsw had the ability, through specific DMA descriptor bits to direct traffic towards one external port or another and conversely, have that information from the HW when receiving packets. 

That's one mode of operation when by-passing the ALE if my understanding of
the hardware is correct. You can choose not to do that. I am still earning all
the details of the ahrdware myself. On Rx though you still need the CPU to
participate to receive the packet(and yes the descriptor indicates the port)

> What you describe is exactly the same problem we have in DSA when the switch advertises DSA_TAG_PROTO_NONE where only VLAN tags could help differentiate traffic from external ports. At some point there was a discuss of making DSA_TAG_PROTO_NONE automatically create one VLAN per port but this is a good source for other problems...

I am pretty sure i am not the first one to encounter this kind of problems
> 

> Looking forward to your follow-up patch series!

Will do!
> 

> -- 

> Florian


Thanks
Ilias
Grygorii Strashko June 5, 2018, 9:03 p.m. UTC | #9
On 06/02/2018 05:34 AM, Ilias Apalodimas wrote:
> Hi Florian,

> 

> Thanks for taking time to look into this

> 

> On Fri, Jun 01, 2018 at 02:48:48PM -0700, Florian Fainelli wrote:

>>

>>

>> On 05/24/2018 09:56 PM, Ilias Apalodimas wrote:

>>> On Thu, May 24, 2018 at 06:39:04PM +0200, Andrew Lunn wrote:

>>>> On Thu, May 24, 2018 at 04:32:34PM +0300, Ilias Apalodimas wrote:

>>>>> On Thu, May 24, 2018 at 03:12:29PM +0200, Andrew Lunn wrote:

>>>>>> Device tree is supposed to describe the hardware. Using that hardware

>>>>>> in different ways is not something you should describe in DT.

>>>>>>

>>>>> The new switchdev mode is applied with a .config option in the kernel. What you

>>>>> see is pre-existing code, so i am not sure if i should change it in this

>>>>> patchset.

>>>>

>>>> If you break the code up into a library and two drivers, it becomes a

>>>> moot point.

>>> Agree

>>>

>>>>

>>>> But what i don't like here is that the device tree says to do dual

>>>> mac. But you ignore that and do sometime else. I would prefer that if

>>>> DT says dual mac, and switchdev is compiled in, the probe fails with

>>>> EINVAL. Rather than ignore something, make it clear it is invalid.

>>> The switch has 3 modes of operation as is.

>>> 1. switch mode, to enable that you don't need to add anything on

>>> the DTS and linux registers a single netdev interface.

>>> 2. dual mac mode, this is when you need to add dual_emac; on the DTS.

>>> 3. switchdev mode which is controlled by a .config option, since as you

>>> pointed out DTS was not made for controlling config options.

>>>

>>> I agree that this is far from beautiful. If the driver remains as in though,

>>> i'd prefer either keeping what's there or making "switchdev" a DTS option,

>>> following the pre-existing erroneous usage rather than making the device

>>> unusable.  If we end up returning some error and refuse to initialize, users

>>> that remote upgrade their equipment, without taking a good look at changelog,

>>> will loose access to their devices with no means of remotely fixing that.

>>

>> It seems to me that the mistake here is seeing multiple modes of

>> operations for the cpsw. There are not actually many, there is one

>> usage, and then there is what you can and cannot offload.


> CPSW has in fact 2 modes of operation, different FIFO usage/lookup entry(it's

> called ALE in the current driver) by-pass(which is used in dual emac for

> example) and other features. Again Grygorii is better suited to answer the

> exact differences.


dual_mac is HW enabled mode of operation, so having DT option is pretty
reasonable as for me. 
1) when enabled it configures internal FIFOs in special way so both
external Ports became equal in the direction toward to Host port 0.

TRM "The intention of this mode is to allow packets from both ethernet ports
to be written into the FIFO without one port starving the other port."

2) ALE, out of the box, does not support this mode and, as result, two
default vlan have to be created to direct traffic P1->P0 (vlan1) and
P2-P0 (vlan2), but any packets P0->P1 and P0->P2 have to be directed
(and will bypass ALE). This way traffic separated on cpsw egress towards to P0, 

>> The basic

>> premise with switchdev and DSA (which uses switchdev) is that each

>> user-facing port of your switch needs to work as if it were a normal

>> Ethernet NIC, that is what you call dual-MAC I believe. Then, when you

>> create a bridge and you enslave those ports into the bridge, you need to

>> have forwarding done in hardware between these two ports when the

>> SMAC/DMAC are not for the host/CPU/management interface and you must

>> simultaneously still have the host have the ability to send/receive

>> traffic through the bridge device.


TRM
"When operating in dual mac mode the intention is to transfer packets between ports 0 and 1 and ports 0
and 2, but not between ports 1 and 2. Each CPGMAC_SL appears as a single MAC with no bridging
between MAC’s. Each CPGMAC_SL has at least one unique (not the same) mac address."

So, we cant't talk about forwarding in dual_mac mode. Interfaces expected to be 
completely independent without any packet leaking between interfaces.

!! BUT !! CPSW ALE (switch core) still expected to be used, but for packet filtering
- only registered vlans
- only registered mcast/bcast
- ingress mcast/bcast rate limiting (it's actually more like coalescing -
 limits number of mcast/bcast packets per sec. 

And all offloading ALE (val/mdb) entries should always contain two ports in masks:
p1&p0 or p2&p0. Never ever all three ports. 

> Yes dual emac does that. But dual emac configures the port facing VLAN to the

> CPU port as well. So dual emac splits and uses 2 interfaces. VLAN 1 is

> configured on port1 + CPU port and VLAN 2 is confired on port 2 + CPU port

> That's exactly what the current RFC does as well, with the addition of

> registering a sw0p0 (i'll explain why later on this mail)

> A little more detail on the issue we are having. On my description

> sw0p0 -> CPU port, sw0p1 -> port 1 sw0p2 -> port 2. sw0p1/sw0p2 are the ports

> that have PHYs attached.

> 

> When we start in the new switchdev mode all interfaces are added to VLAN 0

> so CPU port + port1 + port2 are all in the same VLAN group. In that case sw0p1

> and sw0p2 are working as you describe. So those 2 interfaces can send/receive

> traffic normally which matches the switchdev case.

> 

> When we add them on a bridge(let's say br0), VLAN1(or any default bridge VLAN)

> is now configured on sw0p1 and sw0p2 but *not* on the CPU port.

>  From this point on the whole fuunctionality just collapses. The switch will

> work and offload traffic between sw0p1/sw0p2 but the bridge won't be able to

> get an ip address (since VLAN1 is not a member of the CPU port and the packet

> gets dropped).

> IGMPv2/V3 messages will never reach the br_multicast.c code to trigger

> switchdev and configure the MDBs on the ports.  i am prety sure there are other

> fail scenarios which i haven't discovered already, but those 2 are the most

> basic ones.  If we add VLAN1 on the CPU port, everything works as intended of

> course.

> 

> That's the reason we registered sw0p0 as the CPU port. It can't do any "real"

> traffic, but you can configure the CPU port independantly and not be forced to

> do an OR on every VLAN add/delete grouping the CPU port with your port command.

> The TL;DR version of this is that the switch is working exactly as switchdev is

> expecting offloading traffic to the hardware when possible as long as the CPU

> port is member of the proper VLANs

> 

> Petr's patch solves this for us (9c86ce2c1ae337fc10568a12aea812ed03de8319).

> We can now do "bridge vlan add dev br0 vid 100 pvid untagged self" and decide

> when to add the CPU port or not.

> 

> There are still a couple of cases that are not covered though, if we don't

> register the CPU port. We cant decide when to forward multicast

> traffic on the CPU port if a join hasn't been sent from br0.

> So let's say you got 2 hosts doing multicast and for whatever reason the host

> wants to see that traffic.

> With the CPU port present you can do a

> "bridge mdb add dev br0 port sw0p0 grp 239.1.1.1 permanent" which will offload

> the traffic to the CPU port and thus the host. If this goes away we are forced

> to send a join.

>   

>> It seems to me like this is entirely doable given that the dual MAC use

>> case is supported already.

>>

>> switchdev is just a stateless framework to get notified from the

>> networking stack about what you can possibly offload in hardware, so

>> having a DTS option gate that is unfortunately wrong because it is

>> really implementing a SW policy in DTS which is not what it is meant for.

> The DTS option for configuration pre-existed, i don't like that either switchdev

> mode is activated by a .config option not DTS(it just overrides whatever config

> you have on the DTS). Far from pretty though fair point, i am open to

> suggestions.


Again this is option describing HW mode which not expected to be changed on the fly.
I'm honestly, propose descope dual_mac and concentrate on switch mode (switchdev) - 
right now I don't see how dual_mac can be supported with switchdev as per above.
The same way as I do not see how we can re-use switchdev with 50% of devices which 
have "only one" user-facing external port (P1 or P2).


-- 
regards,
-grygorii
Florian Fainelli June 5, 2018, 9:37 p.m. UTC | #10
On 06/05/2018 02:03 PM, Grygorii Strashko wrote:
> 

> 

> On 06/02/2018 05:34 AM, Ilias Apalodimas wrote:

>> Hi Florian,

>>

>> Thanks for taking time to look into this

>>

>> On Fri, Jun 01, 2018 at 02:48:48PM -0700, Florian Fainelli wrote:

>>>

>>>

>>> On 05/24/2018 09:56 PM, Ilias Apalodimas wrote:

>>>> On Thu, May 24, 2018 at 06:39:04PM +0200, Andrew Lunn wrote:

>>>>> On Thu, May 24, 2018 at 04:32:34PM +0300, Ilias Apalodimas wrote:

>>>>>> On Thu, May 24, 2018 at 03:12:29PM +0200, Andrew Lunn wrote:

>>>>>>> Device tree is supposed to describe the hardware. Using that hardware

>>>>>>> in different ways is not something you should describe in DT.

>>>>>>>

>>>>>> The new switchdev mode is applied with a .config option in the kernel. What you

>>>>>> see is pre-existing code, so i am not sure if i should change it in this

>>>>>> patchset.

>>>>>

>>>>> If you break the code up into a library and two drivers, it becomes a

>>>>> moot point.

>>>> Agree

>>>>

>>>>>

>>>>> But what i don't like here is that the device tree says to do dual

>>>>> mac. But you ignore that and do sometime else. I would prefer that if

>>>>> DT says dual mac, and switchdev is compiled in, the probe fails with

>>>>> EINVAL. Rather than ignore something, make it clear it is invalid.

>>>> The switch has 3 modes of operation as is.

>>>> 1. switch mode, to enable that you don't need to add anything on

>>>> the DTS and linux registers a single netdev interface.

>>>> 2. dual mac mode, this is when you need to add dual_emac; on the DTS.

>>>> 3. switchdev mode which is controlled by a .config option, since as you

>>>> pointed out DTS was not made for controlling config options.

>>>>

>>>> I agree that this is far from beautiful. If the driver remains as in though,

>>>> i'd prefer either keeping what's there or making "switchdev" a DTS option,

>>>> following the pre-existing erroneous usage rather than making the device

>>>> unusable.  If we end up returning some error and refuse to initialize, users

>>>> that remote upgrade their equipment, without taking a good look at changelog,

>>>> will loose access to their devices with no means of remotely fixing that.

>>>

>>> It seems to me that the mistake here is seeing multiple modes of

>>> operations for the cpsw. There are not actually many, there is one

>>> usage, and then there is what you can and cannot offload.

> 

>> CPSW has in fact 2 modes of operation, different FIFO usage/lookup entry(it's

>> called ALE in the current driver) by-pass(which is used in dual emac for

>> example) and other features. Again Grygorii is better suited to answer the

>> exact differences.

> 

> dual_mac is HW enabled mode of operation, so having DT option is pretty

> reasonable as for me. 

> 1) when enabled it configures internal FIFOs in special way so both

> external Ports became equal in the direction toward to Host port 0.

> 

> TRM "The intention of this mode is to allow packets from both ethernet ports

> to be written into the FIFO without one port starving the other port."

> 

> 2) ALE, out of the box, does not support this mode and, as result, two

> default vlan have to be created to direct traffic P1->P0 (vlan1) and

> P2-P0 (vlan2), but any packets P0->P1 and P0->P2 have to be directed

> (and will bypass ALE). This way traffic separated on cpsw egress towards to P0, 

> 

>>> The basic

>>> premise with switchdev and DSA (which uses switchdev) is that each

>>> user-facing port of your switch needs to work as if it were a normal

>>> Ethernet NIC, that is what you call dual-MAC I believe. Then, when you

>>> create a bridge and you enslave those ports into the bridge, you need to

>>> have forwarding done in hardware between these two ports when the

>>> SMAC/DMAC are not for the host/CPU/management interface and you must

>>> simultaneously still have the host have the ability to send/receive

>>> traffic through the bridge device.

> 

> TRM

> "When operating in dual mac mode the intention is to transfer packets between ports 0 and 1 and ports 0

> and 2, but not between ports 1 and 2. Each CPGMAC_SL appears as a single MAC with no bridging

> between MAC’s. Each CPGMAC_SL has at least one unique (not the same) mac address."

> 

> So, we cant't talk about forwarding in dual_mac mode. Interfaces expected to be 

> completely independent without any packet leaking between interfaces.

> 

> !! BUT !! CPSW ALE (switch core) still expected to be used, but for packet filtering

> - only registered vlans

> - only registered mcast/bcast

> - ingress mcast/bcast rate limiting (it's actually more like coalescing -

>  limits number of mcast/bcast packets per sec. 

> 

> And all offloading ALE (val/mdb) entries should always contain two ports in masks:

> p1&p0 or p2&p0. Never ever all three ports. 

> 

>> Yes dual emac does that. But dual emac configures the port facing VLAN to the

>> CPU port as well. So dual emac splits and uses 2 interfaces. VLAN 1 is

>> configured on port1 + CPU port and VLAN 2 is confired on port 2 + CPU port

>> That's exactly what the current RFC does as well, with the addition of

>> registering a sw0p0 (i'll explain why later on this mail)

>> A little more detail on the issue we are having. On my description

>> sw0p0 -> CPU port, sw0p1 -> port 1 sw0p2 -> port 2. sw0p1/sw0p2 are the ports

>> that have PHYs attached.

>>

>> When we start in the new switchdev mode all interfaces are added to VLAN 0

>> so CPU port + port1 + port2 are all in the same VLAN group. In that case sw0p1

>> and sw0p2 are working as you describe. So those 2 interfaces can send/receive

>> traffic normally which matches the switchdev case.

>>

>> When we add them on a bridge(let's say br0), VLAN1(or any default bridge VLAN)

>> is now configured on sw0p1 and sw0p2 but *not* on the CPU port.

>>  From this point on the whole fuunctionality just collapses. The switch will

>> work and offload traffic between sw0p1/sw0p2 but the bridge won't be able to

>> get an ip address (since VLAN1 is not a member of the CPU port and the packet

>> gets dropped).

>> IGMPv2/V3 messages will never reach the br_multicast.c code to trigger

>> switchdev and configure the MDBs on the ports.  i am prety sure there are other

>> fail scenarios which i haven't discovered already, but those 2 are the most

>> basic ones.  If we add VLAN1 on the CPU port, everything works as intended of

>> course.

>>

>> That's the reason we registered sw0p0 as the CPU port. It can't do any "real"

>> traffic, but you can configure the CPU port independantly and not be forced to

>> do an OR on every VLAN add/delete grouping the CPU port with your port command.

>> The TL;DR version of this is that the switch is working exactly as switchdev is

>> expecting offloading traffic to the hardware when possible as long as the CPU

>> port is member of the proper VLANs

>>

>> Petr's patch solves this for us (9c86ce2c1ae337fc10568a12aea812ed03de8319).

>> We can now do "bridge vlan add dev br0 vid 100 pvid untagged self" and decide

>> when to add the CPU port or not.

>>

>> There are still a couple of cases that are not covered though, if we don't

>> register the CPU port. We cant decide when to forward multicast

>> traffic on the CPU port if a join hasn't been sent from br0.

>> So let's say you got 2 hosts doing multicast and for whatever reason the host

>> wants to see that traffic.

>> With the CPU port present you can do a

>> "bridge mdb add dev br0 port sw0p0 grp 239.1.1.1 permanent" which will offload

>> the traffic to the CPU port and thus the host. If this goes away we are forced

>> to send a join.

>>   

>>> It seems to me like this is entirely doable given that the dual MAC use

>>> case is supported already.

>>>

>>> switchdev is just a stateless framework to get notified from the

>>> networking stack about what you can possibly offload in hardware, so

>>> having a DTS option gate that is unfortunately wrong because it is

>>> really implementing a SW policy in DTS which is not what it is meant for.

>> The DTS option for configuration pre-existed, i don't like that either switchdev

>> mode is activated by a .config option not DTS(it just overrides whatever config

>> you have on the DTS). Far from pretty though fair point, i am open to

>> suggestions.

> 

> Again this is option describing HW mode which not expected to be changed on the fly.

> I'm honestly, propose descope dual_mac and concentrate on switch mode (switchdev) - 

> right now I don't see how dual_mac can be supported with switchdev as per above.

> The same way as I do not see how we can re-use switchdev with 50% of devices which 

> have "only one" user-facing external port (P1 or P2).


This is still not appropriate for Device Tree, because this is
completely orthogonal from one another. Also, I think you tend to
conflate what the switch can do, with in which mode does it start by
default, which are, again, two different things.
-- 
Florian
diff mbox series

Patch

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index b16e7cf..8f8ebd8 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -18,12 +18,10 @@ 
 #include <linux/clk.h>
 #include <linux/timer.h>
 #include <linux/module.h>
-#include <linux/platform_device.h>
 #include <linux/irqreturn.h>
 #include <linux/interrupt.h>
 #include <linux/if_ether.h>
 #include <linux/etherdevice.h>
-#include <linux/netdevice.h>
 #include <linux/net_tstamp.h>
 #include <linux/phy.h>
 #include <linux/workqueue.h>
@@ -42,6 +40,7 @@ 
 #include "cpsw.h"
 #include "cpsw_ale.h"
 #include "cpsw_priv.h"
+#include "cpsw_switchdev.h"
 #include "cpts.h"
 #include "davinci_cpdma.h"
 
@@ -146,9 +145,6 @@  do {								\
 #define CPSW_CMINTMAX_INTVL	(1000 / CPSW_CMINTMIN_CNT)
 #define CPSW_CMINTMIN_INTVL	((1000 / CPSW_CMINTMAX_CNT) + 1)
 
-#define cpsw_slave_index(cpsw, priv)				\
-		((cpsw->data.dual_emac) ? priv->emac_port :	\
-		cpsw->data.active_slave)
 #define IRQ_NUM			2
 #define CPSW_MAX_QUEUES		8
 #define CPSW_CPDMA_DESCS_POOL_SIZE_DEFAULT 256
@@ -360,6 +356,13 @@  struct cpsw_hw_stats {
 	u32	rxdmaoverruns;
 };
 
+struct cpsw_switchdev_event_work {
+	struct work_struct work;
+	struct switchdev_notifier_fdb_info fdb_info;
+	struct cpsw_priv *priv;
+	unsigned long event;
+};
+
 #define CPSW_STAT(m)		CPSW_STATS,				\
 				sizeof(((struct cpsw_hw_stats *)0)->m), \
 				offsetof(struct cpsw_hw_stats, m)
@@ -433,18 +436,22 @@  static const struct cpsw_stats cpsw_gstrings_ch_stats[] = {
 		struct cpsw_slave *slave;				\
 		struct cpsw_common *cpsw = (priv)->cpsw;		\
 		int n;							\
-		if (cpsw->data.dual_emac)				\
-			(func)((cpsw)->slaves + priv->emac_port, ##arg);\
-		else							\
+		if (cpsw->data.switch_mode) {				\
+			if (priv->emac_port == HOST_PORT_NUM)		\
+				break;					\
+			(func)((cpsw)->slaves + (priv->emac_port - 1),  \
+			       ##arg);\
+		} else {						\
 			for (n = cpsw->data.slaves,			\
 					slave = cpsw->slaves;		\
 					n; n--)				\
 				(func)(slave++, ##arg);			\
+		}							\
 	} while (0)
 
 #define cpsw_dual_emac_src_port_detect(cpsw, status, ndev, skb)		\
 	do {								\
-		if (!cpsw->data.dual_emac)				\
+		if (!cpsw->data.switch_mode)				\
 			break;						\
 		if (CPDMA_RX_SOURCE_PORT(status) == 1) {		\
 			ndev = cpsw->slaves[0].ndev;			\
@@ -456,11 +463,13 @@  static const struct cpsw_stats cpsw_gstrings_ch_stats[] = {
 	} while (0)
 #define cpsw_add_mcast(cpsw, priv, addr)				\
 	do {								\
-		if (cpsw->data.dual_emac) {				\
+		if (cpsw->data.switch_mode) {				\
 			struct cpsw_slave *slave = cpsw->slaves +	\
-						priv->emac_port;	\
+						(priv->emac_port - 1);	\
 			int slave_port = cpsw_get_slave_port(		\
 						slave->slave_num);	\
+			if (priv->emac_port == HOST_PORT_NUM)		\
+				break;					\
 			cpsw_ale_add_mcast(cpsw->ale, addr,		\
 				1 << slave_port | ALE_PORT_HOST,	\
 				ALE_VLAN, slave->port_vlan, 0);		\
@@ -476,13 +485,41 @@  static inline int cpsw_get_slave_port(u32 slave_num)
 	return slave_num + 1;
 }
 
+static int cpsw_is_dual_mac(u8 switch_mode)
+{
+	return switch_mode == CPSW_DUAL_EMAC;
+}
+
+static int cpsw_is_switchdev(u8 switch_mode)
+{
+	return switch_mode == CPSW_SWITCHDEV;
+}
+
+static int cpsw_is_switch(u8 switch_mode)
+{
+	return switch_mode == CPSW_TI_SWITCH;
+}
+
+static int cpsw_slave_index(struct cpsw_priv *priv)
+{
+	struct cpsw_common *cpsw = priv->cpsw;
+
+#if IS_ENABLED(CONFIG_TI_CPSW_SWITCHDEV)
+	if (priv->emac_port == HOST_PORT_NUM)
+		return -1;
+#endif
+
+	return cpsw->data.switch_mode ? priv->emac_port - 1 :
+		cpsw->data.active_slave;
+}
+
 static void cpsw_set_promiscious(struct net_device *ndev, bool enable)
 {
 	struct cpsw_common *cpsw = ndev_to_cpsw(ndev);
 	struct cpsw_ale *ale = cpsw->ale;
 	int i;
 
-	if (cpsw->data.dual_emac) {
+	if (cpsw_is_dual_mac(cpsw->data.switch_mode)) {
 		bool flag = false;
 
 		/* Enabling promiscuous mode for one interface will be
@@ -508,7 +545,7 @@  static void cpsw_set_promiscious(struct net_device *ndev, bool enable)
 			cpsw_ale_control_set(ale, 0, ALE_BYPASS, 0);
 			dev_dbg(&ndev->dev, "promiscuity disabled\n");
 		}
-	} else {
+	} else if (cpsw_is_switch(cpsw->data.switch_mode)) {
 		if (enable) {
 			unsigned long timeout = jiffies + HZ;
 
@@ -548,6 +585,18 @@  static void cpsw_set_promiscious(struct net_device *ndev, bool enable)
 			}
 			dev_dbg(&ndev->dev, "promiscuity disabled\n");
 		}
+	} else if (cpsw_is_switchdev(cpsw->data.switch_mode)) {
+		/* When interfaces are placed into a bridge they'll switch to
+		 * promiscuous mode. In switchdev case ALE_P0_UNI_FLOOD is
+		 * changed whether the cpu port participates in the bridge
+		 */
+		struct cpsw_priv *priv = netdev_priv(ndev);
+		int slave_idx = cpsw_slave_index(priv);
+		int slave_num;
+
+		slave_num = cpsw_get_slave_port(slave_idx);
+		cpsw_ale_control_set(ale, slave_num, ALE_PORT_NOLEARN, 0);
+		cpsw_ale_control_set(ale, slave_num, ALE_PORT_NO_SA_UPDATE, 0);
 	}
 }
 
@@ -555,10 +604,11 @@  static void cpsw_ndo_set_rx_mode(struct net_device *ndev)
 {
 	struct cpsw_priv *priv = netdev_priv(ndev);
 	struct cpsw_common *cpsw = priv->cpsw;
+	int slave_no = cpsw_slave_index(priv);
 	int vid;
 
-	if (cpsw->data.dual_emac)
-		vid = cpsw->slaves[priv->emac_port].port_vlan;
+	if (cpsw_is_dual_mac(cpsw->data.switch_mode))
+		vid = cpsw->slaves[slave_no].port_vlan;
 	else
 		vid = cpsw->data.default_vlan;
 
@@ -629,8 +679,9 @@  static void cpsw_tx_handler(void *token, int len, int status)
 static void cpsw_rx_vlan_encap(struct sk_buff *skb)
 {
 	struct cpsw_priv *priv = netdev_priv(skb->dev);
-	struct cpsw_common *cpsw = priv->cpsw;
 	u32 rx_vlan_encap_hdr = *((u32 *)skb->data);
+	struct cpsw_common *cpsw = priv->cpsw;
+	int slave_no = cpsw_slave_index(priv);
 	u16 vtag, vid, prio, pkt_type;
 
 	/* Remove VLAN header encapsulation word */
@@ -651,8 +702,8 @@  static void cpsw_rx_vlan_encap(struct sk_buff *skb)
 	if (!vid)
 		return;
 	/* Ignore default vlans in dual mac mode */
-	if (cpsw->data.dual_emac &&
-	    vid == cpsw->slaves[priv->emac_port].port_vlan)
+	if (cpsw_is_dual_mac(cpsw->data.switch_mode) &&
+	    vid == cpsw->slaves[slave_no].port_vlan)
 		return;
 
 	prio = (rx_vlan_encap_hdr >>
@@ -681,9 +732,9 @@  static void cpsw_rx_handler(void *token, int len, int status)
 	cpsw_dual_emac_src_port_detect(cpsw, status, ndev, skb);
 
 	if (unlikely(status < 0) || unlikely(!netif_running(ndev))) {
-		/* In dual emac mode check for all interfaces */
-		if (cpsw->data.dual_emac && cpsw->usage_count &&
-		    (status >= 0)) {
+		/* In any other that switch mode check for all interfaces */
+		if (!cpsw_is_switch(cpsw->data.switch_mode) &&
+		    cpsw->usage_count && status >= 0) {
 			/* The packet received is for the interface which
 			 * is already down and the other interface is up
 			 * and running, instead of freeing which results
@@ -699,6 +750,11 @@  static void cpsw_rx_handler(void *token, int len, int status)
 		return;
 	}
 
+#if IS_ENABLED(CONFIG_TI_CPSW_SWITCHDEV)
+	if (cpsw_is_switchdev(cpsw->data.switch_mode))
+		skb->offload_fwd_mark = 1;
+#endif
+
 	new_skb = netdev_alloc_skb_ip_align(ndev, cpsw->rx_packet_max);
 	if (new_skb) {
 		skb_copy_queue_mapping(new_skb, skb);
@@ -1206,11 +1262,10 @@  static inline int cpsw_tx_packet_submit(struct cpsw_priv *priv,
 					struct sk_buff *skb,
 					struct cpdma_chan *txch)
 {
-	struct cpsw_common *cpsw = priv->cpsw;
-
 	skb_tx_timestamp(skb);
+
 	return cpdma_chan_submit(txch, skb, skb->data, skb->len,
-				 priv->emac_port + cpsw->data.dual_emac);
+				 priv->emac_port);
 }
 
 static inline void cpsw_add_dual_emac_def_ale_entries(
@@ -1283,7 +1338,7 @@  static void cpsw_slave_open(struct cpsw_slave *slave, struct cpsw_priv *priv)
 
 	slave_port = cpsw_get_slave_port(slave->slave_num);
 
-	if (cpsw->data.dual_emac)
+	if (cpsw_is_dual_mac(cpsw->data.switch_mode))
 		cpsw_add_dual_emac_def_ale_entries(priv, slave, slave_port);
 	else
 		cpsw_ale_add_mcast(cpsw->ale, priv->ndev->broadcast,
@@ -1362,8 +1417,8 @@  static void cpsw_init_host_port(struct cpsw_priv *priv)
 	control_reg = readl(&cpsw->regs->control);
 	control_reg |= CPSW_VLAN_AWARE | CPSW_RX_VLAN_ENCAP;
 	writel(control_reg, &cpsw->regs->control);
-	fifo_mode = (cpsw->data.dual_emac) ? CPSW_FIFO_DUAL_MAC_MODE :
-		     CPSW_FIFO_NORMAL_MODE;
+	fifo_mode = cpsw_is_dual_mac(cpsw->data.switch_mode) ?
+		CPSW_FIFO_DUAL_MAC_MODE : CPSW_FIFO_NORMAL_MODE;
 	writel(fifo_mode, &cpsw->host_port_regs->tx_in_ctl);
 
 	/* setup host port priority mapping */
@@ -1374,7 +1429,7 @@  static void cpsw_init_host_port(struct cpsw_priv *priv)
 	cpsw_ale_control_set(cpsw->ale, HOST_PORT_NUM,
 			     ALE_PORT_STATE, ALE_PORT_STATE_FORWARD);
 
-	if (!cpsw->data.dual_emac) {
+	if (!cpsw_is_dual_mac(cpsw->data.switch_mode)) {
 		cpsw_ale_add_ucast(cpsw->ale, priv->mac_addr, HOST_PORT_NUM,
 				   0, 0);
 		cpsw_ale_add_mcast(cpsw->ale, priv->ndev->broadcast,
@@ -1474,14 +1529,19 @@  static int cpsw_ndo_open(struct net_device *ndev)
 	/* Initialize host and slave ports */
 	if (!cpsw->usage_count)
 		cpsw_init_host_port(priv);
-	for_each_slave(priv, cpsw_slave_open, priv);
 
-	/* Add default VLAN */
-	if (!cpsw->data.dual_emac)
-		cpsw_add_default_vlan(priv);
-	else
-		cpsw_ale_add_vlan(cpsw->ale, cpsw->data.default_vlan,
-				  ALE_ALL_PORTS, ALE_ALL_PORTS, 0, 0);
+	if (!IS_ENABLED(CONFIG_TI_CPSW_SWITCHDEV) ||
+	    (IS_ENABLED(CONFIG_TI_CPSW_SWITCHDEV) &&
+	     priv->emac_port != HOST_PORT_NUM)) {
+		for_each_slave(priv, cpsw_slave_open, priv);
+
+		/* Add default VLAN */
+		if (cpsw_is_dual_mac(cpsw->data.switch_mode))
+			cpsw_ale_add_vlan(cpsw->ale, cpsw->data.default_vlan,
+					  ALE_ALL_PORTS, ALE_ALL_PORTS, 0, 0);
+		else
+			cpsw_add_default_vlan(priv);
+	}
 
 	/* initialize shared resources for every ndev */
 	if (!cpsw->usage_count) {
@@ -1575,6 +1635,13 @@  static netdev_tx_t cpsw_ndo_start_xmit(struct sk_buff *skb,
 	struct cpdma_chan *txch;
 	int ret, q_idx;
 
+#if IS_ENABLED(CONFIG_TI_CPSW_SWITCHDEV)
+	if (priv->emac_port == HOST_PORT_NUM) {
+		dev_kfree_skb_any(skb);
+		return NETDEV_TX_OK;
+	}
+#endif
+
 	if (skb_padto(skb, CPSW_MIN_PACKET_SIZE)) {
 		cpsw_err(priv, tx_err, "packet pad failed\n");
 		ndev->stats.tx_dropped++;
@@ -1655,8 +1722,12 @@  static void cpsw_hwtstamp_v2(struct cpsw_priv *priv)
 	struct cpsw_slave *slave;
 	struct cpsw_common *cpsw = priv->cpsw;
 	u32 ctrl, mtype;
+	int slave_no = cpsw_slave_index(priv);
+
+	if (slave_no < 0)
+		return;
 
-	slave = &cpsw->slaves[cpsw_slave_index(cpsw, priv)];
+	slave = &cpsw->slaves[slave_no];
 
 	ctrl = slave_read(slave, CPSW2_CONTROL);
 	switch (cpsw->version) {
@@ -1791,11 +1862,14 @@  static int cpsw_ndo_ioctl(struct net_device *dev, struct ifreq *req, int cmd)
 {
 	struct cpsw_priv *priv = netdev_priv(dev);
 	struct cpsw_common *cpsw = priv->cpsw;
-	int slave_no = cpsw_slave_index(cpsw, priv);
+	int slave_no = cpsw_slave_index(priv);
 
 	if (!netif_running(dev))
 		return -EINVAL;
 
+	if (slave_no < 0)
+		return -EOPNOTSUPP;
+
 	switch (cmd) {
 	case SIOCSHWTSTAMP:
 		return cpsw_hwtstamp_set(dev, req);
@@ -1832,6 +1906,7 @@  static int cpsw_ndo_set_mac_address(struct net_device *ndev, void *p)
 	struct cpsw_priv *priv = netdev_priv(ndev);
 	struct sockaddr *addr = (struct sockaddr *)p;
 	struct cpsw_common *cpsw = priv->cpsw;
+	int slave_no = cpsw_slave_index(priv);
 	int flags = 0;
 	u16 vid = 0;
 	int ret;
@@ -1845,8 +1920,8 @@  static int cpsw_ndo_set_mac_address(struct net_device *ndev, void *p)
 		return ret;
 	}
 
-	if (cpsw->data.dual_emac) {
-		vid = cpsw->slaves[priv->emac_port].port_vlan;
+	if (cpsw_is_dual_mac(cpsw->data.switch_mode)) {
+		vid = cpsw->slaves[slave_no].port_vlan;
 		flags = ALE_VLAN;
 	}
 
@@ -1884,8 +1959,11 @@  static inline int cpsw_add_vlan_ale_entry(struct cpsw_priv *priv,
 	u32 port_mask;
 	struct cpsw_common *cpsw = priv->cpsw;
 
-	if (cpsw->data.dual_emac) {
-		port_mask = (1 << (priv->emac_port + 1)) | ALE_PORT_HOST;
+	if (cpsw_is_switchdev(cpsw->data.switch_mode))
+		return -EOPNOTSUPP;
+
+	if (cpsw_is_dual_mac(cpsw->data.switch_mode)) {
+		port_mask = (1 << priv->emac_port) | ALE_PORT_HOST;
 
 		if (priv->ndev->flags & IFF_ALLMULTI)
 			unreg_mcast_mask = port_mask;
@@ -1929,6 +2007,9 @@  static int cpsw_ndo_vlan_rx_add_vid(struct net_device *ndev,
 	struct cpsw_common *cpsw = priv->cpsw;
 	int ret;
 
+	if (cpsw_is_switchdev(cpsw->data.switch_mode))
+		return 0;
+
 	if (vid == cpsw->data.default_vlan)
 		return 0;
 
@@ -1938,7 +2019,7 @@  static int cpsw_ndo_vlan_rx_add_vid(struct net_device *ndev,
 		return ret;
 	}
 
-	if (cpsw->data.dual_emac) {
+	if (cpsw_is_dual_mac(cpsw->data.switch_mode)) {
 		/* In dual EMAC, reserved VLAN id should not be used for
 		 * creating VLAN interfaces as this can break the dual
 		 * EMAC port separation
@@ -1965,6 +2046,9 @@  static int cpsw_ndo_vlan_rx_kill_vid(struct net_device *ndev,
 	struct cpsw_common *cpsw = priv->cpsw;
 	int ret;
 
+	if (cpsw_is_switchdev(cpsw->data.switch_mode))
+		return 0;
+
 	if (vid == cpsw->data.default_vlan)
 		return 0;
 
@@ -1974,7 +2058,7 @@  static int cpsw_ndo_vlan_rx_kill_vid(struct net_device *ndev,
 		return ret;
 	}
 
-	if (cpsw->data.dual_emac) {
+	if (cpsw_is_dual_mac(cpsw->data.switch_mode)) {
 		int i;
 
 		for (i = 0; i < cpsw->data.slaves; i++) {
@@ -1999,6 +2083,24 @@  static int cpsw_ndo_vlan_rx_kill_vid(struct net_device *ndev,
 	return ret;
 }
 
+static int cpsw_ndo_get_phys_port_name(struct net_device *ndev, char *name,
+				       size_t len)
+{
+	struct cpsw_priv *priv = netdev_priv(ndev);
+	struct cpsw_common *cpsw = priv->cpsw;
+	int err;
+
+	if (!cpsw_is_switchdev(cpsw->data.switch_mode))
+		return -EOPNOTSUPP;
+
+	err = snprintf(name, len, "p%d", priv->emac_port);
+
+	if (err >= len)
+		return -EINVAL;
+
+	return 0;
+}
+
 static int cpsw_ndo_set_tx_maxrate(struct net_device *ndev, int queue, u32 rate)
 {
 	struct cpsw_priv *priv = netdev_priv(ndev);
@@ -2065,6 +2167,7 @@  static const struct net_device_ops cpsw_netdev_ops = {
 #endif
 	.ndo_vlan_rx_add_vid	= cpsw_ndo_vlan_rx_add_vid,
 	.ndo_vlan_rx_kill_vid	= cpsw_ndo_vlan_rx_kill_vid,
+	.ndo_get_phys_port_name = cpsw_ndo_get_phys_port_name,
 };
 
 static int cpsw_get_regs_len(struct net_device *ndev)
@@ -2152,7 +2255,10 @@  static int cpsw_get_link_ksettings(struct net_device *ndev,
 {
 	struct cpsw_priv *priv = netdev_priv(ndev);
 	struct cpsw_common *cpsw = priv->cpsw;
-	int slave_no = cpsw_slave_index(cpsw, priv);
+	int slave_no = cpsw_slave_index(priv);
+
+	if (slave_no < 0)
+		return -EOPNOTSUPP;
 
 	if (!cpsw->slaves[slave_no].phy)
 		return -EOPNOTSUPP;
@@ -2166,7 +2272,10 @@  static int cpsw_set_link_ksettings(struct net_device *ndev,
 {
 	struct cpsw_priv *priv = netdev_priv(ndev);
 	struct cpsw_common *cpsw = priv->cpsw;
-	int slave_no = cpsw_slave_index(cpsw, priv);
+	int slave_no = cpsw_slave_index(priv);
+
+	if (slave_no < 0)
+		return -EOPNOTSUPP;
 
 	if (cpsw->slaves[slave_no].phy)
 		return phy_ethtool_ksettings_set(cpsw->slaves[slave_no].phy,
@@ -2179,7 +2288,10 @@  static void cpsw_get_wol(struct net_device *ndev, struct ethtool_wolinfo *wol)
 {
 	struct cpsw_priv *priv = netdev_priv(ndev);
 	struct cpsw_common *cpsw = priv->cpsw;
-	int slave_no = cpsw_slave_index(cpsw, priv);
+	int slave_no = cpsw_slave_index(priv);
+
+	if (slave_no < 0)
+		return;
 
 	wol->supported = 0;
 	wol->wolopts = 0;
@@ -2192,7 +2304,10 @@  static int cpsw_set_wol(struct net_device *ndev, struct ethtool_wolinfo *wol)
 {
 	struct cpsw_priv *priv = netdev_priv(ndev);
 	struct cpsw_common *cpsw = priv->cpsw;
-	int slave_no = cpsw_slave_index(cpsw, priv);
+	int slave_no = cpsw_slave_index(priv);
+
+	if (slave_no < 0)
+		return -EOPNOTSUPP;
 
 	if (cpsw->slaves[slave_no].phy)
 		return phy_ethtool_set_wol(cpsw->slaves[slave_no].phy, wol);
@@ -2451,7 +2566,10 @@  static int cpsw_get_eee(struct net_device *ndev, struct ethtool_eee *edata)
 {
 	struct cpsw_priv *priv = netdev_priv(ndev);
 	struct cpsw_common *cpsw = priv->cpsw;
-	int slave_no = cpsw_slave_index(cpsw, priv);
+	int slave_no = cpsw_slave_index(priv);
+
+	if (slave_no < 0)
+		return -EOPNOTSUPP;
 
 	if (cpsw->slaves[slave_no].phy)
 		return phy_ethtool_get_eee(cpsw->slaves[slave_no].phy, edata);
@@ -2463,7 +2581,10 @@  static int cpsw_set_eee(struct net_device *ndev, struct ethtool_eee *edata)
 {
 	struct cpsw_priv *priv = netdev_priv(ndev);
 	struct cpsw_common *cpsw = priv->cpsw;
-	int slave_no = cpsw_slave_index(cpsw, priv);
+	int slave_no = cpsw_slave_index(priv);
+
+	if (slave_no < 0)
+		return -EOPNOTSUPP;
 
 	if (cpsw->slaves[slave_no].phy)
 		return phy_ethtool_set_eee(cpsw->slaves[slave_no].phy, edata);
@@ -2475,7 +2596,10 @@  static int cpsw_nway_reset(struct net_device *ndev)
 {
 	struct cpsw_priv *priv = netdev_priv(ndev);
 	struct cpsw_common *cpsw = priv->cpsw;
-	int slave_no = cpsw_slave_index(cpsw, priv);
+	int slave_no = cpsw_slave_index(priv);
+
+	if (slave_no < 0)
+		return -EOPNOTSUPP;
 
 	if (cpsw->slaves[slave_no].phy)
 		return genphy_restart_aneg(cpsw->slaves[slave_no].phy);
@@ -2626,7 +2750,11 @@  static int cpsw_probe_dt(struct cpsw_platform_data *data,
 	data->mac_control = prop;
 
 	if (of_property_read_bool(node, "dual_emac"))
-		data->dual_emac = 1;
+		data->switch_mode = CPSW_DUAL_EMAC;
+
+	/* switchdev overrides DTS */
+	if (IS_ENABLED(CONFIG_TI_CPSW_SWITCHDEV))
+		data->switch_mode = CPSW_SWITCHDEV;
 
 	/*
 	 * Populate all the child nodes here...
@@ -2707,7 +2835,7 @@  static int cpsw_probe_dt(struct cpsw_platform_data *data,
 			if (ret)
 				return ret;
 		}
-		if (data->dual_emac) {
+		if (cpsw_is_dual_mac(data->switch_mode)) {
 			if (of_property_read_u32(slave_node, "dual_emac_res_vlan",
 						 &prop)) {
 				dev_err(&pdev->dev, "Missing dual_emac_res_vlan in DT.\n");
@@ -2787,9 +2915,13 @@  static int cpsw_probe_dual_emac(struct cpsw_priv *priv)
 	}
 	memcpy(ndev->dev_addr, priv_sl2->mac_addr, ETH_ALEN);
 
-	priv_sl2->emac_port = 1;
+	priv_sl2->emac_port = 2;
 	cpsw->slaves[1].ndev = ndev;
 	ndev->features |= NETIF_F_HW_VLAN_CTAG_FILTER;
+	if (cpsw_is_switchdev(cpsw->data.switch_mode)) {
+		ndev->features |= NETIF_F_NETNS_LOCAL;
+		cpsw_port_switchdev_init(ndev);
+	}
 
 	ndev->netdev_ops = &cpsw_netdev_ops;
 	ndev->ethtool_ops = &cpsw_ethtool_ops;
@@ -2806,6 +2938,49 @@  static int cpsw_probe_dual_emac(struct cpsw_priv *priv)
 	return ret;
 }
 
+static int cpsw_probe_cpu_port(struct cpsw_common *cpsw)
+{
+	struct cpsw_priv *priv_sl2;
+	struct net_device *ndev;
+	int ret = 0;
+
+	ndev = alloc_etherdev_mq(sizeof(struct cpsw_priv), CPSW_MAX_QUEUES);
+	if (!ndev) {
+		dev_err(cpsw->dev, "cpsw: error allocating net_device\n");
+		return -ENOMEM;
+	}
+
+	priv_sl2 = netdev_priv(ndev);
+	priv_sl2->cpsw = cpsw;
+	priv_sl2->ndev = ndev;
+	priv_sl2->dev  = &ndev->dev;
+	priv_sl2->msg_enable = netif_msg_init(debug_level, CPSW_DEBUG);
+
+	random_ether_addr(priv_sl2->mac_addr);
+	dev_info(cpsw->dev, "cpu port: Random MACID = %pM\n",
+		 priv_sl2->mac_addr);
+
+	memcpy(ndev->dev_addr, priv_sl2->mac_addr, ETH_ALEN);
+
+	priv_sl2->emac_port = HOST_PORT_NUM;
+	ndev->features |= NETIF_F_HW_VLAN_CTAG_FILTER | NETIF_F_NETNS_LOCAL;
+
+	ndev->netdev_ops = &cpsw_netdev_ops;
+
+	/* register the network device */
+	SET_NETDEV_DEV(ndev, cpsw->dev);
+	cpsw_port_switchdev_init(ndev);
+	ret = register_netdev(ndev);
+	if (ret) {
+		dev_err(cpsw->dev, "cpsw: error registering net device\n");
+		free_netdev(ndev);
+		ret = -ENODEV;
+	}
+	cpsw->master = ndev;
+
+	return ret;
+}
+
 #define CPSW_QUIRK_IRQ		BIT(0)
 
 static const struct platform_device_id cpsw_devtype[] = {
@@ -2844,6 +3019,187 @@  static const struct of_device_id cpsw_of_mtable[] = {
 };
 MODULE_DEVICE_TABLE(of, cpsw_of_mtable);
 
+static bool cpsw_port_dev_check(const struct net_device *dev)
+{
+	return dev->netdev_ops == &cpsw_netdev_ops;
+}
+
+static void cpsw_fdb_offload_notify(struct net_device *ndev,
+				    struct switchdev_notifier_fdb_info *rcv)
+{
+	struct switchdev_notifier_fdb_info info;
+
+	info.addr = rcv->addr;
+	info.vid = rcv->vid;
+	call_switchdev_notifiers(SWITCHDEV_FDB_OFFLOADED,
+				 ndev, &info.info);
+}
+
+static void cpsw_switchdev_event_work(struct work_struct *work)
+{
+	struct cpsw_switchdev_event_work *switchdev_work =
+		container_of(work, struct cpsw_switchdev_event_work, work);
+	struct cpsw_priv *priv = switchdev_work->priv;
+	struct switchdev_notifier_fdb_info *fdb;
+	struct cpsw_common *cpsw = priv->cpsw;
+
+	rtnl_lock();
+	switch (switchdev_work->event) {
+	case SWITCHDEV_FDB_ADD_TO_DEVICE:
+		fdb = &switchdev_work->fdb_info;
+		cpsw_ale_add_ucast(cpsw->ale, (u8 *)fdb->addr, priv->emac_port,
+				   ALE_VLAN | ALE_SECURE, fdb->vid);
+		cpsw_fdb_offload_notify(priv->ndev, fdb);
+		break;
+	case SWITCHDEV_FDB_DEL_TO_DEVICE:
+		fdb = &switchdev_work->fdb_info;
+		cpsw_ale_del_ucast(cpsw->ale, (u8 *)fdb->addr, priv->emac_port,
+				   ALE_VLAN | ALE_SECURE, fdb->vid);
+		break;
+	default:
+		break;
+	}
+	rtnl_unlock();
+
+	kfree(switchdev_work->fdb_info.addr);
+	kfree(switchdev_work);
+	dev_put(priv->ndev);
+}
+
+/* called under rcu_read_lock() */
+static int cpsw_switchdev_event(struct notifier_block *unused,
+				unsigned long event, void *ptr)
+{
+	struct net_device *ndev = switchdev_notifier_info_to_dev(ptr);
+	struct switchdev_notifier_fdb_info *fdb_info = ptr;
+	struct cpsw_switchdev_event_work *switchdev_work;
+	struct cpsw_priv *priv = netdev_priv(ndev);
+
+	if (!cpsw_port_dev_check(ndev))
+		return NOTIFY_DONE;
+
+	switchdev_work = kzalloc(sizeof(*switchdev_work), GFP_ATOMIC);
+	if (WARN_ON(!switchdev_work))
+		return NOTIFY_BAD;
+
+	INIT_WORK(&switchdev_work->work, cpsw_switchdev_event_work);
+	switchdev_work->priv = priv;
+	switchdev_work->event = event;
+
+	switch (event) {
+	case SWITCHDEV_FDB_ADD_TO_DEVICE:
+	case SWITCHDEV_FDB_DEL_TO_DEVICE:
+		memcpy(&switchdev_work->fdb_info, ptr,
+		       sizeof(switchdev_work->fdb_info));
+		switchdev_work->fdb_info.addr = kzalloc(ETH_ALEN, GFP_ATOMIC);
+		ether_addr_copy((u8 *)switchdev_work->fdb_info.addr,
+				fdb_info->addr);
+		dev_hold(ndev);
+		break;
+	default:
+		kfree(switchdev_work);
+		return NOTIFY_DONE;
+	}
+
+	queue_work(system_long_wq, &switchdev_work->work);
+
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block cpsw_switchdev_notifier = {
+	.notifier_call = cpsw_switchdev_event,
+};
+
+static void cpsw_netdevice_port_link(struct net_device *ndev)
+{
+	struct cpsw_priv *priv = netdev_priv(ndev);
+	struct cpsw_common *cpsw = priv->cpsw;
+
+	if (priv->emac_port != HOST_PORT_NUM)
+		return;
+
+	cpsw_ale_control_set(cpsw->ale, HOST_PORT_NUM, ALE_P0_UNI_FLOOD, 1);
+}
+
+static void cpsw_netdevice_port_unlink(struct net_device *ndev)
+{
+	struct cpsw_priv *priv = netdev_priv(ndev);
+	struct cpsw_common *cpsw = priv->cpsw;
+
+	if (priv->emac_port != HOST_PORT_NUM)
+		return;
+
+	cpsw_ale_control_set(cpsw->ale, HOST_PORT_NUM, ALE_P0_UNI_FLOOD, 0);
+}
+
+/* netdev notifier */
+static int cpsw_netdevice_event(struct notifier_block *unused,
+				unsigned long event, void *ptr)
+{
+	struct net_device *ndev = netdev_notifier_info_to_dev(ptr);
+	struct netdev_notifier_changeupper_info *info;
+
+	switch (event) {
+	case NETDEV_CHANGEUPPER:
+		info = ptr;
+		if (!info->master)
+			goto out;
+		if (info->linking)
+			cpsw_netdevice_port_link(ndev);
+		else
+			cpsw_netdevice_port_unlink(ndev);
+		break;
+	default:
+		return NOTIFY_DONE;
+	}
+
+out:
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block cpsw_netdevice_nb __read_mostly = {
+	.notifier_call = cpsw_netdevice_event,
+};
+
+static int cpsw_register_notifiers(struct cpsw_priv *priv)
+{
+	int ret;
+
+	ret = register_netdevice_notifier(&cpsw_netdevice_nb);
+	if (ret) {
+		cpsw_err(priv, probe, "can't register netdevice notifier\n");
+		return ret;
+	}
+
+	ret = register_switchdev_notifier(&cpsw_switchdev_notifier);
+	if (ret) {
+		cpsw_err(priv, probe, "can't register switchdev notifier\n");
+		goto unreg_netdevice;
+	}
+
+	return ret;
+
+unreg_netdevice:
+	ret = unregister_netdevice_notifier(&cpsw_netdevice_nb);
+
+	return ret;
+}
+
+static int cpsw_unregister_notifiers(struct cpsw_priv *priv)
+{
+	int ret;
+
+	ret = unregister_switchdev_notifier(&cpsw_switchdev_notifier);
+	if (ret)
+		dev_err(priv->dev, "can't unregister switchdev notifier\n");
+
+	ret += unregister_netdevice_notifier(&cpsw_netdevice_nb);
+	if (ret)
+		dev_err(priv->dev, "can't unregister netdevice notifier\n");
+
+	return ret;
+}
+
 static int cpsw_probe(struct platform_device *pdev)
 {
 	struct clk			*clk;
@@ -2935,7 +3291,11 @@  static int cpsw_probe(struct platform_device *pdev)
 		cpsw->slaves[i].slave_num = i;
 
 	cpsw->slaves[0].ndev = ndev;
-	priv->emac_port = 0;
+
+	if (cpsw_is_switch(cpsw->data.switch_mode))
+		priv->emac_port = HOST_PORT_NUM;
+	else
+		priv->emac_port = 1;
 
 	clk = devm_clk_get(&pdev->dev, "fck");
 	if (IS_ERR(clk)) {
@@ -3076,8 +3436,17 @@  static int cpsw_probe(struct platform_device *pdev)
 			cpsw->quirk_irq = true;
 	}
 
-	ndev->features |= NETIF_F_HW_VLAN_CTAG_FILTER | NETIF_F_HW_VLAN_CTAG_RX;
+	if (cpsw_is_switchdev(cpsw->data.switch_mode)) {
+		ret = cpsw_probe_cpu_port(cpsw);
+		if (ret) {
+			cpsw_err(priv, probe, "error probe cpu interface\n");
+			goto clean_dma_ret;
+		}
+		cpsw_port_switchdev_init(ndev);
+		ndev->features |= NETIF_F_NETNS_LOCAL;
+	}
 
+	ndev->features |= NETIF_F_HW_VLAN_CTAG_FILTER | NETIF_F_HW_VLAN_CTAG_RX;
 	ndev->netdev_ops = &cpsw_netdev_ops;
 	ndev->ethtool_ops = &cpsw_ethtool_ops;
 	netif_napi_add(ndev, &cpsw->napi_rx, cpsw_rx_poll, CPSW_POLL_WEIGHT);
@@ -3093,7 +3462,7 @@  static int cpsw_probe(struct platform_device *pdev)
 		goto clean_dma_ret;
 	}
 
-	if (cpsw->data.dual_emac) {
+	if (!cpsw_is_switch(cpsw->data.switch_mode)) {
 		ret = cpsw_probe_dual_emac(priv);
 		if (ret) {
 			cpsw_err(priv, probe, "error probe slave 2 emac interface\n");
@@ -3139,6 +3508,12 @@  static int cpsw_probe(struct platform_device *pdev)
 		goto clean_dma_ret;
 	}
 
+	if (cpsw_is_switchdev(cpsw->data.switch_mode)) {
+		ret = cpsw_register_notifiers(priv);
+		if (ret)
+			goto clean_dma_ret;
+	}
+
 	cpsw_notice(priv, probe,
 		    "initialized device (regs %pa, irq %d, pool size %d)\n",
 		    &ss_res->start, ndev->irq, dma_params.descs_pool_size);
@@ -3164,7 +3539,8 @@  static int cpsw_probe(struct platform_device *pdev)
 static int cpsw_remove(struct platform_device *pdev)
 {
 	struct net_device *ndev = platform_get_drvdata(pdev);
-	struct cpsw_common *cpsw = ndev_to_cpsw(ndev);
+	struct cpsw_priv *priv = netdev_priv(ndev);
+	struct cpsw_common *cpsw = priv->cpsw;
 	int ret;
 
 	ret = pm_runtime_get_sync(&pdev->dev);
@@ -3173,7 +3549,10 @@  static int cpsw_remove(struct platform_device *pdev)
 		return ret;
 	}
 
-	if (cpsw->data.dual_emac)
+	if (cpsw_is_switchdev(cpsw->data.switch_mode))
+		ret = cpsw_unregister_notifiers(priv);
+
+	if (cpsw->data.switch_mode)
 		unregister_netdev(cpsw->slaves[1].ndev);
 	unregister_netdev(ndev);
 
@@ -3182,8 +3561,10 @@  static int cpsw_remove(struct platform_device *pdev)
 	cpsw_remove_dt(pdev);
 	pm_runtime_put_sync(&pdev->dev);
 	pm_runtime_disable(&pdev->dev);
-	if (cpsw->data.dual_emac)
+	if (cpsw->data.switch_mode)
 		free_netdev(cpsw->slaves[1].ndev);
+	if (cpsw->master)
+		free_netdev(cpsw->master);
 	free_netdev(ndev);
 	return 0;
 }
@@ -3195,7 +3576,7 @@  static int cpsw_suspend(struct device *dev)
 	struct net_device	*ndev = platform_get_drvdata(pdev);
 	struct cpsw_common	*cpsw = ndev_to_cpsw(ndev);
 
-	if (cpsw->data.dual_emac) {
+	if (cpsw->data.switch_mode) {
 		int i;
 
 		for (i = 0; i < cpsw->data.slaves; i++) {
@@ -3224,7 +3605,7 @@  static int cpsw_resume(struct device *dev)
 
 	/* shut up ASSERT_RTNL() warning in netif_set_real_num_tx/rx_queues */
 	rtnl_lock();
-	if (cpsw->data.dual_emac) {
+	if (cpsw->data.switch_mode) {
 		int i;
 
 		for (i = 0; i < cpsw->data.slaves; i++) {
diff --git a/drivers/net/ethernet/ti/cpsw_priv.h b/drivers/net/ethernet/ti/cpsw_priv.h
index 3b02a83..4be5ffc 100644
--- a/drivers/net/ethernet/ti/cpsw_priv.h
+++ b/drivers/net/ethernet/ti/cpsw_priv.h
@@ -30,6 +30,12 @@ 
 #define CPSW2_TX_PRI_MAP    0x18 /* Tx Header Priority to Switch Pri Mapping */
 #define CPSW2_TS_SEQ_MTYPE  0x1c /* Time Sync Sequence ID Offset and Msg Type */
 
+enum {
+	CPSW_TI_SWITCH,
+	CPSW_DUAL_EMAC,
+	CPSW_SWITCHDEV,
+};
+
 struct cpsw_slave_data {
 	struct	device_node *phy_node;
 	char	phy_id[MII_BUS_ID_SIZE];
@@ -48,7 +54,7 @@  struct cpsw_platform_data {
 	u32	bd_ram_size;  /*buffer descriptor ram size */
 	u32	mac_control;	/* Mac control register */
 	u16	default_vlan;	/* Def VLAN for ALE lookup in VLAN aware mode*/
-	bool	dual_emac;	/* Enable Dual EMAC mode */
+	u8	switch_mode;    /* Enable Dual EMAC/switchdev mode */
 };
 
 struct cpsw_slave {
@@ -80,6 +86,7 @@  struct cpsw_common {
 	u32				coal_intvl;
 	u32				bus_freq_mhz;
 	int				rx_packet_max;
+	struct net_device		*master; /* used for switchdev */
 	struct cpsw_slave		*slaves;
 	struct cpdma_ctlr		*dma;
 	struct cpsw_vector		txv[CPSW_MAX_QUEUES];