diff mbox series

[RFC,net-next,2/4] net: dsa: remove the "dsa_to_port in a loop" antipattern from the core

Message ID 20210809190320.1058373-3-vladimir.oltean@nxp.com
State New
Headers show
Series Remove the "dsa_to_port in a loop" antipattern | expand

Commit Message

Vladimir Oltean Aug. 9, 2021, 7:03 p.m. UTC
Ever since Vivien's conversion of the ds->ports array into a dst->ports
list, and the introduction of dsa_to_port, iterations through the ports
of a switch became quadratic whenever dsa_to_port was needed.

dsa_to_port can either be called directly, or indirectly through the
dsa_is_{user,cpu,dsa,unused}_port helpers.

Introduce a basic switch port iterator macro, dsa_switch_for_each_port,
that works with the iterator variable being a struct dsa_port *dp
directly, and not an int i. It is an expensive variable to go from i to
dp, but cheap to go from dp to i.

This macro iterates through the entire ds->dst->ports list and filters
by the ports belonging just to the switch provided as argument.

While at it, provide some more flavors of that macro which filter for
specific types of ports: at the moment just user ports and CPU ports.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 include/net/dsa.h   | 19 +++++++++++++++----
 net/dsa/dsa.c       | 22 +++++++++++-----------
 net/dsa/dsa2.c      | 11 +++++------
 net/dsa/port.c      |  7 ++++---
 net/dsa/switch.c    | 41 ++++++++++++++++++-----------------------
 net/dsa/tag_8021q.c | 29 +++++++++++++----------------
 6 files changed, 66 insertions(+), 63 deletions(-)

Comments

Qingfang Deng Aug. 10, 2021, 3:33 a.m. UTC | #1
On Mon, Aug 09, 2021 at 10:03:18PM +0300, Vladimir Oltean wrote:
> Ever since Vivien's conversion of the ds->ports array into a dst->ports
> list, and the introduction of dsa_to_port, iterations through the ports
> of a switch became quadratic whenever dsa_to_port was needed.

So, what is the benefit of a linked list here? Do we allow users to
insert/delete a dsa_port at runtime? If not, how about using a
dynamically allocated array instead?

> 
> dsa_to_port can either be called directly, or indirectly through the
> dsa_is_{user,cpu,dsa,unused}_port helpers.
> 
> Introduce a basic switch port iterator macro, dsa_switch_for_each_port,
> that works with the iterator variable being a struct dsa_port *dp
> directly, and not an int i. It is an expensive variable to go from i to
> dp, but cheap to go from dp to i.
> 
> This macro iterates through the entire ds->dst->ports list and filters
> by the ports belonging just to the switch provided as argument.
> 
> While at it, provide some more flavors of that macro which filter for
> specific types of ports: at the moment just user ports and CPU ports.
> 
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Florian Fainelli Aug. 10, 2021, 9:41 a.m. UTC | #2
On 8/9/2021 8:33 PM, DENG Qingfang wrote:
> On Mon, Aug 09, 2021 at 10:03:18PM +0300, Vladimir Oltean wrote:
>> Ever since Vivien's conversion of the ds->ports array into a dst->ports
>> list, and the introduction of dsa_to_port, iterations through the ports
>> of a switch became quadratic whenever dsa_to_port was needed.
> 
> So, what is the benefit of a linked list here? Do we allow users to
> insert/delete a dsa_port at runtime? If not, how about using a
> dynamically allocated array instead?

The goal was to flatten the space while doing cross switch operations, 
which would have otherwise required iterating over dsa_switch instances 
within a dsa_switch_tree, and then over dsa_port within each dsa_switch.
Qingfang Deng Aug. 10, 2021, 5:04 p.m. UTC | #3
On Tue, Aug 10, 2021 at 07:35:33PM +0300, Vladimir Oltean wrote:
> If I were to guess where Qingfang was hinting at, is that the receive
> path now needs to iterate over a list, whereas before it simply indexed
> an array:
> 
> static inline struct net_device *dsa_master_find_slave(struct net_device *dev,
> 						       int device, int port)
> {
> 	struct dsa_port *cpu_dp = dev->dsa_ptr;
> 	struct dsa_switch_tree *dst = cpu_dp->dst;
> 	struct dsa_port *dp;
> 
> 	list_for_each_entry(dp, &dst->ports, list)
> 		if (dp->ds->index == device && dp->index == port &&
> 		    dp->type == DSA_PORT_TYPE_USER)
> 			return dp->slave;
> 
> 	return NULL;
> }
> 
> I will try in the following days to make a prototype implementation of
> converting back the linked list into an array and see if there is any
> justifiable performance improvement.
> 
> [ even if this would make the "multiple CPU ports in LAG" implementation
>   harder ]

Yes, you got my point.

There is RTL8390M series SoC, which has 52+ ports but a weak CPU (MIPS
34kc 700MHz). In that case the linear lookup time and the potential cache
miss could make a difference.
Vladimir Oltean Aug. 11, 2021, 5:32 p.m. UTC | #4
On Wed, Aug 11, 2021 at 01:04:47AM +0800, DENG Qingfang wrote:
> On Tue, Aug 10, 2021 at 07:35:33PM +0300, Vladimir Oltean wrote:

> > If I were to guess where Qingfang was hinting at, is that the receive

> > path now needs to iterate over a list, whereas before it simply indexed

> > an array:

> > 

> > static inline struct net_device *dsa_master_find_slave(struct net_device *dev,

> > 						       int device, int port)

> > {

> > 	struct dsa_port *cpu_dp = dev->dsa_ptr;

> > 	struct dsa_switch_tree *dst = cpu_dp->dst;

> > 	struct dsa_port *dp;

> > 

> > 	list_for_each_entry(dp, &dst->ports, list)

> > 		if (dp->ds->index == device && dp->index == port &&

> > 		    dp->type == DSA_PORT_TYPE_USER)

> > 			return dp->slave;

> > 

> > 	return NULL;

> > }

> > 

> > I will try in the following days to make a prototype implementation of

> > converting back the linked list into an array and see if there is any

> > justifiable performance improvement.

> > 

> > [ even if this would make the "multiple CPU ports in LAG" implementation

> >   harder ]

> 

> Yes, you got my point.

> 

> There is RTL8390M series SoC, which has 52+ ports but a weak CPU (MIPS

> 34kc 700MHz). In that case the linear lookup time and the potential cache

> miss could make a difference.


Then I am not in a position to make relevant performance tests for that
scenario.

I have been testing with the following setup: an NXP LS1028A switch
(ocelot/felix driver) using IPv4 forwarding of 64 byte UDP datagrams
sent by a data generator. 2 ports at 1Gbps each, 100% port load. IP
forwarding takes place between 1 port and the other.
Generator port A sends from 192.168.100.1 to 192.168.200.1
Generator port B sends from 192.168.200.1 to 192.168.100.1

Flow control is enabled on all switch ports, the user ports and the CPU port
(I don't really have a setup that I can test in any meaningful way
without flow control).

The script I run on the board to set things up for IP forwarding is:

ip link set eno2 down && echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging
ip link set swp0 address a0:00:00:00:00:02
ip link set swp1 address a0:00:00:00:00:04
for eth in swp0 swp1; do
	ip link set ${eth} up
done
ip addr add 192.168.100.2/24 dev swp0
ip addr add 192.168.200.2/24 dev swp1
echo 1 > /proc/sys/net/ipv4/ip_forward
arp -s 192.168.100.1 00:01:02:03:04:05 dev swp0
arp -s 192.168.200.1 00:01:02:03:04:06 dev swp1
ethtool --config-nfc eno2 flow-type ip4 dst-ip 192.168.200.1 action 0
ethtool --config-nfc eno2 flow-type ip4 dst-ip 192.168.100.1 action 1
ethtool -K eno2 gro on rx-gro-list on
ethtool -K swp0 gro on rx-gro-list on
ethtool -K swp1 gro on rx-gro-list on


The DSA patch I used on top of today's net-next was:

-----------------------------[ cut here ]-----------------------------
From 7733f643dd61431a93da5a8f5118848cdc037562 Mon Sep 17 00:00:00 2001
From: Vladimir Oltean <vladimir.oltean@nxp.com>

Date: Wed, 11 Aug 2021 00:27:07 +0300
Subject: [PATCH] net: dsa: setup a linear port cache for faster receive path

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

---
 include/net/dsa.h  |  5 +++++
 net/dsa/dsa2.c     | 46 ++++++++++++++++++++++++++++++++++++++++++----
 net/dsa/dsa_priv.h |  9 ++++-----
 3 files changed, 51 insertions(+), 9 deletions(-)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index 3203b200cc38..2a9ea4f57910 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -153,9 +153,14 @@ struct dsa_switch_tree {
 	struct net_device **lags;
 	unsigned int lags_len;
 
+	struct dsa_port **port_cache;
+
 	/* Track the largest switch index within a tree */
 	unsigned int last_switch;
 
+	/* Track the largest port count in a switch within a tree */
+	unsigned int max_num_ports;
+
 	/* Track the bridges with forwarding offload enabled */
 	unsigned long fwd_offloading_bridges;
 };
diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index 0b7497dd60c3..3d2b92dbd603 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -941,6 +941,39 @@ static void dsa_tree_teardown_lags(struct dsa_switch_tree *dst)
 	kfree(dst->lags);
 }
 
+static int dsa_tree_setup_port_cache(struct dsa_switch_tree *dst)
+{
+	struct dsa_port *dp;
+
+	list_for_each_entry(dp, &dst->ports, list) {
+		if (dst->last_switch < dp->ds->index)
+			dst->last_switch = dp->ds->index;
+		if (dst->max_num_ports < dp->ds->num_ports)
+			dst->max_num_ports = dp->ds->num_ports;
+	}
+
+	dst->port_cache = kcalloc((dst->last_switch + 1) * dst->max_num_ports,
+				  sizeof(struct dsa_port *), GFP_KERNEL);
+	if (!dst->port_cache)
+		return -ENOMEM;
+
+	list_for_each_entry(dp, &dst->ports, list)
+		dst->port_cache[dp->ds->index * dst->max_num_ports + dp->index] = dp;
+
+	return 0;
+}
+
+static void dsa_tree_teardown_port_cache(struct dsa_switch_tree *dst)
+{
+	int i;
+
+	for (i = 0; i < dst->max_num_ports * dst->last_switch; i++)
+		dst->port_cache[i] = NULL;
+
+	kfree(dst->port_cache);
+	dst->port_cache = NULL;
+}
+
 static int dsa_tree_setup(struct dsa_switch_tree *dst)
 {
 	bool complete;
@@ -956,10 +989,14 @@ static int dsa_tree_setup(struct dsa_switch_tree *dst)
 	if (!complete)
 		return 0;
 
-	err = dsa_tree_setup_cpu_ports(dst);
+	err = dsa_tree_setup_port_cache(dst);
 	if (err)
 		return err;
 
+	err = dsa_tree_setup_cpu_ports(dst);
+	if (err)
+		goto teardown_port_cache;
+
 	err = dsa_tree_setup_switches(dst);
 	if (err)
 		goto teardown_cpu_ports;
@@ -984,6 +1021,8 @@ static int dsa_tree_setup(struct dsa_switch_tree *dst)
 	dsa_tree_teardown_switches(dst);
 teardown_cpu_ports:
 	dsa_tree_teardown_cpu_ports(dst);
+teardown_port_cache:
+	dsa_tree_teardown_port_cache(dst);
 
 	return err;
 }
@@ -1003,6 +1042,8 @@ static void dsa_tree_teardown(struct dsa_switch_tree *dst)
 
 	dsa_tree_teardown_cpu_ports(dst);
 
+	dsa_tree_teardown_port_cache(dst);
+
 	list_for_each_entry_safe(dl, next, &dst->rtable, list) {
 		list_del(&dl->list);
 		kfree(dl);
@@ -1301,9 +1342,6 @@ static int dsa_switch_parse_member_of(struct dsa_switch *ds,
 		return -EEXIST;
 	}
 
-	if (ds->dst->last_switch < ds->index)
-		ds->dst->last_switch = ds->index;
-
 	return 0;
 }
 
diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
index 6310a15afe21..5c27f66fd62a 100644
--- a/net/dsa/dsa_priv.h
+++ b/net/dsa/dsa_priv.h
@@ -188,12 +188,11 @@ static inline struct net_device *dsa_master_find_slave(struct net_device *dev,
 	struct dsa_switch_tree *dst = cpu_dp->dst;
 	struct dsa_port *dp;
 
-	list_for_each_entry(dp, &dst->ports, list)
-		if (dp->ds->index == device && dp->index == port &&
-		    dp->type == DSA_PORT_TYPE_USER)
-			return dp->slave;
+	dp = dst->port_cache[device * dst->max_num_ports + port];
+	if (!dp || dp->type != DSA_PORT_TYPE_USER)
+		return NULL;
 
-	return NULL;
+	return dp->slave;
 }
 
 /* netlink.c */
-----------------------------[ cut here ]-----------------------------

The results I got were:

Before the patch:

684 Kpps = 459 Mbps

perf record -e cycles -C 0 sleep 10 && perf report
    10.17%  ksoftirqd/0      [kernel.kallsyms]  [k] enetc_pci_remove
     6.13%  ksoftirqd/0      [kernel.kallsyms]  [k] eth_type_trans
     5.48%  ksoftirqd/0      [kernel.kallsyms]  [k] enetc_poll
     4.99%  ksoftirqd/0      [kernel.kallsyms]  [k] kmem_cache_alloc
     4.56%  ksoftirqd/0      [kernel.kallsyms]  [k] dev_gro_receive
     2.89%  ksoftirqd/0      [kernel.kallsyms]  [k] enetc_start_xmit
     2.77%  ksoftirqd/0      [kernel.kallsyms]  [k] __skb_flow_dissect
     2.75%  ksoftirqd/0      [kernel.kallsyms]  [k] __siphash_aligned
     2.55%  ksoftirqd/0      [kernel.kallsyms]  [k] __netif_receive_skb_core
     2.48%  ksoftirqd/0      [kernel.kallsyms]  [k] build_skb
     2.47%  ksoftirqd/0      [kernel.kallsyms]  [k] take_page_off_buddy
     2.01%  ksoftirqd/0      [kernel.kallsyms]  [k] inet_gro_receive
     1.86%  ksoftirqd/0      [kernel.kallsyms]  [k] dsa_slave_xmit
     1.76%  ksoftirqd/0      [kernel.kallsyms]  [k] __dev_queue_xmit
     1.68%  ksoftirqd/0      [kernel.kallsyms]  [k] skb_zcopy_clear
     1.62%  ksoftirqd/0      [kernel.kallsyms]  [k] enetc_build_skb
     1.60%  ksoftirqd/0      [kernel.kallsyms]  [k] __build_skb_around
     1.50%  ksoftirqd/0      [kernel.kallsyms]  [k] dev_hard_start_xmit
     1.49%  ksoftirqd/0      [kernel.kallsyms]  [k] sch_direct_xmit
     1.42%  ksoftirqd/0      [kernel.kallsyms]  [k] __skb_get_hash
     1.29%  ksoftirqd/0      [kernel.kallsyms]  [k] __local_bh_enable_ip
     1.26%  ksoftirqd/0      [kernel.kallsyms]  [k] udp_gro_receive
     1.23%  ksoftirqd/0      [kernel.kallsyms]  [k] udp4_gro_receive
     1.21%  ksoftirqd/0      [kernel.kallsyms]  [k] skb_segment_list
     1.13%  ksoftirqd/0      [kernel.kallsyms]  [k] netdev_drivername
     1.05%  ksoftirqd/0      [kernel.kallsyms]  [k] dev_shutdown
     1.05%  ksoftirqd/0      [kernel.kallsyms]  [k] inet_gso_segment
     1.01%  ksoftirqd/0      [kernel.kallsyms]  [k] dsa_switch_rcv
     0.98%  ksoftirqd/0      [kernel.kallsyms]  [k] ocelot_rcv
     0.91%  ksoftirqd/0      [kernel.kallsyms]  [k] napi_gro_receive
     0.87%  ksoftirqd/0      [kernel.kallsyms]  [k] enetc_xmit
     0.85%  ksoftirqd/0      [kernel.kallsyms]  [k] kmem_cache_free_bulk
     0.84%  ksoftirqd/0      [kernel.kallsyms]  [k] do_csum
     0.84%  ksoftirqd/0      [kernel.kallsyms]  [k] memmove
     0.80%  ksoftirqd/0      [kernel.kallsyms]  [k] dsa_8021q_rcv
     0.79%  ksoftirqd/0      [kernel.kallsyms]  [k] __netif_receive_skb_list_core
     0.78%  ksoftirqd/0      [kernel.kallsyms]  [k] netif_skb_features
     0.76%  ksoftirqd/0      [kernel.kallsyms]  [k] netif_receive_skb_list_internal
     0.76%  ksoftirqd/0      [kernel.kallsyms]  [k] skb_release_all
     0.74%  ksoftirqd/0      [kernel.kallsyms]  [k] netdev_pick_tx

perf record -e cache-misses -C 0 sleep 10 && perf report
     7.22%  ksoftirqd/0      [kernel.kallsyms]  [k] skb_zcopy_clear
     6.46%  ksoftirqd/0      [kernel.kallsyms]  [k] inet_gro_receive
     6.41%  ksoftirqd/0      [kernel.kallsyms]  [k] enetc_pci_remove
     6.20%  ksoftirqd/0      [kernel.kallsyms]  [k] take_page_off_buddy
     6.13%  ksoftirqd/0      [kernel.kallsyms]  [k] build_skb
     5.06%  ksoftirqd/0      [kernel.kallsyms]  [k] inet_gso_segment
     4.47%  ksoftirqd/0      [kernel.kallsyms]  [k] dev_gro_receive
     4.28%  ksoftirqd/0      [kernel.kallsyms]  [k] memmove
     3.77%  ksoftirqd/0      [kernel.kallsyms]  [k] enetc_poll
     3.73%  ksoftirqd/0      [kernel.kallsyms]  [k] __copy_skb_header
     3.46%  ksoftirqd/0      [kernel.kallsyms]  [k] eth_type_trans
     3.06%  ksoftirqd/0      [kernel.kallsyms]  [k] skb_release_all
     2.76%  ksoftirqd/0      [kernel.kallsyms]  [k] ip_send_check
     2.36%  ksoftirqd/0      [kernel.kallsyms]  [k] __skb_get_hash
     2.24%  ksoftirqd/0      [kernel.kallsyms]  [k] kmem_cache_alloc
     2.23%  ksoftirqd/0      [kernel.kallsyms]  [k] __netif_receive_skb_core
     1.68%  ksoftirqd/0      [kernel.kallsyms]  [k] netdev_pick_tx
     1.56%  ksoftirqd/0      [kernel.kallsyms]  [k] skb_segment_list
     1.55%  ksoftirqd/0      [kernel.kallsyms]  [k] dsa_slave_xmit
     1.54%  ksoftirqd/0      [kernel.kallsyms]  [k] skb_headers_offset_update
     1.51%  ksoftirqd/0      [kernel.kallsyms]  [k] enetc_start_xmit
     1.48%  ksoftirqd/0      [kernel.kallsyms]  [k] netdev_core_pick_tx
     1.30%  ksoftirqd/0      [kernel.kallsyms]  [k] __dev_queue_xmit
     1.14%  ksoftirqd/0      [kernel.kallsyms]  [k] dsa_8021q_rcv
     1.05%  ksoftirqd/0      [kernel.kallsyms]  [k] __build_skb_around
     1.05%  ksoftirqd/0      [kernel.kallsyms]  [k] skb_pull_rcsum
     1.03%  ksoftirqd/0      [kernel.kallsyms]  [k] dsa_8021q_xmit
     0.98%  ksoftirqd/0      [kernel.kallsyms]  [k] __skb_flow_dissect
     0.89%  ksoftirqd/0      [kernel.kallsyms]  [k] ocelot_xmit
     0.84%  ksoftirqd/0      [kernel.kallsyms]  [k] enetc_build_skb
     0.73%  ksoftirqd/0      [kernel.kallsyms]  [k] kmem_cache_free_bulk
     0.63%  ksoftirqd/0      [kernel.kallsyms]  [k] enetc_refill_rx_ring
     0.55%  ksoftirqd/0      [kernel.kallsyms]  [k] netif_skb_features
     0.46%  ksoftirqd/0      [kernel.kallsyms]  [k] dma_unmap_page_attrs
     0.46%  ksoftirqd/0      [kernel.kallsyms]  [k] fib_table_lookup
     0.36%  ksoftirqd/0      [kernel.kallsyms]  [k] fib_lookup_good_nhc
     0.33%  ksoftirqd/0      [kernel.kallsyms]  [k] skb_release_data
     0.33%  ksoftirqd/0      [kernel.kallsyms]  [k] dsa_8021q_tx_vid
     0.32%  ksoftirqd/0      [kernel.kallsyms]  [k] enetc_flip_rx_buff
     0.29%  ksoftirqd/0      [kernel.kallsyms]  [k] napi_consume_skb

After the patch:

650 Kpps = 426 Mbps

perf record -e cycles -C 0 sleep 10 && perf report
     9.34%  ksoftirqd/0      [kernel.kallsyms]  [k] enetc_pci_remove
     7.70%  ksoftirqd/0      [kernel.kallsyms]  [k] enetc_poll
     5.49%  ksoftirqd/0      [kernel.kallsyms]  [k] eth_type_trans
     4.62%  ksoftirqd/0      [kernel.kallsyms]  [k] take_page_off_buddy
     4.55%  ksoftirqd/0      [kernel.kallsyms]  [k] kmem_cache_alloc
     4.36%  ksoftirqd/0      [kernel.kallsyms]  [k] dev_gro_receive
     3.22%  ksoftirqd/0      [kernel.kallsyms]  [k] skb_zcopy_clear
     2.59%  ksoftirqd/0      [kernel.kallsyms]  [k] __siphash_aligned
     2.51%  ksoftirqd/0      [kernel.kallsyms]  [k] enetc_start_xmit
     2.37%  ksoftirqd/0      [kernel.kallsyms]  [k] __skb_flow_dissect
     2.20%  ksoftirqd/0      [kernel.kallsyms]  [k] __netif_receive_skb_core
     1.84%  ksoftirqd/0      [kernel.kallsyms]  [k] kmem_cache_free_bulk
     1.69%  ksoftirqd/0      [kernel.kallsyms]  [k] inet_gro_receive
     1.65%  ksoftirqd/0      [kernel.kallsyms]  [k] __dev_queue_xmit
     1.63%  ksoftirqd/0      [kernel.kallsyms]  [k] dsa_slave_xmit
     1.60%  ksoftirqd/0      [kernel.kallsyms]  [k] build_skb
     1.45%  ksoftirqd/0      [kernel.kallsyms]  [k] enetc_build_skb
     1.41%  ksoftirqd/0      [kernel.kallsyms]  [k] dev_hard_start_xmit
     1.39%  ksoftirqd/0      [kernel.kallsyms]  [k] sch_direct_xmit
     1.39%  ksoftirqd/0      [kernel.kallsyms]  [k] __build_skb_around
     1.28%  ksoftirqd/0      [kernel.kallsyms]  [k] __skb_get_hash
     1.16%  ksoftirqd/0      [kernel.kallsyms]  [k] __local_bh_enable_ip
     1.14%  ksoftirqd/0      [kernel.kallsyms]  [k] udp4_gro_receive
     1.12%  ksoftirqd/0      [kernel.kallsyms]  [k] skb_release_all
     1.09%  ksoftirqd/0      [kernel.kallsyms]  [k] ocelot_rcv
     1.08%  ksoftirqd/0      [kernel.kallsyms]  [k] netdev_drivername
     1.08%  ksoftirqd/0      [kernel.kallsyms]  [k] dma_unmap_page_attrs
     1.08%  ksoftirqd/0      [kernel.kallsyms]  [k] udp_gro_receive
     1.05%  ksoftirqd/0      [kernel.kallsyms]  [k] dev_shutdown
     1.04%  ksoftirqd/0      [kernel.kallsyms]  [k] napi_consume_skb
     1.03%  ksoftirqd/0      [kernel.kallsyms]  [k] skb_segment_list
     0.90%  ksoftirqd/0      [kernel.kallsyms]  [k] napi_gro_receive
     0.86%  ksoftirqd/0      [kernel.kallsyms]  [k] inet_gso_segment
     0.83%  ksoftirqd/0      [kernel.kallsyms]  [k] dsa_switch_rcv
     0.77%  ksoftirqd/0      [kernel.kallsyms]  [k] memmove
     0.75%  ksoftirqd/0      [kernel.kallsyms]  [k] do_csum
     0.73%  ksoftirqd/0      [kernel.kallsyms]  [k] netif_skb_features
     0.71%  ksoftirqd/0      [kernel.kallsyms]  [k] dsa_8021q_rcv
     0.69%  ksoftirqd/0      [kernel.kallsyms]  [k] __netif_receive_skb_list_core
     0.67%  ksoftirqd/0      [kernel.kallsyms]  [k] netif_receive_skb_list_internal

perf record -e cache-misses -C 0 sleep 10 && perf report
    12.38%  ksoftirqd/0      [kernel.kallsyms]  [k] skb_zcopy_clear
     9.34%  ksoftirqd/0      [kernel.kallsyms]  [k] take_page_off_buddy
     8.62%  ksoftirqd/0      [kernel.kallsyms]  [k] enetc_pci_remove
     5.61%  ksoftirqd/0      [kernel.kallsyms]  [k] memmove
     5.44%  ksoftirqd/0      [kernel.kallsyms]  [k] inet_gro_receive
     4.61%  ksoftirqd/0      [kernel.kallsyms]  [k] enetc_poll
     4.20%  ksoftirqd/0      [kernel.kallsyms]  [k] inet_gso_segment
     3.58%  ksoftirqd/0      [kernel.kallsyms]  [k] dev_gro_receive
     3.19%  ksoftirqd/0      [kernel.kallsyms]  [k] build_skb
     3.11%  ksoftirqd/0      [kernel.kallsyms]  [k] skb_release_all
     2.80%  ksoftirqd/0      [kernel.kallsyms]  [k] __copy_skb_header
     2.79%  ksoftirqd/0      [kernel.kallsyms]  [k] eth_type_trans
     2.31%  ksoftirqd/0      [kernel.kallsyms]  [k] ip_send_check
     1.95%  ksoftirqd/0      [kernel.kallsyms]  [k] __skb_get_hash
     1.64%  ksoftirqd/0      [kernel.kallsyms]  [k] kmem_cache_alloc
     1.54%  ksoftirqd/0      [kernel.kallsyms]  [k] __netif_receive_skb_core
     1.52%  ksoftirqd/0      [kernel.kallsyms]  [k] dsa_slave_xmit
     1.51%  ksoftirqd/0      [kernel.kallsyms]  [k] kmem_cache_free_bulk
     1.49%  ksoftirqd/0      [kernel.kallsyms]  [k] netdev_pick_tx
     1.42%  ksoftirqd/0      [kernel.kallsyms]  [k] skb_headers_offset_update
     1.34%  ksoftirqd/0      [kernel.kallsyms]  [k] skb_segment_list
     1.20%  ksoftirqd/0      [kernel.kallsyms]  [k] enetc_build_skb
     1.19%  ksoftirqd/0      [kernel.kallsyms]  [k] enetc_start_xmit
     1.09%  ksoftirqd/0      [kernel.kallsyms]  [k] dsa_8021q_xmit
     0.94%  ksoftirqd/0      [kernel.kallsyms]  [k] netdev_core_pick_tx
     0.90%  ksoftirqd/0      [kernel.kallsyms]  [k] __dev_queue_xmit
     0.87%  ksoftirqd/0      [kernel.kallsyms]  [k] ocelot_xmit
     0.85%  ksoftirqd/0      [kernel.kallsyms]  [k] __skb_flow_dissect
     0.68%  ksoftirqd/0      [kernel.kallsyms]  [k] dsa_8021q_rcv
     0.63%  ksoftirqd/0      [kernel.kallsyms]  [k] skb_pull_rcsum
     0.63%  ksoftirqd/0      [kernel.kallsyms]  [k] enetc_flip_rx_buff
     0.61%  ksoftirqd/0      [kernel.kallsyms]  [k] napi_consume_skb
     0.50%  ksoftirqd/0      [kernel.kallsyms]  [k] skb_release_data
     0.48%  ksoftirqd/0      [kernel.kallsyms]  [k] dma_unmap_page_attrs
     0.41%  ksoftirqd/0      [kernel.kallsyms]  [k] enetc_refill_rx_ring
     0.37%  ksoftirqd/0      [kernel.kallsyms]  [k] __build_skb_around
     0.33%  ksoftirqd/0      [kernel.kallsyms]  [k] fib_table_lookup
     0.33%  ksoftirqd/0      [kernel.kallsyms]  [k] napi_skb_cache_put
     0.28%  ksoftirqd/0      [kernel.kallsyms]  [k] gro_cells_receive
     0.26%  ksoftirqd/0      [kernel.kallsyms]  [k] bpf_skb_load_helper_16

So the performance seems to be slightly worse with the patch, for
reasons that are not immediately apparent from perf, as far as I can
tell. If I just look at dsa_switch_rcv, I see there is a decrease in CPU
cycles from 1.01% to 0.83%, but that doesn't reflect in the throughput I
see for some reason.

(also please ignore some oddities in the perf reports like
"enetc_pci_remove", this comes from enetc_lock_mdio/enetc_unlock_mdio, I
have no idea why it gets printed like that)

So yeah, I cannot actually change my setup such that the list iteration
is more expensive than this (swp0 is the first element, swp1 is the second).
diff mbox series

Patch

diff --git a/include/net/dsa.h b/include/net/dsa.h
index d05c71a92715..4639a82bab66 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -471,14 +471,25 @@  static inline bool dsa_is_user_port(struct dsa_switch *ds, int p)
 	return dsa_to_port(ds, p)->type == DSA_PORT_TYPE_USER;
 }
 
+#define dsa_switch_for_each_port(_dp, _ds) \
+	list_for_each_entry((_dp), &(_ds)->dst->ports, list) \
+		if ((_dp)->ds == (_ds))
+
+#define dsa_switch_for_each_user_port(_dp, _ds) \
+	dsa_switch_for_each_port((_dp), (_ds)) \
+		if (dsa_port_is_user((_dp)))
+
+#define dsa_switch_for_each_cpu_port(_dp, _ds) \
+	dsa_switch_for_each_port((_dp), (_ds)) \
+		if (dsa_port_is_cpu((_dp)))
+
 static inline u32 dsa_user_ports(struct dsa_switch *ds)
 {
+	struct dsa_port *dp;
 	u32 mask = 0;
-	int p;
 
-	for (p = 0; p < ds->num_ports; p++)
-		if (dsa_is_user_port(ds, p))
-			mask |= BIT(p);
+	dsa_switch_for_each_user_port(dp, ds)
+		mask |= BIT(dp->index);
 
 	return mask;
 }
diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
index 5e73332a9707..8104f2382988 100644
--- a/net/dsa/dsa.c
+++ b/net/dsa/dsa.c
@@ -280,23 +280,22 @@  static int dsa_switch_rcv(struct sk_buff *skb, struct net_device *dev,
 }
 
 #ifdef CONFIG_PM_SLEEP
-static bool dsa_is_port_initialized(struct dsa_switch *ds, int p)
+static bool dsa_port_is_initialized(const struct dsa_port *dp)
 {
-	const struct dsa_port *dp = dsa_to_port(ds, p);
-
 	return dp->type == DSA_PORT_TYPE_USER && dp->slave;
 }
 
 int dsa_switch_suspend(struct dsa_switch *ds)
 {
-	int i, ret = 0;
+	struct dsa_port *dp;
+	int ret = 0;
 
 	/* Suspend slave network devices */
-	for (i = 0; i < ds->num_ports; i++) {
-		if (!dsa_is_port_initialized(ds, i))
+	dsa_switch_for_each_port(dp, ds) {
+		if (!dsa_port_is_initialized(dp))
 			continue;
 
-		ret = dsa_slave_suspend(dsa_to_port(ds, i)->slave);
+		ret = dsa_slave_suspend(dp->slave);
 		if (ret)
 			return ret;
 	}
@@ -310,7 +309,8 @@  EXPORT_SYMBOL_GPL(dsa_switch_suspend);
 
 int dsa_switch_resume(struct dsa_switch *ds)
 {
-	int i, ret = 0;
+	struct dsa_port *dp;
+	int ret = 0;
 
 	if (ds->ops->resume)
 		ret = ds->ops->resume(ds);
@@ -319,11 +319,11 @@  int dsa_switch_resume(struct dsa_switch *ds)
 		return ret;
 
 	/* Resume slave network devices */
-	for (i = 0; i < ds->num_ports; i++) {
-		if (!dsa_is_port_initialized(ds, i))
+	dsa_switch_for_each_port(dp, ds) {
+		if (!dsa_port_is_initialized(dp))
 			continue;
 
-		ret = dsa_slave_resume(dsa_to_port(ds, i)->slave);
+		ret = dsa_slave_resume(dp->slave);
 		if (ret)
 			return ret;
 	}
diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index 8150e16aaa55..13794578a1d0 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -707,16 +707,15 @@  static int dsa_switch_setup_tag_protocol(struct dsa_switch *ds)
 {
 	const struct dsa_device_ops *tag_ops = ds->dst->tag_ops;
 	struct dsa_switch_tree *dst = ds->dst;
-	int port, err;
+	struct dsa_port *cpu_dp;
+	int err;
 
 	if (tag_ops->proto == dst->default_proto)
 		return 0;
 
-	for (port = 0; port < ds->num_ports; port++) {
-		if (!dsa_is_cpu_port(ds, port))
-			continue;
-
-		err = ds->ops->change_tag_protocol(ds, port, tag_ops->proto);
+	dsa_switch_for_each_cpu_port(cpu_dp, ds) {
+		err = ds->ops->change_tag_protocol(ds, cpu_dp->index,
+						   tag_ops->proto);
 		if (err) {
 			dev_err(ds->dev, "Unable to use tag protocol \"%s\": %pe\n",
 				tag_ops->name, ERR_PTR(err));
diff --git a/net/dsa/port.c b/net/dsa/port.c
index eedd9881e1ba..05b902d69863 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -540,7 +540,8 @@  static bool dsa_port_can_apply_vlan_filtering(struct dsa_port *dp,
 					      struct netlink_ext_ack *extack)
 {
 	struct dsa_switch *ds = dp->ds;
-	int err, i;
+	struct dsa_port *other_dp;
+	int err;
 
 	/* VLAN awareness was off, so the question is "can we turn it on".
 	 * We may have had 8021q uppers, those need to go. Make sure we don't
@@ -582,10 +583,10 @@  static bool dsa_port_can_apply_vlan_filtering(struct dsa_port *dp,
 	 * different ports of the same switch device and one of them has a
 	 * different setting than what is being requested.
 	 */
-	for (i = 0; i < ds->num_ports; i++) {
+	dsa_switch_for_each_port(other_dp, ds) {
 		struct net_device *other_bridge;
 
-		other_bridge = dsa_to_port(ds, i)->bridge_dev;
+		other_bridge = other_dp->bridge_dev;
 		if (!other_bridge)
 			continue;
 		/* If it's the same bridge, it also has same
diff --git a/net/dsa/switch.c b/net/dsa/switch.c
index fd1a1c6bf9cf..5ddcf37ecfa5 100644
--- a/net/dsa/switch.c
+++ b/net/dsa/switch.c
@@ -17,14 +17,11 @@ 
 static unsigned int dsa_switch_fastest_ageing_time(struct dsa_switch *ds,
 						   unsigned int ageing_time)
 {
-	int i;
-
-	for (i = 0; i < ds->num_ports; ++i) {
-		struct dsa_port *dp = dsa_to_port(ds, i);
+	struct dsa_port *dp;
 
+	dsa_switch_for_each_port(dp, ds)
 		if (dp->ageing_time && dp->ageing_time < ageing_time)
 			ageing_time = dp->ageing_time;
-	}
 
 	return ageing_time;
 }
@@ -117,7 +114,8 @@  static int dsa_switch_bridge_leave(struct dsa_switch *ds,
 	bool unset_vlan_filtering = br_vlan_enabled(info->br);
 	struct dsa_switch_tree *dst = ds->dst;
 	struct netlink_ext_ack extack = {0};
-	int err, port;
+	struct dsa_port *dp;
+	int err;
 
 	if (dst->index == info->tree_index && ds->index == info->sw_index &&
 	    ds->ops->port_bridge_leave)
@@ -138,10 +136,10 @@  static int dsa_switch_bridge_leave(struct dsa_switch *ds,
 	 * VLAN-aware bridge.
 	 */
 	if (unset_vlan_filtering && ds->vlan_filtering_is_global) {
-		for (port = 0; port < ds->num_ports; port++) {
+		dsa_switch_for_each_port(dp, ds) {
 			struct net_device *bridge_dev;
 
-			bridge_dev = dsa_to_port(ds, port)->bridge_dev;
+			bridge_dev = dp->bridge_dev;
 
 			if (bridge_dev && br_vlan_enabled(bridge_dev)) {
 				unset_vlan_filtering = false;
@@ -566,38 +564,35 @@  static int dsa_switch_change_tag_proto(struct dsa_switch *ds,
 				       struct dsa_notifier_tag_proto_info *info)
 {
 	const struct dsa_device_ops *tag_ops = info->tag_ops;
-	int port, err;
+	struct dsa_port *dp, *cpu_dp;
+	int err;
 
 	if (!ds->ops->change_tag_protocol)
 		return -EOPNOTSUPP;
 
 	ASSERT_RTNL();
 
-	for (port = 0; port < ds->num_ports; port++) {
-		if (!dsa_is_cpu_port(ds, port))
-			continue;
-
-		err = ds->ops->change_tag_protocol(ds, port, tag_ops->proto);
+	dsa_switch_for_each_cpu_port(cpu_dp, ds) {
+		err = ds->ops->change_tag_protocol(ds, cpu_dp->index,
+						   tag_ops->proto);
 		if (err)
 			return err;
 
-		dsa_port_set_tag_protocol(dsa_to_port(ds, port), tag_ops);
+		dsa_port_set_tag_protocol(cpu_dp, tag_ops);
 	}
 
 	/* Now that changing the tag protocol can no longer fail, let's update
 	 * the remaining bits which are "duplicated for faster access", and the
 	 * bits that depend on the tagger, such as the MTU.
 	 */
-	for (port = 0; port < ds->num_ports; port++) {
-		if (dsa_is_user_port(ds, port)) {
-			struct net_device *slave;
+	dsa_switch_for_each_user_port(dp, ds) {
+		struct net_device *slave;
 
-			slave = dsa_to_port(ds, port)->slave;
-			dsa_slave_setup_tagger(slave);
+		slave = dp->slave;
+		dsa_slave_setup_tagger(slave);
 
-			/* rtnl_mutex is held in dsa_tree_change_tag_proto */
-			dsa_slave_change_mtu(slave, slave->mtu);
-		}
+		/* rtnl_mutex is held in dsa_tree_change_tag_proto */
+		dsa_slave_change_mtu(slave, slave->mtu);
 	}
 
 	return 0;
diff --git a/net/dsa/tag_8021q.c b/net/dsa/tag_8021q.c
index 654697ebb6f3..b29f4eb9aed1 100644
--- a/net/dsa/tag_8021q.c
+++ b/net/dsa/tag_8021q.c
@@ -322,15 +322,13 @@  int dsa_switch_tag_8021q_vlan_del(struct dsa_switch *ds,
  * +-+-----+-+-----+-+-----+-+-----+-+    +-+-----+-+-----+-+-----+-+-----+-+
  *   swp0    swp1    swp2    swp3           swp0    swp1    swp2    swp3
  */
-static bool dsa_tag_8021q_bridge_match(struct dsa_switch *ds, int port,
+static bool dsa_tag_8021q_bridge_match(struct dsa_port *dp,
 				       struct dsa_notifier_bridge_info *info)
 {
-	struct dsa_port *dp = dsa_to_port(ds, port);
-
 	/* Don't match on self */
-	if (ds->dst->index == info->tree_index &&
-	    ds->index == info->sw_index &&
-	    port == info->port)
+	if (dp->ds->dst->index == info->tree_index &&
+	    dp->ds->index == info->sw_index &&
+	    dp->index == info->port)
 		return false;
 
 	if (dsa_port_is_user(dp))
@@ -344,8 +342,9 @@  int dsa_tag_8021q_bridge_join(struct dsa_switch *ds,
 {
 	struct dsa_switch *targeted_ds;
 	struct dsa_port *targeted_dp;
+	struct dsa_port *dp;
 	u16 targeted_rx_vid;
-	int err, port;
+	int err;
 
 	if (!ds->tag_8021q_ctx)
 		return 0;
@@ -354,11 +353,10 @@  int dsa_tag_8021q_bridge_join(struct dsa_switch *ds,
 	targeted_dp = dsa_to_port(targeted_ds, info->port);
 	targeted_rx_vid = dsa_8021q_rx_vid(targeted_ds, info->port);
 
-	for (port = 0; port < ds->num_ports; port++) {
-		struct dsa_port *dp = dsa_to_port(ds, port);
-		u16 rx_vid = dsa_8021q_rx_vid(ds, port);
+	dsa_switch_for_each_port(dp, ds) {
+		u16 rx_vid = dsa_8021q_rx_vid(ds, dp->index);
 
-		if (!dsa_tag_8021q_bridge_match(ds, port, info))
+		if (!dsa_tag_8021q_bridge_match(dp, info))
 			continue;
 
 		/* Install the RX VID of the targeted port in our VLAN table */
@@ -380,8 +378,8 @@  int dsa_tag_8021q_bridge_leave(struct dsa_switch *ds,
 {
 	struct dsa_switch *targeted_ds;
 	struct dsa_port *targeted_dp;
+	struct dsa_port *dp;
 	u16 targeted_rx_vid;
-	int port;
 
 	if (!ds->tag_8021q_ctx)
 		return 0;
@@ -390,11 +388,10 @@  int dsa_tag_8021q_bridge_leave(struct dsa_switch *ds,
 	targeted_dp = dsa_to_port(targeted_ds, info->port);
 	targeted_rx_vid = dsa_8021q_rx_vid(targeted_ds, info->port);
 
-	for (port = 0; port < ds->num_ports; port++) {
-		struct dsa_port *dp = dsa_to_port(ds, port);
-		u16 rx_vid = dsa_8021q_rx_vid(ds, port);
+	dsa_switch_for_each_port(dp, ds) {
+		u16 rx_vid = dsa_8021q_rx_vid(ds, dp->index);
 
-		if (!dsa_tag_8021q_bridge_match(ds, port, info))
+		if (!dsa_tag_8021q_bridge_match(dp, info))
 			continue;
 
 		/* Remove the RX VID of the targeted port from our VLAN table */