diff mbox series

[net-next,v1,1/6] lan743x: boost performance on cpu archs w/o dma cache snooping

Message ID 20210129195240.31871-2-TheSven73@gmail.com
State New
Headers show
Series lan743x speed boost | expand

Commit Message

Sven Van Asbroeck Jan. 29, 2021, 7:52 p.m. UTC
From: Sven Van Asbroeck <thesven73@gmail.com>

The buffers in the lan743x driver's receive ring are always 9K,
even when the largest packet that can be received (the mtu) is
much smaller. This performs particularly badly on cpu archs
without dma cache snooping (such as ARM): each received packet
results in a 9K dma_{map|unmap} operation, which is very expensive
because cpu caches need to be invalidated.

Careful measurement of the driver rx path on armv7 reveals that
the cpu spends the majority of its time waiting for cache
invalidation.

Optimize as follows:

1. set rx ring buffer size equal to the mtu. this limits the
   amount of cache that needs to be invalidated per dma_map().

2. when dma_unmap()ping, skip cpu sync. Sync only the packet data
   actually received, the size of which the chip will indicate in
   its rx ring descriptors. this limits the amount of cache that
   needs to be invalidated per dma_unmap().

These optimizations double the rx performance on armv7.
Third parties report 3x rx speedup on armv8.

Performance on dma cache snooping architectures (such as x86)
is expected to stay the same.

Tested with iperf3 on a freescale imx6qp + lan7430, both sides
set to mtu 1500 bytes, measure rx performance:

Before:
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-20.00  sec   550 MBytes   231 Mbits/sec    0
After:
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-20.00  sec  1.33 GBytes   570 Mbits/sec    0

Test by Anders Roenningen (anders@ronningen.priv.no) on armv8,
    rx iperf3:
Before 102 Mbits/sec
After  279 Mbits/sec

Signed-off-by: Sven Van Asbroeck <thesven73@gmail.com>
---

Tree: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git # 46eb3c108fe1

To: Bryan Whitehead <bryan.whitehead@microchip.com>
To: UNGLinuxDriver@microchip.com
To: "David S. Miller" <davem@davemloft.net>
To: Jakub Kicinski <kuba@kernel.org>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Alexey Denisov <rtgbnm@gmail.com>
Cc: Sergej Bauer <sbauer@blackbox.su>
Cc: Tim Harvey <tharvey@gateworks.com>
Cc: Anders Rønningen <anders@ronningen.priv.no>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org (open list)

 drivers/net/ethernet/microchip/lan743x_main.c | 35 ++++++++++++-------
 1 file changed, 23 insertions(+), 12 deletions(-)

Comments

Sven Van Asbroeck Jan. 30, 2021, 11:59 p.m. UTC | #1
Hi Bryan, thank you so much for reviewing, I really appreciate it.

On Sat, Jan 30, 2021 at 5:11 PM <Bryan.Whitehead@microchip.com> wrote:
>
> >                         /* unmap from dma */
> > +                       packet_length = RX_DESC_DATA0_FRAME_LENGTH_GET_
> > +                                       (descriptor->data0);
> It appears you moved this packet_length assignment from just below the following if block, however  you left out the le32_to_cpu.See next comment
>

Correct on both counts. This is a merge snafu that crept in when I
rebased to Alexey's very recent little/big endian patch, at the last
minute.
My tests didn't catch it, because I'm running on a little-endian cpu,
where le32_to_cpu() compiles to a nop.

Had I had the good sense to run sparse on every patch, like Jakub has,
I would have caught it...
Sven Van Asbroeck Jan. 31, 2021, 12:14 a.m. UTC | #2
On Sat, Jan 30, 2021 at 5:11 PM <Bryan.Whitehead@microchip.com> wrote:
>
> It appears you moved this packet_length assignment from just below the following if block, however  you left out the le32_to_cpu.See next comment

PS this merge snafu is removed completely by the next patch in the set.
So this will not prevent you from reviewing/testing the multi-buffer support,
should you want to.
Christoph Hellwig Feb. 5, 2021, 9:31 a.m. UTC | #3
This is a pattern we've seen in a few other net driver, so we should
be ok.  It just is rather hairy and needs a good justification, which
seems to be given here.
Sergej Bauer Feb. 5, 2021, 12:44 p.m. UTC | #4
Hi Sven
I can confirm great stability improvement after your patch
"lan743x: boost performance on cpu archs w/o dma cache snooping".
Please note, that test_ber opens raw sockets as
s = socket(AF_PACKET, SOCK_RAW, ETH_P_ALL)
and resulting 'average speed' is a average egress speed.

Test machine is Intel Pentium G4560 3.50GHz
lan743x with rejected virtual phy 'inside'

What I had before:
$ ifmtu eth7 500
$ test_ber -l eth7 -c 1000 -n 1000000 -f500 --no-conf
...
number of sent packets      = 1000000
number of received packets  = 289017
number of lost packets = 710983
number of out of order packets = 0
number of bit errors   = 0
total errors detected  = 710983
bit error rate         = 0.710983
average speed: 429.3799 Mbit/s

$ ifmtu eth7 1500
$ sudo test_ber -l eth7 -c 1000 -n 1000000 -f1500 --no-conf
...
number of sent packets      = 1000000
number of received packets  = 577194
number of lost packets = 422806
number of out of order packets = 0
number of bit errors   = 0
total errors detected  = 422806
bit error rate         = 0.422806
average speed: 644.6557 Mbit/s
---

and what I had after your patch:
$ ifmtu eth7 500
$ test_ber -l eth7 -c 1000 -n 1000000 -f500 --no-conf
...
number of sent packets      = 1000000
number of received packets  = 711329
number of lost packets = 288671
number of out of order packets = 0
number of bit errors   = 0
total errors detected  = 288671
bit error rate         = 0.288671
average speed: 429.2263 Mbit/s

$ ifmtu eth7 1500
$ test_ber -l eth7 -c 1000 -n 1000000 -f1500 --no-conf
...
number of sent packets      = 1000000
number of received packets  = 1000000
number of lost packets = 0
number of out of order packets = 0
number of bit errors   = 0
total errors detected  = 0
bit error rate         = 0
average speed: 644.5405 Mbit/s
Sven Van Asbroeck Feb. 5, 2021, 2:01 p.m. UTC | #5
Hi Christoph,

On Fri, Feb 5, 2021 at 4:31 AM Christoph Hellwig <hch@lst.de> wrote:
>

> This is a pattern we've seen in a few other net driver, so we should

> be ok.  It just is rather hairy and needs a good justification, which

> seems to be given here.


Thank you so much for taking the time to look into this.
That is certainly good news !!
Sven Van Asbroeck Feb. 5, 2021, 2:07 p.m. UTC | #6
Hi Sergej,

On Fri, Feb 5, 2021 at 7:44 AM Sergej Bauer <sbauer@blackbox.su> wrote:
>

> Hi Sven

> I can confirm great stability improvement after your patch

> "lan743x: boost performance on cpu archs w/o dma cache snooping".

>

> Test machine is Intel Pentium G4560 3.50GHz

> lan743x with rejected virtual phy 'inside'


Interesting, so the speed boost patch seems to improve things even on Intel...

Would you be able to apply and test the multi-buffer patch as well?
To do that, you can simply apply patches [2/6] and [3/6] on top of
what you already have.

Keeping in mind that Bryan has identified an issue with the above
patch, which will get fixed in v2. So YMMV.
Sergej Bauer Feb. 5, 2021, 3:09 p.m. UTC | #7
On Friday, February 5, 2021 5:07:22 PM MSK you wrote:
> Hi Sergej,

> 

> On Fri, Feb 5, 2021 at 7:44 AM Sergej Bauer <sbauer@blackbox.su> wrote:

> > Hi Sven

> > I can confirm great stability improvement after your patch

> > "lan743x: boost performance on cpu archs w/o dma cache snooping".

> > 

> > Test machine is Intel Pentium G4560 3.50GHz

> > lan743x with rejected virtual phy 'inside'

> 

> Interesting, so the speed boost patch seems to improve things even on

> Intel...

> 

> Would you be able to apply and test the multi-buffer patch as well?

> To do that, you can simply apply patches [2/6] and [3/6] on top of

> what you already have.

> 


Hi Sven
Tests after applying patches [2/6] and [3/6] are:
$ ifmtu eth7 500
$ sudo test_ber -l eth7 -c 1000 -n 1000000 -f500 --no-conf
...
number of sent packets      = 1000000
number of received packets  = 713288
number of lost packets = 286712
number of out of order packets = 0
number of bit errors   = 0
total errors detected  = 286712
bit error rate         = 0.286712
average speed: 427.8043 Mbit/s

$ ifmtu eth7 1500
$ sudo test_ber -l eth7 -c 1000 -n 1000000 -f500 --no-conf
...
number of sent packets      = 1000000
number of received packets  = 707869
number of lost packets = 292131
number of out of order packets = 0
number of bit errors   = 0
total errors detected  = 292131
bit error rate         = 0.292131
average speed: 431.0163 Mbit/s

$ sudo test_ber -l eth7 -c 1000 -n 1000000 -f1500 --no-conf
...
number of sent packets      = 1000000
number of received packets  = 1000000
number of lost packets = 0
number of out of order packets = 0
number of bit errors   = 0
total errors detected  = 0
bit error rate         = 0
average speed: 646.4932 Mbit/s

> Keeping in mind that Bryan has identified an issue with the above

> patch, which will get fixed in v2. So YMMV.

I'll perform tests with v2 as well.
Sven Van Asbroeck Feb. 5, 2021, 4:39 p.m. UTC | #8
Hi Sergej,

On Fri, Feb 5, 2021 at 10:09 AM Sergej Bauer <sbauer@blackbox.su> wrote:
>

> Tests after applying patches [2/6] and [3/6] are:

> $ ifmtu eth7 500

> $ sudo test_ber -l eth7 -c 1000 -n 1000000 -f500 --no-conf


Thank you! Is there a way for me to run test_ber myself?
Is this a standard, or a bespoke testing tool?
Sergej Bauer Feb. 5, 2021, 4:59 p.m. UTC | #9
sOn Friday, February 5, 2021 7:39:40 PM MSK Sven Van Asbroeck wrote:
> Hi Sergej,

> 

> On Fri, Feb 5, 2021 at 10:09 AM Sergej Bauer <sbauer@blackbox.su> wrote:

> > Tests after applying patches [2/6] and [3/6] are:

> > $ ifmtu eth7 500

> > $ sudo test_ber -l eth7 -c 1000 -n 1000000 -f500 --no-conf

> 

> Thank you! Is there a way for me to run test_ber myself?

> Is this a standard, or a bespoke testing tool?

It's kind of bespoke... A part of framework to assist HW guys in developing
PHY-device. But the project is finished, so I could ask for permission to send
the source to you.
diff mbox series

Patch

diff --git a/drivers/net/ethernet/microchip/lan743x_main.c b/drivers/net/ethernet/microchip/lan743x_main.c
index f1f6eba4ace4..f485320e5784 100644
--- a/drivers/net/ethernet/microchip/lan743x_main.c
+++ b/drivers/net/ethernet/microchip/lan743x_main.c
@@ -1957,11 +1957,11 @@  static int lan743x_rx_next_index(struct lan743x_rx *rx, int index)
 
 static struct sk_buff *lan743x_rx_allocate_skb(struct lan743x_rx *rx)
 {
-	int length = 0;
+	struct net_device *netdev = rx->adapter->netdev;
 
-	length = (LAN743X_MAX_FRAME_SIZE + ETH_HLEN + 4 + RX_HEAD_PADDING);
-	return __netdev_alloc_skb(rx->adapter->netdev,
-				  length, GFP_ATOMIC | GFP_DMA);
+	return __netdev_alloc_skb(netdev,
+				  netdev->mtu + ETH_HLEN + 4 + RX_HEAD_PADDING,
+				  GFP_ATOMIC | GFP_DMA);
 }
 
 static void lan743x_rx_update_tail(struct lan743x_rx *rx, int index)
@@ -1977,9 +1977,10 @@  static int lan743x_rx_init_ring_element(struct lan743x_rx *rx, int index,
 {
 	struct lan743x_rx_buffer_info *buffer_info;
 	struct lan743x_rx_descriptor *descriptor;
-	int length = 0;
+	struct net_device *netdev = rx->adapter->netdev;
+	int length;
 
-	length = (LAN743X_MAX_FRAME_SIZE + ETH_HLEN + 4 + RX_HEAD_PADDING);
+	length = netdev->mtu + ETH_HLEN + 4 + RX_HEAD_PADDING;
 	descriptor = &rx->ring_cpu_ptr[index];
 	buffer_info = &rx->buffer_info[index];
 	buffer_info->skb = skb;
@@ -2148,11 +2149,18 @@  static int lan743x_rx_process_packet(struct lan743x_rx *rx)
 			descriptor = &rx->ring_cpu_ptr[first_index];
 
 			/* unmap from dma */
+			packet_length =	RX_DESC_DATA0_FRAME_LENGTH_GET_
+					(descriptor->data0);
 			if (buffer_info->dma_ptr) {
-				dma_unmap_single(&rx->adapter->pdev->dev,
-						 buffer_info->dma_ptr,
-						 buffer_info->buffer_length,
-						 DMA_FROM_DEVICE);
+				dma_sync_single_for_cpu(&rx->adapter->pdev->dev,
+							buffer_info->dma_ptr,
+							packet_length,
+							DMA_FROM_DEVICE);
+				dma_unmap_single_attrs(&rx->adapter->pdev->dev,
+						       buffer_info->dma_ptr,
+						       buffer_info->buffer_length,
+						       DMA_FROM_DEVICE,
+						       DMA_ATTR_SKIP_CPU_SYNC);
 				buffer_info->dma_ptr = 0;
 				buffer_info->buffer_length = 0;
 			}
@@ -2167,8 +2175,8 @@  static int lan743x_rx_process_packet(struct lan743x_rx *rx)
 			int index = first_index;
 
 			/* multi buffer packet not supported */
-			/* this should not happen since
-			 * buffers are allocated to be at least jumbo size
+			/* this should not happen since buffers are allocated
+			 * to be at least the mtu size configured in the mac.
 			 */
 
 			/* clean up buffers */
@@ -2628,6 +2636,9 @@  static int lan743x_netdev_change_mtu(struct net_device *netdev, int new_mtu)
 	struct lan743x_adapter *adapter = netdev_priv(netdev);
 	int ret = 0;
 
+	if (netif_running(netdev))
+		return -EBUSY;
+
 	ret = lan743x_mac_set_mtu(adapter, new_mtu);
 	if (!ret)
 		netdev->mtu = new_mtu;