diff mbox series

gianfar: fix jumbo packets+napi+rx overrun crash

Message ID 20210304031238.28880-1-michael-dev@fami-braun.de
State New
Headers show
Series gianfar: fix jumbo packets+napi+rx overrun crash | expand

Commit Message

michael-dev March 4, 2021, 3:12 a.m. UTC
From: Michael Braun <michael-dev@fami-braun.de>

When using jumbo packets and overrunning rx queue with napi enabled,
the following sequence is observed in gfar_add_rx_frag:

   | lstatus                              |       | skb                   |
t  | lstatus,  size, flags                | first | len, data_len, *ptr   |
---+--------------------------------------+-------+-----------------------+
13 | 18002348, 9032, INTERRUPT LAST       | 0     | 9600, 8000,  f554c12e |
12 | 10000640, 1600, INTERRUPT            | 0     | 8000, 6400,  f554c12e |
11 | 10000640, 1600, INTERRUPT            | 0     | 6400, 4800,  f554c12e |
10 | 10000640, 1600, INTERRUPT            | 0     | 4800, 3200,  f554c12e |
09 | 10000640, 1600, INTERRUPT            | 0     | 3200, 1600,  f554c12e |
08 | 14000640, 1600, INTERRUPT FIRST      | 0     | 1600, 0,     f554c12e |
07 | 14000640, 1600, INTERRUPT FIRST      | 1     | 0,    0,     f554c12e |
06 | 1c000080, 128,  INTERRUPT LAST FIRST | 1     | 0,    0,     abf3bd6e |
05 | 18002348, 9032, INTERRUPT LAST       | 0     | 8000, 6400,  c5a57780 |
04 | 10000640, 1600, INTERRUPT            | 0     | 6400, 4800,  c5a57780 |
03 | 10000640, 1600, INTERRUPT            | 0     | 4800, 3200,  c5a57780 |
02 | 10000640, 1600, INTERRUPT            | 0     | 3200, 1600,  c5a57780 |
01 | 10000640, 1600, INTERRUPT            | 0     | 1600, 0,     c5a57780 |
00 | 14000640, 1600, INTERRUPT FIRST      | 1     | 0,    0,     c5a57780 |

So at t=7 a new packets is started but not finished, probably due to rx
overrun - but rx overrun is not indicated in the flags. Instead a new
packets starts at t=8. This results in skb->len to exceed size for the LAST
fragment at t=13 and thus a negative fragment size added to the skb.

This then crashes:

kernel BUG at include/linux/skbuff.h:2277!
Oops: Exception in kernel mode, sig: 5 [#1]
...
NIP [c04689f4] skb_pull+0x2c/0x48
LR [c03f62ac] gfar_clean_rx_ring+0x2e4/0x844
Call Trace:
[ec4bfd38] [c06a84c4] _raw_spin_unlock_irqrestore+0x60/0x7c (unreliable)
[ec4bfda8] [c03f6a44] gfar_poll_rx_sq+0x48/0xe4
[ec4bfdc8] [c048d504] __napi_poll+0x54/0x26c
[ec4bfdf8] [c048d908] net_rx_action+0x138/0x2c0
[ec4bfe68] [c06a8f34] __do_softirq+0x3a4/0x4fc
[ec4bfed8] [c0040150] run_ksoftirqd+0x58/0x70
[ec4bfee8] [c0066ecc] smpboot_thread_fn+0x184/0x1cc
[ec4bff08] [c0062718] kthread+0x140/0x144
[ec4bff38] [c0012350] ret_from_kernel_thread+0x14/0x1c

This patch fixes this by checking for computed LAST fragment size, so a
negative sized fragment is never added.
In order to prevent the newer rx frame from getting corrupted, the FIRST
flag is checked to discard the incomplete older frame.

Signed-off-by: Michael Braun <michael-dev@fami-braun.de>
---
 drivers/net/ethernet/freescale/gianfar.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

Comments

Claudiu Manoil March 4, 2021, 10:05 a.m. UTC | #1
>-----Original Message-----
>From: michael-dev@fami-braun.de <michael-dev@fami-braun.de>
>Sent: Thursday, March 4, 2021 5:13 AM
>To: Claudiu Manoil <claudiu.manoil@nxp.com>
>Cc: netdev@vger.kernel.org; Michael Braun <michael-dev@fami-braun.de>
>Subject: [PATCH] gianfar: fix jumbo packets+napi+rx overrun crash
>
>From: Michael Braun <michael-dev@fami-braun.de>
>
>When using jumbo packets and overrunning rx queue with napi enabled,
>the following sequence is observed in gfar_add_rx_frag:
>

Hi,

Could you help provide some context, e.g.:
On what board/soc were you able to trigger this issue?
How often does the overrun occur? What's the use case? Is the issue triggered with smaller packets than 9600B?
Increasing the Rx ring size does significantly reduce ring overruns?
Thanks.

>   | lstatus                              |       | skb                   |
>t  | lstatus,  size, flags                | first | len, data_len, *ptr   |
>---+--------------------------------------+-------+-----------------------+
>13 | 18002348, 9032, INTERRUPT LAST       | 0     | 9600, 8000,  f554c12e |
>12 | 10000640, 1600, INTERRUPT            | 0     | 8000, 6400,  f554c12e |
>11 | 10000640, 1600, INTERRUPT            | 0     | 6400, 4800,  f554c12e |
>10 | 10000640, 1600, INTERRUPT            | 0     | 4800, 3200,  f554c12e |
>09 | 10000640, 1600, INTERRUPT            | 0     | 3200, 1600,  f554c12e |
>08 | 14000640, 1600, INTERRUPT FIRST      | 0     | 1600, 0,     f554c12e |
>07 | 14000640, 1600, INTERRUPT FIRST      | 1     | 0,    0,     f554c12e |
>06 | 1c000080, 128,  INTERRUPT LAST FIRST | 1     | 0,    0,     abf3bd6e |
>05 | 18002348, 9032, INTERRUPT LAST       | 0     | 8000, 6400,  c5a57780 |
>04 | 10000640, 1600, INTERRUPT            | 0     | 6400, 4800,  c5a57780 |
>03 | 10000640, 1600, INTERRUPT            | 0     | 4800, 3200,  c5a57780 |
>02 | 10000640, 1600, INTERRUPT            | 0     | 3200, 1600,  c5a57780 |
>01 | 10000640, 1600, INTERRUPT            | 0     | 1600, 0,     c5a57780 |
>00 | 14000640, 1600, INTERRUPT FIRST      | 1     | 0,    0,     c5a57780 |
>
>So at t=7 a new packets is started but not finished, probably due to rx
>overrun - but rx overrun is not indicated in the flags. Instead a new
>packets starts at t=8. This results in skb->len to exceed size for the LAST
>fragment at t=13 and thus a negative fragment size added to the skb.
>
>This then crashes:
>
>kernel BUG at include/linux/skbuff.h:2277!
>Oops: Exception in kernel mode, sig: 5 [#1]
>...
>NIP [c04689f4] skb_pull+0x2c/0x48
>LR [c03f62ac] gfar_clean_rx_ring+0x2e4/0x844
>Call Trace:
>[ec4bfd38] [c06a84c4] _raw_spin_unlock_irqrestore+0x60/0x7c (unreliable)
>[ec4bfda8] [c03f6a44] gfar_poll_rx_sq+0x48/0xe4
>[ec4bfdc8] [c048d504] __napi_poll+0x54/0x26c
>[ec4bfdf8] [c048d908] net_rx_action+0x138/0x2c0
>[ec4bfe68] [c06a8f34] __do_softirq+0x3a4/0x4fc
>[ec4bfed8] [c0040150] run_ksoftirqd+0x58/0x70
>[ec4bfee8] [c0066ecc] smpboot_thread_fn+0x184/0x1cc
>[ec4bff08] [c0062718] kthread+0x140/0x144
>[ec4bff38] [c0012350] ret_from_kernel_thread+0x14/0x1c
>
>This patch fixes this by checking for computed LAST fragment size, so a
>negative sized fragment is never added.
>In order to prevent the newer rx frame from getting corrupted, the FIRST
>flag is checked to discard the incomplete older frame.
>
>Signed-off-by: Michael Braun <michael-dev@fami-braun.de>
>---
> drivers/net/ethernet/freescale/gianfar.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
>diff --git a/drivers/net/ethernet/freescale/gianfar.c
>b/drivers/net/ethernet/freescale/gianfar.c
>index 541de32ea662..2aecae23bfd0 100644
>--- a/drivers/net/ethernet/freescale/gianfar.c
>+++ b/drivers/net/ethernet/freescale/gianfar.c
>@@ -2390,6 +2390,10 @@ static bool gfar_add_rx_frag(struct gfar_rx_buff
>*rxb, u32 lstatus,
> 		if (lstatus & BD_LFLAG(RXBD_LAST))
> 			size -= skb->len;
>
>+		WARN(size < 0, "gianfar: rx fragment size underflow");
>+		if (size < 0)
>+			return false;
>+
> 		skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, page,
> 				rxb->page_offset + RXBUF_ALIGNMENT,
> 				size, GFAR_RXB_TRUESIZE);
>@@ -2552,6 +2556,16 @@ static int gfar_clean_rx_ring(struct gfar_priv_rx_q
>*rx_queue,
> 		if (lstatus & BD_LFLAG(RXBD_EMPTY))
> 			break;
>
>+		/* lost RXBD_LAST descriptor due to overrun */
>+		if (skb &&
>+		    (lstatus & BD_LFLAG(RXBD_FIRST))) {
>+			/* discard faulty buffer */
>+			dev_kfree_skb(skb);
>+			skb = NULL;
>+
>+			/* can continue normally */
>+		}
>+

This is indeed an invalid state. If you hit this condition, discarding the skb is the right thing to do.

Acked-by: Claudiu Manoil <claudiu.manoil@nxp.com>

> 		/* order rx buffer descriptor reads */
> 		rmb();
>
>--
>2.20.1
michael-dev March 7, 2021, 8:51 p.m. UTC | #2
Hi,

sorry I missed the mail.

Am 04.03.2021 11:05, schrieb Claudiu Manoil:
> Could you help provide some context, e.g.:

> On what board/soc were you able to trigger this issue?


I have an OpenWRT running on P1020WLAN boards and have IPsec for some 
gretap tunnel configured.
I run iperf3 -s on the AP and iperf3 -c --udp -b 1000M on some server 
and then the AP reliably crashes.
It also crashed sometimes during normal operations when doing some fast 
download over the IPsec tunnel, but that was hard to reproduce.

> How often does the overrun occur?

Reliably within seconds with the above test.

> What's the use case? Is the issue triggered with smaller packets than 

> 9600B?

It cannot be triggered with 1500 Byte  MTU packets as these have FIRST 
and LAST set in one.
I use jumbo frames to speed up ipsec.

> Increasing the Rx ring size does significantly reduce ring overruns?


I did not test this.

Regards,
Michael Braun
diff mbox series

Patch

diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c
index 541de32ea662..2aecae23bfd0 100644
--- a/drivers/net/ethernet/freescale/gianfar.c
+++ b/drivers/net/ethernet/freescale/gianfar.c
@@ -2390,6 +2390,10 @@  static bool gfar_add_rx_frag(struct gfar_rx_buff *rxb, u32 lstatus,
 		if (lstatus & BD_LFLAG(RXBD_LAST))
 			size -= skb->len;
 
+		WARN(size < 0, "gianfar: rx fragment size underflow");
+		if (size < 0)
+			return false;
+
 		skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, page,
 				rxb->page_offset + RXBUF_ALIGNMENT,
 				size, GFAR_RXB_TRUESIZE);
@@ -2552,6 +2556,16 @@  static int gfar_clean_rx_ring(struct gfar_priv_rx_q *rx_queue,
 		if (lstatus & BD_LFLAG(RXBD_EMPTY))
 			break;
 
+		/* lost RXBD_LAST descriptor due to overrun */
+		if (skb &&
+		    (lstatus & BD_LFLAG(RXBD_FIRST))) {
+			/* discard faulty buffer */
+			dev_kfree_skb(skb);
+			skb = NULL;
+
+			/* can continue normally */
+		}
+
 		/* order rx buffer descriptor reads */
 		rmb();