diff mbox series

[net-next,v1,2/6] lan743x: support rx multi-buffer packets

Message ID 20210129195240.31871-3-TheSven73@gmail.com
State New
Headers show
Series lan743x speed boost | expand

Commit Message

Sven Van Asbroeck Jan. 29, 2021, 7:52 p.m. UTC
From: Sven Van Asbroeck <thesven73@gmail.com>

Multi-buffer packets enable us to use rx ring buffers smaller than
the mtu. This will allow us to change the mtu on-the-fly, without
having to stop the network interface in order to re-size the rx
ring buffers.

This is a big change touching a key driver function (process_packet),
so care has been taken to test this extensively:

Tests with debug logging enabled (add #define DEBUG).

1. Limit rx buffer size to 500, so mtu (1500) takes 3 buffers.
   Ping to chip, verify correct packet size is sent to OS.
   Ping large packets to chip (ping -s 1400), verify correct
     packet size is sent to OS.
   Ping using packets around the buffer size, verify number of
     buffers is changing, verify correct packet size is sent
     to OS:
     $ ping -s 472
     $ ping -s 473
     $ ping -s 992
     $ ping -s 993
   Verify that each packet is followed by extension processing.

2. Limit rx buffer size to 500, so mtu (1500) takes 3 buffers.
   Run iperf3 -s on chip, verify that packets come in 3 buffers
     at a time.
   Verify that packet size is equal to mtu.
   Verify that each packet is followed by extension processing.

3. Set chip and host mtu to 2000.
   Limit rx buffer size to 500, so mtu (2000) takes 4 buffers.
   Run iperf3 -s on chip, verify that packets come in 4 buffers
     at a time.
   Verify that packet size is equal to mtu.
   Verify that each packet is followed by extension processing.

Tests with debug logging DISabled (remove #define DEBUG).

4. Limit rx buffer size to 500, so mtu (1500) takes 3 buffers.
   Run iperf3 -s on chip, note sustained rx speed.
   Set chip and host mtu to 2000, so mtu takes 4 buffers.
   Run iperf3 -s on chip, note sustained rx speed.
   Verify no packets are dropped in both cases.

Tests with DEBUG_KMEMLEAK on:
 $ mount -t debugfs nodev /sys/kernel/debug/
 $ echo scan > /sys/kernel/debug/kmemleak

5. Limit rx buffer size to 500, so mtu (1500) takes 3 buffers.
   Run the following tests concurrently for at least one hour:
   - iperf3 -s on chip
   - ping -> chip
   Monitor reported memory leaks.

6. Set chip and host mtu to 2000.
   Limit rx buffer size to 500, so mtu (2000) takes 4 buffers.
   Run the following tests concurrently for at least one hour:
   - iperf3 -s on chip
   - ping -> chip
   Monitor reported memory leaks.

7. Simulate low-memory in lan743x_rx_allocate_skb(): fail every
     100 allocations.
   Repeat (5) and (6).
   Monitor reported memory leaks.

8. Simulate  low-memory in lan743x_rx_allocate_skb(): fail 10
     allocations in a row in every 100.
   Repeat (5) and (6).
   Monitor reported memory leaks.

9. Simulate  low-memory in lan743x_rx_trim_skb(): fail 1 allocation
     in every 100.
   Repeat (5) and (6).
   Monitor reported memory leaks.

Signed-off-by: Sven Van Asbroeck <thesven73@gmail.com>
---

Tree: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git # 46eb3c108fe1

To: Bryan Whitehead <bryan.whitehead@microchip.com>
To: UNGLinuxDriver@microchip.com
To: "David S. Miller" <davem@davemloft.net>
To: Jakub Kicinski <kuba@kernel.org>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Alexey Denisov <rtgbnm@gmail.com>
Cc: Sergej Bauer <sbauer@blackbox.su>
Cc: Tim Harvey <tharvey@gateworks.com>
Cc: Anders Rønningen <anders@ronningen.priv.no>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org (open list)

 drivers/net/ethernet/microchip/lan743x_main.c | 321 ++++++++----------
 drivers/net/ethernet/microchip/lan743x_main.h |   2 +
 2 files changed, 143 insertions(+), 180 deletions(-)

Comments

Willem de Bruijn Jan. 29, 2021, 10:11 p.m. UTC | #1
On Fri, Jan 29, 2021 at 2:56 PM Sven Van Asbroeck <thesven73@gmail.com> wrote:
>
> From: Sven Van Asbroeck <thesven73@gmail.com>
>
> Multi-buffer packets enable us to use rx ring buffers smaller than
> the mtu. This will allow us to change the mtu on-the-fly, without
> having to stop the network interface in order to re-size the rx
> ring buffers.
>
> This is a big change touching a key driver function (process_packet),
> so care has been taken to test this extensively:
>
> Tests with debug logging enabled (add #define DEBUG).
>
> 1. Limit rx buffer size to 500, so mtu (1500) takes 3 buffers.
>    Ping to chip, verify correct packet size is sent to OS.
>    Ping large packets to chip (ping -s 1400), verify correct
>      packet size is sent to OS.
>    Ping using packets around the buffer size, verify number of
>      buffers is changing, verify correct packet size is sent
>      to OS:
>      $ ping -s 472
>      $ ping -s 473
>      $ ping -s 992
>      $ ping -s 993
>    Verify that each packet is followed by extension processing.
>
> 2. Limit rx buffer size to 500, so mtu (1500) takes 3 buffers.
>    Run iperf3 -s on chip, verify that packets come in 3 buffers
>      at a time.
>    Verify that packet size is equal to mtu.
>    Verify that each packet is followed by extension processing.
>
> 3. Set chip and host mtu to 2000.
>    Limit rx buffer size to 500, so mtu (2000) takes 4 buffers.
>    Run iperf3 -s on chip, verify that packets come in 4 buffers
>      at a time.
>    Verify that packet size is equal to mtu.
>    Verify that each packet is followed by extension processing.
>
> Tests with debug logging DISabled (remove #define DEBUG).
>
> 4. Limit rx buffer size to 500, so mtu (1500) takes 3 buffers.
>    Run iperf3 -s on chip, note sustained rx speed.
>    Set chip and host mtu to 2000, so mtu takes 4 buffers.
>    Run iperf3 -s on chip, note sustained rx speed.
>    Verify no packets are dropped in both cases.
>
> Tests with DEBUG_KMEMLEAK on:
>  $ mount -t debugfs nodev /sys/kernel/debug/
>  $ echo scan > /sys/kernel/debug/kmemleak
>
> 5. Limit rx buffer size to 500, so mtu (1500) takes 3 buffers.
>    Run the following tests concurrently for at least one hour:
>    - iperf3 -s on chip
>    - ping -> chip
>    Monitor reported memory leaks.
>
> 6. Set chip and host mtu to 2000.
>    Limit rx buffer size to 500, so mtu (2000) takes 4 buffers.
>    Run the following tests concurrently for at least one hour:
>    - iperf3 -s on chip
>    - ping -> chip
>    Monitor reported memory leaks.
>
> 7. Simulate low-memory in lan743x_rx_allocate_skb(): fail every
>      100 allocations.
>    Repeat (5) and (6).
>    Monitor reported memory leaks.
>
> 8. Simulate  low-memory in lan743x_rx_allocate_skb(): fail 10
>      allocations in a row in every 100.
>    Repeat (5) and (6).
>    Monitor reported memory leaks.
>
> 9. Simulate  low-memory in lan743x_rx_trim_skb(): fail 1 allocation
>      in every 100.
>    Repeat (5) and (6).
>    Monitor reported memory leaks.
>
> Signed-off-by: Sven Van Asbroeck <thesven73@gmail.com>
> ---
>
> Tree: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git # 46eb3c108fe1
>
> To: Bryan Whitehead <bryan.whitehead@microchip.com>
> To: UNGLinuxDriver@microchip.com
> To: "David S. Miller" <davem@davemloft.net>
> To: Jakub Kicinski <kuba@kernel.org>
> Cc: Andrew Lunn <andrew@lunn.ch>
> Cc: Alexey Denisov <rtgbnm@gmail.com>
> Cc: Sergej Bauer <sbauer@blackbox.su>
> Cc: Tim Harvey <tharvey@gateworks.com>
> Cc: Anders Rønningen <anders@ronningen.priv.no>
> Cc: netdev@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org (open list)
>
>  drivers/net/ethernet/microchip/lan743x_main.c | 321 ++++++++----------
>  drivers/net/ethernet/microchip/lan743x_main.h |   2 +
>  2 files changed, 143 insertions(+), 180 deletions(-)


> +static struct sk_buff *
> +lan743x_rx_trim_skb(struct sk_buff *skb, int frame_length)
> +{
> +       if (skb_linearize(skb)) {

Is this needed? That will be quite expensive

> +               dev_kfree_skb_irq(skb);
> +               return NULL;
> +       }
> +       frame_length = max_t(int, 0, frame_length - RX_HEAD_PADDING - 2);
> +       if (skb->len > frame_length) {
> +               skb->tail -= skb->len - frame_length;
> +               skb->len = frame_length;
> +       }
> +       return skb;
> +}
> +
>  static int lan743x_rx_process_packet(struct lan743x_rx *rx)
>  {
> -       struct skb_shared_hwtstamps *hwtstamps = NULL;
> +       struct lan743x_rx_descriptor *descriptor, *desc_ext;
>         int result = RX_PROCESS_RESULT_NOTHING_TO_DO;
>         int current_head_index = le32_to_cpu(*rx->head_cpu_ptr);
>         struct lan743x_rx_buffer_info *buffer_info;
> -       struct lan743x_rx_descriptor *descriptor;
> +       struct skb_shared_hwtstamps *hwtstamps;
> +       int frame_length, buffer_length;
> +       struct sk_buff *skb;
>         int extension_index = -1;
> -       int first_index = -1;
> -       int last_index = -1;
> +       bool is_last, is_first;
>
>         if (current_head_index < 0 || current_head_index >= rx->ring_size)
>                 goto done;
> @@ -2068,170 +2075,126 @@ static int lan743x_rx_process_packet(struct lan743x_rx *rx)
>         if (rx->last_head < 0 || rx->last_head >= rx->ring_size)
>                 goto done;
>
> -       if (rx->last_head != current_head_index) {
> -               descriptor = &rx->ring_cpu_ptr[rx->last_head];
> -               if (le32_to_cpu(descriptor->data0) & RX_DESC_DATA0_OWN_)
> -                       goto done;
> +       if (rx->last_head == current_head_index)
> +               goto done;

Is it possible to avoid the large indentation change, or else do that
in a separate patch? It makes it harder to follow the functional
change.

>
> -               if (!(le32_to_cpu(descriptor->data0) & RX_DESC_DATA0_FS_))
> -                       goto done;
> +       descriptor = &rx->ring_cpu_ptr[rx->last_head];
> +       if (le32_to_cpu(descriptor->data0) & RX_DESC_DATA0_OWN_)
> +               goto done;
> +       buffer_info = &rx->buffer_info[rx->last_head];
>
> -               first_index = rx->last_head;
> -               if (le32_to_cpu(descriptor->data0) & RX_DESC_DATA0_LS_) {
> -                       last_index = rx->last_head;
> -               } else {
> -                       int index;
> -
> -                       index = lan743x_rx_next_index(rx, first_index);
> -                       while (index != current_head_index) {
> -                               descriptor = &rx->ring_cpu_ptr[index];
> -                               if (le32_to_cpu(descriptor->data0) & RX_DESC_DATA0_OWN_)
> -                                       goto done;
> -
> -                               if (le32_to_cpu(descriptor->data0) & RX_DESC_DATA0_LS_) {
> -                                       last_index = index;
> -                                       break;
> -                               }
> -                               index = lan743x_rx_next_index(rx, index);
> -                       }
> -               }
> -               if (last_index >= 0) {
> -                       descriptor = &rx->ring_cpu_ptr[last_index];
> -                       if (le32_to_cpu(descriptor->data0) & RX_DESC_DATA0_EXT_) {
> -                               /* extension is expected to follow */
> -                               int index = lan743x_rx_next_index(rx,
> -                                                                 last_index);
> -                               if (index != current_head_index) {
> -                                       descriptor = &rx->ring_cpu_ptr[index];
> -                                       if (le32_to_cpu(descriptor->data0) &
> -                                           RX_DESC_DATA0_OWN_) {
> -                                               goto done;
> -                                       }
> -                                       if (le32_to_cpu(descriptor->data0) &
> -                                           RX_DESC_DATA0_EXT_) {
> -                                               extension_index = index;
> -                                       } else {
> -                                               goto done;
> -                                       }
> -                               } else {
> -                                       /* extension is not yet available */
> -                                       /* prevent processing of this packet */
> -                                       first_index = -1;
> -                                       last_index = -1;
> -                               }
> -                       }
> -               }
> +       is_last = le32_to_cpu(descriptor->data0) & RX_DESC_DATA0_LS_;
> +       is_first = le32_to_cpu(descriptor->data0) & RX_DESC_DATA0_FS_;
> +
> +       if (is_last && le32_to_cpu(descriptor->data0) & RX_DESC_DATA0_EXT_) {
> +               /* extension is expected to follow */
> +               int index = lan743x_rx_next_index(rx, rx->last_head);
> +
> +               if (index == current_head_index)
> +                       /* extension not yet available */
> +                       goto done;
> +               desc_ext = &rx->ring_cpu_ptr[index];
> +               if (le32_to_cpu(desc_ext->data0) & RX_DESC_DATA0_OWN_)
> +                       /* extension not yet available */
> +                       goto done;
> +               if (!(le32_to_cpu(desc_ext->data0) & RX_DESC_DATA0_EXT_))
> +                       goto move_forward;
> +               extension_index = index;
>         }
> -       if (first_index >= 0 && last_index >= 0) {
> -               int real_last_index = last_index;
> -               struct sk_buff *skb = NULL;
> -               u32 ts_sec = 0;
> -               u32 ts_nsec = 0;
> -
> -               /* packet is available */
> -               if (first_index == last_index) {
> -                       /* single buffer packet */
> -                       struct sk_buff *new_skb = NULL;
> -                       int packet_length;
> -
> -                       new_skb = lan743x_rx_allocate_skb(rx);
> -                       if (!new_skb) {
> -                               /* failed to allocate next skb.
> -                                * Memory is very low.
> -                                * Drop this packet and reuse buffer.
> -                                */
> -                               lan743x_rx_reuse_ring_element(rx, first_index);
> -                               goto process_extension;
> -                       }
>
> -                       buffer_info = &rx->buffer_info[first_index];
> -                       skb = buffer_info->skb;
> -                       descriptor = &rx->ring_cpu_ptr[first_index];
> -
> -                       /* unmap from dma */
> -                       packet_length = RX_DESC_DATA0_FRAME_LENGTH_GET_
> -                                       (descriptor->data0);
> -                       if (buffer_info->dma_ptr) {
> -                               dma_sync_single_for_cpu(&rx->adapter->pdev->dev,
> -                                                       buffer_info->dma_ptr,
> -                                                       packet_length,
> -                                                       DMA_FROM_DEVICE);
> -                               dma_unmap_single_attrs(&rx->adapter->pdev->dev,
> -                                                      buffer_info->dma_ptr,
> -                                                      buffer_info->buffer_length,
> -                                                      DMA_FROM_DEVICE,
> -                                                      DMA_ATTR_SKIP_CPU_SYNC);
> -                               buffer_info->dma_ptr = 0;
> -                               buffer_info->buffer_length = 0;
> -                       }
> -                       buffer_info->skb = NULL;
> -                       packet_length = RX_DESC_DATA0_FRAME_LENGTH_GET_
> -                                       (le32_to_cpu(descriptor->data0));
> -                       skb_put(skb, packet_length - 4);
> -                       skb->protocol = eth_type_trans(skb,
> -                                                      rx->adapter->netdev);
> -                       lan743x_rx_init_ring_element(rx, first_index, new_skb);
> -               } else {
> -                       int index = first_index;
> +       /* Only the last buffer in a multi-buffer frame contains the total frame
> +        * length. All other buffers have a zero frame length. The chip
> +        * occasionally sends more buffers than strictly required to reach the
> +        * total frame length.
> +        * Handle this by adding all buffers to the skb in their entirety.
> +        * Once the real frame length is known, trim the skb.
> +        */
> +       frame_length =
> +               RX_DESC_DATA0_FRAME_LENGTH_GET_(le32_to_cpu(descriptor->data0));
> +       buffer_length = buffer_info->buffer_length;
>
> -                       /* multi buffer packet not supported */
> -                       /* this should not happen since buffers are allocated
> -                        * to be at least the mtu size configured in the mac.
> -                        */
> +       netdev_dbg(rx->adapter->netdev, "%s%schunk: %d/%d",
> +                  is_first ? "first " : "      ",
> +                  is_last  ? "last  " : "      ",
> +                  frame_length, buffer_length);
>
> -                       /* clean up buffers */
> -                       if (first_index <= last_index) {
> -                               while ((index >= first_index) &&
> -                                      (index <= last_index)) {
> -                                       lan743x_rx_reuse_ring_element(rx,
> -                                                                     index);
> -                                       index = lan743x_rx_next_index(rx,
> -                                                                     index);
> -                               }
> -                       } else {
> -                               while ((index >= first_index) ||
> -                                      (index <= last_index)) {
> -                                       lan743x_rx_reuse_ring_element(rx,
> -                                                                     index);
> -                                       index = lan743x_rx_next_index(rx,
> -                                                                     index);
> -                               }
> -                       }
> -               }
> +       /* unmap from dma */
> +       if (buffer_info->dma_ptr) {
> +               dma_unmap_single(&rx->adapter->pdev->dev,
> +                                buffer_info->dma_ptr,
> +                                buffer_info->buffer_length,
> +                                DMA_FROM_DEVICE);
> +               buffer_info->dma_ptr = 0;
> +               buffer_info->buffer_length = 0;
> +       }
> +       skb = buffer_info->skb;
>
> -process_extension:
> -               if (extension_index >= 0) {
> -                       descriptor = &rx->ring_cpu_ptr[extension_index];
> -                       buffer_info = &rx->buffer_info[extension_index];
> -
> -                       ts_sec = le32_to_cpu(descriptor->data1);
> -                       ts_nsec = (le32_to_cpu(descriptor->data2) &
> -                                 RX_DESC_DATA2_TS_NS_MASK_);
> -                       lan743x_rx_reuse_ring_element(rx, extension_index);
> -                       real_last_index = extension_index;
> -               }
> +       /* allocate new skb and map to dma */
> +       if (lan743x_rx_init_ring_element(rx, rx->last_head)) {
> +               /* failed to allocate next skb.
> +                * Memory is very low.
> +                * Drop this packet and reuse buffer.
> +                */
> +               lan743x_rx_reuse_ring_element(rx, rx->last_head);
> +               goto process_extension;
> +       }
> +
> +       /* add buffers to skb via skb->frag_list */
> +       if (is_first) {
> +               skb_reserve(skb, RX_HEAD_PADDING);
> +               skb_put(skb, buffer_length - RX_HEAD_PADDING);
> +               if (rx->skb_head)
> +                       dev_kfree_skb_irq(rx->skb_head);
> +               rx->skb_head = skb;
> +       } else if (rx->skb_head) {
> +               skb_put(skb, buffer_length);
> +               if (skb_shinfo(rx->skb_head)->frag_list)
> +                       rx->skb_tail->next = skb;
> +               else
> +                       skb_shinfo(rx->skb_head)->frag_list = skb;

Instead of chaining skbs into frag_list, you could perhaps delay skb
alloc until after reception, allocate buffers stand-alone, and link
them into the skb as skb_frags? That might avoid a few skb alloc +
frees. Though a bit change, not sure how feasible.

> +               rx->skb_tail = skb;
> +               rx->skb_head->len += skb->len;
> +               rx->skb_head->data_len += skb->len;
> +               rx->skb_head->truesize += skb->truesize;
> +       } else {
> +               rx->skb_head = skb;
> +       }
>
> -               if (!skb) {
> -                       result = RX_PROCESS_RESULT_PACKET_DROPPED;
> -                       goto move_forward;
> +process_extension:
> +       if (extension_index >= 0) {
> +               u32 ts_sec;
> +               u32 ts_nsec;
> +
> +               ts_sec = le32_to_cpu(desc_ext->data1);
> +               ts_nsec = (le32_to_cpu(desc_ext->data2) &
> +                         RX_DESC_DATA2_TS_NS_MASK_);
> +               if (rx->skb_head) {
> +                       hwtstamps = skb_hwtstamps(rx->skb_head);
> +                       if (hwtstamps)

This is always true.

You can just call skb_hwtstamps(skb)->hwtstamp = ktime_set(ts_sec, ts_nsec);

Though I see that this is existing code just moved due to
aforementioned indentation change.
Sven Van Asbroeck Jan. 29, 2021, 11:02 p.m. UTC | #2
Hoi Willem, thanks a lot for reviewing this patch, much appreciated !!

On Fri, Jan 29, 2021 at 5:11 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> > +static struct sk_buff *
> > +lan743x_rx_trim_skb(struct sk_buff *skb, int frame_length)
> > +{
> > +       if (skb_linearize(skb)) {
>
> Is this needed? That will be quite expensive

The skb will only be non-linear when it's created from a multi-buffer frame.
Multi-buffer frames are only generated right after a mtu change - fewer than
32 frames will be non-linear after an mtu increase. So as long as people don't
change the mtu in a tight loop, skb_linearize is just a single comparison,
99.999999+% of the time.

>
> Is it possible to avoid the large indentation change, or else do that
> in a separate patch? It makes it harder to follow the functional
> change.

It's not immediately obvious, but I have replaced the whole function
with slightly different logic, and the replacement content has a much
flatter indentation structure, and should be easier to follow.

Or perhaps I am misinterpreting your question?

> > +
> > +       /* add buffers to skb via skb->frag_list */
> > +       if (is_first) {
> > +               skb_reserve(skb, RX_HEAD_PADDING);
> > +               skb_put(skb, buffer_length - RX_HEAD_PADDING);
> > +               if (rx->skb_head)
> > +                       dev_kfree_skb_irq(rx->skb_head);
> > +               rx->skb_head = skb;
> > +       } else if (rx->skb_head) {
> > +               skb_put(skb, buffer_length);
> > +               if (skb_shinfo(rx->skb_head)->frag_list)
> > +                       rx->skb_tail->next = skb;
> > +               else
> > +                       skb_shinfo(rx->skb_head)->frag_list = skb;
>
> Instead of chaining skbs into frag_list, you could perhaps delay skb
> alloc until after reception, allocate buffers stand-alone, and link
> them into the skb as skb_frags? That might avoid a few skb alloc +
> frees. Though a bit change, not sure how feasible.

The problem here is this (copypasta from somewhere else in this patch):

/* Only the last buffer in a multi-buffer frame contains the total frame
* length. All other buffers have a zero frame length. The chip
* occasionally sends more buffers than strictly required to reach the
* total frame length.
* Handle this by adding all buffers to the skb in their entirety.
* Once the real frame length is known, trim the skb.
*/

In other words, the chip sometimes sends more buffers than strictly needed to
fit the frame. linearize + trim deals with this thorny issue perfectly.

If the skb weren't linearized, we would run into trouble when trying to trim
(remove from the end) a chunk bigger than the last skb fragment.

> > +process_extension:
> > +       if (extension_index >= 0) {
> > +               u32 ts_sec;
> > +               u32 ts_nsec;
> > +
> > +               ts_sec = le32_to_cpu(desc_ext->data1);
> > +               ts_nsec = (le32_to_cpu(desc_ext->data2) &
> > +                         RX_DESC_DATA2_TS_NS_MASK_);
> > +               if (rx->skb_head) {
> > +                       hwtstamps = skb_hwtstamps(rx->skb_head);
> > +                       if (hwtstamps)
>
> This is always true.
>
> You can just call skb_hwtstamps(skb)->hwtstamp = ktime_set(ts_sec, ts_nsec);

Thank you, will do !

>
> Though I see that this is existing code just moved due to
> aforementioned indentation change.

True, but I can make the change anyway.
Willem de Bruijn Jan. 29, 2021, 11:08 p.m. UTC | #3
On Fri, Jan 29, 2021 at 6:03 PM Sven Van Asbroeck <thesven73@gmail.com> wrote:
>
> Hoi Willem, thanks a lot for reviewing this patch, much appreciated !!
>
> On Fri, Jan 29, 2021 at 5:11 PM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> >
> > > +static struct sk_buff *
> > > +lan743x_rx_trim_skb(struct sk_buff *skb, int frame_length)
> > > +{
> > > +       if (skb_linearize(skb)) {
> >
> > Is this needed? That will be quite expensive
>
> The skb will only be non-linear when it's created from a multi-buffer frame.
> Multi-buffer frames are only generated right after a mtu change - fewer than
> 32 frames will be non-linear after an mtu increase. So as long as people don't
> change the mtu in a tight loop, skb_linearize is just a single comparison,
> 99.999999+% of the time.

Ah. I had missed the temporary state of this until the buffers are
reinitialized. Yes, then there is no reason to worry. Same for the
frag_list vs frags comment I made.

> >
> > Is it possible to avoid the large indentation change, or else do that
> > in a separate patch? It makes it harder to follow the functional
> > change.
>
> It's not immediately obvious, but I have replaced the whole function
> with slightly different logic, and the replacement content has a much
> flatter indentation structure, and should be easier to follow.
>
> Or perhaps I am misinterpreting your question?

Okay. I found it a bit hard to parse how much true code change was
mixed in with just reindenting existing code. If a lot, then no need
to split of the code refactor.

>
> > > +
> > > +       /* add buffers to skb via skb->frag_list */
> > > +       if (is_first) {
> > > +               skb_reserve(skb, RX_HEAD_PADDING);
> > > +               skb_put(skb, buffer_length - RX_HEAD_PADDING);
> > > +               if (rx->skb_head)
> > > +                       dev_kfree_skb_irq(rx->skb_head);
> > > +               rx->skb_head = skb;
> > > +       } else if (rx->skb_head) {
> > > +               skb_put(skb, buffer_length);
> > > +               if (skb_shinfo(rx->skb_head)->frag_list)
> > > +                       rx->skb_tail->next = skb;
> > > +               else
> > > +                       skb_shinfo(rx->skb_head)->frag_list = skb;
> >
> > Instead of chaining skbs into frag_list, you could perhaps delay skb
> > alloc until after reception, allocate buffers stand-alone, and link
> > them into the skb as skb_frags? That might avoid a few skb alloc +
> > frees. Though a bit change, not sure how feasible.
>
> The problem here is this (copypasta from somewhere else in this patch):
>
> /* Only the last buffer in a multi-buffer frame contains the total frame
> * length. All other buffers have a zero frame length. The chip
> * occasionally sends more buffers than strictly required to reach the
> * total frame length.
> * Handle this by adding all buffers to the skb in their entirety.
> * Once the real frame length is known, trim the skb.
> */
>
> In other words, the chip sometimes sends more buffers than strictly needed to
> fit the frame. linearize + trim deals with this thorny issue perfectly.
>
> If the skb weren't linearized, we would run into trouble when trying to trim
> (remove from the end) a chunk bigger than the last skb fragment.
>
> > > +process_extension:
> > > +       if (extension_index >= 0) {
> > > +               u32 ts_sec;
> > > +               u32 ts_nsec;
> > > +
> > > +               ts_sec = le32_to_cpu(desc_ext->data1);
> > > +               ts_nsec = (le32_to_cpu(desc_ext->data2) &
> > > +                         RX_DESC_DATA2_TS_NS_MASK_);
> > > +               if (rx->skb_head) {
> > > +                       hwtstamps = skb_hwtstamps(rx->skb_head);
> > > +                       if (hwtstamps)
> >
> > This is always true.
> >
> > You can just call skb_hwtstamps(skb)->hwtstamp = ktime_set(ts_sec, ts_nsec);
>
> Thank you, will do !
>
> >
> > Though I see that this is existing code just moved due to
> > aforementioned indentation change.
>
> True, but I can make the change anyway.
Bryan.Whitehead@microchip.com Jan. 31, 2021, 7:06 a.m. UTC | #4
Hi Sven, 

Looks good.
see comments below.

>  static int lan743x_rx_process_packet(struct lan743x_rx *rx)  {

It looks like this function no longer processes a packet, but rather only processes a single buffer.
So perhaps it should be renamed to lan743x_rx_process_buffer, so it is not misleading.

...
> +       /* unmap from dma */

> +       if (buffer_info->dma_ptr) {

> +               dma_unmap_single(&rx->adapter->pdev->dev,

> +                                buffer_info->dma_ptr,

> +                                buffer_info->buffer_length,

> +                                DMA_FROM_DEVICE);

> +               buffer_info->dma_ptr = 0;

> +               buffer_info->buffer_length = 0;

> +       }

> +       skb = buffer_info->skb;

> 

> -process_extension:

> -               if (extension_index >= 0) {

> -                       descriptor = &rx->ring_cpu_ptr[extension_index];

> -                       buffer_info = &rx->buffer_info[extension_index];

> -

> -                       ts_sec = le32_to_cpu(descriptor->data1);

> -                       ts_nsec = (le32_to_cpu(descriptor->data2) &

> -                                 RX_DESC_DATA2_TS_NS_MASK_);

> -                       lan743x_rx_reuse_ring_element(rx, extension_index);

> -                       real_last_index = extension_index;

> -               }

> +       /* allocate new skb and map to dma */

> +       if (lan743x_rx_init_ring_element(rx, rx->last_head)) {


If lan743x_rx_init_ring_element fails to allocate an skb,
Then lan743x_rx_reuse_ring_element will be called.
But that function expects the skb is already allocated and dma mapped.
But the dma was unmapped above.

Also if lan743x_rx_init_ring_element fails to allocate an skb.
Then control will jump to process_extension and therefor
the currently received skb will not be added to the skb list.
I assume that would corrupt the packet? Or am I missing something?

...
> -               if (!skb) {

> -                       result = RX_PROCESS_RESULT_PACKET_DROPPED;

It looks like this return value is no longer used.
If there is no longer a case where a packet will be dropped 
then maybe this return value should be deleted from the header file.

... 
>  move_forward:

> -               /* push tail and head forward */

> -               rx->last_tail = real_last_index;

> -               rx->last_head = lan743x_rx_next_index(rx, real_last_index);

> -       }

> +       /* push tail and head forward */

> +       rx->last_tail = rx->last_head;

> +       rx->last_head = lan743x_rx_next_index(rx, rx->last_head);

> +       result = RX_PROCESS_RESULT_PACKET_RECEIVED;


Since this function handles one buffer at a time,
  The return value RX_PROCESS_RESULT_PACKET_RECEIVED is now misleading.
  Can you change it to RX_PROCESS_RESULT_BUFFER_RECEIVED.
Sven Van Asbroeck Jan. 31, 2021, 3:25 p.m. UTC | #5
On Sun, Jan 31, 2021 at 2:06 AM <Bryan.Whitehead@microchip.com> wrote:
>

> >  static int lan743x_rx_process_packet(struct lan743x_rx *rx)  {

> It looks like this function no longer processes a packet, but rather only processes a single buffer.

> So perhaps it should be renamed to lan743x_rx_process_buffer, so it is not misleading.


Agreed, will do.

>

> If lan743x_rx_init_ring_element fails to allocate an skb,

> Then lan743x_rx_reuse_ring_element will be called.

> But that function expects the skb is already allocated and dma mapped.

> But the dma was unmapped above.


Good catch. I think you're right, the skb allocation always has to come before
the unmap. Because if we unmap, and then the skb allocation fails, there is no
guarantee that we can remap the old skb we've just unmapped (it could fail).
And then we'd end up with a broken driver.

BUT I actually joined skb alloc and init_ring_element, because of a very subtle
synchronization bug I was seeing: if someone changes the mtu _in_between_
skb alloc and init_ring_element, things will go haywire, because the skb and
mapping lengths would be different !

We could fix that by using a spinlock I guess, but synchronization primitives
in "hot paths" like these are far from ideal... Would be nice if we could
avoid that.

Here's an idea: what if we fold "unmap from dma" into init_ring_element()?
That way, we get the best of both worlds: length cannot change in the middle,
and the function can always "back out" without touching the ring element
in case an allocation or mapping fails.

Pseudo-code:

init_ring_element() {
    /* single "sampling" of mtu, so no synchronization required */
    length = netdev->mtu + ETH_HLEN + 4 + RX_HEAD_PADDING;

    skb = alloc(length);
    if (!skb) return FAIL;
    dma_ptr = dma_map(skb, length);
    if (!dma_ptr) {
        free(skb);
        return FAIL;
    }
    if (buffer_info->dma_ptr)
        dma_unmap(buffer_info->dma_ptr, buffer_info->buffer_length);
    buffer_info->skb = skb;
    buffer_info->dma_ptr = dma_ptr;
    buffer_info->buffer_length = length;

    return SUCCESS;
}

What do you think?

>

> Also if lan743x_rx_init_ring_element fails to allocate an skb.

> Then control will jump to process_extension and therefor

> the currently received skb will not be added to the skb list.

> I assume that would corrupt the packet? Or am I missing something?

>


Yes if an skb alloc failure in the middle of a multi-buffer frame, will corrupt
the packet inside the frame. A chunk will be missing. I had assumed that this
would be caught by an upper network layer, some checksum would be incorrect?

Are there current networking devices that would send a corrupted packet to
Linux if there is a corruption on the physical link? Especially if they don't
support checksumming?

Maybe my assumption is naive.
I'll fix this up if you believe that it could be an issue.

> ...

> > -               if (!skb) {

> > -                       result = RX_PROCESS_RESULT_PACKET_DROPPED;

> It looks like this return value is no longer used.

> If there is no longer a case where a packet will be dropped

> then maybe this return value should be deleted from the header file.


Agreed, will do.

>

> ...

> >  move_forward:

> > -               /* push tail and head forward */

> > -               rx->last_tail = real_last_index;

> > -               rx->last_head = lan743x_rx_next_index(rx, real_last_index);

> > -       }

> > +       /* push tail and head forward */

> > +       rx->last_tail = rx->last_head;

> > +       rx->last_head = lan743x_rx_next_index(rx, rx->last_head);

> > +       result = RX_PROCESS_RESULT_PACKET_RECEIVED;

>

> Since this function handles one buffer at a time,

>   The return value RX_PROCESS_RESULT_PACKET_RECEIVED is now misleading.

>   Can you change it to RX_PROCESS_RESULT_BUFFER_RECEIVED.


Agreed, will do.

RX_PROCESS_RESULT_XXX can now only take two values (RECEIVED and NOTHING_TO_DO),
so in theory it could be replaced by a bool. But perhaps we should keep the
current names, because they are clearer to the reader?

>

>
Bryan.Whitehead@microchip.com Feb. 1, 2021, 6:04 p.m. UTC | #6
Hi Sven, see below

> >

> > If lan743x_rx_init_ring_element fails to allocate an skb, Then

> > lan743x_rx_reuse_ring_element will be called.

> > But that function expects the skb is already allocated and dma mapped.

> > But the dma was unmapped above.

> 

> Good catch. I think you're right, the skb allocation always has to come before

> the unmap. Because if we unmap, and then the skb allocation fails, there is

> no guarantee that we can remap the old skb we've just unmapped (it could

> fail).

> And then we'd end up with a broken driver.

> 

> BUT I actually joined skb alloc and init_ring_element, because of a very

> subtle synchronization bug I was seeing: if someone changes the mtu

> _in_between_ skb alloc and init_ring_element, things will go haywire,

> because the skb and mapping lengths would be different !

> 

> We could fix that by using a spinlock I guess, but synchronization primitives in

> "hot paths" like these are far from ideal... Would be nice if we could avoid

> that.

> 

> Here's an idea: what if we fold "unmap from dma" into init_ring_element()?

> That way, we get the best of both worlds: length cannot change in the

> middle, and the function can always "back out" without touching the ring

> element in case an allocation or mapping fails.

> 

> Pseudo-code:

> 

> init_ring_element() {

>     /* single "sampling" of mtu, so no synchronization required */

>     length = netdev->mtu + ETH_HLEN + 4 + RX_HEAD_PADDING;

> 

>     skb = alloc(length);

>     if (!skb) return FAIL;

>     dma_ptr = dma_map(skb, length);

>     if (!dma_ptr) {

>         free(skb);

>         return FAIL;

>     }

>     if (buffer_info->dma_ptr)

>         dma_unmap(buffer_info->dma_ptr, buffer_info->buffer_length);

>     buffer_info->skb = skb;

>     buffer_info->dma_ptr = dma_ptr;

>     buffer_info->buffer_length = length;

> 

>     return SUCCESS;

> }

> 

> What do you think?


Yes, I believe this will work.

> 

> >

> > Also if lan743x_rx_init_ring_element fails to allocate an skb.

> > Then control will jump to process_extension and therefor the currently

> > received skb will not be added to the skb list.

> > I assume that would corrupt the packet? Or am I missing something?

> >

> 

> Yes if an skb alloc failure in the middle of a multi-buffer frame, will corrupt

> the packet inside the frame. A chunk will be missing. I had assumed that this

> would be caught by an upper network layer, some checksum would be

> incorrect?

> 

> Are there current networking devices that would send a corrupted packet to

> Linux if there is a corruption on the physical link? Especially if they don't

> support checksumming?

> 

> Maybe my assumption is naive.

> I'll fix this up if you believe that it could be an issue.


Yes the upper layers will catch it and drop it.
But if we already know the packet is corrupted, I think it would be better if we
dropped it here, to avoid unnecessary processing upstream.

...
> 

> RX_PROCESS_RESULT_XXX can now only take two values (RECEIVED and

> NOTHING_TO_DO), so in theory it could be replaced by a bool. But perhaps

> we should keep the current names, because they are clearer to the reader?

> 

I'm ok if you want to change it too a bool.
Sven Van Asbroeck Feb. 3, 2021, 6:53 p.m. UTC | #7
Thank you Bryan. I will prepare a v2 early next week.

Would Microchip be able to donate some time to test v2? My own tests
are certainly not perfect. Various stress tests across architectures
(intel/arm) would help create confidence in the multi-buffer frame
implementation. Perhaps Microchip has various test rigs already set
up?

After all, absence of bugs is more important than speed improvements.
Bryan.Whitehead@microchip.com Feb. 3, 2021, 8:14 p.m. UTC | #8
Hi Sven,

We can test on x86 PC. We will just need about a week after you release your next version.

Thanks,
Bryan

> -----Original Message-----

> From: Sven Van Asbroeck <thesven73@gmail.com>

> Sent: Wednesday, February 3, 2021 1:53 PM

> To: Bryan Whitehead - C21958 <Bryan.Whitehead@microchip.com>

> Cc: UNGLinuxDriver <UNGLinuxDriver@microchip.com>; David Miller

> <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; Andrew Lunn

> <andrew@lunn.ch>; Alexey Denisov <rtgbnm@gmail.com>; Sergej Bauer

> <sbauer@blackbox.su>; Tim Harvey <tharvey@gateworks.com>; Anders

> Rønningen <anders@ronningen.priv.no>; netdev

> <netdev@vger.kernel.org>; Linux Kernel Mailing List <linux-

> kernel@vger.kernel.org>

> Subject: Re: [PATCH net-next v1 2/6] lan743x: support rx multi-buffer

> packets

> 

> EXTERNAL EMAIL: Do not click links or open attachments unless you know the

> content is safe

> 

> Thank you Bryan. I will prepare a v2 early next week.

> 

> Would Microchip be able to donate some time to test v2? My own tests are

> certainly not perfect. Various stress tests across architectures

> (intel/arm) would help create confidence in the multi-buffer frame

> implementation. Perhaps Microchip has various test rigs already set up?

> 

> After all, absence of bugs is more important than speed improvements.
Sven Van Asbroeck Feb. 3, 2021, 8:25 p.m. UTC | #9
On Wed, Feb 3, 2021 at 3:14 PM <Bryan.Whitehead@microchip.com> wrote:
>

> We can test on x86 PC. We will just need about a week after you release your next version.

>


That's great. If you have any suggestions on how I can improve testing
on my end, feel free to reach out.
Bryan.Whitehead@microchip.com Feb. 3, 2021, 8:41 p.m. UTC | #10
> On Wed, Feb 3, 2021 at 3:14 PM <Bryan.Whitehead@microchip.com> wrote:

> >

> > We can test on x86 PC. We will just need about a week after you release

> your next version.

> >

> 

> That's great. If you have any suggestions on how I can improve testing on my

> end, feel free to reach out.


If you are able, in addition to basic rx and tx iperf tests, I would recommend PTP tests.
PTP relies on the time stamps extracted from the extension descriptors, which is directly in the RX path you are modifying.

If you are not able, we will at least cover that for x86 PC.

Thanks,
Bryan
diff mbox series

Patch

diff --git a/drivers/net/ethernet/microchip/lan743x_main.c b/drivers/net/ethernet/microchip/lan743x_main.c
index f485320e5784..b784e9feadac 100644
--- a/drivers/net/ethernet/microchip/lan743x_main.c
+++ b/drivers/net/ethernet/microchip/lan743x_main.c
@@ -1955,15 +1955,6 @@  static int lan743x_rx_next_index(struct lan743x_rx *rx, int index)
 	return ((++index) % rx->ring_size);
 }
 
-static struct sk_buff *lan743x_rx_allocate_skb(struct lan743x_rx *rx)
-{
-	struct net_device *netdev = rx->adapter->netdev;
-
-	return __netdev_alloc_skb(netdev,
-				  netdev->mtu + ETH_HLEN + 4 + RX_HEAD_PADDING,
-				  GFP_ATOMIC | GFP_DMA);
-}
-
 static void lan743x_rx_update_tail(struct lan743x_rx *rx, int index)
 {
 	/* update the tail once per 8 descriptors */
@@ -1972,37 +1963,37 @@  static void lan743x_rx_update_tail(struct lan743x_rx *rx, int index)
 				  index);
 }
 
-static int lan743x_rx_init_ring_element(struct lan743x_rx *rx, int index,
-					struct sk_buff *skb)
+static int lan743x_rx_init_ring_element(struct lan743x_rx *rx, int index)
 {
+	struct net_device *netdev = rx->adapter->netdev;
+	struct device *dev = &rx->adapter->pdev->dev;
 	struct lan743x_rx_buffer_info *buffer_info;
 	struct lan743x_rx_descriptor *descriptor;
-	struct net_device *netdev = rx->adapter->netdev;
+	struct sk_buff *skb;
+	dma_addr_t dma_ptr;
 	int length;
 
 	length = netdev->mtu + ETH_HLEN + 4 + RX_HEAD_PADDING;
+
 	descriptor = &rx->ring_cpu_ptr[index];
 	buffer_info = &rx->buffer_info[index];
-	buffer_info->skb = skb;
-	if (!(buffer_info->skb))
+	skb = __netdev_alloc_skb(netdev, length, GFP_ATOMIC | GFP_DMA);
+	if (!skb)
 		return -ENOMEM;
-	buffer_info->dma_ptr = dma_map_single(&rx->adapter->pdev->dev,
-					      buffer_info->skb->data,
-					      length,
-					      DMA_FROM_DEVICE);
-	if (dma_mapping_error(&rx->adapter->pdev->dev,
-			      buffer_info->dma_ptr)) {
-		buffer_info->dma_ptr = 0;
+	dma_ptr = dma_map_single(dev, skb->data, length, DMA_FROM_DEVICE);
+	if (dma_mapping_error(dev, dma_ptr)) {
+		dev_kfree_skb_any(skb);
 		return -ENOMEM;
 	}
 
+	buffer_info->skb = skb;
+	buffer_info->dma_ptr = dma_ptr;
 	buffer_info->buffer_length = length;
 	descriptor->data1 = cpu_to_le32(DMA_ADDR_LOW32(buffer_info->dma_ptr));
 	descriptor->data2 = cpu_to_le32(DMA_ADDR_HIGH32(buffer_info->dma_ptr));
 	descriptor->data3 = 0;
 	descriptor->data0 = cpu_to_le32((RX_DESC_DATA0_OWN_ |
 			    (length & RX_DESC_DATA0_BUF_LENGTH_MASK_)));
-	skb_reserve(buffer_info->skb, RX_HEAD_PADDING);
 	lan743x_rx_update_tail(rx, index);
 
 	return 0;
@@ -2051,16 +2042,32 @@  static void lan743x_rx_release_ring_element(struct lan743x_rx *rx, int index)
 	memset(buffer_info, 0, sizeof(*buffer_info));
 }
 
+static struct sk_buff *
+lan743x_rx_trim_skb(struct sk_buff *skb, int frame_length)
+{
+	if (skb_linearize(skb)) {
+		dev_kfree_skb_irq(skb);
+		return NULL;
+	}
+	frame_length = max_t(int, 0, frame_length - RX_HEAD_PADDING - 2);
+	if (skb->len > frame_length) {
+		skb->tail -= skb->len - frame_length;
+		skb->len = frame_length;
+	}
+	return skb;
+}
+
 static int lan743x_rx_process_packet(struct lan743x_rx *rx)
 {
-	struct skb_shared_hwtstamps *hwtstamps = NULL;
+	struct lan743x_rx_descriptor *descriptor, *desc_ext;
 	int result = RX_PROCESS_RESULT_NOTHING_TO_DO;
 	int current_head_index = le32_to_cpu(*rx->head_cpu_ptr);
 	struct lan743x_rx_buffer_info *buffer_info;
-	struct lan743x_rx_descriptor *descriptor;
+	struct skb_shared_hwtstamps *hwtstamps;
+	int frame_length, buffer_length;
+	struct sk_buff *skb;
 	int extension_index = -1;
-	int first_index = -1;
-	int last_index = -1;
+	bool is_last, is_first;
 
 	if (current_head_index < 0 || current_head_index >= rx->ring_size)
 		goto done;
@@ -2068,170 +2075,126 @@  static int lan743x_rx_process_packet(struct lan743x_rx *rx)
 	if (rx->last_head < 0 || rx->last_head >= rx->ring_size)
 		goto done;
 
-	if (rx->last_head != current_head_index) {
-		descriptor = &rx->ring_cpu_ptr[rx->last_head];
-		if (le32_to_cpu(descriptor->data0) & RX_DESC_DATA0_OWN_)
-			goto done;
+	if (rx->last_head == current_head_index)
+		goto done;
 
-		if (!(le32_to_cpu(descriptor->data0) & RX_DESC_DATA0_FS_))
-			goto done;
+	descriptor = &rx->ring_cpu_ptr[rx->last_head];
+	if (le32_to_cpu(descriptor->data0) & RX_DESC_DATA0_OWN_)
+		goto done;
+	buffer_info = &rx->buffer_info[rx->last_head];
 
-		first_index = rx->last_head;
-		if (le32_to_cpu(descriptor->data0) & RX_DESC_DATA0_LS_) {
-			last_index = rx->last_head;
-		} else {
-			int index;
-
-			index = lan743x_rx_next_index(rx, first_index);
-			while (index != current_head_index) {
-				descriptor = &rx->ring_cpu_ptr[index];
-				if (le32_to_cpu(descriptor->data0) & RX_DESC_DATA0_OWN_)
-					goto done;
-
-				if (le32_to_cpu(descriptor->data0) & RX_DESC_DATA0_LS_) {
-					last_index = index;
-					break;
-				}
-				index = lan743x_rx_next_index(rx, index);
-			}
-		}
-		if (last_index >= 0) {
-			descriptor = &rx->ring_cpu_ptr[last_index];
-			if (le32_to_cpu(descriptor->data0) & RX_DESC_DATA0_EXT_) {
-				/* extension is expected to follow */
-				int index = lan743x_rx_next_index(rx,
-								  last_index);
-				if (index != current_head_index) {
-					descriptor = &rx->ring_cpu_ptr[index];
-					if (le32_to_cpu(descriptor->data0) &
-					    RX_DESC_DATA0_OWN_) {
-						goto done;
-					}
-					if (le32_to_cpu(descriptor->data0) &
-					    RX_DESC_DATA0_EXT_) {
-						extension_index = index;
-					} else {
-						goto done;
-					}
-				} else {
-					/* extension is not yet available */
-					/* prevent processing of this packet */
-					first_index = -1;
-					last_index = -1;
-				}
-			}
-		}
+	is_last = le32_to_cpu(descriptor->data0) & RX_DESC_DATA0_LS_;
+	is_first = le32_to_cpu(descriptor->data0) & RX_DESC_DATA0_FS_;
+
+	if (is_last && le32_to_cpu(descriptor->data0) & RX_DESC_DATA0_EXT_) {
+		/* extension is expected to follow */
+		int index = lan743x_rx_next_index(rx, rx->last_head);
+
+		if (index == current_head_index)
+			/* extension not yet available */
+			goto done;
+		desc_ext = &rx->ring_cpu_ptr[index];
+		if (le32_to_cpu(desc_ext->data0) & RX_DESC_DATA0_OWN_)
+			/* extension not yet available */
+			goto done;
+		if (!(le32_to_cpu(desc_ext->data0) & RX_DESC_DATA0_EXT_))
+			goto move_forward;
+		extension_index = index;
 	}
-	if (first_index >= 0 && last_index >= 0) {
-		int real_last_index = last_index;
-		struct sk_buff *skb = NULL;
-		u32 ts_sec = 0;
-		u32 ts_nsec = 0;
-
-		/* packet is available */
-		if (first_index == last_index) {
-			/* single buffer packet */
-			struct sk_buff *new_skb = NULL;
-			int packet_length;
-
-			new_skb = lan743x_rx_allocate_skb(rx);
-			if (!new_skb) {
-				/* failed to allocate next skb.
-				 * Memory is very low.
-				 * Drop this packet and reuse buffer.
-				 */
-				lan743x_rx_reuse_ring_element(rx, first_index);
-				goto process_extension;
-			}
 
-			buffer_info = &rx->buffer_info[first_index];
-			skb = buffer_info->skb;
-			descriptor = &rx->ring_cpu_ptr[first_index];
-
-			/* unmap from dma */
-			packet_length =	RX_DESC_DATA0_FRAME_LENGTH_GET_
-					(descriptor->data0);
-			if (buffer_info->dma_ptr) {
-				dma_sync_single_for_cpu(&rx->adapter->pdev->dev,
-							buffer_info->dma_ptr,
-							packet_length,
-							DMA_FROM_DEVICE);
-				dma_unmap_single_attrs(&rx->adapter->pdev->dev,
-						       buffer_info->dma_ptr,
-						       buffer_info->buffer_length,
-						       DMA_FROM_DEVICE,
-						       DMA_ATTR_SKIP_CPU_SYNC);
-				buffer_info->dma_ptr = 0;
-				buffer_info->buffer_length = 0;
-			}
-			buffer_info->skb = NULL;
-			packet_length =	RX_DESC_DATA0_FRAME_LENGTH_GET_
-					(le32_to_cpu(descriptor->data0));
-			skb_put(skb, packet_length - 4);
-			skb->protocol = eth_type_trans(skb,
-						       rx->adapter->netdev);
-			lan743x_rx_init_ring_element(rx, first_index, new_skb);
-		} else {
-			int index = first_index;
+	/* Only the last buffer in a multi-buffer frame contains the total frame
+	 * length. All other buffers have a zero frame length. The chip
+	 * occasionally sends more buffers than strictly required to reach the
+	 * total frame length.
+	 * Handle this by adding all buffers to the skb in their entirety.
+	 * Once the real frame length is known, trim the skb.
+	 */
+	frame_length =
+		RX_DESC_DATA0_FRAME_LENGTH_GET_(le32_to_cpu(descriptor->data0));
+	buffer_length = buffer_info->buffer_length;
 
-			/* multi buffer packet not supported */
-			/* this should not happen since buffers are allocated
-			 * to be at least the mtu size configured in the mac.
-			 */
+	netdev_dbg(rx->adapter->netdev, "%s%schunk: %d/%d",
+		   is_first ? "first " : "      ",
+		   is_last  ? "last  " : "      ",
+		   frame_length, buffer_length);
 
-			/* clean up buffers */
-			if (first_index <= last_index) {
-				while ((index >= first_index) &&
-				       (index <= last_index)) {
-					lan743x_rx_reuse_ring_element(rx,
-								      index);
-					index = lan743x_rx_next_index(rx,
-								      index);
-				}
-			} else {
-				while ((index >= first_index) ||
-				       (index <= last_index)) {
-					lan743x_rx_reuse_ring_element(rx,
-								      index);
-					index = lan743x_rx_next_index(rx,
-								      index);
-				}
-			}
-		}
+	/* unmap from dma */
+	if (buffer_info->dma_ptr) {
+		dma_unmap_single(&rx->adapter->pdev->dev,
+				 buffer_info->dma_ptr,
+				 buffer_info->buffer_length,
+				 DMA_FROM_DEVICE);
+		buffer_info->dma_ptr = 0;
+		buffer_info->buffer_length = 0;
+	}
+	skb = buffer_info->skb;
 
-process_extension:
-		if (extension_index >= 0) {
-			descriptor = &rx->ring_cpu_ptr[extension_index];
-			buffer_info = &rx->buffer_info[extension_index];
-
-			ts_sec = le32_to_cpu(descriptor->data1);
-			ts_nsec = (le32_to_cpu(descriptor->data2) &
-				  RX_DESC_DATA2_TS_NS_MASK_);
-			lan743x_rx_reuse_ring_element(rx, extension_index);
-			real_last_index = extension_index;
-		}
+	/* allocate new skb and map to dma */
+	if (lan743x_rx_init_ring_element(rx, rx->last_head)) {
+		/* failed to allocate next skb.
+		 * Memory is very low.
+		 * Drop this packet and reuse buffer.
+		 */
+		lan743x_rx_reuse_ring_element(rx, rx->last_head);
+		goto process_extension;
+	}
+
+	/* add buffers to skb via skb->frag_list */
+	if (is_first) {
+		skb_reserve(skb, RX_HEAD_PADDING);
+		skb_put(skb, buffer_length - RX_HEAD_PADDING);
+		if (rx->skb_head)
+			dev_kfree_skb_irq(rx->skb_head);
+		rx->skb_head = skb;
+	} else if (rx->skb_head) {
+		skb_put(skb, buffer_length);
+		if (skb_shinfo(rx->skb_head)->frag_list)
+			rx->skb_tail->next = skb;
+		else
+			skb_shinfo(rx->skb_head)->frag_list = skb;
+		rx->skb_tail = skb;
+		rx->skb_head->len += skb->len;
+		rx->skb_head->data_len += skb->len;
+		rx->skb_head->truesize += skb->truesize;
+	} else {
+		rx->skb_head = skb;
+	}
 
-		if (!skb) {
-			result = RX_PROCESS_RESULT_PACKET_DROPPED;
-			goto move_forward;
+process_extension:
+	if (extension_index >= 0) {
+		u32 ts_sec;
+		u32 ts_nsec;
+
+		ts_sec = le32_to_cpu(desc_ext->data1);
+		ts_nsec = (le32_to_cpu(desc_ext->data2) &
+			  RX_DESC_DATA2_TS_NS_MASK_);
+		if (rx->skb_head) {
+			hwtstamps = skb_hwtstamps(rx->skb_head);
+			if (hwtstamps)
+				hwtstamps->hwtstamp = ktime_set(ts_sec, ts_nsec);
 		}
+		lan743x_rx_reuse_ring_element(rx, extension_index);
+		rx->last_head = extension_index;
+		netdev_dbg(rx->adapter->netdev, "process extension");
+	}
 
-		if (extension_index < 0)
-			goto pass_packet_to_os;
-		hwtstamps = skb_hwtstamps(skb);
-		if (hwtstamps)
-			hwtstamps->hwtstamp = ktime_set(ts_sec, ts_nsec);
+	if (is_last && rx->skb_head)
+		rx->skb_head = lan743x_rx_trim_skb(rx->skb_head, frame_length);
 
-pass_packet_to_os:
-		/* pass packet to OS */
-		napi_gro_receive(&rx->napi, skb);
-		result = RX_PROCESS_RESULT_PACKET_RECEIVED;
+	if (is_last && rx->skb_head) {
+		rx->skb_head->protocol = eth_type_trans(rx->skb_head,
+							rx->adapter->netdev);
+		netdev_dbg(rx->adapter->netdev, "sending %d byte frame to OS",
+			   rx->skb_head->len);
+		napi_gro_receive(&rx->napi, rx->skb_head);
+		rx->skb_head = NULL;
+	}
 
 move_forward:
-		/* push tail and head forward */
-		rx->last_tail = real_last_index;
-		rx->last_head = lan743x_rx_next_index(rx, real_last_index);
-	}
+	/* push tail and head forward */
+	rx->last_tail = rx->last_head;
+	rx->last_head = lan743x_rx_next_index(rx, rx->last_head);
+	result = RX_PROCESS_RESULT_PACKET_RECEIVED;
 done:
 	return result;
 }
@@ -2367,9 +2330,7 @@  static int lan743x_rx_ring_init(struct lan743x_rx *rx)
 
 	rx->last_head = 0;
 	for (index = 0; index < rx->ring_size; index++) {
-		struct sk_buff *new_skb = lan743x_rx_allocate_skb(rx);
-
-		ret = lan743x_rx_init_ring_element(rx, index, new_skb);
+		ret = lan743x_rx_init_ring_element(rx, index);
 		if (ret)
 			goto cleanup;
 	}
diff --git a/drivers/net/ethernet/microchip/lan743x_main.h b/drivers/net/ethernet/microchip/lan743x_main.h
index 751f2bc9ce84..17e31a6210c6 100644
--- a/drivers/net/ethernet/microchip/lan743x_main.h
+++ b/drivers/net/ethernet/microchip/lan743x_main.h
@@ -698,6 +698,8 @@  struct lan743x_rx {
 	struct napi_struct napi;
 
 	u32		frame_count;
+
+	struct sk_buff *skb_head, *skb_tail;
 };
 
 struct lan743x_adapter {